Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
[Christoph Martin] check_linux_raid should also warn if a md device is in resync mode and not only if in recovory mode. I suspect this is the wrong conclusion to the problem. I ran into a similar issue with my Nagios monitored raid, where one of the disks failed and the spare were automatically resynced into the RAID. But I do not want a warning because of the resync. I want a warning because there is a failing disk. So in this case: [Jan Wagner] md3 : active raid10 sdd4[4](F) sdc4[1] sdb4[5](F) sda4[0] 1887974656 blocks 64K chunks 2 near-copies [4/2] [UU__] [==..] recovery = 50.3% (474987648/943987328) finish=4363639.0min speed=1K/sec I believe the module should report the devices listed with '(F)' as at least a warning and preferably a critical issue, and ignore the fact that a sync/recovery is in progress. It would also be nice if it would report the disk serial number of the failing disk, to make it easier to locate the correct disk when replacing disks. The serial number can either be discovered using 'hdparm -I /dev/sdd4' (in the example above), or by looking in /sys/. -- Happy hacking Petter Reinholdtsen -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#575382: [Pkg-nagios-devel] Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
Jan Wagner schrieb am 29.03.2010 00:02: Hi Christoph, Okay ... found an example, hopefully equivalent to your problem: bb:~# cat /tmp/mdstat Personalities : [raid1] [raid10] md2 : active raid1 sdd2[4] sdc2[1] sdb2[2] sda2[0] 16386176 blocks [4/3] [UUU_] resync=DELAYED md1 : active raid1 sdd3[4] sdc3[1] sdb3[2] sda3[0] 6144768 blocks [4/3] [UUU_] resync=DELAYED md3 : active raid10 sdd4[4](F) sdc4[1] sdb4[5](F) sda4[0] 1887974656 blocks 64K chunks 2 near-copies [4/2] [UU__] [==..] recovery = 50.3% (474987648/943987328) finish=4363639.0min speed=1K/sec md0 : active raid1 sdd1[4](S) sdc1[5] sdb1[2] sda1[0] 10241280 blocks [4/2] [U_U_] resync=DELAYED unused devices: none bb:~# /tmp/check_linux_raid md3 WARNING md3 status=[UU__], recovery=50.3%, finish=4363639.0min. bb:~# /tmp/check_linux_raid_new md3 WARNING md3 status=[UU__], recovery=recovery, finish=4363639.0min. bb:~# diff -Nur /tmp/check_linux_raid /tmp/check_linux_raid_new --- /tmp/check_linux_raid 2010-03-28 23:19:08.0 +0200 +++ /tmp/check_linux_raid_new 2010-03-28 21:33:44.0 +0200 @@ -61,7 +61,7 @@ if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; - } elsif (/recovery = (.*?)\s/) { + } elsif (/(recovery|resync) = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; } elsif (/^\s*$/) { I don't see, how that should help. :) Did I miss something? I see now, that the patch is wrong. It should not output recovery = x% There should be a $2 in the line below: } elsif (/(recovery|refresh) = (.*?)\s/) { $recovery{$device} = $2; I just did have a closer look, what do you think about the following: --- /tmp/check_linux_raid 2010-03-28 23:19:08.0 +0200 +++ /tmp/check_linux_raid_new 2010-03-28 23:48:11.0 +0200 @@ -50,6 +50,7 @@ my $msg = ; my %status; my %recovery; +my %resync; my %finish; my %active; my %devices; @@ -64,6 +65,8 @@ } elsif (/recovery = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; + } elsif (/resync=(.*?)\s/) { + $resync{$device} = $1; } elsif (/^\s*$/) { $device=undef; } @@ -84,6 +87,10 @@ $msg .= sprintf %s status=%s, recovery=%s, finish=%s., $devices{$k}, $status{$k}, $recovery{$k}, $finish{$k}; $code = max_state($code, WARNING); + } elsif (defined $resync{$k}) { + $msg .= sprintf %s status=%s, resync=%s., + $devices{$k}, $status{$k}, $resync{$k}; + $code = max_state($code, WARNING); } else { $msg .= sprintf %s status=%s., $devices{$k}, $status{$k}; $code = max_state($code, CRITICAL); Maybe thats what you expecting? This is also not correct, because the resync line might also have the %-value and the finish= value. Christoph attachment: martin.vcf signature.asc Description: OpenPGP digital signature
Bug#575382: [Pkg-nagios-devel] Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
Hi Martin, On Thursday 25 March 2010, Christoph Martin wrote: Thanks for the last patch. There is another one: check_linux_raid should also warn if a md device is in resync mode and not only if in recovory mode. *** /usr/lib/nagios/plugins/check_linux_raid.pl~Fri Mar 19 12:06:24 2010 --- /usr/lib/nagios/plugins/check_linux_raid.pl Wed Mar 24 00:21:04 2010 *** *** 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/recovery = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; --- 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/(recovery|resync) = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; I'm not very familiar with mdadm, but how does you patch help here? Once a month /etc/cron.d/mdadm is doing a resync. # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda3[0] sdb3[1] 133821376 blocks [2/2] [UU] [=...] check = 29.6% (3924/133821376) finish=24.1min speed=64932K/sec unused devices: none check_linux_raid is reporting OK md1 status=[UU]. with and without of your patch. Under which conditions matches your regex? Do you have an example and when will this happen? Thanks and with kind regards, Jan. -- Never write mail to w...@spamfalle.info, you have been warned! -BEGIN GEEK CODE BLOCK- Version: 3.12 GIT d-- s+: a C+++ UL P+ L+++ E--- W+++ N+++ o++ K++ w--- O M V- PS PE Y++ PGP++ t-- 5 X R tv- b+ DI D+ G++ e++ h r+++ y --END GEEK CODE BLOCK-- signature.asc Description: This is a digitally signed message part.
Bug#575382: [Pkg-nagios-devel] Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
Hi Jan, Jan Wagner schrieb am 28.03.2010 22:19: Hi Martin, s/Martin/Christoph/ On Thursday 25 March 2010, Christoph Martin wrote: Thanks for the last patch. There is another one: check_linux_raid should also warn if a md device is in resync mode and not only if in recovory mode. *** /usr/lib/nagios/plugins/check_linux_raid.pl~Fri Mar 19 12:06:24 2010 --- /usr/lib/nagios/plugins/check_linux_raid.pl Wed Mar 24 00:21:04 2010 *** *** 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/recovery = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; --- 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/(recovery|resync) = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; I'm not very familiar with mdadm, but how does you patch help here? Once a month /etc/cron.d/mdadm is doing a resync. # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda3[0] sdb3[1] 133821376 blocks [2/2] [UU] [=...] check = 29.6% (3924/133821376) finish=24.1min speed=64932K/sec unused devices: none check_linux_raid is reporting OK md1 status=[UU]. with and without of your patch. Under which conditions matches your regex? Do you have an example and when will this happen? If with raid6 a disk fails, a automatic recovery is done with one missing disk. With raid6 two disk may fail and the raid can be recovered. If you add a good disk, after the recovery a resync is done with state UUU_. Only after this resync it will get into state again. This resync should generate a warning. Christoph attachment: martin.vcf signature.asc Description: OpenPGP digital signature
Bug#575382: [Pkg-nagios-devel] Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
Hi Christoph, On Sunday 28 March 2010, Christoph Martin wrote: Jan Wagner schrieb am 28.03.2010 22:19: Hi Martin, s/Martin/Christoph/ sorry. :/ If with raid6 a disk fails, a automatic recovery is done with one missing disk. With raid6 two disk may fail and the raid can be recovered. If you add a good disk, after the recovery a resync is done with state UUU_. Only after this resync it will get into state again. This resync should generate a warning. Okay ... found an example, hopefully equivalent to your problem: bb:~# cat /tmp/mdstat Personalities : [raid1] [raid10] md2 : active raid1 sdd2[4] sdc2[1] sdb2[2] sda2[0] 16386176 blocks [4/3] [UUU_] resync=DELAYED md1 : active raid1 sdd3[4] sdc3[1] sdb3[2] sda3[0] 6144768 blocks [4/3] [UUU_] resync=DELAYED md3 : active raid10 sdd4[4](F) sdc4[1] sdb4[5](F) sda4[0] 1887974656 blocks 64K chunks 2 near-copies [4/2] [UU__] [==..] recovery = 50.3% (474987648/943987328) finish=4363639.0min speed=1K/sec md0 : active raid1 sdd1[4](S) sdc1[5] sdb1[2] sda1[0] 10241280 blocks [4/2] [U_U_] resync=DELAYED unused devices: none bb:~# /tmp/check_linux_raid md3 WARNING md3 status=[UU__], recovery=50.3%, finish=4363639.0min. bb:~# /tmp/check_linux_raid_new md3 WARNING md3 status=[UU__], recovery=recovery, finish=4363639.0min. bb:~# diff -Nur /tmp/check_linux_raid /tmp/check_linux_raid_new --- /tmp/check_linux_raid 2010-03-28 23:19:08.0 +0200 +++ /tmp/check_linux_raid_new 2010-03-28 21:33:44.0 +0200 @@ -61,7 +61,7 @@ if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; - } elsif (/recovery = (.*?)\s/) { + } elsif (/(recovery|resync) = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; } elsif (/^\s*$/) { I don't see, how that should help. :) Did I miss something? I just did have a closer look, what do you think about the following: --- /tmp/check_linux_raid 2010-03-28 23:19:08.0 +0200 +++ /tmp/check_linux_raid_new 2010-03-28 23:48:11.0 +0200 @@ -50,6 +50,7 @@ my $msg = ; my %status; my %recovery; +my %resync; my %finish; my %active; my %devices; @@ -64,6 +65,8 @@ } elsif (/recovery = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; + } elsif (/resync=(.*?)\s/) { + $resync{$device} = $1; } elsif (/^\s*$/) { $device=undef; } @@ -84,6 +87,10 @@ $msg .= sprintf %s status=%s, recovery=%s, finish=%s., $devices{$k}, $status{$k}, $recovery{$k}, $finish{$k}; $code = max_state($code, WARNING); + } elsif (defined $resync{$k}) { + $msg .= sprintf %s status=%s, resync=%s., + $devices{$k}, $status{$k}, $resync{$k}; + $code = max_state($code, WARNING); } else { $msg .= sprintf %s status=%s., $devices{$k}, $status{$k}; $code = max_state($code, CRITICAL); Maybe thats what you expecting? With kind regards, Jan. -- Never write mail to w...@spamfalle.info, you have been warned! -BEGIN GEEK CODE BLOCK- Version: 3.12 GIT d-- s+: a C+++ UL P+ L+++ E--- W+++ N+++ o++ K++ w--- O M V- PS PE Y++ PGP++ t-- 5 X R tv- b+ DI D+ G++ e++ h r+++ y --END GEEK CODE BLOCK-- signature.asc Description: This is a digitally signed message part.
Bug#575382: nagios-plugins-standard: check_linux_raid does not warn if resync is in process
Package: nagios-plugins-standard Version: 1.4.12-5 Severity: normal Thanks for the last patch. There is another one: check_linux_raid should also warn if a md device is in resync mode and not only if in recovory mode. *** /usr/lib/nagios/plugins/check_linux_raid.pl~Fri Mar 19 12:06:24 2010 --- /usr/lib/nagios/plugins/check_linux_raid.pl Wed Mar 24 00:21:04 2010 *** *** 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/recovery = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; --- 61,67 if (defined $device) { if (/(\[[_U]+\])/) { $status{$device} = $1; ! } elsif (/(recovery|resync) = (.*?)\s/) { $recovery{$device} = $1; ($finish{$device}) = /finish=(.*?min)/; $device=undef; Christoph -- System Information: Debian Release: 5.0.4 APT prefers stable APT policy: (900, 'stable'), (90, 'oldstable') Architecture: i386 (i686) Kernel: Linux 2.6.26-2-686-bigmem (SMP w/2 CPU cores) Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968) Shell: /bin/sh linked to /bin/bash Versions of packages nagios-plugins-standard depends on: ii dnsutils1:9.5.1.dfsg.P3-1+lenny1 Clients provided with BIND ii fping 2.4b2-to-ipv6-15 sends ICMP ECHO_REQUEST packets to ii host2331-9 utility for querying DNS servers ii libc6 2.7-18lenny2 GNU C Library: Shared libraries ii libldap-2.4-2 2.4.11-1+lenny1 OpenLDAP libraries ii libmysqlclient1 5.0.51a-24+lenny3MySQL database client library ii libnet-snmp-per 5.2.0-1 Script SNMP connections ii libpq5 8.3.9-0lenny1PostgreSQL C client library ii libradiusclient 0.5.5-1 Enhanced RADIUS client library ii nagios-plugins- 1.4.12-5 Plugins for the nagios network mon ii qstat 2.11-1 Command-line tool for querying qua ii radiusclient1 0.3.2-11.1 /bin/login replacement which uses ii smbclient 2:3.2.5-4lenny9 a LanManager-like simple client fo ii snmp5.4.1~dfsg-12SNMP (Simple Network Management Pr ii ucf 3.0016 Update Configuration File: preserv nagios-plugins-standard recommends no packages. Versions of packages nagios-plugins-standard suggests: pn nagios3 none (no description available) pn whois none (no description available) -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org