Bug#658701: mdadm: should send email if mismatches are reported by a check
On Sat, 26 May 2012 18:39:08 +0400 Michael Tokarev wrote: > Neil, can you comment on the change to Monitor offered > in the mentioned bugreport please? > > On 12.04.2012 23:28, Michael Tokarev wrote: > > Neil, re http://bugs.debian.org/658701 , how do you think, > > is it okay if mdadm --monitor will send email in case check > > found mismatches, the same way it sends email about other > > more critical errors? > > > > I think Russell has a good point here, but there's one more > > source of mismatches we have in kernel - some "sporadic" > > mismatches in raid1 and raid10, especially when these are > > used as swap space... > > > > In Debian we've several bugreports already requesting more > > attention to mismatch_cnt, see: > > > > http://bugs.debian.org/658701 (this one) > > http://bugs.debian.org/599821 > > http://bugs.debian.org/588516 > > > > Thank you! > > > > /mjt Sorry for not replying the first time :-( I do not agree with the suggested change to mdadm. A non-zero mismatch count may not be a problem. It could be due to swap writing to a RAID1/RAID10. It could also be due to a RAID1/RAID10/RAID6 having been created with --assume-clean. This is perfectly safe thing to do but results in a non-zero mismatch_cnt. mdadm --monitor will run a program on every event. If someone wants more events reported than currently are reported, they are free to write a script to do whatever they like. If md finds unreadable blocks and fixes them, then that certainly might be interesting. However that is interesting much more broadly than just for md, and I believe 'smart' makes that information available. So having it reported from SMART would be more sensible. In brief: mismatch_cnt maybe useful to someone who understands what is means and is investigating some issues, but it is not something that should be automatically reported to a casual sysadmin. NeilBrown signature.asc Description: PGP signature
Bug#658701: mdadm: should send email if mismatches are reported by a check
Neil, can you comment on the change to Monitor offered in the mentioned bugreport please? On 12.04.2012 23:28, Michael Tokarev wrote: > Neil, re http://bugs.debian.org/658701 , how do you think, > is it okay if mdadm --monitor will send email in case check > found mismatches, the same way it sends email about other > more critical errors? > > I think Russell has a good point here, but there's one more > source of mismatches we have in kernel - some "sporadic" > mismatches in raid1 and raid10, especially when these are > used as swap space... > > In Debian we've several bugreports already requesting more > attention to mismatch_cnt, see: > > http://bugs.debian.org/658701 (this one) > http://bugs.debian.org/599821 > http://bugs.debian.org/588516 > > Thank you! > > /mjt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
Neil, re http://bugs.debian.org/658701 , how do you think, is it okay if mdadm --monitor will send email in case check found mismatches, the same way it sends email about other more critical errors? I think Russell has a good point here, but there's one more source of mismatches we have in kernel - some "sporadic" mismatches in raid1 and raid10, especially when these are used as swap space... In Debian we've several bugreports already requesting more attention to mismatch_cnt, see: http://bugs.debian.org/658701 (this one) http://bugs.debian.org/599821 http://bugs.debian.org/588516 Thank you! /mjt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
On 05.02.2012 19:40, Michael Tokarev wrote: > On 05.02.2012 18:58, Russell Coker wrote: >> On Mon, 6 Feb 2012, Michael Tokarev wrote: [] >>> And second, more to the point, Neil gave a very good writeup of these >>> checks and repairs of raid arrays, about deciding which part/component of >>> the array is "more right". Unfortunately I can't find it right now. >> >> Unfortunately at the moment it seems impossible to determine which disk had >> the error, if you even know that there was an error. > > Yes that's the bottom line of that article, and that's exactly what I had > in mind. It describes in great details (without touching latent errors much) > why it is so. I meant this one: http://neil.brown.name/blog/20100211050355 "Smart or simple RAID recovery??". /mjt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
On 05.02.2012 18:58, Russell Coker wrote: > On Mon, 6 Feb 2012, Michael Tokarev wrote: >>> I believe that this is a serious bug, it seems to me that one of the most >>> significant conditions it can encounter that should be immediately >>> reported to the sysadmin is the fact that the contents of disks are >>> changing and breaking RAID consistency! >> >> Yes that's the condition it may encouner indeed. The question is WHY - >> under normal conditions there should be no such errors. > > http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805 > > The disk just has errors sometimes. The above article has some calculations > of the probabilities. The point here is latent errors only. Yes these becomes more and more common "per drive" as drives grow in size, and also becomes less and less common with new/improved technologies (like switching to 4k sector size where error detection checksums work a bit differently and has more chances to detect the error). Note that these all are internal to the drive and usually is a know-how of the manufacturer and can be changed without breaking any compatibility whatsoever, since again these are all internal things. It is just not realistic to draw an interpolation line based on current volumes, because handling larger volumes may require more reliable error detection mechanisms, be it internal for drives or by external means (adding (meta)data checksumming, using various raid tecniques and so on). >> There are two points there. >> >> First, a formal one. Were it a serious issue if such a check weren't be >> done at all? I think that in this case this bugreport didt'n exist to >> start with. > > http://etbe.coker.com.au/2012/02/06/reliability-raid/ I recall again this is a "formal point". Lack of any scrubbing is serious bug, but lack of reporting is a wishlist, that's what i'm saying, nothing more. > If there were no checks at all then we would migrate to BTRFS even sooner, at > the above URL I've written some of the thoughts about BTRFS vs software RAID. > >> And second, more to the point, Neil gave a very good writeup of these >> checks and repairs of raid arrays, about deciding which part/component of >> the array is "more right". Unfortunately I can't find it right now. > > Unfortunately at the moment it seems impossible to determine which disk had > the error, if you even know that there was an error. Yes that's the bottom line of that article, and that's exactly what I had in mind. It describes in great details (without touching latent errors much) why it is so. For the future, I think drive manufacturers will do something to reduce probability of latent errors dramatically maybe to cryptographically-impossible levels, by changing ways how error detection and correction is done. Please note that I don't argue against the lack of reporting - just about the severity of the bugreport. /mjt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
On Mon, 6 Feb 2012, Michael Tokarev wrote: > > I believe that this is a serious bug, it seems to me that one of the most > > significant conditions it can encounter that should be immediately > > reported to the sysadmin is the fact that the contents of disks are > > changing and breaking RAID consistency! > > Yes that's the condition it may encouner indeed. The question is WHY - > under normal conditions there should be no such errors. http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805 The disk just has errors sometimes. The above article has some calculations of the probabilities. > There are two points there. > > First, a formal one. Were it a serious issue if such a check weren't be > done at all? I think that in this case this bugreport didt'n exist to > start with. http://etbe.coker.com.au/2012/02/06/reliability-raid/ If there were no checks at all then we would migrate to BTRFS even sooner, at the above URL I've written some of the thoughts about BTRFS vs software RAID. > And second, more to the point, Neil gave a very good writeup of these > checks and repairs of raid arrays, about deciding which part/component of > the array is "more right". Unfortunately I can't find it right now. Unfortunately at the moment it seems impossible to determine which disk had the error, if you even know that there was an error. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
On 05.02.2012 16:34, Russell Coker wrote: > Package: mdadm > Version: 3.2.3-2 > Severity: important > > Feb 5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device > /dev/md0, component device mismatches found: 20608 (on raid level 1) > > When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover > this and log an error such as the above with facility DAEMON. But it doesn't > send an email. This is the same as discussed in #599821 and #588516. I'll think about mergeing all 3 together. > I believe that this is a serious bug, it seems to me that one of the most > significant conditions it can encounter that should be immediately reported to > the sysadmin is the fact that the contents of disks are changing and breaking > RAID consistency! Yes that's the condition it may encouner indeed. The question is WHY - under normal conditions there should be no such errors. There are two points there. First, a formal one. Were it a serious issue if such a check weren't be done at all? I think that in this case this bugreport didt'n exist to start with. And second, more to the point, Neil gave a very good writeup of these checks and repairs of raid arrays, about deciding which part/component of the array is "more right". Unfortunately I can't find it right now. > > For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as > long > as all the other disks are fine. If you have an array with double-redundancy > and one disk fails entirely while another returns dodgey data then you lose, > and obviously anyone who creates a doubly-redundant array wants protection > against that sort of thing. > > With a RAID-1 or RAID-5 array every mismatch is an indication of real data > corruption and is very important. > > The following patch makes mdadm send email about such events. > > --- /tmp/Monitor.c2012-02-05 23:28:41.873079816 +1100 > +++ ./Monitor.c 2012-02-05 23:32:03.961132380 +1100 > @@ -364,6 +364,7 @@ > (strncmp(event, "Fail", 4)==0 || >strncmp(event, "Test", 4)==0 || >strncmp(event, "Spares", 6)==0 || > + (strncmp(event, "RebuildFinished", 15)==0 && disc) || >strncmp(event, "Degrade", 7)==0)) { > FILE *mp = popen(Sendmail, "w"); > if (mp) { > This might be more interesting approach than already offered in two other mentioned patches. /mjt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#658701: mdadm: should send email if mismatches are reported by a check
Package: mdadm Version: 3.2.3-2 Severity: important Feb 5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 20608 (on raid level 1) When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover this and log an error such as the above with facility DAEMON. But it doesn't send an email. I believe that this is a serious bug, it seems to me that one of the most significant conditions it can encounter that should be immediately reported to the sysadmin is the fact that the contents of disks are changing and breaking RAID consistency! For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as long as all the other disks are fine. If you have an array with double-redundancy and one disk fails entirely while another returns dodgey data then you lose, and obviously anyone who creates a doubly-redundant array wants protection against that sort of thing. With a RAID-1 or RAID-5 array every mismatch is an indication of real data corruption and is very important. The following patch makes mdadm send email about such events. --- /tmp/Monitor.c 2012-02-05 23:28:41.873079816 +1100 +++ ./Monitor.c 2012-02-05 23:32:03.961132380 +1100 @@ -364,6 +364,7 @@ (strncmp(event, "Fail", 4)==0 || strncmp(event, "Test", 4)==0 || strncmp(event, "Spares", 6)==0 || +(strncmp(event, "RebuildFinished", 15)==0 && disc) || strncmp(event, "Degrade", 7)==0)) { FILE *mp = popen(Sendmail, "w"); if (mp) { -- System Information: Debian Release: wheezy/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-1-amd64 (SMP w/2 CPU cores) Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages mdadm depends on: ii debconf 1.5.41 ii initscripts 2.88dsf-22 ii libc62.13-25 ii lsb-base 3.2-28.1 ii udev 175-3 Versions of packages mdadm recommends: ii module-init-tools 3.16-1 ii postfix [mail-transport-agent] 2.8.7-1 mdadm suggests no packages. -- debconf information: mdadm/initrdstart_msg_errexist: mdadm/initrdstart_msg_intro: * mdadm/autostart: false mdadm/autocheck: true mdadm/initrdstart_msg_errblock: mdadm/mail_to: root mdadm/initrdstart_msg_errmd: * mdadm/initrdstart: none mdadm/initrdstart_msg_errconf: mdadm/initrdstart_notinconf: false mdadm/start_daemon: true -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org