Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-05-27 Thread NeilBrown
On Sat, 26 May 2012 18:39:08 +0400 Michael Tokarev  wrote:

> Neil, can you comment on the change to Monitor offered
> in the mentioned bugreport please?
> 
> On 12.04.2012 23:28, Michael Tokarev wrote:
> > Neil, re http://bugs.debian.org/658701 , how do you think,
> > is it okay if mdadm --monitor will send email in case check
> > found mismatches, the same way it sends email about other
> > more critical errors?
> > 
> > I think Russell has a good point here, but there's one more
> > source of mismatches we have in kernel - some "sporadic"
> > mismatches in raid1 and raid10, especially when these are
> > used as swap space...
> > 
> > In Debian we've several bugreports already requesting more
> > attention to mismatch_cnt, see:
> > 
> >  http://bugs.debian.org/658701 (this one)
> >  http://bugs.debian.org/599821
> >  http://bugs.debian.org/588516
> > 
> > Thank you!
> > 
> > /mjt

Sorry for not replying the first time :-(

I do not agree with the suggested change to mdadm.
A non-zero mismatch count may not be a problem.
It could be due to swap writing to a RAID1/RAID10.
It could also be due to a RAID1/RAID10/RAID6 having been
created with --assume-clean.  This is perfectly safe thing
to do but results in a non-zero mismatch_cnt.

mdadm --monitor will run a program on every event.  If someone
wants more events reported than currently are reported, they are
free to write a script to do whatever they like.

If md finds unreadable blocks and fixes them, then that certainly
might be interesting.  However that is interesting much more broadly than
just for md, and I believe 'smart' makes that information available.  So
having it reported from SMART would be more sensible.

In brief: mismatch_cnt maybe useful to someone who understands what is means
and is investigating some issues, but it is not something that should be
automatically reported to a casual sysadmin.

NeilBrown


signature.asc
Description: PGP signature


Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-05-26 Thread Michael Tokarev
Neil, can you comment on the change to Monitor offered
in the mentioned bugreport please?

On 12.04.2012 23:28, Michael Tokarev wrote:
> Neil, re http://bugs.debian.org/658701 , how do you think,
> is it okay if mdadm --monitor will send email in case check
> found mismatches, the same way it sends email about other
> more critical errors?
> 
> I think Russell has a good point here, but there's one more
> source of mismatches we have in kernel - some "sporadic"
> mismatches in raid1 and raid10, especially when these are
> used as swap space...
> 
> In Debian we've several bugreports already requesting more
> attention to mismatch_cnt, see:
> 
>  http://bugs.debian.org/658701 (this one)
>  http://bugs.debian.org/599821
>  http://bugs.debian.org/588516
> 
> Thank you!
> 
> /mjt



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-04-12 Thread Michael Tokarev
Neil, re http://bugs.debian.org/658701 , how do you think,
is it okay if mdadm --monitor will send email in case check
found mismatches, the same way it sends email about other
more critical errors?

I think Russell has a good point here, but there's one more
source of mismatches we have in kernel - some "sporadic"
mismatches in raid1 and raid10, especially when these are
used as swap space...

In Debian we've several bugreports already requesting more
attention to mismatch_cnt, see:

 http://bugs.debian.org/658701 (this one)
 http://bugs.debian.org/599821
 http://bugs.debian.org/588516

Thank you!

/mjt



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-02-11 Thread Michael Tokarev
On 05.02.2012 19:40, Michael Tokarev wrote:
> On 05.02.2012 18:58, Russell Coker wrote:
>> On Mon, 6 Feb 2012, Michael Tokarev  wrote:
[]
>>> And second, more to the point, Neil gave a very good writeup of these
>>> checks and repairs of raid arrays, about deciding which part/component of
>>> the array is "more right".  Unfortunately I can't find it right now.
>>
>> Unfortunately at the moment it seems impossible to determine which disk had 
>> the error, if you even know that there was an error.
> 
> Yes that's the bottom line of that article, and that's exactly what I had
> in mind.  It describes in great details (without touching latent errors much)
> why it is so.

I meant this one: http://neil.brown.name/blog/20100211050355
"Smart or simple RAID recovery??".

/mjt



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-02-05 Thread Michael Tokarev
On 05.02.2012 18:58, Russell Coker wrote:
> On Mon, 6 Feb 2012, Michael Tokarev  wrote:
>>> I believe that this is a serious bug, it seems to me that one of the most
>>> significant conditions it can encounter that should be immediately
>>> reported to the sysadmin is the fact that the contents of disks are
>>> changing and breaking RAID consistency!
>>
>> Yes that's the condition it may encouner indeed.  The question is WHY -
>> under normal conditions there should be no such errors.
> 
> http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805
> 
> The disk just has errors sometimes.  The above article has some calculations 
> of the probabilities.

The point here is latent errors only.  Yes these becomes more and more
common "per drive" as drives grow in size, and also becomes less and less
common with new/improved technologies (like switching to 4k sector size
where error detection checksums work a bit differently and has more chances
to detect the error).  Note that these all are internal to the drive and
usually is a know-how of the manufacturer and can be changed without breaking
any compatibility whatsoever, since again these are all internal things.
It is just not realistic to draw an interpolation line based on current
volumes, because handling larger volumes may require more reliable error
detection mechanisms, be it internal for drives or by external means
(adding (meta)data checksumming, using various raid tecniques and so on).

>> There are two points there.
>>
>> First, a formal one.  Were it a serious issue if such a check weren't be
>> done at all?  I think that in this case this bugreport didt'n exist to
>> start with.
> 
> http://etbe.coker.com.au/2012/02/06/reliability-raid/

I recall again this is a "formal point".  Lack of any scrubbing is serious
bug, but lack of reporting is a wishlist, that's what i'm saying, nothing
more.

> If there were no checks at all then we would migrate to BTRFS even sooner, at 
> the above URL I've written some of the thoughts about BTRFS vs software RAID.
> 
>> And second, more to the point, Neil gave a very good writeup of these
>> checks and repairs of raid arrays, about deciding which part/component of
>> the array is "more right".  Unfortunately I can't find it right now.
> 
> Unfortunately at the moment it seems impossible to determine which disk had 
> the error, if you even know that there was an error.

Yes that's the bottom line of that article, and that's exactly what I had
in mind.  It describes in great details (without touching latent errors much)
why it is so.

For the future, I think drive manufacturers will do something to reduce
probability of latent errors dramatically maybe to cryptographically-impossible
levels, by changing ways how error detection and correction is done.

Please note that I don't argue against the lack of reporting - just about
the severity of the bugreport.

/mjt




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-02-05 Thread Russell Coker
On Mon, 6 Feb 2012, Michael Tokarev  wrote:
> > I believe that this is a serious bug, it seems to me that one of the most
> > significant conditions it can encounter that should be immediately
> > reported to the sysadmin is the fact that the contents of disks are
> > changing and breaking RAID consistency!
> 
> Yes that's the condition it may encouner indeed.  The question is WHY -
> under normal conditions there should be no such errors.

http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805

The disk just has errors sometimes.  The above article has some calculations 
of the probabilities.

> There are two points there.
> 
> First, a formal one.  Were it a serious issue if such a check weren't be
> done at all?  I think that in this case this bugreport didt'n exist to
> start with.

http://etbe.coker.com.au/2012/02/06/reliability-raid/

If there were no checks at all then we would migrate to BTRFS even sooner, at 
the above URL I've written some of the thoughts about BTRFS vs software RAID.

> And second, more to the point, Neil gave a very good writeup of these
> checks and repairs of raid arrays, about deciding which part/component of
> the array is "more right".  Unfortunately I can't find it right now.

Unfortunately at the moment it seems impossible to determine which disk had 
the error, if you even know that there was an error.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-02-05 Thread Michael Tokarev
On 05.02.2012 16:34, Russell Coker wrote:
> Package: mdadm
> Version: 3.2.3-2
> Severity: important
> 
> Feb  5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device 
> /dev/md0, component device  mismatches found: 20608 (on raid level 1)
> 
> When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover
> this and log an error such as the above with facility DAEMON.  But it doesn't
> send an email.

This is the same as discussed in #599821 and #588516.  I'll think about
mergeing all 3 together.

> I believe that this is a serious bug, it seems to me that one of the most
> significant conditions it can encounter that should be immediately reported to
> the sysadmin is the fact that the contents of disks are changing and breaking
> RAID consistency!

Yes that's the condition it may encouner indeed.  The question is WHY - under 
normal
conditions there should be no such errors.

There are two points there.

First, a formal one.  Were it a serious issue if such a check weren't be done at
all?  I think that in this case this bugreport didt'n exist to start with.

And second, more to the point, Neil gave a very good writeup of these checks and
repairs of raid arrays, about deciding which part/component of the array is
"more right".  Unfortunately I can't find it right now.

> 
> For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as 
> long
> as all the other disks are fine.  If you have an array with double-redundancy
> and one disk fails entirely while another returns dodgey data then you lose,
> and obviously anyone who creates a doubly-redundant array wants protection
> against that sort of thing.
> 
> With a RAID-1 or RAID-5 array every mismatch is an indication of real data
> corruption and is very important.
> 
> The following patch makes mdadm send email about such events.
> 
> --- /tmp/Monitor.c2012-02-05 23:28:41.873079816 +1100
> +++ ./Monitor.c   2012-02-05 23:32:03.961132380 +1100
> @@ -364,6 +364,7 @@
>   (strncmp(event, "Fail", 4)==0 ||
>strncmp(event, "Test", 4)==0 ||
>strncmp(event, "Spares", 6)==0 ||
> +  (strncmp(event, "RebuildFinished", 15)==0 && disc) ||
>strncmp(event, "Degrade", 7)==0)) {
>   FILE *mp = popen(Sendmail, "w");
>   if (mp) {
> 

This might be more interesting approach than already offered in two
other mentioned patches.

/mjt



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#658701: mdadm: should send email if mismatches are reported by a check

2012-02-05 Thread Russell Coker
Package: mdadm
Version: 3.2.3-2
Severity: important

Feb  5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device 
/dev/md0, component device  mismatches found: 20608 (on raid level 1)

When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover
this and log an error such as the above with facility DAEMON.  But it doesn't
send an email.

I believe that this is a serious bug, it seems to me that one of the most
significant conditions it can encounter that should be immediately reported to
the sysadmin is the fact that the contents of disks are changing and breaking
RAID consistency!

For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as long
as all the other disks are fine.  If you have an array with double-redundancy
and one disk fails entirely while another returns dodgey data then you lose,
and obviously anyone who creates a doubly-redundant array wants protection
against that sort of thing.

With a RAID-1 or RAID-5 array every mismatch is an indication of real data
corruption and is very important.

The following patch makes mdadm send email about such events.

--- /tmp/Monitor.c  2012-02-05 23:28:41.873079816 +1100
+++ ./Monitor.c 2012-02-05 23:32:03.961132380 +1100
@@ -364,6 +364,7 @@
(strncmp(event, "Fail", 4)==0 ||
 strncmp(event, "Test", 4)==0 ||
 strncmp(event, "Spares", 6)==0 ||
+(strncmp(event, "RebuildFinished", 15)==0 && disc) ||
 strncmp(event, "Degrade", 7)==0)) {
FILE *mp = popen(Sendmail, "w");
if (mp) {

-- System Information:
Debian Release: wheezy/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages mdadm depends on:
ii  debconf  1.5.41
ii  initscripts  2.88dsf-22
ii  libc62.13-25
ii  lsb-base 3.2-28.1
ii  udev 175-3

Versions of packages mdadm recommends:
ii  module-init-tools   3.16-1
ii  postfix [mail-transport-agent]  2.8.7-1

mdadm suggests no packages.

-- debconf information:
  mdadm/initrdstart_msg_errexist:
  mdadm/initrdstart_msg_intro:
* mdadm/autostart: false
  mdadm/autocheck: true
  mdadm/initrdstart_msg_errblock:
  mdadm/mail_to: root
  mdadm/initrdstart_msg_errmd:
* mdadm/initrdstart: none
  mdadm/initrdstart_msg_errconf:
  mdadm/initrdstart_notinconf: false
  mdadm/start_daemon: true



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org