Bug#658701: mdadm: should send email if mismatches are reported by a check

Michael Tokarev Sun, 05 Feb 2012 07:45:16 -0800

On 05.02.2012 18:58, Russell Coker wrote:
> On Mon, 6 Feb 2012, Michael Tokarev <m...@tls.msk.ru> wrote:
>>> I believe that this is a serious bug, it seems to me that one of the most
>>> significant conditions it can encounter that should be immediately
>>> reported to the sysadmin is the fact that the contents of disks are
>>> changing and breaking RAID consistency!
>>
>> Yes that's the condition it may encouner indeed.  The question is WHY -
>> under normal conditions there should be no such errors.
> 
> http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805
> 
> The disk just has errors sometimes.  The above article has some calculations 
> of the probabilities.


The point here is latent errors only.  Yes these becomes more and more
common "per drive" as drives grow in size, and also becomes less and less
common with new/improved technologies (like switching to 4k sector size
where error detection checksums work a bit differently and has more chances
to detect the error).  Note that these all are internal to the drive and
usually is a know-how of the manufacturer and can be changed without breaking
any compatibility whatsoever, since again these are all internal things.
It is just not realistic to draw an interpolation line based on current
volumes, because handling larger volumes may require more reliable error
detection mechanisms, be it internal for drives or by external means
(adding (meta)data checksumming, using various raid tecniques and so on).

>> There are two points there.
>>
>> First, a formal one.  Were it a serious issue if such a check weren't be
>> done at all?  I think that in this case this bugreport didt'n exist to
>> start with.
> 
> http://etbe.coker.com.au/2012/02/06/reliability-raid/

I recall again this is a "formal point".  Lack of any scrubbing is serious
bug, but lack of reporting is a wishlist, that's what i'm saying, nothing
more.

> If there were no checks at all then we would migrate to BTRFS even sooner, at 
> the above URL I've written some of the thoughts about BTRFS vs software RAID.
> 
>> And second, more to the point, Neil gave a very good writeup of these
>> checks and repairs of raid arrays, about deciding which part/component of
>> the array is "more right".  Unfortunately I can't find it right now.
> 
> Unfortunately at the moment it seems impossible to determine which disk had 
> the error, if you even know that there was an error.

Yes that's the bottom line of that article, and that's exactly what I had
in mind.  It describes in great details (without touching latent errors much)
why it is so.

For the future, I think drive manufacturers will do something to reduce
probability of latent errors dramatically maybe to cryptographically-impossible
levels, by changing ways how error detection and correction is done.

Please note that I don't argue against the lack of reporting - just about
the severity of the bugreport.

/mjt




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#658701: mdadm: should send email if mismatches are reported by a check

Reply via email to