Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling Wed, 27 Aug 2008 11:29:23 -0700

Miles Nordin wrote:
>>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>>>             
>
>     re> not all devices return error codes which indicate
>     re> unrecoverable reads.
>
> What you mean is, ``devices sometimes return bad data instead of an
> error code.''
>
> If you really mean there are devices out there which never return
> error codes, and always silently return bad data, please tell us which
> one and the story of when you encountered it, because I'm incredulous.
> I've never seen or heard of anything like that.  Not even 5.25"
> floppies do that.
>


I blogged about one such case.
http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

However, I'm not inclined to publically chastise the vendor or device model.
It is a major vendor and a popular device. 'nuff said.

> Well...wait, actually I have.  I heard some SGI disks had special
> firmware which could be ordered to behave this way, and some kind of
> ioctl or mount option to turn it on per-file or per-filesystem.  But
> the drives wouldn't disable error reporting unless ordered to.
> Another interesting lesson SGI offers here: they pushed this feature
> through their entire stack.  The point was, for some video playback,
> data which arrives after the playback point has passed is just as
> useless as silently corrupt data, so the disk, driver, filesystem, all
> need to modify their exception handling to deliver the largest amount
> of on-time data possible, rather than the traditional goal of
> eventually returning the largest amount of correct data possible and
> clear errors instead of silent corruption.  This whole-stack approach
> is exactly what I thought ``green line'' was promising, and exactly
> what's kept out of Solaris by the ``go blame the drivers'' mantra.
>
> Maybe I was thinking of this SGI firmware when I suggested the
> customized firmware netapp loads into the drives in their study could
> silently return bad data more often than the firmware we're all using,
> the standard firmware with 512-byte sectors intended for RAID layers
> without block checksums.
>
>     re> I would love for you produce data to that effect.
>
> Read the netapp paper you cited earlier
>
>   
> http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
>
> on page 234 there's a comparison of the relative prevalence of each
> kind of error.
>
>   Latent sector errors / Unrecoverable reads
>
>    nearline disks experiencing latent read errors per year:   9.5%
>   

This number should scare the *%^ out of you.  It basically means
that no data redundancy is a recipe for disaster.  Fortunately, with
ZFS you can have data redundancy without requiring a logical
volume manager to mirror your data.  This is especially useful on
single-disk systems like laptops.

>    Netapp calls the UNC errors, where the drive returns an error
>    instead of data, ``latent sector errors.''  Software RAID systems
>    other than ZFS *do* handle this error, usually better than ZFS to
>    my impression.  And AIUI when it doesn't freeze and reboot, ZFS
>    counts this as a READ error.  In addition to reporting it, most
>    consumer drives seem to log the last five of these non-volatilely,
>    and you can read the log with 'smartctl -a' (if you're using Linux
>    always, or under Solaris only if smartctl is working with your
>    particular disk driver).
>
>
>   Silent corruption
>
>    nearline disks experiencing silent corruption per year:    0.466%
>
>    What netapp calls ``silent data corruption'' is bad data silently
>    returned by drives with no error indication, counted by ZFS as
>    CKSUM and seems not to cause ZFS to freeze.  I think you have been
>    lumping this in with unrecoverable reads, but using the word
>    ``silent'' makes it clearer because unrecoverable makes it sound to
>    me like the drive tried to recover, and failed, in which case the
>    drive probably also reported the error making it a ``latent sector
>    error''.
>   

Likewise, this number should scare you.  AFAICT, logical volume
managers like SVM will not detect this.

Terminology wise, silent errors are, by-definition, not detected.  But
in the literature you might see this in studies of failures where the
author intends to differentiate between one system which detects
such errors and one which does not.

>
>   filesystem corruption
>
>    This is also discovered silently w.r.t. the driver: the corruption
>    that happens to ZFS systems when SAN targets disappear suddenly or
>    when you offline a target and then reboot (which is also counted in
>    the CKSUM column, and which ZFS-level redundancy also helps fix).
>    I would call this ``ZFS bugs'', ``filesystem corruption,'' or
>    ``manual resilvering''.  Obviously it's not included on the Netapp
>    table.  It would be nice if ZFS had two separate CKSUM columns to
>    distinguish between what netapp calls ``checksum errors'' vs
>    ``identity discrepancies''.  For ZFS the ``checksum error'' would
>    point with high certainty to the storage and silent corruption, and
>    the ``identity discrepancy'' would be more like filesystem
>    corruption and flag things like one side of a mirror being
>    out-of-date when ZFS thinks it shouldn't be.  but currently we have
>    only one CKSUM column for both cases.
>
>   

This differentiation is noted in the FMA e-reports.

> so, I would say, yes, the type of read error that other software RAID
> systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
> 0.466%/yr for nearline disks, and the same ~20x factor for enterprise
> disks.  The rare silent error which other software LVM's miss and only
> ZFS/Netapp/EMC/... handles is still common enough to worry about, at
> least on the nearline disks in the Netapp drive population.
>   

0.466%/yr is a per-disk rate.  If you have 10 disks, your exposure
is 4.6% per year.  For 100 disks, 46% per year, etc.  For systems with
thousands of disks this is a big problem.

But I don't think using a rate-per-unit-time is the best way to look
at this problem because if you never read the data, you don't care.
This is why disk vendors spec UERs as rate-per-bits-read.  I have
some field data on bits read over time, but routine activities, like
backups, zfs sends, or scrubs, can change the number of bits read
per unit time by a significant amount.

> What this also shows, though, is that about 1 in 10 drives will return
> an UNC per year, and possibly cause ZFS to freeze up.  It's worth
> worrying about availability during an exception as common as that---it
> might even be more important for some applications than catching the
> silent corruption.  not for my own application, but for some readily
> imagineable ones, yes.
>   

UNCs don't cause ZFS to freeze as long as failmode != wait or
ZFS manages the data redundancy.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Reply via email to