>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:

    re> not all devices return error codes which indicate
    re> unrecoverable reads.

What you mean is, ``devices sometimes return bad data instead of an
error code.''

If you really mean there are devices out there which never return
error codes, and always silently return bad data, please tell us which
one and the story of when you encountered it, because I'm incredulous.
I've never seen or heard of anything like that.  Not even 5.25"
floppies do that.

Well...wait, actually I have.  I heard some SGI disks had special
firmware which could be ordered to behave this way, and some kind of
ioctl or mount option to turn it on per-file or per-filesystem.  But
the drives wouldn't disable error reporting unless ordered to.
Another interesting lesson SGI offers here: they pushed this feature
through their entire stack.  The point was, for some video playback,
data which arrives after the playback point has passed is just as
useless as silently corrupt data, so the disk, driver, filesystem, all
need to modify their exception handling to deliver the largest amount
of on-time data possible, rather than the traditional goal of
eventually returning the largest amount of correct data possible and
clear errors instead of silent corruption.  This whole-stack approach
is exactly what I thought ``green line'' was promising, and exactly
what's kept out of Solaris by the ``go blame the drivers'' mantra.

Maybe I was thinking of this SGI firmware when I suggested the
customized firmware netapp loads into the drives in their study could
silently return bad data more often than the firmware we're all using,
the standard firmware with 512-byte sectors intended for RAID layers
without block checksums.

    re> I would love for you produce data to that effect.

Read the netapp paper you cited earlier

  
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

on page 234 there's a comparison of the relative prevalence of each
kind of error.

  Latent sector errors / Unrecoverable reads

   nearline disks experiencing latent read errors per year:   9.5%

   Netapp calls the UNC errors, where the drive returns an error
   instead of data, ``latent sector errors.''  Software RAID systems
   other than ZFS *do* handle this error, usually better than ZFS to
   my impression.  And AIUI when it doesn't freeze and reboot, ZFS
   counts this as a READ error.  In addition to reporting it, most
   consumer drives seem to log the last five of these non-volatilely,
   and you can read the log with 'smartctl -a' (if you're using Linux
   always, or under Solaris only if smartctl is working with your
   particular disk driver).


  Silent corruption

   nearline disks experiencing silent corruption per year:    0.466%

   What netapp calls ``silent data corruption'' is bad data silently
   returned by drives with no error indication, counted by ZFS as
   CKSUM and seems not to cause ZFS to freeze.  I think you have been
   lumping this in with unrecoverable reads, but using the word
   ``silent'' makes it clearer because unrecoverable makes it sound to
   me like the drive tried to recover, and failed, in which case the
   drive probably also reported the error making it a ``latent sector
   error''.


  filesystem corruption

   This is also discovered silently w.r.t. the driver: the corruption
   that happens to ZFS systems when SAN targets disappear suddenly or
   when you offline a target and then reboot (which is also counted in
   the CKSUM column, and which ZFS-level redundancy also helps fix).
   I would call this ``ZFS bugs'', ``filesystem corruption,'' or
   ``manual resilvering''.  Obviously it's not included on the Netapp
   table.  It would be nice if ZFS had two separate CKSUM columns to
   distinguish between what netapp calls ``checksum errors'' vs
   ``identity discrepancies''.  For ZFS the ``checksum error'' would
   point with high certainty to the storage and silent corruption, and
   the ``identity discrepancy'' would be more like filesystem
   corruption and flag things like one side of a mirror being
   out-of-date when ZFS thinks it shouldn't be.  but currently we have
   only one CKSUM column for both cases.


so, I would say, yes, the type of read error that other software RAID
systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
0.466%/yr for nearline disks, and the same ~20x factor for enterprise
disks.  The rare silent error which other software LVM's miss and only
ZFS/Netapp/EMC/... handles is still common enough to worry about, at
least on the nearline disks in the Netapp drive population.

What this also shows, though, is that about 1 in 10 drives will return
an UNC per year, and possibly cause ZFS to freeze up.  It's worth
worrying about availability during an exception as common as that---it
might even be more important for some applications than catching the
silent corruption.  not for my own application, but for some readily
imagineable ones, yes.

Attachment: pgpCkxLRrmDAV.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to