[zfs-discuss] What are the usual suspects in data errors?

Gary Mills Wed, 14 Jan 2009 14:43:46 -0800

I realize that any error can occur in a storage subsystem, but most
of these have an extremely low probability.  I'm interested in this
discussion in only those that do occur occasionally, and that are
not catastrophic.


Consider the common configuration of two SCSI disks connected to
the same HBA that are configured as a mirror in some manner.  In this
case, the data path in general consists of:

o The application
o The filesystem
o The drivers
o The HBA
o The SCSI bus
o The controllers
o The heads and patters

Many of those components have their own error checking.  Some have
error correction.  For example, parity checking is done on a SCSI bus,
unless it's specifically disabled.  Do SATA and PATA connections also
do error checking?  Disk sector I/O uses CRC error checking and
correction.  Memory buffers would often be protected by parity memory.
Is there any more that I've missed?

Now, let's consider common errors.  To me, the most frequent would
be a bit error on a disk sector.  In this case, the controller would
report a CRC error and would not return bad data.  The filesystem
would obtain the data from its redundant copy.  I assume that ZFS
would also rewrite the bad sector to correct it.  The application
would not see an error.  Similar events would happen for a parity
error on the SCSI bus.

What can go wrong with the disk controller?  A simple seek to the
wrong track is not a problem because the track number is encoded on
the platter.  The controller will simply recalibrate the mechanism and
retry the seek.  If it computes the wrong sector, that would be a
problem.  Does this happen with any frequency?  In this case, ZFS
would detect a checksum error and obtain the data from its redundant
copy.

A logic error in ZFS might result in incorrect metadata being written
with valid checksum.  In this case, ZFS might panic on import or might
correct the error.  How is this sort of error prevented?

If the application wrote bad data to the filesystem, none of the
error checking in lower layers would detect it.  This would be
strictly an error in the application.

Some errors might result from a loss of power if some ZFS data was
written to a disk cache but never was written to the disk platter.
Again, ZFS might panic on import or might correct the error.  How is
this sort of error prevented?

After all of this discussion, what other errors can ZFS checksums
reasonably detect?  Certainly if some of the other error checking
failed to detect an error, ZFS would still detect one.  How likely
are these other error checks to fail?

Is there anything else I've missed in this analysis?

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] What are the usual suspects in data errors?

Reply via email to