well, since this is part of how I make my living, or at least
what is in my current job description...

Gary Mills wrote:
> I realize that any error can occur in a storage subsystem, but most
> of these have an extremely low probability.  I'm interested in this
> discussion in only those that do occur occasionally, and that are
> not catastrophic.

excellent... fertile ground for research.  One of the things
that we see occur with ZFS is that it detects errors which were
previously not detected.  You can see this happen on this forum
when people try to kill the canary (ZFS).  I think a better
analogy is astronomy: as our ability to see more of the universe
gets better, we see more of the universe -- but that also raises
the number of questions we can't answer... well... yet...

> Consider the common configuration of two SCSI disks connected to
> the same HBA that are configured as a mirror in some manner.  In this
> case, the data path in general consists of:

Beware of the Decomposition Law which says, the part is more
than a fraction of the whole.  This is what trips people up when
they think that if every part performs flawlessly, then the whole
will perform flawlessly.

> o The application
> o The filesystem
> o The drivers
> o The HBA
> o The SCSI bus
> o The controllers
> o The heads and patters
> 
> Many of those components have their own error checking.  Some have
> error correction.  For example, parity checking is done on a SCSI bus,
> unless it's specifically disabled.  Do SATA and PATA connections also
> do error checking?  Disk sector I/O uses CRC error checking and
> correction.  Memory buffers would often be protected by parity memory.
> Is there any more that I've missed?

thousands more ;-)


> Now, let's consider common errors.  To me, the most frequent would
> be a bit error on a disk sector.  In this case, the controller would
> report a CRC error and would not return bad data.  The filesystem
> would obtain the data from its redundant copy.  I assume that ZFS
> would also rewrite the bad sector to correct it.  The application
> would not see an error.  Similar events would happen for a parity
> error on the SCSI bus.

Nit: modern disks can detect and correct multiple byte errors in a
sector.  If ZFS can correct it (depends on the ZFS configuration)
then it will, but it will not rewrite the defective sector -- it
will write to a different sector.  While that seems better, it also
introduces at least one new failure mode and can help to expose
other, existing failure modes, such as phantom writes.

> What can go wrong with the disk controller?  A simple seek to the
> wrong track is not a problem because the track number is encoded on
> the platter.  The controller will simply recalibrate the mechanism and
> retry the seek.  If it computes the wrong sector, that would be a
> problem.  Does this happen with any frequency?  In this case, ZFS
> would detect a checksum error and obtain the data from its redundant
> copy.
> 
> A logic error in ZFS might result in incorrect metadata being written
> with valid checksum.  In this case, ZFS might panic on import or might
> correct the error.  How is this sort of error prevented?
> 
> If the application wrote bad data to the filesystem, none of the
> error checking in lower layers would detect it.  This would be
> strictly an error in the application.
> 
> Some errors might result from a loss of power if some ZFS data was
> written to a disk cache but never was written to the disk platter.
> Again, ZFS might panic on import or might correct the error.  How is
> this sort of error prevented?
> 
> After all of this discussion, what other errors can ZFS checksums
> reasonably detect?  Certainly if some of the other error checking
> failed to detect an error, ZFS would still detect one.  How likely
> are these other error checks to fail?
> 
> Is there anything else I've missed in this analysis?

Everything along the way.  If you search the archives here you will
find anecdotes of:
        + bad disks -- of all sorts
        + bad power supplies
        + bad FC switch firmware
        + flaky cables
        + bugs in NIC drivers
        + transient and permanent DRAM errors
        + and, of course, bugs in ZFS code

Basically, anywhere you data touches can fail.

However, to make the problem tractable, we often
divide failures into two classifications:

1. mechanical, including quantum-mechanical

2. design or implementation, including software defects,
    design deficiencies, and manufacturing

There is a lot of experience with measurements of mechanical
failure modes, so we tend to have some ways to assign reliability
budgets and predictions.  For #2, the science we use for #1
doesn't apply.
  -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to