On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: > I realize that any error can occur in a storage subsystem, but most > of these have an extremely low probability. I'm interested in this > discussion in only those that do occur occasionally, and that are > not catastrophic.
What level is "extremely low" here? > Many of those components have their own error checking. Some have > error correction. For example, parity checking is done on a SCSI bus, > unless it's specifically disabled. Do SATA and PATA connections also > do error checking? Disk sector I/O uses CRC error checking and > correction. Memory buffers would often be protected by parity memory. > Is there any more that I've missed? Reports suggest that bugs in drive firmware could account for errors at a level that is not insignificant. > What can go wrong with the disk controller? A simple seek to the > wrong track is not a problem because the track number is encoded on > the platter. The controller will simply recalibrate the mechanism and > retry the seek. If it computes the wrong sector, that would be a > problem. Does this happen with any frequency? Netapp documents certain rewrite bugs that they've specifically seen. I would imagine they have good data on the frequency that they see it in the field. > In this case, ZFS > would detect a checksum error and obtain the data from its redundant > copy. Correct. > A logic error in ZFS might result in incorrect metadata being written > with valid checksum. In this case, ZFS might panic on import or might > correct the error. How is this sort of error prevented? It's very difficult to protect yourself from software bugs with the same piece of software. You can create assertions that are hopefully simpler and less prone to errors, but they will not catch all bugs. > Some errors might result from a loss of power if some ZFS data was > written to a disk cache but never was written to the disk platter. > Again, ZFS might panic on import or might correct the error. How is > this sort of error prevented? ZFS uses a multi-stage commit. It relies on the "disk" responding to a request to flush caches to the disk. If that assumption is correct, then there is no problem in general with power issues. The disk is consistent both before and after the cache is flushed. -- Darren _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss