I realize that any error can occur in a storage subsystem, but most of these have an extremely low probability. I'm interested in this discussion in only those that do occur occasionally, and that are not catastrophic.
Consider the common configuration of two SCSI disks connected to the same HBA that are configured as a mirror in some manner. In this case, the data path in general consists of: o The application o The filesystem o The drivers o The HBA o The SCSI bus o The controllers o The heads and patters Many of those components have their own error checking. Some have error correction. For example, parity checking is done on a SCSI bus, unless it's specifically disabled. Do SATA and PATA connections also do error checking? Disk sector I/O uses CRC error checking and correction. Memory buffers would often be protected by parity memory. Is there any more that I've missed? Now, let's consider common errors. To me, the most frequent would be a bit error on a disk sector. In this case, the controller would report a CRC error and would not return bad data. The filesystem would obtain the data from its redundant copy. I assume that ZFS would also rewrite the bad sector to correct it. The application would not see an error. Similar events would happen for a parity error on the SCSI bus. What can go wrong with the disk controller? A simple seek to the wrong track is not a problem because the track number is encoded on the platter. The controller will simply recalibrate the mechanism and retry the seek. If it computes the wrong sector, that would be a problem. Does this happen with any frequency? In this case, ZFS would detect a checksum error and obtain the data from its redundant copy. A logic error in ZFS might result in incorrect metadata being written with valid checksum. In this case, ZFS might panic on import or might correct the error. How is this sort of error prevented? If the application wrote bad data to the filesystem, none of the error checking in lower layers would detect it. This would be strictly an error in the application. Some errors might result from a loss of power if some ZFS data was written to a disk cache but never was written to the disk platter. Again, ZFS might panic on import or might correct the error. How is this sort of error prevented? After all of this discussion, what other errors can ZFS checksums reasonably detect? Certainly if some of the other error checking failed to detect an error, ZFS would still detect one. How likely are these other error checks to fail? Is there anything else I've missed in this analysis? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss