> There are a category of errors that are 
> not caused by firmware, or any type of software. The
> hardware just doesn't write or read the correct bit value this time
> around. With out a checksum there's no way for the firmware to know, and
> next time it very well may write or read the correct bit value from the
> exact same spot on the disk, so scrubbing is not going to flag this
> sector as 'bad'.

There seems to be a lot of ignorance about how disks actually work in this 
thread.

Here's the data path, to a first approximation.

  Processor <=> RAM <=> controller RAM <=> disk cache RAM <=> read/write head 
<=> media

There are four buses in the above (which is a slight oversimplification): the 
processor/memory bus, the internal I/O bus (e.g. PCI), the external I/O bus 
(e.g. SATA), and the internal disk bus. (The last arrow isn't a bus, it's the 
magnetic field.)

Errors can be introduced at any point and there are corresponding error 
detection and correction mechanisms at each point.

Processor: Usually parity on internal registers & buses, ECC on larger cache.
Processor/memory bus: Usually ECC (SECDED).
RAM: Usually SECDED or better for better servers, parity for cheap servers, 
nothing @ low-end.
Internal I/O bus: Usually parity (PCI) or CRC (PCI-E).
Controller RAM: Usually parity for low-end controllers, rarely ECC for high-end 
controllers.
External I/O bus: Usually CRC.
Disk cache RAM: Usually parity for low-end disks, ECC for high-end disks.
Internal disk bus: Media ECC.
Read/write head: N/A, doesn't hold bits.
Media: Media ECC.

The disk, as it's transferring data from its cache to the media, adds a very 
large and complex error-correction coding to the data. This protects against a 
huge number of errors, 20 or more bits in a single 512-byte block.  This is 
because the media is very noisy.

So there is far *better* protection than a checksum for the data once it gets 
to the disk, and you can't possibly (well, not within any reasonable 
probability) return bad data from disk.  You'll get an I/O error ("media error" 
in SCSI parlance) instead.

ZFS protects against an error introduced between memory and the disk.  "Aha!", 
you say, "there's a lot of steps there, and we could get an error at any 
point!"  There are a lot of points there, but very few where the data isn't 
already protected by either CRC or parity.  (Why do controllers usually use 
parity internally?  The same reason the processor uses parity for L1; access is 
speed-critical, and the data is "live" in the cache/FIFO for such a small 
amount of time that the probability of a multi-bit error is negligible.)

> Now you may claim that this type of error happens so infrequently that 
> it's not worth it.

I do claim that the error you described -- a bit error on the disk, undetected 
by the disk's ECC -- is infrequent to the point of being negligible.  The much 
more frequent case, an error which is detected but not corrected by ECC, is 
handled by simple mirroring.

Anton
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to