On Oct 19, 2011, at 1:52 PM, Richard Elling wrote: > On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote: > >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Tim Cook >>> >>> I had and have redundant storage, it has *NEVER* automatically fixed >>> it. You're the first person I've heard that has had it automatically fix >> it. >> >> That's probably just because it's normal and expected behavior to >> automatically fix it - I always have redundancy, and every cksum error I >> ever find is always automatically fixed. I never tell anyone here because >> it's normal and expected. > > Yes, and in fact the automated tests for ZFS developers intentionally > corrupts data > so that the repair code can be tested. Also, the same checksum code is used > to > calculate the checksum when writing and reading. > >> If you have redundancy, and cksum errors, and it's not automatically fixed, >> then you should report the bug. > > For modern Solaris-based implementations, each checksum mismatch that is > repaired reports the bitmap of the corrupted vs expected data. Obviously, if > the > data cannot be repaired, you cannot know the expected data, so the error is > reported without identification of the broken bits. > > In the archives, you can find reports of recoverable and unrecoverable errors > attributed to: > 1. ZFS software (rare, but a bug a few years ago mishandled a raidz > case) > 2. SAN switch firmware > 3. "Hardware" RAID array firmware > 4. Power supplies > 5. RAM > 6. HBA > 7. PCI-X bus > 8. BIOS settings > 9. CPU and chipset errata > > Personally, I've seen all of the above except #7, because PCI-X hardware is > hard to find now.
I've seen #7. I have some PCI-X hardware that is flaky in my home lab. ;-) There was a case of #1 not very long ago, but it was a difficult to trigger race and is fixed in illumos and I presume other derivatives (including NexentaStor). - Garrett > > If consistently see unrecoverable data from a system that has protected data, > then > there may be an issue with a part of the system that is a single point of > failure. Very, > very, very few x86 systems are designed with no SPOF. > -- richard > > -- > > ZFS and performance consulting > http://www.RichardElling.com > VMworld Copenhagen, October 17-20 > OpenStorage Summit, San Jose, CA, October 24-27 > LISA '11, Boston, MA, December 4-9 > > > > > > > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss