Re: [zfs-discuss] repair [was: about btrfs and zfs]

Garrett D'Amore Wed, 19 Oct 2011 07:38:32 -0700

On Oct 19, 2011, at 1:52 PM, Richard Elling wrote:

> On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote:
> 
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Tim Cook
>>> 
>>> I had and have redundant storage, it has *NEVER* automatically fixed
>>> it.  You're the first person I've heard that has had it automatically fix
>> it.
>> 
>> That's probably just because it's normal and expected behavior to
>> automatically fix it - I always have redundancy, and every cksum error I
>> ever find is always automatically fixed.  I never tell anyone here because
>> it's normal and expected.
> 
> Yes, and in fact the automated tests for ZFS developers intentionally 
> corrupts data
> so that the repair code can be tested. Also, the same checksum code is used 
> to 
> calculate the checksum when writing and reading.
> 
>> If you have redundancy, and cksum errors, and it's not automatically fixed,
>> then you should report the bug.
> 
> For modern Solaris-based implementations, each checksum mismatch that is
> repaired reports the bitmap of the corrupted vs expected data. Obviously, if 
> the
> data cannot be repaired, you cannot know the expected data, so the error is 
> reported without identification of the broken bits.
> 
> In the archives, you can find reports of recoverable and unrecoverable errors 
> attributed to:
>       1. ZFS software (rare, but a bug a few years ago mishandled a raidz 
> case)
>       2. SAN switch firmware
>       3. "Hardware" RAID array firmware
>       4. Power supplies
>       5. RAM
>       6. HBA
>       7. PCI-X bus
>       8. BIOS settings
>       9. CPU and chipset errata
> 
> Personally, I've seen all of the above except #7, because PCI-X hardware is
> hard to find now.


I've seen #7.  I have some PCI-X hardware that is flaky in my home lab. ;-)

There was a case of #1 not very long ago, but it was a difficult to trigger 
race and is fixed in illumos and I presume other derivatives (including 
NexentaStor).

        - Garrett
> 
> If consistently see unrecoverable data from a system that has protected data, 
> then
> there may be an issue with a part of the system that is a single point of 
> failure. Very,
> very, very few x86 systems are designed with no SPOF.
> -- richard
> 
> -- 
> 
> ZFS and performance consulting
> http://www.RichardElling.com
> VMworld Copenhagen, October 17-20
> OpenStorage Summit, San Jose, CA, October 24-27
> LISA '11, Boston, MA, December 4-9 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] repair [was: about btrfs and zfs]

Reply via email to