[zfs-code] correcting single-bit errors in fletcher4 checksums

Daniel Carosone Mon, 04 May 2009 18:28:01 PDT

> Very nice; and ultimately will be very interesting to
> see what percentage of checksum errors within a
> particular deployment turn out to most likely be
> correctable single bit errors.


Very promising and very nice, indeed.

One thing needs to be established carefully in the course of this analysis, and 
that's resiliency in where the errors are introduced (regardless of how).  Do 
these properties hold up when error is introduced in the checksum block rather 
than in the data block(s)?

This brings up another question, too. ZFS uses ditto blocks for metadata, which 
covers checksums, and these are in turn covered by enclosing checksums in 
parent blocks, so in theory all checksum data should be verified before be used 
to verify user data.  

If a data block fails to verify, does ZFS consider the possibility that the 
damage may be in the checksum data at all (perhaps corrupted in bad memory 
since being read)?  Does it attempt to re-read or reverify checksums at the 
same time as looking for alternate copies of user data when trying to correct a 
checksum failure?


(And thereby possibly
> even measurably help improve the integrity of
> non-redundant array configurations, short of the
> catastrophic failure of a sector or drive itself.)
> 
> After reviewing the code (and presuming you intended
> "if (base->a less-than bad->a)"), I can't quite seem
> to convince myself the implementation is immune from
> misdiagnosing a double/triple bit error as a single
> bit error in general (although likely staring me in
> the face; as all correct, single, and double bit
> error checksums are warranted to be unique; as should
> also be all 4 and 5 bit error checksums for a
> corrected fletcher4 implementation to my
> understanding)?
-- 
This message posted from opensolaris.org

[zfs-code] correcting single-bit errors in fletcher4 checksums

Reply via email to