Thanks to everyone who made suggestions! This machine has run
memtest for a week and VTS for several days with no errors. It
does seem that the problem is probably in the CPU cache.

On 03/24/10 10:07 AM, Damon Atkins wrote:
You could try copying the file to /tmp (ie swap/ram) and do a
continues loop of checksums

On a variation of your suggestion, I implemented a bash script
that applies sha1sum 10,000 times with a pause of 0.1S between
each attempt, and tests the result against what seemed to be the
correct result.

sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results
sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000
sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000
sha1sum on /tmp/libpam.so.1ditto.

So what we have is a pattern sensitive failure that is also sensitive
to how busy the cpu is (and doesn't fail running VTS). md5sum and
sha256sum produced similar results, and presumably so would
fletcher2. To get really meaningful results, the machine should be
otherwise idle (but then, maybe it wouldn't fail).

Is anyone willing to speculate (or have any suggestions for further
experiments) about what failure mode could cause a checksum
calculation to be pattern sensitive and also thousands of times
more likely to fail if read from disk vs. tmpfs? FWIW the failures
are pretty consistent, mostly but not always producing the
same bad checksum.

So at boot, the cpu is busy, increasing the probability of this
pattern sensitive failure,  and this one time it failed on every
read of /lib/libdlpi.so.1. With copies=1 this was twice as likely
to happen, and when it did ZFS returned an error on any
attempt to read the file. With copies=2 in this case it doesn't
return an error when attempting to read. Also there were no
set-bit errors this time, but then I have no idea what a set-bit
error is.

On 03/24/10 12:32 PM, Richard Elling wrote:

Clearly, fletcher2 identified the problem.

Ironically, on this hardware it seems it created the problem :-).
However you have been vindicated - it was a pattern sensitive
problem as you have long suggested it might be.

So: that the file is still readable is a mystery, but how it became
to be flagged as bad in ZFS isn't, any more.

Cheers -- Frank


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to