Re: [zfs-discuss] ZFS Dedup and bad checksums

Jim Klimov Thu, 12 Jan 2012 16:47:44 -0800

2012-01-13 4:26, Richard Elling wrote:

On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:

As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...

Bug alert: it seems the block-pointer block with that
mismatching checksum did not get invalidated, so my
attempts to rsync known-good versions of the bad files
from external source seemed to work, but in fact failed:
subsequent reads of the files produced IO errors.
Apparently (my wild guess), upon writing the blocks,
checksums were calculated and the matching DDT entry
was found. ZFS did not care that the entry pointed to
inconsistent data (not matching the checksum now),
it still increased the DDT counter.

The problem was solved by disabling dedup for the dataset
involved and rsync-updating the file in-place. After the
dedup feature was disabled and new blocks were uniquely
written, everything was readable (and md5sums matched)
as expected.

I think of a couple of solutions:


In theory, the verify option will correct this going forward.


But in practice there are many suggestions to disable
verification because it is slowing down the writes beyond
what DDT does to performance, and since there is just
some 10^-77 chance that two blocks would have same values
of checksums, it is there only for paranoics.

If the block is detected to be corrupt (checksum mismatches
the data), the checksum value in blockpointers and DDT
should be rewritten to an "impossible" value, perhaps
all-zeroes or such, when the error is detected.


What if it is a transient fault?


Reread disk, retest checksums?.. I don't know... :)

Alternatively (opportunistically), a flag might be set
in the DDT entry requesting that a new write mathching
this stored checksum should get committed to disk - thus
"repairing" all files which reference the block (at least,
stopping the IO errors).


verify eliminates this failure mode.


Sounds true, I didn't try that, though.
But my scrub is not yet complete, maybe there will be more
test subjects ;)

Alas, so far there is anyways no guarantee that it was
not the checksum itself that got corrupted (except for
using ZDB to retrieve the block contents and matching
that with a known-good copy of the data, if any), so
corruption of the checksum would also cause replacement
of "really-good-but-normally-inaccessible" data.


Extrememly unlikely. The metadata is also checksummed. To arrive here
you will have to have two corruptions each of which generate the proper
checksum. Not impossible, but… I'd buy a lottery ticket instead.


I've rather meant the opposite: file data is actually good,
but checksums (apparently both DDT and BlockPointer ones
with all their ditto copies) are bad, either due to disk
rot or RAM failures. For example, are the "blockpointer"
and "dedup" versions of the sha256 checksum recalculated
by both stages, or reused, on writes of a block?..

See also dedupditto. I could argue that the default value of dedupditto
should be 2 rather than "off".


I couldn't set it to smallish values (like 64), on oi_148a LiveUSB:

root@openindiana:~# zpool set dedupditto=64 pool
cannot set property for 'pool': invalid argument for this pool operation

root@openindiana:~# zpool set dedupditto=2 pool
cannot set property for 'pool': invalid argument for this pool operation

root@openindiana:~# zpool set dedupditto=127 pool
root@openindiana:~# zpool get dedupditto pool
NAME  PROPERTY    VALUE       SOURCE
pool  dedupditto  127         local


Thanks,
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedup and bad checksums

Reply via email to