2012-01-21 0:33, Jim Klimov wrote:
2012-01-13 4:12, Jim Klimov wrote:
As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...
Well, after some crawling over my data with zdb, od and dd,
I guess ZFS was right about finding checksum errors - the
metadata's checksum matched that of a block on original
system, and the data block was indeed erring.
Well, as I'm moving to close my quest with broken data, I'd
like to draw up some conclusions and RFEs. I am still not
sure if they are factually true, I'm still learning the ZFS
internals. So "it currently seems to me, that":
1) My on-disk data could get corrupted for whatever reason
ZFS tries to protect it from, at least once probably
from misdirected writes (i.e. the head landed not where
it was asked to write). It can not be ruled out that the
checksums got broken in non-ECC RAM before writes of
block pointers for some of my data, thus leading to
mismatches. One way or another, ZFS noted the discrepancy
during scrubs and "normal" file accesses. There is no
(automatic) way to tell which part is faulty - checksum
or data.
2) In the case where on-disk data did get corrupted, the
checksum in block pointer was correct (matching original
data), but the raidz2 redundancy did not aid recovery.
3) The file in question was created on a dataset with enabled
deduplication, so at the very least the dedup bit was set
on the corrupted block's pointer and a DDT entry likely
existed. Attempts to rewrite the block with the original
one (having "dedup=on") failed in fact, probably because
the matching checksum was already in DDT.
Rewrites of such blocks with "dedup=off" or "dedup=verify"
succeeded.
Failure/success were tested by "sync; md5sum FILE" some
time after the fix attempt. (When done just after the
fix, test tends to return success even if the ondisk data
is bad, "thanks" to caching).
My last attempt was to set "dedup=on" and write the block
again and sync; the (remote) computer hung instantly :(
3*)The RFE stands: deduped blocks found to be invalid and not
recovered by redundancy should somehow be evicted from DDT
(or marked for required verification-before-write) so as
not to pollute further writes, including repair attmepts.
Alternatively, "dedup=verify" takes care of the situation
and should be the recommended option.
3**) It was suggested to set "dedupditto" to small values,
like "2". My oi_148a refused to set values smaller than 100.
Moreover, it seems reasonable to have two dedupditto values:
for example, to make a ditto copy when DDT reference counter
exceeds some small value (2-5), and add ditto copies every
"N" values for frequently-referenced data (every 64-128).
4) I did not get to check whether "dedup=verify" triggers a
checksum mismatch alarm if the preexisting on-disk data
does not in fact match the checksum.
I think such alarm should exist and to as much as a scrub,
read or other means of error detection and recovery would.
5) It seems like a worthy RFE to include a pool-wide option to
"verify-after-write/commit" - to test that recent TXG sync
data has indeed made it to disk on (consumer-grade) hardware
into the designated sector numbers. Perhaps the test should
be delayed several seconds after the sync writes.
If the verifcation fails, currently cached data from recent
TXGs can be recovered from on-disk redundancy and/or still
exist in RAM cache, and rewritten again (and tested again).
More importantly, a failed test *may* mean that the write
landed on disk randomly, and the pool should be scrubbed
ASAP. It may be guessed that the yet-unknown error can lie
within "epsilon" tracks (sector numbers) from the currently
found non-written data, so if it is possible to scrub just
a portion of the pool based on DVAs - that's a preferred
start. It is possible that some data can be recovered if
it is tended to ASAP (i.e. on mirror, raidz, copies>1)...
Finally, I should say I'm sorry for lame questions arising
from not reading the format spec and zdb blogs carefully ;)
In particular, it was my understanding for a long time that
block pointers each have a sector of their own, leading to
overheads that I've seen. Now I know (and checked) that most
of the blockpointer tree is made of larger groupings (128
blkptr_t's in a single 16KB block), reducing the impact of
BP's on fragmentation and/or slacky waste of large sectors
that I predicted and expected for the past year.
Sad that nobody ever contradicted that (mis)understanding
of mine.
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss