On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:
> 2012-01-21 0:33, Jim Klimov wrote:
>> 2012-01-13 4:12, Jim Klimov wrote:
>>> As I recently wrote, my data pool has experienced some
>>> "unrecoverable errors". It seems that a userdata block
>>> of deduped data got corrupted and no longer matches the
>>> stored checksum. For whatever reason, raidz2 did not
>>> help in recovery of this data, so I rsync'ed the files
>>> over from another copy. Then things got interesting...
>> 
>> 
>> Well, after some crawling over my data with zdb, od and dd,
>> I guess ZFS was right about finding checksum errors - the
>> metadata's checksum matched that of a block on original
>> system, and the data block was indeed erring.
> 
> Well, as I'm moving to close my quest with broken data, I'd
> like to draw up some conclusions and RFEs. I am still not
> sure if they are factually true, I'm still learning the ZFS
> internals. So "it currently seems to me, that":
> 
> 1) My on-disk data could get corrupted for whatever reason
>   ZFS tries to protect it from, at least once probably
>   from misdirected writes (i.e. the head landed not where
>   it was asked to write). It can not be ruled out that the
>   checksums got broken in non-ECC RAM before writes of
>   block pointers for some of my data, thus leading to
>   mismatches. One way or another, ZFS noted the discrepancy
>   during scrubs and "normal" file accesses. There is no
>   (automatic) way to tell which part is faulty - checksum
>   or data.

Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I'm not sure you have grasped the concept of checksums
in the parent object.

> 
> 2) In the case where on-disk data did get corrupted, the
>   checksum in block pointer was correct (matching original
>   data), but the raidz2 redundancy did not aid recovery.

I think your analysis is incomplete. Have you determined the root cause?

> 
> 3) The file in question was created on a dataset with enabled
>   deduplication, so at the very least the dedup bit was set
>   on the corrupted block's pointer and a DDT entry likely
>   existed. Attempts to rewrite the block with the original
>   one (having "dedup=on") failed in fact, probably because
>   the matching checksum was already in DDT.

Works as designed.

> 
>   Rewrites of such blocks with "dedup=off" or "dedup=verify"
>   succeeded.
> 
>   Failure/success were tested by "sync; md5sum FILE" some
>   time after the fix attempt. (When done just after the
>   fix, test tends to return success even if the ondisk data
>   is bad, "thanks" to caching).

No, I think you've missed the root cause. By default, data that does
not match its checksum is not used.

> 
>   My last attempt was to set "dedup=on" and write the block
>   again and sync; the (remote) computer hung instantly :(
> 
> 3*)The RFE stands: deduped blocks found to be invalid and not
>   recovered by redundancy should somehow be evicted from DDT
>   (or marked for required verification-before-write) so as
>   not to pollute further writes, including repair attmepts.
> 
>   Alternatively, "dedup=verify" takes care of the situation
>   and should be the recommended option.

I have lobbied for this, but so far people prefer performance to dependability.

> 
> 3**) It was suggested to set "dedupditto" to small values,
>   like "2". My oi_148a refused to set values smaller than 100.
>   Moreover, it seems reasonable to have two dedupditto values:
>   for example, to make a ditto copy when DDT reference counter
>   exceeds some small value (2-5), and add ditto copies every
>   "N" values for frequently-referenced data (every 64-128).
> 
> 4) I did not get to check whether "dedup=verify" triggers a
>   checksum mismatch alarm if the preexisting on-disk data
>   does not in fact match the checksum.

All checksum mismatches are handled the same way.

> 
>   I think such alarm should exist and to as much as a scrub,
>   read or other means of error detection and recovery would.

Checksum mismatches are logged, what was your root cause?

> 
> 5) It seems like a worthy RFE to include a pool-wide option to
>   "verify-after-write/commit" - to test that recent TXG sync
>   data has indeed made it to disk on (consumer-grade) hardware
>   into the designated sector numbers. Perhaps the test should
>   be delayed several seconds after the sync writes.

There are highly-reliable systems that do this in the fault-tolerant
market.

> 
>   If the verifcation fails, currently cached data from recent
>   TXGs can be recovered from on-disk redundancy and/or still
>   exist in RAM cache, and rewritten again (and tested again).
> 
>   More importantly, a failed test *may* mean that the write
>   landed on disk randomly, and the pool should be scrubbed
>   ASAP. It may be guessed that the yet-unknown error can lie
>   within "epsilon" tracks (sector numbers) from the currently
>   found non-written data, so if it is possible to scrub just
>   a portion of the pool based on DVAs - that's a preferred
>   start. It is possible that some data can be recovered if
>   it is tended to ASAP (i.e. on mirror, raidz, copies>1)...
> 
> Finally, I should say I'm sorry for lame questions arising
> from not reading the format spec and zdb blogs carefully ;)
> 
> In particular, it was my understanding for a long time that
> block pointers each have a sector of their own, leading to
> overheads that I've seen. Now I know (and checked) that most
> of the blockpointer tree is made of larger groupings (128
> blkptr_t's in a single 16KB block), reducing the impact of
> BP's on fragmentation and/or slacky waste of large sectors
> that I predicted and expected for the past year.
> 
> Sad that nobody ever contradicted that (mis)understanding
> of mine.

Perhaps some day you can become a ZFS guru, but the journey is long...
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to