Yet again, I don't see how RAID-Z reconstruction is related to the
subject discussed (what data should be sha256'ed when both dedupe and
compression are enabled, raw or compressed ). sha256 has nothing to do
with bad block detection (may be it will when encryption is
implemented, but for now sha256 is used for duplicate candidates
look-up only).

Regards,
Andrey




On Wed, Dec 16, 2009 at 5:18 PM, Kjetil Torgrim Homme
<kjeti...@linpro.no> wrote:
> Andrey Kuzmin <andrey.v.kuz...@gmail.com> writes:
>
>> Kjetil Torgrim Homme wrote:
>>> for some reason I, like Steve, thought the checksum was calculated on
>>> the uncompressed data, but a look in the source confirms you're right,
>>> of course.
>>>
>>> thinking about the consequences of changing it, RAID-Z recovery would be
>>> much more CPU intensive if hashing was done on uncompressed data --
>>
>> I don't quite see how dedupe (based on sha256) and parity (based on
>> crc32) are related.
>
> I tried to hint at an explanation:
>
>>> every possible combination of the N-1 disks would have to be
>>> decompressed (and most combinations would fail), and *then* the
>>> remaining candidates would be hashed to see if the data is correct.
>
> the key is that you don't know which block is corrupt.  if everything is
> hunky-dory, the parity will match the data.  parity in RAID-Z1 is not a
> checksum like CRC32, it is simply XOR (like in RAID 5).  here's an
> example with four data disks and one paritydisk:
>
>  D1  D2  D3  D4  PP
>  00  01  10  10  01
>
> this is a single stripe with 2-bit disk blocks for simplicity.  if you
> XOR together all the blocks, you get 00.  that's the simple premise for
> reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and
> so on.
>
> so what happens if a bit flips in D4 and it becomes 00?  the total XOR
> isn't 00 anymore, it is 10 -- something is wrong.  but unless you get a
> hardware signal from D4, you don't know which block is corrupt.  this is
> a major problem with RAID 5, the data is irrevocably corrupt.  the
> parity discovers the error, and can alert the user, but that's the best
> it can do.  in RAID-Z the hash saves the day: first *assume* D1 is bad
> and reconstruct it from parity.  if the hash for the block is OK, D1
> *was* bad.  otherwise, assume D2 is bad.  and so on.
>
> so, the parity calculation will indicate which stripes contain bad
> blocks.  but the hashing, the sanity check for which disk blocks are
> actually bad must be calculated over all the stripes a ZFS block
> (record) consists of.
>
>>> this would be done on a per recordsize basis, not per stripe, which
>>> means reconstruction would fail if two disk blocks (512 octets) on
>>> different disks and in different stripes go bad.  (doing an exhaustive
>>> search for all possible permutations to handle that case doesn't seem
>>> realistic.)
>
> actually this is the same for compression before/after hashing.  it's
> just that each permutation is more expensive to check.
>
>>> in addition, hashing becomes slightly more expensive since more data
>>> needs to be hashed.
>>>
>>> overall, my guess is that this choice (made before dedup!) will give
>>> worse performance in normal situations in the future, when dedup+lzjb
>>> will be very common, at a cost of faster and more reliable resilver.  in
>>> any case, there is not much to be done about it now.
>
> --
> Kjetil T. Homme
> Redpill Linpro AS - Changing the game
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to