Andrey Kuzmin <andrey.v.kuz...@gmail.com> writes:

> Kjetil Torgrim Homme wrote:
>> for some reason I, like Steve, thought the checksum was calculated on
>> the uncompressed data, but a look in the source confirms you're right,
>> of course.
>>
>> thinking about the consequences of changing it, RAID-Z recovery would be
>> much more CPU intensive if hashing was done on uncompressed data --
>
> I don't quite see how dedupe (based on sha256) and parity (based on
> crc32) are related.

I tried to hint at an explanation:

>> every possible combination of the N-1 disks would have to be
>> decompressed (and most combinations would fail), and *then* the
>> remaining candidates would be hashed to see if the data is correct.

the key is that you don't know which block is corrupt.  if everything is
hunky-dory, the parity will match the data.  parity in RAID-Z1 is not a
checksum like CRC32, it is simply XOR (like in RAID 5).  here's an
example with four data disks and one paritydisk:

  D1  D2  D3  D4  PP
  00  01  10  10  01

this is a single stripe with 2-bit disk blocks for simplicity.  if you
XOR together all the blocks, you get 00.  that's the simple premise for
reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and
so on.

so what happens if a bit flips in D4 and it becomes 00?  the total XOR
isn't 00 anymore, it is 10 -- something is wrong.  but unless you get a
hardware signal from D4, you don't know which block is corrupt.  this is
a major problem with RAID 5, the data is irrevocably corrupt.  the
parity discovers the error, and can alert the user, but that's the best
it can do.  in RAID-Z the hash saves the day: first *assume* D1 is bad
and reconstruct it from parity.  if the hash for the block is OK, D1
*was* bad.  otherwise, assume D2 is bad.  and so on.

so, the parity calculation will indicate which stripes contain bad
blocks.  but the hashing, the sanity check for which disk blocks are
actually bad must be calculated over all the stripes a ZFS block
(record) consists of.

>> this would be done on a per recordsize basis, not per stripe, which
>> means reconstruction would fail if two disk blocks (512 octets) on
>> different disks and in different stripes go bad.  (doing an exhaustive
>> search for all possible permutations to handle that case doesn't seem
>> realistic.)

actually this is the same for compression before/after hashing.  it's
just that each permutation is more expensive to check.

>> in addition, hashing becomes slightly more expensive since more data
>> needs to be hashed.
>>
>> overall, my guess is that this choice (made before dedup!) will give
>> worse performance in normal situations in the future, when dedup+lzjb
>> will be very common, at a cost of faster and more reliable resilver.  in
>> any case, there is not much to be done about it now.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to