Kjetil Torgrim Homme wrote:

I don't know how tightly interwoven the dedup hash tree and the block
pointer hash tree are, or if it is all possible to disentangle them.

At the moment I'd say very interwoven by desgin.

conceptually it doesn't seem impossible, but that's easy for me to
say, with no knowledge of the zio pipeline...

Correct it isn't impossible but instead there would probably need to be two checksums held, one of the untransformed data (ie uncompressed and unencrypted) and one of the transformed data (compressed and encrypted). That has different tradeoffs and SHA256 can be expensive too see:


Note also that the compress/encrypt/checksum and the dedup are separate pipeline stages so while dedup is happening for block N block N+1 can be getting transformed - so this is designed to take advantage of multiple scheduling units (threads,cpus,cores etc).

oh, how does encryption play into this?  just don't?  knowing that
someone else has the same block as you is leaking information, but that
may be acceptable -- just make different pools for people you don't

compress, encrypt, checksum, dedup.

You are correct that it is an information leak but only within a dataset and its clones and only if you can observe the deduplication stats (and you need to use zdb to get enough info to see the leak - and that means you have access to the raw devices), the deupratio isn't really enough unless the pool is really idle or has only one user writing at a time.

For the encryption case deduplication of the same plaintext block will only work with in a dataset or a clone of it - because only in those cases do you have the same key (and the way I have implemented the IV generation for AES CCM/GCM mode ensures that the same plaintext will have the same IV so the ciphertexts will match). Also if you place a block in an unencrypted dataset that happens to match the ciphertext in an encrypted dataset they won't dedup either (you need to understand what I've done with the AES CCM/GCM MAC and the zio_chksum_t field in the blkptr_t and how that is used by dedup to see why).

If that small information leak isn't acceptable even within the dataset then don't enable both encryption and deduplication on those datasets - and don't delegate that property to your users either. Or you can frequently rekey your per dataset data encryption keys 'zfs key -K' but then you might as well turn dedup off - other there are some very good usecases in multi level security where doing dedup/encryption and rekey provides a nice effect.

Darren J Moffat
zfs-discuss mailing list

Reply via email to