Kjetil Torgrim Homme wrote:
I don't know how tightly interwoven the dedup hash tree and the block
pointer hash tree are, or if it is all possible to disentangle them.
At the moment I'd say very interwoven by desgin.
conceptually it doesn't seem impossible, but that's easy for me to
say, with no knowledge of the zio pipeline...
Correct it isn't impossible but instead there would probably need to be
two checksums held, one of the untransformed data (ie uncompressed and
unencrypted) and one of the transformed data (compressed and encrypted).
That has different tradeoffs and SHA256 can be expensive too see:
http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via
Note also that the compress/encrypt/checksum and the dedup are separate
pipeline stages so while dedup is happening for block N block N+1 can be
getting transformed - so this is designed to take advantage of multiple
scheduling units (threads,cpus,cores etc).
oh, how does encryption play into this? just don't? knowing that
someone else has the same block as you is leaking information, but that
may be acceptable -- just make different pools for people you don't
trust.
compress, encrypt, checksum, dedup.
You are correct that it is an information leak but only within a dataset
and its clones and only if you can observe the deduplication stats (and
you need to use zdb to get enough info to see the leak - and that means
you have access to the raw devices), the deupratio isn't really enough
unless the pool is really idle or has only one user writing at a time.
For the encryption case deduplication of the same plaintext block will
only work with in a dataset or a clone of it - because only in those
cases do you have the same key (and the way I have implemented the IV
generation for AES CCM/GCM mode ensures that the same plaintext will
have the same IV so the ciphertexts will match). Also if you place a
block in an unencrypted dataset that happens to match the ciphertext in
an encrypted dataset they won't dedup either (you need to understand
what I've done with the AES CCM/GCM MAC and the zio_chksum_t field in
the blkptr_t and how that is used by dedup to see why).
If that small information leak isn't acceptable even within the dataset
then don't enable both encryption and deduplication on those datasets -
and don't delegate that property to your users either. Or you can
frequently rekey your per dataset data encryption keys 'zfs key -K' but
then you might as well turn dedup off - other there are some very good
usecases in multi level security where doing dedup/encryption and rekey
provides a nice effect.
--
Darren J Moffat
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss