On Tue, Oct 13, 2009 at 03:06:59PM -0400, James Carlson wrote:
> Matthew Ahrens wrote:
> > A new '-D' option to 'zfs send' is proposed.  This option will cause
> > dedup processing to be performed on the data being written to a send
> > stream.  Dedup processing is optional because it isn't always appropriate
> > (some kinds of data have very little duplication) and it has significant
> > costs:  the checksumming required to detect duplicate blocks is
> > CPU-intensive and the data that must be maintained while the stream
> > is being processed can occupy a very large amount of memory.
> 
> "Must" seems a little strong.  As it's just an optimization, throwing
> away old checksums if you have a large number of new ones to store --
> and thus possibly sending some things uncompressed that could have been
> compressed if you'd had infinite memory -- seems like a plausible
> trade-off to avoid using "very large" amounts of memory.  Moreover, if
> you find that you're seeing a lot of novel checksums (and thus using up
> a lot of memory), then that also implies that you're not getting much
> compression bang for the buck, and you might want to disable compression
> on the fly.  (Many stream processing compressors do something like this;
> disabling the compressor at least temporarily if the compression ratio
> drops below some set limit.)

Throwing away of cached blocks probably needs to be done synchronously
by both ends, or else the receiver has to at least keep an index of
block checksum to block pointer for all previously seen blocks in the
stream.  Synchronizing the caches may require additional records in the
stream.  But I agree with you: it should be possible to bound the memory
usage of zfs send dedup.

Also, in ZFS today block checksums are used for integrity protection,
not for block equality comparisons.  The fact that here blocks would not
be compared for actual equality does worry me somewhat.

Nico
-- 

Reply via email to