2009]

James Carlson Tue, 13 Oct 2009 15:06:59 -0400

Matthew Ahrens wrote:
> A new '-D' option to 'zfs send' is proposed.  This option will cause
> dedup processing to be performed on the data being written to a send
> stream.  Dedup processing is optional because it isn't always appropriate
> (some kinds of data have very little duplication) and it has significant
> costs:  the checksumming required to detect duplicate blocks is
> CPU-intensive and the data that must be maintained while the stream
> is being processed can occupy a very large amount of memory.


"Must" seems a little strong.  As it's just an optimization, throwing
away old checksums if you have a large number of new ones to store --
and thus possibly sending some things uncompressed that could have been
compressed if you'd had infinite memory -- seems like a plausible
trade-off to avoid using "very large" amounts of memory.  Moreover, if
you find that you're seeing a lot of novel checksums (and thus using up
a lot of memory), then that also implies that you're not getting much
compression bang for the buck, and you might want to disable compression
on the fly.  (Many stream processing compressors do something like this;
disabling the compressor at least temporarily if the compression ratio
drops below some set limit.)

How often does the user know for certain whether the undocumented data
stream actually has a lot or a little duplicated data blocks?

-- 
James Carlson         42.703N 71.076W         <carlsonj at workingcode.com>

ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009]

Reply via email to