Matthew Ahrens wrote: > A new '-D' option to 'zfs send' is proposed. This option will cause > dedup processing to be performed on the data being written to a send > stream. Dedup processing is optional because it isn't always appropriate > (some kinds of data have very little duplication) and it has significant > costs: the checksumming required to detect duplicate blocks is > CPU-intensive and the data that must be maintained while the stream > is being processed can occupy a very large amount of memory.
"Must" seems a little strong. As it's just an optimization, throwing away old checksums if you have a large number of new ones to store -- and thus possibly sending some things uncompressed that could have been compressed if you'd had infinite memory -- seems like a plausible trade-off to avoid using "very large" amounts of memory. Moreover, if you find that you're seeing a lot of novel checksums (and thus using up a lot of memory), then that also implies that you're not getting much compression bang for the buck, and you might want to disable compression on the fly. (Many stream processing compressors do something like this; disabling the compressor at least temporarily if the compression ratio drops below some set limit.) How often does the user know for certain whether the undocumented data stream actually has a lot or a little duplicated data blocks? -- James Carlson 42.703N 71.076W <carlsonj at workingcode.com>