On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote: > I like the in-memory dedup backend. It's lightweight, only a heuristic, > does not need any IO or persistent storage. OTOH I consider it a subpart > of the in-band deduplication that does all the persistency etc. So I > treat the ioctl interface from a broader aspect. > > A usecase I find interesting is to keep the in-memory dedup cache and > then flush it to disk on demand, compared to automatically synced dedup > (eg. at commit time).
The tradeoff depends on a lot of parameters, like your expected dup rate, memory size, and seek latency. If the dup rate is high (say 40%) and your seek latency is high (low-cost spinning rust) and you don't have enough RAM to load the whole hash table into memory, then an on-disk dedup cache _itself_ creates an unusable I/O load. Hash table lookups generate random I/O, and at 40% dup rate every other block you write requires a performance-crippling disk seek to read the half of the cache that isn't in RAM. I looked at my parameters and concluded that an in-memory cache (with persistence by saving the data at regular intervals) was the *only* kind of cache I'd ever be able to use with any dedup implementation. If the dup rate is lower or you're using SSD then you might trade some IOs for more free RAM and consider an on-disk hash table with some sort of paging scheme. If you have huge amounts of RAM you don't need an on-disk scheme at all--you can add persistence e.g. by trickle-writing the data over the space of an hour to avoid adding a lot of latency or memory pressure at once. > > Users can get better dedupe via the ioctl today than with what > > you propose go in as an experimental feature so I don't see many people > > caring to test it. IMHO you would have to provide a more compelling reason > > to include this code. > > I see it as a complementary feature in the deduplication capabilities, > covering more usecases. If you have unlimited amounts of RAM, fast CPU, and slow disks, then it certainly makes sense, even with the SHA256 hash. That seems to be the use case ZFS was designed for.
signature.asc
Description: Digital signature