On Tue, Aug 13, 2013 at 11:04:31PM -0300, Rogério Brito wrote: > On Tue, Aug 13, 2013 at 7:06 AM, Stefano Zacchiroli <z...@debian.org> wrote: > > Heya, > > for http://sources.debian.net (and to implement pabs' crazy ideas --- > > it's always pabs' fault!) I'm playing with btrfs deduplicaiton [1]. > > Geee, I really miss not being able to attend this year's DebConf, as > deduplication is something that is very dear to my heart. > > As soon as btrfs supports deduplication (offline is better for my use > case), I will seriously consider switching.
There are two ways: * a hacky way. Works with existing kernels, has some quirks, I asked for its inclusion in #703169 (problems are listed there). * a nice and clean way. The kernel interface would need to be "hey kernel, I think the block in fd 1 offset 0 might be same as a block in fd 2 offset 4096, care to compare and perhaps combine them?". There's no problem with races, locking to avoid them, etc, userland could use hashes as short as 64 bits[1], etc. I think someone even posted a patch, but it got lost in a flamewar between developers. > Online deduplication, at least with ZFS, was catastrophic in my experience Online (ie, at write time) has some benefits in theory, but it has MASSIVE memory/disk costs. Even with a large block size, it takes several GB for a not so big filesystem. There are so many ways to use that memory better. Offline (a confusing name, it's a mounted filesystem but at a later time) costs you a write + read, but can be done at your leisure, perhaps one-shot, perhaps incrementally by saving the hash table, can be done with short hashes, don't use any memory the rest of the time, etc. [1] It could be done with a traditional hash table, but here's my algorithm: Blocks need to be indexable by integers from a limited range. Let's say, a 2TB disk with 4KB blocks -> 2^29 slots. A hash table with in-table chains (+1, +i^2, double hashed, etc) can be done with no extra memory (unlike one with bucket lists) but needs to know the max occupancy beforehand (# of blocks in the filesystem) and degrades once it's more than ~80% full. With 64 bit elements, that costs us 5GB (if you can't afford that, use bigger blocks; that'd be still much better than ZFS' not working with files smaller than 128KB). Calculate your favourite hash for every block on the disk. It needs to include a secret salt to prevent DOS attacks. Take a part of the block hash to be the index into our hash table, 29 bits here. The entry consists of the block's index and 35 other bits of the block hash. If the # of blocks is not a power of two and you like to show off, you can even store a partial bit (how: an exercise to the reader :))! If you get a collision, check those 35 bits of the hash. If they match, you got a potential match, you can tell the kernel to compare the blocks on the disk (one will be already in the page cache). If they don't, you follow the hash chain. The number of false positives should be acceptable until the disk size reaches well into hundreds of TB (with 4KB blocks). Enlarging the block size gives you more wiggle room (at the cost of finding fewer duplicates). But hey, we can use 65 bit entries! The hash table can be saved to the disk, to be reused by an incremental run. Another modification would be doing multiple passes, giving every pass 1/N of the hash space, discarding hashes outside the range. End result: 1/N the speed, 1/N the memory needed. Using a binary tree instead of a hash table would increase memory costs (say, 14 bytes instead of 8 per entry) but would allow saving parts of the tree to the disk (with linear access!), allowing less computationally demanding multi-pass runs. See how much fun can we have with data structures? And the best of all, the kernel needs just a single syscall, with all the complexity done in userspace. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ _______________________________________________ Debconf-discuss mailing list Debconf-discuss@lists.debconf.org http://lists.debconf.org/mailman/listinfo/debconf-discuss