On 12/8/16 10:42 AM, Austin S. Hemmelgarn wrote:
> On 2016-12-08 10:11, Swâmi Petaramesh wrote:
>> Hi, Some real world figures about running duperemove deduplication on
>> BTRFS :
>>
>> I have an external 2,5", 5400 RPM, 1 TB HD, USB3, on which I store the
>> BTRFS backups (full rsync) of 5 PCs, using 2 different distros,
>> typically at the same update level, and all of them more of less sharing
>> the entirety or part of the same set of user files.
>>
>> For each of these PCs I keep a series of 4-5 BTRFS subvolume snapshots
>> for having complete backups at different points in time.
>>
>> The HD was full to 93% and made a good testbed for deduplicating.
>>
>> So I ran duperemove on this HD, on a machine doing "only this", using a
>> hashfile. The machine being an Intel i5 with 6 GB of RAM.
>>
>> Well, the damn thing has been running for 15 days uninterrupted !
>> ...Until I [Ctrl]-C it this morning as I had to move with the machine (I
>> wasn't expecting it to last THAT long...).
>>
>> It took about 48 hours just for calculating the files hashes.
>>
>> Then it took another 48 hours just for "loading the hashes of duplicate
>> extents".
>>
>> Then it took 11 days deduplicating until I killed it.
>>
>> At the end, the disk that was 93% full is now 76% full, so I saved 17%
>> of 1 TB (170 GB) by deduplicating for 15 days.
>>
>> Well the thing "works" and my disk isn't full anymore, so that's a very
>> partial success, but still l wonder if the gain is worth the effort...
> So, some general explanation here:
> Duperemove hashes data in blocks of (by default) 128kB, which means for
> ~930GB, you've got about 7618560 blocks to hash, which partly explains
> why it took so long to hash.  Once that's done, it then has to compare
> hashes for all combinations of those blocks, which totals to
> 58042456473600 comparisons (hence that taking a long time).  The block
> size thus becomes a trade-off between performance when hashing and
> actual space savings (smaller block size makes hashing take longer, but
> gives overall slightly better results for deduplication).

IIRC, the core of the duperemove duplicate matcher isn't an O(n^2)
algorithm.  I think Mark used a bloom filter to reduce the data set
prior to matching, but I haven't looked at the code in a while.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to