Re: Offline Deduplication for Btrfs

Spelic Wed, 05 Jan 2011 17:19:23 -0800

On 01/05/2011 09:46 PM, Gordan Bobic wrote:

On 01/05/2011 07:46 PM, Josef Bacik wrote:
Offline dedup is more expensive - so why are you of the opinion thatit is less silly? And comparison by silliness quotiend still soundslike an argument over which is better.

If I can say my opinion, I wouldn't want dedup to be enabled online forthe whole filesystem.


Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if youcare about performances, because fragmentation is more important in thatcase.So offline dedup is preferable IMHO. Or at least online dedup shouldhappen only on configured paths.

2- I don't want performances to drop all the time. I would run dedupperiodically on less active hours, hence, offline. A rate limiter shouldalso be implemented so not to trash the drives too much. Also a stop andcontinue should be implemented, so that dedup which couldn't finishwithin a certain time-frame (e.g. one night) can be made continue thenight after without restarting from the beginning.

3- Only some directories should be deduped, for performance reasons. Youcan foresee where duplicate blocks can exist and where not. Backupdirectories typically, or mailservers directories. The rest is probablya waste of time.

Dedup isn't for an average desktop user. Dedup is for backup storageand virtual images


Not virtual images imho, for the reason above.

Also, the OS is small even if identical on multiple virtual images, howmuch is going to occupy anyway? Less than 5GB per disk image usually.And that's the only thing that would be deduped because data likely tobe different on each instance. How many VMs running you have? 20? That'sat most 100GB saved one-time at the cost of a lot of fragmentation.

Now if you backup those images periodically, that's a place where Iwould run dedup.

I'd just make it always use the fs block size. No point in making itvariable.


Agreed. What is the reason for variable block size?

And then lets bring up the fact that you _have_ to manually compareany data youare going to dedup. I don't care if you think you have the greatesthashingalgorithm known to man, you are still going to have collisionssomewhere at somepoint, so in order to make sure you don't lose data, you have tomanually memcmp
the data.


Totally agreed

So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before youwrite. That
is going to make your write performance _suck_.
IIRC, this is configurable in ZFS so that you can switch off thephysical block comparison. If you use SHA256, the probability of acollission (unless SHA is broken, in which case we have much biggerproblems) is 1^128. Times 4KB blocks, that is one collission in 10^24Exabytes. That's one trillion trillion (that's double trillion) Exabytes.

I like mathematics, but I don't care this time. I would never enablededup without full blocks compare. I think most users and most companieswould do the same.

If there is full blocks compare, a simpler/faster algorithm could bechosen, like md5. Or even a md-64bits which I don't think it exists, butyou can take MD4 and then xor the first 8 bytes with the second 8 bytesso to reduce it to 8 bytes only. This is just because it saves 60% ofthe RAM occupation during dedup, which is expected to be large, and thecollisions are still insignificant at 64bits. Clearly you need to dofull blocks compare after that.

BTW if you want to allow (as an option) dedup without full blockscompare, SHA1 is not so good: sha-0 already had problems, now sha-1 hasproblems, I almost wouldn't suggest it for cryptographically securestuff foreseeing the future. Use ripemd160 or even better ripemd256which is even faster according tohttp://www.cryptopp.com/benchmarks.html ripemds are much betteralgorithms than shas: they have no known weaknesses.Note that deduplication IS a cryptographically sensitive matter becauseif sha-1 is cracked, people can nuke (or maybe even alter, and withthis, hack privileges) other users' files by providing blocks with thesame SHA and waiting for dedup to pass.

Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...

Dedup without full blocks compare seems indeed suited for online dedup(which I wouldn't enable, now for one more reason) because with fullblock compares performances would really suck. But please leave fullblocks compare for the offline dedup.

Also I could suggest a third type of deduplication, but this isharder... it's a file-level deduplication which works like xdelta, thatis, it is capable to recognize piece of identical data on two files,which are not at the same offset and which are not even aligned at blockboundary. For this, a rolling hash like the one of rsync, or the xdelta3.0 algorithm could be used. For this to work I suppose Btrfs needs tohandle the padding of filesystem blocks... which I'm not sure it wasforeseen.



Above in this thread you said:

The _only_ reason to defer deduping is that hashing costs CPU time.But the chances are that a modern CPU core can churn out MD5 and/orSHA256 hashes faster than a modern mechanical disk can keep up. A15,000rpm disk can theoretically handle 250 IOPS. A modern CPU canhandle considerably more than 250 block hashings per second. You couldargue that this changes in cases of sequential I/O on big files, but a1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDswill struggle to keep up with.


A normal 1TB disk with platters can do 130MB/sec sequential, no problems.

A SSD can do more like 200MB/sec write 280MB/sec read sequential orrandom and is actually limited only by the SATA 3.0gbit/sec but soonenough they will have SATA/SAS 6.0gbit/sec.More cores can be used for hashing but multicore implementation forstuff that is not natively threaded (such as parallel and completelyseparate queries to a DB) usually is very difficult to do well. E.g. itwas attempted recently on MD raid for parity computation byknowledgeable people but it performed so much worse than single-corethat it was disabled.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Offline Deduplication for Btrfs

Reply via email to