Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted: > For example, my 'large' duperemove test involves about 750 gigabytes of > general purpose data - quite literally /home off my workstation. > > After the run I'm usually seeing between 65-75 gigabytes saved for a > total of only 10% duplicated data. I would expect this to be fairly > 'average' - /home on my machine has the usual stuff - documents, source > code, media, etc. > > So if you were writing your whole fs out you could expect about the same > from inline dedupe - 10%-ish. Let's be generous and go with that number > though as a general 'this is how much dedupe we get'. > > What the memory backend is doing then is providing a cache of > sha256/block calculations. This cache is very expensive to fill, and > every written block must go through it. On top of that, the cache does > not persist between mounts, and has items regularly removed from it when > we run low on memory. All of this will drive down the amount of > duplicated data we can find. > > So our best case savings is probably way below 10% - let's be _really_ > nice and say 5%.
My understanding is that this "general purpose data" use-case isn't being targeted by the in-memory dedup at all, because indeed it's a very poor fit for exactly the reason you explain. Instead, think data centers where perhaps 50% of all files are duplicated thousands of times... and it's exactly those files that are most frequently used. Totally different use-case, where that 5% on general purpose data could easily skyrocket to 50%+. Refining that a bit, as I understand it, the idea with the in-memory- inline dedup is pretty much opportunity-based dedup. Where there's an easy opportunity seen, grab it, but don't go out of your way to do anything fancy. Then somewhat later, a much more thorough offline dedup process will come along and dedup-pack everything else. In that scenario a quick-opportunity 20% hit rate may be acceptable, while actual hit rates may approach 50% due to skew toward the common. Then the dedup-pack comes along and finishes things, possibly resulting in total savings of say 70% or so. Even if the in-memory doesn't get that common-skew boost and ends up nearer 20%, that's still a significant savings for the initial inline result, with the dedup-packer coming along later to clean things up properly. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html