Re: About in-band dedupe for v4.7

Duncan Fri, 13 May 2016 00:15:02 -0700

Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted:

> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.
> 
> After the run I'm usually seeing between 65-75 gigabytes saved for a
> total of only 10% duplicated data. I would expect this to be fairly
> 'average' - /home on my machine has the usual stuff - documents, source
> code, media, etc.
> 
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
> 
> What the memory backend is doing then is providing a cache of
> sha256/block calculations. This cache is very expensive to fill, and
> every written block must go through it. On top of that, the cache does
> not persist between mounts, and has items regularly removed from it when
> we run low on memory. All of this will drive down the amount of
> duplicated data we can find.
> 
> So our best case savings is probably way below 10% - let's be _really_
> nice and say 5%.


My understanding is that this "general purpose data" use-case isn't being 
targeted by the in-memory dedup at all, because indeed it's a very poor 
fit for exactly the reason you explain.

Instead, think data centers where perhaps 50% of all files are duplicated 
thousands of times... and it's exactly those files that are most 
frequently used.  Totally different use-case, where that 5% on general 
purpose data could easily skyrocket to 50%+.

Refining that a bit, as I understand it, the idea with the in-memory-
inline dedup is pretty much opportunity-based dedup.  Where there's an 
easy opportunity seen, grab it, but don't go out of your way to do 
anything fancy.  Then somewhat later, a much more thorough offline dedup 
process will come along and dedup-pack everything else.

In that scenario a quick-opportunity 20% hit rate may be acceptable, 
while actual hit rates may approach 50% due to skew toward the common.  
Then the dedup-pack comes along and finishes things, possibly resulting 
in total savings of say 70% or so.  Even if the in-memory doesn't get 
that common-skew boost and ends up nearer 20%, that's still a significant 
savings for the initial inline result, with the dedup-packer coming along 
later to clean things up properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About in-band dedupe for v4.7

Reply via email to