On 21/5/2014 1:37 πμ, Mark Fasheh wrote:
On Tue, May 20, 2014 at 01:07:50AM +0300, Konstantinos Skarlatos wrote:
Duperemove will be shipping as supported software in a major SUSE release so
it will be bug fixed, etc as you would expect. At the moment I'm very busy
trying to fix qgroup bugs so I haven't had much time to add features, or
handle external bug reports, etc. Also I'm not very good at advertising my
software which would be why it hasn't really been mentioned on list lately
:)

I would say that state that it's in is that I've gotten the feature set to a
point which feels reasonable, and I've fixed enough bugs that I'd appreciate
folks giving it a spin and providing reasonable feedback.
Well, after having good results with duperemove with a few gigs of data, i
tried it on a 500gb subvolume. After it scanned all files, it is stuck at
100% of one cpu core for about 5 hours, and still hasn't done any deduping.
My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i guess thats
not the problem. So I guess the speed of duperemove drops dramatically as
data volume increases.
Yeah I doubt it's your CPU. Duperemove is right now targeted at smaller data
sets (a few VMS, iso images, etc) than you threw it at as you undoubtedly
have figured out. It will need a bit of work before it can handle entire
file systems. My guess is that it was spending an enormous amount of time
finding duplicates (it has a very thorough check that could probably be
optimized).
It finished after 9 or so hours, so I agree it was checking for duplicates. It does a few GB in just seconds, so time probably scales exponentially with data size.

For what it's worth, handling larger data sets is the type of work I want to
be doing on it in the future.
I can help with testing :)
I would also suggest that you publish in this list any changes that you do, so that your program becomes better known among btrfs users. Or even a new announcement mail or a page in the btrfs wiki.

Finally, i would like to request the ability to do file level dedup, with a reflink. That has the advantage of consuming very little metadata compared to block level dedup. It could be done with a two pass dedup, first comparing all the same-sized files and after that doing your normal block level dedup.

Btw does anybody have a good program/script that can do file level dedup with reflinks and checksum comparison?

Kind regards,
Konstantinos Skarlatos
        --Mark

--
Mark Fasheh

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to