On 19/05/14 15:00, Scott Middleton wrote:
On 19 May 2014 09:07, Marc MERLIN <m...@merlins.org> wrote:
On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote:
I read so much about BtrFS that I mistaked Bedup with Duperemove.
Duperemove is actually what I am testing.
I'm currently using programs that find files that are the same, and
hardlink them together:
http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html

hardlink.py actually seems to be the faster (memory and CPU) one event
though it's in python.
I can get others to run out of RAM on my 8GB server easily :(

Interesting app.

An issue with hardlinking (with the backups use-case, this problem isn't likely 
to happen), is that if you modify a file, all the hardlinks get changed along 
with it - including the ones that you don't want changed.

@Marc: Since you've been using btrfs for a while now I'm sure you've already 
considered whether or not a reflink copy is the better/worse option.


Bedup should be better, but last I tried I couldn't get it to work.
It's been updated since then, I just haven't had the chance to try it
again since then.

Please post what you find out, or if you have a hardlink maker that's
better than the ones I found :)


Thanks for that.

I may be  completely wrong in my approach.

I am not looking for a file level comparison. Bedup worked fine for
that. I have a lot of virtual images and shadow protect images where
only a few megabytes may be the difference. So a file level hash and
comparison doesn't really achieve my goals.

I thought duperemove may be on a lower level.

https://github.com/markfasheh/duperemove

"Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the btrfs-extent-same ioctl."

It defaults to 128k but you can make it smaller.

I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
SMART test but seems to die every few hours. Admittedly it was part of
a failed mdadm RAID array that I pulled out of a clients machine.

The only other copy I have of the data is the original mdadm array
that was recently replaced with a new server, so I am loathe to use
that HDD yet. At least for another couple of weeks!


I am still hopeful duperemove will work.
Duperemove does look exactly like what you are looking for. The last traffic on the mailing list regarding that was in August last year. It looks like it was pulled into the main kernel repository on September 1st.

The last commit to the duperemove application was on April 20th this year. Maybe Mark (cc'd) can provide further insight on its current status.

--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to