Re: Easy way to find identify files which share some content/blocks

2011-07-21 Thread Thomas Keusch
On 2011-05-02, Justin Sherrill jus...@shiningsilence.com wrote:

Hi Justin,

 You could dump out the B-tree information.  I don't know how clear a
 picture would come from that, and it may require some massaging of
 data anyway since nonduplicated files may have some degree of
 matching, duplicated data anyway, especially when dealing with larger
 image file.

That's a bit beyond my current C programming skills I guess, and a
little to much effort for this little cleanup project. Anyway, thanks
for the idea.

 If you are sure that the corruption lies at the end of the files, you
 could loop over the files, read the first x bytes of each, then MD5
 that data.  Matching MD5 = matching file.

It mostly is at the end. This suggestion (partitioning files into
chunks) is what I had done so far (on Linux) with a few lines of shell
(changed old existing script for that), then, due to inherent
inefficiencies, in python.

A handful of lines, and output inode, chunkId, hash to file or SQL,
then go from there.

I had hoped hammer, as a deduplicating filesystem, had tools that could
easily give me that information without hacks like above.

Regards
Thomas



 On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote:
 Hello,

 now that Dragonfly's HAMMER has got deduplication I ask myself if there
 is a simple way to identify pairs or groups of files which share a lot
 of data, i.e. are mostly identical.

 I have a rather large repository of downloaded pictures, which contain
 a lot of dupes in multiple locations. I have no problems finding those
 given some time and a shell prompt.

 I'm interested in identifying broken files. Broken in the sense that
 A is an incomplete version of B (some bytes missing), or B a damaged
 version of A (some additional bytes at the end).

 Is there a way to get to something like this:

 File A shares 1234 (98.3%) data blocks with file B
 File A shares  (xx.x%) data blocks with file C

 Getting a step closer helps too.

 Thanks for any insights.


 Regards
 Thomas



Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-20 Thread Thomas Keusch
On 2011-07-19, Siju George sgeorge...@gmail.com wrote:

Hi Siju,

 Short Sumary before dedup of firtst Hard Disk

 FilesystemSize   Used  Avail Capacity  Mounted on
 Backup1   454G   451G   2.8G99%/Backup1

 Short Sumary after dedup of firtst Hard Disk

 FilesystemSize   Used  Avail Capacity  Mounted on
 /Backup1/pfs/@@-1:1   454G   313G   141G69%/Backup1/Data
[...]

nice statistics. I can not provide stats of my own, as I don't run
Dragonfly yet, so I'm more of a hypothetical user right now. But one
thing that's of interest to me is how long did the de-dupe process take?

Regards
Thomas


Easy way to find identify files which share some content/blocks

2011-05-01 Thread Thomas Keusch
Hello,

now that Dragonfly's HAMMER has got deduplication I ask myself if there
is a simple way to identify pairs or groups of files which share a lot
of data, i.e. are mostly identical.

I have a rather large repository of downloaded pictures, which contain
a lot of dupes in multiple locations. I have no problems finding those
given some time and a shell prompt.

I'm interested in identifying broken files. Broken in the sense that
A is an incomplete version of B (some bytes missing), or B a damaged
version of A (some additional bytes at the end).

Is there a way to get to something like this:

File A shares 1234 (98.3%) data blocks with file B
File A shares  (xx.x%) data blocks with file C

Getting a step closer helps too.

Thanks for any insights.


Regards
Thomas