On 2011-05-02, Justin Sherrill jus...@shiningsilence.com wrote:
Hi Justin,
You could dump out the B-tree information. I don't know how clear a
picture would come from that, and it may require some massaging of
data anyway since nonduplicated files may have some degree of
matching, duplicated data anyway, especially when dealing with larger
image file.
That's a bit beyond my current C programming skills I guess, and a
little to much effort for this little cleanup project. Anyway, thanks
for the idea.
If you are sure that the corruption lies at the end of the files, you
could loop over the files, read the first x bytes of each, then MD5
that data. Matching MD5 = matching file.
It mostly is at the end. This suggestion (partitioning files into
chunks) is what I had done so far (on Linux) with a few lines of shell
(changed old existing script for that), then, due to inherent
inefficiencies, in python.
A handful of lines, and output inode, chunkId, hash to file or SQL,
then go from there.
I had hoped hammer, as a deduplicating filesystem, had tools that could
easily give me that information without hacks like above.
Regards
Thomas
On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote:
Hello,
now that Dragonfly's HAMMER has got deduplication I ask myself if there
is a simple way to identify pairs or groups of files which share a lot
of data, i.e. are mostly identical.
I have a rather large repository of downloaded pictures, which contain
a lot of dupes in multiple locations. I have no problems finding those
given some time and a shell prompt.
I'm interested in identifying broken files. Broken in the sense that
A is an incomplete version of B (some bytes missing), or B a damaged
version of A (some additional bytes at the end).
Is there a way to get to something like this:
File A shares 1234 (98.3%) data blocks with file B
File A shares (xx.x%) data blocks with file C
Getting a step closer helps too.
Thanks for any insights.
Regards
Thomas