Re: Data De-duplication

Ray Van Dolson Wed, 10 Dec 2008 19:51:02 -0800

On Thu, Dec 11, 2008 at 03:42:58AM +0000, Oliver Mattos wrote:
> Here is a script to locate duplicate data WITHIN files:
> 
> On some test file sets of binary data with no duplicated files, about 3%
> of the data blocks were duplicated, and about 0.1% of the data blocks
> were nulls.  The data was mainly elf and win32 binaries plus some random
> game data, office documents and a few images.
> 
> This code is hideously slow, so don't give it more than a couple of MB
> of files to chew through at once.  In retrospect I should've just
> written it in plain fast C instead of fighting with bash pipes!
> 
> Note to get "verbose" output, just remove everything after the word
> "sort" in the code.


Neat.  Thanks much.  It'd be cool to output the results of each of your
hashes to a database so you can get a feel for how many duplicate
blocks there are cross-files as well.

I'd like to run this in a similar setup on all my VMware VMDK files and
get an idea of how much space savings there would be across 20+ Windows
2003 VMDK files... probably *lots* of common blocks.

Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data De-duplication

Reply via email to