On Thu, Dec 11, 2008 at 03:42:58AM +0000, Oliver Mattos wrote: > Here is a script to locate duplicate data WITHIN files: > > On some test file sets of binary data with no duplicated files, about 3% > of the data blocks were duplicated, and about 0.1% of the data blocks > were nulls. The data was mainly elf and win32 binaries plus some random > game data, office documents and a few images. > > This code is hideously slow, so don't give it more than a couple of MB > of files to chew through at once. In retrospect I should've just > written it in plain fast C instead of fighting with bash pipes! > > Note to get "verbose" output, just remove everything after the word > "sort" in the code.
Neat. Thanks much. It'd be cool to output the results of each of your hashes to a database so you can get a feel for how many duplicate blocks there are cross-files as well. I'd like to run this in a similar setup on all my VMware VMDK files and get an idea of how much space savings there would be across 20+ Windows 2003 VMDK files... probably *lots* of common blocks. Ray -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html