Nick Rout wrote:

On Sat, 2004-12-18 at 19:10 +1300, Steve Holdoway wrote:


Jamie Dobbs wrote:



I have a directory full of sound bytes and clip art (approx. 45,000 files) that I have collected over many years.
I want to search the entire directory (and sub directories) and find any duplicate files, by content rather than by filename or filesize. Can anyone tell me of any command line programs for Linux that might allow me to do this, then give me the option of deleting (or moving) the duplicate files.


Thanks

Jamie




I'd use a checksum of some kind, and take a multi step approach...

for file in `find . -type f`
do
md5sum $file
done | sort -u > biglist



just an idle query, why sort -u? afaict that will weed out identical
entries, ie same name and md5sum, which seem to be the prime candidates!


No particular reason... force of habit I suppose. It won't affect the result, as the filenames will force the uniqueness ( if that's a real word ).




Then you need to post process the list...

for sum in `cat biglist | awk '{print $1}' | sort -u`
do
if [ `grep '^$sum' biglist | wc -l` -gt 1 ]
then
...add in your processing for managing multiple entries here or just cat the names >> duplicateslist and fix them manually
fi
done


I'd probably run them overnight on 45k files, tho!

Cheers,

Steve





Reply via email to