Re: Searching for Duplicate Files

Steve Holdoway Sat, 18 Dec 2004 01:29:27 -0800

Nick Rout wrote:

On Sat, 2004-12-18 at 19:10 +1300, Steve Holdoway wrote:
Jamie Dobbs wrote:
I have a directory full of sound bytes and clip art (approx. 45,000 files) that I have collected over many years. I want to search the entire directory (and sub directories) and find any duplicate files, by content rather than by filename or filesize. Can anyone tell me of any command line programs for Linux that might allow me to do this, then give me the option of deleting (or moving) the duplicate files.
Thanks
Jamie
I'd use a checksum of some kind, and take a multi step approach...
for file in `find . -type f` do md5sum $file done | sort -u > biglist
just an idle query, why sort -u? afaict that will weed out identical entries, ie same name and md5sum, which seem to be the prime candidates!

No particular reason... force of habit I suppose. It won't affect the result, as the filenames will force the uniqueness ( if that's a real word ).

Then you need to post process the list...
for sum in `cat biglist | awk '{print $1}' | sort -u` do if [ `grep '^$sum' biglist | wc -l` -gt 1 ] then ...add in your processing for managing multiple entries here or just cat the names >> duplicateslist and fix them manually fi done
I'd probably run them overnight on 45k files, tho!
Cheers,
Steve

Re: Searching for Duplicate Files

Reply via email to