On Sat, 2004-12-18 at 19:10 +1300, Steve Holdoway wrote: > Jamie Dobbs wrote: > > > I have a directory full of sound bytes and clip art (approx. 45,000 > > files) that I have collected over many years. > > I want to search the entire directory (and sub directories) and find > > any duplicate files, by content rather than by filename or filesize. > > Can anyone tell me of any command line programs for Linux that might > > allow me to do this, then give me the option of deleting (or moving) > > the duplicate files. > > > > Thanks > > > > Jamie > > > > > I'd use a checksum of some kind, and take a multi step approach... > > for file in `find . -type f` > do > md5sum $file > done | sort -u > biglist
just an idle query, why sort -u? afaict that will weed out identical entries, ie same name and md5sum, which seem to be the prime candidates! > > Then you need to post process the list... > > for sum in `cat biglist | awk '{print $1}' | sort -u` > do > if [ `grep '^$sum' biglist | wc -l` -gt 1 ] > then > ...add in your processing for managing multiple entries here or > just cat the names >> duplicateslist and fix them manually > fi > done > > I'd probably run them overnight on 45k files, tho! > > Cheers, > > Steve -- Nick Rout <[EMAIL PROTECTED]>