Nick Rout wrote:
On Sat, 2004-12-18 at 19:10 +1300, Steve Holdoway wrote:
Jamie Dobbs wrote:
I have a directory full of sound bytes and clip art (approx. 45,000
files) that I have collected over many years.
I want to search the entire directory (and sub directories) and find
any duplicate files, by content rather than by filename or filesize.
Can anyone tell me of any command line programs for Linux that might
allow me to do this, then give me the option of deleting (or moving)
the duplicate files.
Thanks
Jamie
I'd use a checksum of some kind, and take a multi step approach...
for file in `find . -type f`
do
md5sum $file
done | sort -u > biglist
just an idle query, why sort -u? afaict that will weed out identical
entries, ie same name and md5sum, which seem to be the prime candidates!
No particular reason... force of habit I suppose. It won't affect the
result, as the filenames will force the uniqueness ( if that's a real
word ).
Then you need to post process the list...
for sum in `cat biglist | awk '{print $1}' | sort -u`
do
if [ `grep '^$sum' biglist | wc -l` -gt 1 ]
then
...add in your processing for managing multiple entries here or
just cat the names >> duplicateslist and fix them manually
fi
done
I'd probably run them overnight on 45k files, tho!
Cheers,
Steve