On Sat, 2004-12-18 at 19:10 +1300, Steve Holdoway wrote:
> Jamie Dobbs wrote:
> 
> > I have a directory full of sound bytes and clip art (approx. 45,000 
> > files) that I have collected over many years.
> > I want to search the entire directory (and sub directories) and find 
> > any duplicate files, by content rather than by filename or filesize. 
> > Can anyone tell me of any command line programs for Linux that might 
> > allow me to do this, then give me the option of deleting (or moving) 
> > the duplicate files.
> >
> > Thanks
> >
> > Jamie
> >
> >
> I'd use a checksum of some kind, and take a multi step approach...
> 
> for file in `find . -type f`
> do
>     md5sum $file
> done | sort -u > biglist

just an idle query, why sort -u? afaict that will weed out identical
entries, ie same name and md5sum, which seem to be the prime candidates!


> 
> Then you need to post process the list...
> 
> for sum in `cat biglist | awk '{print $1}' | sort -u`
> do
>     if [ `grep '^$sum' biglist | wc -l` -gt 1 ]
>     then
>        ...add in your processing for managing multiple entries here or 
> just cat the names >> duplicateslist and fix them manually
>     fi
> done
> 
> I'd probably run them overnight on 45k files, tho!
> 
> Cheers,
> 
> Steve
-- 
Nick Rout <[EMAIL PROTECTED]>

Reply via email to