As Matt implies above, this isn't too difficult with standard *nix utilities if the files are actually duplicates.
I use a tip from Jim McNamara to do this in single directories; compare checksums and dump dupes into a file for review, then delete known duplicates: cksum *.jpg | sort -n > filelist #you can use md5 instead of cksum #now review the contents of filelist and make sure it's getting rid of the right stuff, then old="" while read sum lines filename do if [[ "$sum" != "$old" ]] ; then old="$sum" continue fi rm -f "$filename" done < filelist Combined with "find -exec" you shouldn't be far from a workable shell script that can iterate through multiple directories. Careful with wildcards when you're working out the exact syntax for the find. -Josh On Mon, Dec 7, 2015 at 9:01 PM, Matt Morgan <m...@concretecomputing.com> wrote: > You could do this with a shell script. One way: write a `find -exec ...` > that runs through all the files, outputting the md5sums in some usable way. > Sort the list and look for multiples (double-checking with diff on matches, > if you're worried), and replace duplicates with symlinks if/where you need > them, chmodding as necessary. But if multiple versions of the same file > have different perms, that's a problem in the first place, most likely. > > Good luck! > > > On 12/07/2015 06:32 PM, Perian Sully wrote: > >> Hi everyone: >> >> I know this is possibly something of a fool's errand, but I'm hoping >> someone has come up with some magic tool or process for more-easily >> cleaning up file storage than going through 12 years of files one-by-one. >> >> As part of our DAMS project, I've run some TreeSize Pro scans on three of >> the 20-25 or so network storage directories. Just in those three, there >> are >> approximately 66,467 duplicate files. We initially thought about creating >> hardlinks for the duplicates, which will at least help the server access >> files more efficiently, but it won't solve the problem of actually having >> files all over the place that the DAMS will ultimately ingest. >> >> Another thought was to do symlinks, but as far as I know, there aren't >> easy >> tools to automagically create these for Windows desktops or servers. Plus, >> it might create havoc for all of the file permissions. >> >> So does anyone have any other ideas that I might try? Or are we really >> just >> stuck with all of this junk until someone manually goes in and cleans it >> up? >> >> Thanks, >> >> ~Perian >> >> >> >> _______________________________________________ >> You are currently subscribed to mcn-l, the listserv of the Museum >> Computer Network (http://www.mcn.edu) >> >> To post to this list, send messages to: mcn-l@mcn.edu >> >> To unsubscribe or change mcn-l delivery options visit: >> http://mcn.edu/mailman/listinfo/mcn-l >> >> The MCN-L archives can be found at: >> http://www.mail-archive.com/mcn-l@mcn.edu/ >> > > > _______________________________________________ > You are currently subscribed to mcn-l, the listserv of the Museum Computer > Network (http://www.mcn.edu) > > To post to this list, send messages to: mcn-l@mcn.edu > > To unsubscribe or change mcn-l delivery options visit: > http://mcn.edu/mailman/listinfo/mcn-l > > The MCN-L archives can be found at: > http://www.mail-archive.com/mcn-l@mcn.edu/ > > -- Joshua D. McDonald (510) 472-4512 jmcdo...@jhu.edu *j0sh...@umd.edu <j0sh...@umd.edu>* *joshuadmcdonald.com <http://joshuadmcdonald.com>* *Johns Hopkins Scholarly Communication Group <http://guides.library.jhu.edu/content.php?pid=315747&sid=2583663>* http://prisonlibraryresources.wordpress.com/
_______________________________________________ You are currently subscribed to mcn-l, the listserv of the Museum Computer Network (http://www.mcn.edu) To post to this list, send messages to: mcn-l@mcn.edu To unsubscribe or change mcn-l delivery options visit: http://mcn.edu/mailman/listinfo/mcn-l The MCN-L archives can be found at: http://www.mail-archive.com/mcn-l@mcn.edu/