As Matt implies above, this isn't too difficult with standard *nix
utilities if the files are actually duplicates.

I use a tip from Jim McNamara to do this in single directories; compare
checksums and dump dupes into a file for review, then delete known
duplicates:

cksum  *.jpg | sort -n > filelist

#you can use md5 instead of cksum
#now review the contents of filelist and make sure it's getting rid of the
right stuff, then

old=""
while read sum lines filename
do
      if [[ "$sum" != "$old" ]] ; then
            old="$sum"
            continue
      fi
      rm -f "$filename"

done < filelist

Combined with "find -exec" you shouldn't be far from a workable shell
script that can iterate through multiple directories. Careful with
wildcards when you're working out the exact syntax for the find.

-Josh


On Mon, Dec 7, 2015 at 9:01 PM, Matt Morgan <m...@concretecomputing.com>
wrote:

> You could do this with a shell script. One way: write a `find -exec ...`
> that runs through all the files, outputting the md5sums in some usable way.
> Sort the list and look for multiples (double-checking with diff on matches,
> if you're worried), and replace duplicates with symlinks if/where you need
> them, chmodding as necessary. But if multiple versions of the same file
> have different perms, that's a problem in the first place, most likely.
>
> Good luck!
>
>
> On 12/07/2015 06:32 PM, Perian Sully wrote:
>
>> Hi everyone:
>>
>> I know this is possibly something of a fool's errand, but I'm hoping
>> someone has come up with some magic tool or process for more-easily
>> cleaning up file storage than going through 12 years of files one-by-one.
>>
>> As part of our DAMS project, I've run some TreeSize Pro scans on three of
>> the 20-25 or so network storage directories. Just in those three, there
>> are
>> approximately 66,467 duplicate files. We initially thought about creating
>> hardlinks for the duplicates, which will at least help the server access
>> files more efficiently, but it won't solve the problem of actually having
>> files all over the place that the DAMS will ultimately ingest.
>>
>> Another thought was to do symlinks, but as far as I know, there aren't
>> easy
>> tools to automagically create these for Windows desktops or servers. Plus,
>> it might create havoc for all of the file permissions.
>>
>> So does anyone have any other ideas that I might try? Or are we really
>> just
>> stuck with all of this junk until someone manually goes in and cleans it
>> up?
>>
>> Thanks,
>>
>> ~Perian
>>
>>
>>
>> _______________________________________________
>> You are currently subscribed to mcn-l, the listserv of the Museum
>> Computer Network (http://www.mcn.edu)
>>
>> To post to this list, send messages to: mcn-l@mcn.edu
>>
>> To unsubscribe or change mcn-l delivery options visit:
>> http://mcn.edu/mailman/listinfo/mcn-l
>>
>> The MCN-L archives can be found at:
>> http://www.mail-archive.com/mcn-l@mcn.edu/
>>
>
>
> _______________________________________________
> You are currently subscribed to mcn-l, the listserv of the Museum Computer
> Network (http://www.mcn.edu)
>
> To post to this list, send messages to: mcn-l@mcn.edu
>
> To unsubscribe or change mcn-l delivery options visit:
> http://mcn.edu/mailman/listinfo/mcn-l
>
> The MCN-L archives can be found at:
> http://www.mail-archive.com/mcn-l@mcn.edu/
>
>


-- 
Joshua D. McDonald
(510) 472-4512
jmcdo...@jhu.edu

*j0sh...@umd.edu <j0sh...@umd.edu>*

*joshuadmcdonald.com <http://joshuadmcdonald.com>*
*Johns Hopkins Scholarly Communication Group
<http://guides.library.jhu.edu/content.php?pid=315747&sid=2583663>*
http://prisonlibraryresources.wordpress.com/
_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum Computer 
Network (http://www.mcn.edu)

To post to this list, send messages to: mcn-l@mcn.edu

To unsubscribe or change mcn-l delivery options visit:
http://mcn.edu/mailman/listinfo/mcn-l

The MCN-L archives can be found at:
http://www.mail-archive.com/mcn-l@mcn.edu/

Reply via email to