Carmen, The following code may be able to help.
https://github.com/Georgetown-University-Libraries/File-Analyzer This application can scan a file system and report counts of files by type. The application can also report on files by checksum. If you are trying to find exact file duplicates, the checksum report will identify exact duplicates found across a file system. I will be presenting an overview of this application during the virtual lightning talks session on April 3. If this looks useful to you, I will be glad to give you an overview of the application. Terry On Tue, Mar 19, 2013 at 4:51 PM, Carmen Mitchell <carmenmitch...@gmail.com>wrote: > Hello Code4Libbers, > > I'm working with a faculty member and trying to help them to formalize > their data collection practices. Part of this process is also going through > old data and trying to assess what they currently have. This particular > faculty member has been doing research for 10 years without any kind of > structure or regular method. So far we have over 2 TB of data in various > states. (With more to come.) > > I've got a programmer working with me to: > a) identify file types > b) count how many files of each type > > We are now working on de-duping and assessing file size, focusing on the > JPEGs first. With over 300,000 over them...it might take a while. (Of > course they aren't following any kind of file naming structure, > either...It's a mess.) > > Any tips or tricks or tools that you might know of to help speed up this > process? Is there a good image recognition tool that you could suggest that > would help us with automation? > > Thanks, > > Carmen Mitchell > Institutional Repository Librarian > Cal State San Marcos > -- Terry Brady Applications Programmer Analyst Lauinger Information Technology 202-687-7053