Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread Kyle Banerjee
On Wed, Mar 20, 2013 at 2:22 AM, chris fitzpatrick wrote: > Anyone please correct me if this is wrong. A md5/sha1 file hash would also > not get any image derivatives, like crops or they added text or tweaked the > contrast or photoshopped their cat into the shot... > > If you really wanted to ge

Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread chris fitzpatrick
<>

Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread Dave Caroline
I had a project to de duplicate many images and other files too. I wrote a little ditty in PHP but the idea can by used in any language. I have a set of tables in MySQL. give the utility a set of root directories to test and compare trawl the filestems for filename location and size and store in

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread BWS Johnson
I just want to say that even if this isn't what you had in mind, Wikimedia is very serious and very respectful about Indigenous cultural persistence and language preservation. I can only imagine that sharing your data would be most welcome. Cheers, Brooke

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Kyle Banerjee
On Tue, Mar 19, 2013 at 3:14 PM, Carmen Mitchell wrote: > Heh, well she works with teams of students and willing volunteers from > native communities. The faculty member in question has been > doing documentation and revitalization of endangered languages and has > worked on language revitalizatio

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Carmen Mitchell
Heh, well she works with teams of students and willing volunteers from native communities. The faculty member in question has been doing documentation and revitalization of endangered languages and has worked on language revitalization efforts with several communities, including the Oklahoma Kickap

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Kyle Banerjee
On Tue, Mar 19, 2013 at 1:51 PM, Carmen Mitchell wrote: > We are now working on de-duping and assessing file size, focusing on the > JPEGs first. With over 300,000 over them...it might take a while. (Of > course they aren't following any kind of file naming structure, > either...It's a mess.) > 3

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Carmen Mitchell
Thanks, Shaun and Terry. I'll pass this info along. Terry, I may have Tyson contact you directly if he has questions. I look forward to seeing your lightning talk! Carmen On Tue, Mar 19, 2013 at 2:09 PM, Shaun Ellis wrote: > Carmen, > If you are only interested in de-duping and assessing file

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Terry Brady
Carmen, The following code may be able to help. https://github.com/Georgetown-University-Libraries/File-Analyzer This application can scan a file system and report counts of files by type. The application can also report on files by checksum. If you are trying to find exact file duplicates, th

Re: [CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Shaun Ellis
Carmen, If you are only interested in de-duping and assessing file size, it may be overkill. Picasa has some good organizing and browsing features. Your developer may want to look at the Picasa (Desktop Client) Button API, which can kick off scripts for processing selected photos: https://de

[CODE4LIB] Image de-duping and file identification

2013-03-19 Thread Carmen Mitchell
Hello Code4Libbers, I'm working with a faculty member and trying to help them to formalize their data collection practices. Part of this process is also going through old data and trying to assess what they currently have. This particular faculty member has been doing research for 10 years without