I want to second this recommendation. I wrote a script that recursively 
descends and writes out the MD5, SHA1, file length, and file path. Using those 
first three parameters *in combination* is darn close to 100% for determining 
file uniqueness. I have never come across two files that differ but still have 
the same 

        $MD5 . $SHA1 . $LENGTH

(had to throw in some Perl :-)

--
Mike Arms


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Ken Cornetet
Sent: Thursday, September 23, 2010 1:45 PM
To: Francisco Zarabozo; Active State Perl Mailing List
Subject: RE: Best way to compare to files in Perl

Your requirements are impossible to fulfill.

Think about this for a minute. There are an infinite possible number of input 
files, but only a finite number of digests or checksums of any given fixed 
length. Hence, no way to make this work.

That said, in practical terms if you store the length of each existing file, 
its MD5 digest, and its SHA1 digest, you can be pretty sure you'll never reject 
a non-duplicate file. Pretty sure, but not 100% positive.


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Francisco 
Zarabozo
Sent: Thursday, September 23, 2010 3:29 PM
To: Active State Perl Mailing List
Subject: Best way to compare to files in Perl


Hello All,


I have thousands of files that I need to analyze with Perl and discard any 
duplicates. I also need to implement a way to *not* save on disk any file 
that a visitor uploads on the website in the case it's a file we already 
have on disk.

So, I need to compare files and have some kind of identifiers in a database 
that can help me quickly identify when a duplicate file is received (so 
comparing the whole files against each file in the server in every upload is 
not really an option since it could take forever). I've heard a little about 
CRC and checksum (about how you can obtain a little identifier/result that 
can be stored in the DB) but I'm not really sure how to use it in Perl for 
file comparition and if that's the best way to do this.

Someone told me that CRC can sometimes make you believe it's a duplicate 
when it's not (that it can give you the same result with two different 
files), and I need to be 100% certain that a file is not a duplicate of 
another already on the server.

Can you guys please give me some advice on how to do this and maybe point me 
to the right modules?

Thanks a lot! :-)

Francisco 


_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to