On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
>       a) Use Chris Angelico's suggestion and hash each of the files (use the 
> standard library's 'hashlib' for this).  Identical files will always have 
> identical hashes, but there may be false positives, so you'll need to verify 
> that files that have identical hashes are indeed identical.
>       b) If your files tend to have sections that are very different (e.g., 
> the first 32 bytes tend to be different), then you pretend that section of 
> the file is its hash.  You can then do the same trick as above. (the 
> advantage of this is that you will read in a lot less data than if you have 
> to hash the entire file).
>       c) You may be able to do something clever by reading portions of each 
> file.  That is, use zip() combined with read(1024) to read each of the files 
> in sections, while keeping hashes of the files.  Or, maybe you'll be able to 
> read portions of them and sort the list as you're reading.  In either case, 
> if any files are NOT identical, then you'll be able to stop work as soon as 
> you figure this out, rather than having to read the entire file at once.
> 
> The main purpose of these suggestions is to reduce the amount of reading
> you're doing.

hashing a file using a conventional hashing algorithm requires reading
the whole file. Unless the files are very likely to be identical _until_
near the end, you're better off just reading the first N bytes of both
files, then the next N bytes, etc, until you find somewhere they're
different. The filecmp module may be useful for this.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to