On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: > a) Use Chris Angelico's suggestion and hash each of the files (use the > standard library's 'hashlib' for this). Identical files will always have > identical hashes, but there may be false positives, so you'll need to verify > that files that have identical hashes are indeed identical. > b) If your files tend to have sections that are very different (e.g., > the first 32 bytes tend to be different), then you pretend that section of > the file is its hash. You can then do the same trick as above. (the > advantage of this is that you will read in a lot less data than if you have > to hash the entire file). > c) You may be able to do something clever by reading portions of each > file. That is, use zip() combined with read(1024) to read each of the files > in sections, while keeping hashes of the files. Or, maybe you'll be able to > read portions of them and sort the list as you're reading. In either case, > if any files are NOT identical, then you'll be able to stop work as soon as > you figure this out, rather than having to read the entire file at once. > > The main purpose of these suggestions is to reduce the amount of reading > you're doing.
hashing a file using a conventional hashing algorithm requires reading the whole file. Unless the files are very likely to be identical _until_ near the end, you're better off just reading the first N bytes of both files, then the next N bytes, etc, until you find somewhere they're different. The filecmp module may be useful for this. -- https://mail.python.org/mailman/listinfo/python-list
