On Feb 7, 2016, at 4:46 PM, Paulo da Silva <[email protected]>
wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.
If you're after strict equality (every byte in a pair of files is identical),
then here are a few heuristics that may help you:
1) Test for file length, without reading in the whole file. You can use
os.path.getsize() to do this (I hope that this is a constant-time operation,
but I haven't tested it). As Oscar Benjamin suggested, you can create a
defaultdict(list) which will make it possible to gather lists of files of equal
size. This should help you gather your potentially identical files quickly.
2) Once you have your dictionary from above, you can iterate its values, each
of which will be a list. If a list has only one file in it, you know its
unique, and you don't have to do any more work on it. If there are two files
in the list, then you have several different options:
a) Use Chris Angelico's suggestion and hash each of the files (use the
standard library's 'hashlib' for this). Identical files will always have
identical hashes, but there may be false positives, so you'll need to verify
that files that have identical hashes are indeed identical.
b) If your files tend to have sections that are very different (e.g.,
the first 32 bytes tend to be different), then you pretend that section of the
file is its hash. You can then do the same trick as above. (the advantage of
this is that you will read in a lot less data than if you have to hash the
entire file).
c) You may be able to do something clever by reading portions of each
file. That is, use zip() combined with read(1024) to read each of the files in
sections, while keeping hashes of the files. Or, maybe you'll be able to read
portions of them and sort the list as you're reading. In either case, if any
files are NOT identical, then you'll be able to stop work as soon as you figure
this out, rather than having to read the entire file at once.
The main purpose of these suggestions is to reduce the amount of reading you're
doing. Storage tends to be slow, and any tricks that reduce the number of
bytes you need to read in will be helpful to you.
Good luck!
Cem Karan
--
https://mail.python.org/mailman/listinfo/python-list