Re: A sets algorithm
On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Apranowrote: > On Tuesday 09 February 2016 02:11, Chris Angelico wrote: > >> That's fine for comparing one file against one other. He started out >> by saying he already had a way to compare files for equality. What he >> wants is a way to capitalize on that to find all the identical files >> in a group. A naive approach would simply compare every file against >> every other, for O(N*N) comparisons - but a hash lookup can make that >> O(N) on the files themselves, plus (I think) an O(N log N) hash >> comparison job, which has much lower constant factors. > > You're mixing up N's there :-) In the first two (compare every file against > every other file, and hash lookup), N stands for the number of files. But in > the hash comparison, well, I'm not sure what you mean by that, unless you > mean the calculation of the hash itself, which will be O(M) where M is the > size of the file. > > Unfortunately, even using a hash, you still have to do O(N**2) work, or > rather, O(N*M) where N is the number of files and M is their size. > My intention was that every N was "number of files being compared", but it's possible I made a mistake elsewhere. Assuming the hashes become integers too large for a bucket sort (which they certainly will, unless you want inordinate numbers of hash collisions), the code would look something like this: hash_to_filename = defaultdict(list) for fn in files: # Step 1: Hash every file. hash = calculate_hash(fn) # Step 2: Locate all pairs of files with identical hashes hash_to_filename[hash].append(fn) This loop executes once for each file, so calculate_hash() is called once for each file. The cost of that is O(N), or possibly O(N*M). If you're fairly sure the files are going to differ in size, you could use the file size *as* the hash - it's cheap to calculate (no M component whatsoever - just a stat call, and even that could be optimized away in some cases, eg if you're scanning a whole directory), but isn't cryptographically secure and ignores file content altogether. The second step involves one dictionary lookup or insertion for each file. That's an O(log N) operation, where N is the number of elements in the dictionary. So yes, it's not quite the same as the number of files (if every file exists exactly twice, this N will be half the other N), but it's broadly going to correspond. And an O(log N) operation performed N times has an overall cost of O(N log N). I might have something wrong here, but definitely I meant to have the Ns all mean the same thing, like X on a Magic: The Gathering card. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On Tuesday 09 February 2016 02:11, Chris Angelico wrote: > That's fine for comparing one file against one other. He started out > by saying he already had a way to compare files for equality. What he > wants is a way to capitalize on that to find all the identical files > in a group. A naive approach would simply compare every file against > every other, for O(N*N) comparisons - but a hash lookup can make that > O(N) on the files themselves, plus (I think) an O(N log N) hash > comparison job, which has much lower constant factors. You're mixing up N's there :-) In the first two (compare every file against every other file, and hash lookup), N stands for the number of files. But in the hash comparison, well, I'm not sure what you mean by that, unless you mean the calculation of the hash itself, which will be O(M) where M is the size of the file. Unfortunately, even using a hash, you still have to do O(N**2) work, or rather, O(N*M) where N is the number of files and M is their size. Assuming that duplicate files are relatively rare, the best way to do this is to collate files with the same size. Looking up the file size ought to be pretty fast. It is an IO operation, but it is approximately constant time in the sense that it doesn't actually depend on the size of the file. (If you're using a programming language or operating system where the only way to find out how big a file is to open it and count the bytes, this obviously won't work for you.) So let's collect files into a separate bucket for each unique file size: bysize = {} for file in files: bysize.setdefault(file.getsize(), []).append(file) In practice, you may find that nearly all files have different sizes, and most of the bucks have only a single file in them. Those files you can just ignore, as they must be unique. So instead of having to compare N files against each of the others, you are left with a much smaller number of buckets, each with (hopefully) only two or three files of the same size. You might have gone from N=100 to thirty buckets with five files each. And most of those files will probably differ on the very first byte, or at least within the first 16K of bytes, so it's hardly worth hashing them (especially if the hash algorithm is an expensive cryptographic hash, or if the file is large). I would delay the hash comparison as long as possible, something like this: def __hash__(self): if self._hash is None: self._hash = calculate_hash() # expensive, so we cache it return self._hash def __eq__(self, other): size = self.getsize() if size != other.getsize(): return False if self._hash and other._hash: # Both hashes are already done, so might as well use them. return ( hash(self) == hash(other) # in case of collisions... and self.read() == other.read() ) if size < MAXSIZE: return self.read(size) == other.read(size) else: if self.read(MAXSIZE) != other.read(MAXSIZE): return False else: return ( hash(self) == hash(other) # in case of collisions... and self.read() == other.read() ) If your hash is strong enough that collisions are vanishingly rare, you could ignore the `and self.read() == other.read()` check. -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
Chris Angelico wrote: hash_to_filename = defaultdict(list) for fn in files: # Step 1: Hash every file. hash = calculate_hash(fn) # Step 2: Locate all pairs of files with identical hashes hash_to_filename[hash].append(fn) I think you can avoid hashing the files altogether. First divide the files into groups of the same size. Then for each group, open all the files at once, read them in parallel and compare them with each other. As soon as you find a difference, split the group into smaller groups. When a group gets down to just one file, you can stop reading that file. Assuming that most of the differing files differ near the beginning, this strategy means that you will hardly ever have to read a whole file. Hashing the files beforehand, in contrast, requires reading all of every file every time. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: > a) Use Chris Angelico's suggestion and hash each of the files (use the > standard library's 'hashlib' for this). Identical files will always have > identical hashes, but there may be false positives, so you'll need to verify > that files that have identical hashes are indeed identical. > b) If your files tend to have sections that are very different (e.g., > the first 32 bytes tend to be different), then you pretend that section of > the file is its hash. You can then do the same trick as above. (the > advantage of this is that you will read in a lot less data than if you have > to hash the entire file). > c) You may be able to do something clever by reading portions of each > file. That is, use zip() combined with read(1024) to read each of the files > in sections, while keeping hashes of the files. Or, maybe you'll be able to > read portions of them and sort the list as you're reading. In either case, > if any files are NOT identical, then you'll be able to stop work as soon as > you figure this out, rather than having to read the entire file at once. > > The main purpose of these suggestions is to reduce the amount of reading > you're doing. hashing a file using a conventional hashing algorithm requires reading the whole file. Unless the files are very likely to be identical _until_ near the end, you're better off just reading the first N bytes of both files, then the next N bytes, etc, until you find somewhere they're different. The filecmp module may be useful for this. -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On Tue, Feb 9, 2016 at 1:49 AM, Random832wrote: > On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: >> a) Use Chris Angelico's suggestion and hash each of the files (use the >> standard library's 'hashlib' for this). Identical files will always have >> identical hashes, but there may be false positives, so you'll need to verify >> that files that have identical hashes are indeed identical. >> b) If your files tend to have sections that are very different (e.g., >> the first 32 bytes tend to be different), then you pretend that section of >> the file is its hash. You can then do the same trick as above. (the >> advantage of this is that you will read in a lot less data than if you have >> to hash the entire file). >> c) You may be able to do something clever by reading portions of each >> file. That is, use zip() combined with read(1024) to read each of the files >> in sections, while keeping hashes of the files. Or, maybe you'll be able to >> read portions of them and sort the list as you're reading. In either case, >> if any files are NOT identical, then you'll be able to stop work as soon as >> you figure this out, rather than having to read the entire file at once. >> >> The main purpose of these suggestions is to reduce the amount of reading >> you're doing. > > hashing a file using a conventional hashing algorithm requires reading > the whole file. Unless the files are very likely to be identical _until_ > near the end, you're better off just reading the first N bytes of both > files, then the next N bytes, etc, until you find somewhere they're > different. The filecmp module may be useful for this. That's fine for comparing one file against one other. He started out by saying he already had a way to compare files for equality. What he wants is a way to capitalize on that to find all the identical files in a group. A naive approach would simply compare every file against every other, for O(N*N) comparisons - but a hash lookup can make that O(N) on the files themselves, plus (I think) an O(N log N) hash comparison job, which has much lower constant factors. The key here is the hashing algorithm though. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silvawrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions. Hash them in some way. This has two costs: 1) You need to figure out some hashing algorithm such that any two equal files have the same hash, and ideally, that unequal files will generally have unequal hashes. 2) Hash each file once, and then do your comparisons on the hashes, finally testing for actual equality only on those with the same hash. (The last step takes care of hash collisions - where different files happen to have the same hash - so ideally, you shouldn't have to do this often.) If your definition of "equal" among MyFiles is simply based on the file content, it's easy - just hash the content. But if there are ways for files to be considered equal without being bit-for-bit identical, you'll have to reflect that in the hash. For instance, if you consider files equal if they differ only in whitespace, then you'd need to convert all whitespace to a single space before hashing; or if you don't care about the order of lines in the file, you could either hash the lines separately and sum/xor the hashes, or sort the lines before hashing. But the rest of the job is pretty straight-forward. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On 7 Feb 2016 21:51, "Paulo da Silva"wrote: > > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? If you can make the MyFiles hashable then this is easily done with defaultdict(list). -- Oscar -- https://mail.python.org/mailman/listinfo/python-list
A sets algorithm
Hello! This may not be a strict python question, but ... Suppose I have already a class MyFile that has an efficient method (or operator) to compare two MyFile s for equality. What is the most efficient way to obtain all sets of equal files (of course each set must have more than one file - all single files are discarded)? Thanks for any suggestions. Paulo -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
Às 22:17 de 07-02-2016, Tim Chase escreveu: > On 2016-02-07 21:46, Paulo da Silva wrote: ... > > If you the MyFile objects can be unique but compare for equality > (e.g. two files on the file-system that have the same SHA1 hash, but > you want to know the file-names), you'd have to do a paired search > which would have worse performance and would need to iterate over the > data multiple times: > > all_files = list(generate_MyFile_objects()) > interesting = [ > (my_file1, my_file2) > for i, my_file1 > in enumerate(all_files, 1) > for my_file2 > in all_files[i:] > if my_file1 == my_file2 > ] > "my_file1 == my_file2" can be implemented into MyFile class taking advantage of caching sizes (if different files are different), hashes or even content (for small files) or file headers (first n bytes). However this seems to have a problem: all_files: a b c d e ... If a==b then comparing b with c,d,e is useless. May be using several steps with dict - sizes, then hashes for same sizes files, etc ... Another solution I thought of, could be defining some methods (I still don't know which ones) in MyFile so that I could use sets intersection. Would this one be a faster solution? Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On 2016-02-07 21:46, Paulo da Silva wrote: > Suppose I have already a class MyFile that has an efficient method > (or operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? If your MyFile grows a __hash__ method that is based off the uniqueness-qualities that you're comparing, you can use collections.Counter and filter the results: from collections import Counter interesting = [ my_file for my_file, count in Counter(generate_MyFile_objects()).iteritems() if count > 1 ] Which should be roughly O(n). It also has the advantage of iterating over your generator once. If you the MyFile objects can be unique but compare for equality (e.g. two files on the file-system that have the same SHA1 hash, but you want to know the file-names), you'd have to do a paired search which would have worse performance and would need to iterate over the data multiple times: all_files = list(generate_MyFile_objects()) interesting = [ (my_file1, my_file2) for i, my_file1 in enumerate(all_files, 1) for my_file2 in all_files[i:] if my_file1 == my_file2 ] -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On Feb 7, 2016, at 4:46 PM, Paulo da Silvawrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions. If you're after strict equality (every byte in a pair of files is identical), then here are a few heuristics that may help you: 1) Test for file length, without reading in the whole file. You can use os.path.getsize() to do this (I hope that this is a constant-time operation, but I haven't tested it). As Oscar Benjamin suggested, you can create a defaultdict(list) which will make it possible to gather lists of files of equal size. This should help you gather your potentially identical files quickly. 2) Once you have your dictionary from above, you can iterate its values, each of which will be a list. If a list has only one file in it, you know its unique, and you don't have to do any more work on it. If there are two files in the list, then you have several different options: a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this). Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical. b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash. You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file). c) You may be able to do something clever by reading portions of each file. That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files. Or, maybe you'll be able to read portions of them and sort the list as you're reading. In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once. The main purpose of these suggestions is to reduce the amount of reading you're doing. Storage tends to be slow, and any tricks that reduce the number of bytes you need to read in will be helpful to you. Good luck! Cem Karan -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
On 2016-02-08 00:05, Paulo da Silva wrote: > Às 22:17 de 07-02-2016, Tim Chase escreveu: >> all_files = list(generate_MyFile_objects()) >> interesting = [ >> (my_file1, my_file2) >> for i, my_file1 >> in enumerate(all_files, 1) >> for my_file2 >> in all_files[i:] >> if my_file1 == my_file2 >> ] > > "my_file1 == my_file2" can be implemented into MyFile class taking > advantage of caching sizes (if different files are different), > hashes or even content (for small files) or file headers (first n > bytes). However this seems to have a problem: > all_files: a b c d e ... > If a==b then comparing b with c,d,e is useless. Depends on what the OP wants to have happen if more than one input file is equal. I.e., a == b == c. Does one just want "a has duplicates" (and optionally "and here's one of them"), or does one want "a == b", "a == c" and "b == c" in the output? > Another solution I thought of, could be defining some methods (I > still don't know which ones) in MyFile so that I could use sets > intersection. Would this one be a faster solution? Adding __hash__ would allow for the set operations, but would require (as ChrisA points out) knowing how to create a hash function that encompasses the information you want to compare. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: A sets algorithm
Às 21:46 de 07-02-2016, Paulo da Silva escreveu: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > After reading all suggestions I decided to try first the defaultdict(list), as first suggested by Oscar, in several steps. First with sizes and then with other partial contents or/and "strong" hashes as suggested by Cem. Thank you very much to all who responded for all helpful suggestions. If I find something better I'll report here. Paulo -- https://mail.python.org/mailman/listinfo/python-list