Chris Angelico wrote:
hash_to_filename = defaultdict(list) for fn in files: # Step 1: Hash every file. hash = calculate_hash(fn)# Step 2: Locate all pairs of files with identical hashes hash_to_filename[hash].append(fn)
I think you can avoid hashing the files altogether. First divide the files into groups of the same size. Then for each group, open all the files at once, read them in parallel and compare them with each other. As soon as you find a difference, split the group into smaller groups. When a group gets down to just one file, you can stop reading that file. Assuming that most of the differing files differ near the beginning, this strategy means that you will hardly ever have to read a whole file. Hashing the files beforehand, in contrast, requires reading all of every file every time. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
