David Eppstein wrote:

The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. You can't hope to cut the reads off early when using comparisons, because the files won't be different.

If you read them in parallel, it's _at most_ m (m is the worst case here), not 2(m-1). In my tests, it has always significantly less than m.


-pu
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to