David Eppstein wrote:

Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was.

That's exactly what my program does.

More seriously, the best I can think of that doesn't use a strong slow hash would be to group files by (file size, cheap hash) then compare each file in a group with a representative of each distinct file found among earlier files in the same group -- that leads to an average of about three reads per duplicated file copy: one to hash it, and two for the comparison between it and its representative (almost all of the comparisons will turn out equal but you still need to check unless you

My point is : forget hashes. If you work with hashes, you do have to read each file completely, cheap hash or not. My program normally reads *at most* 100% of the files to analyse, but usually much less. Also, I do plain comparisons which are much cheaper than hash calculations.


I'm assuming of course that there are too many files and/or they're too large just to keep them all in core.

I assume that file handles are sufficient to keep one open per file of the same size. This lead to trouble on Windows installations, but I guess that's a parameter to change. On Linux, I never had the problem.


Regarding buffer size, I use a maxumim which is then split up between all open files.

Anyone have any data on whether reading files and SHA-256'ing them (or whatever other cryptographic hash you think is strong enough) is I/O-bound or CPU-bound? That is, is three reads with very little CPU overhead really cheaper than one read with a strong hash?

It also depends on the OS. I found that my program runs much slower on Windows, probably due to the way Linux anticipates reads and tries to reduce head movement.


I guess it also depends on the number of files you expect to have duplicates of. If most of the files exist in only one copy, it's clear that the cheap hash will find them more cheaply than the expensive hash. In that case you could combine the (file size, cheap hash) filtering with the expensive hash and get only two reads per copy rather than three.

Sorry, but I can still not see a point tu use hashes. Maybe you'll have a look at my program and tell me where a hash could be useful?


It's available at http://www.homepages.lu/pu/fdups.html.

Regards,
-pu
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to