John Machin wrote:

Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time


Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's a large number of identical hashes, you'd still need to read all of these files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, but process one by one. But it's not perfect and you still need to compare afterwards. In the worst case, you end up with 3 files with identical hashes, of which 2 are identical and 1 is not. In order to find this, you'd still have to program the algorithm I use, unless you say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want to reduce bandwidth. A hash lets you do that at the cost of local CPU and disk usage, which may be OK. That was not my case.

-pu
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to