Patrick Useldinger wrote: > John Machin wrote: > > > Just look at the efficiency of processing N files of the same size S, > > where they differ after d bytes: [If they don't differ, d = S] > > > > PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d > > which is important for small N and large d]. > > > > Hashing method: O(NS) reading time, O(NS) hash calc time > > > Shouldn't you add the additional comparison time that has to be done > after hash calculation? Hashes do not give 100% guarantee. If there's a > large number of identical hashes, you'd still need to read all of these > files to make sure.
Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: "If there are *any* number (n >= 2) of identical hashes, you'd still need to *RE*-read and *compare* ...". 1. You are ahead already even before adding any extra time for comparison on files with equal hashes. 2. As others have explained, with a decent hash function, the probability of a false positive is vanishingly small. Further, nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible. Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that empirically I could get 509 successful opens on Win 2000 [add 3 for stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just quitting immediately if you ran out of handles. You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux. Re (a): what evidence do you have? Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have? [1] Yes, I am aware of the subject of this thread. Regards, John -- http://mail.python.org/mailman/listinfo/python-list