John Machin wrote:

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

Right, that is what I meant.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Still, if you can get it 100% right automatically, why would you bother checking manually? Why get back to argments like "impossible", "implausible", "can't be" if you can have a simple and correct answer - yes or no?


Anyway, fdups does not do anything else than report duplicates. Deleting, hardlinking or anything else might be an option depending on the context in which you use fdups, but then we'd have to discuss the context. I never assumed any context, in order to keep it as universal as possible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

For the time being, the additional files will be ignored, and a warning is issued. fdups does not quit, why are you saying this?


A fallback solution would be to open the file before every _block_ read, and close it afterwards. In my mind, it would be a command-line option, because it's difficult to determine the number of available file handles in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so that it's a proper class that can be used in other Python programs, as you also asked. That is what I have sent you tonight. It's not that I don't care about the file handle problem, it's just that I do changes by (my own) priority.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

I've had the case myself on my girlfriend's XP box. It was certainly less than 500 files of the same length.


Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?

Sorry, I do not understand what you mean by this.

-pu
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to