On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> I'm indexing ~10000 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
> My idea is to create a checksum of the documents content and store it
> within document inside the index, before indexing a new document I
> will compare the new documents checksum with the ones in the index.
>
Is that a good idea? does someone have experiences with that method?
any tools available?
That could work.
You will need a big sum though. MD5?
Just as a reference, Nutch uses an MD5 digest to detect duplicate web
pages. It works fine, except of course when two docs differ by only
an insignificant text delta. There's some recent work in this area -
check out TextProfileSignature.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]