Re: Checking for duplicates inside index

Ken Krugler Mon, 22 May 2006 21:03:47 -0700

On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:

 > I'm indexing ~10000 documents per day but since I'm getting a lot of

 real duplicates (100% the same document content) I want to check the
 content before indexing...

 > My idea is to create a checksum of the documents content and store it
 > within document inside the index, before indexing a new document I
 > will compare the new documents checksum with the ones in the index.
 >

 Is that a good idea? does someone have experiences with that method?
 any tools available?


That could work.

You will need a big sum though. MD5?

Just as a reference, Nutch uses an MD5 digest to detect duplicate webpages. It works fine, except of course when two docs differ by onlyan insignificant text delta. There's some recent work in this area -check out TextProfileSignature.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Checking for duplicates inside index

Reply via email to