Re: Checking for duplicates inside index

Hannes Carl Meyer Wed, 24 May 2006 00:46:32 -0700

Ken Krugler schrieb:

On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:

 > I'm indexing ~10000 documents per day but since I'm getting a lot of

 real duplicates (100% the same document content) I want to check the
 content before indexing...

 > My idea is to create a checksum of the documents content and store it
 > within document inside the index, before indexing a new document I
 > will compare the new documents checksum with the ones in the index.
 >

 Is that a good idea? does someone have experiences with that method?
 any tools available?


That could work.

You will need a big sum though. MD5?

Just as a reference, Nutch uses an MD5 digest to detect duplicate webpages. It works fine, except of course when two docs differ by only aninsignificant text delta. There's some recent work in this area -check out TextProfileSignature.


-- Ken

Hi,

thank you very much - I'm currently checking the possibilities and Ifound an interesting "hash algorithm" called nilsimsahttp://ixazon.dynip.com/~cmeclax/nilsimsa.html (still searching for ajava implementation).


H.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Checking for duplicates inside index

Reply via email to