Ken Krugler schrieb:
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:

 > I'm indexing ~10000 documents per day but since I'm getting a lot of
 real duplicates (100% the same document content) I want to check the
 content before indexing...

 > My idea is to create a checksum of the documents content and store it
 > within document inside the index, before indexing a new document I
 > will compare the new documents checksum with the ones in the index.
 >
 Is that a good idea? does someone have experiences with that method?
 any tools available?

That could work.

You will need a big sum though. MD5?

Just as a reference, Nutch uses an MD5 digest to detect duplicate web pages. It works fine, except of course when two docs differ by only an insignificant text delta. There's some recent work in this area - check out TextProfileSignature.

-- Ken
Hi,

thank you very much - I'm currently checking the possibilities and I found an interesting "hash algorithm" called nilsimsa http://ixazon.dynip.com/~cmeclax/nilsimsa.html (still searching for a java implementation).

H.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to