I have created a method that can delete duplicate docs. Basically,
during indexing, a doc is associated with an id (a term field defined by
you.) that is indexed. Then, call the method to delete duplicates
whenever you update index. 

I haven't contributed back to Lucene community yet because our code is
in beta testing now. 

My former colleague, Chris, has received agreement from Doug Cutting
since last August that this feature is nice to have.

Eugene


-----Original Message-----
From: Omar Didi [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 22, 2006 6:47 PM
To: java-user@lucene.apache.org
Subject: RE: Checking for duplicates inside index

you have two choices that I can think of:
1- before adding a document, check if it does't exist in the index. you
can do this by querying on a unique field if you have it .
2- you can index all your documents, and once the indexing is done you
can dedupe. (Lucene has built in methods that can help with this)


if your index doesn't have a unique key, you need to add one like the
one you suggested.

-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Monday, May 22, 2006 6:05 PM
To: java-user@lucene.apache.org
Subject: Re: Checking for duplicates inside index


On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> 
> I'm indexing ~10000 documents per day but since I'm getting a lot of 
> real duplicates (100% the same document content) I want to check the 
> content before indexing...
> 
> My idea is to create a checksum of the documents content and store it 
> within document inside the index, before indexing a new document I
> will compare the new documents checksum with the ones in the index.
> 
> Is that a good idea? does someone have experiences with that method?
> any tools available? 

That could work.

You will need a big sum though. MD5?



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to