you have two choices that I can think of: 1- before adding a document, check if it does't exist in the index. you can do this by querying on a unique field if you have it . 2- you can index all your documents, and once the indexing is done you can dedupe. (Lucene has built in methods that can help with this)
if your index doesn't have a unique key, you need to add one like the one you suggested. -----Original Message----- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 6:05 PM To: [email protected] Subject: Re: Checking for duplicates inside index On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > > I'm indexing ~10000 documents per day but since I'm getting a lot of > real duplicates (100% the same document content) I want to check the > content before indexing... > > My idea is to create a checksum of the documents content and store it > within document inside the index, before indexing a new document I > will compare the new documents checksum with the ones in the index. > > Is that a good idea? does someone have experiences with that method? > any tools available? That could work. You will need a big sum though. MD5? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
