you could maintain your bloom filter and check only "positives" if they are not 
false positives with exact search, if you have small percentage of duplicates 
(unique documents dominate updates) this will help you a lot on performance 
side 



----- Original Message ----
> From: markharw00d <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, 21 July, 2008 8:44:26 PM
> Subject: Re: How to avoid duplicate records in lucene
> 
> >>could you define duplicate?
> 
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field 
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for 
> each unique value for the choice of field.
> 
> Cheers
> Mark
> 
> Erick Erickson wrote:
> > could you define duplicate? As far as I know, you don't
> > get the same (internal) doc id back more than once, so what
> > is a duplicate?
> >
> > Best
> > Erick
> >
> > On Mon, Jul 21, 2008 at 9:40 AM, Sebastin wrote:
> >
> >  
> >> at the time search , while querying the data
> >> markrmiller wrote:
> >>    
> >>> Sebastin wrote:
> >>>      
> >>>> Hi All,
> >>>>
> >>>> Is there any possibility to avoid duplicate records in lucene  2.3.1?
> >>>>
> >>>>        
> >>> I don't believe that there is a very high performance way to do this.
> >>> You are basically going to have to query the index for an id before
> >>> adding a new doc. The best way I can think of off the top of my head is
> >>> to batch - first check that ids in the batch are unique, then check all
> >>> ids in the batch against the IndexReader, then add the ones that are not
> >>> dupes. Of course all of your docs would have to be added through this
> >>> single choke point so that you knew other threads had not added that id
> >>> after the first thread had looked but before it added the doc.
> >>>
> >>> I think Mark H has you covered if getting the dupes out after are okay.
> >>>
> >>> - Mark
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>>
> >>>      
> >> --
> >> View this message in context:
> >> 
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>    
> >
> >  
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG. 
> > Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release Date: 20/07/2008 
> 12:59
> >  
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



      __________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to