Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way to eliminate duplicate hits being returned from the index? Sure, don't put duplicate documents in the index :) Erik - To unsubscribe, e-mail: [EMAIL

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 8:35 AM To: Lucene Users List Subject: Re: Duplicate Hits On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? I was dealing with a similar requirement recently. I eventually decided on storing the MD5 checksum of the document as a keyword. It means reading it

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
-1496 [EMAIL PROTECTED] -Original Message- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:06 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I

Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? As John said - you'll have to come up with some way of knowing whether you should index or not. For example, when dealing with

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? I do a quick search for the md5 checksum before indexing. Although I suspect not applicable in

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:39 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
PROTECTED] Sent: Tuesday, February 01, 2005 9:48 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: Just to make sure I understand Do you keep an IndexReader open at the same time you are running the IndexWriter? From what I can see in the JavaDocs, it looks like only

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with

Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open.

Re: Duplicate Hits

2005-02-01 Thread sergiu gordea
Erik Hatcher wrote: On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an

Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread PA
On Jan 24, 2005, at 09:14, Jason Polites wrote: I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic

Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
there are several hundred or several thousand distinct indexes. Thanks, - JP - Original Message - From: PA [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Monday, January 24, 2005 10:43 PM Subject: Re: Duplicate hits using ParallelMultiSearcher On Jan 24, 2005