Just to make sure I understand.... Do you keep an IndexReader open at the same time you are running the IndexWriter? From what I can see in the JavaDocs, it looks like only IndexReader (or IndexSearch) can peek into the index and see if a document exists or not....
Thanks! Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -----Original Message----- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:39 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: >Nice idea John - one I hadn't considered. Once you have the checksum, do >you 'check' in the index first before storing the second document? Or do >you filter on the query side? > > I do a quick search for the md5 checksum before indexing. Although I suspect not applicable in your case, I also maintained a "last time something was indexed" time alongside the index. I used this to drastically prune the number of documents that needed to be considered for indexing if I restarted; anything modified before then wasn't a candidate. Since the MD5 checksum provides the definitive (for a sufficiently loose definition of definitive) indication of whether a document is indexed I didn't need to worry about ultra-fine granularity in the time stamp and I didn't need to worry about it being committed to disk; it generally got committed to the magnetic stuff every few seconds or so. It does help a lot though if documents have nice unique identifiers that you can use instead, then you can use the identifier and the last modified time to decide whether or not to re-index. jch --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]