Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side?
Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -----Original Message----- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:06 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: >Given Erik's response of 'don't put duplicate documents in the index', how >can I accomplish this in the IndexWriter? > > I was dealing with a similar requirement recently. I eventually decided on storing the MD5 checksum of the document as a keyword. It means reading it twice (once to calculate the checksum, once to index it), but it seems to do the trick. jch --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]