Take a look at the source in IndexHTML.java (C:\lucene- 2.1.0\src\demo\org\apache\lucene\demo on my machine). The code goes through quite a bit of effort to remove old documents identified by uid. My comment was really that the underlying engine doesn't recognize duplicates, any such requirements must be implemented on top of the base engine.
But as an exercise, how would you imagine the *engine* could implement anything like this? The only thing I can imagine is that a field would be identified as unique (similar to a database UNIQUE constraint on a column). But now we're mixing databases and text searching, and I don't want to go there.... Of course, this would all work if we could just create the DWIM algorithm... Do What I Mean...... Erick On 4/21/07, jim shirreffs <[EMAIL PROTECTED]> wrote:
"Lucene has no concept of "document identity" in that you can index the same document 15 times in a row and Lucene will have 15 entries. " Is this true? When ever I run the demo indexing logic document already indexed are skipped. What am I missing. jim s start java org.apache.lucene.demo.IndexHTML -index /opt/lucene/index .. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]