On Fri, 16 Jul 2004 15:07:11 +0200, Christoph Goller <[EMAIL PROTECTED]> wrote: > Giulio Cesare Solaroli wrote: > > This is the main problem; in my current arrangement, it is quite > > difficult to find out the documents that needs to be updated in > > advance; it would have been much easier to find out whether every > > single document where a new entry or a document already present, and > > thus to update (instead of insert). > > You could perhaps do it the other way round, first add all modified > documents and then delete the old versions.
I have been thinking about this for a while, but could not find out a reasonable solution. The basic problems are: - where do I (safely) store the index of the documents that needs to be deleted? - how can I uniquely identify the Lucene documents that I have to delete, given that there are different Lucene document matching a single "real" document? The second problem could be "easily" solved adding a kind of version field (stored in the Lucene index) that is incremented every time a new version of a document is inserted. In this way, when searching for duplicated documents (using the "real" document ID) I will find a set of Lucene documents and I could delete all but the one with the highest version number. The real problem is where to keep a list of documents to be deleted. I could keep a list in memory, but if my application crashed (or, more often, we kill it), I will have duplicated documents on the index. I could store it on the DB (where all the real documents are), but in this case I could only store the real ID, as the DocID of the Lucene Index could change. This is probably feasible, but with quite an high overhead. Giulio Cesare Solaroli --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]