On Fri, 16 Jul 2004 15:07:11 +0200, Christoph Goller
<[EMAIL PROTECTED]> wrote:
> Giulio Cesare Solaroli wrote:
> > This is the main problem; in my current arrangement, it is quite
> > difficult to find out the documents that needs to be updated in
> > advance; it would have been much easier to find out whether every
> > single document where a new entry or a document already present, and
> > thus to update (instead of insert).
> 
> You could perhaps do it the other way round, first add all modified
> documents and then delete the old versions.

I have been thinking about this for a while, but could not find out a
reasonable solution.
The basic problems are:
- where do I (safely) store the index of the documents that needs to be deleted?
- how can I uniquely identify the Lucene documents that I have to
delete, given that there are different Lucene document matching a
single "real" document?

The second problem could be "easily" solved adding a kind of version
field (stored in the Lucene index) that is incremented every time a
new version of a document is inserted. In this way, when searching for
duplicated documents (using the "real" document ID) I will find a set
of Lucene documents and I could delete all but the one with the
highest version number.

The real problem is where to keep a list of documents to be deleted. I
could keep a list in memory, but if my application crashed (or, more
often, we kill it), I will have duplicated documents on the index.
I could store it on the DB (where all the real documents are), but in
this case I could only store the real ID, as the DocID of the Lucene
Index could change.

This is probably feasible, but with quite an high overhead.

Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to