Dmitry Serebrennikov wrote:
Another solution that works well in some applications is to rely on document number. This number will remain the same for the life of an IndexReader. This number is also always larger for documents added later. So given two documents with the same ID, the one with the highest document number is the latest one. The rest can be deleted. One way to store a list of documents easily is to use a filter (which could also be serialized to disk if needed). This filter would only be valid for the IndexReader used to create it.

So here's a modified sequence of operations, perhaps a bit more efficient than proposed by Christoph:
1) Open an IndexReader for searching - S. Keep it open until the transaction is committed.
2) Open a second IndexReader for deletions - D.
3) Create a filter bitset F (or use any other mechanism for storing document numbers to be deleted)
4) Open an IndexWriter for new documents - W.
5) As documents come in, add them using W. Find their old versions in D and record their document numbers in F. D will not show any new documents, only documents present at the time D was created.
6) Close W.
7) Use D to delete all documents marked in F.
8) Close D.


Step 8 commits the transaction. At this point, another IndexReader S2 can be created and all new searches can go to that. Once all searches using S are done, S can be closed.

Would this work? I think it might. Anyone sees any holes in this? This can even allow multiple Ws to be used concurrently, and perhaps even multiple machines can be utilized that write to the same index, but I'm not sure if this is desirable.

The propsed mechanism could indeed be made thread-safe and efficient multithreaded update would be possible. Thats probably what you have in mind. However, having more than one IndexWriter is not possible and not required, since IndexWriter is already optimized for multithreading. Well, I think you know this anyway, I add it just for other listeners.

Yea, this would be a great thing to have available in Lucene...
Dmitry.

One could add a class called IndexUpdate that could handle all that. There should be a possibility to specify a field or set of fields for identifying dublicate documents.

Christoph


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to