Dmitry Serebrennikov wrote:
Another solution that works well in some applications is to rely on
document number. This number will remain the same for the life of an
IndexReader. This number is also always larger for documents added
later. So given two documents with the same ID, the one with the highest
document number is the latest one. The rest can be deleted. One way to
store a list of documents easily is to use a filter (which could also be
serialized to disk if needed). This filter would only be valid for the
IndexReader used to create it.
So here's a modified sequence of operations, perhaps a bit more
efficient than proposed by Christoph:
1) Open an IndexReader for searching - S. Keep it open until the
transaction is committed.
2) Open a second IndexReader for deletions - D.
3) Create a filter bitset F (or use any other mechanism for storing
document numbers to be deleted)
4) Open an IndexWriter for new documents - W.
5) As documents come in, add them using W. Find their old versions in D
and record their document numbers in F. D will not show any new
documents, only documents present at the time D was created.
6) Close W.
7) Use D to delete all documents marked in F.
8) Close D.
Step 8 commits the transaction. At this point, another IndexReader S2
can be created and all new searches can go to that. Once all searches
using S are done, S can be closed.
Would this work? I think it might. Anyone sees any holes in this? This
can even allow multiple Ws to be used concurrently, and perhaps even
multiple machines can be utilized that write to the same index, but I'm
not sure if this is desirable.
The propsed mechanism could indeed be made thread-safe and efficient
multithreaded update would be possible. Thats probably what you have in
mind. However, having more than one IndexWriter is not possible and not
required, since IndexWriter is already optimized for multithreading. Well,
I think you know this anyway, I add it just for other listeners.
Yea, this would be a great thing to have available in Lucene...
Dmitry.
One could add a class called IndexUpdate that could handle all that.
There should be a possibility to specify a field or set of fields for
identifying dublicate documents.
Christoph
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]