Ning Li <[EMAIL PROTECTED]> wrote on 09/05/2006 02:07:26 AM: > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches.
It appears that this separation of IndexWriter and IndexReader indeed bothers most new Lucene users, forcing each one to come up with their own buffering tricks to implement document updates and questions like this, and workaround-type solutions, appear on the java-user mailing list very often. So it's probably a great idea to eradicate this problem once and for all, and do it in an integrated way (inside IndexWriter), like you did, rather than in a roundabout way in external objects which do more buffering. I have a couple of small questions on your proposed changes: > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. This will indeed be enough for most uses, but I was thinking if perhaps we can, and should, provide even more "reading" functionality in the IndexWriter. Consider this use case (which actually happened to me): You want to index mail messages, which have attachements. The mail message, and the text of each attachment, get indexed as separate Lucene documents so they can be found independently. When we delete a mail message, we also need to delete the attached documents, so we keep a list of attachments in the the message document and when deleting a mail message we need to read that list field first, and delete all the attachment documents as well. The problem is that this requires not only a deleteDocuments() method, but also a method which finds the document and returns it (or better yet, just the one field we need). So I wonder if the IndexWriter shouldn't contain more reading features that previously were only found in IndexReader. In the long run, should our goal be perhaps to leave only one object, say call it simply "Index", which is basically the old IndexWriter with all of IndexReader's capabilities added to it? > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. I agree. IndexModifier should perhaps also be deprecated (or just become an empty shell around Indexwriter). > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. Thanks, these benchmarks are very important. If you can do it, I'd love to see the results of a fourth benchmark, which represents a typical situation (which you also mentioned) of document updates: every single insert is preceded by a delete, 25% of which actually delete (the updated document existed previously) and the rest end up not finding an old document and not deleting anything. I expect this benchmark to show an even greater improvment of your approach over the naive IndexModifier. -- Nadav Har'El --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]