[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]
Ning Li updated LUCENE-565: --------------------------- Attachment: newMergePolicy.Sept08.patch This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147 - The patch passes all the tests except that one in TestIndexModifier (see an earlier comment on this issue). - Since the test itself has a problem, it is fixed (one line change) and the patch passes the fixed test. - A new test call TestIndexWriterMergePolicy is included which shows the robustness of the new merge policy. The following is a detailed description of the new merge policy and its properties. Overview of merge policy: A flush is triggered either by close() or by the number of ram segments reaching maxBufferedDocs. After a disk segment is created by the flush, further merges may be triggered. LowerBound and upperBound set the limits on the doc count of a segment which may be merged. Initially, lowerBound is set to 0 and upperBound to maxBufferedDocs. Starting from the rightmost* segment whose doc count > lowerBound and <= upperBound, count the number of consecutive segments whose doc count <= upperBound. Case 1: number of worthy segments < mergeFactor, no merge, done. Case 2: number of worthy segments == mergeFactor, merge these segments. If the doc count of the merged segment <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Case 3: number of worthy segments > mergeFactor (in the case mergeFactor M changes), merge the leftmost* M segments. If the doc count of the merged segment <= upperBound, consider the merged segment for further merges on this same level. Merge the now leftmost* M segments, and so on, until number of worthy segments < mergeFactor. If the doc count of all the merged segments <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Note that case 2 can be considerd as a special case of case 3. This merge policy guarantees two invariants if M does not change and segment doc count is not reaching maxMergeDocs: B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f(x) >= f(y). 2: The number of committed segments on the same level (f(n)) <= M. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > --------------------------------------------------------------------------------- > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, > NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, > TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > ----------- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -------------- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > ------------------- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > ----------------------------------------------------------------------- > Insert only 116 min 119 min 116 min > Insert/delete (big batches) -- 135 min 125 min > Insert/delete (small batches) -- 338 min 134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]