LogMergePolicy should use the number of deleted docs when deciding which 
segments to merge
------------------------------------------------------------------------------------------

                 Key: LUCENE-1634
                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Yasuhiro Matsuda


I found that IndexWriter.optimize(int) method does not pick up large segments 
with a lot of deletes even when most of the docs are deleted. And the existence 
of such segments affected the query performance significantly.

I created an index with 1 million docs, then went over all docs and updated a 
few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
segments with most of docs deleted. Although these segments did not have valid 
docs they remained in the directory for a very long time until more segments 
with comparable or bigger sizes were created.

This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
but does not take the number of deleted documents into consideration when it 
decides which segments to merge. So, a simple fix is to use the delete count to 
calibrate the segment size. I can create a patch for this.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to