[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

John Wang (JIRA) Wed, 13 May 2009 09:50:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709002#action_12709002
 ]


John Wang commented on LUCENE-1634:
-----------------------------------

RE: implementing custom MergePolicy
Let me describe in detail on problems of implementing a custom MergePolicy:

1) In IndexWriter code, such methods on MergePolicy is called, e.g. 
findMergesForOptimize. I believe that is the contract for implementing your own 
MergePolicy. However, it is "hidden" by the javadoc in terms of documentation, 
and furthermore, it is hidden because these methods are package protected. So 
to implement your own MergePolicy, you have to resort back to sneaking the 
class into the package.

2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is 
not used, also popular methods such as set/getMergeFactor etc. are only 
applicable to LogMergePolicy. (Just to clarify, useCompoundFile is a 
package-level protected method on the base MergePolicy class, so my guess is 
that set/getCompoundFile should be applicable to all implementations of 
MergePolicy.

This brings up another issue about the practice of having to "sneak" classes 
into a package. We are looking at making our Lucene code, OSGI compliant, and 
this becomes an issue because we cannot have multiple "bundles" exporting the 
same package. Which means, I would have to repackage lucene to include my 
classes that I have snuck into some lucene packages. I would like to use a 
standard distribution of  a lucene jar (as suggested/echoed by some luceners).


> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Reply via email to