RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes Sun, 25 Mar 2007 10:04:07 -0800

   * I think maxBufferedDocs should not be exposed in any *MergePolicy
    classes or interfaces?  I'm planning on deprecating this param
    with LUCENE-843 when we switch by default to "buffering by RAM
    usage" and it really relates to "how/when should writer flush its
    RAM buffer".


  * I also think "minMergeDocs" (which today is one and the same as
    maxBufferedDocs in IndexWriter but conceptually could be a
    different configuration) also should not appear in the MergePolicy
    interface.  I think it should only appear in
    LogarithmicMergePolicy?

Yeah, I've thought about these two. As far as I've been able to see,
making maxBufferedDocs an IW (IndexWriter) thing is good.

I've been wondering about taking minMergeDocs out of LMP
(LogarithmicMergePolicy): if IW is doing maxBufferedDocs, can we get by
with
        ceil(log(docs))
rather than
        ceil(log(ceil(docs/minMergeDocs))
(That's not exactly right, but it's close). The simplicity appeals to
me, but ...

    If we remove these from the MergePolicy interface then maybe we
    don't need MergePolicyBase?  (Just to makes things simpler).

Just a DRY class. I have no strong feeling about this. In fact, I went
back and forth on it. It's served as a placeholder while I experimented.

  * I think we should not create a LegacyMergePolicy interface.

Yeah, I agree with this. Hard coding LMP in IW made me nervous, but
you're right, it's only in there long enough to support the deprecated
methods. Anybody adding a new merge policy doesn't need to rely on this.

  * I was a little spooked by this change to TestAddIndexesNoOptimize:

      -    assertEquals(2, writer.getSegmentCount());
      +    assertEquals(3, writer.getSegmentCount());

    I think with just the refactoring, there should not need to be any
    changes to unit tests right?

I don't know if I this got into what I wrote either in e-mail or in the
start of the comments. I guess I've done two steps in one here: the
factoring isn't just renaming methods and classes. I did create an
MergePolicy interface that is has a slight simplificatin on how the
merge policy is currently implemented.

In particular, the current merge policy as implemented depends on more
than just a sequence of segments: it knows (or assumes) to some extent
how that sequence is created. In particular, in
TestAddIndexesNoOptimize, it knows that a portion of the sequence comes
from the current index and the remainder comes from other indexes.

The MergePolicy as I drafted it has just sequences of segments, without
this extra level of detail.

When I started, I didn't know how big a change this would be so it was
kind of an experiment.

At this point, the straw man version is pretty similar to the
non-factored version: this test is the only one where there's a
difference and it's pretty inconsequential. But the two version do
generate (slightly?) different merge operations and I still want to test
them on bigger corpora to see if there is any significant difference.

If there weren't, my vote would be to keep the simpler version. Does
anyone have any compelling argument for the more complicated case? In
addition to
        merge( SegmentInfos )
in MergePolicy, it would have 
        merge( SegmentInfos, SegmentInfos )
or maybe even
        merge( SegmentInfos[] )
I just don't have got any compelling examples for the extra complexity.

  * It's interesting that you've pulled "useCompoundFile" into the
    LegacyMergePolicy.  I'm torn on whether it belongs in MergePolicy
    at all, since this is really a file format issue?

Well, the idea was here that you might want to use non-compound files
for big segments (since you have few of them) and compound for smaller
segments. It basically reflects the idea that to some extent, the merge
policy is factoring the number of file descriptors required into its
decision.

-----Original Message-----
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 25, 2007 5:49 AM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-847) Factor merge policy out of
IndexWriter


    [
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira
.plugin.system.issuetabpanels:comment-tabpanel#action_12483931 ] 

Michael McCandless commented on LUCENE-847:
-------------------------------------------

OK some specific comments, only on the refactoring (ie I haven't
really looked at the new merge policies yet):

  * I also think "minMergeDocs" (which today is one and the same as
    maxBufferedDocs in IndexWriter but conceptually could be a
    different configuration) also should not appear in the MergePolicy
    interface.  I think it should only appear in
    LogarithmicMergePolicy?


  * I think we should not create a LegacyMergePolicy interface.  But I
    realize you need this so the deprecated methods in IndexWriter
    (setMergeFactor, setMaxBufferedDocs, setMaxMergeDocs, etc.) can be
    implemented.  How about instead these methods will only work if
    the current merge policy is the LogarithmicMergePolicy?  You can
    check if the current mergePolicy is an instanceof
    LogarithmicMergePolicy and then throw eg an IllegalStateException
    if it's not?

    Ie, going forward, with new and interesting merge policies,
    developers should interact with their merge policy to configure
    it.

  * I was a little spooked by this change to TestAddIndexesNoOptimize:

      -    assertEquals(2, writer.getSegmentCount());
      +    assertEquals(3, writer.getSegmentCount());

    I think with just the refactoring, there should not need to be any
    changes to unit tests right?

  * It's interesting that you've pulled "useCompoundFile" into the
    LegacyMergePolicy.  I'm torn on whether it belongs in MergePolicy
    at all, since this is really a file format issue?

    For example, newly written segments (no longer a "merge" with
    LUCENE-843) must also know whether to write in compound file
    format.  If we make interesting file format changes in the future
    that are configurable by the developer we wouldn't want to change
    all MergePolicy classes to propogate that.  It feels like using
    compound file or not should remain only in IndexWriter?


> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>         Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it
pluggable, making it possible for apps to choose a custom merge policy
and for easier experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to