[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Michael McCandless (JIRA) Tue, 07 Aug 2007 09:36:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518184
 ]


Michael McCandless commented on LUCENE-847:
-------------------------------------------


Some more feedback:

  - Is the separate IndexMerger interface really necessary?  Can't we
    just use IndexWriter directly?  It's somewhat awkward/forced now,
    because the interface has getters for ramBufferSizeMB and
    maxBufferedDocs that are really a "writer" (flushing) thing not a
    "merging" thing.

    While LogDocMergePolicy does need "maxBufferedDocs", I think,
    instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
    through" to the LogDocMergePolicy if that is the merge policy in
    use (and LogDocMergePolicy should store its own private
    "minMergeDocs").

    I think the three getters may not even be needed (based on
    comments below), in which case it seems like we shouldn't be
    creating a new interface.

  - In LogDocMergePolicy, it seems like the checking that's done for
    whether a SegmentInfo is in a different directory, as well as the
    subsequent copy to move it over to the IndexWriter's directory,
    should not live in the MergePolicy?

    Otherwise, every MergePolicy is going to have to duplicate this
    code.  Not to mention, we may someday create a more efficient way
    to copy than running a single-segment merge (which is a very
    inefficient, but, we only do it in addIndexes* I think).  I'd like
    to capture this in one place (IndexWriter).

    EG, the writer could have its own "copyExternalSegments" method
    which is called in addIndexes* and also optimize(), after the
    merge policy has done its merge, which does the check for wrong
    directory and subsequent copy.

    I think this would mean IndexMerger.getDirectory() is not really
    needed.

  - The "checkOptimize" method in LogDocMergePolicy seems like it
    belongs back in IndexWriter: I think we don't want every
    MergePolicy having to replicate that tricky while condition.

    Maybe we could change MergePolicy.merge to take a boolean "forced"
    which, if true, means that the MergePolicy must now pick a merge
    even if it didn't think it was time to.  This would let us move
    that tricky while condition logic back into IndexWriter.

    Also, I think at some point we may want to have an optimize()
    method that takes an int parameter stating the max # segments in
    the resulting index.  This would allow you to optimize down to <=
    N segments w/o paying full cost of a complete "only one segment"
    optimize.  If we had a "forced" boolean then we could build such
    an optimize method in the future, whereas as it stands now it
    would not be so easy to retrofit previously created MergePolicy
    classes to do this.

And some more minor feedback:

  - I love the simplification of addIndexesNoOptimize :)  Though (same
    comment as above) I think we should move that final "copy if
    different directory" step back in IndexWriter.

  - There are some minor things that should not be committed eg the
    added "infoStream = null" in IndexFileDeleter.  I typically try to
    put a comment "// nocommit" above such changes as I make them to
    remind myself and keep them from slipping in.

  - In the deprecated IndexWriter methods you're doing a hard cast to
    LogDocMergePolicy which gives a bad result if you're using a
    different merge policy; maybe catch the class cast exception (or,
    better, check upfront if it's an instanceof) and raise a more
    reasonable exception if not?

  - IndexWriter around line 1908 has for loop that has commented out
    "System.err.println"; we should just comment out/remove for loop

  - These commented out synchronized spook me a bit:

      /* synchronized(segmentInfos) */ {

    Are they needed in these contexts?  Is this only once we have
    concurrent merging?  (This ties back to the first feedback to
    simplify MergePolicy API so that this kind of locking is only
    needed inside IndexWriter).

  - Can we support non-contiguous merges?  If I understand it
    correctly, the MergeSpecification can express such a merge, it's
    just that the current IndexMerger (IndexWriter) cannot execute it,
    right?  So we are at least not precluding fixing this in the
    future.

  - It confuses me that MergePolicy has a method "merge(...)" -- can
    we rename it to "maybeMerge(..)" or "checkForMerge(...)"?  Ie,
    this method determines whether a merge is necessary and, if so, it
    then asks IndexMerger to in fact execute the merge (or, returns
    the spec)?

  - Instead of IndexWriter.releaseMergePolicy() can we have
    IndexWriter only close the merge policy if it was the one that had
    created it?  (Similar to how IndexWriter closes the dir if it has
    opened it from a String or File, but does not close it if it was
    passed in).



> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to