[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes (JIRA) Thu, 16 Aug 2007 09:27:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520293
 ]


Steven Parkes commented on LUCENE-847:
--------------------------------------

        One new small item: you've added a "public void merge()" to
        IndexWriter so that people can externally kick off a merge request,
        which is good I think.

        But, is it really necessary to flush here?  It would be better to not
        flush so that users then have two separate methods (flush() and
        merge()) to do each function independently.

That makes sense.

Note that merge() was added not for users (which I have no strong opinion 
about) but so that, potentially, CMP can check again for merges when a set of 
merge threads completes, i.e., cascade.

        I think instead we should leave the methods, not deprecated, as
        convenience (sugar) methods.  Simple things should be simple; complex
        things should be possible.

I think this argues for a LegacyMergePolicy interface again, then? If we change 
the default merge policy and someone changes their code to use LogDoc for their 
own purposes, in both cases the getters/setters should work? So cast to the 
interface and as long as the merge policy supports this, the getters/setters 
work (unless the merge policy decides to throw within), otherwise the 
getters/setters throw? 

        Uh, no: when someone calls optimize that means it really must be done,
        right?  So "optimize" is the right name I think.

Yeah, but it might do nothing. Just as merge might do nothing.

        Can you factor this out, eg add a private method
        "getLogDocMergePolicy(String reason)"

Sure.

        Looks good, thanks.  Can you add javadocs (w/ params) for both of
        these new methods?

Sure.

        Though, this raises the tricky question of index
        consistency ...

Definitely. I'm still trying to understand all the subtleties here.

        IndexWriter commits the new segments file right after
        mergePolicy.merge returns ... so for CMP we suddenly have an unusable
        index (as seen by an IndexReader).

How so? I figured that after mergePolicy.merge returns, in the case of CMP with 
an ongoing merge, segmentInfos won't have changed at all. Is that a problem?

I thought the issue would be on the other end, where the concurrent merge 
finishes and needs to update segmentInfos.

        Maybe it's too ambitious to allow merges of segments from other
        directories to run concurrently?

Yeah, that might be the case. At least as a default?

        I would consider it a hard error in IndexWriter if after calling
        mergePolicy.merge from any of the addIndexes*, there remain segments
        in other directories.  I think we should catch this & throw an
        exception?

It would be easy enough for CMP to block in this case, rather than returning 
immediately. Wouldn't that be better? And I suppose it's possible to imagine an 
API on CMP for specifying this behavior?

        I opened a separate issue LUCENE-982 to track this.

I think this is good. I think it's an interesting issue but not directly 
related to the refactor?

        Although ... do you think we need need some way for merge policy to
        state where the new segment should be inserted into SegmentInfos?

Right now I assumed it would replace the left most-segment.

Since I don't really know the details of what such a merge policy would like, I 
don't really know what it needs.

If you've thought about this more, do you have a suggestion? I suppose we could 
just add an int. But, then again, I'd do that as a separate function, leaving 
the original available, so we can do this later, completely compatibly?

        Hmmm, does CMP block on close while it joins to any running merge
        threads?

Yeah, at least in my sandbox.

        How can the user close IndexWriter and abort the running
        merges?  I guess CMP would provide a method to abort any running
        merges, and user would first call that before calling
        IndexWriter.close?

I hadn't really thought about this but I can see that should be made possible. 
It's always safe to abandon a merge so it should be available, for fast, safe, 
and clean shutdown.

        True, LogDoc as it now stands would never exploit concurrency (it will
        always return the highest level that needs merging).  But, we could
        relax that such that if ever the lowest level has > 2*mergeFactor
        pending segments to merge then we select the 2nd set.

Okay. But it will always return that? Still doesn't sound concurrent?

The thing is, the serial merge policy has no concept of concurrent merges, so 
if the API is always to select the best merge, until a pervious merge finishes, 
it will always return that as the best merge.

Concurrent is going to require, by hook or by crook, that a merge policy be 
able to generate a set of non-conflicting merges, is it not?

I think the LUCENE-845 merge policy does this now, given that CMP gathers up 
the merge calls. I'm not sure the current LUCENE-847 merge policy does (I'd 
have to double check) because it sometimes will try to use the result of the 
current merge in the next merge. The LUCENE-845 merge doesn't try to do this 
which is a(n) (inconsequential) change?

        This is another benefit of the simplified API: MergePolicy.maybeMerge
        would only be called with a lock already acquired (by IndexWriter) on
        the segmentInfos.

Do you really mean a lock on segmentInfos or just the lock on IndexWriter? I'm 
assuming the latter and I think this is the case for both API models.

I don't think it's feasible to have a lock on segmentInfos separately. Only 
IndexWriter should change segmentInfos and no code should try to look at 
segmentInfos w/o being called via an IW synch method.

This does imply that CMP has to copy any segmentInfos data it plans to use 
during concurrent merging, since the IW lock is not held during these periods. 
Then, when the merge is done, segmentInfos is updated in IndexWriter via a 
synch call to IW#replace.

This means IW#segmentInfos can change while a merge is in progress and this has 
to be accounted for. That's what I'm walking through now.


> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to