[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes (JIRA) Tue, 14 Aug 2007 14:49:58 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519790
 ]


Steven Parkes commented on LUCENE-847:
--------------------------------------

        Are you going to fix all unit tests that call the now-deprecated APIs?

Yeah. Thanks for the reminder.

As to the IndexWriter vs. IndexMerger issue, I still think having the interface 
is useful if not only that it makes my testing much easier. I have a 
MockIndexMerger that implements only the functions in the interface and 
therefore I can test merge policies without creating a writer. For me this has 
been a big win ...

        Exactly: these settings decide when a segment is flushed, so, why put
        them into IndexMerger interface?  They shouldn't have anything to with
        merging; I think they should be removed.

        For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
        does not use maxBufferedDocs.

I can see that one could write a merge policy that didn't have any idea of how 
the initial buffering was done, but I worry about precluding it. Maybe the 
LUCENE-845 patch will show a strong enough pattern to believe no merge policies 
will need it?

        I think factoring into a base class is an OK solution, but, it
        shouldn't be MergePolicy's job to remember to call this final "move
        any segments in the wrong directory over" code.  As long as its in one
        place and people don't have to copy/paste code between MergePolicy
        sources.

In the case of concurrent merges, I think this gets more complicated. When do 
you do those directory copies? I think you can't do them at the return from the 
merge policy because the merge policy may want to do them, but later.

I don't think IndexWriter has enough information to know when the copies need 
to done. Doubly so if we have concurrent merges?

I still stand by it should be the merge policy making the choice. You could 
have the code in IndexWriter too, but then there'd be duplicate code. To put 
the code only in IndexWriter removes the choice from the merge policy.

        I think there
        should in fact be a default optimize() in the base class that does
        what current IndexWriter now does so that a MergePolicy need not
        implement optimize at all.

It'd be nice, but I don't know how to do it: the merge factor is not generic, 
so I don't know how to implement the loop generically.

Ah ... I see: with your forced merge ... hmmm.

        No, forced would mean the merge policy must do a merge; whereas,
        normally, it's free not to do a merge until it wants to.

Hmmmm ...

I think adding a forced merge concept here is new ... If it's simply to support 
optimize, I'm not sure I find it too compelling. LogDoc as it stands uses 
different algorithms for incremental merges and optimize, so there's not too 
much of a concept of forced merges vs. optional merges to be factored out. So I 
guess I'm not seeing a strong compelling case for creating it?

        Well, it's sort of awkward if you want to vary that max # segments.
        Say during the day you optimize down to 15 segments every time you
        update the index, but then at night you want to optimize down to 5.
        If we don't add method to IndexWriter you then must have instance var
        on your MergePolicy that you set, then you call optimize.  It's not
        clean since really it should be a parameter.

Well, I don't know if I buy the argument that it should be a parameter. The 
merge policy has lots of state like docs/seg. I don't really see why 
segs/optimize is different.

My main reason for not wanting put this into IndexWriter is then every merge 
policy must support it.

        Wait: there is a barrier, right?  In IndexWriter.replace we don't do
        the right thing with non-contiguous merges?

Yeah, I meant that I'm not aware of any barriers except fixing 
IndexWriter#replace, in other words, I'm not aware of any other places where 
non-contiguity would cause a failure.

        I'm not wed to "maybeMerge()" but I really don't like "merge" :)  It's
        overloaded now.

        EG IndexMerger interface has a method called "merge" that is named
        correctly because it will specifically go a do the requested merge.
        So does IndexWriter.

        Then, you have other [overloaded] methods in LogDocMergePolicy called
        "merge" that are named appropriately (they will do a specific merge).

        How about "checkForMerges()"?

I don't find it ambiguous based on class and argument type. It's all personal, 
of course.

I'd definitely prefer maybe over checkFor because that sounds like a predicate.

        Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
        doClose)" and default doClose to true?

That sounds good.

        Finally: does MergePolicy really need a close()?

I think so. The concurrent merge policy maintains all sorts of state.

        I don't see how it can work (building the generic concurrency wrapper
        "under" IndexMerger) because the MergePolicy is in "serial control",
        eg, when it wants to cascade merges.  How will you return that thread
        back to IndexWriter?

So this is how it looks now: the concurrent merge policy is both a merge policy 
and an index merger. The serial merge policy knows nothing about it other than 
it does not get IndexWriter as its merge.

The index writer wants its merge, so it does it merge/maybeMerge call on the 
concurrent merge policy. The CMP calls merge on the serial policy, but 
substitutes itself for the merger rather than IndexWriter.

The serial merge policy goes on its merry way, looking for merges to do (in the 
current model, this is a loop; more on that in a minute). Each time it has a 
subset of segments to merge, it calls merger.merge(...).

At this point, the concurrent merge policy takes over again. It looks at the 
segments to be merged and other segments being processed by all existing merge 
threads and determines if there's a conflict (a request to merge a segment 
that's currently in a merge). If there's no conflict, it starts a merge thread 
and calls IndexWriter#merge on the thread. The original calling thread returns 
immediately. (I have a few ideas how to handle conflicts, the simplest of which 
is to wait for the conflicting merge and the restart the serial merge, e.g., 
revert to serial).

This seems to work pretty well, so far. The only difference in API for the 
serial merges is that the merge operation can't return the number of documents 
in the result (since it isn't known how many docs will be deleted).   

        With concurrency wrapper "on the top" it's able to easily take a merge
        request as returned by the policy, kick it off in the backrground, and
        immediately return control of original thread back to IndexWriter.

What I don't know how to do with this is figure out how to do a bunch of 
merges. Lets say I have two levels in LogDoc that are merge worthy. If I call 
LogDoc, it'll return the lower level. That's all good. But what about doing the 
higher level in parallel? If I call LogDoc again, it's going to return the 
lower level again because it knows nothing about the current merge going on.

LogDoc already does things in a loop: it's pretty much set up to call all 
possible merges at one time (if they return immediately).

It would be possible to have it return a vector of segmentInfo subsets, but I 
don't see the gain (and it doesn't work out as well for my putative conflict 
resolution).

        have all necessary locking for SegmentInfos inside
        IndexWriter

This was a red-herring on my part. All the "segmentInfos locking" has always 
been in IndexWriter. That's note exactly sufficient. The fundamental issue is 
that IndexWriter#merge has to operate without a lock on IndexWriter. At some 
point, I was thinking that meant it would have to lock SegmentInfos but that's 
ludicrous, actually. It's sufficient for IndexWriter#replace to be synchronized.

        If CMP implements IndexMerger you must have locking inside any
        MergePolicy that's calling into CMP?

No. CMP does it's own locking (for purposes of thread management) but the 
serial merge policies no nothing of this (and they can expect to be called 
synchronously).

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to