[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Michael McCandless (JIRA) Thu, 16 Aug 2007 07:21:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520260
 ]


Michael McCandless commented on LUCENE-847:
-------------------------------------------

One new small item: you've added a "public void merge()" to
IndexWriter so that people can externally kick off a merge request,
which is good I think.

But, is it really necessary to flush here?  It would be better to not
flush so that users then have two separate methods (flush() and
merge()) to do each function independently.

> > Are you going to fix all unit tests that call the now-deprecated
> > APIs?
> 
> Yeah. Thanks for the reminder.

On thinking about this more ... and on seeing all the diffs ... I no
longer feel we should be deprecating "get/setUseCompoundFile()" nor
"get/setMergeFactor()" nor "get/setMaxMergeDocs()" in IndexWriter.

The vast majoriy of Lucene users will not make their own merge policy
(just use the default merge policy) and so I don't think we should be
complicating their lives with having to now write lines like this when
they want to change settings:

   
((LogDocMergePolicy)writer.getMergePolicy()).setUseCompoundFile(useCompoundFile);

Also, this ties our hands if ever we want to change the default merge
policy (which, under LUCENE-845, I'd like to do).

I think instead we should leave the methods, not deprecated, as
convenience (sugar) methods.  Simple things should be simple; complex
things should be possible.  Sorry I didn't think of this before you
made the new patch Steve :(

> > I'm not wed to "maybeMerge()" but I really don't like "merge" :)
> > It's overloaded now.
> > 
> > EG IndexMerger interface has a method called "merge" that is named
> > correctly because it will specifically go a do the requested
> > merge.  So does IndexWriter.
> >
> > Then, you have other [overloaded] methods in LogDocMergePolicy
> > called "merge" that are named appropriately (they will do a
> > specific merge).
> > 
> > How about "checkForMerges()"?
>
> I don't find it ambiguous based on class and argument type. It's all
> personal, of course.
> 
> I'd definitely prefer maybe over checkFor because that sounds like a
> predicate.

OK let's settle on "maybeMerge"?

>    - Does this mean optimize -> maybeOptimize, too?

Uh, no: when someone calls optimize that means it really must be done,
right?  So "optimize" is the right name I think.

> * Make LDMP casts not throw bad cast 

Can you factor this out, eg add a private method
"getLogDocMergePolicy(String reason)" that would be the one place that
does the class casting & throwing an error message from one single
source line?  Right now the message is copied in multiple places and,
it's wrong for mergeFactor (was copied from useCompoundFile).

> * Get rid of releaseMergePolicy and add doClose parameter on set

Looks good, thanks.  Can you add javadocs (w/ params) for both of
these new methods?

> As to the IndexWriter vs. IndexMerger issue, I still think having
> the interface is useful if not only that it makes my testing much
> easier. I have a MockIndexMerger that implements only the functions
> in the interface and therefore I can test merge policies without
> creating a writer. For me this has been a big win ...

Subclassing IndexWriter to make MockIndexMerger would also work for
testing?  This is what MockRAMDirectory does for example...

> > Exactly: these settings decide when a segment is flushed, so, why
> > put them into IndexMerger interface? They shouldn't have anything
> > to with merging; I think they should be removed.
> > 
> > For LUCENE-845 I'm working on a replacement for LogDocMergePolicy
> > that does not use maxBufferedDocs.

> I can see that one could write a merge policy that didn't have any
> idea of how the initial buffering was done, but I worry about
> precluding it. Maybe the LUCENE-845 patch will show a strong enough
> pattern to believe no merge policies will need it?

We wouldn't be precluding it (people can still get it from their
IndexWriter).  This is one of the big reasons that I don't like making
an interface out of IndexMerger: here we are having to pick & choose
which settings from IndexWriter a merge policy is "allowed" to use.  I
don't think that's necessary (we are just making extra work for
ourselves) and inevitably we won't get it right...

> > I think factoring into a base class is an OK solution, but, it
> > shouldn't be MergePolicy's job to remember to call this final
> > "move any segments in the wrong directory over" code. As long as
> > its in one place and people don't have to copy/paste code
> > between MergePolicy sources.
> 
> In the case of concurrent merges, I think this gets more
> complicated. When do you do those directory copies? I think you
> can't do them at the return from the merge policy because the merge
> policy may want to do them, but later.
>
> I don't think IndexWriter has enough information to know when the
> copies need to done. Doubly so if we have concurrent merges? 

Ahh, good point.  Though, this raises the tricky question of index
consistency ... IndexWriter commits the new segments file right after
mergePolicy.merge returns ... so for CMP we suddenly have an unusable
index (as seen by an IndexReader).  EG if things crash any time after
this point and before the background merging finishes & commits,
you're hosed.

Maybe it's too ambitious to allow merges of segments from other
directories to run concurrently?

I would consider it a hard error in IndexWriter if after calling
mergePolicy.merge from any of the addIndexes*, there remain segments
in other directories.  I think we should catch this & throw an
exception?

> I still stand by it should be the merge policy making the
> choice. You could have the code in IndexWriter too, but then there'd
> be duplicate code. To put the code only in IndexWriter removes the
> choice from the merge policy.

I agree that merge policy should be the one making the choice, but the
execution of it should be a centralized place (IndexWriter).  EG with
the simplified API, the merge policy would just return, one by one,
each of the segments that is in a different directory...

We can't all be copy/pasting code (like I had to do for LUCENE-845)
for checking & then moving segments across directories.  I think we
need single source for this, somehow.

> > I think there should in fact be a default optimize() in the base class
> > that does what current IndexWriter now does so that a MergePolicy need
> > not implement optimize at all.
>
> It'd be nice, but I don't know how to do it: the merge factor is not
> generic, so I don't know how to implement the loop generically.

Hmmm, OK.  I think what you did (factoring out that massive
conditional) is good here.

> Ah ... I see: with your forced merge ... hmmm.
> 
> No, forced would mean the merge policy must do a merge; whereas,
> normally, it's free not to do a merge until it wants to.
>
> I think adding a forced merge concept here is new ... If it's simply
> to support optimize, I'm not sure I find it too compelling. LogDoc
> as it stands uses different algorithms for incremental merges and
> optimize, so there's not too much of a concept of forced merges
> vs. optional merges to be factored out. So I guess I'm not seeing a
> strong compelling case for creating it?

OK, I agree, let's not add "forced".  How about, instead we only
require mergePolicy to implement optimize(int maxNumSegments)?  (And
current IndexWriter.optimize() calls this with parameter "1").

> > Well, it's sort of awkward if you want to vary that max #
> > segments.  Say during the day you optimize down to 15 segments
> > every time you update the index, but then at night you want to
> > optimize down to 5.  If we don't add method to IndexWriter you
> > then must have instance var on your MergePolicy that you set,
> > then you call optimize. It's not clean since really it should be
> > a parameter.
>
> Well, I don't know if I buy the argument that it should be a
> parameter. The merge policy has lots of state like docs/seg. I don't
> really see why segs/optimize is different.

I think this would be a useful enough method that it should be "made
simple" (ie, this is different from the "other state" that a merge
policy would store).  I opened a separate issue LUCENE-982 to track
this.

> My main reason for not wanting put this into IndexWriter is then
> every merge policy must support it.

This is why I want to address it now, while we are cementing the
MergePolicy API: I don't want to preclude it.

> > Wait: there is a barrier, right? In IndexWriter.replace we don't do
> > the right thing with non-contiguous merges?
> 
> Yeah, I meant that I'm not aware of any barriers except fixing
> IndexWriter#replace, in other words, I'm not aware of any other
> places where non-contiguity would cause a failure.

OK, good, that's my impression too.

Although ... do you think we need need some way for merge policy to
state where the new segment should be inserted into SegmentInfos?  For
the contiguous case it seems clear that we should default to what is
done now (new segment goes into same spot where old segments were).
But for the non-contiguous case, how would IndexWriter know where to
put the newly created segment?

> > Finally: does MergePolicy really need a close()?
> 
> I think so. The concurrent merge policy maintains all sorts of
> state.

OK.  Hmmm, does CMP block on close while it joins to any running merge
threads?  How can the user close IndexWriter and abort the running
merges?  I guess CMP would provide a method to abort any running
merges, and user would first call that before calling
IndexWriter.close?

> > I don't see how it can work (building the generic concurrency
> > wrapper "under" IndexMerger) because the MergePolicy is in "serial
> > control", eg, when it wants to cascade merges. How will you return
> > that thread back to IndexWriter?
>
> So this is how it looks now: the concurrent merge policy is both a
> merge policy and an index merger. The serial merge policy knows
> nothing about it other than it does not get IndexWriter as its
> merge.
>
> The index writer wants its merge, so it does it merge/maybeMerge
> call on the concurrent merge policy. The CMP calls merge on the
> serial policy, but substitutes itself for the merger rather than
> IndexWriter.
>
> The serial merge policy goes on its merry way, looking for merges to
> do (in the current model, this is a loop; more on that in a
> minute). Each time it has a subset of segments to merge, it calls
> merger.merge(...).
>
> At this point, the concurrent merge policy takes over again. It
> looks at the segments to be merged and other segments being
> processed by all existing merge threads and determines if there's a
> conflict (a request to merge a segment that's currently in a
> merge). If there's no conflict, it starts a merge thread and calls
> IndexWriter#merge on the thread. The original calling thread returns
> immediately. (I have a few ideas how to handle conflicts, the
> simplest of which is to wait for the conflicting merge and the
> restart the serial merge, e.g., revert to serial).
>
> This seems to work pretty well, so far. The only difference in API
> for the serial merges is that the merge operation can't return the
> number of documents in the result (since it isn't known how many
> docs will be deleted).

Hmmm.  This looks more complex than the proposed API simplification,
because you now have CMP on the top and on the bottom.  Also, this
requires the IndexMerger interface, but with the simplification we
would not need a separate interface.  Finally, I'm pretty sure you
have locking issues (more below...), which are required of all merge
policies, that the simplified API wouldn't have.

How we handle conflicts is important but I think independent of this
API discussion (ie both your CMP and my CMP have this same challenge,
and I agree we should start simple by just blocking when the selected
merge conflicts with a previous one that's still in progress).

> > With concurrency wrapper "on the top" it's able to easily take a
> > merge request as returned by the policy, kick it off in the
> > backrground, and immediately return control of original thread
> > back to IndexWriter.
>
> What I don't know how to do with this is figure out how to do a
> bunch of merges. Lets say I have two levels in LogDoc that are merge
> worthy. If I call LogDoc, it'll return the lower level. That's all
> good. But what about doing the higher level in parallel? If I call
> LogDoc again, it's going to return the lower level again because it
> knows nothing about the current merge going on.

True, LogDoc as it now stands would never exploit concurrency (it will
always return the highest level that needs merging).  But, we could
relax that such that if ever the lowest level has > 2*mergeFactor
pending segments to merge then we select the 2nd set.  This would
expose concurrency that would only be used when CMP is in use.  But I
think we should do this, later, as an enhancement.  Let's focus on
simplifying the API now...

> It would be possible to have it return a vector of segmentInfo
> subsets, but I don't see the gain (and it doesn't work out as well
> for my putative conflict resolution).

Yeah that would make the API even more complex, which is the wrong
direction here :)

> > have all necessary locking for SegmentInfos inside IndexWriter
>
> This was a red-herring on my part. All the "segmentInfos locking"
> has always been in IndexWriter. That's note exactly sufficient. The
> fundamental issue is that IndexWriter#merge has to operate without a
> lock on IndexWriter. At some point, I was thinking that meant it
> would have to lock SegmentInfos but that's ludicrous, actually. It's
> sufficient for IndexWriter#replace to be synchronized.

Right: merging certainly shouldn't hold lock on IndexWriter nor
segmentInfos.

> > If CMP implements IndexMerger you must have locking inside any
> > MergePolicy that's calling into CMP?
>
> No. CMP does it's own locking (for purposes of thread management)
> but the serial merge policies no nothing of this (and they can
> expect to be called synchronously).

This I don't get: it seems to me that the serial merge policies must
do their own locking when they access the SegmentInfos that's passed
in?  And that lock must be released, somehow, when they call merge?
Would merge (inside IndexWriter) somehow release the lock on being
called?  I don't see how you're going to make the locking work, but I
think it's required with the current API.

This is another benefit of the simplified API: MergePolicy.maybeMerge
would only be called with a lock already acquired (by IndexWriter) on
the segmentInfos.  Then maybeMerge looks @ the segmentInfos, makes its
choice, and returns it, and the lock is released.  The lock is not
held for an extended period of time...


> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to