[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Michael McCandless (JIRA) Wed, 08 Aug 2007 05:49:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518435
 ]


Michael McCandless commented on LUCENE-847:
-------------------------------------------

New feedback:

  * Are you going to fix all unit tests that call the now-deprecated
    APIs?  (You should still leave a few tests using the deprecated
    APIs to make sure they in fact continue to work, but most tests
    should not use the deprecated APIs).

Responses to previous feedback:

> As an example, it's not a good idea for merge policies to be able to
> access IndexWriter#segmentInfos directly. (That's a case I would
> like to solve by making segmentInfos private, but I haven't looked
> at that completely yet and even beyond that case, I'd still like
> merge policies to have a very clean interface with IWs.)

Agreed, but the solution to that is to make segmentInfos private, not
to make a whole new interface.  Ie, this is an IndexWriter problem, so
we should fix it in IndexWriter.

> > While LogDocMergePolicy does need "maxBufferedDocs", I think,
> > instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
> > through" to the LogDocMergePolicy if that is the merge policy in
> > use (and LogDocMergePolicy should store its own private
> > "minMergeDocs").
>
> The thing here is that the merge policy has nothing to do with max
> buffered docs, right? The code for buffering docs for the first
> segment is wholly in the IndexWriter. LogDocMergePolicy happens to
> need it (in its current incarnation) in order to handle the way the
> log levels are computed. This could, of course, be changed. There's
> nothing that says a merge policy has to look at these values, just
> that they're available should the merge policy want to look.

Exactly: these settings decide when a segment is flushed, so, why put
them into IndexMerger interface?  They shouldn't have anything to with
merging; I think they should be removed.

For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
does not use maxBufferedDocs.

> I guess my idea was that the index writer was responsible for
> handling the initial segment (with DocsWriter, if it wants) and also
> to provide an indication to the merge policies how it was handling
> them.
>
> It's possible to have the merge policy influence the first segment
> size but that opens up a lot of issues. Does every merge policy then
> have to know how to trade between buffering by doc count and
> buffering by ram? I was hoping to avoid that.

Yeah, I don't think merge policy should dictate flushing either;
especially because app logic above IndexWriter can already easily call
flush() if necessary.

> > In LogDocMergePolicy, it seems like the checking that's done for
> > whether a SegmentInfo is in a different directory, as well as the
> > subsequent copy to move it over to the IndexWriter's directory,
> > should not live in the MergePolicy?
>
> I see two parts to this.
> 
> First, I hesitate to make it not possible for merge policy to see
> the directory information, i.e., remove
> IndexMerger#getDirectory(). Copying a segment is one way to get it
> into the current directory, but merging with other segments does
> to. A merge policy could decide to go ahead and merge a bunch of
> segments that are in other directories rather than just copy them
> individually. Taking away getDirectory() removes that option.

Agreed, a "sophisticated" merge policy would go and merge segments in
other directories.  But, it should not have to.

I'm not proposing making it "not possible"; I'm proposing the merge
policy is given IndexWriter at which point it can getDirectory() from
it.  (Ie the extra interface solely for this purpose is overkill).

> As to how to handle the case where single segments are copied, I
> guess my main reason for leaving that in the merge policy would be
> for simplicity. Seems nicer to have all segment amalgamation
> management in one place, rather than some in one class and some in
> another. Could be factored into an optional base merge policy for
> derived classes to use as they might like.

I don't see this as simpler: I see it as making the MergePolicy
writer's job more complex.  I also see it as substantial duplicated
code (I just had to copy/paste a bunch of code in working on my
MergePolicy in LUCENE-845).

I think factoring into a base class is an OK solution, but, it
shouldn't be MergePolicy's job to remember to call this final "move
any segments in the wrong directory over" code.  As long as its in one
place and people don't have to copy/paste code between MergePolicy
sources.

> > The "checkOptimize" method in LogDocMergePolicy seems like it
> > belongs back in IndexWriter: I think we don't want every
> > MergePolicy having to replicate that tricky while condition.
>
> The reason for not doing this was I can imagine different merge
> policies having a different model of what optimize means. I can
> imagine some policies that would not optimize everything down to a
> single segment but instead obeyed a max segment size. But we could
> factor the default conditional into an optional base class, as
> above.
>
> It is an ugly conditional that there might be better way to
> formulate, so that policies don't have to look at the grody details
> like hasSeparateNorms. But I'd still like the merge policies to be
> able to decide what optimize means at a high level.

Agreed: I too don't want to preclude variants to optimize like
"optimize to at most N segments".  (Maybe we should add that method,
now, to IndexWriter, and fix MergePolicy to work with this?).

So, yes, please at least factor this out into a base class.  In
LUCENE-845 this was another copy/paste for me (ick).  I think there
should in fact be a default optimize() in the base class that does
what current IndexWriter now does so that a MergePolicy need not
implement optimize at all.

> > Maybe we could change MergePolicy.merge to take a boolean "forced"
> > which, if true, means that the MergePolicy must now pick a merge
> > even if it didn't think it was time to. This would let us move
> > that tricky while condition logic back into IndexWriter.
>
> I didn't follow this. forced == optimize? If not, what does pick a
> merge mean? Not sure what LogDoc should do for merge(force=true) if
> it thinks everything is fine?

No, forced would mean the merge policy must do a merge; whereas,
normally, it's free not to do a merge until it wants to.  Instead of
boolean argument we could have two methods, one called "merge" (you
must merge) and one called "maybeMerge" or "checkForMerges".

Ie, optimize is really a series of forced merges: we are merging
segments from different levels, N times, until we are down to 1
segment w/ no deletes, norms, etc.

> > Also, I think at some point we may want to have an optimize()
> > method that takes an int parameter stating the max # segments in
> > the resulting index.
>
> I think this is great functionality for a merge policy, but what
> about just making that part of the individual merge policy
> interface, rather than linked to IndexWriter? That was one goal of
> making a factored merge policy: that you could do these tweaks
> without changing IndexWriter.

Well, it's sort of awkward if you want to vary that max # segments.
Say during the day you optimize down to 15 segments every time you
update the index, but then at night you want to optimize down to 5.
If we don't add method to IndexWriter you then must have instance var
on your MergePolicy that you set, then you call optimize.  It's not
clean since really it should be a parameter.

And, with the merge/maybeMerge separation above, every merge policy
could have a default implementation for optimize(int maxNumSegments)
(in MergePolicy base class or in IndexWriter).

> > Can we support non-contiguous merges? If I understand it
> > correctly, the MergeSpecification can express such a merge, it's
> > just that the current IndexMerger (IndexWriter) cannot execute it,
> > right? So we are at least not precluding fixing this in the
> > future.
>
> As far as I've seen so far, there are no barriers to non-contiguous
> merges. Maybe something will crop up that is, but in what I've done,
> I haven't seen any.

Wait: there is a barrier, right?  In IndexWriter.replace we don't do
the right thing with non-contiguous merges?  What I'm asking is: is
that the only barrier?  Ie MergePolicy API will not need to change in
the future once we fix IndexWriter.replace to handle non-contiguous
merges?

> > It confuses me that MergePolicy has a method "merge(...)" -- can
> > we rename it to "maybeMerge(..)" or "checkForMerge(...)"?
>
> I suppose. I'm not a big fan of the "maybeFoo" style of naming. I
> think of "merge" like "optimize": make it so / idempotent. But I'm
> certainly willing to write whatever people find clearest.

I'm not wed to "maybeMerge()" but I really don't like "merge" :)  It's
overloaded now.

EG IndexMerger interface has a method called "merge" that is named
correctly because it will specifically go a do the requested merge.
So does IndexWriter.

Then, you have other [overloaded] methods in LogDocMergePolicy called
"merge" that are named appropriately (they will do a specific merge).

How about "checkForMerges()"?

> > Instead of IndexWriter.releaseMergePolicy() can we have
> > IndexWriter only close the merge policy if it was the one that had
> > created it? (Similar to how IndexWriter closes the dir if it has
> > opened it from a String or File, but does not close it if it was
> > passed in).
> 
> This precludes
> 
> iw.setMergePolicy(new MyMergePolicy(...));
>      ...
> iw.close();
>
> The implementation's much cleaner using the semantics you describe,
> but I was thinking it'd be better to optimize for the usability of the
> common client code case? 

The thing is, that method leaves IndexWriter in a broken state (null
mergePolicy).  What if you keep adding docs after that then suddenly
hit an NPE?

Also, I'm OK if people need to separately close their MergePolicy
instances: this is an advanced use of Lucene so it's OK to expect that
("simple things should be simple; complex things should be possible").

Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
doClose)" and default doClose to true?

Finally: does MergePolicy really need a close()?  Is this overkill (at
this point)?

> > Well I think the current MergePolicy API (where the "merge" method
> calls IndexWriter.merge itself, must cascade itself, etc.) makes it
> hard to build a generic ConcurrentMergePolicy "wrapper" that you
> could use to make any MergePolicy concurrent (?). How would you do
> it?
>
> I really haven't had time to go heads down on this (the old
> concurrent merge policy was a derived class rather than a wrapper
> class). But I was thinking that perhaps ConurrentMergePolicy would
> actually wrap IndexWriter as well as the serial merge policy, i.e.,
> implement IndexMerger (my biggest argument for IM at this
> point). But I haven't looked deeply at whether this will work but I
> think it has at least a chance.
>
> I should know more about this is a day or two. 

I don't see how it can work (building the generic concurrency wrapper
"under" IndexMerger) because the MergePolicy is in "serial control",
eg, when it wants to cascade merges.  How will you return that thread
back to IndexWriter?

Also it feels like the wrong place for concurrency -- I think
generally for "macro" concurrency you want it higher up, not lower
down, in the call stack.

With concurrency wrapper "on the top" it's able to easily take a merge
request as returned by the policy, kick it off in the backrground, and
immediately return control of original thread back to IndexWriter.

But if you see a way to make it work "on the bottom", let's definitely
explore it & understand the tradeoffs.

> > I think you can still have state (as instance variables in your
> > class)? How would this simplification restrict the space of merge
> > policies?
>
> s/state/stack state/. Yeah, you can always unwind your loops and
> lift your recursions, put all that stack state into instance
> variables. But, well, yuck. I'd like to make it easy to write simple
> merge policies and take up the heavy lifting elsewhere. Hopefully
> there will be more merge policies than index writers.

Can you give an example of the kind of "state" we're talking about?
Is this just academic?

Since the segments change in an unpredictable way (as seen by
MergePolicy) eg from addIndexes*, flushing, concurrent merge swapping
things at random times (thus requiring careful locking), etc, it seems
like you can't be keeping much state (stack or instance) that depends
on what segments are in the index?

> > Oh, no: ConcurrentMergePolicy would still call
> > IndexWriter.merge(spec), just with a separate thread. And so all
> > synchronization required is still inside IndexWriter (I think?).
>
> That's my idea.

Actually I was talking about my idea (to "simplify MergePolicy.merge
API").  With the simplification (whereby MergePolicy.merge just
returns the MergeSpecification instead of driving the merge itself) I
believe it's simple to make a concurrency wrapper around any merge
policy, and, have all necessary locking for SegmentInfos inside
IndexWriter.

> > In fact, if we stick with the current MergePolicy API, aren't you
> > going to have to put some locking into eg the LogDocMergePolicy
> > when concurrent merges might be happening?
>
> Yes and no. If CMP implements IndexMerger, I think we might be okay?

If CMP implements IndexMerger you must have locking inside any
MergePolicy that's calling into CMP?  Whereas with the simplified
MergePolicy.merge API, no locking is necessary because IndexWriter
would lock segmentInfos whenever it calls MergePolicy.merge.

> In the previous iteration, I used derivation so that
> ConcurrentLogDocMergePolicy derived from the serial version but had
> the necessary threading. I agree that a wrapper is better solution
> if it can be made to work.

I think it (concurrency wrapper around any merge policy) can be made
to work, if we do simplify the MergePolicy.merge API.  I'm not sure it
can be made to work if we don't, but if you have an approach we should
work through it!


> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to