[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes (JIRA) Tue, 07 Aug 2007 10:56:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
 ]


Steven Parkes commented on LUCENE-847:
--------------------------------------

    Is the separate IndexMerger interface really necessary?

I wrestled with this, so in the past, it wouldn't have taken much to convince 
me otherwise. The reason for the extra interface is I was hoping to discourage 
or create a little extra friction for merge policies in terms of looking too 
much into the internals of IndexWriter.

As an example, it's not a good idea for merge policies to be able to access 
IndexWriter#segmentInfos directly. (That's a case I would like to solve by 
making segmentInfos private, but I haven't looked at that completely yet and 
even beyond that case, I'd still like merge policies to have a very clean 
interface with IWs.)

It feels kinda weird for me to be arguing for constraints since I work mostly 
in dynamic languages that have none of this stuff. But it reflects my goal that 
merge policies simply be algorithms, not real workers.

Moreover, I think it may be useful for implementing concurrency (see below).

    While LogDocMergePolicy does need "maxBufferedDocs", I think,
    instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
    through" to the LogDocMergePolicy if that is the merge policy in
    use (and LogDocMergePolicy should store its own private
    "minMergeDocs").

The thing here is that the merge policy has nothing to do with max buffered 
docs, right? The code for buffering docs for the first segment is wholly in the 
IndexWriter. LogDocMergePolicy happens to need it (in its current incarnation) 
in order to handle the way the log levels are computed. This could, of course, 
be changed. There's nothing that says a merge policy has to look at these 
values, just that they're available should the merge policy want to look.

I guess my idea was that the index writer was responsible for handling the 
initial segment (with DocsWriter, if it wants) and also to provide an 
indication to the merge policies how it was handling them.

It's possible to have the merge policy influence the first segment size but 
that opens up a lot of issues. Does every merge policy then have to know how to 
trade between buffering by doc count and buffering by ram? I was hoping to 
avoid that.

I'm not all that happy with this the way it is, but supporting both by-doc and 
by-ram is messy but seems necessary. This was my take on making it least messy?

    In LogDocMergePolicy, it seems like the checking that's done for
    whether a SegmentInfo is in a different directory, as well as the
    subsequent copy to move it over to the IndexWriter's directory,
    should not live in the MergePolicy?

I see two parts to this.

First, I hesitate to make it not possible for merge policy to see the directory 
information, i.e., remove IndexMerger#getDirectory(). Copying a segment is one 
way to get it into the current directory, but merging with other segments does 
to. A merge policy could decide to go ahead and merge a bunch of segments that 
are in other directories rather than just copy them individually. Taking away 
getDirectory() removes that option.

As to how to handle the case where single segments are copied, I guess my main 
reason for leaving that in the merge policy would be for simplicity. Seems 
nicer to have all segment amalgamation management in one place, rather than 
some in one class and some in another. Could be factored into an optional base 
merge policy for derived classes to use as they might like.

    The "checkOptimize" method in LogDocMergePolicy seems like it
    belongs back in IndexWriter: I think we don't want every
    MergePolicy having to replicate that tricky while condition.

The reason for not doing this was I can imagine different merge policies having 
a different model of what optimize means. I can imagine some policies that 
would not optimize everything down to a single segment but instead obeyed a max 
segment size. But we could factor the default conditional into an optional base 
class, as above.

It is an ugly conditional that there might be better way to formulate, so that 
policies don't have to look at the grody details like hasSeparateNorms. But I'd 
still like the merge policies to be able to decide what optimize means at a 
high level.

    Maybe we could change MergePolicy.merge to take a boolean "forced"
    which, if true, means that the MergePolicy must now pick a merge
    even if it didn't think it was time to.  This would let us move
    that tricky while condition logic back into IndexWriter.

I didn't follow this. forced == optimize? If not, what does pick a merge mean? 
Not sure what LogDoc should do for merge(force=true) if it thinks everything is 
fine?

    Also, I think at some point we may want to have an optimize()
    method that takes an int parameter stating the max # segments in
    the resulting index.

I think this is great functionality for a merge policy, but what about just 
making that part of the individual merge policy interface, rather than linked 
to IndexWriter? That was one goal of making a factored merge policy: that you 
could do these tweaks without changing IndexWriter.

    This would allow you to optimize down to <=
    N segments w/o paying full cost of a complete "only one segment"
    optimize.  If we had a "forced" boolean then we could build such
    an optimize method in the future, whereas as it stands now it
    would not be so easy to retrofit previously created MergePolicy
    classes to do this.

I haven't looked at how difficult it would be to make LogDoc sufficiently 
derivable to allow a derived class to add this tweak. If I could, would it be 
enough?

    There are some minor things that should not be committed eg the
    added "infoStream = null" in IndexFileDeleter.  I typically try to
    put a comment "// nocommit" above such changes as I make them to
    remind myself and keep them from slipping in.

You're right and thanks for the idea. Obvious now.

    In the deprecated IndexWriter methods you're doing a hard cast to
    LogDocMergePolicy which gives a bad result if you're using a
    different merge policy; maybe catch the class cast exception (or,
    better, check upfront if it's an instanceof) and raise a more
    reasonable exception if not?

Agreed.

    IndexWriter around line 1908 has for loop that has commented out
    "System.err.println"; we should just comment out/remove for loop

The whole loop will be gone before commit but I didn't want to delete it yet 
since I might need to turn it back on for more debugging.  It should, of 
course, have a "// nocommit" comment.

    These commented out synchronized spook me a bit:

      /* synchronized(segmentInfos) */ {

This is from my old code. I left it in there as a hint as I work on the 
concurrent stuff again. It's only a weak hint, though, because things have 
changed a lot since that code was even partially functional. Ignore it. It 
won't go into the serial patch and anything for LUCENE-870 will have to have 
its own justification.

    Can we support non-contiguous merges?  If I understand it
    correctly, the MergeSpecification can express such a merge, it's
    just that the current IndexMerger (IndexWriter) cannot execute it,
    right?  So we are at least not precluding fixing this in the
    future.

As far as I've seen so far, there are no barriers to non-contiguous merges. 
Maybe something will crop up that is, but in what I've done, I haven't seen any.

    It confuses me that MergePolicy has a method "merge(...)" -- can
    we rename it to "maybeMerge(..)" or "checkForMerge(...)"?

I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of 
"merge" like "optimize": make it so / idempotent. But I'm certainly willing to 
write whatever people find clearest. 

    Instead of IndexWriter.releaseMergePolicy() can we have
    IndexWriter only close the merge policy if it was the one that had
    created it?  (Similar to how IndexWriter closes the dir if it has
    opened it from a String or File, but does not close it if it was
    passed in).

This precludes

        iw.setMergePolicy(new MyMergePolicy(...));
      ...
        iw.close();

You're always going to have to

        MergePolicy mp = new MyMergePolicy(...);
        iw.setMergePolicy(mp);
      ...
      iw.close();
      mp.close();

The implementation's much cleaner using the semantics you describe, but I was 
thinking it'd be better to optimize for the usability of the common client code 
case?
        
        Well I think the current MergePolicy API (where the "merge" method
        calls IndexWriter.merge itself, must cascade itself, etc.) makes it
        hard to build a generic ConcurrentMergePolicy "wrapper" that you could
        use to make any MergePolicy concurrent (?).  How would you do it?

I really haven't had time to go heads down on this (the old concurrent merge 
policy was a derived class rather than a wrapper class). But I was thinking 
that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as 
the serial merge policy, i.e., implement IndexMerger (my biggest argument for 
IM at this point). But I haven't looked deeply at whether this will work but I 
think it has at least a chance.

I should know more about this is a day or two.

        I think you can still have state (as instance variables in your
        class)?  How would this simplification restrict the space of merge
        policies?

s/state/stack state/. Yeah, you can always unwind your loops and lift your 
recursions, put all that stack state into instance variables. But, well, yuck. 
I'd like to make it easy to write simple merge policies and take up the heavy 
lifting elsewhere. Hopefully there will be more merge policies than index 
writers.

        Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
        just with a separate thread.  And so all synchronization required is
        still inside IndexWriter (I think?).

That's my idea.

The synchronization is still tricky, since parts of segmentInfos are getting 
changed at various times and there are references and/or copies of it other 
places. And as Ning pointed out to me, we also have to deal with buffered 
delete terms. I'd say I got about 80% of the way there on the last go around. 
I'm hoping to get all the way this time.

        In fact, if we stick with the current MergePolicy API, aren't you
        going to have to put some locking into eg the LogDocMergePolicy when
        concurrent merges might be happening?

Yes and no. If CMP implements IndexMerger, I think we might be okay? In the 
previous iteration, I used derivation so that ConcurrentLogDocMergePolicy 
derived from the serial version but had the necessary threading. I agree that a 
wrapper is better solution if it can be made to work.

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to