Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Ning Li Wed, 08 Aug 2007 06:48:52 -0700

On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
>  ]
>
> Steven Parkes commented on LUCENE-847:
> --------------------------------------
>
>     Is the separate IndexMerger interface really necessary?
>
> I wrestled with this, so in the past, it wouldn't have taken much to convince 
> me otherwise. The reason for the extra interface is I was hoping to 
> discourage or create a little extra friction for merge policies in terms of 
> looking too much into the internals of IndexWriter.
>
> As an example, it's not a good idea for merge policies to be able to access 
> IndexWriter#segmentInfos directly. (That's a case I would like to solve by 
> making segmentInfos private, but I haven't looked at that completely yet and 
> even beyond that case, I'd still like merge policies to have a very clean 
> interface with IWs.)
>
> It feels kinda weird for me to be arguing for constraints since I work mostly 
> in dynamic languages that have none of this stuff. But it reflects my goal 
> that merge policies simply be algorithms, not real workers.
>
> Moreover, I think it may be useful for implementing concurrency (see below).
>
>     While LogDocMergePolicy does need "maxBufferedDocs", I think,
>     instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
>     through" to the LogDocMergePolicy if that is the merge policy in
>     use (and LogDocMergePolicy should store its own private
>     "minMergeDocs").
>
> The thing here is that the merge policy has nothing to do with max buffered 
> docs, right? The code for buffering docs for the first segment is wholly in 
> the IndexWriter. LogDocMergePolicy happens to need it (in its current 
> incarnation) in order to handle the way the log levels are computed. This 
> could, of course, be changed. There's nothing that says a merge policy has to 
> look at these values, just that they're available should the merge policy 
> want to look.
>
> I guess my idea was that the index writer was responsible for handling the 
> initial segment (with DocsWriter, if it wants) and also to provide an 
> indication to the merge policies how it was handling them.
>
> It's possible to have the merge policy influence the first segment size but 
> that opens up a lot of issues. Does every merge policy then have to know how 
> to trade between buffering by doc count and buffering by ram? I was hoping to 
> avoid that.
>
> I'm not all that happy with this the way it is, but supporting both by-doc 
> and by-ram is messy but seems necessary. This was my take on making it least 
> messy?
>
>     In LogDocMergePolicy, it seems like the checking that's done for
>     whether a SegmentInfo is in a different directory, as well as the
>     subsequent copy to move it over to the IndexWriter's directory,
>     should not live in the MergePolicy?
>
> I see two parts to this.
>
> First, I hesitate to make it not possible for merge policy to see the 
> directory information, i.e., remove IndexMerger#getDirectory(). Copying a 
> segment is one way to get it into the current directory, but merging with 
> other segments does to. A merge policy could decide to go ahead and merge a 
> bunch of segments that are in other directories rather than just copy them 
> individually. Taking away getDirectory() removes that option.
>
> As to how to handle the case where single segments are copied, I guess my 
> main reason for leaving that in the merge policy would be for simplicity. 
> Seems nicer to have all segment amalgamation management in one place, rather 
> than some in one class and some in another. Could be factored into an 
> optional base merge policy for derived classes to use as they might like.
>
>     The "checkOptimize" method in LogDocMergePolicy seems like it
>     belongs back in IndexWriter: I think we don't want every
>     MergePolicy having to replicate that tricky while condition.
>
> The reason for not doing this was I can imagine different merge policies 
> having a different model of what optimize means. I can imagine some policies 
> that would not optimize everything down to a single segment but instead 
> obeyed a max segment size. But we could factor the default conditional into 
> an optional base class, as above.
>
> It is an ugly conditional that there might be better way to formulate, so 
> that policies don't have to look at the grody details like hasSeparateNorms. 
> But I'd still like the merge policies to be able to decide what optimize 
> means at a high level.
>
>     Maybe we could change MergePolicy.merge to take a boolean "forced"
>     which, if true, means that the MergePolicy must now pick a merge
>     even if it didn't think it was time to.  This would let us move
>     that tricky while condition logic back into IndexWriter.
>
> I didn't follow this. forced == optimize? If not, what does pick a merge 
> mean? Not sure what LogDoc should do for merge(force=true) if it thinks 
> everything is fine?
>
>     Also, I think at some point we may want to have an optimize()
>     method that takes an int parameter stating the max # segments in
>     the resulting index.
>
> I think this is great functionality for a merge policy, but what about just 
> making that part of the individual merge policy interface, rather than linked 
> to IndexWriter? That was one goal of making a factored merge policy: that you 
> could do these tweaks without changing IndexWriter.
>
>     This would allow you to optimize down to <=
>     N segments w/o paying full cost of a complete "only one segment"
>     optimize.  If we had a "forced" boolean then we could build such
>     an optimize method in the future, whereas as it stands now it
>     would not be so easy to retrofit previously created MergePolicy
>     classes to do this.
>
> I haven't looked at how difficult it would be to make LogDoc sufficiently 
> derivable to allow a derived class to add this tweak. If I could, would it be 
> enough?
>
>     There are some minor things that should not be committed eg the
>     added "infoStream = null" in IndexFileDeleter.  I typically try to
>     put a comment "// nocommit" above such changes as I make them to
>     remind myself and keep them from slipping in.
>
> You're right and thanks for the idea. Obvious now.
>
>     In the deprecated IndexWriter methods you're doing a hard cast to
>     LogDocMergePolicy which gives a bad result if you're using a
>     different merge policy; maybe catch the class cast exception (or,
>     better, check upfront if it's an instanceof) and raise a more
>     reasonable exception if not?
>
> Agreed.
>
>     IndexWriter around line 1908 has for loop that has commented out
>     "System.err.println"; we should just comment out/remove for loop
>
> The whole loop will be gone before commit but I didn't want to delete it yet 
> since I might need to turn it back on for more debugging.  It should, of 
> course, have a "// nocommit" comment.
>
>     These commented out synchronized spook me a bit:
>
>       /* synchronized(segmentInfos) */ {
>
> This is from my old code. I left it in there as a hint as I work on the 
> concurrent stuff again. It's only a weak hint, though, because things have 
> changed a lot since that code was even partially functional. Ignore it. It 
> won't go into the serial patch and anything for LUCENE-870 will have to have 
> its own justification.
>
>     Can we support non-contiguous merges?  If I understand it
>     correctly, the MergeSpecification can express such a merge, it's
>     just that the current IndexMerger (IndexWriter) cannot execute it,
>     right?  So we are at least not precluding fixing this in the
>     future.
>
> As far as I've seen so far, there are no barriers to non-contiguous merges. 
> Maybe something will crop up that is, but in what I've done, I haven't seen 
> any.
>
>     It confuses me that MergePolicy has a method "merge(...)" -- can
>     we rename it to "maybeMerge(..)" or "checkForMerge(...)"?
>
> I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of 
> "merge" like "optimize": make it so / idempotent. But I'm certainly willing 
> to write whatever people find clearest.
>
>     Instead of IndexWriter.releaseMergePolicy() can we have
>     IndexWriter only close the merge policy if it was the one that had
>     created it?  (Similar to how IndexWriter closes the dir if it has
>     opened it from a String or File, but does not close it if it was
>     passed in).
>
> This precludes
>
>         iw.setMergePolicy(new MyMergePolicy(...));
>       ...
>         iw.close();
>
> You're always going to have to
>
>         MergePolicy mp = new MyMergePolicy(...);
>         iw.setMergePolicy(mp);
>       ...
>       iw.close();
>       mp.close();
>
> The implementation's much cleaner using the semantics you describe, but I was 
> thinking it'd be better to optimize for the usability of the common client 
> code case?
>
>         Well I think the current MergePolicy API (where the "merge" method
>         calls IndexWriter.merge itself, must cascade itself, etc.) makes it
>         hard to build a generic ConcurrentMergePolicy "wrapper" that you could
>         use to make any MergePolicy concurrent (?).  How would you do it?
>
> I really haven't had time to go heads down on this (the old concurrent merge 
> policy was a derived class rather than a wrapper class). But I was thinking 
> that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as 
> the serial merge policy, i.e., implement IndexMerger (my biggest argument for 
> IM at this point). But I haven't looked deeply at whether this will work but 
> I think it has at least a chance.
>
> I should know more about this is a day or two.
>
>         I think you can still have state (as instance variables in your
>         class)?  How would this simplification restrict the space of merge
>         policies?
>
> s/state/stack state/. Yeah, you can always unwind your loops and lift your 
> recursions, put all that stack state into instance variables. But, well, 
> yuck. I'd like to make it easy to write simple merge policies and take up the 
> heavy lifting elsewhere. Hopefully there will be more merge policies than 
> index writers.
>
>         Oh, no: ConcurrentMergePolicy would still call 
> IndexWriter.merge(spec),
>         just with a separate thread.  And so all synchronization required is
>         still inside IndexWriter (I think?).
>
> That's my idea.
>
> The synchronization is still tricky, since parts of segmentInfos are getting 
> changed at various times and there are references and/or copies of it other 
> places. And as Ning pointed out to me, we also have to deal with buffered 
> delete terms. I'd say I got about 80% of the way there on the last go around. 
> I'm hoping to get all the way this time.
>
>         In fact, if we stick with the current MergePolicy API, aren't you
>         going to have to put some locking into eg the LogDocMergePolicy when
>         concurrent merges might be happening?
>
> Yes and no. If CMP implements IndexMerger, I think we might be okay? In the 
> previous iteration, I used derivation so that ConcurrentLogDocMergePolicy 
> derived from the serial version but had the necessary threading. I agree that 
> a wrapper is better solution if it can be made to work.
>
> > Factor merge policy out of IndexWriter
> > --------------------------------------
> >
> >                 Key: LUCENE-847
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-847
> >             Project: Lucene - Java
> >          Issue Type: Improvement
> >          Components: Index
> >            Reporter: Steven Parkes
> >            Assignee: Steven Parkes
> >         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> > LUCENE-847.txt
> >
> >
> > If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> > making it possible for apps to choose a custom merge policy and for easier 
> > experimenting with merge policy variants.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to