[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
]
Steven Parkes commented on LUCENE-847:
--------------------------------------
Is the separate IndexMerger interface really necessary?
I wrestled with this, so in the past, it wouldn't have taken much to convince
me otherwise. The reason for the extra interface is I was hoping to discourage
or create a little extra friction for merge policies in terms of looking too
much into the internals of IndexWriter.
As an example, it's not a good idea for merge policies to be able to access
IndexWriter#segmentInfos directly. (That's a case I would like to solve by
making segmentInfos private, but I haven't looked at that completely yet and
even beyond that case, I'd still like merge policies to have a very clean
interface with IWs.)
It feels kinda weird for me to be arguing for constraints since I work mostly
in dynamic languages that have none of this stuff. But it reflects my goal that
merge policies simply be algorithms, not real workers.
Moreover, I think it may be useful for implementing concurrency (see below).
While LogDocMergePolicy does need "maxBufferedDocs", I think,
instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
through" to the LogDocMergePolicy if that is the merge policy in
use (and LogDocMergePolicy should store its own private
"minMergeDocs").
The thing here is that the merge policy has nothing to do with max buffered
docs, right? The code for buffering docs for the first segment is wholly in the
IndexWriter. LogDocMergePolicy happens to need it (in its current incarnation)
in order to handle the way the log levels are computed. This could, of course,
be changed. There's nothing that says a merge policy has to look at these
values, just that they're available should the merge policy want to look.
I guess my idea was that the index writer was responsible for handling the
initial segment (with DocsWriter, if it wants) and also to provide an
indication to the merge policies how it was handling them.
It's possible to have the merge policy influence the first segment size but
that opens up a lot of issues. Does every merge policy then have to know how to
trade between buffering by doc count and buffering by ram? I was hoping to
avoid that.
I'm not all that happy with this the way it is, but supporting both by-doc and
by-ram is messy but seems necessary. This was my take on making it least messy?
In LogDocMergePolicy, it seems like the checking that's done for
whether a SegmentInfo is in a different directory, as well as the
subsequent copy to move it over to the IndexWriter's directory,
should not live in the MergePolicy?
I see two parts to this.
First, I hesitate to make it not possible for merge policy to see the directory
information, i.e., remove IndexMerger#getDirectory(). Copying a segment is one
way to get it into the current directory, but merging with other segments does
to. A merge policy could decide to go ahead and merge a bunch of segments that
are in other directories rather than just copy them individually. Taking away
getDirectory() removes that option.
As to how to handle the case where single segments are copied, I guess my main
reason for leaving that in the merge policy would be for simplicity. Seems
nicer to have all segment amalgamation management in one place, rather than
some in one class and some in another. Could be factored into an optional base
merge policy for derived classes to use as they might like.
The "checkOptimize" method in LogDocMergePolicy seems like it
belongs back in IndexWriter: I think we don't want every
MergePolicy having to replicate that tricky while condition.
The reason for not doing this was I can imagine different merge policies having
a different model of what optimize means. I can imagine some policies that
would not optimize everything down to a single segment but instead obeyed a max
segment size. But we could factor the default conditional into an optional base
class, as above.
It is an ugly conditional that there might be better way to formulate, so that
policies don't have to look at the grody details like hasSeparateNorms. But I'd
still like the merge policies to be able to decide what optimize means at a
high level.
Maybe we could change MergePolicy.merge to take a boolean "forced"
which, if true, means that the MergePolicy must now pick a merge
even if it didn't think it was time to. This would let us move
that tricky while condition logic back into IndexWriter.
I didn't follow this. forced == optimize? If not, what does pick a merge mean?
Not sure what LogDoc should do for merge(force=true) if it thinks everything is
fine?
Also, I think at some point we may want to have an optimize()
method that takes an int parameter stating the max # segments in
the resulting index.
I think this is great functionality for a merge policy, but what about just
making that part of the individual merge policy interface, rather than linked
to IndexWriter? That was one goal of making a factored merge policy: that you
could do these tweaks without changing IndexWriter.
This would allow you to optimize down to <=
N segments w/o paying full cost of a complete "only one segment"
optimize. If we had a "forced" boolean then we could build such
an optimize method in the future, whereas as it stands now it
would not be so easy to retrofit previously created MergePolicy
classes to do this.
I haven't looked at how difficult it would be to make LogDoc sufficiently
derivable to allow a derived class to add this tweak. If I could, would it be
enough?
There are some minor things that should not be committed eg the
added "infoStream = null" in IndexFileDeleter. I typically try to
put a comment "// nocommit" above such changes as I make them to
remind myself and keep them from slipping in.
You're right and thanks for the idea. Obvious now.
In the deprecated IndexWriter methods you're doing a hard cast to
LogDocMergePolicy which gives a bad result if you're using a
different merge policy; maybe catch the class cast exception (or,
better, check upfront if it's an instanceof) and raise a more
reasonable exception if not?
Agreed.
IndexWriter around line 1908 has for loop that has commented out
"System.err.println"; we should just comment out/remove for loop
The whole loop will be gone before commit but I didn't want to delete it yet
since I might need to turn it back on for more debugging. It should, of
course, have a "// nocommit" comment.
These commented out synchronized spook me a bit:
/* synchronized(segmentInfos) */ {
This is from my old code. I left it in there as a hint as I work on the
concurrent stuff again. It's only a weak hint, though, because things have
changed a lot since that code was even partially functional. Ignore it. It
won't go into the serial patch and anything for LUCENE-870 will have to have
its own justification.
Can we support non-contiguous merges? If I understand it
correctly, the MergeSpecification can express such a merge, it's
just that the current IndexMerger (IndexWriter) cannot execute it,
right? So we are at least not precluding fixing this in the
future.
As far as I've seen so far, there are no barriers to non-contiguous merges.
Maybe something will crop up that is, but in what I've done, I haven't seen any.
It confuses me that MergePolicy has a method "merge(...)" -- can
we rename it to "maybeMerge(..)" or "checkForMerge(...)"?
I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of
"merge" like "optimize": make it so / idempotent. But I'm certainly willing to
write whatever people find clearest.
Instead of IndexWriter.releaseMergePolicy() can we have
IndexWriter only close the merge policy if it was the one that had
created it? (Similar to how IndexWriter closes the dir if it has
opened it from a String or File, but does not close it if it was
passed in).
This precludes
iw.setMergePolicy(new MyMergePolicy(...));
...
iw.close();
You're always going to have to
MergePolicy mp = new MyMergePolicy(...);
iw.setMergePolicy(mp);
...
iw.close();
mp.close();
The implementation's much cleaner using the semantics you describe, but I was
thinking it'd be better to optimize for the usability of the common client code
case?
Well I think the current MergePolicy API (where the "merge" method
calls IndexWriter.merge itself, must cascade itself, etc.) makes it
hard to build a generic ConcurrentMergePolicy "wrapper" that you could
use to make any MergePolicy concurrent (?). How would you do it?
I really haven't had time to go heads down on this (the old concurrent merge
policy was a derived class rather than a wrapper class). But I was thinking
that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as
the serial merge policy, i.e., implement IndexMerger (my biggest argument for
IM at this point). But I haven't looked deeply at whether this will work but I
think it has at least a chance.
I should know more about this is a day or two.
I think you can still have state (as instance variables in your
class)? How would this simplification restrict the space of merge
policies?
s/state/stack state/. Yeah, you can always unwind your loops and lift your
recursions, put all that stack state into instance variables. But, well, yuck.
I'd like to make it easy to write simple merge policies and take up the heavy
lifting elsewhere. Hopefully there will be more merge policies than index
writers.
Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
just with a separate thread. And so all synchronization required is
still inside IndexWriter (I think?).
That's my idea.
The synchronization is still tricky, since parts of segmentInfos are getting
changed at various times and there are references and/or copies of it other
places. And as Ning pointed out to me, we also have to deal with buffered
delete terms. I'd say I got about 80% of the way there on the last go around.
I'm hoping to get all the way this time.
In fact, if we stick with the current MergePolicy API, aren't you
going to have to put some locking into eg the LogDocMergePolicy when
concurrent merges might be happening?
Yes and no. If CMP implements IndexMerger, I think we might be okay? In the
previous iteration, I used derivation so that ConcurrentLogDocMergePolicy
derived from the serial version but had the necessary threading. I agree that a
wrapper is better solution if it can be made to work.
> Factor merge policy out of IndexWriter
> --------------------------------------
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Steven Parkes
> Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt,
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable,
> making it possible for apps to choose a custom merge policy and for easier
> experimenting with merge policy variants.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]