factoring the merge policy

Steven Parkes Mon, 12 Mar 2007 19:25:46 -0800

I've been thinking about merge issues for a while and going through
IndexWriter to see if I could convince myself I understood it.


There are areas I'm interested in exploring tweaking the merge policy.
For example, it might be nice to have an optimize-like operation that
could look at the number of deleted documents in a segment (or sequence
of segments) and perform merges to remove deleted docs but without
reducing the index to a single a file. Tweaks like this via IndexWriter
are, well, not easy. Perhaps not even possible, given what's final and
what's not.

So I've been experimenting with factoring the merge policy out of
IndexWriter and making it pluggable.  In such a world, IndexWriter would
still do all the merges, but it would delegate the decision on what
segments to merge to a merge policy:

interface MergePolicy {

   static class MergeSpecification {
    SegmentInfos segmentInfos;
    int first;
    int last;
    boolean useCompoundFile;
  }

  void merge( SegmentInfos segmentInfos )
                throws CorruptIndexException, IOException;
  void optimize( SegmentInfos segmentInfos )
                throws CorruptIndexException, IOException;
  
}

The merge policy gets created with an IndexWriter and then called via
merge or optimize with the relevant segmentInfos. It in turn calls back
to the IndexWriter for each primitive merge operation:

        int merge( MergePolicy.MergeSpecification m )
                    throws CorruptIndexException, IOException;

When I started playing with this, I wasn't sure how well it would work
but it's seems to have come out pretty very well. A lot more feasible, I
think, than actually trying to derive from IndexWriter to change only
the policy.

My initial impetus for this was to be able to tweak how the large, old
segments were handled. But all the concurrent merge stuff from a few
weeks ago gave me a bunch more to think about.

One of the things discussed there was having a single background thread
vs. multiple threads. Given the logarithmic nature of the current merge
policy, it seems to me that an obvious candidate threaded policy is to
have a merge thread at each logarithmic level (with a sanity bound for
small levels). That way when you kickoff a merge of big segments, which
will take a while, you can still be merging smaller segments in
parallel. The obvious candidate concurrency limit would be to not allow
two merges at the same level concurrently. It wouldn't solve all
problems, but I think it would tend to spread out the effects of
cascading merges. And it so nicely fits in with the logarithmic merge
policy.

There are all sorts of issues to consider with something like that, but
what's interested me in this last week, is that I think factoring the
merge policy actually makes it pretty easy to play with these ideas. The
classic merge policy, i.e., LogarithmicMergePolicy, simply decides what
merges to pursue and calls IndexWriter#merge to make them happen. A
ThreadedLogarithmicMergePolicy, derived from the LMP, can keep track of
the necessary merges and their levels, creating/using threads when a
merge is needed on a level where there's no conflict. In that new
thread, it calls back to IndexWriter#merge while the calling thread
returns normally. Obviously this means IndexWriter#merge isn't
synchronized and, at least in this case, should only be called via the
merge policy. The only thing IndexWriter#merge really needs is the
SegmentInfos subsequence that it is merging. It shouldn't need any other
state. When it's done, it'll need to update the SegmentInfos object and
that operation needs to be synchronized, but that's only during the
update. The merges by definition do not overlap (and thus maintain
document order).

Making concurrency like that possible in IndexWriter would take a little
tweaking but I don't think we're talking much code or any of significant
performance impact. The default merge policy would be what it is now
(factored) and only when using a threaded merge policy would any of the
threading stuff come to play. Decided via by IndexWriter#setMergePolicy.

All the threading stuff is speculation at this point: I've got a
factored (unpolished) version of trunk working with a factored merge
policy but haven't tried to implement threading yet.

Trying to simplify the putative merge policy interface does have some
minor impact on the existing merge policy: at this point, one test gives
different results, a result of the fact that the merge policy as I've
been thinking of it, doesn't have different calls for combining multiple
indexes vs. merging a single index with segments of inconsistent size.
Rather than assume that segment sequence is consistent except when told
otherwise, it always checks. But that's a whole 'nother discussion ...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

factoring the merge policy

Reply via email to