RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes Fri, 23 Mar 2007 16:45:03 -0800

> I haven't read the details, but should maxBufferedDocs be exposed in
> some subinterfaces instead of the MergePolicy interface?


I've been wondering about this, too, but haven't come to any strong
opinions (yet). I figured maybe playing with a few merge policies might
make things clearer.

maxBufferedDocs: is this truly an invariant of all merge policies? I
don't know. But actually, I think a possible question is whether merge
policies should have any role in this at all, or if IndexWriter should
just do it itself. If we go forward with Mike's stuff about writing a
segment w/multiple docs w/o a merge, it's sounding more like the
buffering of docs is not actually a merge policy a question.

maxMergeDocs: should all merge policies accept this?

> 1) A merge thread is started when an IndexWriter is created and
> stopped when the IndexWriter is closed. (A single merge thread is used
> for simplicity. Multiple merge threads could be used.)

I haven't looked at pooling of threads, whether it be one or more than
one, but I agree it needs to be looked at. I've heard that threads can't
be created willy-nilly in J2EE apps but instead have to be drawn from
the J2EE pool, so I figured when we look at pooling, we might need to
accommodate that kind of constrained environment.

> 2) The merge thread periodically checks if there is merge work to do.
> This is done by synchronously checking segmentInfos and producing a
> MergeSpecification if there is merge work to do.

It does this check via a synchronized call on IndexWriter, right?

> 3) If a MergeSpecification is produced, the merge thread goes ahead
> and does the merge. Importantly, documents in the segments being
> merged may be deleted while the concurrent merge is happening. These
> deletes have to be remembered.

Yup, and I haven't looked at that yet.

> I see you start a thread whenever there is merge work. Would it be
> hard to control system load?

I think it needs to be looked at. Since concurrent conflicting merges
aren't allowed, there is a bound on concurrency, but it might be too
loose a bound. I'm setting up tests to start getting a feel for the
dynamics.

My strawman model was to start with as much concurrency as the data
allowed, then scale it back as necessary.

My main interest is in reducing the latency of add docs. In the example
in my head, I have segments on a number of levels. Lets say merges at
the higher end are going to take 3 seconds, 3 hours, and 3 days. I'd
like to launch the 3 day merge and let it run in the background. It
should be a while before a 3 hour merge is required, but if one is
required before the 3 day merge is complete, I'd like not to block in
that case, too. If load is an issue, the idea would be to lower the
priority or suspend the 3 day merge while the 3 hour merge is going.

My focus isn't on slowing things down, i.e., handling a system where you
truly can't keep up, but in spreading out the big lumps of work, rather
than putting them in the add doc control path.

It's possible that at some point you'll want to do a merge that includes
segments that are being merged concurrently. In that case, the code
currently blocks. There are alternatives, like allowing more than
mergeFactor segments on a level, at least temporarily, but I haven't
gone that way yet. So my way of keeping things simple (if any version of
concurrent can be called simple) is not to make blocking impossible, but
to make it less likely. In the serial case, it's a certainty.

The main thing I've been trying to understand up until now was the
concurrency of IndexWriter#segmentInfos, given that multiple merges
could be running. If you allow that merges could be running AND a merge
might be blocked, you can't make a synchronized call on IndexWriter,
because the blocked merge request holds that.

But my most recent thinking has been that I've been going down the wrong
path trying to separately synchronize segmentInfos. I think instead the
merge threads can make a separate queue of merge results that
IndexWriter can look at when it wants to. I'm gonna look at that soon.
Currently my concurrent stuff won't work because of this part is
incomplete.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to