[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520874
]
Michael McCandless commented on LUCENE-847:
-------------------------------------------
> > I don't think so: I think if someone changes the merge policy to
> > something else, it's fine to require that they then do settings
> > directly through that merge policy.
>
> You're going to want to change the default merge policy, right? So
> you're going to change the hard cast in IW to that policy? So it'll
> fail for anyone that wants to just getMergePolicy back to the old
> policy?
I don't really follow... my feeling is we should not deprecate
setUseCompoundFile, setMergeFactor, setMaxMergeDocs.
> > I think we shouldn't allow any mergePolicy to leave the index
> > inconsistent (failing to copy over segments from other
> > directories).
>
> That makes sense to me. CMP could enforce this, even in the case of
> concurrent merges.
I think IndexWriter should enforce it? Ie no merge policy should be
allowed to leave segments in other dirs (= at inconsistent index) at
point of commit.
> Perhaps this is sufficient, but not necessary? I see it as simpler
> just to have the merge policy (abstractly) generate a set of
> non-conflicting merges and let someone else worry about scheduling
> them.
I like that idea :) It fits well w/ the stateless API. Ie, merge
policy returns all possible merges and "someone above" takes care of
scheduling them.
> > But, providing just a single concurrent merge already gains us
> > concurrency of merging with adding of docs.
>
> I'm worried about when you start the leftmost merge, that, say, is
> going to take a day. With a steady influx of docs, it's not going to
> be long before you need another merge and if you have only one
> thread, you're going to block for the rest of the day. You've bought
> a little concurrency, but it's the almost day-long block I really
> want to avoid.
Ahh ... very good point. I agree.
> With a log-like policy, I think it's feasible to have logN
> threads. You might not want them all doing disk i/o at the same
> time: you'd want to prioritize threads on the small merges and/or
> suspend large merge threads. The speed with which the larger merge
> threads can vary when other merges are taking place, you just have
> to not stop them and start over.
Agreed: CMP should do this.
> > Right, the LUCENE-845 merge policy doesn't look @ the return
> > result of "merge". It just looks at the newly created
> > SegmentInfos.
>
> Yeah. My thinking was this would be tweaked. If merger.merge returns
> a valid number of docs, it could recurse as it does. If merger.merge
> returned -1 (which CMP does), it would not recurse but simply
> continue the loop.
Hmm. This means each merge policy must know whether it's talking to
CMP or IndexWriter underneith? With the stateless approach this
wouldn't happen.
> > Hmmmm, in fact, I think your CMP wrapper would not work with the
> > merge policy in LUCENE-845, right? Ie, won't it will just recurse
> > forever? So actually I don't see how your CMP (using the current
> > API) can in general safely "wrap" around a merge policy w/o
> > breaking things?
>
> I think it's safe, just not concurrent. The recursion would generate
> the same set of segments to merge and CMP would make the second call
> block (abstractly, anyway: it actually throws an exception that
> unwinds the stack and causes the call to start again from the top
> when the conflicting merge finishes).
Oh I see... that's kind of sneaky (planning on using exceptions to
abort a merge requested by the policy). I think the stateless
approach would be cleaner here.
> > But, if you lock on IndexWriter, what about apps that use multiple
> > threads to add documents and but don't use CMP? When one thread
> > gets tied up merging, you'll then block on the other synchronized
> > methods? And you also can't flush from other threads either? I
> > think flushing a new segment should be allowed to run concurrently
> > with the merge?
>
> I'm not sure I'm following this. That's what happens now, right? Are
> you trying to get more concurrency then there is now w/o using CMP?
> I certainly haven't been trying to do that.
True, this is something new. But since you're already doing the work
to allow a merge to run in the BG without blocking adding of docs,
flushing, etc, wouldn't this come nearly for free? Actually I think
all that's necessary, regardless of sync'ing on IndexWriter or
SegmentInfos is to move the "if (triggerMerge)" out of the
synchronized method/block.
> > I guess I don't see the reason to synchronize on IndexWriter
> > instead of segmentInfos.
>
> I looked at trying to make IW work when a synchronization of IW
> didn't imply a synchronization of segmentInfos. It's a very, very
> heavily used little data structure. I found it very hard to convince
> myself I could catch all the places locks would be required. And at
> the same time, I seemed to be able to do everything I needed with IW
> locking.
Well, eg flush() now synchronizes on IndexWriter: we don't want 2
threads doing this at once. But, the touching of segmentInfos inside
flush (to add the new SegmentInfo) is a tiny fleeting event (like
replace) and so you would want segmentInfos to be free to change while
the flushing was running (eg by a BG merge that has finished).
> Hmmm ... I guess our approaches are pretty different. If you want to
> take a stab at this ...
OK I will try to take a rough stab a the stateless approach....
> Factor merge policy out of IndexWriter
> --------------------------------------
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Steven Parkes
> Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt,
> LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable,
> making it possible for apps to choose a custom merge policy and for easier
> experimenting with merge policy variants.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]