[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520874 ]
Michael McCandless commented on LUCENE-847: ------------------------------------------- > > I don't think so: I think if someone changes the merge policy to > > something else, it's fine to require that they then do settings > > directly through that merge policy. > > You're going to want to change the default merge policy, right? So > you're going to change the hard cast in IW to that policy? So it'll > fail for anyone that wants to just getMergePolicy back to the old > policy? I don't really follow... my feeling is we should not deprecate setUseCompoundFile, setMergeFactor, setMaxMergeDocs. > > I think we shouldn't allow any mergePolicy to leave the index > > inconsistent (failing to copy over segments from other > > directories). > > That makes sense to me. CMP could enforce this, even in the case of > concurrent merges. I think IndexWriter should enforce it? Ie no merge policy should be allowed to leave segments in other dirs (= at inconsistent index) at point of commit. > Perhaps this is sufficient, but not necessary? I see it as simpler > just to have the merge policy (abstractly) generate a set of > non-conflicting merges and let someone else worry about scheduling > them. I like that idea :) It fits well w/ the stateless API. Ie, merge policy returns all possible merges and "someone above" takes care of scheduling them. > > But, providing just a single concurrent merge already gains us > > concurrency of merging with adding of docs. > > I'm worried about when you start the leftmost merge, that, say, is > going to take a day. With a steady influx of docs, it's not going to > be long before you need another merge and if you have only one > thread, you're going to block for the rest of the day. You've bought > a little concurrency, but it's the almost day-long block I really > want to avoid. Ahh ... very good point. I agree. > With a log-like policy, I think it's feasible to have logN > threads. You might not want them all doing disk i/o at the same > time: you'd want to prioritize threads on the small merges and/or > suspend large merge threads. The speed with which the larger merge > threads can vary when other merges are taking place, you just have > to not stop them and start over. Agreed: CMP should do this. > > Right, the LUCENE-845 merge policy doesn't look @ the return > > result of "merge". It just looks at the newly created > > SegmentInfos. > > Yeah. My thinking was this would be tweaked. If merger.merge returns > a valid number of docs, it could recurse as it does. If merger.merge > returned -1 (which CMP does), it would not recurse but simply > continue the loop. Hmm. This means each merge policy must know whether it's talking to CMP or IndexWriter underneith? With the stateless approach this wouldn't happen. > > Hmmmm, in fact, I think your CMP wrapper would not work with the > > merge policy in LUCENE-845, right? Ie, won't it will just recurse > > forever? So actually I don't see how your CMP (using the current > > API) can in general safely "wrap" around a merge policy w/o > > breaking things? > > I think it's safe, just not concurrent. The recursion would generate > the same set of segments to merge and CMP would make the second call > block (abstractly, anyway: it actually throws an exception that > unwinds the stack and causes the call to start again from the top > when the conflicting merge finishes). Oh I see... that's kind of sneaky (planning on using exceptions to abort a merge requested by the policy). I think the stateless approach would be cleaner here. > > But, if you lock on IndexWriter, what about apps that use multiple > > threads to add documents and but don't use CMP? When one thread > > gets tied up merging, you'll then block on the other synchronized > > methods? And you also can't flush from other threads either? I > > think flushing a new segment should be allowed to run concurrently > > with the merge? > > I'm not sure I'm following this. That's what happens now, right? Are > you trying to get more concurrency then there is now w/o using CMP? > I certainly haven't been trying to do that. True, this is something new. But since you're already doing the work to allow a merge to run in the BG without blocking adding of docs, flushing, etc, wouldn't this come nearly for free? Actually I think all that's necessary, regardless of sync'ing on IndexWriter or SegmentInfos is to move the "if (triggerMerge)" out of the synchronized method/block. > > I guess I don't see the reason to synchronize on IndexWriter > > instead of segmentInfos. > > I looked at trying to make IW work when a synchronization of IW > didn't imply a synchronization of segmentInfos. It's a very, very > heavily used little data structure. I found it very hard to convince > myself I could catch all the places locks would be required. And at > the same time, I seemed to be able to do everything I needed with IW > locking. Well, eg flush() now synchronizes on IndexWriter: we don't want 2 threads doing this at once. But, the touching of segmentInfos inside flush (to add the new SegmentInfo) is a tiny fleeting event (like replace) and so you would want segmentInfos to be free to change while the flushing was running (eg by a BG merge that has finished). > Hmmm ... I guess our approaches are pretty different. If you want to > take a stab at this ... OK I will try to take a rough stab a the stateless approach.... > Factor merge policy out of IndexWriter > -------------------------------------- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Steven Parkes > Assignee: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]