[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519790 ]
Steven Parkes commented on LUCENE-847: -------------------------------------- Are you going to fix all unit tests that call the now-deprecated APIs? Yeah. Thanks for the reminder. As to the IndexWriter vs. IndexMerger issue, I still think having the interface is useful if not only that it makes my testing much easier. I have a MockIndexMerger that implements only the functions in the interface and therefore I can test merge policies without creating a writer. For me this has been a big win ... Exactly: these settings decide when a segment is flushed, so, why put them into IndexMerger interface? They shouldn't have anything to with merging; I think they should be removed. For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that does not use maxBufferedDocs. I can see that one could write a merge policy that didn't have any idea of how the initial buffering was done, but I worry about precluding it. Maybe the LUCENE-845 patch will show a strong enough pattern to believe no merge policies will need it? I think factoring into a base class is an OK solution, but, it shouldn't be MergePolicy's job to remember to call this final "move any segments in the wrong directory over" code. As long as its in one place and people don't have to copy/paste code between MergePolicy sources. In the case of concurrent merges, I think this gets more complicated. When do you do those directory copies? I think you can't do them at the return from the merge policy because the merge policy may want to do them, but later. I don't think IndexWriter has enough information to know when the copies need to done. Doubly so if we have concurrent merges? I still stand by it should be the merge policy making the choice. You could have the code in IndexWriter too, but then there'd be duplicate code. To put the code only in IndexWriter removes the choice from the merge policy. I think there should in fact be a default optimize() in the base class that does what current IndexWriter now does so that a MergePolicy need not implement optimize at all. It'd be nice, but I don't know how to do it: the merge factor is not generic, so I don't know how to implement the loop generically. Ah ... I see: with your forced merge ... hmmm. No, forced would mean the merge policy must do a merge; whereas, normally, it's free not to do a merge until it wants to. Hmmmm ... I think adding a forced merge concept here is new ... If it's simply to support optimize, I'm not sure I find it too compelling. LogDoc as it stands uses different algorithms for incremental merges and optimize, so there's not too much of a concept of forced merges vs. optional merges to be factored out. So I guess I'm not seeing a strong compelling case for creating it? Well, it's sort of awkward if you want to vary that max # segments. Say during the day you optimize down to 15 segments every time you update the index, but then at night you want to optimize down to 5. If we don't add method to IndexWriter you then must have instance var on your MergePolicy that you set, then you call optimize. It's not clean since really it should be a parameter. Well, I don't know if I buy the argument that it should be a parameter. The merge policy has lots of state like docs/seg. I don't really see why segs/optimize is different. My main reason for not wanting put this into IndexWriter is then every merge policy must support it. Wait: there is a barrier, right? In IndexWriter.replace we don't do the right thing with non-contiguous merges? Yeah, I meant that I'm not aware of any barriers except fixing IndexWriter#replace, in other words, I'm not aware of any other places where non-contiguity would cause a failure. I'm not wed to "maybeMerge()" but I really don't like "merge" :) It's overloaded now. EG IndexMerger interface has a method called "merge" that is named correctly because it will specifically go a do the requested merge. So does IndexWriter. Then, you have other [overloaded] methods in LogDocMergePolicy called "merge" that are named appropriately (they will do a specific merge). How about "checkForMerges()"? I don't find it ambiguous based on class and argument type. It's all personal, of course. I'd definitely prefer maybe over checkFor because that sounds like a predicate. Maybe we could add a "setMergePolicy(MergePolicy policy, boolean doClose)" and default doClose to true? That sounds good. Finally: does MergePolicy really need a close()? I think so. The concurrent merge policy maintains all sorts of state. I don't see how it can work (building the generic concurrency wrapper "under" IndexMerger) because the MergePolicy is in "serial control", eg, when it wants to cascade merges. How will you return that thread back to IndexWriter? So this is how it looks now: the concurrent merge policy is both a merge policy and an index merger. The serial merge policy knows nothing about it other than it does not get IndexWriter as its merge. The index writer wants its merge, so it does it merge/maybeMerge call on the concurrent merge policy. The CMP calls merge on the serial policy, but substitutes itself for the merger rather than IndexWriter. The serial merge policy goes on its merry way, looking for merges to do (in the current model, this is a loop; more on that in a minute). Each time it has a subset of segments to merge, it calls merger.merge(...). At this point, the concurrent merge policy takes over again. It looks at the segments to be merged and other segments being processed by all existing merge threads and determines if there's a conflict (a request to merge a segment that's currently in a merge). If there's no conflict, it starts a merge thread and calls IndexWriter#merge on the thread. The original calling thread returns immediately. (I have a few ideas how to handle conflicts, the simplest of which is to wait for the conflicting merge and the restart the serial merge, e.g., revert to serial). This seems to work pretty well, so far. The only difference in API for the serial merges is that the merge operation can't return the number of documents in the result (since it isn't known how many docs will be deleted). With concurrency wrapper "on the top" it's able to easily take a merge request as returned by the policy, kick it off in the backrground, and immediately return control of original thread back to IndexWriter. What I don't know how to do with this is figure out how to do a bunch of merges. Lets say I have two levels in LogDoc that are merge worthy. If I call LogDoc, it'll return the lower level. That's all good. But what about doing the higher level in parallel? If I call LogDoc again, it's going to return the lower level again because it knows nothing about the current merge going on. LogDoc already does things in a loop: it's pretty much set up to call all possible merges at one time (if they return immediately). It would be possible to have it return a vector of segmentInfo subsets, but I don't see the gain (and it doesn't work out as well for my putative conflict resolution). have all necessary locking for SegmentInfos inside IndexWriter This was a red-herring on my part. All the "segmentInfos locking" has always been in IndexWriter. That's note exactly sufficient. The fundamental issue is that IndexWriter#merge has to operate without a lock on IndexWriter. At some point, I was thinking that meant it would have to lock SegmentInfos but that's ludicrous, actually. It's sufficient for IndexWriter#replace to be synchronized. If CMP implements IndexMerger you must have locking inside any MergePolicy that's calling into CMP? No. CMP does it's own locking (for purposes of thread management) but the serial merge policies no nothing of this (and they can expect to be called synchronously). > Factor merge policy out of IndexWriter > -------------------------------------- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Steven Parkes > Assignee: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]