On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210 > ] > > Steven Parkes commented on LUCENE-847: > -------------------------------------- > > Is the separate IndexMerger interface really necessary? > > I wrestled with this, so in the past, it wouldn't have taken much to convince > me otherwise. The reason for the extra interface is I was hoping to > discourage or create a little extra friction for merge policies in terms of > looking too much into the internals of IndexWriter. > > As an example, it's not a good idea for merge policies to be able to access > IndexWriter#segmentInfos directly. (That's a case I would like to solve by > making segmentInfos private, but I haven't looked at that completely yet and > even beyond that case, I'd still like merge policies to have a very clean > interface with IWs.) > > It feels kinda weird for me to be arguing for constraints since I work mostly > in dynamic languages that have none of this stuff. But it reflects my goal > that merge policies simply be algorithms, not real workers. > > Moreover, I think it may be useful for implementing concurrency (see below). > > While LogDocMergePolicy does need "maxBufferedDocs", I think, > instead, in IndexWriter's "setMaxBufferedDocs()" it should "write > through" to the LogDocMergePolicy if that is the merge policy in > use (and LogDocMergePolicy should store its own private > "minMergeDocs"). > > The thing here is that the merge policy has nothing to do with max buffered > docs, right? The code for buffering docs for the first segment is wholly in > the IndexWriter. LogDocMergePolicy happens to need it (in its current > incarnation) in order to handle the way the log levels are computed. This > could, of course, be changed. There's nothing that says a merge policy has to > look at these values, just that they're available should the merge policy > want to look. > > I guess my idea was that the index writer was responsible for handling the > initial segment (with DocsWriter, if it wants) and also to provide an > indication to the merge policies how it was handling them. > > It's possible to have the merge policy influence the first segment size but > that opens up a lot of issues. Does every merge policy then have to know how > to trade between buffering by doc count and buffering by ram? I was hoping to > avoid that. > > I'm not all that happy with this the way it is, but supporting both by-doc > and by-ram is messy but seems necessary. This was my take on making it least > messy? > > In LogDocMergePolicy, it seems like the checking that's done for > whether a SegmentInfo is in a different directory, as well as the > subsequent copy to move it over to the IndexWriter's directory, > should not live in the MergePolicy? > > I see two parts to this. > > First, I hesitate to make it not possible for merge policy to see the > directory information, i.e., remove IndexMerger#getDirectory(). Copying a > segment is one way to get it into the current directory, but merging with > other segments does to. A merge policy could decide to go ahead and merge a > bunch of segments that are in other directories rather than just copy them > individually. Taking away getDirectory() removes that option. > > As to how to handle the case where single segments are copied, I guess my > main reason for leaving that in the merge policy would be for simplicity. > Seems nicer to have all segment amalgamation management in one place, rather > than some in one class and some in another. Could be factored into an > optional base merge policy for derived classes to use as they might like. > > The "checkOptimize" method in LogDocMergePolicy seems like it > belongs back in IndexWriter: I think we don't want every > MergePolicy having to replicate that tricky while condition. > > The reason for not doing this was I can imagine different merge policies > having a different model of what optimize means. I can imagine some policies > that would not optimize everything down to a single segment but instead > obeyed a max segment size. But we could factor the default conditional into > an optional base class, as above. > > It is an ugly conditional that there might be better way to formulate, so > that policies don't have to look at the grody details like hasSeparateNorms. > But I'd still like the merge policies to be able to decide what optimize > means at a high level. > > Maybe we could change MergePolicy.merge to take a boolean "forced" > which, if true, means that the MergePolicy must now pick a merge > even if it didn't think it was time to. This would let us move > that tricky while condition logic back into IndexWriter. > > I didn't follow this. forced == optimize? If not, what does pick a merge > mean? Not sure what LogDoc should do for merge(force=true) if it thinks > everything is fine? > > Also, I think at some point we may want to have an optimize() > method that takes an int parameter stating the max # segments in > the resulting index. > > I think this is great functionality for a merge policy, but what about just > making that part of the individual merge policy interface, rather than linked > to IndexWriter? That was one goal of making a factored merge policy: that you > could do these tweaks without changing IndexWriter. > > This would allow you to optimize down to <= > N segments w/o paying full cost of a complete "only one segment" > optimize. If we had a "forced" boolean then we could build such > an optimize method in the future, whereas as it stands now it > would not be so easy to retrofit previously created MergePolicy > classes to do this. > > I haven't looked at how difficult it would be to make LogDoc sufficiently > derivable to allow a derived class to add this tweak. If I could, would it be > enough? > > There are some minor things that should not be committed eg the > added "infoStream = null" in IndexFileDeleter. I typically try to > put a comment "// nocommit" above such changes as I make them to > remind myself and keep them from slipping in. > > You're right and thanks for the idea. Obvious now. > > In the deprecated IndexWriter methods you're doing a hard cast to > LogDocMergePolicy which gives a bad result if you're using a > different merge policy; maybe catch the class cast exception (or, > better, check upfront if it's an instanceof) and raise a more > reasonable exception if not? > > Agreed. > > IndexWriter around line 1908 has for loop that has commented out > "System.err.println"; we should just comment out/remove for loop > > The whole loop will be gone before commit but I didn't want to delete it yet > since I might need to turn it back on for more debugging. It should, of > course, have a "// nocommit" comment. > > These commented out synchronized spook me a bit: > > /* synchronized(segmentInfos) */ { > > This is from my old code. I left it in there as a hint as I work on the > concurrent stuff again. It's only a weak hint, though, because things have > changed a lot since that code was even partially functional. Ignore it. It > won't go into the serial patch and anything for LUCENE-870 will have to have > its own justification. > > Can we support non-contiguous merges? If I understand it > correctly, the MergeSpecification can express such a merge, it's > just that the current IndexMerger (IndexWriter) cannot execute it, > right? So we are at least not precluding fixing this in the > future. > > As far as I've seen so far, there are no barriers to non-contiguous merges. > Maybe something will crop up that is, but in what I've done, I haven't seen > any. > > It confuses me that MergePolicy has a method "merge(...)" -- can > we rename it to "maybeMerge(..)" or "checkForMerge(...)"? > > I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of > "merge" like "optimize": make it so / idempotent. But I'm certainly willing > to write whatever people find clearest. > > Instead of IndexWriter.releaseMergePolicy() can we have > IndexWriter only close the merge policy if it was the one that had > created it? (Similar to how IndexWriter closes the dir if it has > opened it from a String or File, but does not close it if it was > passed in). > > This precludes > > iw.setMergePolicy(new MyMergePolicy(...)); > ... > iw.close(); > > You're always going to have to > > MergePolicy mp = new MyMergePolicy(...); > iw.setMergePolicy(mp); > ... > iw.close(); > mp.close(); > > The implementation's much cleaner using the semantics you describe, but I was > thinking it'd be better to optimize for the usability of the common client > code case? > > Well I think the current MergePolicy API (where the "merge" method > calls IndexWriter.merge itself, must cascade itself, etc.) makes it > hard to build a generic ConcurrentMergePolicy "wrapper" that you could > use to make any MergePolicy concurrent (?). How would you do it? > > I really haven't had time to go heads down on this (the old concurrent merge > policy was a derived class rather than a wrapper class). But I was thinking > that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as > the serial merge policy, i.e., implement IndexMerger (my biggest argument for > IM at this point). But I haven't looked deeply at whether this will work but > I think it has at least a chance. > > I should know more about this is a day or two. > > I think you can still have state (as instance variables in your > class)? How would this simplification restrict the space of merge > policies? > > s/state/stack state/. Yeah, you can always unwind your loops and lift your > recursions, put all that stack state into instance variables. But, well, > yuck. I'd like to make it easy to write simple merge policies and take up the > heavy lifting elsewhere. Hopefully there will be more merge policies than > index writers. > > Oh, no: ConcurrentMergePolicy would still call > IndexWriter.merge(spec), > just with a separate thread. And so all synchronization required is > still inside IndexWriter (I think?). > > That's my idea. > > The synchronization is still tricky, since parts of segmentInfos are getting > changed at various times and there are references and/or copies of it other > places. And as Ning pointed out to me, we also have to deal with buffered > delete terms. I'd say I got about 80% of the way there on the last go around. > I'm hoping to get all the way this time. > > In fact, if we stick with the current MergePolicy API, aren't you > going to have to put some locking into eg the LogDocMergePolicy when > concurrent merges might be happening? > > Yes and no. If CMP implements IndexMerger, I think we might be okay? In the > previous iteration, I used derivation so that ConcurrentLogDocMergePolicy > derived from the serial version but had the necessary threading. I agree that > a wrapper is better solution if it can be made to work. > > > Factor merge policy out of IndexWriter > > -------------------------------------- > > > > Key: LUCENE-847 > > URL: https://issues.apache.org/jira/browse/LUCENE-847 > > Project: Lucene - Java > > Issue Type: Improvement > > Components: Index > > Reporter: Steven Parkes > > Assignee: Steven Parkes > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > > LUCENE-847.txt > > > > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > > making it possible for apps to choose a custom merge policy and for easier > > experimenting with merge policy variants. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]