I've been thinking about merge issues for a while and going through IndexWriter to see if I could convince myself I understood it.
There are areas I'm interested in exploring tweaking the merge policy. For example, it might be nice to have an optimize-like operation that could look at the number of deleted documents in a segment (or sequence of segments) and perform merges to remove deleted docs but without reducing the index to a single a file. Tweaks like this via IndexWriter are, well, not easy. Perhaps not even possible, given what's final and what's not. So I've been experimenting with factoring the merge policy out of IndexWriter and making it pluggable. In such a world, IndexWriter would still do all the merges, but it would delegate the decision on what segments to merge to a merge policy: interface MergePolicy { static class MergeSpecification { SegmentInfos segmentInfos; int first; int last; boolean useCompoundFile; } void merge( SegmentInfos segmentInfos ) throws CorruptIndexException, IOException; void optimize( SegmentInfos segmentInfos ) throws CorruptIndexException, IOException; } The merge policy gets created with an IndexWriter and then called via merge or optimize with the relevant segmentInfos. It in turn calls back to the IndexWriter for each primitive merge operation: int merge( MergePolicy.MergeSpecification m ) throws CorruptIndexException, IOException; When I started playing with this, I wasn't sure how well it would work but it's seems to have come out pretty very well. A lot more feasible, I think, than actually trying to derive from IndexWriter to change only the policy. My initial impetus for this was to be able to tweak how the large, old segments were handled. But all the concurrent merge stuff from a few weeks ago gave me a bunch more to think about. One of the things discussed there was having a single background thread vs. multiple threads. Given the logarithmic nature of the current merge policy, it seems to me that an obvious candidate threaded policy is to have a merge thread at each logarithmic level (with a sanity bound for small levels). That way when you kickoff a merge of big segments, which will take a while, you can still be merging smaller segments in parallel. The obvious candidate concurrency limit would be to not allow two merges at the same level concurrently. It wouldn't solve all problems, but I think it would tend to spread out the effects of cascading merges. And it so nicely fits in with the logarithmic merge policy. There are all sorts of issues to consider with something like that, but what's interested me in this last week, is that I think factoring the merge policy actually makes it pretty easy to play with these ideas. The classic merge policy, i.e., LogarithmicMergePolicy, simply decides what merges to pursue and calls IndexWriter#merge to make them happen. A ThreadedLogarithmicMergePolicy, derived from the LMP, can keep track of the necessary merges and their levels, creating/using threads when a merge is needed on a level where there's no conflict. In that new thread, it calls back to IndexWriter#merge while the calling thread returns normally. Obviously this means IndexWriter#merge isn't synchronized and, at least in this case, should only be called via the merge policy. The only thing IndexWriter#merge really needs is the SegmentInfos subsequence that it is merging. It shouldn't need any other state. When it's done, it'll need to update the SegmentInfos object and that operation needs to be synchronized, but that's only during the update. The merges by definition do not overlap (and thus maintain document order). Making concurrency like that possible in IndexWriter would take a little tweaking but I don't think we're talking much code or any of significant performance impact. The default merge policy would be what it is now (factored) and only when using a threaded merge policy would any of the threading stuff come to play. Decided via by IndexWriter#setMergePolicy. All the threading stuff is speculation at this point: I've got a factored (unpolished) version of trunk working with a factored merge policy but haven't tried to implement threading yet. Trying to simplify the putative merge policy interface does have some minor impact on the existing merge policy: at this point, one test gives different results, a result of the fact that the merge policy as I've been thinking of it, doesn't have different calls for combining multiple indexes vs. merging a single index with segments of inconsistent size. Rather than assume that segment sequence is consistent except when told otherwise, it always checks. But that's a whole 'nother discussion ... --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]