My question is in the context of Solr, but I think it would probably be best implemented in Lucene, for the benefit of all Lucene-based software. I'm describing it here to decide whether I should raise an issue.
I'm after something that would simply rewrite any segment containing deleted documents, without actually merging the segments. It would be *like* a merge, except that it would usually merge one segment to one segment, instead of many to one. If the deleted documents are evenly scattered across the whole index (shard), simply doing forceMerge might be just as efficient, assuming disk space is not a concern. A use case with highly-bunched deletes and a relatively large number of segments would only need to work on some of the segments, and would complete faster. I suspect that bunched deletes are probably common in actual user indexes, at least for the ones where most deletes are related to document updates. I don't know what this operation would be called. I can start the bikeshedding with something like wipeDeletes. Using expungeDeletes would be awesome, but this name is already used as a parameter for another operation, at least in Solr. I can imagine two methods, one which has no arguments and one that takes two float percentage thresholds. For the second method, the thresholds would control what happens if the space used by segments with deletes is above or below the threshold. The first threshold, which might be called "mergeThreshold" would merge the segments with deletes into a single segment IF the space used by the segments with deletes is less than or equal to that percentage of the whole index. The second threshold, which might be called "forceMergeThreshold" would change the request into a forceMerge if the amount of space used by the segments with deletes is greater than or equal to that percentage of the whole index. The no-arg method could go two ways: Either it *only* rewrites segments one to one (maybe calling the other method with Float.MIN_VALUE for both arguments), or it assigns reasonable default values to the two thresholds, perhaps 30 and 90 percent. On my dev server, optimizing a 33GB index shard takes over 3500 seconds -- close to an hour. I only do the optimize (forceMerge in Lucene) to clean out deletes so they don't accumulate. Any performance increase that I obtain is a nice bonus -- not the reason for the optimize. I would expect the operation I am describing here to take a fraction of that time, if it is run on an index that has never been optimized. My TMP settings are roughly equivalent to a mergeFactor of 35. I have the potential for many segments. <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">35</int> <int name="segmentsPerTier">35</int> <int name="maxMergeAtOnceExplicit">105</int> </mergePolicy> Most of my deletes are concentrated in the most recently added documents. Normal merging will eliminate some of them, and most of what is left will be in the first tier of merged segments, which should be pretty small. Getting rid of deleted documents should be very efficient on my indexes with this operation. Thanks, Shawn --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
