For the sake of history, somewhere around Solr/Lucene 3.2 a new
"MergePolicy" was introduced. The old one merged simply based upon age,
or "index generation", meaning the older the segment, the less likely it
would get merged, hence needing optimize to clear out deletes from your
older segments.

The new MergePolicy, the TieredMergePolicy, uses a more intelligent
algorithm to decide which segments to merge, and is the single reason
why optimization isn't recommended anymore. According to the javadocs:

"For normal merging, this policy first computes a "budget" of how many
segments are allowed to be in the index. If the index is over-budget,
then the policy sorts segments by decreasing size (pro-rating by percent
deletes), and then finds the least-cost merge. Merge cost is measured by
a combination of the "skew" of the merge (size of largest segment
divided by smallest segment), total merge size and percent deletes
reclaimed, so that merges with lower skew, smaller size and those
reclaiming more deletes, are favored.

If a merge will produce a segment that's larger than
setMaxMergedSegmentMB(double), then the policy will merge fewer segments
(down to 1 at once, if that one has deletions) to keep the segment size
under budget."

Upayavira


On Mon, Jun 29, 2015, at 08:55 PM, Toke Eskildsen wrote:
> Reitzel, Charles <charles.reit...@tiaa-cref.org> wrote:
> > Is there really a good reason to consolidate down to a single segment?
> 
> In the  scenario spawning this thread it does not seem to be the best
> choice. Speaking more broadly there are Solr setups out there that deals
> with immutable data, often tied to a point in time, e.g. log data. We
> have such a setup (harvested web resources) and are able to lower heap
> requirements significantly and increase speed by building fully optimized
> and immutable shards.
> 
> > Any incremental query performance benefit is tiny compared to the loss of 
> > managability.
> 
> True in many cases and I agree that the "Optimize"-wording is a bit of a
> trap. While technically correct, it implies that one should do it
> occasionally to keep any index fit. A different wording and maybe a
> tooltip saying something like "Only recommended for non-changing indexes"
> might be better.
> 
> Turning it around: To minimize the risk of occasional
> performance-degrading large merges, one might want an index where all the
> shards are below a certain size. Splitting larger shards into smaller
> ones would in that case also be an optimization, just towards a different
> goal.
> 
> - Toke Eskildsen

Reply via email to