OK I ran a quick test using Wikipedia docs; net/net I think TieredMergePolicy's (the default) behavior is fine. Once a too-large segment has > 50% deletes it is eligible for merging and will be aggressively merged.
To visualize this, I first built a 33.3M doc Wikipedia index (append only), then ran forever randomly replacing each doc, which is a worst case test since every update also deletes a previous doc. I set max merged segment size to 800 MB, so I had a good number (17) of them; otherwise I left TMP at defaults. I refreshed every 3 seconds, and plotted the resulting graph of %tg deleted but not yet merge docs over time: It quickly ramps up from 0 at the start and only falls again once the too-large segments start being merged and eventually stabilizes to a fairly narrow range of 33%-45%. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 4, 2014 at 5:30 AM, Michael McCandless <m...@elasticsearch.com> wrote: > 25-40% is definitely "normal" for an index where many docs are being > replaced; I've seen this go up to ~65% before large merges bring it back > down. > > On 2) there may be some improvements we can make to Lucene default > TieredMergePolicy here, to reclaim deletes for the "too large" segments ... > I'll have a look. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Dec 4, 2014 at 4:06 AM, Michal Taborsky <michal.tabor...@gmail.com > > wrote: > >> Hello Nikolas, >> >> we are facing similar behavior. Did you find out anything? >> >> Thank you, >> Michal >> >> Dne pondělí, 8. září 2014 22:55:12 UTC+2 Nikolas Everett napsal(a): >> >>> My indexes change somewhat frequently. If I let leave the merge >>> settings as the default I end up with 25%-40% deleted documents (some >>> indexes higher, some lower). I'm looking for some generic advice on: >>> 1. Is that 25%-40% ok? >>> 2. What kind of settings should I set to keep that in an acceptable >>> range? For some meaning of acceptable. >>> >>> On (1) I'm pretty sure 25%-40% is OK for my low query traffic indexes - >>> no use optimizing them anyway. But for my high search traffic indexes I >>> _think_ I see a performance improvement when I have lower (<5%) deleted >>> documents and fewer segments. But computers are complicated and my >>> performance tests might just have been testing cache warming.... Does this >>> conclusion match other's experience? >>> >>> On (2) I'm not really sure what to do. It _looks_ _like_ Lucene isn't >>> picking up the bigger segments to merge the deletes out of them. I assume >>> that is because they are bumping against the max allowed segment size and >>> therefor it can only merge one at a time so it always has something better >>> to do. I'm not sure that is healthy though. Some of those old segments >>> can get really bloated - like 40%-50% deleted. >>> >>> Thanks! >>> >>> Nik >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/faec06a2-c352-4e3e-bea0-41ace2b35d6f%40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/faec06a2-c352-4e3e-bea0-41ace2b35d6f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRe_cN%2B2PtNT68z%2B5%3DDJ4W-vaO4-pUJ3bo1o0AFe%3D-4B1Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.