[jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments

Michael McCandless (JIRA) Sun, 22 Apr 2018 11:43:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447323#comment-16447323
 ]


Michael McCandless commented on LUCENE-7976:
--------------------------------------------

{quote}About removing {{@lucene.experimental}}, yes that was deliberate, TMP 
has been around for a very long time and it seemed to me that it's now 
mainstream. I have no problem with putting it back. Let me know if that's your 
preference. Is putting it back for back-compat? Well, actually so we don't 
_have_ to maintain back-compat?
{quote}
Well, it expresses that the API might change w/o back-compat, and as long as 
TMP has been around, I'm not sure it's safe to remove that label yet.  E.g. 
here on this issue we are working out big changes to its behavior (though, no 
API breaks I think?).
{quote}What's the purpose here? Mechanically it's simple and I'll be glad to do 
it, I'd just like to know what the goal is. My guess is so we can have a clear 
distinction between changes in behavior in NATURAL indexing and refactoring.
{quote}
Hmm I was hoping to separate out changes that are just refactoring (with no 
change to behavior), which I think is the bulk of the change here, from changes 
that do alter behavior (the {{indexPctDeletedTarget}}).  This makes large 
changes like this easier to review, I think.
{quote}When you say "no change in behavior" you were referring to NATURAL 
merging, correct? Not FORCE_MERGE or FORCE_MERGE_DELETES. Those will behave 
quite differently.
{quote}
Hmm then I'm confused – I thought the refactoring was to get all of these 
methods to use the scoring approach (enumerate all possible merges, score them, 
pick the best scoring ones), and that that change alone should not change 
behavior, and then, separately, changing the limits on % deletions of a max 
sized segment before it can be merged.
{quote}Should you be using writer.numDeletesToMerge rather than the 
info.getDelDocs other places
{quote}
Hmm I think it's more correct to use {{writer.numDeletesToMerge}} – that API 
will reflect any pending deletions as well, which can be significant.  If you 
use {{info.getDelCount}} you are using stale information. That first method 
should not be too costly, unless soft deletes are used?

> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of 
> very large segments
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, 
> LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments

Reply via email to