[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506594#comment-16506594
 ] 

Erick Erickson commented on LUCENE-7976:
----------------------------------------

[~simonw]

bq. should we check here if the segDelDocs is less that the threshold rather 
than checking if there is at least one delete.

Not unless we redefine what forceMerge does. It's perfectly possible to have a 
segment at this point that's 4.999G with one document deleted. It'll be 
horribly wasteful, but it's no worse than what has always happened with 
forceMerge.

Outside of forceMerge, segments won't be eligible unless they have 10% deleted 
docs.

In the case of findMerge, I'm counting on the scoring mechanism to keep this 
from being a problem.

bq. no if we have not seen a too large merge but the best one is too large we 
still add it? is this correct, don't we want to prevent.

This is awkward at present in that it preserves the old behavior. 
findForcedDeletesMerges has always allowed multiple large merges, leaving that 
for a later JIRA.

In the other cases, this will prevent multiple large merges because the first 
time we get a large merge, haveOneLargeMerge == false and bestTooLarge == true 
so we create a large merge.

Thereafter, if bestTooLarge == true we'll avoid adding it.

bq. I do wonder about the naming here why is this named maxDoc should it be 
named delCount or so?

Brain fart, changed. I started out doing one thing then changed it without 
noticing that.

bq. can I suggest to remove the seg prefix. It's obivous form the name. I also 
think it should be delCount instead
Done

bq. can you plese use parentesis around this?
Done

bq. in SegmentsInfoRequestHandler solr reads the SegmentInfos from disk which 
will not result in accurate counts.

Good to know, is there a better way to go? I don't think total accuracy is 
necessary here.

bq. ..I would loved to see them work without index writer....Do you think you 
can still fix that easily

I have no idea ;) I saw the discussion at 8330 but didn't see any test 
conversions I could copy. I'll put up another version of this patch 
momentarily, if you could show me the pattern to use I'll see what I can do. 
That said, if it's involved at all I'd like to put it in a follow-on JIRA.

[~mikemccand] This set of changes is purely style, no code changes. So unless 
there are objections, I'll commit it sometime next week. 




> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of 
> very large segments
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, 
> LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, 
> LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, 
> SOLR-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to