[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214947#comment-16214947
 ] 

Michael McCandless commented on LUCENE-7976:
--------------------------------------------

{quote}
> But how can that work?

It will work as defined. For some, this will be worse and they should not have 
called forceMerge. For others, they knew what they were doing and it's exactly 
what they wanted.
If you don't want 1 big segment, don't call forceMerge(1).
{quote}

But then the bug is not fixed?  I.e. if we don't require forced merges and 
natural merges to respect the same segment size, then users who force merge and 
then insist on continuing to change the index can easily get themselves to 
segments with 97% deletions.

With a single enforced max segment size, even then users can still get into 
trouble if they really want to, e.g. by making it MAX_LONG, running 
{{forceMerge}}, and then reducing it back to {{5 GB}} default again.

Or maybe we really should deprecate {{forceMerge}} and add a new 
{{forceMergeAndFreeze}} method...

bq. See LogByteSizeMergePolicy which already works correctly and defaults to 
maxSegmentSize=2GB, maxForcedMergeSegmentSize=Long.MAX_VALUE

Just because an older merge policy did it this way does not mean we should 
continue to repeat the mistake.  Two wrongs don't make a right!

bq. I completely agree that removing the Solr optimize button should be done, 
take that as read. 

+1; it's insane how tempting that button makes this dangerous operation.  Who 
wouldn't want to "optimize" their index?  Hell if my toaster had a button that 
looked like Solr's optimize button, I would press it every time I made toast!

bq. I do not and will not agree that all uses of forceMerge are invalid. 
Currently, one thing that contributes to their being overused is the percentage 
of deleted documents in the index. If a user notices that near 50% of the docs 
are deleted, what else can they do? expungeDeletes doesn't help here, it still 
creates a massive segment.

But if we make the small change to allow max sized segments to be merged 
regardless of their % deletes then that should fix that reason for force merge?

There are two separate bugs here:
#  If you force merge then keep updating you can get to segments with 97% 
deletes; fixing all force merges to respect max segment size fixes this.
#  50% is too many deleted docs for some use cases; fixing TMP to let the large 
segments be eligible for merging, always, plus maybe tuning up the existing 
{{reclaimDeletesWeight}}, fixes that.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to