[
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544392#comment-16544392
]
Marc Morissette commented on LUCENE-8263:
-----------------------------------------
{quote}the above simulations suggest around 2.1x more merging with 10% of
allowed deletes but I wouldn't be surprised that it could be much worse in
practice in production under certain conditions.{quote}
I understand why you would rather not give users another way to shoot
themselves in the foot but I think you may underestimate how diverse and
idiosyncratic some use cases can get. There are many real world situations
where a setting lower than 20% might be very appropriate
* Super large indexes that are not updated often i.e. where size is way more
important than IO
* Indexes where large documents are updated more often than small documents
which skews TieredMergePolicy's estimate of delete%
* Query-heavy update-light indexes where update IO is a tiny fraction of query
IO
Users who will be looking to alter deletesPctAllowed will presumably be doing
so because the default is inappropriate for their use case. I feel that 20-50%
might be too narrow a range for some significant percentage of these use cases.
I think documenting the danger of setting too low a value and letting users do
their own experiments is the better course of action.
> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more
> aggressive merging
> ------------------------------------------------------------------------------------------------
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large
> indexes. This parameter will do more aggressive merging of segments with
> deleted documents when the _total_ percentage of deleted docs in the entire
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20%
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10%
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits
> that reference this new parameter. After it's checked in we can bring this
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in
> 7976. The first cut at this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging.
> Empirically, using the same percentage for both caused the merging to hover
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming
> my preliminary results hold, you specify this parameter at, say, 20% and once
> the index hits that % deleted docs it hovers right around there, even if
> you've forceMerged earlier down to 1 segment. This seems in line with what
> I'd expect and adding another parameter seems excessively complicated to no
> good purpose. We could always add something like that later if we wanted.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]