[
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428605#comment-16428605
]
Erick Erickson commented on LUCENE-7976:
----------------------------------------
Marc:
Thanks for looking, especially at how jumbled the code is right now!
I collected some preliminary stats on total bytes written, admittedly
unscientific and hacky. I set a low maxMergedSegmentSizeMB and reindexed the
same docs randomly. To my great surprise the new code wrote _fewer_ bytes than
the current code. My expectation was just what you're pointing out, I expected
to see the new stuff write a lot more bytes. This was with an index that
respected max segment sizes.
On my plate today is to reconcile my expectations and measurements. What I
_think_ happened is that Mike's clever cost measurements are getting in here.
The singleton merge is not intended (I'll have to ensure I didn't screw this
up, thanks for drawing attention to it) to be run against segments that respect
the max segment size. It's supposed to be there to allow recovery from the case
where someone optimized to 1 huge segment. If it leads to a lot of extra writes
in that case I think it's acceptable. If it leads to a lot more bytes written
in the case where the segments respect max segment size, I worry a lot....
In the normal case, it's not that a segment are merged when it has > 20%
deleted docs, it's that it becomes _eligible_ for merging even if it has > 50%
maxSegmentSize "live" docs.. What I have to figure out (all help appreciated!)
is how Mike's scoring algorithm influences this. The code starting with
// Consider all merge starts:
is key here. Let's say I have 100 possible eligible segments and 30
"maxMergeAtOnce". The code starts at 0 and collects up to 30 segments and
scores that merge. Then it starts at 1, collects up to 30 segments and scores
that. Repeat until you start at 70, keeping the "best" merge as determined by
the scoring method and use the best-scoring one. What I _think_ is happening is
that the large segments do grow past 20% before they're merged due to the
scoring.
And there's a whole discussion here about what's a "good" number and whether it
should be user-configurable, I chose 20% semi-randomly (and hard-coded it!)
just to get something going.
All that said, performance is the next big chunk of this I need to tackle,
insuring that this doesn't become horribly I/O intensive. Or, as you suggest,
we figure out a way to throttle it.
Or throw out the idea of singleton merges in the first place and, now that
expungeDeletes respects max segment size too, tell users who've optimized down
to single segments that they should occasionally run expungeDeletes as they
replace documents.
> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of
> very large segments
> -------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: LUCENE-7976.patch, LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on
> disk) handled quite easily in a single Lucene index. This is particularly
> true as features like docValues move data into MMapDirectory space. The
> current TMP algorithm allows on the order of 50% deleted documents as per a
> dev list conversation with Mike McCandless (and his blog here:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many
> TB) solutions like "you need to distribute your collection over more shards"
> become very costly. Additionally, the tempting "optimize" button exacerbates
> the issue since once you form, say, a 100G segment (by
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with
> >> smaller segments to bring the resulting segment up to 5G. If no smaller
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize).
> >> It would be rewritten into a single segment removing all deleted docs no
> >> matter how big it is to start. The 100G example above would be rewritten
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the
> default would be the same behavior we see now. As it stands now, though,
> there's no way to recover from an optimize/forceMerge except to re-index from
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the
> wild" with 10s of shards replicated 3 or more times. And that doesn't even
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A
> new merge policy is certainly an alternative.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]