[
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497857#comment-16497857
]
Simon Willnauer commented on LUCENE-7976:
-----------------------------------------
{code:java}
+ // A singleton merge with no deletes makes no sense. We can get here
when forceMerge is looping around...
+ if (candidate.size() == 1) {
+ SegmentSizeAndDocs segSizeDocs =
segInfosSizes.get(candidate.get(0));
+ if (segSizeDocs.segDelDocs == 0) {
+ continue;
+ }
{code}
should we check here if the segDelDocs is less that the threshold rather than
checking if there is at least one delete.
{code:java}
+ if (haveOneLargeMerge == false || bestTooLarge == false || mergeType
== MERGE_TYPE.FORCE_MERGE_DELETES) {
{code}
I have a question about it, I might just not understand this well enough:
* if we have seen one or more large merges we don't add the merge
* if the best one is too large neither
* but always when we do force merge deletes
no if we have not seen a too large merge but the best one is too large we still
add it? is this correct, don't we want to prevent
these massive merges? I might just miss something sorry for being slow.
{code:java}
+ this.segDelDocs = maxDoc;
{code}
I do wonder about the naming here why is this named _maxDoc_ should it be named
_delCount_ or so?
{code:java}
+ private final SegmentCommitInfo segInfo;
+ private final long segBytes;
+ private final int segDelDocs;
+ private final int segMaxDoc;
+ private final String segName;
{code}
can I suggest to remove the _seg_ prefix. It's obivous form the name. I also
think it should be _delCount_ instead.
{code:java}
if (haveWork == false) return null;
{code}
can you plese use parentesis around this?
{code:java}
SegmentInfos infos =
SegmentInfos.readLatestCommit(searcher.getIndexReader().directory());
{code}
in SegmentsInfoRequestHandler solr reads the SegmentInfos from disk which will
not result in accurate counts. I know this
is a preexisting issue I just want to point it out. IW will use the object
identity of the SegmentCommitInfo of the reader to look
up it's live-stats for NRT deletes etc.
I do like the tests, I would loved to see them work without index writer. They
should be real unittests not relying on stats in IW. Do you think you can still
fix that easily. Not a blocker just a bummer :/
> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of
> very large segments
> -------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch,
> LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch,
> LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on
> disk) handled quite easily in a single Lucene index. This is particularly
> true as features like docValues move data into MMapDirectory space. The
> current TMP algorithm allows on the order of 50% deleted documents as per a
> dev list conversation with Mike McCandless (and his blog here:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many
> TB) solutions like "you need to distribute your collection over more shards"
> become very costly. Additionally, the tempting "optimize" button exacerbates
> the issue since once you form, say, a 100G segment (by
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with
> >> smaller segments to bring the resulting segment up to 5G. If no smaller
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize).
> >> It would be rewritten into a single segment removing all deleted docs no
> >> matter how big it is to start. The 100G example above would be rewritten
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the
> default would be the same behavior we see now. As it stands now, though,
> there's no way to recover from an optimize/forceMerge except to re-index from
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the
> wild" with 10s of shards replicated 3 or more times. And that doesn't even
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A
> new merge policy is certainly an alternative.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]