[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450104#comment-16450104 ]
Michael McCandless commented on LUCENE-7976: -------------------------------------------- {quote}Right, but that has quite a few consequences when comparing old .vs. new behavior for FORCE_MERGE and FORCE_MERGE_DELETES for several reasons, mostly stemming from having these two operations respect maxSegmentBytes: {quote} OK I see ... I think it still makes sense to try to break these changes into a couple issues. This one (just refactoring to share the scoring approach, with the corresponding change in behavior) is going to be big enough! Hmm I see some more failing tests e.g.: {quote}[junit4] Suite: org.apache.lucene.search.TestTopFieldCollectorEarlyTermination [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestTopFieldCollectorEarlyTermination -Dtests.method=testEarlyTermination -Dtests.seed=355D07976851D85A -Dtests.badapples=true -Dtests.locale=nn-N\ O -Dtests.timezone=America/Cambridge_Bay -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 869s J3 | TestTopFieldCollectorEarlyTermination.testEarlyTermination <<< [junit4] > Throwable #1: java.lang.OutOfMemoryError: GC overhead limit exceeded [junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A:FACA46C8503D4859]:0) [junit4] > at java.util.Arrays.copyOf(Arrays.java:3332) [junit4] > at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) [junit4] > at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) [junit4] > at java.lang.StringBuilder.append(StringBuilder.java:136) [junit4] > at org.apache.lucene.store.MockIndexInputWrapper.toString(MockIndexInputWrapper.java:224) [junit4] > at java.lang.String.valueOf(String.java:2994) [junit4] > at java.lang.StringBuilder.append(StringBuilder.java:131) [junit4] > at org.apache.lucene.store.BufferedChecksumIndexInput.<init>(BufferedChecksumIndexInput.java:34) [junit4] > at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:119) [junit4] > at org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1072) [junit4] > at org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.readEntries(Lucene50CompoundReader.java:105) [junit4] > at org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.<init>(Lucene50CompoundReader.java:69) [junit4] > at org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:70) [junit4] > at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:100) [junit4] > at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:78) [junit4] > at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:202) [junit4] > at org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:782) [junit4] > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4221) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3910) [junit4] > at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40) [junit4] > at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2077) [junit4] > at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1910) [junit4] > at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1861) [junit4] > at org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:454) [junit4] > at org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.createRandomIndex(TestTopFieldCollectorEarlyTermination.java:96) [junit4] > at org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.doTestEarlyTermination(TestTopFieldCollectorEarlyTermination.java:123) [junit4] > at org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.testEarlyTermination(TestTopFieldCollectorEarlyTermination.java:113) {quote} and {quote}[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterDelete -Dtests.method=testOnlyDeletesTriggersMergeOnClose -Dtests.seed=355D07976851D85A -Dtests.badapples=true -Dtests.locale=en-IE\ -Dtests.timezone=Australia/Perth -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.05s J0 | TestIndexWriterDelete.testOnlyDeletesTriggersMergeOnClose <<< [junit4] > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=660, name=Lucene Merge Thread #6, state=RUNNABLE, group=TGRP-Tes\ tIndexWriterDelete] [junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: segments must include at least one segment [junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) [junit4] > Caused by: java.lang.RuntimeException: segments must include at least one segment [junit4] > at org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228) [junit4] > at org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701) [junit4] > at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Capture\ d an uncaught exception in thread: Thread[id=661, name=Lucene Merge Thread #7, state=RUNNABLE, group=TGRP-TestIndexWriterDelete|#7, state=RUNNABLE, group=TGRP-TestIndexWriterDelete] [junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge [junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) [junit4] > Caused by: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge [junit4] > at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4072) [junit4] > at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4052) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3904) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) [junit4] > Caused by: java.lang.RuntimeException: segments must include at least one segment [junit4] > at org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228) [junit4] > at org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701) [junit4] > at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929) [junit4] > ... 2 more {quote} and {quote} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllSlowly -Dtests.seed=355D07976851D85A -Dtests.badapples=true -Dtests.locale=en-IE -Dtests.timezon\ e=Australia/Perth -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.21s J0 | TestIndexWriterDelete.testDeleteAllSlowly <<< [junit4] > Throwable #1: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot complete forceMerge [junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A:C651573F1DF18CA2]:0) [junit4] > at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1917) [junit4] > at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1861) [junit4] > at org.apache.lucene.index.RandomIndexWriter.doRandomForceMerge(RandomIndexWriter.java:371) [junit4] > at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:386) [junit4] > at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:332) [junit4] > at org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllSlowly(TestIndexWriterDelete.java:984) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: java.lang.RuntimeException: segments must include at least one segment [junit4] > at org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228) [junit4] > at org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701) [junit4] > at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) [junit4] 2> Apr 24, 2018 9:27:54 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #6,5,TGRP-TestIndexWriterDelete|#6,5,TGRP-TestIndexWriterDelete] [junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: segments must include at least one segment [junit4] 2> at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) [junit4] 2> Caused by: java.lang.RuntimeException: segments must include at least one segment [junit4] 2> at org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228) [junit4] 2> at org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701) [junit4] 2> at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103) [junit4] 2> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) [junit4] 2> {quote} Can we make these ints, and cast to double when we need to divide them?: {quote}+ double totalDelDocs = 0; + double totalMaxDocs = 0; {quote} Hmm that {{50/100}} integer division will just be zero: {quote}cutoffSize = (long) ((double) maxMergeSegmentBytesThisMerge * (1.0 - (50/100))); {quote} Hmm this left me hanging (in {{findForcedMerges}}): {quote}// First condition is that {quote} We define this: {quote}int totalEligibleSegs = eligible.size(); {quote} But do not decrement it when we remove segments from {{eligible}} in the loop after? In {{findForcedMerges}} since we pre-compute the per-segment sizes using {{getSegmentSizes}}, can you use that map instead of calling {{size(info, writer)}} again? > Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of > very large segments > ------------------------------------------------------------------------------------------------- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, > LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch > > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org