[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Michael McCandless (Jira) Thu, 25 Jun 2020 14:44:55 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145871#comment-17145871
 ]


Michael McCandless commented on LUCENE-8962:
--------------------------------------------

Ooh, found another failing test (after 341 full iterations), and this one does 
reproduce!

 
{noformat}
[junit4:pickseed] Seed property 'tests.seed' already defined: D75F6483D2E6C62C
   [junit4] <JUnit4> says ¡Hola! Master seed: D75F6483D2E6C62C
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(3328881@localhost).
   [junit4] Suite: org.apache.lucene.index.TestIndexWriterMergePolicy
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit 
-Dtests.seed=D75F6483D2E6C62C -Dtests.slow=true -Dtests.badapples=tr\
ue -Dtests.locale=en-CM -Dtests.timezone=Eire -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.27s | TestIndexWriterMergePolicy.testMergeOnCommit <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<1> but 
was:<6>
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([D75F6483D2E6C62C:F3D97E843303240C]:0)
   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit(TestIndexWriterMergePolicy.java:340)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> Jun 25, 2020 5:41:39 PM 
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
   [junit4]   2> WARNING: Will linger awaiting termination of 1 leaked 
thread(s).
   [junit4]   2> NOTE: test params are: codec=CheapBastard, 
sim=Asserting(RandomSimilarity(queryNorm=true): {content=DFI(ChiSquared)}), 
locale=en-CM, timezone=Eire
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 
(64-bit)/cpus=128,threads=1,free=477412792,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterMergePolicy]
   [junit4] Completed [1/1 (1!)] in 0.56s, 1 test, 1 failure <<< FAILURES!
   [junit4]
   [junit4]
   [junit4] Tests with failures [seed: D75F6483D2E6C62C]:
   [junit4]   - 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit
   [junit4]
   [junit4]
   [junit4] JVM J0:     0.34 ..     1.45 =     1.11s
   [junit4] Execution time total: 1.45 sec.
   [junit4] Tests summary: 1 suite, 1 test, 1 failure {noformat}

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>          Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Reply via email to