[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

David Smiley (Jira) Thu, 30 Jan 2020 13:16:17 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027006#comment-17027006
 ]


David Smiley commented on LUCENE-8962:
--------------------------------------

Woah; I defer to your expertise [~mikemccand].  IndexWriter has a huge bus 
factor and I haven't delved into it.  Still... I want to confirm what I think 
you are telling me.  Based on my understanding of when merges are triggered and 
observed (by a reader/searcher), I wrote the following test on 
{{TestIndexWriterMergePolicy}}:

{code:java}
  public void testMergeOnCommitIsSearchable() throws IOException {
    try (
        Directory dir = newDirectory();
        IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random()))
            .setMaxBufferedDocs(10)
            .setMergePolicy(new LogDocMergePolicy())
            .setMergeScheduler(new SerialMergeScheduler()))
    ) {
      for (int i = 0; i < 99; i++) {
        addDoc(writer);
        checkInvariants(writer);
      }
      assertEquals(9, writer.getSegmentCount());
      assertEquals(9, writer.getNumBufferedDocuments());
      writer.commit();
      try (DirectoryReader reader = DirectoryReader.open(writer)) {
        assertEquals(1, reader.getSequentialSubReaders().size());
      }
    }
  }
{code}

Generally speaking and scene here, after a commit, a search application will 
open a new reader to be able to search over the recently committed documents.  
In the scenario above, I index a bunch of documents, some of which have been 
flushed already, some pending.  Also notice the SerialMergeScheduler so that 
the writing thread merges in-process / synchronously.  Then see I open a NRT 
reader from the writer and count the segments.  I get 1, because the flushed 
buffer will produce the 10th segment and the configured LogDocMergePolicy will 
merge altogether on 10.

_The test passes_.  So; am I missing something?

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Reply via email to