[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027553#comment-17027553
 ] 

Michael McCandless commented on LUCENE-8962:
--------------------------------------------

{quote}IndexWriter has a huge bus factor
{quote}
You mean small bus factor!
{quote}and I haven't delved into it.
{quote}
You should dive into IW, to increase the bus factor.
{quote}_The test passes_. So; am I missing something?
{quote}
Well, using {{SerialMergeScheduler}} allows the test to pass, since the merges 
kicked off due to new segments after the commit will run, synchronously (using 
the main thread in your test) to completion.  And then you open a new NRT 
{{IndexReader}} directly from {{IndexWriter}} that sees only the one merged 
segment.  If you make the test more realistic (use concurrent indexing threads, 
{{ConcurrentMergeScheduler}}), the assertion should fail.  Or, if you opened 
the {{IndexReader}} from {{Directory}} instead, it should also fail.

I think in order to see the actual committed {{SegmentInfos}} reflect the 
"cheap" merges, we need to take an approach similar to [~msfroh]'s PR.

 

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to