[jira] [Comment Edited] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Michael Froh (Jira) Wed, 04 Mar 2020 11:33:28 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051558#comment-17051558
 ]


Michael Froh edited comment on LUCENE-8962 at 3/4/20, 7:32 PM:
---------------------------------------------------------------

It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
properly handle the case where, say, an IOException occurred when committing 
the merge (after {{SegmentInfo.setFiles()}} was called, but before the files 
were actually written to disk).

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.


was (Author: msfroh):
It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
catch an IOException when e.g. committing the merge.

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Reply via email to