[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

ASF subversion and git services (Jira) Wed, 06 Jul 2022 15:16:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563440#comment-17563440
 ]


ASF subversion and git services commented on LUCENE-10216:
----------------------------------------------------------

Commit 698f40ad51af0c42b0a4a8321ab89968e8d0860b in lucene's branch 
refs/heads/main from Vigya Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=698f40ad51a ]

LUCENE-10216: Use MergeScheduler and MergePolicy to run 
addIndexes(CodecReader[]) merges. (#633)

* Use merge policy and merge scheduler to run addIndexes merges

* wrapped reader does not see deletes - debug

* Partially fixed tests in TestAddIndexes

* Use writer object to invoke addIndexes merge

* Use merge object info

* Add javadocs for new methods

* TestAddIndexes passing

* verify field info schemas upfront from incoming readers

* rename flag to track pooled readers

* Keep addIndexes API transactional

* Maintain transactionality - register segments with iw after all merges 
complete

* fix checkstyle

* PR comments

* Fix pendingDocs - numDocs mismatch bug

* Tests with 1-1 merges and partial merge failures

* variable renaming and better comments

* add test for partial merge failures. change tests to use 1-1 findmerges

* abort pending merges gracefully

* test null and empty merge specs

* test interim files are deleted

* test with empty readers

* test cascading merges triggered

* remove nocommits

* gradle check errors

* remove unused line

* remove printf

* spotless apply

* update TestIndexWriterOnDiskFull to accept mergeException from failing 
addIndexes calls

* return singleton reader mergespec in NoMergePolicy

* rethrow exceptions seen in merge threads on failure

* spotless apply

* update test to new exception type thrown

* spotlessApply

* test for maxDoc limit in IndexWriter

* spotlessApply

* Use DocValuesIterator instead of DocValuesFieldExistsQuery for counting soft 
deletes

* spotless apply

* change exception message for closed IW

* remove non-essential comments

* update api doc string

* doc string update

* spotless

* Changes file entry

* simplify findMerges API, add 1-1 merges to MockRandomMergePolicy

* update merge policies to new api

* remove unused imports

* spotless apply

* move changes entry to end of list

* fix testAddIndicesWithSoftDeletes

* test with 1-1 merge policy always enabled

* please spotcheck

* tidy

* test - never use 1-1 merge policy

* use 1-1 merge policy randomly

* Remove concurrent addIndexes findMerges from MockRandomMergePolicy

* Bug Fix: RuntimeException in addIndexes

Aborted pending merges were slipping through the merge exception check in
API, and getting caught later in the RuntimeException check.

* tidy

* Rebase on main. Move changes to 10.0

* Synchronize IW.AddIndexesMergeSource on outer class IW object

* tidy

> Add concurrency to addIndexes(CodecReader…) API
> -----------------------------------------------
>
>                 Key: LUCENE-10216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10216
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Vigya Sharma
>            Priority: Major
>          Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

Reply via email to