Re: Deterministic index construction

Haoyu Zhai Sat, 19 Dec 2020 11:50:35 -0800

Hi Adrien
I think Mike's comment is correct, we already have index sorted but we want
to reconstruct a index with exact same number of segments and each segment
contains exact same documents.


Mike
AddIndexes could take CodecReader as input [1], which allows us to pass in
a customized FilteredIndexReader I think? Then it knows which docs to take.
And then suppose original index has N segments, we could open N IndexWriter
concurrently and rebuilt those N segments, and at last somehow merge them
back to a whole index. (I am not quite sure about whether we could achieve
the last step easily, but that sounds not so hard?)

[1]
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-

Michael Sokolov <[email protected]> 于2020年12月19日周六 上午9:13写道：

> I don't know about addIndexes. Does that let you say which document goes
> where somehow? Wouldn't you have to select a subset of documents from each
> originally indexed segment?
>
> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <[email protected]> wrote:
>
>> I think the idea is to exert control over the distribution of documents
>> among the segments, in a deterministic reproducible way.
>>
>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <[email protected]> wrote:
>>
>>> Have you considered leveraging Lucene's built-in index sorting? It
>>> supports concurrent indexing and is quite fast.
>>>
>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <[email protected]> wrote:
>>>
>>>> Hi
>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>> manner but that might be too slow for us sometimes)
>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>> to be aware of which doc belong to which segment:
>>>> 1. First build index in the normal way (concurrently), after the index
>>>> is built, using "addIndexes" functionality to merge documents into the
>>>> correct segment.
>>>> 2. By controlling FlushPolicy and other related classes, make sure each
>>>> segment created (before merge) has only the documents that belong to one of
>>>> the segments in the pre-built index. And create a dedicated MergePolicy to
>>>> only merge segments belonging to one pre-built segment.
>>>>
>>>> Basically we think first one is easier to implement and second one is
>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>
>>>> Thanks
>>>> Patrick Zhai
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>

Re: Deterministic index construction

Reply via email to