Hi Adrien I think Mike's comment is correct, we already have index sorted but we want to reconstruct a index with exact same number of segments and each segment contains exact same documents.
Mike AddIndexes could take CodecReader as input [1], which allows us to pass in a customized FilteredIndexReader I think? Then it knows which docs to take. And then suppose original index has N segments, we could open N IndexWriter concurrently and rebuilt those N segments, and at last somehow merge them back to a whole index. (I am not quite sure about whether we could achieve the last step easily, but that sounds not so hard?) [1] https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...- Michael Sokolov <[email protected]> 于2020年12月19日周六 上午9:13写道: > I don't know about addIndexes. Does that let you say which document goes > where somehow? Wouldn't you have to select a subset of documents from each > originally indexed segment? > > On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <[email protected]> wrote: > >> I think the idea is to exert control over the distribution of documents >> among the segments, in a deterministic reproducible way. >> >> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <[email protected]> wrote: >> >>> Have you considered leveraging Lucene's built-in index sorting? It >>> supports concurrent indexing and is quite fast. >>> >>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <[email protected]> wrote: >>> >>>> Hi >>>> Our team is seeking a way of construct (or rebuild) a deterministic >>>> sorted index concurrently (I know lucene could achieve that in a sequential >>>> manner but that might be too slow for us sometimes) >>>> Currently we have roughly 2 ideas, all assuming there's a pre-built >>>> index and have dumped a doc-segment map so that IndexWriter would be able >>>> to be aware of which doc belong to which segment: >>>> 1. First build index in the normal way (concurrently), after the index >>>> is built, using "addIndexes" functionality to merge documents into the >>>> correct segment. >>>> 2. By controlling FlushPolicy and other related classes, make sure each >>>> segment created (before merge) has only the documents that belong to one of >>>> the segments in the pre-built index. And create a dedicated MergePolicy to >>>> only merge segments belonging to one pre-built segment. >>>> >>>> Basically we think first one is easier to implement and second one is >>>> faster. Want to seek some ideas & suggestions & feedback here. >>>> >>>> Thanks >>>> Patrick Zhai >>>> >>> >>> >>> -- >>> Adrien >>> >>
