I think the idea is to exert control over the distribution of documents among the segments, in a deterministic reproducible way.
On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpou...@gmail.com> wrote: > Have you considered leveraging Lucene's built-in index sorting? It > supports concurrent indexing and is quite fast. > > On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7...@gmail.com> wrote: > >> Hi >> Our team is seeking a way of construct (or rebuild) a deterministic >> sorted index concurrently (I know lucene could achieve that in a sequential >> manner but that might be too slow for us sometimes) >> Currently we have roughly 2 ideas, all assuming there's a pre-built index >> and have dumped a doc-segment map so that IndexWriter would be able to be >> aware of which doc belong to which segment: >> 1. First build index in the normal way (concurrently), after the index is >> built, using "addIndexes" functionality to merge documents into the correct >> segment. >> 2. By controlling FlushPolicy and other related classes, make sure each >> segment created (before merge) has only the documents that belong to one of >> the segments in the pre-built index. And create a dedicated MergePolicy to >> only merge segments belonging to one pre-built segment. >> >> Basically we think first one is easier to implement and second one is >> faster. Want to seek some ideas & suggestions & feedback here. >> >> Thanks >> Patrick Zhai >> > > > -- > Adrien >