Parallelizing segment merges would be nice. When indexing a dataset into a
single segment, it is not rare that the final merge down to 1 segment take
longer than indexing just because merging can only use one thread. It's
frustrating to wait for this merge to finish with only one busy core. :-)

Apart from the dependency of postings on norms in order to write impacts that
Mike S mentioned, I can't think of a reason why segment parts can't be
merged in parallel.

For full text collections, I believe that the bottleneck is usually
terms+postings so it might not save much time. Maybe we could also
parallelize on a per-field basis by writing to temporary files and then
copying the raw data to the target segment part. For instance for the
Wikipedia dataset we use for nightly benchmarks, maybe the inverted indexes
for 'title' and 'body' could be merged in parallel this way.


Le lun. 25 janv. 2021 à 22:46, Dawid Weiss <dawid.we...@gmail.com> a écrit :

>
> Thanks for early feedback.
>
> I freely admit I never had to touch codecs so I'm not sure what ordering
> dependencies need to be respected. But it's certainly something I'd like to
> look into since that "last" segment merge can now take ~10 minutes on
> mostly idle CPU (64 cores, remember...) and I/O. Worth a shot to improve
> this.
>
> Dawid
>
> On Mon, Jan 25, 2021 at 10:39 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
>
>> At least in theory, since the segmentWriteState is shared among these
>> phases, there could be dependencies, but it seems as if it ought to be
>> limited to making sure that the FieldInfos are written last? This is
>> pure speculation, I haven't dug deeply in the code. However, it would
>> be necessary to have some kind of synchronization on updates to that
>> state if these were to be run concurrently. If we do this, should we
>> also handle the various steps in IndexingChain.flush concurrently? I
>> guess the mechanism fort providing threads to do so might be
>> different. At least in this case, there do seem to be *some*
>> dependencies, like between norms and terms?
>>
>> On Mon, Jan 25, 2021 at 1:58 PM David Smiley <dsmi...@apache.org> wrote:
>> >
>> > I suppose we should add a CallerRunsMergeScheduler (a new superclass of
>> SerialMergeScheduler)?  Or make this aspect of SMS configurable.  We might
>> use a semaphore to control how many callers can merge at once (1 == SMS of
>> today, larger for expanded).  It might be debatable if it is then "serial"
>> or not.
>> >
>> > I do think it'd be possible to merge parts of a segment at once!
>> That'd be a cool feature to add.
>> >
>> > ~ David Smiley
>> > Apache Lucene/Solr Search Developer
>> > http://www.linkedin.com/in/davidwsmiley
>> >
>> >
>> > On Mon, Jan 25, 2021 at 11:05 AM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >>
>> >> It makes sense to me. I don't have the full picture, but I did just
>> >> implement merging for vector format, and that at least, could be done
>> >> fully concurrent with other formats. I expect the same is true of
>> >> DocValues, Terms, etc. I'm not sure about the different kinds of
>> >> DocValues - they might want to be done together?
>> >>
>> >> On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> >> >
>> >> >
>> >> > Hey everyone,
>> >> >
>> >> > I'm trying to cut the total wall-time of indexing for some fairly
>> large document collections on machines with a high CPU count (> 32 indexing
>> threads). So far my observations are:
>> >> >
>> >> > 1) I resigned from using the concurrent merge scheduler in favor of
>> "same thread" merging. This means the indexing thread that encounters a
>> merge just does it. The CMS is designed to favor concurrent searches over
>> indexing and it really didn't do anything I needed - in fact, I had to
>> disable most things it offers. I/O throttling and thread stalling are not
>> really practical on fast I/O in the absence of concurrent searches - you
>> can literally just use as many merge threads as needed to saturate the I/O.
>> >> >
>> >> > 2) It is quite frequent that everything is churning nicely until the
>> last few merges combine huge smaller segments and form a "long-tail" where
>> most cores are just idle... Here comes my question - can we execute the
>> individual "parts" involved in segment merging (the logic inside
>> SegmentMerger) in separate threads? On the surface it looks like these
>> steps can be done independently (even if they're executed sequentially at
>> the moment) but perhaps I'm missing something?
>> >> >
>> >> > I'd like to ask before I try to tinker with it. Thanks for any
>> feedback.
>> >> >
>> >> > Dawid
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Reply via email to