[ 
https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525244#comment-17525244
 ] 

Vigya Sharma commented on LUCENE-8580:
--------------------------------------

I'm thinking of tackling this one data structure at a time, starting with 
postings.

Extending on the [~dweiss]'s  patch, I was wondering if we could merge each 
{{field}} in parallel, in a separate fork-join task. I feel merging terms in 
parallel is tricky as we want to retain their sorted order, but going one level 
deeper, we could parallelize merging the postings within a term. Every 
{{ReaderSlice}} maps to a separate chunk of docIds in the target segment, so we 
could write postings for a term from each subreader in parallel.

For the fields part first, would it work to just spawn off multiple 
{{FieldsConsumer.write(...)}} calls in parallel, with different subsets of 
fields (maybe 1 field per task)?

TermsIndex has an FST per field and I see we go field by field while writing 
terms and merging their postings, which makes it look viable. I'm not yet clear 
on whether this will require writing to multiple (tim/tip/tmd/others?) files, 
and if that is a problem (whether we need to stitch them together which drains 
away all our concurrency gains, or if we can explore ways to work with multiple 
files).

I'll explore more on how the files get written. Want to check with the 
community for any pointers, if I'm on the right track here, and if there are 
some obvious wrinkles I should look at. 

> Make segment merging parallel in SegmentMerger
> ----------------------------------------------
>
>                 Key: LUCENE-8580
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8580
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>         Attachments: LUCENE-8580.patch
>
>
> A placeholder issue stemming from the discussion on the mailing list [1]. Not 
> of any high priority.
> At the moment any merging from N segments into one will happen sequentially 
> for each data structure involved in a segment (postings, norms, points, 
> etc.). If the input segments are large, the CPU (and I/O) are mostly unused 
> and the process takes a long time. 
> Merging of these data structures is mostly independent of each other, so it'd 
> be interesting to see if we can speed things up by allowing them to run 
> concurrently. I investigated this on a 40GB index with 22 segments, 
> force-merging this into 1 segment (of similar size). Quick and dirty patch 
> attached.
> I see some improvement, although it's not by much; the largest component 
> dominates everything else.
> Results from an 8-core CPU.
> Before:
> {code}
> SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields 
> [41922110 docs]
> SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings 
> [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 
> docs]
> IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 
> docs
> {code}
> After:
> {code}
> SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec 
> to merge norms
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge doc values
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge points
> SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge 
> store matchedCount=22 vs 22
> SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 
> msec to merge stored fields [41922110 docs]
> SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 
> msec to merge postings
> IW 0 [2018-11-30T10:32:23.513Z; main]: merge time  950249 msec for 41922110 
> docs
> {code}
> Ideally, one would need to push forkjoin into individual subroutines so that, 
> for example, postings utilize concurrency when merging (pulling blocks of 
> terms concurrently from the input, calculating statistics, etc. and then 
> pushing in an ordered fashion to the codec). 
> [1] https://markmail.org/thread/dtejwq42qagykeac



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to