[ https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dawid Weiss reopened LUCENE-8580: --------------------------------- About time, eh? > Make segment merging parallel in SegmentMerger > ---------------------------------------------- > > Key: LUCENE-8580 > URL: https://issues.apache.org/jira/browse/LUCENE-8580 > Project: Lucene - Core > Issue Type: Task > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Attachments: LUCENE-8580.patch > > > A placeholder issue stemming from the discussion on the mailing list [1]. Not > of any high priority. > At the moment any merging from N segments into one will happen sequentially > for each data structure involved in a segment (postings, norms, points, > etc.). If the input segments are large, the CPU (and I/O) are mostly unused > and the process takes a long time. > Merging of these data structures is mostly independent of each other, so it'd > be interesting to see if we can speed things up by allowing them to run > concurrently. I investigated this on a 40GB index with 22 segments, > force-merging this into 1 segment (of similar size). Quick and dirty patch > attached. > I see some improvement, although it's not by much; the largest component > dominates everything else. > Results from an 8-core CPU. > Before: > {code} > SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields > [41922110 docs] > SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 > docs] > SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings > [41922110 docs] > SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 > docs] > SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs] > SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 > docs] > IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 > docs > {code} > After: > {code} > SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec > to merge norms > SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to > merge doc values > SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to > merge points > SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge > store matchedCount=22 vs 22 > SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 > msec to merge stored fields [41922110 docs] > SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 > msec to merge postings > IW 0 [2018-11-30T10:32:23.513Z; main]: merge time 950249 msec for 41922110 > docs > {code} > Ideally, one would need to push forkjoin into individual subroutines so that, > for example, postings utilize concurrency when merging (pulling blocks of > terms concurrently from the input, calculating statistics, etc. and then > pushing in an ordered fashion to the codec). > [1] https://markmail.org/thread/dtejwq42qagykeac -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org