[ 
https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9932.
----------------------------------
    Fix Version/s: 8.9
       Resolution: Fixed

> Performance improvement for BKD index building
> ----------------------------------------------
>
>                 Key: LUCENE-9932
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9932
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 8.8.2
>            Reporter: neoremind
>            Priority: Critical
>             Fix For: 8.9
>
>         Attachments: benchmark_data.png, flame-graph.png, 
> refined-code-benchmark.png, refined-code-benchmark2.png
>
>          Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> In BKD index building, the input bytes must be sorted before calling BKD 
> writer related API. The sorting method leverages MSB Radix Sort algorithm, 
> and the comparing method takes both the bytes itself and the DocId, but in 
> real cases, DocIds are usually monotonically increasing. This could yield one 
> possible performance enhancer. I found this enhancement when I dig into one 
> performance issue in our system. Then I research on the possible solution.
> DocId is usually increased by one when building index in a thread-safe way, 
> by assuming such condition, the comparing method can eliminate the 
> unnecessary comparing input - DocId, only leave the bytes itself to compare. 
> In order to do so, MSB radix sorting and its fallback sorting method must be 
> *stable*, so that when elements are the same, the sorting method maintains 
> its original order when added, which makes DocId still monotonically 
> increasing. To make MSB Radix Sort stable, it needs a trivial update; to make 
> fallback sort table, use merge sort instead of quick sort. Meanwhile, there 
> should introduce a switch which is able to turn the stable option on or off.
> To validate how much performance could be gained. I make a benchmark taking 
> down only the time elapsed in _MutablePointsReaderUtils.sort_ stage.
> *Test environment:* 
>  MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 
> MHz DDR3
> *Java version:*
>  java version "1.8.0_161"
>  Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
> *Testcase:*
>  bytesPerDim = [1, 2, 3, 4, 8, 16, 32]
>  dim = 1
>  doc num = 2,000,000
>  warm up 5 time, run 10 times to calculate average time used.
> *Result:*
>  
> ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id 
> (master branch)||
> |1|30989.594 us|1151149.9 us|
> |2|313469.47 us|1115595.1 us|
> |3|844617.8 us|1465465.1 us|
> |4|1350946.8 us|1465465.1 us|
> |8|1344814.6 us|1458115.5 us|
> |16|1344516.6 us|1459849.6 us|
> |32|1386847.8 us|1583097.5 us|
> !benchmark_data.png|width=580,height=283!
> Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster 
> when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data 
> cardinality is high (bytesPerDim >= 4, test cases will generate random bytes 
> which are more scatter, not likely to be duplicate), the performance does not 
> go backward, still a little better.
> In conclusion, in the end to end process for building BKD index, which relies 
> on BKDWriter for some data types, performance could be better by ignoring 
> DocId if they are already monotonically increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to