[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

GitBox Sun, 27 Feb 2022 02:01:51 -0800


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677



   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
   #### dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
   #### query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
       "bool": {
         "must": [
           {
             "range": {
               "@timestamp": {
                "gte": "1998-06-08T05:00:01Z",
                 "lt": "1998-06-15T00:00:00Z"
               }
             }
           },
           {
             "match": {
               "status": "200"
             }
           }
         ]
       }
     }
   ```
   #### result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |                                                        Metric |          
Task |    Baseline |   Contender |     Diff |   Unit |
   |                                                Min Throughput | 
200s-in-range |    9.54473 |     13.0162 |  3.47149 |  ops/s |
   |                                               Mean Throughput | 
200s-in-range |    9.58063 |     13.0482 |  3.46758 |  ops/s |
   |                                             Median Throughput | 
200s-in-range |     9.5815 |     13.0526 |  3.47114 |  ops/s |
   |                                                Max Throughput | 
200s-in-range |    9.61395 |     13.0712 |  3.45725 |  ops/s |
   |                                       50th percentile latency | 
200s-in-range |    40581.6 |     25504.9 | -15076.7 |     ms |
   |                                       90th percentile latency | 
200s-in-range |    43334.5 |     27291.3 | -16043.2 |     ms |
   |                                       99th percentile latency | 
200s-in-range |    43949.5 |     27681.4 | -16268.1 |     ms |
   |                                      100th percentile latency | 
200s-in-range |    44016.2 |     27723.8 | -16292.4 |     ms |
   |                                  50th percentile service time | 
200s-in-range |    98.6711 |     73.0836 | -25.5875 |     ms |
   |                                  90th percentile service time | 
200s-in-range |    100.634 |      74.586 |  -26.048 |     ms |
   |                                  99th percentile service time | 
200s-in-range |    121.701 |     91.0001 | -30.7012 |     ms |
   |                                 100th percentile service time | 
200s-in-range |    127.223 |     120.735 | -6.48813 |     ms |
   ```
   3. manually run one time（clear all  cache） 
   ```
   |            dataSet|main branch latency|this pr latency|latency improvement|
   |            httpLog|              267ms|          167ms|               -38%|
   |our application log|             2829ms|         1093ms|               -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

Reply via email to