[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

GitBox Sat, 26 Feb 2022 22:06:23 -0800


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677



   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in ** BoundedDocSetIdIterator**to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
   #### dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
   #### query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
       "bool": {
         "must": [
           {
             "range": {
               "@timestamp": {
                "gte": "1998-06-08T05:00:01Z",
                 "lt": "1998-06-15T00:00:00Z"
               }
             }
           },
           {
             "match": {
               "status": "200"
             }
           }
         ]
       }
     }
   ```
   #### result
   1. with es rally tool. (it run many times, so the disk data is cached)
   ```
   |                                                        Metric |          
Task |    Baseline |   Contender |     Diff |   Unit |
   |                                                Min Throughput | 
200s-in-range |     9.92683 |     10.0551 |  0.12825 |  ops/s |
   |                                               Mean Throughput | 
200s-in-range |     9.94556 |     10.0642 |  0.11868 |  ops/s |
   |                                             Median Throughput | 
200s-in-range |     9.94556 |     10.0633 |   0.1177 |  ops/s |
   |                                                Max Throughput | 
200s-in-range |     9.96398 |     10.0737 |  0.10974 |  ops/s |
   |                                       50th percentile latency | 
200s-in-range |     38664.7 |     38022.7 | -641.967 |     ms |
   |                                       90th percentile latency | 
200s-in-range |     41349.8 |       40704 | -645.858 |     ms |
   |                                       99th percentile latency | 
200s-in-range |     41954.2 |     41308.7 | -645.491 |     ms |
   |                                      100th percentile latency | 
200s-in-range |     42021.6 |     41377.6 | -643.989 |     ms |
   ```
   2. manually run one time（clear all  cache） 
   ```
   |            dataSet|main branch latency|this pr latency|latency improvement|
   |            httpLog|              267ms|          167ms|               -38%|
   |our application log|             2829ms|         1093ms|               -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

Reply via email to