[GitHub] [lucene] gashutos commented on issue #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

via GitHub Thu, 20 Jul 2023 02:40:28 -0700


gashutos commented on issue #12448:
URL: https://github.com/apache/lucene/issues/12448#issuecomment-1643604606


   @jpountz for looking at this. 
   
   >Lazily heapifying sounds interesting, and thanks for sharing performance 
numbers when data occurs in random order. Do you also have performance numbers 
for the case when the index sort is the opposite order compared to the query 
sort? I'm curious how much this optimization can save in that case since this 
is what you're trying to optimize.
   
   For this, I have inserted `36 million` documents with below schema, keeping 
`@timestamp` long field in `asc` order almost.
   ```
   {
     "@timestamp": 898459201,
     "clientip": "211.11.9.0",
     "request": "GET /english/index.html HTTP/1.0",
     "status": 304,
     "size": 0
   }
   ```
   
   Below were the results, units are in `ms`, when I query `top(K)` in 
descending order. where k = 5, 100, 1000....
   
   before my changes,
   top(K)  | pq time | total time |  
   -- | -- | -- | --
   5 hits | 4 | 58 |  
   100 hits | 70 | 130 |  
   1000 hits | 95 | 197 |  
   
   after my changes,
   
   top(K)  | pq time | total time |  
   -- | -- | -- | --
   5 hits | 0 | 47 |  
   100 hits | 1 | 59 |  
   1000 hits | 2 | 101 |  
   
   
   The higher the value of `k` the higher heapifycation time is going to take.
   
   
   >You might also be interested in checking out this 
[paper](https://www.vldb.org/pvldb/vol15/p3472-yu.pdf) where Tencent describes 
optimizations that they made for a similar problem in section 4.5.2: they 
configure an index sort by ascending timestamp on their data, but still want to 
be able to perform both queries by ascending timestamp and descending 
timestamp. To handle the case when the index sort and the query sort are 
opposite, they query on exponentially growing windows of documents that match 
the end of the doc ID space.
   
   Yeah, I have read this paper :) The main thing which I kind of felt 
bottleneck there is they use `index sort` in timestamp field which leads higher 
indexing latencies. That's a trade off there.
   I like our current implementation without `index sort` and use BKD points to 
skip non-competitive hits.
   And if we add this `Leazy heapifycation` as proposed here, that will give 
good advantage on current implementation itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] gashutos commented on issue #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

Reply via email to