gashutos commented on issue #12448:
URL: https://github.com/apache/lucene/issues/12448#issuecomment-1643604606
@jpountz for looking at this.
>Lazily heapifying sounds interesting, and thanks for sharing performance
numbers when data occurs in random order. Do you also have performance numbers
for the case when the index sort is the opposite order compared to the query
sort? I'm curious how much this optimization can save in that case since this
is what you're trying to optimize.
For this, I have inserted `36 million` documents with below schema, keeping
`@timestamp` long field in `asc` order almost.
```
{
"@timestamp": 898459201,
"clientip": "211.11.9.0",
"request": "GET /english/index.html HTTP/1.0",
"status": 304,
"size": 0
}
```
Below were the results, units are in `ms`, when I query `top(K)` in
descending order. where k = 5, 100, 1000....
before my changes,
top(K) | pq time | total time |
-- | -- | -- | --
5 hits | 4 | 58 |
100 hits | 70 | 130 |
1000 hits | 95 | 197 |
after my changes,
top(K) | pq time | total time |
-- | -- | -- | --
5 hits | 0 | 47 |
100 hits | 1 | 59 |
1000 hits | 2 | 101 |
The higher the value of `k` the higher heapifycation time is going to take.
>You might also be interested in checking out this
[paper](https://www.vldb.org/pvldb/vol15/p3472-yu.pdf) where Tencent describes
optimizations that they made for a similar problem in section 4.5.2: they
configure an index sort by ascending timestamp on their data, but still want to
be able to perform both queries by ascending timestamp and descending
timestamp. To handle the case when the index sort and the query sort are
opposite, they query on exponentially growing windows of documents that match
the end of the doc ID space.
Yeah, I have read this paper :) The main thing which I kind of felt
bottleneck there is they use `index sort` in timestamp field which leads higher
indexing latencies. That's a trade off there.
I like our current implementation without `index sort` and use BKD points to
skip non-competitive hits.
And if we add this `Leazy heapifycation` as proposed here, that will give
good advantage on current implementation itself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]