gf2121 opened a new pull request, #12324: URL: https://github.com/apache/lucene/pull/12324
Today `Sparse#AdvanceExactWithinBlock` always need to read next doc and seek back if a doc not exists. This could do harm to performance in dense hit queries. For example, a field exists in doc 1, 5. When `advanceExact` 2,3,4 it always need to read next doc (5) and seek back. I think caching the next existing doc in block can help dense hit queries without too much harm to other cases. I ran a benchmark with `MatchAllDocsQuery` on some fields with different sparsity: > sparsity=n means field only exists when `doc % n == 0` <byte-sheet-html-origin data-id="1684739567508" data-version="4" data-is-embed="false" data-grid-line-hidden="false" data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/YyZcs5ZLNh9tl2t2MvKcsU4jn6b" data-importRangeRawData-range="'Sheet1'!I1:L12"> sparsity | baseline(ms) | candidate(ms) | diff -- | -- | -- | -- 32 | 255 | 112 | -56.08% 64 | 260 | 95 | -63.46% 128 | 264 | 94 | -64.39% 256 | 262 | 93 | -64.50% 512 | 260 | 91 | -65.00% 1024 | 259 | 90 | -65.25% 2048 | 258 | 90 | -65.12% 4096 | 253 | 90 | -64.43% 8192 | 243 | 90 | -62.96% 16384 | 224 | 90 | -59.82% 32768 | 184 | 90 | -51.09% </byte-sheet-html-origin> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
