Hi everyone, long time lurker here. I recently investigated Elasticsearch/OpenSearch performance in a blog post [1], and saw some interesting behavior of numerical range queries and numerical sorting with regards to inlining and virtual calls.
In short, the DocIdsWriter::readInts method seems to get much slower if it is called with 3 or more implementations of IntersectVisitor during the JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid) being heavily inlined with 2 or fewer IntersectVisitor implementations, while becoming a virtual call with 3 or more. This leads to two interesting points wrt Lucene 1) For benchmarks, warm ups should not only be done to trigger speedups by the JIT, instead making the JVM be in a realistic production state. For the BKDPointTree, this means at least 3 implementations of the IntersectVisitor. I'm not sure if this is top of mind when writing Lucene benchmarks? 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the number of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of execution time of 35-45% for range queries and numerical sorting with this patch applied. PR: https://github.com/apache/lucene/pull/13149 I have not been able to reproduce the speedup with lucenutil - I suspect that the default tasks in it would not trigger this code path that much. If you want understand more of my line of thinking, consider skimming through the blog post [1] [1] https://blunders.io/posts/es-benchmark-4-inlining [2] https://github.com/elastic/elasticsearch-opensearch-benchmark best regards, Anton H