Hi everyone, long time lurker here.

I recently investigated Elasticsearch/OpenSearch performance in a blog post
[1], and saw some interesting behavior of numerical range queries and
numerical sorting with regards to inlining and virtual calls.

In short, the DocIdsWriter::readInts method seems to get much slower if it
is called with 3 or more implementations of IntersectVisitor during the JVM
lifetime. I believe that this is due to IntersectVisitory.visit(docid)
being heavily inlined with 2 or fewer IntersectVisitor implementations,
while becoming a virtual call with 3 or more.

This leads to two interesting points wrt Lucene

1) For benchmarks, warm ups should not only be done to trigger speedups by
the JIT, instead making the JVM be in a realistic production state. For the
BKDPointTree, this means at least 3 implementations of the
IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
benchmarks?
2) I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead
calling the IntersectVisitor with a DocIdSetItorator, to reduce the number
of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of
execution time of 35-45% for range queries and numerical sorting with this
patch applied. PR: https://github.com/apache/lucene/pull/13149

I have not been able to reproduce the speedup with lucenutil - I suspect
that the default tasks in it would not trigger this code path that much.

If you want understand more of my line of thinking, consider skimming
through the blog post [1]

[1] https://blunders.io/posts/es-benchmark-4-inlining
[2] https://github.com/elastic/elasticsearch-opensearch-benchmark

best regards,
Anton H

Reply via email to