Re: Inlining, virtual calls and BKDPointsTree

Gautam Worah Sat, 02 Mar 2024 08:45:54 -0800

 > I am running Amazon Product Search's benchmarks to see if the change is
needle moving for us.


Results were flat to slightly positive (+0.94% redline QPS) on our workload.
Although we do have numeric range queries that would've improved, I suspect
it is flat because our workload is heavily dominated by TermQueries and
their combinations with various clauses.

I'll try tweaking the query set to target queries with more Point hits
during the week and see what comes out..

Regards,
Gautam Worah.


On Sat, Mar 2, 2024 at 2:33 AM Anton Hägerstrand <an...@blunders.io> wrote:

> Thank you Gautam!
>
> > Yeah, it seems like luceneutil is not stressing the code path that
> ElasticSearch's benchmarks are?
>
> Yes, as far as I understand it - though it might just be that I don't
> understand luceneutil good enough. I believe that in order to see the
> performance diff numerical range queries or numerical sorting would have to
> be involved - the more documents matched the larger the difference. This is
> what the relevant benchmark operations from Elastic does.
>
> > So it seems like switching over from an iterative visit(int docID) call
> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>
> Yes, it seems like it, based on this benchmark.
>
> > I am running Amazon Product Search's benchmarks to see if the change is
> needle moving for us.
>
> Thank you, much appreciated!
>
> > Small suggestion on the blog...
>
> Thank you for the feedback! The post is definitely a bit confusing, I
> struggled with keeping it clear. I will try to make some edits to make it
> clearer what conclusions can be made after each section.
>
> /Anton
>
> On Sat, 2 Mar 2024 at 00:30, Gautam Worah <worah.gau...@gmail.com> wrote:
>
>> Hi Anton,
>>
>> It took me a while to get through the blog post, and I suspect I will
>> need to read through a couple more times to understand it fully.
>> Thanks for writing up something so detailed. I learnt a lot! (especially
>> about JVM inlining methods).
>>
>> > I have not been able to reproduce the speedup with lucenutil - I
>> suspect that the default tasks in it would not trigger this code path that
>> much.
>>
>> Yeah, it seems like luceneutil is not stressing the code path that
>> ElasticSearch's benchmarks are?
>>
>> > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>> decrease of execution time of 35-45% for range queries and numerical
>> sorting with this patch applied.
>>
>> So it seems like switching over from an iterative visit(int docID) call
>> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>>
>> I am running Amazon Product Search's benchmarks to see if the change is
>> needle moving for us.
>>
>> Small suggestion on the blog: The JVM inlining, ElasticSearch
>> short-circuiting/opto causing a difference in performance could've been a
>> blog on its own, part 1 maybe.. I got confused when the blog shifted from
>> the performance differences between ElasticSearch and OpenSearch, to how
>> you ended up improving Lucene.
>>
>> Regards,
>> Gautam Worah.
>>
>>
>> On Fri, Mar 1, 2024 at 2:42 AM Anton Hägerstrand <an...@blunders.io>
>> wrote:
>>
>>> Hi everyone, long time lurker here.
>>>
>>> I recently investigated Elasticsearch/OpenSearch performance in a blog
>>> post [1], and saw some interesting behavior of numerical range queries and
>>> numerical sorting with regards to inlining and virtual calls.
>>>
>>> In short, the DocIdsWriter::readInts method seems to get much slower if
>>> it is called with 3 or more implementations of IntersectVisitor during the
>>> JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid)
>>> being heavily inlined with 2 or fewer IntersectVisitor implementations,
>>> while becoming a virtual call with 3 or more.
>>>
>>> This leads to two interesting points wrt Lucene
>>>
>>> 1) For benchmarks, warm ups should not only be done to trigger speedups
>>> by the JIT, instead making the JVM be in a realistic production state. For
>>> the BKDPointTree, this means at least 3 implementations of the
>>> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
>>> benchmarks?
>>> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>> decrease of execution time of 35-45% for range queries and numerical
>>> sorting with this patch applied. PR:
>>> https://github.com/apache/lucene/pull/13149
>>>
>>> I have not been able to reproduce the speedup with lucenutil - I suspect
>>> that the default tasks in it would not trigger this code path that much.
>>>
>>> If you want understand more of my line of thinking, consider skimming
>>> through the blog post [1]
>>>
>>> [1] https://blunders.io/posts/es-benchmark-4-inlining
>>> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>>>
>>> best regards,
>>> Anton H
>>>
>>

Re: Inlining, virtual calls and BKDPointsTree

Reply via email to