Hi Anton,
Thank you for the links and details! I've tried disabling inlining by
adding `-XX:-Inline` as you suggest in the blogpost, both baseline and
candidate became slower, but baseline is still faster! I think we can rule
out inlining? Results after 12 runs:
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
HighTerm 154.89 (0.8%) 143.58
(0.6%) -7.3% ( -8% - -5%) 0.000
MedTerm 140.22 (1.4%) 133.85
(1.0%) -4.5% ( -6% - -2%) 0.000
Just for the reference, I'm testing using luceneutil fork that only runs
two tasks HighTerm and MedTerm but more times, see this comment
https://github.com/apache/lucene/pull/13657#issuecomment-2293467656
Thanks,
Egor
On Fri, 16 Aug 2024 at 13:59, Anton Hägerstrand <[email protected]> wrote:
> I did a bunch of digging into hotspot performance for a blog post that I
> wrote recently - the tl;dr; is that the order of execution of hotspot code
> could have this level of effect, due to inlining decisions. The post
> resulted in https://github.com/apache/lucene/pull/13149 which made things
> significantly faster by forcing different inlining.
>
> The post might be worth reading:
> https://blunders.io/posts/es-benchmark-4-inlining (the post is a part of
> an ES vs. OS series but the conclusions are Lucene/Hotspot). For a much
> deeper dive into how inlining is done in the JVM, I recommend
> https://shipilev.net/blog/2015/black-magic-method-dispatch , a post
> written by a jdk engineer. The best tool that I found to debug inlining
> decisions was to simply set -XX:+UnlockDiagnosticVMOptions
> -XX:+PrintInlining to print the inlining decisions - prepare for large
> amounts of output.
>
> I'm not saying that this must be inlining/hotspot issues, but it sounds
> like it could be. If inlining decisions are the cause, it is quite possible
> that your change is not a performance regression - it could be that the
> benchmarks tested the code with more favourable compiled/inlined machine
> code than real world users would see.
>
> /anton
>
> On Fri, 16 Aug 2024 at 13:13, Michael Sokolov <[email protected]> wrote:
>
>> Maybe getSlices has some side effect that messes up create Weight?
>>
>> On Fri, Aug 16, 2024, 7:10 AM Michael Sokolov <[email protected]> wrote:
>>
>>> That is super weird. I wonder if changing the names of variables will
>>> make a difference. Have you verified that this effect is observable during
>>> all lunar phases?
>>>
>>> I assume we liked at any profiler do offs we could get our hands on? If
>>> not, maybe some for would show up there.
>>>
>>> On Thu, Aug 15, 2024, 7:22 PM Greg Miller <[email protected]> wrote:
>>>
>>>> Hi folks-
>>>>
>>>> Egor Potemkin and I have been digging into a baffling performance
>>>> regression we're seeing in response to a one-line change that doesn't
>>>> rationally seem like it should have any performance impact what-so-ever.
>>>> There's more background on why we're trying to understand this, but I'll
>>>> save the broader context for now and just focus on the confusing issue
>>>> we're trying to understand.
>>>>
>>>> Inside IndexSearcher, we've staged a change that initializes an
>>>> ArrayList of Collectors slightly earlier than what we do today (see:
>>>> https://github.com/apache/lucene/pull/13657/files). We end up with
>>>> code that looks like this (note the isolated line that's initializing
>>>> `collectors`):
>>>>
>>>> ```
>>>> public <C extends Collector, T> T search(Query query,
>>>> CollectorManager<C, T> collectorManager)
>>>> throws IOException {
>>>> final LeafSlice[] leafSlices = getSlices();
>>>> final C firstCollector = collectorManager.newCollector();
>>>> query = rewrite(query, firstCollector.scoreMode().needsScores());
>>>> final Weight weight = createWeight(query,
>>>> firstCollector.scoreMode(), 1);
>>>>
>>>> final List<C> collectors = new ArrayList<>(leafSlices.length);
>>>>
>>>> return search(weight, collectorManager, firstCollector, collectors,
>>>> leafSlices);
>>>> }
>>>> ```
>>>>
>>>> What's baffling is that if we initialize the `collectors` list _after_
>>>> the call to `createWeight` (as shown here), there's no performance impact
>>>> at all (as expected). But if all we do is initialize `collectors` _before_
>>>> the call to `createWeight`, we see a very significant regression on
>>>> LowTerm, MedTerm, HighTerm tasks in luceneutil (e.g., %15 - 30%). At the
>>>> other end, we see a significant improvement to OrHighNotLow, OrHighNotMed,
>>>> OrHighNotHigh (e.g., 7% - 15%). (This is running wikimedium10m on an
>>>> x86-based AWS ec2 host, but results reproduced separately for Egor and in
>>>> our nightly benchmark runs; full luceneutil output at the bottom of this
>>>> email [1]). Some additional context and conversation is captured in this
>>>> "demo" PR: https://github.com/apache/lucene/pull/13657.
>>>>
>>>> My only hunch here is this has something to do with hotspot's decision
>>>> making or some other such runtime optimization, but I'm getting out of my
>>>> depth and hoping someone in this community will have ideas on ways to
>>>> continue this investigation. Anyone have a clue what might be going on? Or
>>>> any suggestions on other things to look at? This isn't a purely academic
>>>> exercise for what it's worth. This oddity has caused us to duplicate some
>>>> code in IndexSearcher to work with a new sandbox faceting module, so it
>>>> would be nice to figure this out so we can remove the code duplication.
>>>> (The code duplication is pretty minor, but it's still really frustrating
>>>> and it's a trap waiting to be hit by someone in the future that tries to
>>>> consolidate the code duplication and runs into this)
>>>>
>>>> Thanks for reading, and thanks in advance for any ideas!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>>
>>>> [1] Full Lucene util output:
>>>> ```
>>>> TaskQPS baseline StdDevQPS
>>>> my_modified_version StdDev Pct diff p-value
>>>> MedTerm 513.21 (4.9%) 369.43
>>>> (4.8%) -28.0% ( -35% - -19%) 0.000
>>>> HighTerm 523.20 (6.9%) 402.11
>>>> (5.0%) -23.1% ( -32% - -12%) 0.000
>>>> LowTerm 837.70 (3.9%) 715.94
>>>> (3.9%) -14.5% ( -21% - -6%) 0.000
>>>> BrowseDayOfYearSSDVFacets 11.97 (18.9%) 11.31
>>>> (11.9%) -5.5% ( -30% - 31%) 0.273
>>>> MedTermDayTaxoFacets 23.03 (4.9%) 21.95
>>>> (6.4%) -4.7% ( -15% - 6%) 0.009
>>>> HighPhrase 143.93 (8.3%) 139.35
>>>> (4.7%) -3.2% ( -14% - 10%) 0.136
>>>> Fuzzy2 53.03 (9.0%) 51.50
>>>> (7.3%) -2.9% ( -17% - 14%) 0.265
>>>> MedSpanNear 50.70 (5.1%) 49.26
>>>> (3.0%) -2.8% ( -10% - 5%) 0.032
>>>> LowPhrase 70.38 (4.9%) 68.60
>>>> (5.3%) -2.5% ( -12% - 8%) 0.118
>>>> MedPhrase 88.15 (5.2%) 86.03
>>>> (4.2%) -2.4% ( -11% - 7%) 0.105
>>>> OrHighMedDayTaxoFacets 7.01 (5.5%) 6.86
>>>> (5.4%) -2.0% ( -12% - 9%) 0.237
>>>> HighSpanNear 28.95 (2.7%) 28.42
>>>> (2.9%) -1.8% ( -7% - 3%) 0.043
>>>> MedSloppyPhrase 201.71 (3.3%) 198.58
>>>> (3.1%) -1.6% ( -7% - 4%) 0.124
>>>> BrowseDateTaxoFacets 23.97 (28.7%) 23.62
>>>> (22.8%) -1.5% ( -41% - 70%) 0.858
>>>> AndHighMedDayTaxoFacets 32.81 (5.8%) 32.35
>>>> (7.1%) -1.4% ( -13% - 12%) 0.493
>>>> AndHighHighDayTaxoFacets 27.86 (6.1%) 27.50
>>>> (6.5%) -1.3% ( -13% - 12%) 0.507
>>>> LowSloppyPhrase 149.20 (2.9%) 147.50
>>>> (3.0%) -1.1% ( -6% - 4%) 0.227
>>>> HighTermTitleBDVSort 66.72 (6.6%) 66.04
>>>> (5.7%) -1.0% ( -12% - 12%) 0.604
>>>> AndHighHigh 187.45 (7.4%) 185.75
>>>> (6.7%) -0.9% ( -13% - 14%) 0.684
>>>> LowSpanNear 102.21 (2.1%) 101.50
>>>> (1.5%) -0.7% ( -4% - 3%) 0.242
>>>> OrHighHigh 218.06 (6.3%) 216.74
>>>> (4.1%) -0.6% ( -10% - 10%) 0.721
>>>> HighTermTitleSort 132.14 (1.5%) 131.93
>>>> (1.3%) -0.2% ( -2% - 2%) 0.724
>>>> HighSloppyPhrase 31.43 (5.4%) 31.39
>>>> (6.6%) -0.1% ( -11% - 12%) 0.949
>>>> BrowseRandomLabelSSDVFacets 7.91 (10.2%) 7.91
>>>> (11.2%) -0.0% ( -19% - 23%) 0.992
>>>> AndHighMed 288.24 (4.9%) 288.33
>>>> (4.0%) 0.0% ( -8% - 9%) 0.982
>>>> AndHighLow 1339.09 (3.2%) 1345.87
>>>> (4.8%) 0.5% ( -7% - 8%) 0.694
>>>> OrHighMed 473.22 (3.9%) 476.21
>>>> (3.8%) 0.6% ( -6% - 8%) 0.603
>>>> BrowseDayOfYearTaxoFacets 23.67 (28.7%) 23.82
>>>> (23.5%) 0.6% ( -40% - 74%) 0.938
>>>> HighTermDayOfYearSort 415.29 (5.2%) 418.26
>>>> (5.9%) 0.7% ( -9% - 12%) 0.684
>>>> BrowseDateSSDVFacets 2.14 (21.4%) 2.16
>>>> (22.4%) 1.0% ( -35% - 56%) 0.887
>>>> Wildcard 489.21 (4.3%) 494.69
>>>> (4.5%) 1.1% ( -7% - 10%) 0.420
>>>> TermDTSort 216.56 (5.9%) 219.04
>>>> (4.8%) 1.1% ( -8% - 12%) 0.499
>>>> PKLookup 139.24 (8.7%) 140.89
>>>> (10.8%) 1.2% ( -16% - 22%) 0.703
>>>> Fuzzy1 74.44 (9.7%) 75.42
>>>> (8.3%) 1.3% ( -15% - 21%) 0.643
>>>> Respell 48.52 (7.2%) 49.20
>>>> (6.6%) 1.4% ( -11% - 16%) 0.519
>>>> OrNotHighLow 1260.39 (3.0%) 1279.03
>>>> (2.7%) 1.5% ( -4% - 7%) 0.101
>>>> MedIntervalsOrdered 132.03 (9.2%) 134.25
>>>> (12.6%) 1.7% ( -18% - 25%) 0.630
>>>> BrowseMonthTaxoFacets 24.51 (26.9%) 25.02
>>>> (25.5%) 2.0% ( -39% - 74%) 0.804
>>>> HighTermMonthSort 1117.15 (4.1%) 1143.38
>>>> (4.6%) 2.3% ( -6% - 11%) 0.090
>>>> BrowseRandomLabelTaxoFacets 15.54 (25.0%) 15.93
>>>> (19.7%) 2.5% ( -33% - 62%) 0.724
>>>> Prefix3 667.73 (11.1%) 684.51
>>>> (11.1%) 2.5% ( -17% - 27%) 0.474
>>>> LowIntervalsOrdered 118.38 (14.5%) 121.55
>>>> (14.8%) 2.7% ( -23% - 37%) 0.564
>>>> HighIntervalsOrdered 30.52 (9.2%) 31.34
>>>> (7.0%) 2.7% ( -12% - 20%) 0.298
>>>> OrNotHighMed 365.66 (5.9%) 376.73
>>>> (6.1%) 3.0% ( -8% - 15%) 0.110
>>>> OrHighLow 586.67 (5.7%) 608.48
>>>> (5.6%) 3.7% ( -7% - 15%) 0.037
>>>> OrNotHighHigh 257.09 (5.8%) 267.66
>>>> (6.5%) 4.1% ( -7% - 17%) 0.034
>>>> BrowseMonthSSDVFacets 11.21 (9.1%) 11.69
>>>> (7.1%) 4.3% ( -11% - 22%) 0.100
>>>> OrHighNotLow 446.78 (8.7%) 479.82
>>>> (7.1%) 7.4% ( -7% - 25%) 0.003
>>>> OrHighNotMed 591.66 (7.6%) 649.35
>>>> (4.8%) 9.8% ( -2% - 23%) 0.000
>>>> IntNRQ 202.12 (17.5%) 224.77
>>>> (28.1%) 11.2% ( -29% - 68%) 0.130
>>>> OrHighNotHigh 339.78 (8.0%) 393.02
>>>> (6.7%) 15.7% ( 0% - 33%) 0.000
>>>> ```
>>>>
>>>