Re: Baffling performance regression measured by luceneutil

Egor Potemkin Fri, 16 Aug 2024 06:57:34 -0700

Hi Anton,

Thank you for the links and details! I've tried disabling inlining by
adding `-XX:-Inline` as you suggest in the blogpost, both baseline and
candidate became slower, but baseline is still faster! I think we can rule
out inlining? Results after 12 runs:


                            TaskQPS baseline      StdDevQPS
my_modified_version      StdDev                Pct diff p-value
                        HighTerm      154.89      (0.8%)      143.58
 (0.6%)   -7.3% (  -8% -   -5%) 0.000
                         MedTerm      140.22      (1.4%)      133.85
 (1.0%)   -4.5% (  -6% -   -2%) 0.000

Just for the reference, I'm testing using luceneutil fork that only runs
two tasks HighTerm and MedTerm but more times, see this comment
https://github.com/apache/lucene/pull/13657#issuecomment-2293467656

Thanks,
Egor

On Fri, 16 Aug 2024 at 13:59, Anton Hägerstrand <[email protected]> wrote:

> I did a bunch of digging into hotspot performance for a blog post that I
> wrote recently - the tl;dr; is that the order of execution of hotspot code
> could have this level of effect, due to inlining decisions. The post
> resulted in https://github.com/apache/lucene/pull/13149 which made things
> significantly faster by forcing different inlining.
>
> The post might be worth reading:
> https://blunders.io/posts/es-benchmark-4-inlining (the post is a part of
> an ES vs. OS series but the conclusions are Lucene/Hotspot). For a much
> deeper dive into how inlining is done in the JVM, I recommend
> https://shipilev.net/blog/2015/black-magic-method-dispatch , a post
> written by a jdk engineer. The best tool that I found to debug inlining
> decisions was to simply set  -XX:+UnlockDiagnosticVMOptions
> -XX:+PrintInlining to print the inlining decisions - prepare for large
> amounts of output.
>
> I'm not saying that this must be inlining/hotspot issues, but it sounds
> like it could be. If inlining decisions are the cause, it is quite possible
> that your change is not a performance regression - it could be that the
> benchmarks tested the code with more favourable compiled/inlined machine
> code than real world users would see.
>
> /anton
>
> On Fri, 16 Aug 2024 at 13:13, Michael Sokolov <[email protected]> wrote:
>
>> Maybe getSlices has some side effect that messes up create Weight?
>>
>> On Fri, Aug 16, 2024, 7:10 AM Michael Sokolov <[email protected]> wrote:
>>
>>> That is super weird. I wonder if changing the names of variables will
>>> make a difference. Have you verified that this effect is observable during
>>> all lunar phases?
>>>
>>> I assume we liked at any profiler do offs we could get our hands on? If
>>> not, maybe some for would show up there.
>>>
>>> On Thu, Aug 15, 2024, 7:22 PM Greg Miller <[email protected]> wrote:
>>>
>>>> Hi folks-
>>>>
>>>> Egor Potemkin and I have been digging into a baffling performance
>>>> regression we're seeing in response to a one-line change that doesn't
>>>> rationally seem like it should have any performance impact what-so-ever.
>>>> There's more background on why we're trying to understand this, but I'll
>>>> save the broader context for now and just focus on the confusing issue
>>>> we're trying to understand.
>>>>
>>>> Inside IndexSearcher, we've staged a change that initializes an
>>>> ArrayList of Collectors slightly earlier than what we do today (see:
>>>> https://github.com/apache/lucene/pull/13657/files). We end up with
>>>> code that looks like this (note the isolated line that's initializing
>>>> `collectors`):
>>>>
>>>> ```
>>>>   public <C extends Collector, T> T search(Query query,
>>>> CollectorManager<C, T> collectorManager)
>>>>       throws IOException {
>>>>     final LeafSlice[] leafSlices = getSlices();
>>>>     final C firstCollector = collectorManager.newCollector();
>>>>     query = rewrite(query, firstCollector.scoreMode().needsScores());
>>>>     final Weight weight = createWeight(query,
>>>> firstCollector.scoreMode(), 1);
>>>>
>>>>     final List<C> collectors = new ArrayList<>(leafSlices.length);
>>>>
>>>>     return search(weight, collectorManager, firstCollector, collectors,
>>>> leafSlices);
>>>>   }
>>>> ```
>>>>
>>>> What's baffling is that if we initialize the `collectors` list _after_
>>>> the call to `createWeight` (as shown here), there's no performance impact
>>>> at all (as expected). But if all we do is initialize `collectors` _before_
>>>> the call to `createWeight`, we see a very significant regression on
>>>> LowTerm, MedTerm, HighTerm tasks in luceneutil (e.g., %15 - 30%). At the
>>>> other end, we see a significant improvement to OrHighNotLow, OrHighNotMed,
>>>> OrHighNotHigh (e.g., 7% - 15%). (This is running wikimedium10m on an
>>>> x86-based AWS ec2 host, but results reproduced separately for Egor and in
>>>> our nightly benchmark runs; full luceneutil output at the bottom of this
>>>> email [1]). Some additional context and conversation is captured in this
>>>> "demo" PR: https://github.com/apache/lucene/pull/13657.
>>>>
>>>> My only hunch here is this has something to do with hotspot's decision
>>>> making or some other such runtime optimization, but I'm getting out of my
>>>> depth and hoping someone in this community will have ideas on ways to
>>>> continue this investigation. Anyone have a clue what might be going on? Or
>>>> any suggestions on other things to look at? This isn't a purely academic
>>>> exercise for what it's worth. This oddity has caused us to duplicate some
>>>> code in IndexSearcher to work with a new sandbox faceting module, so it
>>>> would be nice to figure this out so we can remove the code duplication.
>>>> (The code duplication is pretty minor, but it's still really frustrating
>>>> and it's a trap waiting to be hit by someone in the future that tries to
>>>> consolidate the code duplication and runs into this)
>>>>
>>>> Thanks for reading, and thanks in advance for any ideas!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>>
>>>> [1] Full Lucene util output:
>>>> ```
>>>>                             TaskQPS baseline      StdDevQPS
>>>> my_modified_version      StdDev                Pct diff p-value
>>>>                          MedTerm      513.21      (4.9%)      369.43
>>>>    (4.8%)  -28.0% ( -35% -  -19%) 0.000
>>>>                         HighTerm      523.20      (6.9%)      402.11
>>>>    (5.0%)  -23.1% ( -32% -  -12%) 0.000
>>>>                          LowTerm      837.70      (3.9%)      715.94
>>>>    (3.9%)  -14.5% ( -21% -   -6%) 0.000
>>>>        BrowseDayOfYearSSDVFacets       11.97     (18.9%)       11.31
>>>>   (11.9%)   -5.5% ( -30% -   31%) 0.273
>>>>             MedTermDayTaxoFacets       23.03      (4.9%)       21.95
>>>>    (6.4%)   -4.7% ( -15% -    6%) 0.009
>>>>                       HighPhrase      143.93      (8.3%)      139.35
>>>>    (4.7%)   -3.2% ( -14% -   10%) 0.136
>>>>                           Fuzzy2       53.03      (9.0%)       51.50
>>>>    (7.3%)   -2.9% ( -17% -   14%) 0.265
>>>>                      MedSpanNear       50.70      (5.1%)       49.26
>>>>    (3.0%)   -2.8% ( -10% -    5%) 0.032
>>>>                        LowPhrase       70.38      (4.9%)       68.60
>>>>    (5.3%)   -2.5% ( -12% -    8%) 0.118
>>>>                        MedPhrase       88.15      (5.2%)       86.03
>>>>    (4.2%)   -2.4% ( -11% -    7%) 0.105
>>>>           OrHighMedDayTaxoFacets        7.01      (5.5%)        6.86
>>>>    (5.4%)   -2.0% ( -12% -    9%) 0.237
>>>>                     HighSpanNear       28.95      (2.7%)       28.42
>>>>    (2.9%)   -1.8% (  -7% -    3%) 0.043
>>>>                  MedSloppyPhrase      201.71      (3.3%)      198.58
>>>>    (3.1%)   -1.6% (  -7% -    4%) 0.124
>>>>             BrowseDateTaxoFacets       23.97     (28.7%)       23.62
>>>>   (22.8%)   -1.5% ( -41% -   70%) 0.858
>>>>          AndHighMedDayTaxoFacets       32.81      (5.8%)       32.35
>>>>    (7.1%)   -1.4% ( -13% -   12%) 0.493
>>>>         AndHighHighDayTaxoFacets       27.86      (6.1%)       27.50
>>>>    (6.5%)   -1.3% ( -13% -   12%) 0.507
>>>>                  LowSloppyPhrase      149.20      (2.9%)      147.50
>>>>    (3.0%)   -1.1% (  -6% -    4%) 0.227
>>>>             HighTermTitleBDVSort       66.72      (6.6%)       66.04
>>>>    (5.7%)   -1.0% ( -12% -   12%) 0.604
>>>>                      AndHighHigh      187.45      (7.4%)      185.75
>>>>    (6.7%)   -0.9% ( -13% -   14%) 0.684
>>>>                      LowSpanNear      102.21      (2.1%)      101.50
>>>>    (1.5%)   -0.7% (  -4% -    3%) 0.242
>>>>                       OrHighHigh      218.06      (6.3%)      216.74
>>>>    (4.1%)   -0.6% ( -10% -   10%) 0.721
>>>>                HighTermTitleSort      132.14      (1.5%)      131.93
>>>>    (1.3%)   -0.2% (  -2% -    2%) 0.724
>>>>                 HighSloppyPhrase       31.43      (5.4%)       31.39
>>>>    (6.6%)   -0.1% ( -11% -   12%) 0.949
>>>>      BrowseRandomLabelSSDVFacets        7.91     (10.2%)        7.91
>>>>   (11.2%)   -0.0% ( -19% -   23%) 0.992
>>>>                       AndHighMed      288.24      (4.9%)      288.33
>>>>    (4.0%)    0.0% (  -8% -    9%) 0.982
>>>>                       AndHighLow     1339.09      (3.2%)     1345.87
>>>>    (4.8%)    0.5% (  -7% -    8%) 0.694
>>>>                        OrHighMed      473.22      (3.9%)      476.21
>>>>    (3.8%)    0.6% (  -6% -    8%) 0.603
>>>>        BrowseDayOfYearTaxoFacets       23.67     (28.7%)       23.82
>>>>   (23.5%)    0.6% ( -40% -   74%) 0.938
>>>>            HighTermDayOfYearSort      415.29      (5.2%)      418.26
>>>>    (5.9%)    0.7% (  -9% -   12%) 0.684
>>>>             BrowseDateSSDVFacets        2.14     (21.4%)        2.16
>>>>   (22.4%)    1.0% ( -35% -   56%) 0.887
>>>>                         Wildcard      489.21      (4.3%)      494.69
>>>>    (4.5%)    1.1% (  -7% -   10%) 0.420
>>>>                       TermDTSort      216.56      (5.9%)      219.04
>>>>    (4.8%)    1.1% (  -8% -   12%) 0.499
>>>>                         PKLookup      139.24      (8.7%)      140.89
>>>>   (10.8%)    1.2% ( -16% -   22%) 0.703
>>>>                           Fuzzy1       74.44      (9.7%)       75.42
>>>>    (8.3%)    1.3% ( -15% -   21%) 0.643
>>>>                          Respell       48.52      (7.2%)       49.20
>>>>    (6.6%)    1.4% ( -11% -   16%) 0.519
>>>>                     OrNotHighLow     1260.39      (3.0%)     1279.03
>>>>    (2.7%)    1.5% (  -4% -    7%) 0.101
>>>>              MedIntervalsOrdered      132.03      (9.2%)      134.25
>>>>   (12.6%)    1.7% ( -18% -   25%) 0.630
>>>>            BrowseMonthTaxoFacets       24.51     (26.9%)       25.02
>>>>   (25.5%)    2.0% ( -39% -   74%) 0.804
>>>>                HighTermMonthSort     1117.15      (4.1%)     1143.38
>>>>    (4.6%)    2.3% (  -6% -   11%) 0.090
>>>>      BrowseRandomLabelTaxoFacets       15.54     (25.0%)       15.93
>>>>   (19.7%)    2.5% ( -33% -   62%) 0.724
>>>>                          Prefix3      667.73     (11.1%)      684.51
>>>>   (11.1%)    2.5% ( -17% -   27%) 0.474
>>>>              LowIntervalsOrdered      118.38     (14.5%)      121.55
>>>>   (14.8%)    2.7% ( -23% -   37%) 0.564
>>>>             HighIntervalsOrdered       30.52      (9.2%)       31.34
>>>>    (7.0%)    2.7% ( -12% -   20%) 0.298
>>>>                     OrNotHighMed      365.66      (5.9%)      376.73
>>>>    (6.1%)    3.0% (  -8% -   15%) 0.110
>>>>                        OrHighLow      586.67      (5.7%)      608.48
>>>>    (5.6%)    3.7% (  -7% -   15%) 0.037
>>>>                    OrNotHighHigh      257.09      (5.8%)      267.66
>>>>    (6.5%)    4.1% (  -7% -   17%) 0.034
>>>>            BrowseMonthSSDVFacets       11.21      (9.1%)       11.69
>>>>    (7.1%)    4.3% ( -11% -   22%) 0.100
>>>>                     OrHighNotLow      446.78      (8.7%)      479.82
>>>>    (7.1%)    7.4% (  -7% -   25%) 0.003
>>>>                     OrHighNotMed      591.66      (7.6%)      649.35
>>>>    (4.8%)    9.8% (  -2% -   23%) 0.000
>>>>                           IntNRQ      202.12     (17.5%)      224.77
>>>>   (28.1%)   11.2% ( -29% -   68%) 0.130
>>>>                    OrHighNotHigh      339.78      (8.0%)      393.02
>>>>    (6.7%)   15.7% (   0% -   33%) 0.000
>>>> ```
>>>>
>>>

Re: Baffling performance regression measured by luceneutil

Reply via email to