On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <[email protected]> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in
>> order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of
>> these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for
>> MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can
>> carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt
>> strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]]
>> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193]
>> !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in
>> https://github.com/openjdk/jdk/pull/13056, we can compare the results before
>> and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the
>> MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into
>> CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these
>> results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a
> merge or a rebase. The incremental webrev excludes the unrelated changes
> brought in by the merge/rebase. The pull request contains 44 additional
> commits since the last revision:
>
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Fix typo
> - Renaming methods and variables and add docu on algorithms
> - Fix copyright years
> - Make sure it runs with cpus with either avx512 or asimd
> - Test can only run with 256 bit registers or bigger
>
> * Remove platform dependant check
> and use platform independent configuration instead.
> - Fix license header
> - Tests should also run on aarch64 asimd=true envs
> - Added comment around the assertions
> - Adjust min/max identity IR test expectations after changes
> - ... and 34 more: https://git.openjdk.org/jdk/compare/6ad0c61a...a190ae68
What is happening with int min/max needs a separate investigation because based
on my testing, the int min/max intrinsic is both a regression and a performance
improvement! Check this out:
make test
TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax"
MICRO="FORK=1"
Benchmark (probability) (size) Mode Cnt Score
Error Units
MinMaxVector.intReductionSimpleMax 50 2048 thrpt 4 460.585
± 0.348 ops/ms
MinMaxVector.intReductionSimpleMax 80 2048 thrpt 4 460.633
± 0.103 ops/ms
MinMaxVector.intReductionSimpleMax 100 2048 thrpt 4 460.580
± 0.091 ops/ms
make test
TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax"
MICRO="FORK=1;OPTIONS=-jvmArgs
-XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark (probability) (size) Mode Cnt Score
Error Units
MinMaxVector.intReductionSimpleMax 50 2048 thrpt 4 460.479
± 0.044 ops/ms
MinMaxVector.intReductionSimpleMax 80 2048 thrpt 4 460.587
± 0.106 ops/ms
MinMaxVector.intReductionSimpleMax 100 2048 thrpt 4 1027.831
± 9.353 ops/ms
80%:
││ │ 0x00007ffb200fa089: cmpl %r11d, %r10d
3.04% ││ │ 0x00007ffb200fa08c: cmovll %r11d, %r10d
4.38% ││ │ 0x00007ffb200fa090: cmpl %ebx, %r10d
1.61% ││ │ 0x00007ffb200fa093: cmovll %ebx, %r10d
2.79% ││ │ 0x00007ffb200fa097: cmpl %edi, %r10d
2.92% ││ │ 0x00007ffb200fa09a: cmovll %edi, %r10d
;*ireturn {reexecute=0 rethrow=0 return_oop=0}
││ │
; - java.lang.Math::max@10 (line 2023)
││ │
; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax@23 (line
232)
100%:
3.11% │││││││ ││││││ │ 0x00007f26c00f8f9c: nopl
(%rax)
3.31% │││││││ ││││││ │ 0x00007f26c00f8fa0: cmpl
%r10d, %ecx
│││││││╭ ││││││ │ 0x00007f26c00f8fa3: jge
0x7f26c00f8ff1 ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
││││││││ ││││││ │
; - java.lang.Math::max@10 (line 2023)
││││││││ ││││││ │
; -
org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax@23 (line 232)
││││││││ ││││││ │
; -
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub@19
(line 124)
make test
TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax"
MICRO="FORK=1"
Benchmark (probability) (size) Mode Cnt
Score Error Units
MinMaxVector.intReductionMultiplyMax 50 2048 thrpt 4
2815.614 ± 0.406 ops/ms
MinMaxVector.intReductionMultiplyMax 80 2048 thrpt 4
2814.943 ± 2.174 ops/ms
MinMaxVector.intReductionMultiplyMax 100 2048 thrpt 4
2815.285 ± 1.725 ops/ms
make test
TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax"
MICRO="FORK=1;OPTIONS=-jvmArgs
-XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark (probability) (size) Mode Cnt
Score Error Units
MinMaxVector.intReductionMultiplyMax 50 2048 thrpt 4
2802.062 ± 0.710 ops/ms
MinMaxVector.intReductionMultiplyMax 80 2048 thrpt 4
2814.874 ± 4.058 ops/ms
MinMaxVector.intReductionMultiplyMax 100 2048 thrpt 4
883.879 ± 0.327 ops/ms
80%:
3.54% │ ││ │││││ 0x00007faa700fa177: vpmaxsd %ymm4, %ymm5,
%ymm13;*ireturn {reexecute=0 rethrow=0 return_oop=0}
│ ││ │││││
; - java.lang.Math::max@10 (line 2023)
100:
7.50% ││││││││││││││││││↗ │ 0x00007f75280f8849: imull
$0xb, 0x2c(%rbp, %r11, 4), %r10d
│││││││││││││││││││ │
;*imul {reexecute=0 rethrow=0 return_oop=0}
│││││││││││││││││││ │
; -
org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax@20 (line 221)
│││││││││││││││││││ │
; -
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub@19
(line 124)
3.85% │││││││││││││││││││ │ 0x00007f75280f884f: cmpl
%r10d, %r8d
││││││││││╰││││││││ │ 0x00007f75280f8852: jl
0x7f75280f87d0 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
││││││││││ ││││││││ │
; - java.lang.Math::max@2 (line 2023)
││││││││││ ││││││││ │
; -
org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax@26 (line 222)
││││││││││ ││││││││ │
; -
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub@19
(line 124)
I ran the exact same test with longs and I don't see such an issue. The
performance is always the same either with the intrisinc or disabling it as
shown above.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664871838