This PR introduces the basic Ideal/Identity transformations for
`VectorBlendNode`.
The semantic of `VectorBlend(X, Y, M)` is: `M ? Y : X`.
**Identity**:
(VectorBlend X Y (Replicate -1)) => Y
(VectorBlend X Y (MaskAll -1)) => Y
(VectorBlend X Y (Replicate 0)) => X
(VectorBlend X Y (MaskAll 0)) => X
**Ideal**:
(VectorBlend (VectorBlend X A M) B M) => (VectorBlend X B M)
(VectorBlend A (VectorBlend B X M) M) => (VectorBlend A X M)
(VectorBlend A B (XorV/XorVMask M -1)) => (VectorBlend B A M)
Also corrects the VectorBlendNode header comment: across all backends (X86
SSE/AVX, AArch64 NEON/SVE, RISC-V V) the active mask lane selects `vec2`
(in(2)), and the inactive lane selects `vec1` (in(1)).
JTReg and JMH tests are also added for each optimization pattern. All tests
(tier1, tier2, and tier3) passed on AArch64 and X86 platforms.
JMH benchmark test results:
On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
Benchmark Unit Before Error After Error
Uplift
blendNegatedMaskInt ops/ms 7990.6 2.8 10215.2
11.0 1.3
identityAllOnesInt ops/ms 3574.8 2.6 7967.1
0.3 2.2
identityAllZerosLong ops/ms 3575.6 1.0 7966.0 3.6
2.2
nestedBlendInnerLong ops/ms 3533.8 2.8 478573.0 3178.5 135.4
nestedBlendOuterInt ops/ms 3537.6 3.4 472242.2 3034.2
133.5
On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
Benchmark Unit Before Error After Error
Uplift
blendNegatedMaskInt ops/ms 5171.9 5.2 8129.0
17.3 1.6
identityAllOnesInt ops/ms 2722.0 0.1 5891.3
0.1 2.2
identityAllZerosLong ops/ms 2722.4 0.1 5891.1 0.3
2.2
nestedBlendInnerLong ops/ms 2697.6 0.0 312148.7 2366.4 115.7
nestedBlendOuterInt ops/ms 2702.7 0.1 308686.0 2709.8
114.2
On a Nvidia Grace (Neoverse-V2) machine with `-XX:UseSVE=0`:
Benchmark Unit Before Error After Error
Uplift
blendNegatedMaskInt ops/ms 7718.1 1.9 9515.9
54.0 1.2
identityAllOnesInt ops/ms 3581.9 0.6 8062.5
0.5 2.3
identityAllZerosLong ops/ms 3582.7 0.6 8058.5 11.9
2.2
nestedBlendInnerLong ops/ms 3529.6 1.4 476029.8 5190.2 134.9
nestedBlendOuterInt ops/ms 3536.9 2.1 486060.0 3442.1
137.4
On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
Benchmark Unit Before Error After Error
Uplift
blendNegatedMaskInt ops/ms 36773.6 541.7 46467.4 499.4
1.3
identityAllOnesInt ops/ms 5262.7 3.7 13644.7
12.1 2.6
identityAllZerosLong ops/ms 5272.4 3.4 13665.3 8.4
2.6
nestedBlendInnerLong ops/ms 5256.6 4.9 436643.3 14778.8 83.1
nestedBlendOuterInt ops/ms 5253.2 1.5 223851.3 106003
42.6
On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=2`:
Benchmark Unit Before Error After Error
Uplift
blendNegatedMaskInt ops/ms 24335.3 32.1 30412.3 28.1
1.2
identityAllOnesInt ops/ms 5248.8 5.0 13677.5
18.4 2.6
identityAllZerosLong ops/ms 5248.8 2.2 13655.8 2.9
2.6
nestedBlendInnerLong ops/ms 5146.2 4.6 649242.6 1174.4 126.2
nestedBlendOuterInt ops/ms 5141.8 6.2 646255.2 10654.1
125.7
The microbenchmark shows a significant speedup. This is mainly because this PR
eliminates redundant computations inside the loop by hoisting them out of the
loop. At the same time, it reduces the number of IR uses, which can in turn
enable further optimizations.
---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK
Interim AI Policy](https://openjdk.org/legal/ai).
-------------
Commit messages:
- 8384571: C2: Add some basic IGVN optimization for VectorBlendNode
Changes: https://git.openjdk.org/jdk/pull/31333/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31333&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8384571
Stats: 373 lines in 4 files changed: 371 ins; 0 del; 2 mod
Patch: https://git.openjdk.org/jdk/pull/31333.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/31333/head:pull/31333
PR: https://git.openjdk.org/jdk/pull/31333