RFR: 8384571: C2: Add some basic IGVN optimization for VectorBlendNode

Eric Fang Sun, 31 May 2026 19:12:38 -0700

This PR introduces the basic Ideal/Identity transformations for 
`VectorBlendNode`.


The semantic of `VectorBlend(X, Y, M)` is: `M ? Y : X`.

**Identity**:

  (VectorBlend X Y (Replicate -1)) => Y
  (VectorBlend X Y (MaskAll   -1)) => Y
  (VectorBlend X Y (Replicate  0)) => X
  (VectorBlend X Y (MaskAll    0)) => X


**Ideal**:

  (VectorBlend (VectorBlend X A M) B M)  => (VectorBlend X B M)
  (VectorBlend A (VectorBlend B X M) M)  => (VectorBlend A X M)
  (VectorBlend A B (XorV/XorVMask M -1)) => (VectorBlend B A M)


Also corrects the VectorBlendNode header comment: across all backends (X86 
SSE/AVX, AArch64 NEON/SVE, RISC-V V) the active mask lane selects `vec2` 
(in(2)), and the inactive lane selects `vec1` (in(1)).

JTReg and JMH tests are also added for each optimization pattern. All tests 
(tier1, tier2, and tier3) passed on AArch64 and X86 platforms.

JMH benchmark test results:

On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
blendNegatedMaskInt         ops/ms      7990.6  2.8         10215.2             
11.0    1.3
identityAllOnesInt          ops/ms      3574.8  2.6         7967.1              
0.3         2.2
identityAllZerosLong    ops/ms  3575.6  1.0         7966.0              3.6     
    2.2
nestedBlendInnerLong    ops/ms  3533.8  2.8         478573.0    3178.5  135.4
nestedBlendOuterInt         ops/ms      3537.6  3.4         472242.2    3034.2  
133.5


On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
blendNegatedMaskInt         ops/ms      5171.9  5.2         8129.0              
17.3    1.6
identityAllOnesInt          ops/ms      2722.0  0.1         5891.3              
0.1         2.2
identityAllZerosLong    ops/ms  2722.4  0.1         5891.1              0.3     
    2.2
nestedBlendInnerLong    ops/ms  2697.6  0.0         312148.7    2366.4  115.7
nestedBlendOuterInt         ops/ms      2702.7  0.1         308686.0    2709.8  
114.2


On a Nvidia Grace (Neoverse-V2) machine with `-XX:UseSVE=0`:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
blendNegatedMaskInt         ops/ms      7718.1  1.9         9515.9              
54.0    1.2
identityAllOnesInt          ops/ms      3581.9  0.6         8062.5              
0.5         2.3
identityAllZerosLong    ops/ms  3582.7  0.6         8058.5              11.9    
2.2
nestedBlendInnerLong    ops/ms  3529.6  1.4         476029.8    5190.2  134.9
nestedBlendOuterInt         ops/ms      3536.9  2.1         486060.0    3442.1  
137.4


On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
blendNegatedMaskInt         ops/ms      36773.6 541.7   46467.4         499.4   
1.3
identityAllOnesInt          ops/ms      5262.7  3.7         13644.7             
12.1    2.6
identityAllZerosLong    ops/ms  5272.4  3.4         13665.3             8.4     
    2.6
nestedBlendInnerLong    ops/ms  5256.6  4.9         436643.3    14778.8 83.1
nestedBlendOuterInt         ops/ms      5253.2  1.5         223851.3    106003  
42.6


On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=2`:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
blendNegatedMaskInt         ops/ms      24335.3 32.1    30412.3         28.1    
1.2
identityAllOnesInt          ops/ms      5248.8  5.0         13677.5             
18.4    2.6
identityAllZerosLong    ops/ms  5248.8  2.2         13655.8             2.9     
    2.6
nestedBlendInnerLong    ops/ms  5146.2  4.6         649242.6    1174.4  126.2
nestedBlendOuterInt         ops/ms      5141.8  6.2         646255.2    10654.1 
125.7


The microbenchmark shows a significant speedup. This is mainly because this PR 
eliminates redundant computations inside the loop by hoisting them out of the 
loop. At the same time, it reduces the number of IR uses, which can in turn 
enable further optimizations.




---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK 
Interim AI Policy](https://openjdk.org/legal/ai).

-------------

Commit messages:
 - 8384571: C2: Add some basic IGVN optimization for VectorBlendNode

Changes: https://git.openjdk.org/jdk/pull/31333/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31333&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8384571
  Stats: 373 lines in 4 files changed: 371 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/31333.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31333/head:pull/31333

PR: https://git.openjdk.org/jdk/pull/31333

RFR: 8384571: C2: Add some basic IGVN optimization for VectorBlendNode

Reply via email to