This PR continues the JDK-8384571 [1] work by further optimizing Blend-related 
IR patterns. It adds the factoring optimization that pulls a shared operand out 
of two matching lane-wise binary ops fed into a `VectorBlend`:

  (VectorBlend (op A C) (op B C) M) => (op (VectorBlend A B M) C)


For commutative ops the common operand may be in either position; for 
non-commutative ops it must appear in the same slot in both inner ops.

Shifts and rotates are not factored when the common operand sits in the count 
slot, since that turns a constant shift/rotate amount into a non-constant one 
which is not guaranteed to be a win in codegen.

Also extends `MulVLNode::has_int_inputs` / `has_uint_inputs` to peek through 
`VectorBlend` so the `MulVL -> vmuldq/vmuludq` narrowing on x86 keeps working 
after the factoring runs.

The `factorMulLong` benchmark exposed a performance issue with the rule 
`vmulL_neon`: The encoder lowers `MulVL` to per-lane `GPR extract -> scalar mul 
-> mov dst.D[i]`. The first `mov` inherits a partial-write merge dependency on 
the old `dst`, and if the register allocator coalesces `dst` with `src1/src2` 
the subsequent `umov` for lane 1 must wait for that merge to retire. This 
serialises the two scalar `muls` within an iteration and across unrolled 
iterations. This PR fix the issue by forcing `dst` into a fresh register to 
breaks the chain.

This PR also extends the existing JTReg and JMH tests for `VectorBlend`. All 
tests (tier1, tier2, and tier3) passed on AArch64 and X86 platforms.

JMH benchmark test results:

On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:

Benchmark                       Unit    Before  Err     After   Err         
Uplift
factorAddFloat              ops/ms      2726.9  1.1     7790.8  12.1    2.9
factorAddInt                ops/ms      3383.4  3.9     7778.3  19.7    2.3
factorAndInt                ops/ms      3386.8  2.1     436377  4806.1  128.8
factorCompressBitsInt   ops/ms  1455.7  1.0     2244.7  0.7         1.5
factorDivFloat              ops/ms      1084.9  1.2     1777.4  0.8         1.6
factorExpandBitsLong    ops/ms  1455.6  1.7     2237.7  1.4         1.5
factorLShiftInt             ops/ms      1923.7  2.4     3518.7  3.6         1.8
factorMaxDouble             ops/ms      2712.8  1.6     7718.3  4.9         2.8
factorMinFloat              ops/ms      2725.0  4.7     7778.5  13.5    2.9
factorMulLong               ops/ms      1427.3  0.6     2773.2  3.4         1.9
factorOrLong                ops/ms      3377.3  3.0     472138  2687.2  139.8
factorRolInt                ops/ms      1188.0  1.7     1187.9  0.4         1.0
factorSAddInt               ops/ms      3384.8  3.3     7769.1  4.1         2.3
factorSSubInt               ops/ms      3387.4  8.2     7790.4  12.5    2.3
factorSubDouble             ops/ms      2717.7  3.1     7711.4  2.6         2.8
factorSubInt                ops/ms      3384.8  5.7     7767.4  3.4         2.3
factorXorInt                ops/ms      3386.7  1.9     15017.2 20.5    4.4


On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:

Benchmark                       Unit    Before  Err     After   Err         
Uplift
factorAddFloat              ops/ms      2105.1  0.6     5779.4  0.1         2.7
factorAddInt                ops/ms      2624.2  0.5     5778.5  0.4         2.2
factorAndInt                ops/ms      2623.8  0.6     293407  4459.5  111.8
factorCompressBitsInt   ops/ms  22.0    0.0     22.1    0.0         1.0
factorDivFloat              ops/ms      605.1   0.0     1301.6  0.0         2.2
factorExpandBitsLong    ops/ms  17.0    0.0     17.0    0.0         1.0
factorLShiftInt             ops/ms      1481.8  0.4     2580.5  0.0         1.7
factorMaxDouble             ops/ms      1965.9  1.1     5223.3  1.9         2.7
factorMinFloat              ops/ms      1957.6  2.3     5247.9  1.0         2.7
factorMulLong               ops/ms      1046.7  0.0     2109.6  0.1         2.0
factorOrLong                ops/ms      2615.6  0.0     307047  867.7   117.4
factorRolInt                ops/ms      885.4   0.1     885.6   0.1         1.0
factorSAddInt               ops/ms      2623.8  0.5     5778.6  0.2         2.2
factorSSubInt               ops/ms      2623.7  0.6     5778.1  0.2         2.2
factorSubDouble             ops/ms      2094.9  0.1     5768.0  0.2         2.8
factorSubInt                ops/ms      2624.4  0.4     5778.4  0.2         2.2
factorXorInt                ops/ms      2624.0  0.3     5778.4  0.2         2.2


On a Nvidia Grace (Neoverse-V2) machine with `-XX:UseSVE=0`:

Benchmark                       Unit    Before  Err     After   Err         
Uplift
factorAddFloat              ops/ms      2207.6  1.4     7764.8  5.2         3.5
factorAddInt                ops/ms      2983.3  27      7775.5  16.8    2.6
factorAndInt                ops/ms      2960.1  82      436311  5503.8  147.4
factorCompressBitsInt   ops/ms  50.5    0.1     50.6    0.1         1.0
factorDivFloat              ops/ms      1078.3  0.5     1781.4  1.5         1.7
factorExpandBitsLong    ops/ms  36.7    0.1     36.7    0.1         1.0
factorLShiftInt             ops/ms      1902.4  4.8     3534.5  10.7    1.9
factorMaxDouble             ops/ms      2236.8  1.0     7697.7  15.2    3.4
factorMinFloat              ops/ms      2212.7  8.8     7787.0  11.7    3.5
factorMulLong               ops/ms      728.6   1.1     1037.8  1.7         1.4
factorOrLong                ops/ms      3209.2  5.8     465766  3478.5  145.1
factorRolInt                ops/ms      1005.9  1.7     1003.8  0.7         1.0
factorSAddInt               ops/ms      2922.7  64      7776.3  15.2    2.7
factorSSubInt               ops/ms      2975.3  55      7787.1  20.7    2.6
factorSubDouble         ops/ms  2235.8  0.9     7682.0  2.1         3.4
factorSubInt                ops/ms      2905.3  28      7797.8  12.0    2.7
factorXorInt                ops/ms      2921.3  46      15076.5 38.0    5.2


On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:

Benchmark                       Unit    Before  Err     After   Err         
Uplift
factorAddFloat              ops/ms      3646.2  0.7     6110.9  0.3         1.7
factorAddInt                ops/ms      4938.5  0.3     14243.0 0.8         2.9
factorAndInt                ops/ms      4938.5  0.2     296063  1510.5  59.9
factorCompressBitsInt   ops/ms  83.8    0.5     77.3    0.4         0.9
factorDivFloat              ops/ms      876.8   0.0     1664.3  0.1         1.9
factorExpandBitsLong    ops/ms  96.6    0.2     91.5    0.3         0.9
factorLShiftInt             ops/ms      4540.6  1.3     6821.2  1.2         1.5
factorMaxDouble             ops/ms      884.2   0.1     1231.5  0.4         1.4
factorMinFloat              ops/ms      1085.9  0.2     1730.4  0.7         1.6
factorMulLong               ops/ms      3642.3  0.2     6112.1  0.5         1.7
factorOrLong                ops/ms      4938.3  0.3     290656  6077.9  58.9
factorRolInt                ops/ms      3404.6  1.3     4978.0  0.2         1.5
factorSAddInt               ops/ms      698.3   0.1     1166.3  0.2         1.7
factorSSubInt               ops/ms      708.8   0.1     1189.2  0.1         1.7
factorSubDouble             ops/ms      3646.7  0.2     6110.0  1.4         1.7
factorSubInt                ops/ms      4937.7  1.1     14242.8 0.6         2.9
factorXorInt                ops/ms      4938.6  0.3     68699.2 202.2   13.9


On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=2`:

Benchmark                       Unit    Before  Err     After   Err         
Uplift
factorAddFloat              ops/ms      2920.3  1.3     11216.8 3.3         3.8
factorAddInt                ops/ms      5282.7  3.5     18689.1 39.4    3.5
factorAndInt                ops/ms      5284.6  1.6     426993  2099.4  80.8
factorCompressBitsInt   ops/ms  103.8   1.8     103.1   1.8         1.0
factorDivFloat              ops/ms      873.3   1.0     2063.6  0.7         2.4
factorExpandBitsLong    ops/ms  101.7   1.4     102.0   1.5         1.0
factorLShiftInt             ops/ms      3479.5  5.7     8532.7  2.5         2.5
factorMaxDouble             ops/ms      885.7   0.3     1265.2  0.4         1.4
factorMinFloat              ops/ms      971.8   0.6     1384.1  2.0         1.4
factorMulLong               ops/ms      787.3   0.3     1009.9  1.6         1.3
factorOrLong                ops/ms      5284.6  1.8     425419  789.9   80.5
factorRolInt                ops/ms      2201.4  0.7     2201.5  2.5         1.0
factorSAddInt               ops/ms      1343.0  0.4     2245.6  2.2         1.7
factorSSubInt               ops/ms      1345.2  0.6     2248.6  1.9         1.7
factorSubDouble             ops/ms      2920.4  0.8     11215.9 3.7         3.8
factorSubInt                ops/ms      5284.1  1.7     18719.1 5.1         3.5
factorXorInt                ops/ms      5284.3  3.7     18720.1 6.1         3.5


[1] https://bugs.openjdk.org/browse/JDK-8384571




---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK 
Interim AI Policy](https://openjdk.org/legal/ai).

-------------

Depends on: https://git.openjdk.org/jdk/pull/31333

Commit messages:
 - 8385051: C2: Factor the shared lane-wise binary op out of VectorBlendNode

Changes: https://git.openjdk.org/jdk/pull/31379/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31379&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8385051
  Stats: 968 lines in 8 files changed: 965 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/31379.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31379/head:pull/31379

PR: https://git.openjdk.org/jdk/pull/31379

Reply via email to