This PR continues the JDK-8384571 [1] work by further optimizing Blend-related IR patterns. It adds the factoring optimization that pulls a shared operand out of two matching lane-wise binary ops fed into a `VectorBlend`:
(VectorBlend (op A C) (op B C) M) => (op (VectorBlend A B M) C) For commutative ops the common operand may be in either position; for non-commutative ops it must appear in the same slot in both inner ops. Shifts and rotates are not factored when the common operand sits in the count slot, since that turns a constant shift/rotate amount into a non-constant one which is not guaranteed to be a win in codegen. Also extends `MulVLNode::has_int_inputs` / `has_uint_inputs` to peek through `VectorBlend` so the `MulVL -> vmuldq/vmuludq` narrowing on x86 keeps working after the factoring runs. The `factorMulLong` benchmark exposed a performance issue with the rule `vmulL_neon`: The encoder lowers `MulVL` to per-lane `GPR extract -> scalar mul -> mov dst.D[i]`. The first `mov` inherits a partial-write merge dependency on the old `dst`, and if the register allocator coalesces `dst` with `src1/src2` the subsequent `umov` for lane 1 must wait for that merge to retire. This serialises the two scalar `muls` within an iteration and across unrolled iterations. This PR fix the issue by forcing `dst` into a fresh register to breaks the chain. This PR also extends the existing JTReg and JMH tests for `VectorBlend`. All tests (tier1, tier2, and tier3) passed on AArch64 and X86 platforms. JMH benchmark test results: On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: Benchmark Unit Before Err After Err Uplift factorAddFloat ops/ms 2726.9 1.1 7790.8 12.1 2.9 factorAddInt ops/ms 3383.4 3.9 7778.3 19.7 2.3 factorAndInt ops/ms 3386.8 2.1 436377 4806.1 128.8 factorCompressBitsInt ops/ms 1455.7 1.0 2244.7 0.7 1.5 factorDivFloat ops/ms 1084.9 1.2 1777.4 0.8 1.6 factorExpandBitsLong ops/ms 1455.6 1.7 2237.7 1.4 1.5 factorLShiftInt ops/ms 1923.7 2.4 3518.7 3.6 1.8 factorMaxDouble ops/ms 2712.8 1.6 7718.3 4.9 2.8 factorMinFloat ops/ms 2725.0 4.7 7778.5 13.5 2.9 factorMulLong ops/ms 1427.3 0.6 2773.2 3.4 1.9 factorOrLong ops/ms 3377.3 3.0 472138 2687.2 139.8 factorRolInt ops/ms 1188.0 1.7 1187.9 0.4 1.0 factorSAddInt ops/ms 3384.8 3.3 7769.1 4.1 2.3 factorSSubInt ops/ms 3387.4 8.2 7790.4 12.5 2.3 factorSubDouble ops/ms 2717.7 3.1 7711.4 2.6 2.8 factorSubInt ops/ms 3384.8 5.7 7767.4 3.4 2.3 factorXorInt ops/ms 3386.7 1.9 15017.2 20.5 4.4 On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1: Benchmark Unit Before Err After Err Uplift factorAddFloat ops/ms 2105.1 0.6 5779.4 0.1 2.7 factorAddInt ops/ms 2624.2 0.5 5778.5 0.4 2.2 factorAndInt ops/ms 2623.8 0.6 293407 4459.5 111.8 factorCompressBitsInt ops/ms 22.0 0.0 22.1 0.0 1.0 factorDivFloat ops/ms 605.1 0.0 1301.6 0.0 2.2 factorExpandBitsLong ops/ms 17.0 0.0 17.0 0.0 1.0 factorLShiftInt ops/ms 1481.8 0.4 2580.5 0.0 1.7 factorMaxDouble ops/ms 1965.9 1.1 5223.3 1.9 2.7 factorMinFloat ops/ms 1957.6 2.3 5247.9 1.0 2.7 factorMulLong ops/ms 1046.7 0.0 2109.6 0.1 2.0 factorOrLong ops/ms 2615.6 0.0 307047 867.7 117.4 factorRolInt ops/ms 885.4 0.1 885.6 0.1 1.0 factorSAddInt ops/ms 2623.8 0.5 5778.6 0.2 2.2 factorSSubInt ops/ms 2623.7 0.6 5778.1 0.2 2.2 factorSubDouble ops/ms 2094.9 0.1 5768.0 0.2 2.8 factorSubInt ops/ms 2624.4 0.4 5778.4 0.2 2.2 factorXorInt ops/ms 2624.0 0.3 5778.4 0.2 2.2 On a Nvidia Grace (Neoverse-V2) machine with `-XX:UseSVE=0`: Benchmark Unit Before Err After Err Uplift factorAddFloat ops/ms 2207.6 1.4 7764.8 5.2 3.5 factorAddInt ops/ms 2983.3 27 7775.5 16.8 2.6 factorAndInt ops/ms 2960.1 82 436311 5503.8 147.4 factorCompressBitsInt ops/ms 50.5 0.1 50.6 0.1 1.0 factorDivFloat ops/ms 1078.3 0.5 1781.4 1.5 1.7 factorExpandBitsLong ops/ms 36.7 0.1 36.7 0.1 1.0 factorLShiftInt ops/ms 1902.4 4.8 3534.5 10.7 1.9 factorMaxDouble ops/ms 2236.8 1.0 7697.7 15.2 3.4 factorMinFloat ops/ms 2212.7 8.8 7787.0 11.7 3.5 factorMulLong ops/ms 728.6 1.1 1037.8 1.7 1.4 factorOrLong ops/ms 3209.2 5.8 465766 3478.5 145.1 factorRolInt ops/ms 1005.9 1.7 1003.8 0.7 1.0 factorSAddInt ops/ms 2922.7 64 7776.3 15.2 2.7 factorSSubInt ops/ms 2975.3 55 7787.1 20.7 2.6 factorSubDouble ops/ms 2235.8 0.9 7682.0 2.1 3.4 factorSubInt ops/ms 2905.3 28 7797.8 12.0 2.7 factorXorInt ops/ms 2921.3 46 15076.5 38.0 5.2 On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: Benchmark Unit Before Err After Err Uplift factorAddFloat ops/ms 3646.2 0.7 6110.9 0.3 1.7 factorAddInt ops/ms 4938.5 0.3 14243.0 0.8 2.9 factorAndInt ops/ms 4938.5 0.2 296063 1510.5 59.9 factorCompressBitsInt ops/ms 83.8 0.5 77.3 0.4 0.9 factorDivFloat ops/ms 876.8 0.0 1664.3 0.1 1.9 factorExpandBitsLong ops/ms 96.6 0.2 91.5 0.3 0.9 factorLShiftInt ops/ms 4540.6 1.3 6821.2 1.2 1.5 factorMaxDouble ops/ms 884.2 0.1 1231.5 0.4 1.4 factorMinFloat ops/ms 1085.9 0.2 1730.4 0.7 1.6 factorMulLong ops/ms 3642.3 0.2 6112.1 0.5 1.7 factorOrLong ops/ms 4938.3 0.3 290656 6077.9 58.9 factorRolInt ops/ms 3404.6 1.3 4978.0 0.2 1.5 factorSAddInt ops/ms 698.3 0.1 1166.3 0.2 1.7 factorSSubInt ops/ms 708.8 0.1 1189.2 0.1 1.7 factorSubDouble ops/ms 3646.7 0.2 6110.0 1.4 1.7 factorSubInt ops/ms 4937.7 1.1 14242.8 0.6 2.9 factorXorInt ops/ms 4938.6 0.3 68699.2 202.2 13.9 On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=2`: Benchmark Unit Before Err After Err Uplift factorAddFloat ops/ms 2920.3 1.3 11216.8 3.3 3.8 factorAddInt ops/ms 5282.7 3.5 18689.1 39.4 3.5 factorAndInt ops/ms 5284.6 1.6 426993 2099.4 80.8 factorCompressBitsInt ops/ms 103.8 1.8 103.1 1.8 1.0 factorDivFloat ops/ms 873.3 1.0 2063.6 0.7 2.4 factorExpandBitsLong ops/ms 101.7 1.4 102.0 1.5 1.0 factorLShiftInt ops/ms 3479.5 5.7 8532.7 2.5 2.5 factorMaxDouble ops/ms 885.7 0.3 1265.2 0.4 1.4 factorMinFloat ops/ms 971.8 0.6 1384.1 2.0 1.4 factorMulLong ops/ms 787.3 0.3 1009.9 1.6 1.3 factorOrLong ops/ms 5284.6 1.8 425419 789.9 80.5 factorRolInt ops/ms 2201.4 0.7 2201.5 2.5 1.0 factorSAddInt ops/ms 1343.0 0.4 2245.6 2.2 1.7 factorSSubInt ops/ms 1345.2 0.6 2248.6 1.9 1.7 factorSubDouble ops/ms 2920.4 0.8 11215.9 3.7 3.8 factorSubInt ops/ms 5284.1 1.7 18719.1 5.1 3.5 factorXorInt ops/ms 5284.3 3.7 18720.1 6.1 3.5 [1] https://bugs.openjdk.org/browse/JDK-8384571 --------- - [x] I confirm that I make this contribution in accordance with the [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). ------------- Depends on: https://git.openjdk.org/jdk/pull/31333 Commit messages: - 8385051: C2: Factor the shared lane-wise binary op out of VectorBlendNode Changes: https://git.openjdk.org/jdk/pull/31379/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31379&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8385051 Stats: 968 lines in 8 files changed: 965 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/31379.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/31379/head:pull/31379 PR: https://git.openjdk.org/jdk/pull/31379
