On Thu, 28 May 2026 02:32:56 GMT, Xiaohong Gong <[email protected]> wrote:
>> Eric Fang has updated the pull request with a new target base due to a merge >> or a rebase. The incremental webrev excludes the unrelated changes brought >> in by the merge/rebase. The pull request contains three additional commits >> since the last revision: >> >> - Implement bitwise_blend in IGVN >> >> The latest changes: >> >> 1. Defined a new IR `VectorBitwiseBlendNode` >> 2. Do the optimization in IGVN: >> // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel) >> // XorV(a, AndV(sel, XorV(a, b)), mask) => >> // VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask) >> >> 3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`. >> 4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`. >> - Merge branch 'master' into JDK-8382052-bitwise-blend >> - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND >> operation with BSL >> >> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a >> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 >> provides a more efficient mapping for this operation through the NEON >> `BSL` and SVE `BSL` (bitwise select) instructions. >> >> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and >> lower them to the dedicated AArch64 instructions for better performance. >> >> The change includes the AArch64 match rules and assembler support, >> updates the AArch64 asm tests, adds IR framework nodes for the new mach >> instructions, introduces a new jtreg IR test and extends the >> MaskedLogicOpts JMH benchmark for 128-bit long type. >> >> JMH results show **11% - 54%** performance improvements for the >> optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on >> SVE2, SVE1, and NEON configurations. >> >> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: >> ``` >> Benchmark Unit ARRAYLEN Before Error After >> Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29 >> 4277.64 8.89 1.13 >> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02 >> 2143.21 6.32 1.14 >> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24 >> 1053.45 14.68 1.12 >> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 >> 2140.31 3.68 1.13 >> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 >> 1052.16 14.07 1.12 >> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 >> 526.49 2.62 1.11 >> ``... > > Looks a reasonable optimization and it looks good to me. Hi @XiaohongGong, per your suggestion, I switched to IGVN for this optimization; it looks cleaner. Thanks! Hi @theRealAph, the latest implementation has successfully avoided the issue of adding multiple commutative match rules. Please help take another look when you have a moment, thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/31269#issuecomment-4620081675
