Re: RFR: 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation with BSL [v2]

Eric Fang Thu, 04 Jun 2026 00:50:54 -0700

On Thu, 28 May 2026 02:32:56 GMT, Xiaohong Gong <[email protected]> wrote:


>> Eric Fang has updated the pull request with a new target base due to a merge 
>> or a rebase. The incremental webrev excludes the unrelated changes brought 
>> in by the merge/rebase. The pull request contains three additional commits 
>> since the last revision:
>> 
>>  - Implement bitwise_blend in IGVN
>>    
>>    The latest changes:
>>    
>>    1. Defined a new IR `VectorBitwiseBlendNode`
>>    2. Do the optimization in IGVN:
>>    // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel)
>>    // XorV(a, AndV(sel, XorV(a, b)), mask) =>
>>    //   VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask)
>>    
>>    3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`.
>>    4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`.
>>  - Merge branch 'master' into JDK-8382052-bitwise-blend
>>  - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND 
>> operation with BSL
>>    
>>    Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
>>    generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
>>    provides a more efficient mapping for this operation through the NEON
>>    `BSL` and SVE `BSL` (bitwise select) instructions.
>>    
>>    This change teaches C2 to recognize the `BITWISE_BLEND` patterns and
>>    lower them to the dedicated AArch64 instructions for better performance.
>>    
>>    The change includes the AArch64 match rules and assembler support,
>>    updates the AArch64 asm tests, adds IR framework nodes for the new mach
>>    instructions, introduces a new jtreg IR test and extends the
>>    MaskedLogicOpts JMH benchmark for 128-bit long type.
>>    
>>    JMH results show **11% - 54%** performance improvements for the
>>    optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on
>>    SVE2, SVE1, and NEON configurations.
>>    
>>    On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>    ```
>>    Benchmark                 Unit    ARRAYLEN Before     Error    After      
>> Error   Uplift
>>    bitwiseBlendOperationInt128       ops/s   256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>>    bitwiseBlendOperationInt128       ops/s   512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>>    bitwiseBlendOperationInt128       ops/s   1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>>    bitwiseBlendOperationLong128      ops/s   256.00   1895.45    13.68    
>> 2140.31    3.68    1.13
>>    bitwiseBlendOperationLong128      ops/s   512.00   938.71     5.32     
>> 1052.16    14.07   1.12
>>    bitwiseBlendOperationLong128      ops/s   1024.00  474.15     2.33     
>> 526.49     2.62    1.11
>>    ``...
>
> Looks a reasonable optimization and it looks good to me.

Hi @XiaohongGong, per your suggestion, I switched to IGVN for this 
optimization; it looks cleaner. Thanks!

Hi @theRealAph, the latest implementation has successfully avoided the issue of 
adding multiple commutative match rules.

Please help take another look when you have a moment, thank you!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31269#issuecomment-4620081675

Re: RFR: 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation with BSL [v2]

Reply via email to