On Thu, 4 Jun 2026 07:46:10 GMT, Eric Fang <[email protected]> wrote:

>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
>> provides a more efficient mapping for this operation through the NEON `BSL` 
>> and SVE `BSL` (bitwise select) instructions.
>> 
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
>> them to the dedicated AArch64 instructions for better performance.
>> 
>> The change includes the AArch64 match rules and assembler support, updates 
>> the AArch64 asm tests, adds IR framework nodes for the new mach 
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts 
>> JMH benchmark for 128-bit long type.
>> 
>> JMH results show **11% - 54%** performance improvements for the optimized 
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
>> NEON configurations.
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>> bitwiseBlendOperationInt128      ops/s       512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>> bitwiseBlendOperationInt128      ops/s       1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>> bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
>> 3.68    1.13
>> bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
>> 14.07   1.12
>> bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
>>     2.62        1.11
>> 
>> 
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2051.52    13.85    
>> 2481.44    0.27    1.21
>> bitwiseBlendOperationInt128      ops/s       512.00   995.47     20.77    
>> 1235.10    5.70    1.24
>> bitwiseBlendOperationInt128      ops/s       1024.00  507.73     9.83     
>> 617.59         2.43        1.22
>> bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
>> 5.48    1.23
>> bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
>>     2.32        1.22
>> bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
>>     0.04        1.20
>> 
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2336.17    13.18    
>> 3505.19    19.61   1.50
>> bitwiseBlendOperationInt128      ops/s       512.00   1145.50 ...
>
> Eric Fang has updated the pull request with a new target base due to a merge 
> or a rebase. The incremental webrev excludes the unrelated changes brought in 
> by the merge/rebase. The pull request contains three additional commits since 
> the last revision:
> 
>  - Implement bitwise_blend in IGVN
>    
>    The latest changes:
>    
>    1. Defined a new IR `VectorBitwiseBlendNode`
>    2. Do the optimization in IGVN:
>    // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel)
>    // XorV(a, AndV(sel, XorV(a, b)), mask) =>
>    //   VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask)
>    
>    3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`.
>    4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`.
>  - Merge branch 'master' into JDK-8382052-bitwise-blend
>  - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation 
> with BSL
>    
>    Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
>    generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
>    provides a more efficient mapping for this operation through the NEON
>    `BSL` and SVE `BSL` (bitwise select) instructions.
>    
>    This change teaches C2 to recognize the `BITWISE_BLEND` patterns and
>    lower them to the dedicated AArch64 instructions for better performance.
>    
>    The change includes the AArch64 match rules and assembler support,
>    updates the AArch64 asm tests, adds IR framework nodes for the new mach
>    instructions, introduces a new jtreg IR test and extends the
>    MaskedLogicOpts JMH benchmark for 128-bit long type.
>    
>    JMH results show **11% - 54%** performance improvements for the
>    optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on
>    SVE2, SVE1, and NEON configurations.
>    
>    On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>    ```
>    Benchmark                  Unit    ARRAYLEN Before     Error    After      
> Error   Uplift
>    bitwiseBlendOperationInt128        ops/s   256.00   3787.49    5.29     
> 4277.64    8.89    1.13
>    bitwiseBlendOperationInt128        ops/s   512.00   1888.24    11.02    
> 2143.21    6.32    1.14
>    bitwiseBlendOperationInt128        ops/s   1024.00  938.22     6.24     
> 1053.45    14.68   1.12
>    bitwiseBlendOperationLong128       ops/s   256.00   1895.45    13.68    
> 2140.31    3.68    1.13
>    bitwiseBlendOperationLong128       ops/s   512.00   938.71     5.32     
> 1052.16    14.07   1.12
>    bitwiseBlendOperationLong128       ops/s   1024.00  474.15     2.33     
> 526.49     2.62    1.11
>    ```
>    
>    On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>    ```
>    Benchmar...

I also tested the JMH bencharks again, and the results were basically the same 
as the first commit.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31269#issuecomment-4620092713

Reply via email to