On Mon, 29 Dec 2025 17:39:42 GMT, Bhavana Kilambi <[email protected]> wrote:
>> This patch adds mid-end support for vectorized add/mul reduction operations
>> for half floats. It also includes backend aarch64 support for these
>> operations. Only vectorization support through autovectorization is added as
>> VectorAPI currently does not support Float16 vector species.
>>
>> Both add and mul reduction vectorized through autovectorization mandate the
>> implementation to be strictly ordered. The following is how each of these
>> reductions is implemented for different aarch64 targets -
>>
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the
>> scalar `fadd` instruction for both 8B and 16B vector lengths. This is
>> because Neon does not provide a direct instruction for computing strictly
>> ordered floating point add reduction.
>>
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which
>> computes add reduction for floating point in strict order.
>>
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly
>> ordered floating point multiply reduction. For vector lengths of 8B and 16B,
>> a scalarized sequence of scalar `fmul` instructions is generated and
>> multiply reduction for vector lengths > 16B is not supported.
>>
>> Below is the performance of the two newly added microbenchmarks in
>> `Float16OperationsBenchmark.java` tested on three different aarch64 machines
>> and with varying `MaxVectorSize` -
>>
>> Note: On all machines, the score (ops/ms) is compared with the master branch
>> without this patch which generates a sequence of loads (`ldrsh`) to load the
>> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded
>> value to the running sum/product. The ratios given below are the ratios
>> between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the
>> master branch.
>>
>> **N1 (UseSVE = 0, max vector length = 16B):**
>>
>> Benchmark vectorDim Mode Cnt 8B 16B
>> ReductionAddFP16 256 thrpt 9 1.41 1.40
>> ReductionAddFP16 512 thrpt 9 1.41 1.41
>> ReductionAddFP16 1024 thrpt 9 1.43 1.40
>> ReductionAddFP16 2048 thrpt 9 1.43 1.40
>> ReductionMulFP16 256 thrpt 9 1.22 1.22
>> ReductionMulFP16 512 thrpt 9 1.21 1.23
>> ReductionMulFP16 1024 thrpt 9 1.21 1.22
>> ReductionMulFP16 2048 thrpt 9 1.20 1.22
>>
>>
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> Bhavana Kilambi has updated the pull request with a new target base due to a
> merge or a rebase. The pull request now contains seven commits:
>
> - Address review comments for the JTREG test and microbenchmark
> - Merge branch 'master'
> - Address review comments
> - Fix build failures on Mac
> - Address review comments
> - Merge 'master'
> - 8366444: Add support for add/mul reduction operations for Float16
>
> This patch adds mid-end support for vectorized add/mul reduction
> operations for half floats. It also includes backend aarch64 support for
> these operations. Only vectorization support through autovectorization
> is added as VectorAPI currently does not support Float16 vector species.
>
> Both add and mul reduction vectorized through autovectorization mandate
> the implementation to be strictly ordered. The following is how each of
> these reductions is implemented for different aarch64 targets -
>
> For AddReduction :
> On Neon only targets (UseSVE = 0): Generates scalarized additions
> using the scalar "fadd" instruction for both 8B and 16B vector lengths.
> This is because Neon does not provide a direct instruction for computing
> strictly ordered floating point add reduction.
>
> On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
> computes add reduction for floating point in strict order.
>
> For MulReduction :
> Both Neon and SVE do not provide a direct instruction for computing
> strictly ordered floating point multiply reduction. For vector lengths
> of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
> generated and multiply reduction for vector lengths > 16B is not
> supported.
>
> Below is the performance of the two newly added microbenchmarks in
> Float16OperationsBenchmark.java tested on three different aarch64
> machines and with varying MaxVectorSize -
>
> Note: On all machines, the score (ops/ms) is compared with the master
> branch without this patch which generates a sequence of loads ("ldrsh")
> to load the FP16 value into an FPR and a scalar "fadd/fmul" to
> add/multiply the loaded value to the running sum/product. The ratios
> given below are the ratios between the throughput with this patch and
> the throughput without this patch.
> Ratio > 1 indicates the performance with this patch is better than the
> master branch.
>
> N1 (UseSVE = 0, max vector length = 16B):
> Benchmark vectorDim Mode Cnt 8B 16B
> ReductionAddFP16 256 th...
Here are the new benchmark result:
Neoverse N1 (UseSVE = 0, max vector length = 16B):
Benchmark vectorDim Mode Cnt 8B 16B
ReductionAddFP16 256 thrpt 9 1.17 1.21
ReductionAddFP16 512 thrpt 9 1.17 1.18
ReductionAddFP16 1024 thrpt 9 1.18 1.17
ReductionAddFP16 2048 thrpt 9 1.19 1.16
ReductionMulFP16 256 thrpt 9 1.03 1.04
ReductionMulFP16 512 thrpt 9 1.02 1.03
ReductionMulFP16 1024 thrpt 9 1.01 1.02
ReductionMulFP16 2048 thrpt 9 1.01 1.01
Neoverse V1 (UseSVE = 1, max vector length = 32B):
Benchmark vectorDim Mode Cnt 8B 16B 32B
ReductionAddFP16 256 thrpt 9 1.12 1.75 1.95
ReductionAddFP16 512 thrpt 9 1.07 1.64 1.87
ReductionAddFP16 1024 thrpt 9 1.05 1.59 1.78
ReductionAddFP16 2048 thrpt 9 1.04 1.56 1.74
ReductionMulFP16 256 thrpt 9 1.12 1.12 1.11
ReductionMulFP16 512 thrpt 9 1.04 1.05 1.05
ReductionMulFP16 1024 thrpt 9 1.02 1.02 0.99
ReductionMulFP16 2048 thrpt 9 1.01 1.01 1.00
Neoverse V2 (UseSVE = 2, max vector length = 16B)
Benchmark vectorDim Mode Cnt 8B 16B
ReductionAddFP16 256 thrpt 9 1.16 1.70
ReductionAddFP16 512 thrpt 9 1.07 1.61
ReductionAddFP16 1024 thrpt 9 1.03 1.53
ReductionAddFP16 2048 thrpt 9 1.02 1.50
ReductionMulFP16 256 thrpt 9 1.18 1.18
ReductionMulFP16 512 thrpt 9 1.08 1.07
ReductionMulFP16 1024 thrpt 9 1.04 1.04
ReductionMulFP16 2048 thrpt 9 1.02 1.01
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3861614647