On Tue, 10 Mar 2026 15:46:02 GMT, Ruben <[email protected]> wrote: >> Bhavana Kilambi has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 10 commits: >> >> - Address review feedback >> - merge from main >> - Merge commit '9f13ec1ccb684398e311b5f139773ca9f39561fe' into HEAD >> - Address review comments for the JTREG test and microbenchmark >> - Merge branch 'master' >> - Address review comments >> - Fix build failures on Mac >> - Address review comments >> - Merge 'master' >> - 8366444: Add support for add/mul reduction operations for Float16 >> >> This patch adds mid-end support for vectorized add/mul reduction >> operations for half floats. It also includes backend aarch64 support for >> these operations. Only vectorization support through autovectorization >> is added as VectorAPI currently does not support Float16 vector species. >> >> Both add and mul reduction vectorized through autovectorization mandate >> the implementation to be strictly ordered. The following is how each of >> these reductions is implemented for different aarch64 targets - >> >> For AddReduction : >> On Neon only targets (UseSVE = 0): Generates scalarized additions >> using the scalar "fadd" instruction for both 8B and 16B vector lengths. >> This is because Neon does not provide a direct instruction for computing >> strictly ordered floating point add reduction. >> >> On SVE targets (UseSVE > 0): Generates the "fadda" instruction which >> computes add reduction for floating point in strict order. >> >> For MulReduction : >> Both Neon and SVE do not provide a direct instruction for computing >> strictly ordered floating point multiply reduction. For vector lengths >> of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is >> generated and multiply reduction for vector lengths > 16B is not >> supported. >> >> Below is the performance of the two newly added microbenchmarks in >> Float16OperationsBenchmark.java tested on three different aarch64 >> machines and with varying MaxVectorSize - >> >> Note: On all machines, the score (ops/ms) is compared with the master >> branch without this patch which generates a sequence of loads ("ldrsh") >> to load the FP16 value into an FPR and a scalar "fadd/fmul" to >> add/multiply the loaded value to the running sum/product. The ratios >> given below are the ratios between the throughput with this patch and >> the throughput without this patch. >> Ratio > 1 indicat... > > @eme64, could you suggest whether any further changes or clarification are > needed?
@ruben-arm Thanks for the ping! The code looks good to me now :) I would like to do some internal testing. @ruben-arm @Bhavana-Kilambi @yiwu0b11 Can you please merge with newest master, so I can run my testing script? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-4032779161
