On Mon, 27 Apr 2026 12:22:30 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() >> method and IntegerPolynomial.conditionalAssign(). Since 64-bit >> multiplication is not supported on Neon and manually performing this >> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr >> approach is used. Neon instructions are used to compute intermediate values >> used in the last two iterations of the main "loop", while the GPRs compute >> the first few iterations. At the method level this improves performance by >> ~9% and at the API level roughly 5%. >> >> >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Ferenc Rakoczi has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains three commits: > > - Merged master. > - Removing a jar file. > - 8355216: Accelerate P-256 arithmetic on aarch64 (revived) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 13751: > 13749: if (UseKyberIntrinsics) { > 13750: StubRoutines::_kyberNtt = generate_kyberNtt(); > 13751: StubRoutines::_kyberInverseNtt = generate_kyberInverseNtt(); This is a recurring pattern: __ umulh(hi, a, b); __ mul(lo, a, b) __ lsl(hi, hi, SHIFT1) __ lsr(tmp, lo, SHIFT2) __ orr(hi, hi, tmp) __ andr(lo, lo, mask) which could be abstracted as a macro method (picking an arbitrary name that you probably want to improve on): p256_partial_mul(Register a, Register b, Register hi, Register lo, Register tmp, Register mask) You can then simplify the code that processes this limb (likewise ineach each subsequent limb) to make it clearer what is being done to combine the results of these macro computations: __ ldr(a_i, __ post(a, 8)); p256_partial_mul(a_i, b_0, high, low, tmp, limb_mask); __ andr(n, low, limb_mask); neon_partial_mult_64(B, b_highs, a_vals, 0); p256_partial_mul(n, mod_0, mod_high, mod_low, tmp, limb_mask) __ add(low, low, mod_low); __ add(high, high, mod_high); __ lsr(c_i, low, shift2); __ add(c_i, c_i, high); Also, note that the function consumes SHIFT1 and SHIFT2 which should be defined as final int constants and would be better defined at file scope rather than being declared and initialized as local variables. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3155280621
