Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v2]

Andrew Dinn Tue, 28 Apr 2026 08:27:59 -0700

On Mon, 27 Apr 2026 12:22:30 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains three commits:
> 
>  - Merged master.
>  - Removing a jar file.
>  - 8355216: Accelerate P-256 arithmetic on aarch64 (revived)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 13751:

> 13749:     if (UseKyberIntrinsics) {
> 13750:       StubRoutines::_kyberNtt = generate_kyberNtt();
> 13751:       StubRoutines::_kyberInverseNtt = generate_kyberInverseNtt();

This is a recurring pattern:

    __ umulh(hi, a, b);
    __ mul(lo, a, b)
    __ lsl(hi, hi, SHIFT1)
    __ lsr(tmp, lo, SHIFT2)
    __ orr(hi, hi, tmp)
    __ andr(lo, lo, mask)

which could be abstracted as a macro method (picking an arbitrary name that you 
probably want to improve on):

    p256_partial_mul(Register a, Register b, Register hi, Register lo, Register 
tmp, Register mask)

You can then simplify the code that processes this limb (likewise ineach each 
subsequent limb) to make it clearer what is being done to combine the results 
of these macro computations:

    __ ldr(a_i, __ post(a, 8));

    p256_partial_mul(a_i, b_0, high, low, tmp, limb_mask);

    __ andr(n, low, limb_mask);

    neon_partial_mult_64(B, b_highs, a_vals, 0);

    p256_partial_mul(n, mod_0, mod_high, mod_low, tmp, limb_mask)

    __ add(low, low, mod_low);
    __ add(high, high, mod_high);
    __ lsr(c_i, low, shift2);
    __ add(c_i, c_i, high);

Also, note that the function consumes SHIFT1 and SHIFT2 which should be defined 
as final int constants and would be better defined at file scope rather than 
being declared and initialized as local variables.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3155280621

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v2]

Reply via email to