Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v2]

Ferenc Rakoczi Wed, 13 May 2026 09:20:09 -0700

On Tue, 28 Apr 2026 15:42:48 GMT, Andrew Dinn <[email protected]> wrote:


>> Ferenc Rakoczi has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains three commits:
>> 
>>  - Merged master.
>>  - Removing a jar file.
>>  - 8355216: Accelerate P-256 arithmetic on aarch64 (revived)
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7838:
> 
>> 7836:     __ lsr(tmp, low, shift2);
>> 7837:     __ orr(high, high, tmp);
>> 7838:     __ andr(low, low, limb_mask);
> 
> This is a recurring pattern:
> 
> __ umulh(hi, a, b);
> __ mul(lo, a, b)
> __ lsl(hi, hi, SHIFT1)
> __ lsr(tmp, lo, SHIFT2)
> __ orr(hi, hi, tmp)
> __ andr(lo, lo, mask)
> 
> which could be abstracted as a macro method (picking an arbitrary name that 
> you probably want to improve on):
> 
> p256_partial_mul(Register a, Register b, Register hi, Register lo, Register 
> tmp, Register mask)
> 
> You can then simplify the code that processes this limb (likewise in each 
> each subsequent limb) to make it clearer what is being done to combine the 
> results of these macro computations:
> 
> __ ldr(a_i, __ post(a, 8));
> 
> p256_partial_mul(a_i, b_0, high, low, tmp, limb_mask);
> 
> __ andr(n, low, limb_mask);
> 
> neon_partial_mult_64(B, b_highs, a_vals, 0);
> 
> p256_partial_mul(n, mod_0, mod_high, mod_low, tmp, limb_mask)
> 
> __ add(low, low, mod_low);
> __ add(high, high, mod_high);
> __ lsr(c_i, low, shift2);
> __ add(c_i, c_i, high);
> 
> Also, note that the function consumes SHIFT1 and SHIFT2 which should be 
> defined as final int constants and would be better defined at file scope 
> rather than being declared and initialized as local variables.

Very good idea! Thanks a lot!

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7958:
> 
>> 7956:     __ st1(D[2], __ T2D, __ post(mul_ptr, 16));
>> 7957:     __ st1(A[3], __ T2D, __ post(mul_ptr, 16));
>> 7958:     __ st1(D[3], __ T2D, mul_ptr);
> 
> You could usefully abstract this as a VSeq template function
> 
>     vs_st1_interleaved(VSeq<N> A, VSeq<N> B, Register dest) {
>       for (int i = 0; i < N; i++) {
>         __ st1(A[i], __ T2D, __ post(dest, 16));
>         __ st1(B[i], __ T2D, __ post(dest, 16));
>       }
>     }

Good idea. Thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3235772102
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3235769155

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v2]

Reply via email to