On Tue, 28 Apr 2026 15:42:48 GMT, Andrew Dinn <[email protected]> wrote:
>> Ferenc Rakoczi has updated the pull request with a new target base due to a
>> merge or a rebase. The pull request now contains three commits:
>>
>> - Merged master.
>> - Removing a jar file.
>> - 8355216: Accelerate P-256 arithmetic on aarch64 (revived)
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7838:
>
>> 7836: __ lsr(tmp, low, shift2);
>> 7837: __ orr(high, high, tmp);
>> 7838: __ andr(low, low, limb_mask);
>
> This is a recurring pattern:
>
> __ umulh(hi, a, b);
> __ mul(lo, a, b)
> __ lsl(hi, hi, SHIFT1)
> __ lsr(tmp, lo, SHIFT2)
> __ orr(hi, hi, tmp)
> __ andr(lo, lo, mask)
>
> which could be abstracted as a macro method (picking an arbitrary name that
> you probably want to improve on):
>
> p256_partial_mul(Register a, Register b, Register hi, Register lo, Register
> tmp, Register mask)
>
> You can then simplify the code that processes this limb (likewise in each
> each subsequent limb) to make it clearer what is being done to combine the
> results of these macro computations:
>
> __ ldr(a_i, __ post(a, 8));
>
> p256_partial_mul(a_i, b_0, high, low, tmp, limb_mask);
>
> __ andr(n, low, limb_mask);
>
> neon_partial_mult_64(B, b_highs, a_vals, 0);
>
> p256_partial_mul(n, mod_0, mod_high, mod_low, tmp, limb_mask)
>
> __ add(low, low, mod_low);
> __ add(high, high, mod_high);
> __ lsr(c_i, low, shift2);
> __ add(c_i, c_i, high);
>
> Also, note that the function consumes SHIFT1 and SHIFT2 which should be
> defined as final int constants and would be better defined at file scope
> rather than being declared and initialized as local variables.
Very good idea! Thanks a lot!
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7958:
>
>> 7956: __ st1(D[2], __ T2D, __ post(mul_ptr, 16));
>> 7957: __ st1(A[3], __ T2D, __ post(mul_ptr, 16));
>> 7958: __ st1(D[3], __ T2D, mul_ptr);
>
> You could usefully abstract this as a VSeq template function
>
> vs_st1_interleaved(VSeq<N> A, VSeq<N> B, Register dest) {
> for (int i = 0; i < N; i++) {
> __ st1(A[i], __ T2D, __ post(dest, 16));
> __ st1(B[i], __ T2D, __ post(dest, 16));
> }
> }
Good idea. Thanks!
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3235772102
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3235769155