Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Ferenc Rakoczi Fri, 22 May 2026 06:07:07 -0700

On Mon, 18 May 2026 08:29:18 GMT, Andrew Dinn <[email protected]> wrote:


>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Added AOT Code Cache related code + some cosmetic changes
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7738:
> 
>> 7736:   // so four calls with the appropriate parameters will produce the 
>> 64-bit
>> 7737:   // low32 * low32, low32 * high32, high32 * low32, high32 * high32
>> 7738:   // values in the output register sequences.
> 
> A little more detail would make it easier to understand this method and helpt 
> to clarify what is happening in code where it is called
> Suggestion:
> 
>   // Calls to this function accept either the low 32 bis or high 20 bits
>   // of each b_i packed into bs in ascending order. a_0 and a_1 are packed
>   // into successive 64 bit elements of as. lane selects the low 32 or high
>   // 20 bits of each a_j value. So four calls with the appropriate parameters
>   // will produce the 64-bit low32 * low32, low32 * high20, high20 * low32,
>   // high20 * high20 values in the output register sequences vs. The
>   // 64-bit partial products are returned in vs in ascending order:
>   // vs[0] = (b_0*a_0, b_1*a_0) . . .  vs[3] = (b_2*a_1, b_3*a_1)

Accepted with minor changes.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7820:
> 
>> 7818:     __ mov(c_ptr, sp);
>> 7819: 
>> 7820:     // Calculate limb mask
> 
> Suggestion:
> 
>     // Calculate (52-bit) limb masks for both gpr and vector registers

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7826:
> 
>> 7824:     //Load input arrays and modulus
>> 7825:     Register a_ptr = *common_regs++, mod_ptr = *common_regs++;
>> 7826:     __ add(a_ptr, a, 24);
> 
> Suggestion:
> 
>     // skip 3 limbs so a_ptr addresses trailing pair {a3, a4}
>     __ add(a_ptr, a, 24);

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7838:
> 
>> 7836:     __ ldr(mod_4, mod_ptr);
>> 7837:     __ ld1(a_vals, __ T2D, a_ptr);
>> 7838:     __ ld2(b_lows, b_highs, __ T4S, b);
> 
> Suggestion:
> 
>     // use an interleaved load to group low 32 bits and high 20 bits
>     // of 4 successive b values into two vector registers
>     // n.b. these are the same inputs as the ones in b_0 ... b4
>     __ ld2(b_lows, b_highs, __ T4S, b);

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7854:
> 
>> 7852:       n = *common_regs++;
>> 7853: 
>> 7854:     VSeq<4> A(16);
> 
> Suggestion:
> 
>     // vector sequences used to compute and combine partial products of
>     // b_i * a_j for i = {0,1,2,3} j = {3,4}
>     VSeq<4> A(16);

Accepted.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288498738
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288499475
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288501682
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288500145
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288500841

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Reply via email to