On Mon, 27 Apr 2026 12:22:30 GMT, Ferenc Rakoczi <[email protected]> wrote:

>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains three commits:
> 
>  - Merged master.
>  - Removing a jar file.
>  - 8355216: Accelerate P-256 arithmetic on aarch64 (revived)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7958:

> 7956:     __ st1(D[2], __ T2D, __ post(mul_ptr, 16));
> 7957:     __ st1(A[3], __ T2D, __ post(mul_ptr, 16));
> 7958:     __ st1(D[3], __ T2D, mul_ptr);

You could usefully abstract this as a VSeq template function

    vs_st1_interleaved(VSeq<N> A, VSeq<N> B, Register dest) {
      for (int i = 0; i < N; i++) {
        __ st1(A[i], __ T2D, __ post(dest, 16));
        __ st1(B[i], __ T2D, __ post(dest, 16));
      }
    }

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3155461312

Reply via email to