Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Andrew Dinn Mon, 26 Jan 2026 05:43:12 -0800

On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <[email protected]> wrote:


>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` 
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic:
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 
>> 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 
>> 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 
>> 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 
>> 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256        
>>       thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256        
>>       thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256        
>>       thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256        
>>       thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256   
>>            EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              
>> EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 
>> 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1770.589 ±  
>> 2.584  op...
>
> Ben Perez has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid 
> neon/gpr approach

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7282:

> 7280:     __ mov(mul_ptr, sp);
> 7281: 
> 7282:     __ umullv(A[0], __ T2D, b_lows, __ T2S, a_vals, __ S, 0);

I notice that this 4 line insn pattern recurs several times. You might consider 
defining a macro generator method to generate these sequences i.e.

vs_umullv(VSeq<4> vs, FloatRegister bs, FloatRegister as, int lane_lo) {
    __ umullv(vs[0], __ T2D, bs, __ T2S, as, __S, lane_lo);
    __ umull2v(vs[1], __ T2D, bs, __ T4S. as, __S, lane_lo);
    __ umullv(vs[2], __ T2D, bs, __T2S, as, __S, lane_lo + 2);
    __ umull2v(vs[3], __ T2D, bs, __T4S, as, __S, lane_lo + 2);
}

So, in this case you would call

  vs_umullv(A, b_lows, a_vals, 0);

and in other cases you can simply vary which arguments you provide for `vs`, 
`as`, `bs` and `lane_lo`.

I'm suggesting that because the various cross multiplies generated by this 
macro-method appear to implement a specific subsequence of the mont mul 
computation (similar to what is done in e.g. the kyber code). So, this 
generator method abstraction should also abstract that a clear step in any 
pseudo-code algorithm you provide and ought therefore to make clear that (also 
make clear /how/) the generated code implements that algorithm. If you can 
provide such a pseudo-code algorithm then we might even be able come up with a 
better name for the generator method (e.g. montmul_uproduct or something else 
that reflects what is happening here).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727648074

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Reply via email to