Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Ben Perez Thu, 12 Mar 2026 13:08:37 -0700

> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` 
> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit 
> multiplication is not supported on Neon and manually performing this 
> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
> approach is used. Neon instructions are used to compute intermediate values 
> used in the last two iterations of the main "loop", while the GPRs compute 
> the first few iterations. At the method level this improves performance by 
> ~9% and at the API level roughly 5%. 
> 
> Performance no intrinsic (Apple M1):
> 
> Benchmark                          (isMontBench)   Mode  Cnt     Score    
> Error  Units
> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 
> 24.923  ops/s
> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 
> 41.805  ops/s
> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 
> 20.822  ops/s
> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 
> 33.594  ops/s
> 
> Benchmark                        (algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt      Score     Error  Units
> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256         
>      thrpt   40   8439.881 ±  29.838  ops/s
> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256         
>      thrpt   40   7990.614 ±  30.998  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256         
>      thrpt   40   2677.737 ±   8.400  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256         
>      thrpt   40   2619.297 ±   9.737  ops/s
> 
> Benchmark                                         (algorithm)  (keyLength)  
> (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
> KeyAgreementBench.EC.generateSecret                      ECDH          256    
>           EC              thrpt   40  1905.369 ±  3.745  ops/s
> 
> Benchmark                             (algorithm)  (keyLength)  
> (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
> KeyAgreementBench.EC.generateSecret          ECDH          256              
> EC              thrpt   40  1903.997 ± 4.092  ops/s
> 
> 
> Performance with intrinsic (Apple M1):
> 
> Benchmark                          (isMontBench)   Mode  Cnt     Score    
> Error  Units
> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 
> 24.722  ops/s
> PolynomialP256Bench.benchMultiply          false  thrpt    8  1770.589 ±  
> 2.584  ops/s
> PolynomialP256Bench.benchSqua...


Ben Perez has updated the pull request with a new target base due to a merge or 
a rebase. The incremental webrev excludes the unrelated changes brought in by 
the merge/rebase. The pull request contains 14 additional commits since the 
last revision:

 - Merge branch 'master' into aarch64_montmul256
 - fixed assert typo
 - Made register allocation safer, changed assert in umull{2}v
 - Added vs_tail method to simplify various VSeq operations, updated 
generate_intpoly_assign()
 - added comments to p256 intrinsics, fixed error message in umullv instruction
 - fixed indexing bug in vs_ldpq, simplified vector loads in 
generate_intpoly_assign()
 - Created subroutine for 32 bit vector multiplication
 - Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid 
neon/gpr approach
 - fixed assertions in assembler_aarch64.hpp
 - Fixed typo
 - ... and 4 more: https://git.openjdk.org/jdk/compare/e3130da9...909a2bfa

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/27946/files
  - new: https://git.openjdk.org/jdk/pull/27946/files/dc03697f..909a2bfa

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=10
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=09-10

  Stats: 1158076 lines in 8994 files changed: 644279 ins; 411302 del; 102495 mod
  Patch: https://git.openjdk.org/jdk/pull/27946.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27946/head:pull/27946

PR: https://git.openjdk.org/jdk/pull/27946

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Reply via email to