> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()`
> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit
> multiplication is not supported on Neon and manually performing this
> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr
> approach is used. Neon instructions are used to compute intermediate values
> used in the last two iterations of the main "loop", while the GPRs compute
> the first few iterations. At the method level this improves performance by
> ~9% and at the API level roughly 5%.
>
> Performance no intrinsic (Apple M1):
>
> Benchmark (isMontBench) Mode Cnt Score
> Error Units
> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ±
> 24.923 ops/s
> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ±
> 41.805 ops/s
> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ±
> 20.822 ops/s
> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ±
> 33.594 ops/s
>
> Benchmark (algorithm) (dataSize) (keyLength)
> (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256
> thrpt 40 8439.881 ± 29.838 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256
> thrpt 40 7990.614 ± 30.998 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256
> thrpt 40 2677.737 ± 8.400 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256
> thrpt 40 2619.297 ± 9.737 ops/s
>
> Benchmark (algorithm) (keyLength)
> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256
> EC thrpt 40 1905.369 ± 3.745 ops/s
>
> Benchmark (algorithm) (keyLength)
> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256
> EC thrpt 40 1903.997 ± 4.092 ops/s
>
>
> Performance with intrinsic (Apple M1):
>
> Benchmark (isMontBench) Mode Cnt Score
> Error Units
> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ±
> 24.722 ops/s
> PolynomialP256Bench.benchMultiply false thrpt 8 1770.589 ±
> 2.584 ops/s
> PolynomialP256Bench.benchSqua...
Ben Perez has updated the pull request with a new target base due to a merge or
a rebase. The incremental webrev excludes the unrelated changes brought in by
the merge/rebase. The pull request contains 14 additional commits since the
last revision:
- Merge branch 'master' into aarch64_montmul256
- fixed assert typo
- Made register allocation safer, changed assert in umull{2}v
- Added vs_tail method to simplify various VSeq operations, updated
generate_intpoly_assign()
- added comments to p256 intrinsics, fixed error message in umullv instruction
- fixed indexing bug in vs_ldpq, simplified vector loads in
generate_intpoly_assign()
- Created subroutine for 32 bit vector multiplication
- Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid
neon/gpr approach
- fixed assertions in assembler_aarch64.hpp
- Fixed typo
- ... and 4 more: https://git.openjdk.org/jdk/compare/e3130da9...909a2bfa
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/27946/files
- new: https://git.openjdk.org/jdk/pull/27946/files/dc03697f..909a2bfa
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=10
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=09-10
Stats: 1158076 lines in 8994 files changed: 644279 ins; 411302 del; 102495 mod
Patch: https://git.openjdk.org/jdk/pull/27946.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/27946/head:pull/27946
PR: https://git.openjdk.org/jdk/pull/27946