On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <[email protected]> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()`
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit
>> multiplication is not supported on Neon and manually performing this
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr
>> approach is used. Neon instructions are used to compute intermediate values
>> used in the last two iterations of the main "loop", while the GPRs compute
>> the first few iterations. At the method level this improves performance by
>> ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic:
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ±
>> 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ±
>> 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ±
>> 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ±
>> 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength)
>> (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256
>> thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256
>> thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256
>> thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256
>> thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ±
>> 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1770.589 ±
>> 2.584 op...
>
> Ben Perez has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid
> neon/gpr approach
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7282:
> 7280: __ mov(mul_ptr, sp);
> 7281:
> 7282: __ umullv(A[0], __ T2D, b_lows, __ T2S, a_vals, __ S, 0);
I notice that this 4 line insn pattern recurs several times. You might consider
defining a macro generator method to generate these sequences i.e.
vs_umullv(VSeq<4> vs, FloatRegister bs, FloatRegister as, int lane_lo) {
__ umullv(vs[0], __ T2D, bs, __ T2S, as, __S, lane_lo);
__ umull2v(vs[1], __ T2D, bs, __ T4S. as, __S, lane_lo);
__ umullv(vs[2], __ T2D, bs, __T2S, as, __S, lane_lo + 2);
__ umull2v(vs[3], __ T2D, bs, __T4S, as, __S, lane_lo + 2);
}
So, in this case you would call
vs_umullv(A, b_lows, a_vals, 0);
and in other cases you can simply vary which arguments you provide for `vs`,
`as`, `bs` and `lane_lo`.
I'm suggesting that because the various cross multiplies generated by this
macro-method appear to implement a specific subsequence of the mont mul
computation (similar to what is done in e.g. the kyber code). So, this
generator method abstraction should also abstract that a clear step in any
pseudo-code algorithm you provide and ought therefore to make clear that (also
make clear /how/) the generated code implements that algorithm. If you can
provide such a pseudo-code algorithm then we might even be able come up with a
better name for the generator method (e.g. montmul_uproduct or something else
that reflects what is happening here).
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727648074