On Fri, 5 Jun 2026 20:15:30 GMT, Shawn Emery <[email protected]> wrote:
> Curve25519 polynomial arithmetic is performed with intrinsincs implemented in > GPR related instructions for multiplication operations (method mult()). > Benchmark improvements include: > > X25519 decapsulation: +9% > X25519 encapsulation: +9% > X22519 key agreement: +7% > X25519 key-pair generation: +10% > X25519-MLKEM decapsulation: +7% > X25519-MLKEM encapsulation: +8% > X25519-MLKEM key-pair generation: +8% > EdDSA sign: +12% > EdDSA verify: +12% > EdDSA key-pair generation: +15% > > Note 1: The difference between Aarch64 vs. x86_64 intrinsics implementation > include the lack of square() intrinsics; usage caused a 3.3% performance > regression due to the efficiencies of the symmetric squaring shape in Java > vs. the inefficiencies of the leaf calls and the additional cycles required > for 64 bit multiplication in Aarch64. > Note 2: The GPR related instructions were optimal when compared to hybrid > (GPR related instructions for the first two iterations and Neon instructions > for the last two iterations) solution. This design produced a -4%/-1% > performance drop in KEM decapsulation/encapsulation compared to the GPR > related instructions where the overhead of performing the limb splits and > reconstruction did not compensate enough for the efficiencies of SIMD > parallelism. > > --------- > - [X] I confirm that I make this contribution in accordance with the [OpenJDK > Interim AI Policy](https://openjdk.org/legal/ai). src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7678: > 7676: /** > 7677: * Arithmetic polynomial multiplicaiton in Curve25519. The algorithm > mimics > 7678: * the version in Java, including the use of all columns (no folding > method). Please mention class `IntegerPolynomial25519` or file `IntegerPolynomial25519.java` to make it easier for maintainers to find the source. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7720: > 7718: const int32_t columns = limbs * 2; > 7719: const uint64_t mask = 0x7FFFFFFFFFFFFULL; > 7720: const uint64_t CARRY_ADD = 0x4000000000000ULL; It might be clearer to construct these from bpl rather than leave readers to count the number of zeroes: Suggestion: const uint64_t mask = ((1ULL << bpl) - 1ULL); const uint64_t CARRY_ADD = (1ULL << (bpl - 1)); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3414184380 PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3414209558
