On Thu, 25 Jun 2026 05:59:54 GMT, Shawn Emery <[email protected]> wrote:

> > I ran ML-KEM and ML-DSA benchmarks on a Macbook Pro M3 36GB Memory, see 
> > attached. Everything is improved with the set of changes except for 
> > ML-KEM-512 (encapsulation:-0.92%/decapsulation:-2.25%) and ML-DSA-44 (key 
> > generation:-1.03%). 
> > [8384353BenchmarksM3Pro.pdf](https://github.com/user-attachments/files/29319713/8384353BenchmarksM3Pro.pdf)
> 
> Reran the specific benchmarks that showed regression on a quieter system 
> (MacbookAir M3 8GB Memory) and averaged the results, which showed negligible 
> differences of the afflicted benchmarks as follows: ML-KEM-512 
> (encapsulation:-0.38%/decapsulation:-0.21%) and ML-DSA-44 (key generation: 
> -0.18%). In other words, the current fix resolves the regression in 
> performance by either negligible differences or better performance, where the 
> high observed +6.55% (ML-KEM-768, which would be one of the algorithms that 
> would gain the most from this delta).

Thanks for the measurements, and the extra verification.. the first set of 
results really threw me for a loop, I dont see why it would be slower. 

The second rerun.. seems we are safe, so the PR performance is acceptable? (My 
theory is that GC is introducing a lot of variability to the benchmarks?)


------
We should now match _exactly_ the same number of `doubleKeccak` as before 
(admittedly, in possibly different order..)

int[][][] a = new int[mlDsa_k][mlDsa_l][];
ML_DSA_44  << here
                mlDsa_k = 4;
                mlDsa_l = 4;
ML_DSA_65
                mlDsa_k = 6;
                mlDsa_l = 5;
ML_DSA_87
                mlDsa_k = 8;
                mlDsa_l = 7;

short[][][] a = new short[mlKem_k][mlKem_k][];
ML-KEM-512 << here
                mlKem_k = 2;
ML-KEM-768
                mlKem_k = 3;
ML-KEM-1024
                mlKem_k = 4;


For the first batch, the call sequence..

ML_DSA_44
Before: 2 + 2; 2 + 2; 2 + 2; 2 + 2
Now: 4; 4; 4; 4 (but without quadKeccak, 4 becomes 2 + 2)

ML-KEM-512
Before: 2 ; 2
Now: 2 ; 2 (no change, each row has to be fully processed before moving to next)

----
> With the crypto intrinsics that don't return a value we followed the 
> convention that instead of declaring them void, we declare them int and have 
> the intrinsic version return 0 and the Java version return 1 (it makes it 
> easy for tests to decide which version was actually running). This convention 
> could be applied to the @IntrinsicCadidate methods in this class, too.

I meant to do this! Thanks for catching, done.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31648#issuecomment-4802731372

Reply via email to