On Thu, 22 Sep 2022 20:40:08 GMT, Xue-Lei Andrew Fan <xue...@openjdk.org> wrote:

> Hi,
> 
> Please review this performance improvement for Secp256R1 implementation in 
> OpenJDK.  With this update, there is an about 20% performance improvement for 
> Secp256R1 key generation and signature.
> 
> Basically, 256 bits EC curves could use 9 integer limbs for the computation.  
> The current implementation use 10 limbs instead.  By reducing the number of 
> limbs, the implementation could benefit from less integer computation 
> (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the 
> performance.
> 
> Here are the benchmark numbers without the patch:
> 
> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
> Signatures.sign               64  thrpt   15  1.414 ± 0.022  ops/ms
> Signatures.sign              512  thrpt   15  1.418 ± 0.004  ops/ms
> Signatures.sign             2048  thrpt   15  1.419 ± 0.005  ops/ms
> Signatures.sign            16384  thrpt   15  1.395 ± 0.003  ops/ms
> 
> KeyGenerators.keyPairGen          thrpt   15  1.475 ± 0.043  ops/ms
> 
> 
> And here are the numbers with the patch applied:
> 
> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
> ECSignature.sign               64  thrpt   15  1.719 ± 0.010  ops/ms
> ECSignature.sign              512  thrpt   15  1.704 ± 0.012  ops/ms
> ECSignature.sign             2048  thrpt   15  1.699 ± 0.018  ops/ms
> ECSignature.sign            16384  thrpt   15  1.681 ± 0.006  ops/ms
> 
> KeyGenerators.keyPairGen           thrpt   15  1.881 ± 0.008  ops/ms
> 
> 
> Thanks,
> Xuelei

Thank you for the feedback, Daniel.

> Can you prove that this is correct?

That would depend on the implementation details.   The JDK testing passed, 
although it does not means there is no problem.  I will try if I could describe 
the implementation logic more clear tomorrow.  The following is just a quick 
reply for what I can see now for the case you described.

> My back-of-the-envelope calculations suggest that this may overflow:
> 
> * starting with 29 bit limbs
> * after 2 additions (maxAdds) we have up to 31 bits
> * then we multiply limbs (62 bits) and add up to numLimbs (9) results 
> together = 65 bits, which overflows long
> * and we didn't start reducing yet.

For the multiply operation, I guess the result should have been carry reduced 
and fix in the limbs and sizes.  In the generated IntegerPolynomialP256.java, 
the code looks like:

    protected void mult(long[] a, long[] b, long[] r) {
        ... snipped ...
        long c16 = (a[8] * b[8]);

        carryReduce(r, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, 
c13, c14, c15, c16);
    }

carryReduce() is called, and the result 'r' is a set of limbs fit on the limbs 
size, I guess.


private void carryReduce(...) {
    ... snipped ...
        //carry from position 7
        t0 = (c7 + CARRY_ADD) >> 29;
        c7 -= (t0 << 29);
        c8 += t0;
    ... snipped ...
}


In the description of IntegerPolynomial.java, it is said that "Limb values will 
always fit within a long, so inputs to multiplication must be less than 32 
bits. All IntegerPolynomial implementations allow at most one addition before 
multiplication. Additions after that will result in an ArithmeticException."  

Did you see any code that use more than one addition before the multiplication? 
 Or did you refer to something else with "multiply limbs" other than the 
description in the IntegerPolynomial class?

-------------

PR: https://git.openjdk.org/jdk/pull/10398

Reply via email to