On Sat, 24 Sep 2022 08:17:56 GMT, Xue-Lei Andrew Fan <xue...@openjdk.org> wrote:

>> Hi,
>> 
>> Please review this performance improvement for Secp256R1 implementation in 
>> OpenJDK.  With this update, there is an about 20% performance improvement 
>> for Secp256R1 key generation and signature.
>> 
>> Basically, 256 bits EC curves could use 9 integer limbs for the computation. 
>>  The current implementation use 10 limbs instead.  By reducing the number of 
>> limbs, the implementation could benefit from less integer computation 
>> (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the 
>> performance.
>> 
>> Here are the benchmark numbers without the patch:
>> 
>> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
>> Signatures.sign               64  thrpt   15  1.414 ± 0.022  ops/ms
>> Signatures.sign              512  thrpt   15  1.418 ± 0.004  ops/ms
>> Signatures.sign             2048  thrpt   15  1.419 ± 0.005  ops/ms
>> Signatures.sign            16384  thrpt   15  1.395 ± 0.003  ops/ms
>> 
>> KeyGenerators.keyPairGen          thrpt   15  1.475 ± 0.043  ops/ms
>> 
>> 
>> And here are the numbers with the patch applied:
>> 
>> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
>> ECSignature.sign               64  thrpt   15  1.719 ± 0.010  ops/ms
>> ECSignature.sign              512  thrpt   15  1.704 ± 0.012  ops/ms
>> ECSignature.sign             2048  thrpt   15  1.699 ± 0.018  ops/ms
>> ECSignature.sign            16384  thrpt   15  1.681 ± 0.006  ops/ms
>> 
>> KeyGenerators.keyPairGen           thrpt   15  1.881 ± 0.008  ops/ms
>> 
>> 
>> Thanks,
>> Xuelei
>
> I check with the real bytes for the following computation.  For Secp256R1, 
> the p is defined.
> 
> p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
>   = 2^224(2^32 −1) + 2^192 + 2^96 − 1
>   = 2^256 - 2^224 + 2^192 + 2^96 − 1
> 
> 
> Let's use a full 256 bits integer for following computation, which is bigger 
> than p:
> 
> 
> FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
> 
>> starting with 29 bit limbs
> 
> With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian 
> order, as the following:
> 
> 
> a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 
> 00FFFFFF 
> 
> 
> From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits.  29 * 8 + 24 
> = 256
> 
>> after 2 additions (maxAdds) we have up to 31 bits
> 
> after 2 additions, the 9-29 bits words are:
> 
> a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
> 03FFFFFF
> 
> From a[0] to a[7],  31 bits for each limbs, and a[8] is 26 bits.  This is the 
> biggest number we could get from the finite filed computation.
> 
>> then we multiply limbs (62 bits) and add up to numLimbs (9) results together 
>> = 65 bits, which overflows long
>> and we didn't start reducing yet.
> 
> Let's multiply two limbs:
> 
> a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
> 03FFFFFF
> b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
> 03FFFFFF 
> 
> 
> The overflow we cared about is the following computation, the longest 
> additions.
> 
>         long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * 
> b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] 
> * b[0]);
> 
> Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] 
> and b[8] is 26-bit words.
> 
> Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 
> 31-bits words and 26-bits words.  So we have the equivalent computation for 
> the a and b values above:
> 
> a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF;
> b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF;
> long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + 
> (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) 
>               = 0xBFFFFFF900000007L
> long i0x8 =  (a[0] * b[8]) +  (a[8] * b[0]);
>               = 0x03FFFFFEF8000002L
> 
> long c8 = i1x7 + i0x8
>              = 0xC3FFFFF7F8000009L
> 
> 
> c8 fills the 64 bits fully, but it does not overflow.
> 
> @djelinski does it sound right to you?
> 
> Although there is not overflow if I get the above computation right, it is 
> still not a straightforward thing we can know without detailed computation.  
> Instead, this kind of computation and checking could be performed in FieldGen 
> code automatically, or just reduce limbs operations result (up to no more 
> than twice the modulus) and remove the limit of maxAdd.
> 
> This RFE, JDK-8294248, is one of a few performance improvement of the EC 
> implementation in JDK 
> ([JDK-8294188](https://bugs.openjdk.org/browse/JDK-8294188)).  I would like 
> to address the checking or/and reducing improvement in a separately RFE.

> @XueleiFan as far as I can tell, we cannot assume that the starting number 
> will be 256 bit; the `setReduced` operation performs a limited reduce that 
> only ensures that limbs fit in bitsPerLimb; the carry from the last limb is 
> only performed in `finalCarryReduceLast`.

Thank you @djelinski, I missed this point.

I tried to prototype and allow only one addition, and use reduce if necessary, 
and get the following benchmarking numbers:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
Signatures.sign               64  thrpt   15  1.629 ± 0.008  ops/ms
Signatures.sign              512  thrpt   15  1.628 ± 0.007  ops/ms
Signatures.sign             2048  thrpt   15  1.625 ± 0.006  ops/ms
Signatures.sign            16384  thrpt   15  1.588 ± 0.014  ops/ms
ECKeyGen.keyPairGen               thrpt   15  1.779 ±(99.9%) 0.007 ops/ms
``` 

The improvement looks positive to me (15%-20% improvement).   It implies that 
it may be performance friendly by reducing or semi-reducing limbs operations 
rather than using maxAdd controlling.  However, as the update would impact more 
curves in JDK, I would think it twice if it is good to start from it or not.

-------------

PR: https://git.openjdk.org/jdk/pull/10398

Reply via email to