On Sat, 24 Sep 2022 08:17:56 GMT, Xue-Lei Andrew Fan <xue...@openjdk.org> wrote:
>> Hi, >> >> Please review this performance improvement for Secp256R1 implementation in >> OpenJDK. With this update, there is an about 20% performance improvement >> for Secp256R1 key generation and signature. >> >> Basically, 256 bits EC curves could use 9 integer limbs for the computation. >> The current implementation use 10 limbs instead. By reducing the number of >> limbs, the implementation could benefit from less integer computation >> (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the >> performance. >> >> Here are the benchmark numbers without the patch: >> >> Benchmark (messageLength) Mode Cnt Score Error Units >> Signatures.sign 64 thrpt 15 1.414 ± 0.022 ops/ms >> Signatures.sign 512 thrpt 15 1.418 ± 0.004 ops/ms >> Signatures.sign 2048 thrpt 15 1.419 ± 0.005 ops/ms >> Signatures.sign 16384 thrpt 15 1.395 ± 0.003 ops/ms >> >> KeyGenerators.keyPairGen thrpt 15 1.475 ± 0.043 ops/ms >> >> >> And here are the numbers with the patch applied: >> >> Benchmark (messageLength) Mode Cnt Score Error Units >> ECSignature.sign 64 thrpt 15 1.719 ± 0.010 ops/ms >> ECSignature.sign 512 thrpt 15 1.704 ± 0.012 ops/ms >> ECSignature.sign 2048 thrpt 15 1.699 ± 0.018 ops/ms >> ECSignature.sign 16384 thrpt 15 1.681 ± 0.006 ops/ms >> >> KeyGenerators.keyPairGen thrpt 15 1.881 ± 0.008 ops/ms >> >> >> Thanks, >> Xuelei > > I check with the real bytes for the following computation. For Secp256R1, > the p is defined. > > p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF > = 2^224(2^32 −1) + 2^192 + 2^96 − 1 > = 2^256 - 2^224 + 2^192 + 2^96 − 1 > > > Let's use a full 256 bits integer for following computation, which is bigger > than p: > > > FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF > >> starting with 29 bit limbs > > With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian > order, as the following: > > > a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF > 00FFFFFF > > > From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits. 29 * 8 + 24 > = 256 > >> after 2 additions (maxAdds) we have up to 31 bits > > after 2 additions, the 9-29 bits words are: > > a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF > 03FFFFFF > > From a[0] to a[7], 31 bits for each limbs, and a[8] is 26 bits. This is the > biggest number we could get from the finite filed computation. > >> then we multiply limbs (62 bits) and add up to numLimbs (9) results together >> = 65 bits, which overflows long >> and we didn't start reducing yet. > > Let's multiply two limbs: > > a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF > 03FFFFFF > b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF > 03FFFFFF > > > The overflow we cared about is the following computation, the longest > additions. > > long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * > b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] > * b[0]); > > Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] > and b[8] is 26-bit words. > > Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between > 31-bits words and 26-bits words. So we have the equivalent computation for > the a and b values above: > > a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF; > b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF; > long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + > (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) > = 0xBFFFFFF900000007L > long i0x8 = (a[0] * b[8]) + (a[8] * b[0]); > = 0x03FFFFFEF8000002L > > long c8 = i1x7 + i0x8 > = 0xC3FFFFF7F8000009L > > > c8 fills the 64 bits fully, but it does not overflow. > > @djelinski does it sound right to you? > > Although there is not overflow if I get the above computation right, it is > still not a straightforward thing we can know without detailed computation. > Instead, this kind of computation and checking could be performed in FieldGen > code automatically, or just reduce limbs operations result (up to no more > than twice the modulus) and remove the limit of maxAdd. > > This RFE, JDK-8294248, is one of a few performance improvement of the EC > implementation in JDK > ([JDK-8294188](https://bugs.openjdk.org/browse/JDK-8294188)). I would like > to address the checking or/and reducing improvement in a separately RFE. > @XueleiFan as far as I can tell, we cannot assume that the starting number > will be 256 bit; the `setReduced` operation performs a limited reduce that > only ensures that limbs fit in bitsPerLimb; the carry from the last limb is > only performed in `finalCarryReduceLast`. Thank you @djelinski, I missed this point. I tried to prototype and allow only one addition, and use reduce if necessary, and get the following benchmarking numbers: Benchmark (messageLength) Mode Cnt Score Error Units Signatures.sign 64 thrpt 15 1.629 ± 0.008 ops/ms Signatures.sign 512 thrpt 15 1.628 ± 0.007 ops/ms Signatures.sign 2048 thrpt 15 1.625 ± 0.006 ops/ms Signatures.sign 16384 thrpt 15 1.588 ± 0.014 ops/ms ECKeyGen.keyPairGen thrpt 15 1.779 ±(99.9%) 0.007 ops/ms ``` The improvement looks positive to me (15%-20% improvement). It implies that it may be performance friendly by reducing or semi-reducing limbs operations rather than using maxAdd controlling. However, as the update would impact more curves in JDK, I would think it twice if it is good to start from it or not. ------------- PR: https://git.openjdk.org/jdk/pull/10398