Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski
On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 308:
> 
>> 306: 
>> 307: /*
>> 308:  * public Point addition. Used by ECDSAOperations
> 
> Was the old description not applicable anymore?   It would be nice to improve 
> on the existing description that shortening it.

Forgot to go back and fix the comment. Fixed..

As for the 'meaning'. Notice the signature of the function changed (i.e. no 
longer a 'mixed point', but two ProjectivePoints. This is a good idea 
regardless of Montgomery, but it affects montgomery particularly badly (need to 
compute zInv for 'no reason'. )

For sake of completeness. Apart from constructor, the 'API' for ECOperations 
(i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three 
functions (everything else is used internally by this class)

public void setSum(MutablePoint p, MutablePoint p2)
public MutablePoint multiply(AffinePoint affineP, byte[] s)
public MutablePoint multiply(ECPoint ecPoint, byte[] s)

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 321:
> 
>> 319: ECOperations ops = this;
>> 320: if (this.montgomeryOps != null) {
>> 321: assert p.getField() instanceof 
>> IntegerMontgomeryFieldModuloP;
> 
> This should throw a ProviderException, I believe this would throw an 
> AssertionException

Missed this comment. No longer applicable (this.montgomeryOps got refactored 
away)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski
On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja  wrote:

>> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
>> alignment and loop rotation, but being a fallback figured its more important 
>> to keep it small&readable...
>
> It's all part of intrinsic, no harm in polishing it.

Done (normalized loop/backedge). There was actually a problem in the loop 
counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are 
sealed, but verified manually.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-23 Thread Anthony Scarpino
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

src/java.base/share/classes/sun/security/ec/ECOperations.java line 308:

> 306: 
> 307: /*
> 308:  * public Point addition. Used by ECDSAOperations

Was the old description not applicable anymore?   It would be nice to improve 
on the existing description that shortening it.

src/java.base/share/classes/sun/security/ec/ECOperations.java line 321:

> 319: ECOperations ops = this;
> 320: if (this.montgomeryOps != null) {
> 321: assert p.getField() instanceof IntegerMontgomeryFieldModuloP;

This should throw a ProviderException, I believe this would throw an 
AssertionException

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1556740469
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1558824543


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Jatin Bhateja
On Mon, 15 Apr 2024 22:04:14 GMT, Volodymyr Paprotski  wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
>> 
>>> 392:   __ lea(aLimbs, Address(aLimbs,8));
>>> 393:   __ lea(bLimbs, Address(bLimbs,8));
>>> 394:   __ jmp(L_DefaultLoop);
>> 
>> Both sub and cmp are flag affecting instructions and are macro-fusible. 
>> By doing a loop rotation i.e. moving the length <= 0 check outside the loop 
>> and pushing the loop exit check at bottom you can save additional compare 
>> checks.
>
> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
> alignment and loop rotation, but being a fallback figured its more important 
> to keep it small&readable...

It's all part of intrinsic, no harm in polishing it.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566630667


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39:
> 
>> 37: };
>> 38: static address modulus_p256() {
>> 39:   return (address)MODULUS_P256;
> 
> Long constants should have UL suffix.

Properly ULL, but good point, fixed

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386:
> 
>> 384:   __ jcc(Assembler::equal, L_Length19);
>> 385: 
>> 386:   // Default copy loop
> 
> Please add appropriate loop entry alignment.

This is actually a 'switch statement default'. The default should never happen 
(See "Known Length comment on line 335"), but added because java code has that 
behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the 
existing standard?)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
> 
>> 392:   __ lea(aLimbs, Address(aLimbs,8));
>> 393:   __ lea(bLimbs, Address(bLimbs,8));
>> 394:   __ jmp(L_DefaultLoop);
> 
> Both sub and cmp are flag affecting instructions and are macro-fusible. 
> By doing a loop rotation i.e. moving the length <= 0 check outside the loop 
> and pushing the loop exit check at bottom you can save additional compare 
> checks.

Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
alignment and loop rotation, but being a fallback figured its more important to 
keep it small&readable...

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Fixed all copyright years

git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read 
line; do echo $line =; grep Copyright $line | grep -v 
2024; done | less

Re splitting.. probably too late for that now.. (did consider it initially.. 
got hard to manage two changes while developing. And easier to justify the 
change when the entire patch is presented? but yes, far more code to review.. )

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino  
wrote:

>>> In `ECOperations.java`, if I understand this correctly, it is to replace 
>>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>>> when I look at the code, I see both are still options. If I read this 
>>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>>> Why doesn't it just replace the old implementation entry in the `fields` 
>>> Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
>> 
>> Thanks @ascarpino 
>> 
>> PS: Perhaps there is some middle ground, remove the `ECOperations 
>> montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
>> inaccessible..) the reference ECOperations in the fuzz test instead.. not 
>> sure how yet, but perhaps worth a further thought..
>
>> > In `ECOperations.java`, if I understand this correctly, it is to replace 
>> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>> > when I look at the code, I see both are still options. If I read this 
>> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>> > Why doesn't it just replace the old implementation entry in the `fields` 
>> > Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
> 
> I hadn't looked at your fuzz test until you mentioned it.  I see you are 
> using reflection to change the values.  Is that what you mean by "fallback"?  
> I'm assuming there is no to access the older implementation without 
> reflection.
> 
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
> 
> I wouldn

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-11 Thread Anthony Scarpino
On Wed, 10 Apr 2024 18:02:38 GMT, Volodymyr Paprotski  wrote:

> > In `ECOperations.java`, if I understand this correctly, it is to replace 
> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
> > when I look at the code, I see both are still options. If I read this 
> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
> > Why doesn't it just replace the old implementation entry in the `fields` 
> > Map? Is there a reason to keep it around?
> 
> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
> entirely".. that might also help in cleaning a few things up in the 
> construction. Maybe even get rid of this nested ECOperations inside 
> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
> ECC stack clearer is positive!
> 
> One functional reason that might justify keeping it as-is, is fuzz-testing; 
> with the fallback available, I am able to write the included Fuzz tests and 
> have them check the values against the existing implementation. While I also 
> included a few KAT tests using openssl-generated values, the fuzz tests check 
> millions of values and it does add a lot more certainty about correctness of 
> this code.

I hadn't looked at your fuzz test until you mentioned it.  I see you are using 
reflection to change the values.  Is that what you mean by "fallback"?  I'm 
assuming there is no to access the older implementation without reflection.

> 
> Can it be removed? For the operations that do not involve multiplication 
> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
> existing IntegerPolynomialP256 is no longer used (I should verify that again) 
> and only P256OrderField remains non-montgomery. So removing references to 
> IntegerPolynomialP256 in ECOperations should be possible and cleaner. 
> Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder 
> (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some 
> thought..
> 
> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
> it could also be chucked up as part of 'scaffolding' and removed in name of 
> code quality?

I wouldn't rip out the old implementation.  I have been wondering if we should 
make the older implementation available, maybe by security property.  I was 
looking at the static Maps at the top of `ECOperations`, `forParameters`, and 
the constructors where it checks if the `montgomeryOps` was null or set.  It 
would be nice if we could have one set of `fields` Maps by putting the 
montgomery entry into the `fields` to replace it.  I think that should work 
because `IntegerMontgomeryFieldModuloP` extends `IntegerFieldModuloP`.  
`instanceof` or other `montgomeryOps`  checks would still need to exist because 
not all the `fields` support mongomery, and the older implementation would 
still be accessible for your fuzz tester.   At least that is my theory.

> 
> Thanks @ascarpino
> 
> PS: Perhaps there is some middle ground, remove the `ECOperations 
> montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
> inaccessible..) the reference ECOperations in the fuzz test instead.. not 
> sure how yet, but perhaps worth a further thought..

It would be nice to remove the nesting and it would be nice to be a singleton.  
Maybe some combination of what I mentioned above chance can help that.  I 
haven't fully thought this out either.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2050148475


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski
On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Thanks Jatin, will fix!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski
On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino  
wrote:

> In `ECOperations.java`, if I understand this correctly, it is to replace the 
> existing `PointMultiplier` with montgomery-based PointMuliplier. But when I 
> look at the code, I see both are still options. If I read this correctly, it 
> checks for the old `IntegerFieldModuloP`, then looks for the new 
> `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why 
> doesn't it just replace the old implementation entry in the `fields` Map? Is 
> there a reason to keep it around?

Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) 
"for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. 
that might also help in cleaning a few things up in the construction. Maybe 
even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting 
isnt a big deal, but all attempts to make the ECC stack clearer is positive!

One functional reason that might justify keeping it as-is, is fuzz-testing; 
with the fallback available, I am able to write the included Fuzz tests and 
have them check the values against the existing implementation. While I also 
included a few KAT tests using openssl-generated values, the fuzz tests check 
millions of values and it does add a lot more certainty about correctness of 
this code.

Can it be removed? For the operations that do not involve multiplication (i.e. 
`setSum(*)`), montgomery is expensive. I think I did go through the uses of 
this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing 
IntegerPolynomialP256 is no longer used (I should verify that again) and only 
P256OrderField remains non-montgomery. So removing references to 
IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing 
IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder 
(fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some 
thought..

I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it 
could also be chucked up as part of 'scaffolding' and removed in name of code 
quality?

Thanks @ascarpino 

PS: Perhaps there is some middle ground, remove the `ECOperations 
montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
inaccessible..) the reference ECOperations in the fuzz test instead.. not sure 
how yet, but perhaps worth a further thought..

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Anthony Scarpino
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

In `ECOperations.java`, if I understand this correctly, it is to replace the 
existing `PointMultiplier` with montgomery-based PointMuliplier.  But when I 
look at the code, I see both are still options.  If I read this correctly, it 
checks for the old `IntegerFieldModuloP`, then looks for the new 
`IntegerMontgomeryFieldModuloP`.  It appears to use the new one always.
Why doesn't it just replace the old implementation entry in the `fields` Map?  
Is there a reason to keep it around?

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048090075


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-05 Thread Volodymyr Paprotski
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

@ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would 
appreciate your feedback.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-05 Thread Jatin Bhateja
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

Few early comments.

Please update the copyright year of all the modified files.

You can even consider splitting this into two patches, Java side changes in one 
and  x86 optimized intrinsic in next one.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39:

> 37: };
> 38: static address modulus_p256() {
> 39:   return (address)MODULUS_P256;

Long constants should have UL suffix.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386:

> 384:   __ jcc(Assembler::equal, L_Length19);
> 385: 
> 386:   // Default copy loop

Please add appropriate loop entry alignment.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:

> 392:   __ lea(aLimbs, Address(aLimbs,8));
> 393:   __ lea(bLimbs, Address(bLimbs,8));
> 394:   __ jmp(L_DefaultLoop);

Both sub and cmp are flag affecting instructions and are macro-fusible. 
By doing a loop rotation i.e. moving the length <= 0 check outside the loop and 
pushing the loop exit check at bottom you can save additional compare checks.

-

PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-1981555803
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553056633
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1552710600
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553110376


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-03 Thread Magnus Ihse Bursie
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

Build changes are trivially fine.

-

Marked as reviewed by ihse (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-1976208258


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski
On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/module-info.java line 265:
> 
>> 263: jdk.jfr,
>> 264: jdk.unsupported,
>> 265: jdk.crypto.ec;
> 
> jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in 
> java.base. So I don't think you need this qualified export.

Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a 
few spots during rebase I guess)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  remove use of jdk.crypto.ec

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=00-01

  Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583