Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-06-05 Thread Tobias Hartmann
On Wed, 22 May 2024 14:19:36 GMT, Volodymyr Paprotski  wrote:

>> Volodymyr Paprotski has updated the pull request with a new target base due 
>> to a merge or a rebase. The incremental webrev excludes the unrelated 
>> changes brought in by the merge/rebase. The pull request contains 17 
>> additional commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>>  - shenandoah verifier
>>  - comments from Sandhya
>>  - whitespace
>>  - add message back
>>  - whitespace
>>  - Use AffinePoint to exit Montgomery domain
>>
>>Style notes:
>>Affine.equals()
>>- Mismatched fields only appear to be used from testing, perhaps 
>> should be moved there instead
>>Affine.getX(boolean)|getY(boolean)
>>- "Passing flag is bad design" - cleanest/performant alternative to 
>> several instanceof checks
>>- needed to convert Affine to Projective (need to stay in montgomery 
>> domain)
>>ECOperations.PointMultiplier
>>   - changes could probably be restored to original (since 
>> ProjectivePoint handling no longer required)
>>   - consider these changes an improvement? (fewer nested classes)
>>   - was an inner-class but not using inner-class features (i.e. ecOps 
>> variable should be converted)
>>  - whitespace
>>  - Comments from Tony and Jatin
>>  - Comments from Jatin and Tony
>>  - ... and 7 more: https://git.openjdk.org/jdk/compare/1adfff34...b1a33004
>
> Thanks Tobi!

Unfortunately, this caused a performance regression, see 
[JDK-8333583](https://bugs.openjdk.org/browse/JDK-8333583). @vpaprotsk, please 
have a look.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2149576062


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-22 Thread Volodymyr Paprotski
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>  - shenandoah verifier
>  - comments from Sandhya
>  - whitespace
>  - add message back
>  - whitespace
>  - Use AffinePoint to exit Montgomery domain
>
>Style notes:
>Affine.equals()
>- Mismatched fields only appear to be used from testing, perhaps 
> should be moved there instead
>Affine.getX(boolean)|getY(boolean)
>- "Passing flag is bad design" - cleanest/performant alternative to 
> several instanceof checks
>- needed to convert Affine to Projective (need to stay in montgomery 
> domain)
>ECOperations.PointMultiplier
>   - changes could probably be restored to original (since ProjectivePoint 
> handling no longer required)
>   - consider these changes an improvement? (fewer nested classes)
>   - was an inner-class but not using inner-class features (i.e. ecOps 
> variable should be converted)
>  - whitespace
>  - Comments from Tony and Jatin
>  - Comments from Jatin and Tony
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/c0032e2c...b1a33004

Thanks Tobi!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124924526


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-22 Thread Tobias Hartmann
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>  - shenandoah verifier
>  - comments from Sandhya
>  - whitespace
>  - add message back
>  - whitespace
>  - Use AffinePoint to exit Montgomery domain
>
>Style notes:
>Affine.equals()
>- Mismatched fields only appear to be used from testing, perhaps 
> should be moved there instead
>Affine.getX(boolean)|getY(boolean)
>- "Passing flag is bad design" - cleanest/performant alternative to 
> several instanceof checks
>- needed to convert Affine to Projective (need to stay in montgomery 
> domain)
>ECOperations.PointMultiplier
>   - changes could probably be restored to original (since ProjectivePoint 
> handling no longer required)
>   - consider these changes an improvement? (fewer nested classes)
>   - was an inner-class but not using inner-class features (i.e. ecOps 
> variable should be converted)
>  - whitespace
>  - Comments from Tony and Jatin
>  - Comments from Jatin and Tony
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/45457761...b1a33004

All tests passed.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124892444


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-21 Thread Tobias Hartmann
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>  - shenandoah verifier
>  - comments from Sandhya
>  - whitespace
>  - add message back
>  - whitespace
>  - Use AffinePoint to exit Montgomery domain
>
>Style notes:
>Affine.equals()
>- Mismatched fields only appear to be used from testing, perhaps 
> should be moved there instead
>Affine.getX(boolean)|getY(boolean)
>- "Passing flag is bad design" - cleanest/performant alternative to 
> several instanceof checks
>- needed to convert Affine to Projective (need to stay in montgomery 
> domain)
>ECOperations.PointMultiplier
>   - changes could probably be restored to original (since ProjectivePoint 
> handling no longer required)
>   - consider these changes an improvement? (fewer nested classes)
>   - was an inner-class but not using inner-class features (i.e. ecOps 
> variable should be converted)
>  - whitespace
>  - Comments from Tony and Jatin
>  - Comments from Jatin and Tony
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/9ee91a9f...b1a33004

Thanks! I submitted testing and will report back once it passed.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2123869579


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-21 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request with a new target base due to 
a merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 17 additional commits 
since the last revision:

 - Merge remote-tracking branch 'origin/master' into ecc-montgomery
 - shenandoah verifier
 - comments from Sandhya
 - whitespace
 - add message back
 - whitespace
 - Use AffinePoint to exit Montgomery domain
   
   Style notes:
   Affine.equals()
   - Mismatched fields only appear to be used from testing, perhaps should 
be moved there instead
   Affine.getX(boolean)|getY(boolean)
   - "Passing flag is bad design" - cleanest/performant alternative to 
several instanceof checks
   - needed to convert Affine to Projective (need to stay in montgomery 
domain)
   ECOperations.PointMultiplier
  - changes could probably be restored to original (since ProjectivePoint 
handling no longer required)
  - consider these changes an improvement? (fewer nested classes)
  - was an inner-class but not using inner-class features (i.e. ecOps 
variable should be converted)
 - whitespace
 - Comments from Tony and Jatin
 - Comments from Jatin and Tony
 - ... and 7 more: https://git.openjdk.org/jdk/compare/12e8009b...b1a33004

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/df4fe6fa..b1a33004

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=10-11

  Stats: 190975 lines in 3949 files changed: 105304 ins; 64688 del; 20983 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583