from:"Volodymyr Paprotski"

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v48]

2024-05-30 Thread Volodymyr Paprotski

On Thu, 30 May 2024 13:56:30 GMT, Emanuel Peter  wrote:

>> Control question: Are we confident with this potentially going into JDK 23 
>> or should we rather postpone to JDK 24? The fork is next week.
>
>> Control question: Are we confident with this potentially going into JDK 23 
>> or should we rather postpone to JDK 24? The fork is next week.
> 
> I would hold off. @asgibbons it may pass our tests, and your extensive 
> testing. But you never know what the fuzzer can find over a few weeks once it 
> runs with your changes. I have made that experience many times. Let's just 
> give it a few days, and then we have one JDK version less to worry about for 
> backports on possible follow-up bugs ;)

@eme64 I guess to add some confidence.. we did also 'test it independently' to 
catch blind spots. i.e. `String/IndexOf.java` is from me. I tried to be as 
paranoid as possible with non-random strings. Passed everything I could throw 
at it..

-

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2139882544

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Volodymyr Paprotski

On Tue, 28 May 2024 17:36:03 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
>> 
>>> 486:   __ cmpq(r11, nMinusK);
>>> 487:   __ ja_b(L_return);
>>> 488:   __ movq(rax, r11);
>> 
>> At places where we know that return value in r11 is correct, we dont need to 
>> checkRange so this could have its own label.
>
> I don't want to change this because its reason for existence is to ensure we 
> don't return a value that's beyond the end of the haystack.  We don't yet 
> have a good enough test to validate whether we're reading past the end of the 
> haystack, so I like this as insurance.

I would recommend an experiment. Disable the range-check and run 
String/IndexOf.java test. Particularly run test4(), which is designed exactly 
to test the reads beyond the end. 

It wont find all the bad reads, but right now if there are any failures, they 
are 'hidden' by this range-check.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617888680

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-24 Thread Volodymyr Paprotski

On Wed, 22 May 2024 14:50:40 GMT, Scott Gibbons  wrote:

>> test/jdk/java/lang/StringBuffer/IndexOf.java line 284:
>> 
>>> 282: 
>>> 283:   // Note: it is possible although highly improbable that failCount 
>>> will
>>> 284:   // be > 0 even if everthing is working ok
>> 
>> This sounds like either a bug or a testcase bug? Same as line 301, 
>> `extremely remote possibility of > 1 match`?
>
> This was there from the original author.  I think they were trying to infer 
> that a match could occur in the rare case that the same random string was 
> produced.  They're random after all, and there's no reason the same sequence 
> could be generated.

Makes sense

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613872215

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v37]

2024-05-24 Thread Volodymyr Paprotski

On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   mov64 => lea(InternalAddress)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633:

> 4631: andl(result, 0x000f);  //   tail count (in bytes)
> 4632: andl(limit, 0xfff0);   // vector count (in bytes)
> 4633: jcc(Assembler::zero, COMPARE_TAIL);

In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I 
think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, 
inside the if (and move the other jcc into the else branch)?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639:

> 4637: negptr(limit);
> 4638: 
> 4639: bind(COMPARE_WIDE_VECTORS_16);

Understanding-check.. this loop will execute at most 2 times, right?

i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks 
then byte-by-byte?

(Still a good optimization, just trying to understand the scope)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718:

> 4716: jmp(TRUE_LABEL);
> 4717:   } else {
> 4718: movl(chr, Address(ary1, limit, scaleFactor));

scaleFactor is always Address::times_1 here (expand_ary2==false), might be 
clearer to change it back

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57:

> 55: 
> 56: generator = new Random();
> 57: long seed = generator.nextLong();//-5291521104060046276L;

dead code

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63:

> 61: ///  WARM-UP //
> 62: 
> 63: for (int i = 0; i < 2; i++) {

-Xcomp should be more deterministic (and quicker) way to reach the intrinsic 
(i.e. like the other tests)

On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check 
please.. these tests will run as part of every build from this 
point-till-infinity; Therefore, long test will affect every openjdk developer. 
But if this test is not run on every build, then the build-time does not 
matter, then this test can run for as long as it 'wants'.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160:

> 158:   }
> 159: 
> 160:   private static String generateTestString(int min, int max) {

I see you have various `Charset[] charSets` above, but this function still only 
generates LL. Are those separate tests? Or am I missing some concatenation 
somewhere that will convert the generated string string to the correct encoding?

You could had implemented my suggestion from IndexOf.generateTestString here 
instead, so that the tests that do call this function endup with multiple 
encodings; i.e. similar to what you already do in the next function.

I suppose, with addition of String/IndexOf.java that is a moot point.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185:

> 183:   }
> 184: 
> 185:   private static int

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-24 Thread Volodymyr Paprotski

On Fri, 17 May 2024 23:59:05 GMT, Scott Gibbons  wrote:

>> test/jdk/java/lang/StringBuffer/IndexOf.java line 40:
>> 
>>> 38:   private static boolean failure = false;
>>> 39:   public static void main(String[] args) throws Exception {
>>> 40: String testName = "IndexOf";
>> 
>> intentation
>
> Fixed

(missed a `git add`? don't see the updates for this file)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613870558

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-24 Thread Volodymyr Paprotski

On Wed, 22 May 2024 14:41:36 GMT, Scott Gibbons  wrote:

>> test/micro/org/openjdk/bench/java/lang/StringIndexOfHuge.java line 132:
>> 
>>> 130:   @Benchmark
>>> 131:   public int searchHugeLargeSubstring() {
>>> 132:   return dataStringHuge.indexOf("B".repeat(30) + "X" + 
>>> "A".repeat(30), 74);
>> 
>> .repeat() call and string concatenation shouldn't be part of the benchmark 
>> (here and several other @Benchmark functions in this file) since it will 
>> detract from the measurement.
>> 
>> (String concatenation gets converted (by javac) into 
>> StringBuilder().append().append()append().toString())
>
> Since we're only concerned with the delta of performance, does this really 
> matter?  Can you suggest an alternative?

The needle really should be like the all the other strings, e.g. 
`dataStringHuge` itself, generated by the setup. 

As to weather it really matters; the answer is Amdahl's law. You can indeed 
measure the delta, but you can't measure the speedup of just the indexOf; not 
with repeat and concatenation obscuring the numbers.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613864094

Integrated: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

2024-05-22 Thread Volodymyr Paprotski

On Tue, 2 Apr 2024 15:42:05 GMT, Volodymyr Paprotski  wrote:

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

This pull request has now been integrated.

Changeset: afed7d0b
Author:Volodymyr Paprotski 
Committer: Sandhya Viswanathan 
URL:   
https://git.openjdk.org/jdk/commit/afed7d0b0593864e5595840a6b645c210ff28c7c
Stats: 2409 lines in 36 files changed: 2093 ins; 156 del; 160 mod

8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

Reviewed-by: ihse, ascarpino, sviswanathan

-

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-22 Thread Volodymyr Paprotski

On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>  - shenandoah verifier
>  - comments from Sandhya
>  - whitespace
>  - add message back
>  - whitespace
>  - Use AffinePoint to exit Montgomery domain
>
>Style notes:
>Affine.equals()
>- Mismatched fields only appear to be used from testing, perhaps 
> should be moved there instead
>Affine.getX(boolean)|getY(boolean)
>- "Passing flag is bad design" - cleanest/performant alternative to 
> several instanceof checks
>- needed to convert Affine to Projective (need to stay in montgomery 
> domain)
>ECOperations.PointMultiplier
>   - changes could probably be restored to original (since ProjectivePoint 
> handling no longer required)
>   - consider these changes an improvement? (fewer nested classes)
>   - was an inner-class but not using inner-class features (i.e. ecOps 
> variable should be converted)
>  - whitespace
>  - Comments from Tony and Jatin
>  - Comments from Jatin and Tony
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/c0032e2c...b1a33004

Thanks Tobi!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124924526

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-21 Thread Volodymyr Paprotski

On Tue, 21 May 2024 07:21:14 GMT, Tobias Hartmann  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   shenandoah verifier
>
> I'm getting some conflicts when trying to apply this to master. Could you 
> please merge the PR?

Hi @TobiHartmann , merged with no issues for me. Could you please run the tests 
again? (I think Tony did run them, but can't hurt verifying again). Thanks!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2123122468

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-21 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request with a new target base due to 
a merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 17 additional commits 
since the last revision:

 - Merge remote-tracking branch 'origin/master' into ecc-montgomery
 - shenandoah verifier
 - comments from Sandhya
 - whitespace
 - add message back
 - whitespace
 - Use AffinePoint to exit Montgomery domain
   
   Style notes:
   Affine.equals()
   - Mismatched fields only appear to be used from testing, perhaps should 
be moved there instead
   Affine.getX(boolean)|getY(boolean)
   - "Passing flag is bad design" - cleanest/performant alternative to 
several instanceof checks
   - needed to convert Affine to Projective (need to stay in montgomery 
domain)
   ECOperations.PointMultiplier
  - changes could probably be restored to original (since ProjectivePoint 
handling no longer required)
  - consider these changes an improvement? (fewer nested classes)
  - was an inner-class but not using inner-class features (i.e. ecOps 
variable should be converted)
 - whitespace
 - Comments from Tony and Jatin
 - Comments from Jatin and Tony
 - ... and 7 more: https://git.openjdk.org/jdk/compare/12e8009b...b1a33004

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/df4fe6fa..b1a33004

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=11
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=10-11

  Stats: 190975 lines in 3949 files changed: 105304 ins; 64688 del; 20983 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-17 Thread Volodymyr Paprotski

On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   shenandoah verifier

Thanks Sandhya!

Now that I have @ascarpino approval as well, I plan to integrate next Tuesday.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118443577

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-17 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  shenandoah verifier

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/5c360e35..df4fe6fa

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=10
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=09-10

  Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-17 Thread Volodymyr Paprotski

On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan 
 wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   whitespace
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:
> 
>> 166:   XMMRegister broadcast5 = xmm24;
>> 167:   KRegister limb0 = k1;
>> 168:   KRegister limb5 = k2;
> 
> limb5 and select are not being used anymore.

Thanks, fixed (and also broadcast5)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:
> 
>> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), 
>> false, Assembler::AVX_512bit, rscratch);
>> 184: 
>> 185:   // A = load(*aLimbs)
> 
> A little bit more description in comments on what the load step involves 
> would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, 
> or in the lowest limb.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:
> 
>> 268:   __ push(r14);
>> 269:   __ push(r15);
>> 270: 
> 
> No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in 
> montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed 
> to r8.

Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 
stub)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:
> 
>> 284:   __ mov(aLimbs, c_rarg0);
>> 285:   __ mov(bLimbs, c_rarg1);
>> 286:   __ mov(rLimbs, c_rarg2);
> 
> We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then 
> these moves are not necessary.

Gave them symbolic names and passed the gpr temp and parameter. vector register 
map still in the montgomeryMultiply function, but gprs explicitly passed in. 
'close enough'?

> src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:
> 
>> 1368: 
>> 1369: #ifdef _LP64
>> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize 
>> >= 64) {
> 
> No need to tie the intrinsic to MaxVectorSize setting.

Done

> src/hotspot/share/opto/library_call.cpp line 7564:
> 
>> 7562: 
>> 7563:   if (!stubAddr) return false;
>> 7564:   if (stopped())  return true;
> 
> Line 7564 seems redundant here as there is no range check or anything like 
> that  before this.

Oh. That is what that is for...  I thought it was some soft of 'VM quitting' 
short-circuit. Removed.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v10]

2024-05-17 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  comments from Sandhya

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/8cd095dd..5c360e35

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=09
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=08-09

  Stats: 82 lines in 4 files changed: 1 ins; 59 del; 22 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-15 Thread Volodymyr Paprotski

On Wed, 15 May 2024 19:21:37 GMT, Volodymyr Paprotski  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rearrange; add lambdas for clarity
>
> test/jdk/java/lang/StringBuffer/IndexOf.java line 47:
> 
>> 45:   char[] haystack_16 = new char[128];
>> 46: 
>> 47:   for (int i = 0; i < 128; i++) {
> 
> you can use `char` instead of `int` as iterator

combine into single loop
haystack[i] = (char) i;
haystack_16[i] = (char) (i + 256);

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1602141543

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-15 Thread Volodymyr Paprotski

On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rearrange; add lambdas for clarity

First pass at StringIndexOfHuge.java and IndexOf.java

test/jdk/java/lang/StringBuffer/IndexOf.java line 40:

> 38:   private static boolean failure = false;
> 39:   public static void main(String[] args) throws Exception {
> 40: String testName = "IndexOf";

intentation

test/jdk/java/lang/StringBuffer/IndexOf.java line 47:

> 45:   char[] haystack_16 = new char[128];
> 46: 
> 47:   for (int i = 0; i < 128; i++) {

you can use `char` instead of `int` as iterator

test/jdk/java/lang/StringBuffer/IndexOf.java line 54:

> 52:   // for (int i = 1; i < 128; i++) {
> 53:   //   haystack_16[i] = (char) (i);
> 54:   // }

dead code

test/jdk/java/lang/StringBuffer/IndexOf.java line 64:

> 62:   Charset hs_charset = StandardCharsets.UTF_16;
> 63:   Charset needleCharset = StandardCharsets.ISO_8859_1;
> 64:   // Charset needleCharset = StandardCharsets.UTF_16;

Move from main() into a function that takes `needleCharset` as a parameter, 
then call that function twice.

test/jdk/java/lang/StringBuffer/IndexOf.java line 81:

> 79: sourceBuffer = new StringBuffer(sourceString);
> 80: targetString = generateTestString(10, 11);
> 81: } while (sourceString.indexOf(targetString) != -1);

Should really keep the original test unmodified and add new tests as needed

test/jdk/java/lang/StringBuffer/IndexOf.java line 83:

> 81:   shs = "$&),,18+-!'8)+";
> 82:   endNeedle = "8)-";
> 83:   l_offset = 9;

dead code

test/jdk/java/lang/StringBuffer/IndexOf.java line 89:

> 87:   StringBuffer bshs = new StringBuffer(shs);
> 88: 
> 89:   // printStringBytes(shs.getBytes(hs_charset));

dead code (and next two comments)

test/jdk/java/lang/StringBuffer/IndexOf.java line 90:

> 88: 
> 89:   // printStringBytes(shs.getBytes(hs_charset));
> 90:   for (int i = 0; i < 20; i++) {

This wont be a deterministic way to reach the intrinsic. I would suggest 
copying the idea from 
test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java

i.e. Have two `@run main` invocations at the top of this file, one with default 
parameters, one with  `-Xcomp -XX:-TieredCompilation`. You dont need a 'driver' 
program, that was to handle something else.


/*
 * @test
 * @modules java.base/com.sun.crypto.provider
 * @run main java.base/com.sun.crypto.provider.Poly1305KAT
 * @summary Unit test for com.sun.crypto.provider.Poly1305.
 */

/*
 * @test
 * @modules java.base/com.sun.crypto.provider
 * @summary Unit test for IntrinsicCandidate in 
com.sun.crypto.provider.Poly1305.
 * @run main/othervm -Xcomp -XX:-TieredCompilation 
-XX:+UnlockDiagnosticVMOptions -XX:+ForceUnreachable 
java.base/com.sun.crypto.provider.Poly1305KAT
 */

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

2024-05-13 Thread Volodymyr Paprotski

On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rearrange; add lambdas for clarity

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4492:

> 4490: 
> 4491: // Compare char[] or byte[] arrays aligned to 4 bytes or substrings.
> 4492: void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1,

I liked the old style better, fewer longer lines.. same for rest of the changes 
in this file.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4594:

> 4592: #endif //_LP64
> 4593: bind(COMPARE_WIDE_VECTORS);
> 4594: vmovdqu(vec1, Address(ary1, limit,

create a local scale variable instead of ternary operators. Used several times.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4250:

> 4248:   generate_chacha_stubs();
> 4249: 
> 4250:   if ((UseAVX == 2) && EnableX86ECoreOpts && 
> VM_Version::supports_avx2()) {

Just `if (EnableX86ECoreOpts)`?

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 391:

> 389:   }
> 390: 
> 391:   __ cmpq(needle_len, isU ? 2 : 1);

Can we remove this comparison? i.e.
- broadcast first and last character unconditionally (same character). Or
- move broadcasts 'down' into individual cases..
There is already specialized code to handle needle of size 1.. This adds extra 
pathlength. (Will we actually call this intrinsic for needle_size==1? Assume 
length>=2?)

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1365:

> 1363:   // Compare first byte of needle to haystack
> 1364:  vpcmpeq(cmp_0, byte_0, Address(haystack, 0), 
> Assembler::AVX_256bit);
> 1365:   if (size != (isU ? 2 : 1)) {

`if (size != scale)`

Though in this case, `elem_size` might hold more meaning.

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1372:

> 1370: 
> 1371: if (bytesToCompare > 2) {
> 1372:   if (size > (isU ? 4 : 2)) {

`if (size > 2*scale)`?

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1373:

> 1371: if (bytesToCompare > 2) {
> 1372:   if (size > (isU ? 4 : 2)) {
> 1373: if (doEarlyBailout) {

Is there a big perf difference when `doEarlyBailout` is enabled? And/or just 
for this function?

(i.e. removing `doEarlyBailout` in this function will mean less pathlength. 
Feels like a few extra vpands should be cheap enough.)

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1469:

> 1467: 
> 1468:   if (isU && (size & 1)) {
> 1469: __ emit_int8(0xcc);

This should also be an `assert()` to catch this at compile-time.

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1633:

> 1631:   if (isU) {
> 1632: if ((size & 1) != 0) {
> 1633:   __ emit_int8(0xcc);

Compile-time assert to ensure this code is never called instead?

src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1889:

> 1887: //  r13 = (needle length - 1)
> 1888: //  r14 =

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-09 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=08
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]

2024-05-09 Thread Volodymyr Paprotski

On Thu, 9 May 2024 23:36:03 GMT, Anthony Scarpino  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   whitespace
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 701:
> 
>> 699: if (!m.equals(v)) {
>> 700: java.util.HexFormat hex = 
>> java.util.HexFormat.of();
>> 701: throw new RuntimeException();
> 
> I think your cleanup went to far. You should have some message saying they 
> are not equal and if you don't want to print hex, remove getting an instance.

I put the message back.. I removed it 'half'-intentionally; Was comparing 
against the original version and it didn't have  any details, thought maybe 
should follow suit. But I did find this message helpful, so its back.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1596116606

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v8]

2024-05-09 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  add message back

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/1ecfdc44..83b21310

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=07
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=06-07

  Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]

2024-05-09 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/8ff243a2..1ecfdc44

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=06
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=05-06

  Stats: 753 lines in 9 files changed: 303 ins; 101 del; 349 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v6]

2024-05-06 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Use AffinePoint to exit Montgomery domain
  
  Style notes:
  Affine.equals()
  - Mismatched fields only appear to be used from testing, perhaps should 
be moved there instead
  Affine.getX(boolean)|getY(boolean)
  - "Passing flag is bad design" - cleanest/performant alternative to 
several instanceof checks
  - needed to convert Affine to Projective (need to stay in montgomery 
domain)
  ECOperations.PointMultiplier
 - changes could probably be restored to original (since ProjectivePoint 
handling no longer required)
 - consider these changes an improvement? (fewer nested classes)
 - was an inner-class but not using inner-class features (i.e. ecOps 
variable should be converted)

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/a1984501..8ff243a2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=05
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=04-05

  Stats: 268 lines in 7 files changed: 89 ins; 147 del; 32 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v5]

2024-04-25 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/c93a71f0..a1984501

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=04
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=03-04

  Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski

On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 308:
> 
>> 306: 
>> 307: /*
>> 308:  * public Point addition. Used by ECDSAOperations
> 
> Was the old description not applicable anymore?   It would be nice to improve 
> on the existing description that shortening it.

Forgot to go back and fix the comment. Fixed..

As for the 'meaning'. Notice the signature of the function changed (i.e. no 
longer a 'mixed point', but two ProjectivePoints. This is a good idea 
regardless of Montgomery, but it affects montgomery particularly badly (need to 
compute zInv for 'no reason'. )

For sake of completeness. Apart from constructor, the 'API' for ECOperations 
(i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three 
functions (everything else is used internally by this class)

public void setSum(MutablePoint p, MutablePoint p2)
public MutablePoint multiply(AffinePoint affineP, byte[] s)
public MutablePoint multiply(ECPoint ecPoint, byte[] s)

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 321:
> 
>> 319: ECOperations ops = this;
>> 320: if (this.montgomeryOps != null) {
>> 321: assert p.getField() instanceof 
>> IntegerMontgomeryFieldModuloP;
> 
> This should throw a ProviderException, I believe this would throw an 
> AssertionException

Missed this comment. No longer applicable (this.montgomeryOps got refactored 
away)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]

2024-04-24 Thread Volodymyr Paprotski

On Tue, 23 Apr 2024 19:55:57 GMT, Anthony Scarpino  
wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Comments from Jatin and Tony
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 204:
> 
>> 202:  * @return the product
>> 203:  */
>> 204: public MutablePoint multiply(AffinePoint affineP, byte[] s) {
> 
> It seems like there could be some combining of both `multiply()`.  If 
> `multiply(AffinePoint, ...)` is called, it can call `DefaultMultiplier` with 
> the `affineP`, but internally call the other `multiply(ECPoint, ...)` for the 
> other situations.  I'd rather not have two methods doing most of the same 
> code, but different methods.

Thanks, they indeed look identical, didnt notice. Fixed. (repeated the same 
hashmap refactoring and didnt notice I produced identical code twice)

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 467:
> 
>> 465: sealed static abstract class SmallWindowMultiplier implements 
>> PointMultiplier
>> 466: permits DefaultMultiplier, DefaultMontgomeryMultiplier {
>> 467: private final AffinePoint affineP;
> 
> I don't think `affineP` needs to be a class variable anymore.  It's only used 
> in the constructor

Didn't notice, thanks, fixed.

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 592:
> 
>> 590: }
>> 591: 
>> 592: private final ProjectivePoint.Immutable[][] points;
> 
> Can you define this at the top please.

Done

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 668:
> 
>> 666: }
>> 667: 
>> 668: private final BigInteger[] base;
> 
> Can you define this at the top.  You use it in the constructor but it's 
> defined later on.

Done

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578117929
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578147190
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578148562
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578150303

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski

On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja  wrote:

>> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
>> alignment and loop rotation, but being a fallback figured its more important 
>> to keep it small
>
> It's all part of intrinsic, no harm in polishing it.

Done (normalized loop/backedge). There was actually a problem in the loop 
counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are 
sealed, but verified manually.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v4]

2024-04-24 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Comments from Tony and Jatin

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/6f9ac046..c93a71f0

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=03
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=02-03

  Stats: 48 lines in 2 files changed: 20 ins; 20 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski

On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39:
> 
>> 37: };
>> 38: static address modulus_p256() {
>> 39:   return (address)MODULUS_P256;
> 
> Long constants should have UL suffix.

Properly ULL, but good point, fixed

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386:
> 
>> 384:   __ jcc(Assembler::equal, L_Length19);
>> 385: 
>> 386:   // Default copy loop
> 
> Please add appropriate loop entry alignment.

This is actually a 'switch statement default'. The default should never happen 
(See "Known Length comment on line 335"), but added because java code has that 
behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the 
existing standard?)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
> 
>> 392:   __ lea(aLimbs, Address(aLimbs,8));
>> 393:   __ lea(bLimbs, Address(bLimbs,8));
>> 394:   __ jmp(L_DefaultLoop);
> 
> Both sub and cmp are flag affecting instructions and are macro-fusible. 
> By doing a loop rotation i.e. moving the length <= 0 check outside the loop 
> and pushing the loop exit check at bottom you can save additional compare 
> checks.

Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
alignment and loop rotation, but being a fallback figured its more important to 
keep it small

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski

On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Fixed all copyright years

git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read 
line; do echo $line =; grep Copyright $line | grep -v 
2024; done | less

Re splitting.. probably too late for that now.. (did consider it initially.. 
got hard to manage two changes while developing. And easier to justify the 
change when the entire patch is presented? but yes, far more code to review.. )

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski

On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino  
wrote:

>>> In `ECOperations.java`, if I understand this correctly, it is to replace 
>>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>>> when I look at the code, I see both are still options. If I read this 
>>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>>> Why doesn't it just replace the old implementation entry in the `fields` 
>>> Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
>> 
>> Thanks @ascarpino 
>> 
>> PS: Perhaps there is some middle ground, remove the `ECOperations 
>> montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
>> inaccessible..) the reference ECOperations in the fuzz test instead.. not 
>> sure how yet, but perhaps worth a further thought..
>
>> > In `ECOperations.java`, if I understand this correctly, it is to replace 
>> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>> > when I look at the code, I see both are still options. If I read this 
>> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>> > Why doesn't it just replace the old implementation entry in the `fields` 
>> > Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
> 
> I hadn't looked at your fuzz test until you mentioned it.  I see you are 
> using reflection to change the values.  Is that what you mean by "fallback"?  
> I'm assuming there is no to access the older implementation without 
> reflection.
> 
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
> 
> I

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]

2024-04-15 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Comments from Jatin and Tony

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/82b6dae7..6f9ac046

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=02
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=01-02

  Stats: 97 lines in 20 files changed: 4 ins; 36 del; 57 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski

On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Thanks Jatin, will fix!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski

On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino  
wrote:

> In `ECOperations.java`, if I understand this correctly, it is to replace the 
> existing `PointMultiplier` with montgomery-based PointMuliplier. But when I 
> look at the code, I see both are still options. If I read this correctly, it 
> checks for the old `IntegerFieldModuloP`, then looks for the new 
> `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why 
> doesn't it just replace the old implementation entry in the `fields` Map? Is 
> there a reason to keep it around?

Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) 
"for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. 
that might also help in cleaning a few things up in the construction. Maybe 
even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting 
isnt a big deal, but all attempts to make the ECC stack clearer is positive!

One functional reason that might justify keeping it as-is, is fuzz-testing; 
with the fallback available, I am able to write the included Fuzz tests and 
have them check the values against the existing implementation. While I also 
included a few KAT tests using openssl-generated values, the fuzz tests check 
millions of values and it does add a lot more certainty about correctness of 
this code.

Can it be removed? For the operations that do not involve multiplication (i.e. 
`setSum(*)`), montgomery is expensive. I think I did go through the uses of 
this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing 
IntegerPolynomialP256 is no longer used (I should verify that again) and only 
P256OrderField remains non-montgomery. So removing references to 
IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing 
IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder 
(fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some 
thought..

I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it 
could also be chucked up as part of 'scaffolding' and removed in name of code 
quality?

Thanks @ascarpino 

PS: Perhaps there is some middle ground, remove the `ECOperations 
montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
inaccessible..) the reference ECOperations in the fuzz test instead.. not sure 
how yet, but perhaps worth a further thought..

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-05 Thread Volodymyr Paprotski

On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

@ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would 
appreciate your feedback.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski

On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/module-info.java line 265:
> 
>> 263: jdk.jfr,
>> 264: jdk.unsupported,
>> 265: jdk.crypto.ec;
> 
> jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in 
> java.base. So I don't think you need this qualified export.

Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a 
few spots during rebase I guess)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  remove use of jdk.crypto.ec

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=01
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=00-01

  Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

2024-04-02 Thread Volodymyr Paprotski

Performance. Before:

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt ScoreError  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  6443.934 ±  6.491  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3  6152.979 ±  4.954  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3  1895.410 ± 36.979  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3  1878.955 ± 45.487  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  1357.810 ± 26.584  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  1352.119 ± 23.547  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 10.970 
 ops/s

Performance, no intrinsic:

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt Score Error  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  6529.839 ±  42.420  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3  6199.747 ± 133.566  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3  1973.676 ±  54.071  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3  1932.127 ±  35.920  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  1355.788 ± 29.858  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  1346.523 ± 28.722  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 10.591 
 ops/s

Performance, **with intrinsics**

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt  Score Error  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  10384.591 ±  65.274  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3   9592.912 ± 236.411  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3   3479.494 ±  44.578  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3   3402.147 ±  26.772  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  2527.678 ± 64.791  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  2541.258 ± 66.634  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply   true  thrpt3  3021.139 ± 98.289 
 ops/s


Summary on design (see code for 'ASCII art', references and details on math):
- Added a new `IntegerPolynomial` field (`MontgomeryIntegerPolynomialP256`) 
with 52-bit limbs
   - `getElement(*)/fromMontgomery()` to convert numbers into/out of the field
 - `ECOperations` is the primary use of the new field
   - flattened some extra deep nested class hierarchy (also in prep for further 
other field optimizations)
   - `forParameters()/multiply()/setSum()` generates numbers in the new field
 - `ProjectivePoint/Montgomery{Imm|M}utable.asAffine()` to convert out of the 
new field
 - Added Fuzz Testing and KAT verified with OpenSSL

-

Commit messages:
 - remove trailing whitespace
 - Remeasure performance
 - Fix rebase typo
 - Address comments from Anas and thorough cleanup
 - conditionalAssign intrinsic
 - rebase

Changes: https://git.openjdk.org/jdk/pull/18583/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk=18583=00
  Issue: https://bugs.openjdk.org/browse/JDK-8329538
  Stats: 2335 lines in 34 files changed: 2037 ins; 162 del; 136 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583

37 matches

Mail list logo