Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v48]
On Thu, 30 May 2024 13:56:30 GMT, Emanuel Peter wrote: >> Control question: Are we confident with this potentially going into JDK 23 >> or should we rather postpone to JDK 24? The fork is next week. > >> Control question: Are we confident with this potentially going into JDK 23 >> or should we rather postpone to JDK 24? The fork is next week. > > I would hold off. @asgibbons it may pass our tests, and your extensive > testing. But you never know what the fuzzer can find over a few weeks once it > runs with your changes. I have made that experience many times. Let's just > give it a few days, and then we have one JDK version less to worry about for > backports on possible follow-up bugs ;) @eme64 I guess to add some confidence.. we did also 'test it independently' to catch blind spots. i.e. `String/IndexOf.java` is from me. I tried to be as paranoid as possible with non-random strings. Passed everything I could throw at it.. - PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2139882544
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 17:36:03 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: >> >>> 486: __ cmpq(r11, nMinusK); >>> 487: __ ja_b(L_return); >>> 488: __ movq(rax, r11); >> >> At places where we know that return value in r11 is correct, we dont need to >> checkRange so this could have its own label. > > I don't want to change this because its reason for existence is to ensure we > don't return a value that's beyond the end of the haystack. We don't yet > have a good enough test to validate whether we're reading past the end of the > haystack, so I like this as insurance. I would recommend an experiment. Disable the range-check and run String/IndexOf.java test. Particularly run test4(), which is designed exactly to test the reads beyond the end. It wont find all the bad reads, but right now if there are any failures, they are 'hidden' by this range-check. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617888680
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Wed, 22 May 2024 14:50:40 GMT, Scott Gibbons wrote: >> test/jdk/java/lang/StringBuffer/IndexOf.java line 284: >> >>> 282: >>> 283: // Note: it is possible although highly improbable that failCount >>> will >>> 284: // be > 0 even if everthing is working ok >> >> This sounds like either a bug or a testcase bug? Same as line 301, >> `extremely remote possibility of > 1 match`? > > This was there from the original author. I think they were trying to infer > that a match could occur in the rare case that the same random string was > produced. They're random after all, and there's no reason the same sequence > could be generated. Makes sense - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613872215
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v37]
On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > mov64 => lea(InternalAddress) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633: > 4631: andl(result, 0x000f); // tail count (in bytes) > 4632: andl(limit, 0xfff0); // vector count (in bytes) > 4633: jcc(Assembler::zero, COMPARE_TAIL); In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, inside the if (and move the other jcc into the else branch)? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639: > 4637: negptr(limit); > 4638: > 4639: bind(COMPARE_WIDE_VECTORS_16); Understanding-check.. this loop will execute at most 2 times, right? i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks then byte-by-byte? (Still a good optimization, just trying to understand the scope) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718: > 4716: jmp(TRUE_LABEL); > 4717: } else { > 4718: movl(chr, Address(ary1, limit, scaleFactor)); scaleFactor is always Address::times_1 here (expand_ary2==false), might be clearer to change it back test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57: > 55: > 56: generator = new Random(); > 57: long seed = generator.nextLong();//-5291521104060046276L; dead code test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63: > 61: /// WARM-UP // > 62: > 63: for (int i = 0; i < 2; i++) { -Xcomp should be more deterministic (and quicker) way to reach the intrinsic (i.e. like the other tests) On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check please.. these tests will run as part of every build from this point-till-infinity; Therefore, long test will affect every openjdk developer. But if this test is not run on every build, then the build-time does not matter, then this test can run for as long as it 'wants'. test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160: > 158: } > 159: > 160: private static String generateTestString(int min, int max) { I see you have various `Charset[] charSets` above, but this function still only generates LL. Are those separate tests? Or am I missing some concatenation somewhere that will convert the generated string string to the correct encoding? You could had implemented my suggestion from IndexOf.generateTestString here instead, so that the tests that do call this function endup with multiple encodings; i.e. similar to what you already do in the next function. I suppose, with addition of String/IndexOf.java that is a moot point. test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185: > 183: } > 184: > 185: private static int
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Fri, 17 May 2024 23:59:05 GMT, Scott Gibbons wrote: >> test/jdk/java/lang/StringBuffer/IndexOf.java line 40: >> >>> 38: private static boolean failure = false; >>> 39: public static void main(String[] args) throws Exception { >>> 40: String testName = "IndexOf"; >> >> intentation > > Fixed (missed a `git add`? don't see the updates for this file) - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613870558
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Wed, 22 May 2024 14:41:36 GMT, Scott Gibbons wrote: >> test/micro/org/openjdk/bench/java/lang/StringIndexOfHuge.java line 132: >> >>> 130: @Benchmark >>> 131: public int searchHugeLargeSubstring() { >>> 132: return dataStringHuge.indexOf("B".repeat(30) + "X" + >>> "A".repeat(30), 74); >> >> .repeat() call and string concatenation shouldn't be part of the benchmark >> (here and several other @Benchmark functions in this file) since it will >> detract from the measurement. >> >> (String concatenation gets converted (by javac) into >> StringBuilder().append().append()append().toString()) > > Since we're only concerned with the delta of performance, does this really > matter? Can you suggest an alternative? The needle really should be like the all the other strings, e.g. `dataStringHuge` itself, generated by the setup. As to weather it really matters; the answer is Amdahl's law. You can indeed measure the delta, but you can't measure the speedup of just the indexOf; not with repeat and concatenation obscuring the numbers. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613864094
Integrated: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic
On Tue, 2 Apr 2024 15:42:05 GMT, Volodymyr Paprotski wrote: > Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... This pull request has now been integrated. Changeset: afed7d0b Author:Volodymyr Paprotski Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/afed7d0b0593864e5595840a6b645c210ff28c7c Stats: 2409 lines in 36 files changed: 2093 ins; 156 del; 160 mod 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic Reviewed-by: ihse, ascarpino, sviswanathan - PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request with a new target base due > to a merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 17 additional > commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into ecc-montgomery > - shenandoah verifier > - comments from Sandhya > - whitespace > - add message back > - whitespace > - Use AffinePoint to exit Montgomery domain > >Style notes: >Affine.equals() >- Mismatched fields only appear to be used from testing, perhaps > should be moved there instead >Affine.getX(boolean)|getY(boolean) >- "Passing flag is bad design" - cleanest/performant alternative to > several instanceof checks >- needed to convert Affine to Projective (need to stay in montgomery > domain) >ECOperations.PointMultiplier > - changes could probably be restored to original (since ProjectivePoint > handling no longer required) > - consider these changes an improvement? (fewer nested classes) > - was an inner-class but not using inner-class features (i.e. ecOps > variable should be converted) > - whitespace > - Comments from Tony and Jatin > - Comments from Jatin and Tony > - ... and 7 more: https://git.openjdk.org/jdk/compare/c0032e2c...b1a33004 Thanks Tobi! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124924526
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
On Tue, 21 May 2024 07:21:14 GMT, Tobias Hartmann wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> shenandoah verifier > > I'm getting some conflicts when trying to apply this to master. Could you > please merge the PR? Hi @TobiHartmann , merged with no issues for me. Could you please run the tests again? (I think Tony did run them, but can't hurt verifying again). Thanks! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2123122468
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into ecc-montgomery - shenandoah verifier - comments from Sandhya - whitespace - add message back - whitespace - Use AffinePoint to exit Montgomery domain Style notes: Affine.equals() - Mismatched fields only appear to be used from testing, perhaps should be moved there instead Affine.getX(boolean)|getY(boolean) - "Passing flag is bad design" - cleanest/performant alternative to several instanceof checks - needed to convert Affine to Projective (need to stay in montgomery domain) ECOperations.PointMultiplier - changes could probably be restored to original (since ProjectivePoint handling no longer required) - consider these changes an improvement? (fewer nested classes) - was an inner-class but not using inner-class features (i.e. ecOps variable should be converted) - whitespace - Comments from Tony and Jatin - Comments from Jatin and Tony - ... and 7 more: https://git.openjdk.org/jdk/compare/12e8009b...b1a33004 - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/df4fe6fa..b1a33004 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=11 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=10-11 Stats: 190975 lines in 3949 files changed: 105304 ins; 64688 del; 20983 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > shenandoah verifier Thanks Sandhya! Now that I have @ascarpino approval as well, I plan to integrate next Tuesday. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118443577
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: shenandoah verifier - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/5c360e35..df4fe6fa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=10 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=09-10 Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> whitespace > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168: > >> 166: XMMRegister broadcast5 = xmm24; >> 167: KRegister limb0 = k1; >> 168: KRegister limb5 = k2; > > limb5 and select are not being used anymore. Thanks, fixed (and also broadcast5) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185: > >> 183: __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), >> false, Assembler::AVX_512bit, rscratch); >> 184: >> 185: // A = load(*aLimbs) > > A little bit more description in comments on what the load step involves > would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, > or in the lowest limb. Done > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270: > >> 268: __ push(r14); >> 269: __ push(r15); >> 270: > > No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in > montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed > to r8. Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 stub) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286: > >> 284: __ mov(aLimbs, c_rarg0); >> 285: __ mov(bLimbs, c_rarg1); >> 286: __ mov(rLimbs, c_rarg2); > > We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then > these moves are not necessary. Gave them symbolic names and passed the gpr temp and parameter. vector register map still in the montgomeryMultiply function, but gprs explicitly passed in. 'close enough'? > src/hotspot/cpu/x86/vm_version_x86.cpp line 1370: > >> 1368: >> 1369: #ifdef _LP64 >> 1370: if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize >> >= 64) { > > No need to tie the intrinsic to MaxVectorSize setting. Done > src/hotspot/share/opto/library_call.cpp line 7564: > >> 7562: >> 7563: if (!stubAddr) return false; >> 7564: if (stopped()) return true; > > Line 7564 seems redundant here as there is no range check or anything like > that before this. Oh. That is what that is for... I thought it was some soft of 'VM quitting' short-circuit. Removed. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v10]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: comments from Sandhya - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/8cd095dd..5c360e35 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=09 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=08-09 Stats: 82 lines in 4 files changed: 1 ins; 59 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Wed, 15 May 2024 19:21:37 GMT, Volodymyr Paprotski wrote: >> Scott Gibbons has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rearrange; add lambdas for clarity > > test/jdk/java/lang/StringBuffer/IndexOf.java line 47: > >> 45: char[] haystack_16 = new char[128]; >> 46: >> 47: for (int i = 0; i < 128; i++) { > > you can use `char` instead of `int` as iterator combine into single loop haystack[i] = (char) i; haystack_16[i] = (char) (i + 256); - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1602141543
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Rearrange; add lambdas for clarity First pass at StringIndexOfHuge.java and IndexOf.java test/jdk/java/lang/StringBuffer/IndexOf.java line 40: > 38: private static boolean failure = false; > 39: public static void main(String[] args) throws Exception { > 40: String testName = "IndexOf"; intentation test/jdk/java/lang/StringBuffer/IndexOf.java line 47: > 45: char[] haystack_16 = new char[128]; > 46: > 47: for (int i = 0; i < 128; i++) { you can use `char` instead of `int` as iterator test/jdk/java/lang/StringBuffer/IndexOf.java line 54: > 52: // for (int i = 1; i < 128; i++) { > 53: // haystack_16[i] = (char) (i); > 54: // } dead code test/jdk/java/lang/StringBuffer/IndexOf.java line 64: > 62: Charset hs_charset = StandardCharsets.UTF_16; > 63: Charset needleCharset = StandardCharsets.ISO_8859_1; > 64: // Charset needleCharset = StandardCharsets.UTF_16; Move from main() into a function that takes `needleCharset` as a parameter, then call that function twice. test/jdk/java/lang/StringBuffer/IndexOf.java line 81: > 79: sourceBuffer = new StringBuffer(sourceString); > 80: targetString = generateTestString(10, 11); > 81: } while (sourceString.indexOf(targetString) != -1); Should really keep the original test unmodified and add new tests as needed test/jdk/java/lang/StringBuffer/IndexOf.java line 83: > 81: shs = "$&),,18+-!'8)+"; > 82: endNeedle = "8)-"; > 83: l_offset = 9; dead code test/jdk/java/lang/StringBuffer/IndexOf.java line 89: > 87: StringBuffer bshs = new StringBuffer(shs); > 88: > 89: // printStringBytes(shs.getBytes(hs_charset)); dead code (and next two comments) test/jdk/java/lang/StringBuffer/IndexOf.java line 90: > 88: > 89: // printStringBytes(shs.getBytes(hs_charset)); > 90: for (int i = 0; i < 20; i++) { This wont be a deterministic way to reach the intrinsic. I would suggest copying the idea from test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java i.e. Have two `@run main` invocations at the top of this file, one with default parameters, one with `-Xcomp -XX:-TieredCompilation`. You dont need a 'driver' program, that was to handle something else. /* * @test * @modules java.base/com.sun.crypto.provider * @run main java.base/com.sun.crypto.provider.Poly1305KAT * @summary Unit test for com.sun.crypto.provider.Poly1305. */ /* * @test * @modules java.base/com.sun.crypto.provider * @summary Unit test for IntrinsicCandidate in com.sun.crypto.provider.Poly1305. * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+ForceUnreachable java.base/com.sun.crypto.provider.Poly1305KAT */
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Rearrange; add lambdas for clarity src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4492: > 4490: > 4491: // Compare char[] or byte[] arrays aligned to 4 bytes or substrings. > 4492: void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1, I liked the old style better, fewer longer lines.. same for rest of the changes in this file. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4594: > 4592: #endif //_LP64 > 4593: bind(COMPARE_WIDE_VECTORS); > 4594: vmovdqu(vec1, Address(ary1, limit, create a local scale variable instead of ternary operators. Used several times. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4250: > 4248: generate_chacha_stubs(); > 4249: > 4250: if ((UseAVX == 2) && EnableX86ECoreOpts && > VM_Version::supports_avx2()) { Just `if (EnableX86ECoreOpts)`? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 391: > 389: } > 390: > 391: __ cmpq(needle_len, isU ? 2 : 1); Can we remove this comparison? i.e. - broadcast first and last character unconditionally (same character). Or - move broadcasts 'down' into individual cases.. There is already specialized code to handle needle of size 1.. This adds extra pathlength. (Will we actually call this intrinsic for needle_size==1? Assume length>=2?) src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1365: > 1363: // Compare first byte of needle to haystack > 1364: vpcmpeq(cmp_0, byte_0, Address(haystack, 0), > Assembler::AVX_256bit); > 1365: if (size != (isU ? 2 : 1)) { `if (size != scale)` Though in this case, `elem_size` might hold more meaning. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1372: > 1370: > 1371: if (bytesToCompare > 2) { > 1372: if (size > (isU ? 4 : 2)) { `if (size > 2*scale)`? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1373: > 1371: if (bytesToCompare > 2) { > 1372: if (size > (isU ? 4 : 2)) { > 1373: if (doEarlyBailout) { Is there a big perf difference when `doEarlyBailout` is enabled? And/or just for this function? (i.e. removing `doEarlyBailout` in this function will mean less pathlength. Feels like a few extra vpands should be cheap enough.) src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1469: > 1467: > 1468: if (isU && (size & 1)) { > 1469: __ emit_int8(0xcc); This should also be an `assert()` to catch this at compile-time. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1633: > 1631: if (isU) { > 1632: if ((size & 1) != 0) { > 1633: __ emit_int8(0xcc); Compile-time assert to ensure this code is never called instead? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1889: > 1887: // r13 = (needle length - 1) > 1888: // r14 =
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=08 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]
On Thu, 9 May 2024 23:36:03 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> whitespace > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 701: > >> 699: if (!m.equals(v)) { >> 700: java.util.HexFormat hex = >> java.util.HexFormat.of(); >> 701: throw new RuntimeException(); > > I think your cleanup went to far. You should have some message saying they > are not equal and if you don't want to print hex, remove getting an instance. I put the message back.. I removed it 'half'-intentionally; Was comparing against the original version and it didn't have any details, thought maybe should follow suit. But I did find this message helpful, so its back. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1596116606
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v8]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: add message back - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/1ecfdc44..83b21310 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=07 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=06-07 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/8ff243a2..1ecfdc44 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=06 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=05-06 Stats: 753 lines in 9 files changed: 303 ins; 101 del; 349 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v6]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Use AffinePoint to exit Montgomery domain Style notes: Affine.equals() - Mismatched fields only appear to be used from testing, perhaps should be moved there instead Affine.getX(boolean)|getY(boolean) - "Passing flag is bad design" - cleanest/performant alternative to several instanceof checks - needed to convert Affine to Projective (need to stay in montgomery domain) ECOperations.PointMultiplier - changes could probably be restored to original (since ProjectivePoint handling no longer required) - consider these changes an improvement? (fewer nested classes) - was an inner-class but not using inner-class features (i.e. ecOps variable should be converted) - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/a1984501..8ff243a2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=05 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=04-05 Stats: 268 lines in 7 files changed: 89 ins; 147 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v5]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/c93a71f0..a1984501 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=04 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=03-04 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 308: > >> 306: >> 307: /* >> 308: * public Point addition. Used by ECDSAOperations > > Was the old description not applicable anymore? It would be nice to improve > on the existing description that shortening it. Forgot to go back and fix the comment. Fixed.. As for the 'meaning'. Notice the signature of the function changed (i.e. no longer a 'mixed point', but two ProjectivePoints. This is a good idea regardless of Montgomery, but it affects montgomery particularly badly (need to compute zInv for 'no reason'. ) For sake of completeness. Apart from constructor, the 'API' for ECOperations (i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three functions (everything else is used internally by this class) public void setSum(MutablePoint p, MutablePoint p2) public MutablePoint multiply(AffinePoint affineP, byte[] s) public MutablePoint multiply(ECPoint ecPoint, byte[] s) > src/java.base/share/classes/sun/security/ec/ECOperations.java line 321: > >> 319: ECOperations ops = this; >> 320: if (this.montgomeryOps != null) { >> 321: assert p.getField() instanceof >> IntegerMontgomeryFieldModuloP; > > This should throw a ProviderException, I believe this would throw an > AssertionException Missed this comment. No longer applicable (this.montgomeryOps got refactored away) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]
On Tue, 23 Apr 2024 19:55:57 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Comments from Jatin and Tony > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 204: > >> 202: * @return the product >> 203: */ >> 204: public MutablePoint multiply(AffinePoint affineP, byte[] s) { > > It seems like there could be some combining of both `multiply()`. If > `multiply(AffinePoint, ...)` is called, it can call `DefaultMultiplier` with > the `affineP`, but internally call the other `multiply(ECPoint, ...)` for the > other situations. I'd rather not have two methods doing most of the same > code, but different methods. Thanks, they indeed look identical, didnt notice. Fixed. (repeated the same hashmap refactoring and didnt notice I produced identical code twice) > src/java.base/share/classes/sun/security/ec/ECOperations.java line 467: > >> 465: sealed static abstract class SmallWindowMultiplier implements >> PointMultiplier >> 466: permits DefaultMultiplier, DefaultMontgomeryMultiplier { >> 467: private final AffinePoint affineP; > > I don't think `affineP` needs to be a class variable anymore. It's only used > in the constructor Didn't notice, thanks, fixed. > src/java.base/share/classes/sun/security/ec/ECOperations.java line 592: > >> 590: } >> 591: >> 592: private final ProjectivePoint.Immutable[][] points; > > Can you define this at the top please. Done > src/java.base/share/classes/sun/security/ec/ECOperations.java line 668: > >> 666: } >> 667: >> 668: private final BigInteger[] base; > > Can you define this at the top. You use it in the constructor but it's > defined later on. Done - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578117929 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578147190 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578148562 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578150303
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja wrote: >> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add >> alignment and loop rotation, but being a fallback figured its more important >> to keep it small > > It's all part of intrinsic, no harm in polishing it. Done (normalized loop/backedge). There was actually a problem in the loop counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are sealed, but verified manually. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v4]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Comments from Tony and Jatin - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/6f9ac046..c93a71f0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=03 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=02-03 Stats: 48 lines in 2 files changed: 20 ins; 20 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39: > >> 37: }; >> 38: static address modulus_p256() { >> 39: return (address)MODULUS_P256; > > Long constants should have UL suffix. Properly ULL, but good point, fixed > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386: > >> 384: __ jcc(Assembler::equal, L_Length19); >> 385: >> 386: // Default copy loop > > Please add appropriate loop entry alignment. This is actually a 'switch statement default'. The default should never happen (See "Known Length comment on line 335"), but added because java code has that behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the existing standard?) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: > >> 392: __ lea(aLimbs, Address(aLimbs,8)); >> 393: __ lea(bLimbs, Address(bLimbs,8)); >> 394: __ jmp(L_DefaultLoop); > > Both sub and cmp are flag affecting instructions and are macro-fusible. > By doing a loop rotation i.e. moving the length <= 0 check outside the loop > and pushing the loop exit check at bottom you can save additional compare > checks. Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add alignment and loop rotation, but being a fallback figured its more important to keep it small - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Fixed all copyright years git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read line; do echo $line =; grep Copyright $line | grep -v 2024; done | less Re splitting.. probably too late for that now.. (did consider it initially.. got hard to manage two changes while developing. And easier to justify the change when the entire patch is presented? but yes, far more code to review.. ) - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino wrote: >>> In `ECOperations.java`, if I understand this correctly, it is to replace >>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But >>> when I look at the code, I see both are still options. If I read this >>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >>> Why doesn't it just replace the old implementation entry in the `fields` >>> Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? >> >> Thanks @ascarpino >> >> PS: Perhaps there is some middle ground, remove the `ECOperations >> montgomeryOps` nesting, and construct (somehow?? singleton makes most things >> inaccessible..) the reference ECOperations in the fuzz test instead.. not >> sure how yet, but perhaps worth a further thought.. > >> > In `ECOperations.java`, if I understand this correctly, it is to replace >> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But >> > when I look at the code, I see both are still options. If I read this >> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >> > Why doesn't it just replace the old implementation entry in the `fields` >> > Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. > > I hadn't looked at your fuzz test until you mentioned it. I see you are > using reflection to change the values. Is that what you mean by "fallback"? > I'm assuming there is no to access the older implementation without > reflection. > >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? > > I
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Comments from Jatin and Tony - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/82b6dae7..6f9ac046 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=02 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=01-02 Stats: 97 lines in 20 files changed: 4 ins; 36 del; 57 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Thanks Jatin, will fix! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino wrote: > In `ECOperations.java`, if I understand this correctly, it is to replace the > existing `PointMultiplier` with montgomery-based PointMuliplier. But when I > look at the code, I see both are still options. If I read this correctly, it > checks for the old `IntegerFieldModuloP`, then looks for the new > `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why > doesn't it just replace the old implementation entry in the `fields` Map? Is > there a reason to keep it around? Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. that might also help in cleaning a few things up in the construction. Maybe even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the ECC stack clearer is positive! One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code. Can it be removed? For the operations that do not involve multiplication (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing IntegerPolynomialP256 is no longer used (I should verify that again) and only P256OrderField remains non-montgomery. So removing references to IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some thought.. I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it could also be chucked up as part of 'scaffolding' and removed in name of code quality? Thanks @ascarpino PS: Perhaps there is some middle ground, remove the `ECOperations montgomeryOps` nesting, and construct (somehow?? singleton makes most things inaccessible..) the reference ECOperations in the fuzz test instead.. not sure how yet, but perhaps worth a further thought.. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec @ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would appreciate your feedback. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/module-info.java line 265: > >> 263: jdk.jfr, >> 264: jdk.unsupported, >> 265: jdk.crypto.ec; > > jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in > java.base. So I don't think you need this qualified export. Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a few spots during rebase I guess) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: remove use of jdk.crypto.ec - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=01 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=00-01 Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic
Performance. Before: Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt ScoreError Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 6443.934 ± 6.491 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 6152.979 ± 4.954 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 1895.410 ± 36.979 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 1878.955 ± 45.487 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1357.810 ± 26.584 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1352.119 ± 23.547 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± 10.970 ops/s Performance, no intrinsic: Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 6529.839 ± 42.420 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 6199.747 ± 133.566 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 1973.676 ± 54.071 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 1932.127 ± 35.920 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1355.788 ± 29.858 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1346.523 ± 28.722 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± 10.591 ops/s Performance, **with intrinsics** Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 10384.591 ± 65.274 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 9592.912 ± 236.411 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 3479.494 ± 44.578 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 3402.147 ± 26.772 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 2527.678 ± 64.791 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 2541.258 ± 66.634 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply true thrpt3 3021.139 ± 98.289 ops/s Summary on design (see code for 'ASCII art', references and details on math): - Added a new `IntegerPolynomial` field (`MontgomeryIntegerPolynomialP256`) with 52-bit limbs - `getElement(*)/fromMontgomery()` to convert numbers into/out of the field - `ECOperations` is the primary use of the new field - flattened some extra deep nested class hierarchy (also in prep for further other field optimizations) - `forParameters()/multiply()/setSum()` generates numbers in the new field - `ProjectivePoint/Montgomery{Imm|M}utable.asAffine()` to convert out of the new field - Added Fuzz Testing and KAT verified with OpenSSL - Commit messages: - remove trailing whitespace - Remeasure performance - Fix rebase typo - Address comments from Anas and thorough cleanup - conditionalAssign intrinsic - rebase Changes: https://git.openjdk.org/jdk/pull/18583/files Webrev: https://webrevs.openjdk.org/?repo=jdk=18583=00 Issue: https://bugs.openjdk.org/browse/JDK-8329538 Stats: 2335 lines in 34 files changed: 2037 ins; 162 del; 136 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583