Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 308: > >> 306: >> 307: /* >> 308: * public Point addition. Used by ECDSAOperations > > Was the old description not applicable anymore? It would be nice to improve > on the existing description that shortening it. Forgot to go back and fix the comment. Fixed.. As for the 'meaning'. Notice the signature of the function changed (i.e. no longer a 'mixed point', but two ProjectivePoints. This is a good idea regardless of Montgomery, but it affects montgomery particularly badly (need to compute zInv for 'no reason'. ) For sake of completeness. Apart from constructor, the 'API' for ECOperations (i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three functions (everything else is used internally by this class) public void setSum(MutablePoint p, MutablePoint p2) public MutablePoint multiply(AffinePoint affineP, byte[] s) public MutablePoint multiply(ECPoint ecPoint, byte[] s) > src/java.base/share/classes/sun/security/ec/ECOperations.java line 321: > >> 319: ECOperations ops = this; >> 320: if (this.montgomeryOps != null) { >> 321: assert p.getField() instanceof >> IntegerMontgomeryFieldModuloP; > > This should throw a ProviderException, I believe this would throw an > AssertionException Missed this comment. No longer applicable (this.montgomeryOps got refactored away) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja wrote: >> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add >> alignment and loop rotation, but being a fallback figured its more important >> to keep it small&readable... > > It's all part of intrinsic, no harm in polishing it. Done (normalized loop/backedge). There was actually a problem in the loop counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are sealed, but verified manually. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec src/java.base/share/classes/sun/security/ec/ECOperations.java line 308: > 306: > 307: /* > 308: * public Point addition. Used by ECDSAOperations Was the old description not applicable anymore? It would be nice to improve on the existing description that shortening it. src/java.base/share/classes/sun/security/ec/ECOperations.java line 321: > 319: ECOperations ops = this; > 320: if (this.montgomeryOps != null) { > 321: assert p.getField() instanceof IntegerMontgomeryFieldModuloP; This should throw a ProviderException, I believe this would throw an AssertionException - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1556740469 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1558824543
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Mon, 15 Apr 2024 22:04:14 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: >> >>> 392: __ lea(aLimbs, Address(aLimbs,8)); >>> 393: __ lea(bLimbs, Address(bLimbs,8)); >>> 394: __ jmp(L_DefaultLoop); >> >> Both sub and cmp are flag affecting instructions and are macro-fusible. >> By doing a loop rotation i.e. moving the length <= 0 check outside the loop >> and pushing the loop exit check at bottom you can save additional compare >> checks. > > Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add > alignment and loop rotation, but being a fallback figured its more important > to keep it small&readable... It's all part of intrinsic, no harm in polishing it. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566630667
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39: > >> 37: }; >> 38: static address modulus_p256() { >> 39: return (address)MODULUS_P256; > > Long constants should have UL suffix. Properly ULL, but good point, fixed > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386: > >> 384: __ jcc(Assembler::equal, L_Length19); >> 385: >> 386: // Default copy loop > > Please add appropriate loop entry alignment. This is actually a 'switch statement default'. The default should never happen (See "Known Length comment on line 335"), but added because java code has that behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the existing standard?) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: > >> 392: __ lea(aLimbs, Address(aLimbs,8)); >> 393: __ lea(bLimbs, Address(bLimbs,8)); >> 394: __ jmp(L_DefaultLoop); > > Both sub and cmp are flag affecting instructions and are macro-fusible. > By doing a loop rotation i.e. moving the length <= 0 check outside the loop > and pushing the loop exit check at bottom you can save additional compare > checks. Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add alignment and loop rotation, but being a fallback figured its more important to keep it small&readable... - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Fixed all copyright years git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read line; do echo $line =; grep Copyright $line | grep -v 2024; done | less Re splitting.. probably too late for that now.. (did consider it initially.. got hard to manage two changes while developing. And easier to justify the change when the entire patch is presented? but yes, far more code to review.. ) - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino wrote: >>> In `ECOperations.java`, if I understand this correctly, it is to replace >>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But >>> when I look at the code, I see both are still options. If I read this >>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >>> Why doesn't it just replace the old implementation entry in the `fields` >>> Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? >> >> Thanks @ascarpino >> >> PS: Perhaps there is some middle ground, remove the `ECOperations >> montgomeryOps` nesting, and construct (somehow?? singleton makes most things >> inaccessible..) the reference ECOperations in the fuzz test instead.. not >> sure how yet, but perhaps worth a further thought.. > >> > In `ECOperations.java`, if I understand this correctly, it is to replace >> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But >> > when I look at the code, I see both are still options. If I read this >> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >> > Why doesn't it just replace the old implementation entry in the `fields` >> > Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. > > I hadn't looked at your fuzz test until you mentioned it. I see you are > using reflection to change the values. Is that what you mean by "fallback"? > I'm assuming there is no to access the older implementation without > reflection. > >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? > > I wouldn
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 18:02:38 GMT, Volodymyr Paprotski wrote: > > In `ECOperations.java`, if I understand this correctly, it is to replace > > the existing `PointMultiplier` with montgomery-based PointMuliplier. But > > when I look at the code, I see both are still options. If I read this > > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the > > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. > > Why doesn't it just replace the old implementation entry in the `fields` > > Map? Is there a reason to keep it around? > > Hmm, thats a good point I haven't fully considered; i.e. (if I read > correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery > entirely".. that might also help in cleaning a few things up in the > construction. Maybe even get rid of this nested ECOperations inside > ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the > ECC stack clearer is positive! > > One functional reason that might justify keeping it as-is, is fuzz-testing; > with the fallback available, I am able to write the included Fuzz tests and > have them check the values against the existing implementation. While I also > included a few KAT tests using openssl-generated values, the fuzz tests check > millions of values and it does add a lot more certainty about correctness of > this code. I hadn't looked at your fuzz test until you mentioned it. I see you are using reflection to change the values. Is that what you mean by "fallback"? I'm assuming there is no to access the older implementation without reflection. > > Can it be removed? For the operations that do not involve multiplication > (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the > uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and > existing IntegerPolynomialP256 is no longer used (I should verify that again) > and only P256OrderField remains non-montgomery. So removing references to > IntegerPolynomialP256 in ECOperations should be possible and cleaner. > Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder > (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some > thought.. > > I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but > it could also be chucked up as part of 'scaffolding' and removed in name of > code quality? I wouldn't rip out the old implementation. I have been wondering if we should make the older implementation available, maybe by security property. I was looking at the static Maps at the top of `ECOperations`, `forParameters`, and the constructors where it checks if the `montgomeryOps` was null or set. It would be nice if we could have one set of `fields` Maps by putting the montgomery entry into the `fields` to replace it. I think that should work because `IntegerMontgomeryFieldModuloP` extends `IntegerFieldModuloP`. `instanceof` or other `montgomeryOps` checks would still need to exist because not all the `fields` support mongomery, and the older implementation would still be accessible for your fuzz tester. At least that is my theory. > > Thanks @ascarpino > > PS: Perhaps there is some middle ground, remove the `ECOperations > montgomeryOps` nesting, and construct (somehow?? singleton makes most things > inaccessible..) the reference ECOperations in the fuzz test instead.. not > sure how yet, but perhaps worth a further thought.. It would be nice to remove the nesting and it would be nice to be a singleton. Maybe some combination of what I mentioned above chance can help that. I haven't fully thought this out either. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2050148475
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Thanks Jatin, will fix! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino wrote: > In `ECOperations.java`, if I understand this correctly, it is to replace the > existing `PointMultiplier` with montgomery-based PointMuliplier. But when I > look at the code, I see both are still options. If I read this correctly, it > checks for the old `IntegerFieldModuloP`, then looks for the new > `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why > doesn't it just replace the old implementation entry in the `fields` Map? Is > there a reason to keep it around? Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. that might also help in cleaning a few things up in the construction. Maybe even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the ECC stack clearer is positive! One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code. Can it be removed? For the operations that do not involve multiplication (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing IntegerPolynomialP256 is no longer used (I should verify that again) and only P256OrderField remains non-montgomery. So removing references to IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some thought.. I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it could also be chucked up as part of 'scaffolding' and removed in name of code quality? Thanks @ascarpino PS: Perhaps there is some middle ground, remove the `ECOperations montgomeryOps` nesting, and construct (somehow?? singleton makes most things inaccessible..) the reference ECOperations in the fuzz test instead.. not sure how yet, but perhaps worth a further thought.. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec In `ECOperations.java`, if I understand this correctly, it is to replace the existing `PointMultiplier` with montgomery-based PointMuliplier. But when I look at the code, I see both are still options. If I read this correctly, it checks for the old `IntegerFieldModuloP`, then looks for the new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why doesn't it just replace the old implementation entry in the `fields` Map? Is there a reason to keep it around? - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048090075
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec @ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would appreciate your feedback. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec Few early comments. Please update the copyright year of all the modified files. You can even consider splitting this into two patches, Java side changes in one and x86 optimized intrinsic in next one. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39: > 37: }; > 38: static address modulus_p256() { > 39: return (address)MODULUS_P256; Long constants should have UL suffix. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386: > 384: __ jcc(Assembler::equal, L_Length19); > 385: > 386: // Default copy loop Please add appropriate loop entry alignment. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: > 392: __ lea(aLimbs, Address(aLimbs,8)); > 393: __ lea(bLimbs, Address(bLimbs,8)); > 394: __ jmp(L_DefaultLoop); Both sub and cmp are flag affecting instructions and are macro-fusible. By doing a loop rotation i.e. moving the length <= 0 check outside the loop and pushing the loop exit check at bottom you can save additional compare checks. - PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-1981555803 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553056633 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1552710600 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553110376
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec Build changes are trivially fine. - Marked as reviewed by ihse (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-1976208258
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/module-info.java line 265: > >> 263: jdk.jfr, >> 264: jdk.unsupported, >> 265: jdk.crypto.ec; > > jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in > java.base. So I don't think you need this qualified export. Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a few spots during rebase I guess) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: remove use of jdk.crypto.ec - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=00-01 Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583