Re: [jdk23] RFR: 8333583: Crypto-XDH.generateSecret regression after JDK-8329538
On Tue, 25 Jun 2024 23:50:20 GMT, Volodymyr Paprotski wrote: > Hi all, > > This pull request contains a backport of commit > [f101e153](https://github.com/openjdk/jdk/commit/f101e153cee68750fcf1f12da10e29806875b522) > from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Volodymyr Paprotski on 25 Jun > 2024 and was reviewed by Sandhya Viswanathan, Vladimir Kozlov, Ferenc Rakoczi > and Anthony Scarpino. > > Thanks! Marked as reviewed by sviswanathan (Reviewer). - PR Review: https://git.openjdk.org/jdk/pull/19893#pullrequestreview-2142217862
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > shenandoah verifier Marked as reviewed by sviswanathan (Reviewer). The intrinsics and the C2 changes look good to me. - PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-2064439617 PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118426661
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > whitespace src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168: > 166: XMMRegister broadcast5 = xmm24; > 167: KRegister limb0 = k1; > 168: KRegister limb5 = k2; limb5 and select are not being used anymore. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185: > 183: __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), > false, Assembler::AVX_512bit, rscratch); > 184: > 185: // A = load(*aLimbs) A little bit more description in comments on what the load step involves would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the lowest limb. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270: > 268: __ push(r14); > 269: __ push(r15); > 270: No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to r8. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286: > 284: __ mov(aLimbs, c_rarg0); > 285: __ mov(bLimbs, c_rarg1); > 286: __ mov(rLimbs, c_rarg2); We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these moves are not necessary. src/hotspot/cpu/x86/vm_version_x86.cpp line 1370: > 1368: > 1369: #ifdef _LP64 > 1370: if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize > >= 64) { No need to tie the intrinsic to MaxVectorSize setting. src/hotspot/share/opto/library_call.cpp line 7564: > 7562: > 7563: if (!stubAddr) return false; > 7564: if (stopped()) return true; Line 7564 seems redundant here as there is no range check or anything like that before this. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604169603 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604141586 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604174141 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604175443 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603792252 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603865712
Re: RFR: 8328074: Add jcheck whitespace checking for assembly files [v4]
On Wed, 13 Mar 2024 11:59:36 GMT, Magnus Ihse Bursie wrote: >> As part of the ongoing effort to enable jcheck whitespace checking to all >> text files, it is now time to address assembly (.S) files. The hotspot >> assembly files were fixed as part of the hotspot mapfile removal, so only a >> single incorrect jdk library remains. >> >> The main issue with `libjsvml` was inconsistent line starts, sometimes using >> tabs. I used the `expand` unix command line tool to replace these with >> spaces. > > Magnus Ihse Bursie has updated the pull request incrementally with one > additional commit since the last revision: > > Make all ALIGN/.align aligned Looks good to me. - Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18268#pullrequestreview-1940824156
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]
On Wed, 6 Dec 2023 18:26:34 GMT, Vladimir Kozlov wrote: >> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this >> PR today before the fork. > >> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this >> PR today before the fork. > > Too late. Changes looks fine to me (I am still on fence that we moving to C++ > implementation of intrinsics and require latest C++ compiler version). I need > to run testing for latest version of changes before approval. Lets not rush. Thanks a lot @vnkozlov, we will wait for your approval. - PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843454162
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]
On Wed, 6 Dec 2023 17:48:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for >> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the >> performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit >> datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | >> Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > add missing header files @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR today before the fork. - PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843407940
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8]
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for >> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the >> performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit >> datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | >> Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due > to a merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 17 additional > commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/d4804a12...bc590d9f The PR looks good to me. - Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1763680754
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]
On Tue, 28 Nov 2023 20:52:35 GMT, Srinivas Vamsi Parasa wrote: >> Thanks Sandhya, will fix this issue. > > Thanks Sandhya for suggesting the change to use supports_simd_sort(BasicType > bt). Please see the updated code upstreamed. @vamsi-parasa Thanks, your changes look good to me. - PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1410117695
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]
On Sat, 18 Nov 2023 01:21:09 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for >> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the >> performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit >> datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | >> Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40";> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due > to a merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 11 additional > commits since the last revision: > > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - fix formatting and white spaces > - cleanup unused code > - fix insertion sort thresholds > - add insertion sort > - fix headers > - revert to xss-common-qsort > - ... and 1 more: https://git.openjdk.org/jdk/compare/36991c25...08307b6a src/hotspot/share/opto/library_call.cpp line 5391: > 5389: BasicType bt = elem_type->basic_type(); > 5390: // Disable the intrinsic for 64-bit types with AVX2 > 5391: if ((bt == T_LONG || bt == T_DOUBLE) && UseAVX == 2) { UseAVX is platform specific and cannot be used in library_call.cpp. - PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1406923009
Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]
On Tue, 21 Nov 2023 15:14:28 GMT, Dalibor Topic wrote: >> src/java.base/linux/native/libsimdsort/avx2-32bit-qsort.hpp line 3: >> >>> 1: /* >>> 2: * Copyright (c) 2021, 2023, Intel Corporation. All rights reserved. >>> 3: * Copyright (c) 2021 Serge Sans Paille. All rights reserved. >> >> Is this person an OpenJDK Contributor? > > Not listed here: https://oca.opensource.oracle.com/?ojr=contributors Yes, Vamsi is part of Intel Java team. He also has the author status (https://openjdk.org/census#sparasa). - PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1406904710
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]
On Fri, 13 Oct 2023 10:31:14 GMT, himichael wrote: >> @himichael Please refer to [this >> question](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) >> for how to correctly benchmark Java code. > >> @himichael Please refer to [this >> question](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) >> for how to correctly benchmark Java code. > > thanks for your reply, but I think this has nothing to do with benchmark. > I sort a random array of int[], use: ```java.util.Arrays.sort()``` > in jdk 8, run command: > ```java JDKSort``` , the result display: 15079 ms > > in open jdk 22.19 , run command: > ```java --add-modules jdk.incubator.vector JDKSort``` > the result display: 11619 ms > > my question is that this feature should improve performance several times, > but it doesn't look like there's much difference between open jdk 22.19 and > jdk 8. > is there a problem with my configuration ? @himichael Please try java -Xbatch -XX:-TieredCompilation JDKSort. The method DualPivotQuickSort.sort() needs to be C2 compiled for you to see the benefit. - PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762361069
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v5]
On Wed, 11 Oct 2023 23:14:26 GMT, Vladimir Ivanov wrote: > Proposed patch has one disadvantage: there's no way to override ergonomics > decisions on AMD CPUs and forcibly enable the intrinsic without rebuilding > the JVM. > > For many other intrinsics there are flags which enable finer grained control > over JVM behavior (e.g., `UseVectorizedHashCodeIntrinsic`). There were some > complaints that there are way to many individual flags to control intrinsics, > so now there's `-XX:ControlIntrinsic=`, but as of now ergonomics doesn't rely > on it to control what intrinsics are enabled/disabled. > > It makes sense to let `-XX:ControlIntrinsic=` overrule > `VM_Version::is_intel()` check and enable the intrinsics when `AVX512DQ` is > supported. May be it could be done as part of https://bugs.openjdk.org/browse/JDK-8317976. - PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758699096
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]
On Wed, 11 Oct 2023 23:25:30 GMT, Vladimir Ivanov wrote: >> src/java.base/share/classes/java/util/DualPivotQuicksort.java line 157: >> >>> 155: @ForceInline >>> 156: private static void sort(Class elemType, A array, long >>> offset, int low, int high, SortOperation so) { >>> 157: so.sort(array, low, high); >> >> I'm late to the party, but how does the fallback implementation work (or is >> intended to) for off-heap case (null + absolute base address)? > > Also, for on-heap case the fallback implementation is equivalent to > intrinsified case only when offset points at the 0th element of the array. @iwanowww Yes, you are late to the party :). The fallback implementation could be similar to the vectorizedMismatch regarding base/offset for non heap case. Please see java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java. Yes, the fallback implementation for non intrinsic case is kept to be equivalent to what was before the SIMD sort PR. This is done to not affect the fallback performance on other platforms. - PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1355886081
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v4]
On Wed, 11 Oct 2023 22:25:14 GMT, Erik Joelsson wrote: >> Hi Erik (@erikj79), >> BUILD_LIBFALLBACKLINKER is from different PR (#13079). If I understand >> correctly, for LIB_SIMD_SORT, are you suggesting that we don't pad the lines >> with spaces to align features into columns and instead just use 4 spaces for >> the indentation after the line break? > > I see now that this is an unrelated change. In that case please avoid > changing whitespace in unrelated files for this PR. @erikj79 This space was inadvertently added as part of (https://github.com/openjdk/jdk/pull/14227) and @magicus requested to remove it in this follow-up PR. - PR Review Comment: https://git.openjdk.org/jdk/pull/16124#discussion_r1355851543
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]
On Wed, 11 Oct 2023 18:31:44 GMT, Sandhya Viswanathan wrote: >> Also @forceinline in these changes only works for case when new intrinsics >> are not used. >> I would suggest to adapt/update JMH benchmark to cover all cases and see >> effect @forceinline without intrinsics. >> That will tell us which @forceinline annotations are needed. > > @vnkozlov Please advice if we can integrate this PR or if you would like to > run some tests first. > Okay. I will start testing for current changes. @sviswa7 please file RFE for > Zen 4. If we get patch for it we do followup changes in that RFE. Thanks a lot Vladimir. RFE filed: https://bugs.openjdk.org/browse/JDK-8317976. - PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758550761
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]
On Tue, 10 Oct 2023 22:29:55 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> fix whitespace in build script > > Also @forceinline in these changes only works for case when new intrinsics > are not used. > I would suggest to adapt/update JMH benchmark to cover all cases and see > effect @forceinline without intrinsics. > That will tell us which @forceinline annotations are needed. @vnkozlov Please advice if we can integrate this PR or if you would like to run some tests first. - PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758270977
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR
On Wed, 11 Oct 2023 09:25:15 GMT, Andrew Haley wrote: > > Forgive me, I might be missing something very obvious, but is there any > > particular reason to entirely disable the SIMD accelerated sort on Zen 4 > > rather than having an alternate code path for Zen 4 where it has the > > `compressstoreu` instructions split up into separate `compress` and > > `storeu` instructions so that Zen 4 platforms can still benefit from a > > decent degree of performance uplift from AVX512 acceleration of sort? > > I don't think you're missing anything. This should be done, rather than > disabling the intrinsic. With this PR we want to fix the urgent Alpine Linux build issues and the Zen 4 perf regression. At the minimum, we don't want to hurt performance for Zen 4, hence restricting the intrinsic to only Intel for now as suggested in the [Reddit discussion](https://www.reddit.com/r/java/comments/171t5sj/comment/k3v9ko5/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). Improving perf for it could be a separate PR from someone more familiar with Zen 4. - PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758268675
Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v4]
On Wed, 11 Oct 2023 17:28:12 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to address the follow-up comments to the SIMD >> accelerated sort PR (#14227) which implemented AVX512 intrinsics for >> Arrays.sort() methods. >> The proposed changes are: >> >> 1) Restriction of the AVX512 sort acceleration to only Intel CPUs. A >> performance regression (due to micro-architectural differences) was reported >> for AMD Zen4 CPUs in the comments section of PR. >> 2) Addressing the build failure due to a bug in GCC 12 (which was fixed in >> version 12.3.1). The details of the bug are at: >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593 >> 3) Minor changes in Javadoc strings > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > Add @ForceInline annotation to insertion and mixedInsertion sort Looks good to me. - Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16124#pullrequestreview-1672015105
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]
On Wed, 30 Aug 2023 02:01:38 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Clean up parameters passed to arrayPartition; update the check to load >> library > > Good. Thank you. @vnkozlov Vamsi has approval from Paul on the Java side. He has also implemented all the review comments on the build changes from Eric and Magnus. Could you please take a relook at the PR and let us know if you are ok with the changes made since your last review. This has been a long journey. We are looking forward to your review. Thanks a lot. - PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1734126795
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v40]
On Wed, 20 Sep 2023 17:19:42 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX512 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and >> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the >> performance data below. >> >> >> **Arrays.sort performance data using JMH benchmarks for arrays with random >> data** >> >> |Arrays.sort benchmark | Array Size | Baseline >> (us/op)| AVX512 Sort (us/op) | Speedup | >> |--- | --- | --- | --- | --- >> | >> |ArraysSort.doubleSort | 10 | 0.034 | 0.035 >> | 1.0 | >> |ArraysSort.doubleSort | 25 | 0.116 | 0.089 >> | 1.3 | >> |ArraysSort.doubleSort | 50 | 0.282 | 0.291 >> | 1.0 | >> |ArraysSort.doubleSort | 75 | 0.474 | 0.358 >> | 1.3 | >> |ArraysSort.doubleSort | 100 | 0.654 | 0.623 >> | 1.0 | >> |ArraysSort.doubleSort | 1000| 9.274 | 6.331 >> | 1.5 | >> |ArraysSort.doubleSort | 1 | 323.339 | 71.228 >> | **4.5** | >> |ArraysSort.doubleSort | 10 | 4471.871| >> 1002.748| **4.5** | >> |ArraysSort.doubleSort | 100 | 51660.742 | >> 12921.295 | **4.0** | >> |ArraysSort.floatSort| 10 | 0.045 | 0.046 >> | 1.0 | >> |ArraysSort.floatSort| 25 | 0.103 | 0.084 >> | 1.2 | >> |ArraysSort.floatSort| 50 | 0.285 | 0.33 >> | 0.9 | >> |ArraysSort.floatSort| 75 | 0.492 | 0.346 >> | 1.4 | >> |ArraysSort.floatSort| 100 | 0.597 | 0.326 >> | 1.8 | >> |ArraysSort.floatSort| 1000| 9.811 | 5.294 >> | 1.9 | >> |ArraysSort.floatSort| 1 | 323.955 | 50.547 >> | **6.4** | >> |ArraysSort.floatSort| 10 | 4326.38 | 731.152 >> | **5.9** | >> |ArraysSort.floatSort| 100 | 52413.88| >> 8409.193| **6.2** | >> |ArraysSort.intSort | 10 | 0.033 | 0.033 >> | 1.0 | >> |ArraysSort.intSort | 25 | 0.086 | 0.051 >> | 1.7 | >> |ArraysSort.intSort | 50 | 0.236 | 0.151 >> | 1.6 | >> |ArraysSort.intSort | 75 | 0.416 | 0.332 >> | 1.3 | >> |ArraysSort.intSort | 100 | 0.63| 0.521 >> | 1.2 | >> |ArraysSort.intSort | 1000| 10.518 | 4.698 >> | 2.2 | >> |ArraysSort.intSort | 1 | 309.659 | 42.518 >> | **7.3** | >> |ArraysSort.intSort | 10 | 4130.917| >> 573.956 | **7.2** | >> |ArraysSort.intSort | 100 | 49876.307 | >> 6712.812| **7.4** | >> |ArraysSort.longSort | 10 | 0.036 | 0.037 >> | 1.0 | >> |ArraysSort.longSort | 25 | 0.094 | 0.08 >> | 1.2 | >> |ArraysSort.longSort | 50 | 0.218 | 0.227 >> | 1.0 | >> |ArraysSort.longSort | 75 | 0.466 | 0.402 >> | 1.2 | >> |ArraysSort.longSort | 100 | 0.76| 0.58 >> | 1.3 | >> |ArraysSort.longSort | 1000| 10.449 | 6 > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > change variable names of indexPivot* to pivotIndex* > Hi Vamsi, > > In this comment [#13568 > (comment)](https://github.com/openjdk/jdk/pull/13568#issuecomment-1728082819) > Paul suggested comparing of performance. > > Could you please run benchmarking of the following 4 class? > > [1] current implementation in JDK [2] your AVX12 version based on [1], from > this PR [3] my new version with Radix sort for parallel case plus your AVX12 > changes > https://github.com/iaroslavski/sorting/blob/master/radixsort/DualPivotQuicksort_RadixForParallel.java > [4] my new version with Radix sort for all cases plus your AVX12 changes >
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]
On Tue, 29 Aug 2023 19:28:17 GMT, Alan Bateman wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Clean up parameters passed to arrayPartition; update the check to load >> library > > The changes to DualPivotQuicksort will need detailed review to ensure that > this is understandable and maintainable, there is a lot here to study. > > The addition of libx86_64 and having the stub generation call out to this > library also needs discussion to make sure there is an agreement on how this > would be integrated. @AlanBateman If it helps, the changes made by @vamsi-parasa to DualPivotQuickSort.java don't change the logic as written in Java. They only carve out the functionality into separate Java methods retaining the meaning exactly as before. These Java methods are then optimized through a stub. - PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698079501
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]
On Mon, 28 Aug 2023 21:27:25 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX512 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and >> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the >> performance data below. >> >> >> **Arrays.sort performance data using JMH benchmarks for arrays with random >> data** >> >> |Arrays.sort benchmark | Array Size | Baseline >> (us/op)| AVX512 Sort (us/op) | Speedup | >> |--- | --- | --- | --- | --- >> | >> |ArraysSort.doubleSort | 10 | 0.034 | 0.035 >> | 1.0 | >> |ArraysSort.doubleSort | 25 | 0.116 | 0.089 >> | 1.3 | >> |ArraysSort.doubleSort | 50 | 0.282 | 0.291 >> | 1.0 | >> |ArraysSort.doubleSort | 75 | 0.474 | 0.358 >> | 1.3 | >> |ArraysSort.doubleSort | 100 | 0.654 | 0.623 >> | 1.0 | >> |ArraysSort.doubleSort | 1000| 9.274 | 6.331 >> | 1.5 | >> |ArraysSort.doubleSort | 1 | 323.339 | 71.228 >> | **4.5** | >> |ArraysSort.doubleSort | 10 | 4471.871| >> 1002.748| **4.5** | >> |ArraysSort.doubleSort | 100 | 51660.742 | >> 12921.295 | **4.0** | >> |ArraysSort.floatSort| 10 | 0.045 | 0.046 >> | 1.0 | >> |ArraysSort.floatSort| 25 | 0.103 | 0.084 >> | 1.2 | >> |ArraysSort.floatSort| 50 | 0.285 | 0.33 >> | 0.9 | >> |ArraysSort.floatSort| 75 | 0.492 | 0.346 >> | 1.4 | >> |ArraysSort.floatSort| 100 | 0.597 | 0.326 >> | 1.8 | >> |ArraysSort.floatSort| 1000| 9.811 | 5.294 >> | 1.9 | >> |ArraysSort.floatSort| 1 | 323.955 | 50.547 >> | **6.4** | >> |ArraysSort.floatSort| 10 | 4326.38 | 731.152 >> | **5.9** | >> |ArraysSort.floatSort| 100 | 52413.88| >> 8409.193| **6.2** | >> |ArraysSort.intSort | 10 | 0.033 | 0.033 >> | 1.0 | >> |ArraysSort.intSort | 25 | 0.086 | 0.051 >> | 1.7 | >> |ArraysSort.intSort | 50 | 0.236 | 0.151 >> | 1.6 | >> |ArraysSort.intSort | 75 | 0.416 | 0.332 >> | 1.3 | >> |ArraysSort.intSort | 100 | 0.63| 0.521 >> | 1.2 | >> |ArraysSort.intSort | 1000| 10.518 | 4.698 >> | 2.2 | >> |ArraysSort.intSort | 1 | 309.659 | 42.518 >> | **7.3** | >> |ArraysSort.intSort | 10 | 4130.917| >> 573.956 | **7.2** | >> |ArraysSort.intSort | 100 | 49876.307 | >> 6712.812| **7.4** | >> |ArraysSort.longSort | 10 | 0.036 | 0.037 >> | 1.0 | >> |ArraysSort.longSort | 25 | 0.094 | 0.08 >> | 1.2 | >> |ArraysSort.longSort | 50 | 0.218 | 0.227 >> | 1.0 | >> |ArraysSort.longSort | 75 | 0.466 | 0.402 >> | 1.2 | >> |ArraysSort.longSort | 100 | 0.76| 0.58 >> | 1.3 | >> |ArraysSort.longSort | 1000| 10.449 | 6 > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > Clean up parameters passed to arrayPartition; update the check to load > library Thanks for considering all the review comments and fixing them. The PR looks good to me. @PaulSandoz Could you please review the Java part? - Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1599251153 PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1696551426
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]
On Fri, 25 Aug 2023 18:46:53 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Remove unnecessary import in Arrays.java > > After I fixed it Tier1 passed and I submitted other tiers. @vnkozlov The _mm512_set1_* are all C/C++ intrinsics for Intel instructions documented at https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Both GCC and Microsoft C implements them. https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list. - PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1694025189
Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v23]
On Tue, 22 Aug 2023 23:38:47 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX512 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and >> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the >> performance data below. >> >> >> **Arrays.sort performance data using JMH benchmarks for arrays with random >> data** >> >> |Arrays.sort benchmark | Array Size | Baseline >> (us/op)| AVX512 Sort (us/op) | Speedup | >> |--- | --- | --- | --- | --- >> | >> |ArraysSort.doubleSort | 10 | 0.034 | 0.035 >> | 1.0 | >> |ArraysSort.doubleSort | 25 | 0.116 | 0.089 >> | 1.3 | >> |ArraysSort.doubleSort | 50 | 0.282 | 0.291 >> | 1.0 | >> |ArraysSort.doubleSort | 75 | 0.474 | 0.358 >> | 1.3 | >> |ArraysSort.doubleSort | 100 | 0.654 | 0.623 >> | 1.0 | >> |ArraysSort.doubleSort | 1000| 9.274 | 6.331 >> | 1.5 | >> |ArraysSort.doubleSort | 1 | 323.339 | 71.228 >> | **4.5** | >> |ArraysSort.doubleSort | 10 | 4471.871| >> 1002.748| **4.5** | >> |ArraysSort.doubleSort | 100 | 51660.742 | >> 12921.295 | **4.0** | >> |ArraysSort.floatSort| 10 | 0.045 | 0.046 >> | 1.0 | >> |ArraysSort.floatSort| 25 | 0.103 | 0.084 >> | 1.2 | >> |ArraysSort.floatSort| 50 | 0.285 | 0.33 >> | 0.9 | >> |ArraysSort.floatSort| 75 | 0.492 | 0.346 >> | 1.4 | >> |ArraysSort.floatSort| 100 | 0.597 | 0.326 >> | 1.8 | >> |ArraysSort.floatSort| 1000| 9.811 | 5.294 >> | 1.9 | >> |ArraysSort.floatSort| 1 | 323.955 | 50.547 >> | **6.4** | >> |ArraysSort.floatSort| 10 | 4326.38 | 731.152 >> | **5.9** | >> |ArraysSort.floatSort| 100 | 52413.88| >> 8409.193| **6.2** | >> |ArraysSort.intSort | 10 | 0.033 | 0.033 >> | 1.0 | >> |ArraysSort.intSort | 25 | 0.086 | 0.051 >> | 1.7 | >> |ArraysSort.intSort | 50 | 0.236 | 0.151 >> | 1.6 | >> |ArraysSort.intSort | 75 | 0.416 | 0.332 >> | 1.3 | >> |ArraysSort.intSort | 100 | 0.63| 0.521 >> | 1.2 | >> |ArraysSort.intSort | 1000| 10.518 | 4.698 >> | 2.2 | >> |ArraysSort.intSort | 1 | 309.659 | 42.518 >> | **7.3** | >> |ArraysSort.intSort | 10 | 4130.917| >> 573.956 | **7.2** | >> |ArraysSort.intSort | 100 | 49876.307 | >> 6712.812| **7.4** | >> |ArraysSort.longSort | 10 | 0.036 | 0.037 >> | 1.0 | >> |ArraysSort.longSort | 25 | 0.094 | 0.08 >> | 1.2 | >> |ArraysSort.longSort | 50 | 0.218 | 0.227 >> | 1.0 | >> |ArraysSort.longSort | 75 | 0.466 | 0.402 >> | 1.2 | >> |ArraysSort.longSort | 100 | 0.76| 0.58 >> | 1.3 | >> |ArraysSort.longSort | 1000| 10.449 | 6 > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > Decomposed DPQS using AVX512 partitioning and AVX512 sort (for small > arrays). Works for serial and parallel sort. src/java.base/linux/native/libx86_64/avx512-common-qsort.h line 29: > 27: > 28: // This implementation is based on > x86-simd-sort(https://github.com/intel/x86-simd-sort) > 29: #include Is the include iostream needed? - PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1305047404