Re: [jdk23] RFR: 8333583: Crypto-XDH.generateSecret regression after JDK-8329538

2024-06-26 Thread Sandhya Viswanathan
On Tue, 25 Jun 2024 23:50:20 GMT, Volodymyr Paprotski  wrote:

> Hi all,
> 
> This pull request contains a backport of commit 
> [f101e153](https://github.com/openjdk/jdk/commit/f101e153cee68750fcf1f12da10e29806875b522)
>  from the [openjdk/jdk](https://git.openjdk.org/jdk) repository.
> 
> The commit being backported was authored by Volodymyr Paprotski on 25 Jun 
> 2024 and was reviewed by Sandhya Viswanathan, Vladimir Kozlov, Ferenc Rakoczi 
> and Anthony Scarpino.
> 
> Thanks!

Marked as reviewed by sviswanathan (Reviewer).

-

PR Review: https://git.openjdk.org/jdk/pull/19893#pullrequestreview-2142217862


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-17 Thread Sandhya Viswanathan
On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   shenandoah verifier

Marked as reviewed by sviswanathan (Reviewer).

The intrinsics and the C2 changes look good to me.

-

PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-2064439617
PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118426661


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-16 Thread Sandhya Viswanathan
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   whitespace

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:

> 166:   XMMRegister broadcast5 = xmm24;
> 167:   KRegister limb0 = k1;
> 168:   KRegister limb5 = k2;

limb5 and select are not being used anymore.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:

> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), 
> false, Assembler::AVX_512bit, rscratch);
> 184: 
> 185:   // A = load(*aLimbs)

A little bit more description in comments on what the load step involves would 
be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the 
lowest limb.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:

> 268:   __ push(r14);
> 269:   __ push(r15);
> 270: 

No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in 
montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to 
r8.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:

> 284:   __ mov(aLimbs, c_rarg0);
> 285:   __ mov(bLimbs, c_rarg1);
> 286:   __ mov(rLimbs, c_rarg2);

We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these 
moves are not necessary.

src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:

> 1368: 
> 1369: #ifdef _LP64
> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize 
> >= 64) {

No need to tie the intrinsic to MaxVectorSize setting.

src/hotspot/share/opto/library_call.cpp line 7564:

> 7562: 
> 7563:   if (!stubAddr) return false;
> 7564:   if (stopped())  return true;

Line 7564 seems redundant here as there is no range check or anything like that 
 before this.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604169603
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604141586
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604174141
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604175443
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603792252
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603865712


Re: RFR: 8328074: Add jcheck whitespace checking for assembly files [v4]

2024-03-16 Thread Sandhya Viswanathan
On Wed, 13 Mar 2024 11:59:36 GMT, Magnus Ihse Bursie  wrote:

>> As part of the ongoing effort to enable jcheck whitespace checking to all 
>> text files, it is now time to address assembly (.S) files. The hotspot 
>> assembly files were fixed as part of the hotspot mapfile removal, so only a 
>> single incorrect jdk library remains.
>> 
>> The main issue with `libjsvml` was inconsistent line starts, sometimes using 
>> tabs. I used the `expand` unix command line tool to replace these with 
>> spaces.
>
> Magnus Ihse Bursie has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Make all ALIGN/.align aligned

Looks good to me.

-

Marked as reviewed by sviswanathan (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18268#pullrequestreview-1940824156


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Sandhya Viswanathan
On Wed, 6 Dec 2023 18:26:34 GMT, Vladimir Kozlov  wrote:

>> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this 
>> PR today before the fork.
>
>> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this 
>> PR today before the fork.
> 
> Too late. Changes looks fine to me (I am still on fence that we moving to C++ 
> implementation of intrinsics and require latest C++ compiler version). I need 
> to run testing for latest version of changes before approval. Lets not rush.

Thanks a lot @vnkozlov, we will wait for your approval.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843454162


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Sandhya Viswanathan
On Wed, 6 Dec 2023 17:48:04 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   add missing header files

@TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR 
today before the fork.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843407940


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8]

2023-12-04 Thread Sandhya Viswanathan
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort
>  - add GCC version guards
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort
>  - Remove C++17 from C flags
>  - add avoid masked stores operation
>  - update the code to check for supported simd sort cpus
>  - Disable AVX2 sort for 64-bit types
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort
>  - fix jcheck failures due to windows encoding
>  - fix carriage return and change insertion sort thresholds
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/d4804a12...bc590d9f

The PR looks good to me.

-

Marked as reviewed by sviswanathan (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1763680754


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]

2023-11-29 Thread Sandhya Viswanathan
On Tue, 28 Nov 2023 20:52:35 GMT, Srinivas Vamsi Parasa  
wrote:

>> Thanks Sandhya, will fix this issue.
>
> Thanks Sandhya for suggesting the change to use supports_simd_sort(BasicType 
> bt). Please see the updated code upstreamed.

@vamsi-parasa Thanks, your changes look good to me.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1410117695


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]

2023-11-27 Thread Sandhya Viswanathan
On Sat, 18 Nov 2023 01:21:09 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 11 additional 
> commits since the last revision:
> 
>  - Disable AVX2 sort for 64-bit types
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort
>  - fix jcheck failures due to windows encoding
>  - fix carriage return and change insertion sort thresholds
>  - fix formatting and white spaces
>  - cleanup unused code
>  - fix insertion sort thresholds
>  - add insertion sort
>  - fix headers
>  - revert to xss-common-qsort
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/36991c25...08307b6a

src/hotspot/share/opto/library_call.cpp line 5391:

> 5389: BasicType bt = elem_type->basic_type();
> 5390: // Disable the intrinsic for 64-bit types with AVX2
> 5391: if ((bt == T_LONG || bt == T_DOUBLE) && UseAVX == 2) {

UseAVX is platform specific and cannot be used in library_call.cpp.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1406923009


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]

2023-11-27 Thread Sandhya Viswanathan
On Tue, 21 Nov 2023 15:14:28 GMT, Dalibor Topic  wrote:

>> src/java.base/linux/native/libsimdsort/avx2-32bit-qsort.hpp line 3:
>> 
>>> 1: /*
>>> 2:  * Copyright (c) 2021, 2023, Intel Corporation. All rights reserved.
>>> 3:  * Copyright (c) 2021 Serge Sans Paille. All rights reserved.
>> 
>> Is this person an OpenJDK Contributor?
>
> Not listed here: https://oca.opensource.oracle.com/?ojr=contributors

Yes, Vamsi is part of Intel Java team. He also has the author status 
(https://openjdk.org/census#sparasa).

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1406904710


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

2023-10-13 Thread Sandhya Viswanathan
On Fri, 13 Oct 2023 10:31:14 GMT, himichael  wrote:

>> @himichael Please refer to [this 
>> question](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java)
>>  for how to correctly benchmark Java code.
>
>> @himichael Please refer to [this 
>> question](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java)
>>  for how to correctly benchmark Java code.   
> 
> thanks for your reply, but I think this has nothing to do with benchmark.   
> I sort a random array of int[], use: ```java.util.Arrays.sort()```
> in jdk 8, run command:  
> ```java JDKSort``` , the result display: 15079 ms   
>
> in open jdk 22.19 , run command:   
> ```java --add-modules jdk.incubator.vector JDKSort```   
> the result display:  11619 ms 
>
> my question is that this feature should improve performance several times, 
> but it doesn't look like there's much difference between open jdk 22.19 and 
> jdk 8.   
> is there a problem with my configuration ?

@himichael Please try java -Xbatch -XX:-TieredCompilation JDKSort. The method 
DualPivotQuickSort.sort() needs to be C2 compiled for you to see the benefit.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762361069


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v5]

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 23:14:26 GMT, Vladimir Ivanov  wrote:

> Proposed patch has one disadvantage: there's no way to override ergonomics 
> decisions on AMD CPUs and forcibly enable the intrinsic without rebuilding 
> the JVM.
> 
> For many other intrinsics there are flags which enable finer grained control 
> over JVM behavior (e.g., `UseVectorizedHashCodeIntrinsic`). There were some 
> complaints that there are way to many individual flags to control intrinsics, 
> so now there's `-XX:ControlIntrinsic=`, but as of now ergonomics doesn't rely 
> on it to control what intrinsics are enabled/disabled.
> 
> It makes sense to let `-XX:ControlIntrinsic=` overrule 
> `VM_Version::is_intel()` check and enable the intrinsics when `AVX512DQ` is 
> supported.

May be it could be done as part of https://bugs.openjdk.org/browse/JDK-8317976.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758699096


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 23:25:30 GMT, Vladimir Ivanov  wrote:

>> src/java.base/share/classes/java/util/DualPivotQuicksort.java line 157:
>> 
>>> 155: @ForceInline
>>> 156: private static  void sort(Class elemType, A array, long 
>>> offset, int low, int high, SortOperation so) {
>>> 157: so.sort(array, low, high);
>> 
>> I'm late to the party, but how does the fallback implementation work (or is 
>> intended to) for off-heap case (null + absolute base address)?
>
> Also, for on-heap case the fallback implementation is equivalent to 
> intrinsified case only when offset points at the 0th element of the array.

@iwanowww Yes, you are late to the party :). The fallback implementation could 
be similar to the vectorizedMismatch regarding base/offset for non heap case. 
Please see 
java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java.
Yes, the fallback implementation for non intrinsic case is kept to be 
equivalent to what was before the SIMD sort PR. This is done to not affect the 
fallback performance on other platforms.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1355886081


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v4]

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 22:25:14 GMT, Erik Joelsson  wrote:

>> Hi Erik (@erikj79), 
>> BUILD_LIBFALLBACKLINKER is from different PR (#13079). If I understand 
>> correctly, for LIB_SIMD_SORT, are you suggesting that we don't pad the lines 
>> with spaces to align features into columns and instead just use 4 spaces for 
>> the indentation after the line break?
>
> I see now that this is an unrelated change. In that case please avoid 
> changing whitespace in unrelated files for this PR.

@erikj79 This space was inadvertently added as part of 
(https://github.com/openjdk/jdk/pull/14227) and @magicus requested to remove it 
in this follow-up PR.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16124#discussion_r1355851543


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 18:31:44 GMT, Sandhya Viswanathan 
 wrote:

>> Also @forceinline in these changes only works for case when new intrinsics 
>> are not used.
>> I would suggest to adapt/update JMH benchmark to cover all cases and see 
>> effect @forceinline without intrinsics.
>> That will tell us which @forceinline annotations are needed.
>
> @vnkozlov Please advice if we can integrate this PR or if you would like to 
> run some tests first.

> Okay. I will start testing for current changes. @sviswa7 please file RFE for 
> Zen 4. If we get patch for it we do followup changes in that RFE.

Thanks a lot Vladimir. RFE filed: https://bugs.openjdk.org/browse/JDK-8317976.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758550761


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]

2023-10-11 Thread Sandhya Viswanathan
On Tue, 10 Oct 2023 22:29:55 GMT, Vladimir Kozlov  wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   fix whitespace in build script
>
> Also @forceinline in these changes only works for case when new intrinsics 
> are not used.
> I would suggest to adapt/update JMH benchmark to cover all cases and see 
> effect @forceinline without intrinsics.
> That will tell us which @forceinline annotations are needed.

@vnkozlov Please advice if we can integrate this PR or if you would like to run 
some tests first.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758270977


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 09:25:15 GMT, Andrew Haley  wrote:

> > Forgive me, I might be missing something very obvious, but is there any 
> > particular reason to entirely disable the SIMD accelerated sort on Zen 4 
> > rather than having an alternate code path for Zen 4 where it has the 
> > `compressstoreu` instructions split up into separate `compress` and 
> > `storeu` instructions so that Zen 4 platforms can still benefit from a 
> > decent degree of performance uplift from AVX512 acceleration of sort?
> 
> I don't think you're missing anything. This should be done, rather than 
> disabling the intrinsic.

With this PR we want to fix the urgent Alpine Linux build issues and the Zen 4 
perf regression. At the minimum, we don't want to hurt performance for Zen 4, 
hence restricting the intrinsic to only Intel for now as suggested in the 
[Reddit 
discussion](https://www.reddit.com/r/java/comments/171t5sj/comment/k3v9ko5/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).
  Improving perf for it could be a separate PR from someone more familiar with 
Zen 4.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758268675


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v4]

2023-10-11 Thread Sandhya Viswanathan
On Wed, 11 Oct 2023 17:28:12 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to address the follow-up comments to the SIMD 
>> accelerated sort PR (#14227) which implemented AVX512 intrinsics for 
>> Arrays.sort() methods.
>> The proposed changes are:
>> 
>> 1) Restriction of the AVX512 sort acceleration to only Intel CPUs. A 
>> performance regression (due to micro-architectural differences) was reported 
>> for AMD Zen4 CPUs in the comments section of PR.
>> 2) Addressing the build failure due to a bug in GCC 12 (which was fixed in 
>> version 12.3.1). The details of the bug are at: 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593
>> 3) Minor changes in Javadoc strings
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Add @ForceInline annotation to insertion and mixedInsertion sort

Looks good to me.

-

Marked as reviewed by sviswanathan (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16124#pullrequestreview-1672015105


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-09-25 Thread Sandhya Viswanathan
On Wed, 30 Aug 2023 02:01:38 GMT, Vladimir Kozlov  wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Clean up parameters passed to arrayPartition; update the check to load 
>> library
>
> Good. Thank you.

@vnkozlov Vamsi has approval from Paul on the Java side. He has also 
implemented all the review comments on the build changes from Eric and Magnus. 
Could you please take a relook at the PR and let us know if you are ok with the 
changes made since your last review. This has been a long journey. We are 
looking forward to your review. Thanks a lot.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1734126795


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v40]

2023-09-20 Thread Sandhya Viswanathan
On Wed, 20 Sep 2023 17:19:42 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   change variable names of indexPivot* to pivotIndex*

> Hi Vamsi,
> 
> In this comment [#13568 
> (comment)](https://github.com/openjdk/jdk/pull/13568#issuecomment-1728082819) 
> Paul suggested comparing of performance.
> 
> Could you please run benchmarking of the following 4 class?
> 
> [1] current implementation in JDK [2] your AVX12 version based on [1], from 
> this PR [3] my new version with Radix sort for parallel case plus your AVX12 
> changes 
> https://github.com/iaroslavski/sorting/blob/master/radixsort/DualPivotQuicksort_RadixForParallel.java
>  [4] my new version with Radix sort for all cases plus your AVX12 changes 
> 

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-29 Thread Sandhya Viswanathan
On Tue, 29 Aug 2023 19:28:17 GMT, Alan Bateman  wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Clean up parameters passed to arrayPartition; update the check to load 
>> library
>
> The changes to DualPivotQuicksort will need detailed review to ensure that 
> this is understandable and maintainable, there is a lot here to study.
> 
> The addition of libx86_64 and having the stub generation call out to this 
> library also needs discussion to make sure there is an agreement on how this 
> would be integrated.

@AlanBateman If it helps, the changes made by @vamsi-parasa to 
DualPivotQuickSort.java don't change the logic as written in Java. They only 
carve out the functionality into separate Java methods retaining the meaning 
exactly as before. These Java methods are then optimized through a stub.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698079501


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-28 Thread Sandhya Viswanathan
On Mon, 28 Aug 2023 21:27:25 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Clean up parameters passed to arrayPartition; update the check to load 
> library

Thanks for considering all the review comments and fixing them. The PR looks 
good to me.

@PaulSandoz Could you please review the Java part?

-

Marked as reviewed by sviswanathan (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1599251153
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1696551426


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-25 Thread Sandhya Viswanathan
On Fri, 25 Aug 2023 18:46:53 GMT, Vladimir Kozlov  wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Remove unnecessary import in Arrays.java
>
> After I fixed it Tier1 passed and I submitted other tiers.

@vnkozlov The  _mm512_set1_*  are all C/C++ intrinsics for Intel instructions 
documented at 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Both 
GCC and Microsoft C implements them.
https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1694025189


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v23]

2023-08-24 Thread Sandhya Viswanathan
On Tue, 22 Aug 2023 23:38:47 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Decomposed DPQS using AVX512 partitioning and AVX512 sort (for small 
> arrays). Works for serial and parallel sort.

src/java.base/linux/native/libx86_64/avx512-common-qsort.h line 29:

> 27: 
> 28: // This implementation is based on 
> x86-simd-sort(https://github.com/intel/x86-simd-sort)
> 29: #include 

Is the include iostream needed?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1305047404