On Fri, 13 Oct 2023 23:59:55 GMT, Srinivas Vamsi Parasa <d...@openjdk.org> wrote:
> > my question is that this feature should improve performance several times, > > but it doesn't look like there's much difference between open jdk 22.19 and > > jdk 8. is there a problem with my configuration ? > > Hello @himichael, > > Using your code snippet, please see the output below using the latest JDK and > JDK 20 (which does not have AVX512 sort): > > JDK 20 (without AVX512 sort): `java > -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 > -XX:-TieredCompilation JDKSort ` > > elapse time -> **7501 ms** > > JDK 22 (with AVX512 sort) `java > -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 > -XX:-TieredCompilation JDKSort` elapse time -> **1607 ms** > > It shows 4.66x speedup. Hello, @vamsi-parasa I used the commands you provided, but nothing seems to have changed. The test procedure as follow: use JDK 8(without AVX512 sort) /data/soft/jdk1.8.0_371/bin/javac JDKSort.java /data/soft/jdk1.8.0_371/bin/java JDKSort elapse time -> **15309 ms** use OpenJDK 22.19(with AVX512 sort) /data/soft/jdk-22/bin/javac JDKSort.java /data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100 elapse time -> **11687 ms** Not much seems to have changed. My JDK info: OpenJDK 22.19: /data/soft/jdk-22/bin/java -version openjdk version "22-ea" 2024-03-19 OpenJDK Runtime Environment (build 22-ea+19-1460) OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing) JDK 8: /data/soft/jdk1.8.0_371/bin/java -version java version "1.8.0_371" Java(TM) SE Runtime Environment (build 1.8.0_371-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode) I tested Intel's **x86-simd-sort**, my code as follow: ```c++ #include <iostream> #include <vector> #include <algorithm> #include <chrono> #include "src/avx512-32bit-qsort.hpp" int main() { // 100 million records const int size = 100000000; std::vector<int> random_array(size); for (int i = 0; i < size; ++i) { random_array[i] = rand(); } auto start_time = std::chrono::steady_clock::now(); avx512_qsort(random_array.data(), size); auto end_time = std::chrono::steady_clock::now(); auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count(); std::cout << "elapse time -> " << elapse_time << " ms" << std::endl; return 0; } compile commands: g++ -o sort -O3 -mavx512f -mavx512dq sort.cpp elapse time -> **1151 ms** An order of magnitude performance improvement. Here is my cpu information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel Xeon Processor (Skylake, IBRS) Stepping: 4 CPU MHz: 2394.374 BogoMIPS: 4788.74 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl ```lscpu | grep avx``` The following instructions are supported: - avx - avx2 - avx512f - avx512dq - avx512cd - avx512bw - avx512vl ------------- PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762543464