On Fri, 8 Sep 2023 18:10:33 GMT, Srinivas Vamsi Parasa <d...@openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |    Arrays.sort benchmark   |       Array Size      |       Baseline 
>> (us/op)        |       AVX512 Sort (us/op)     |       Speedup |
>> |    ---     |       ---     |       ---     |       ---     |       ---     
>> |
>> |    ArraysSort.doubleSort   |       10      |       0.034   |       0.035   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       25      |       0.116   |       0.089   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       50      |       0.282   |       0.291   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       75      |       0.474   |       0.358   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       100     |       0.654   |       0.623   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       1000    |       9.274   |       6.331   
>> |       1.5     |
>> |    ArraysSort.doubleSort   |       10000   |       323.339 |       71.228  
>> |       **4.5** |
>> |    ArraysSort.doubleSort   |       100000  |       4471.871        |       
>> 1002.748        |       **4.5** |
>> |    ArraysSort.doubleSort   |       1000000 |       51660.742       |       
>> 12921.295       |       **4.0** |
>> |    ArraysSort.floatSort    |       10      |       0.045   |       0.046   
>> |       1.0     |
>> |    ArraysSort.floatSort    |       25      |       0.103   |       0.084   
>> |       1.2     |
>> |    ArraysSort.floatSort    |       50      |       0.285   |       0.33    
>> |       0.9     |
>> |    ArraysSort.floatSort    |       75      |       0.492   |       0.346   
>> |       1.4     |
>> |    ArraysSort.floatSort    |       100     |       0.597   |       0.326   
>> |       1.8     |
>> |    ArraysSort.floatSort    |       1000    |       9.811   |       5.294   
>> |       1.9     |
>> |    ArraysSort.floatSort    |       10000   |       323.955 |       50.547  
>> |       **6.4** |
>> |    ArraysSort.floatSort    |       100000  |       4326.38 |       731.152 
>> |       **5.9** |
>> |    ArraysSort.floatSort    |       1000000 |       52413.88        |       
>> 8409.193        |       **6.2** |
>> |    ArraysSort.intSort      |       10      |       0.033   |       0.033   
>> |       1.0     |
>> |    ArraysSort.intSort      |       25      |       0.086   |       0.051   
>> |       1.7     |
>> |    ArraysSort.intSort      |       50      |       0.236   |       0.151   
>> |       1.6     |
>> |    ArraysSort.intSort      |       75      |       0.416   |       0.332   
>> |       1.3     |
>> |    ArraysSort.intSort      |       100     |       0.63    |       0.521   
>> |       1.2     |
>> |    ArraysSort.intSort      |       1000    |       10.518  |       4.698   
>> |       2.2     |
>> |    ArraysSort.intSort      |       10000   |       309.659 |       42.518  
>> |       **7.3** |
>> |    ArraysSort.intSort      |       100000  |       4130.917        |       
>> 573.956 |       **7.2** |
>> |    ArraysSort.intSort      |       1000000 |       49876.307       |       
>> 6712.812        |       **7.4** |
>> |    ArraysSort.longSort     |       10      |       0.036   |       0.037   
>> |       1.0     |
>> |    ArraysSort.longSort     |       25      |       0.094   |       0.08    
>> |       1.2     |
>> |    ArraysSort.longSort     |       50      |       0.218   |       0.227   
>> |       1.0     |
>> |    ArraysSort.longSort     |       75      |       0.466   |       0.402   
>> |       1.2     |
>> |    ArraysSort.longSort     |       100     |       0.76    |       0.58    
>> |       1.3     |
>> |    ArraysSort.longSort     |       1000    |       10.449  |       6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fix regression when intrinsics are disabled; enable insertion sort in 
> intrinsic, change library name to libsimdsort

Hello Paul,

Could you please take a look at the Java side changes in the new commit that 
was pushed?
It fixes the regression that was observed when the intrinsics were disabled.

It turns out that the switch is not the problem. Separating the intrinsics for` 
insertionSort` and `mixedInsertionSort` into separate methods (while retaining 
the switch based dispatch) + returning pivotIndices as an integer array, is 
fixing the regression when intrinsics are disabled. 
(For `MemorySegment`, a long array of indices could be returned.)

> That's surprising, i wonder if that is related to the use of switch and 
> somehow C2 cannot elide the allocation via escape analysis? One solution to 
> avoid allocation is to pack the return of the two indices in a long value. 
> That may also simplify the C2 intrinsic implementation. However, that 
> restricts indices to int values. To support `MemorySegment` we need to 
> support long value indices.
>
When passing the pivotIndices array as a parameter, regression was observed 
even in the absence of the intrinsic (i.e. switch). To test this, I took the 
stock JDK implementation of the `DualPivotQuicksort.java` and just only 
refactored the partitioning logic into separate methods. There were no 
intrinsics at all. Even in this case, there was a regression of up to 12%. 
Passing the pivots as seperate parameters and returning the updated pivot 
indices as a new array fixed the regression.

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1712135477

Reply via email to