On Fri, 14 May 2021 17:50:11 GMT, Piotr Tarsa <github.com+1282159+ta...@openjdk.org> wrote:
>> I made a JMH test on jdk16 to test count4 (xor) performance: >> https://github.com/bourgesl/nearly-optimal-mergesort-code/tree/master/sort-bench/results/count_xor >> >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayXorBenchmark.arrayAndOriginal 1000000 avgt 20 684561,951 ± 2177,120 >> ns/op >> ArrayXorBenchmark.arrayXorOriginal 1000000 avgt 20 777255,465 ± 1815,136 >> ns/op >> ArrayXorBenchmark.arrayXor_Masked 1000000 avgt 20 814163,377 ± 2665,436 ns/op >> ArrayXorBenchmark.arrayXor_Unsafe 1000000 avgt 20 707273,922 ± 2994,380 ns/op >> >> Masked xor does not get optimized by c2 too. >> >> Using Unsafe is better, see: >> https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/src/main/java/edu/sorting/bench/ArrayXorBenchmark.java >> >> If you want, I could make another radixsort() using on or off heap count >> buffers and Unsafe, as I did in Marlin to avoid bound checks... > > I think an uncontroversial and efficient solution would be to replace > > count4[(a[i] >>> 24) ^ 0x80]--; > > with > > count4[(a[i] >>> 24) & 0xFF]--; > > and after finishing counting, first half of `count4` should be swapped with > second half of `count4`. FYI I made another radixsortNew() using Unsafe to access to all count arrays: https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/dbfbd731ffd798defc75528cc1db04063bdb4619/src/edu/sorting/DualPivotQuicksort202105.java#L795 But it is only 2% faster in radix-sort-benchmark (1M arrays): https://github.com/bourgesl/radix-sort-benchmark/blob/main/results/2021-ref/cmp-16-1M-full.log It looks not worth the extra complexity for few percents, I think. ------------- PR: https://git.openjdk.java.net/jdk/pull/3938