Below is baseline data collected using a modified version of the
java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug
report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7,
which does support AVX512.
Baseline data
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error
Units
--------------------------------------------------------------------------------------
XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± 60414308.540
ns/op
XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± 2924954.498
ns/op
XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± 28334453.652
ns/op
XorTest.copy REGION SMALL avgt 30 7399944.164 ± 216821.819
ns/op
XorTest.copy REGION MEDIUM avgt 30 20591454.558 ± 147398.572
ns/op
XorTest.copy REGION LARGE avgt 30 21649266.051 ± 179263.875
ns/op
XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± 542.482
ns/op
XorTest.copy CRITICAL MEDIUM avgt 30 2496.961 ± 11.375
ns/op
XorTest.copy CRITICAL LARGE avgt 30 515.454 ± 5.831
ns/op
XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± 79489.276
ns/op
XorTest.copy FOREIGN MEDIUM avgt 30 19730666.341 ± 500505.099
ns/op
XorTest.copy FOREIGN LARGE avgt 30 34616758.085 ± 340300.726
ns/op
XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± 2329417.319
ns/op
XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± 3818334.424
ns/op
XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± 5877981.900
ns/op
XorTest.xor REGION SMALL avgt 30 64093872.804 ± 599704.491
ns/op
XorTest.xor REGION MEDIUM avgt 30 81544576.454 ± 1406342.118
ns/op
XorTest.xor REGION LARGE avgt 30 90091424.883 ± 775577.613
ns/op
XorTest.xor CRITICAL SMALL avgt 30 57231375.744 ± 438223.342
ns/op
XorTest.xor CRITICAL MEDIUM avgt 30 58583884.930 ± 375355.215
ns/op
XorTest.xor CRITICAL LARGE avgt 30 60644832.949 ± 588120.738
ns/op
XorTest.xor FOREIGN SMALL avgt 30 73868679.405 ± 819965.524
ns/op
XorTest.xor FOREIGN MEDIUM avgt 30 88156275.944 ± 1051257.152
ns/op
XorTest.xor FOREIGN LARGE avgt 30 123115513.182 ± 1287935.621
ns/op
The 'copy' benchmark was added to measure the memory copy components of the
'xor' benchmark, separate from the memory allocation and xor data update
components.
Profile data for the baseline REGION LARGE case, shows two hotspots covering
about 90% of cycles:
Baseline REGION LARGE (r231)
Function CPU Time Clockticks Instructions
Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 63.7% 18,189,000,000 52,464,000,000
0.347
__memcpy_evex_unaligned_erms 28.5% 7,608,000,000 3,459,000,000
2.199
```
The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :
Baseline FOREIGN LARGE (r226)
Function CPU Time Clockticks Instructions
Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 46.4% 18,345,000,000 52,476,000,000
0.350
jlong_disjoint_arraycopy_avx3 29.3% 11,124,000,000 1,404,000,000
7.923
Copy::fill_to_memory_atomic 15.3% 5,016,000,000 8,010,000,000
0.626
This PR optimizes the jlong_disjoint_arraycopy_avx3 code. The The
Copy::fill_to memory_atomic hotspot (which I believe is associated with the
benchmark's per-op off-heap buffer allocation) is not optimized here. The av3
array copy code is optimized by increasing the loop granularity from 192 to 256
bytes, adding source address prefetches, and using non-temporal writes with a
store fence. The optimized code in only used with copies of greater that a set
threshold number of bytes, currently 2.5MB. This is the size at which the
optimized code was observed to be faster than the original code. The profile
data with optimization is:
Optimized FOREIGN LARGE (r277)
Function CPU Time Clockticks Instructions
Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 51.2% 18,153,000,000 52,404,000,000
0.346
jlong_disjoint_arraycopy_avx3 22.4% 7,581,000,000 2,364,000,000
3.207
Copy::fill_to_memory_atomic 16.3% 5,316,000,000 7,917,000,000
0.671
The optimization brings the cycles for the mem copy work roughly to parity with
the REGION LARGE case. Benchmark data for the optimized case:
Optimized data
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error
Units
XorTest.copy ELEMENTS SMALL avgt 30 551072938.467 ± 4287149.108
ns/op
XorTest.copy ELEMENTS MEDIUM avgt 30 272304419.633 ± 2993793.130
ns/op
XorTest.copy ELEMENTS LARGE avgt 30 1013925081.233 ± 8590245.238
ns/op
XorTest.copy REGION SMALL avgt 30 7472329.003 ± 77394.114
ns/op
XorTest.copy REGION MEDIUM avgt 30 19882540.205 ± 349544.602
ns/op
XorTest.copy REGION LARGE avgt 30 21185593.636 ± 404369.655
ns/op
XorTest.copy CRITICAL SMALL avgt 30 52358.715 ± 1382.355
ns/op
XorTest.copy CRITICAL MEDIUM avgt 30 2525.108 ± 22.396
ns/op
XorTest.copy CRITICAL LARGE avgt 30 528.865 ± 11.747
ns/op
XorTest.copy FOREIGN SMALL avgt 30 7748587.890 ± 67352.844
ns/op
XorTest.copy FOREIGN MEDIUM avgt 30 19401977.378 ± 256247.071
ns/op
XorTest.copy FOREIGN LARGE avgt 30 21519594.325 ± 124712.980
ns/op
XorTest.xor ELEMENTS SMALL avgt 30 221049328.389 ± 2629557.148
ns/op
XorTest.xor ELEMENTS MEDIUM avgt 30 503362446.150 ± 3759664.343
ns/op
XorTest.xor ELEMENTS LARGE avgt 30 1186563496.067 ± 5135607.671
ns/op
XorTest.xor REGION SMALL avgt 30 88402928.083 ± 790941.309
ns/op
XorTest.xor REGION MEDIUM avgt 30 80041519.052 ± 597221.491
ns/op
XorTest.xor REGION LARGE avgt 30 87706448.917 ± 751350.609
ns/op
XorTest.xor CRITICAL SMALL avgt 30 56869387.315 ± 408618.338
ns/op
XorTest.xor CRITICAL MEDIUM avgt 30 59041245.745 ± 820141.039
ns/op
XorTest.xor CRITICAL LARGE avgt 30 60433672.443 ± 500954.831
ns/op
XorTest.xor FOREIGN SMALL avgt 30 72838421.976 ± 410147.170
ns/op
XorTest.xor FOREIGN MEDIUM avgt 30 87970109.478 ± 1058857.783
ns/op
XorTest.xor FOREIGN LARGE avgt 30 103970690.407 ± 1033001.637
ns/op
I am very much looking forward to contributing to OpenJDK! Please review this
PR and let me know how it can be improved.
-------------
Commit messages:
- - fix whitespace issues
- - initial commit -- optimize large array cases in
StubGenerator::generate_disjoint_copy_avx3_masked
Changes: https://git.openjdk.org/jdk/pull/16575/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16575&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8310159
Stats: 597 lines in 11 files changed: 596 ins; 0 del; 1 mod
Patch: https://git.openjdk.org/jdk/pull/16575.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575
PR: https://git.openjdk.org/jdk/pull/16575