Below is baseline data collected using a modified version of the 
java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, 
which does support AVX512. 

Baseline data
Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score          Error 
 Units
--------------------------------------------------------------------------------------
XorTest.copy     ELEMENTS       SMALL  avgt   30   584737355.767 ± 60414308.540 
 ns/op
XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272248995.683 ±  2924954.498 
 ns/op
XorTest.copy     ELEMENTS       LARGE  avgt   30  1019200210.900 ± 28334453.652 
 ns/op
XorTest.copy       REGION       SMALL  avgt   30     7399944.164 ±   216821.819 
 ns/op
XorTest.copy       REGION      MEDIUM  avgt   30    20591454.558 ±   147398.572 
 ns/op
XorTest.copy       REGION       LARGE  avgt   30    21649266.051 ±   179263.875 
 ns/op
XorTest.copy     CRITICAL       SMALL  avgt   30       51079.357 ±      542.482 
 ns/op
XorTest.copy     CRITICAL      MEDIUM  avgt   30        2496.961 ±       11.375 
 ns/op
XorTest.copy     CRITICAL       LARGE  avgt   30         515.454 ±        5.831 
 ns/op
XorTest.copy      FOREIGN       SMALL  avgt   30     7558432.075 ±    79489.276 
 ns/op
XorTest.copy      FOREIGN      MEDIUM  avgt   30    19730666.341 ±   500505.099 
 ns/op
XorTest.copy      FOREIGN       LARGE  avgt   30    34616758.085 ±   340300.726 
 ns/op
XorTest.xor      ELEMENTS       SMALL  avgt   30   219832692.489 ±  2329417.319 
 ns/op
XorTest.xor      ELEMENTS      MEDIUM  avgt   30   505138197.167 ±  3818334.424 
 ns/op
XorTest.xor      ELEMENTS       LARGE  avgt   30  1189608474.667 ±  5877981.900 
 ns/op
XorTest.xor        REGION       SMALL  avgt   30    64093872.804 ±   599704.491 
 ns/op
XorTest.xor        REGION      MEDIUM  avgt   30    81544576.454 ±  1406342.118 
 ns/op
XorTest.xor        REGION       LARGE  avgt   30    90091424.883 ±   775577.613 
 ns/op
XorTest.xor      CRITICAL       SMALL  avgt   30    57231375.744 ±   438223.342 
 ns/op
XorTest.xor      CRITICAL      MEDIUM  avgt   30    58583884.930 ±   375355.215 
 ns/op
XorTest.xor      CRITICAL       LARGE  avgt   30    60644832.949 ±   588120.738 
 ns/op
XorTest.xor       FOREIGN       SMALL  avgt   30    73868679.405 ±   819965.524 
 ns/op
XorTest.xor       FOREIGN      MEDIUM  avgt   30    88156275.944 ±  1051257.152 
 ns/op
XorTest.xor       FOREIGN       LARGE  avgt   30   123115513.182 ±  1287935.621 
 ns/op

The 'copy' benchmark was added to measure the memory copy components of the 
'xor' benchmark, separate from the memory allocation and xor data update 
components.

Profile data for the baseline REGION LARGE case, shows two hotspots covering 
about 90% of cycles:


Baseline REGION LARGE (r231)
Function                        CPU Time    Clockticks      Instructions 
Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          63.7%       18,189,000,000  52,464,000,000      
    0.347   
__memcpy_evex_unaligned_erms    28.5%        7,608,000,000   3,459,000,000      
    2.199  
``` 
The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :

Baseline FOREIGN LARGE (r226)
Function                        CPU Time    Clockticks      Instructions 
Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          46.4%       18,345,000,000  52,476,000,000      
    0.350   
jlong_disjoint_arraycopy_avx3   29.3%       11,124,000,000   1,404,000,000      
    7.923   
Copy::fill_to_memory_atomic     15.3%        5,016,000,000   8,010,000,000      
    0.626   

This PR optimizes the jlong_disjoint_arraycopy_avx3 code.  The The 
Copy::fill_to memory_atomic hotspot (which I believe is associated with the 
benchmark's per-op off-heap buffer allocation) is not optimized here.  The av3 
array copy code is optimized by increasing the loop granularity from 192 to 256 
bytes, adding source address prefetches, and using non-temporal writes with a 
store fence.  The optimized code in only used with copies of greater that a set 
threshold number of bytes, currently 2.5MB.  This is the size at which the 
optimized code was observed to be faster than the original code.  The profile 
data with optimization is:

Optimized FOREIGN LARGE (r277)
Function                        CPU Time    Clockticks      Instructions 
Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          51.2%       18,153,000,000  52,404,000,000      
    0.346   
jlong_disjoint_arraycopy_avx3   22.4%        7,581,000,000   2,364,000,000      
    3.207   
Copy::fill_to_memory_atomic     16.3%        5,316,000,000   7,917,000,000      
    0.671   

The optimization brings the cycles for the mem copy work roughly to parity with 
the REGION LARGE case.   Benchmark data for the optimized case:  

Optimized data
Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score         Error  
Units
XorTest.copy     ELEMENTS       SMALL  avgt   30   551072938.467 ± 4287149.108  
ns/op
XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272304419.633 ± 2993793.130  
ns/op
XorTest.copy     ELEMENTS       LARGE  avgt   30  1013925081.233 ± 8590245.238  
ns/op
XorTest.copy       REGION       SMALL  avgt   30     7472329.003 ±   77394.114  
ns/op
XorTest.copy       REGION      MEDIUM  avgt   30    19882540.205 ±  349544.602  
ns/op
XorTest.copy       REGION       LARGE  avgt   30    21185593.636 ±  404369.655  
ns/op
XorTest.copy     CRITICAL       SMALL  avgt   30       52358.715 ±    1382.355  
ns/op
XorTest.copy     CRITICAL      MEDIUM  avgt   30        2525.108 ±      22.396  
ns/op
XorTest.copy     CRITICAL       LARGE  avgt   30         528.865 ±      11.747  
ns/op
XorTest.copy      FOREIGN       SMALL  avgt   30     7748587.890 ±   67352.844  
ns/op
XorTest.copy      FOREIGN      MEDIUM  avgt   30    19401977.378 ±  256247.071  
ns/op
XorTest.copy      FOREIGN       LARGE  avgt   30    21519594.325 ±  124712.980  
ns/op
XorTest.xor      ELEMENTS       SMALL  avgt   30   221049328.389 ± 2629557.148  
ns/op
XorTest.xor      ELEMENTS      MEDIUM  avgt   30   503362446.150 ± 3759664.343  
ns/op
XorTest.xor      ELEMENTS       LARGE  avgt   30  1186563496.067 ± 5135607.671  
ns/op
XorTest.xor        REGION       SMALL  avgt   30    88402928.083 ±  790941.309  
ns/op
XorTest.xor        REGION      MEDIUM  avgt   30    80041519.052 ±  597221.491  
ns/op
XorTest.xor        REGION       LARGE  avgt   30    87706448.917 ±  751350.609  
ns/op
XorTest.xor      CRITICAL       SMALL  avgt   30    56869387.315 ±  408618.338  
ns/op
XorTest.xor      CRITICAL      MEDIUM  avgt   30    59041245.745 ±  820141.039  
ns/op
XorTest.xor      CRITICAL       LARGE  avgt   30    60433672.443 ±  500954.831  
ns/op
XorTest.xor       FOREIGN       SMALL  avgt   30    72838421.976 ±  410147.170  
ns/op
XorTest.xor       FOREIGN      MEDIUM  avgt   30    87970109.478 ± 1058857.783  
ns/op
XorTest.xor       FOREIGN       LARGE  avgt   30   103970690.407 ± 1033001.637  
ns/op

I am very much looking forward to contributing to OpenJDK!  Please review this 
PR and let me know how it can be improved.

-------------

Commit messages:
 - - fix whitespace issues
 - - initial commit -- optimize large array cases in 
StubGenerator::generate_disjoint_copy_avx3_masked

Changes: https://git.openjdk.org/jdk/pull/16575/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16575&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8310159
  Stats: 597 lines in 11 files changed: 596 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575

Reply via email to