On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann <d...@openjdk.org> wrote:

>> Update: the XorTest::xor results shown in this message used test code from 
>> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
>> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
>> longer needed and has been removed from this pull request.  See comment 
>> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) 
>> for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score          
>> Error  Units
>> --------------------------------------------------------------------------------------
>> XorTest.copy     ELEMENTS       SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy     ELEMENTS       LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy       REGION       SMALL  avgt   30     7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy       REGION      MEDIUM  avgt   30    20591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy       REGION       LARGE  avgt   30    21649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy     CRITICAL       SMALL  avgt   30       51079.357 ±      
>> 542.482  ns/op
>> XorTest.copy     CRITICAL      MEDIUM  avgt   30        2496.961 ±       
>> 11.375  ns/op
>> XorTest.copy     CRITICAL       LARGE  avgt   30         515.454 ±        
>> 5.831  ns/op
>> XorTest.copy      FOREIGN       SMALL  avgt   30     7558432.075 ±    
>> 79489.276  ns/op
>> XorTest.copy      FOREIGN      MEDIUM  avgt   30    19730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy      FOREIGN       LARGE  avgt   30    34616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor      ELEMENTS       SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor      ELEMENTS      MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor      ELEMENTS       LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xor        REGION       SMALL  avgt   30    64093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xor        REGION      MEDIUM  avgt   30    81544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xor        REGION       LARGE  avgt   30    90091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor      CRITICAL       SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into memcpy
>  - Update full name
>    Previous commit (fcbbc0d7880) added 
> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>    - remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>    - fix xor test foreign impl constructor signature
>  - - initial commit -- optimize large array cases in 
> StubGenerator::generate_disjoint_copy_avx3_masked
>      - add src address prefetches
>      - switch to non-temporal writes
>      - added modified jmh benchmark based on xor benchmark from Maurizio 
> Cimadamore

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753:

> 751:   Label L_pre_main_post_large;
> 752: 
> 753:   if (MaxVectorSize == 64) {

This should be an assert here instead of if check as this method shouldn't be 
called if MaxVectorSize is < 64.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768:

> 766:     }
> 767:     __ movq(temp3, temp2);
> 768:     copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0);

The last argument should be "true" or "1" instead of "0" or "false".  This is 
as temp3 (length) could be less than 32 as well. This case is only handled when 
use64byteVector argument is true.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777:

> 775: 
> 776:     __ BIND(L_main_pre_loop_large);
> 777:     __ subq(temp1, loop_size[shift]);  // whay is this here

Spurious comment.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797:

> 795:     __ jcc(Assembler::lessEqual, L_exit_large);
> 796:     arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift,
> 797:                                  temp4, temp3, L_entry_large, 
> L_exit_large);

When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need 
to be copied so we don't need to go back to L_entry_large.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058:

> 1056:   };
> 1057: 
> 1058:   if (MaxVectorSize == 64) {

This should be an assert here instead of if check as this method shouldn't be 
called if MaxVectorSize is < 64.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1175:

> 1173: void StubGenerator::copy256_avx3(Register dst, Register src, Register 
> index, XMMRegister xmm1,
> 1174:                                 XMMRegister xmm2, XMMRegister xmm3, 
> XMMRegister xmm4,
> 1175:                                 bool conjoint, int shift, int offset) {

The conjoint parameter is not used so could be removed from this function.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916497
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399918640
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399896421
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399934713
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916567
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399881916

Reply via email to