Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]

2023-11-26 Thread Tobias Hartmann
On Tue, 21 Nov 2023 21:03:20 GMT, Steve Dohrmann  wrote:

>> Update: the XorTest::xor results shown in this message used test code from 
>> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
>> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
>> longer needed and has been removed from this pull request.  See comment 
>> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) 
>> for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 11 commits:
> 
>  - Merge branch 'master' into memcpy
>  - Updates based on reviewer (sviswa7) comments including
>- use asserts instead of conditionals in two logically unreachable blocks
>- remove unused function parmeters
>- use 64-byte vector path in pre-loop masked write
>  - Merge branch 'master' into memcpy
>  - Update full name
>Previous commit (fcbbc0d7880) added 
> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>- remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>- fix xor test foreign impl constructor signature
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa

Correctness and performance testing passed.

-

Marked as reviewed by thartmann (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1749803030


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]

2023-11-21 Thread Sandhya Viswanathan
On Tue, 21 Nov 2023 21:03:20 GMT, Steve Dohrmann  wrote:

>> Update: the XorTest::xor results shown in this message used test code from 
>> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
>> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
>> longer needed and has been removed from this pull request.  See comment 
>> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) 
>> for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 11 commits:
> 
>  - Merge branch 'master' into memcpy
>  - Updates based on reviewer (sviswa7) comments including
>- use asserts instead of conditionals in two logically unreachable blocks
>- remove unused function parmeters
>- use 64-byte vector path in pre-loop masked write
>  - Merge branch 'master' into memcpy
>  - Update full name
>Previous commit (fcbbc0d7880) added 
> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>- remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>- fix xor test foreign impl constructor signature
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa

Thanks a lot for taking care of all the review comments. The PR looks good to 
me now.

-

Marked as reviewed by sviswanathan (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1743562444


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

2023-11-21 Thread Steve Dohrmann
On Tue, 21 Nov 2023 01:14:49 GMT, Sandhya Viswanathan 
 wrote:

>> Steve Dohrmann has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains ten commits:
>> 
>>  - Merge branch 'master' into memcpy
>>  - Update full name
>>Previous commit (fcbbc0d7880) added 
>> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>>  - - remerge upstream master
>>- remove ::copy test from XorTest
>>  - Merge branch 'master' into memcpy
>>  - - fix whitespace error
>>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>>  - - fix whitespace issues
>>- fix xor test foreign impl constructor signature
>>  - - initial commit -- optimize large array cases in 
>> StubGenerator::generate_disjoint_copy_avx3_masked
>>  - add src address prefetches
>>  - switch to non-temporal writes
>>  - added modified jmh benchmark based on xor benchmark from Maurizio 
>> Cimadamore
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768:
> 
>> 766: }
>> 767: __ movq(temp3, temp2);
>> 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 
>> 0);
> 
> The last argument should be "true" or "1" instead of "0" or "false".  This is 
> as temp3 (length) could be less than 32 as well. This case is only handled 
> when use64byteVector argument is true.

Thanks, done.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401229608


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]

2023-11-21 Thread Steve Dohrmann
> Update: the XorTest::xor results shown in this message used test code from PR 
> commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
> longer needed and has been removed from this pull request.  See comment 
> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) for 
> updated performance data. 
> 
> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.x...

Steve Dohrmann has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 11 commits:

 - Merge branch 'master' into memcpy
 - Updates based on reviewer (sviswa7) comments including
   - use asserts instead of conditionals in two logically unreachable blocks
   - remove unused function parmeters
   - use 64-byte vector path in pre-loop masked write
 - Merge branch 'master' into memcpy
 - Update full name
   Previous commit (fcbbc0d7880) added 
org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
 - - remerge upstream master
   - remove ::copy test from XorTest
 - Merge branch 'master' into memcpy
 - - fix whitespace error
 - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
 - - bug fix: only generate / use large copy code if MaxVectorSize == 64
 - - fix whitespace issues
   - fix xor test foreign impl constructor signature
 - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa

-

Changes: https://git.openjdk.org/jdk/pull/16575/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=05
  Stats: 259 lines in 5 files changed: 259 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

2023-11-21 Thread Steve Dohrmann
On Tue, 21 Nov 2023 01:10:40 GMT, Sandhya Viswanathan 
 wrote:

>> Steve Dohrmann has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains ten commits:
>> 
>>  - Merge branch 'master' into memcpy
>>  - Update full name
>>Previous commit (fcbbc0d7880) added 
>> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>>  - - remerge upstream master
>>- remove ::copy test from XorTest
>>  - Merge branch 'master' into memcpy
>>  - - fix whitespace error
>>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>>  - - fix whitespace issues
>>- fix xor test foreign impl constructor signature
>>  - - initial commit -- optimize large array cases in 
>> StubGenerator::generate_disjoint_copy_avx3_masked
>>  - add src address prefetches
>>  - switch to non-temporal writes
>>  - added modified jmh benchmark based on xor benchmark from Maurizio 
>> Cimadamore
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753:
> 
>> 751:   Label L_pre_main_post_large;
>> 752: 
>> 753:   if (MaxVectorSize == 64) {
> 
> This should be an assert here instead of if check as this method shouldn't be 
> called if MaxVectorSize is < 64.

Thanks, done.

> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777:
> 
>> 775: 
>> 776: __ BIND(L_main_pre_loop_large);
>> 777: __ subq(temp1, loop_size[shift]);  // whay is this here
> 
> Spurious comment.

Thanks, done

> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797:
> 
>> 795: __ jcc(Assembler::lessEqual, L_exit_large);
>> 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift,
>> 797:  temp4, temp3, L_entry_large, 
>> L_exit_large);
> 
> When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes 
> need to be copied so we don't need to go back to L_entry_large.

Thanks, done

> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058:
> 
>> 1056:   };
>> 1057: 
>> 1058:   if (MaxVectorSize == 64) {
> 
> This should be an assert here instead of if check as this method shouldn't be 
> called if MaxVectorSize is < 64.

Thanks, done.

> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1175:
> 
>> 1173: void StubGenerator::copy256_avx3(Register dst, Register src, Register 
>> index, XMMRegister xmm1,
>> 1174: XMMRegister xmm2, XMMRegister xmm3, 
>> XMMRegister xmm4,
>> 1175: bool conjoint, int shift, int offset) {
> 
> The conjoint parameter is not used so could be removed from this function.

Thanks, done.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401179711
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401178991
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401180641
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401180258
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401177983


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

2023-11-20 Thread Sandhya Viswanathan
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann  wrote:

>> Update: the XorTest::xor results shown in this message used test code from 
>> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
>> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
>> longer needed and has been removed from this pull request.  See comment 
>> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) 
>> for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into memcpy
>  - Update full name
>Previous commit (fcbbc0d7880) added 
> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>- remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>- fix xor test foreign impl constructor signature
>  - - initial commit -- optimize large array cases in 
> StubGenerator::generate_disjoint_copy_avx3_masked
>  - add src address prefetches
>  - switch to non-temporal writes
>  - added modified jmh benchmark based on xor benchmark from Maurizio 
> Cimadamore

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753:

> 751:   Label L_pre_main_post_large;
> 752: 
> 753:   if (MaxVectorSize == 64) {

This should be an assert here instead of if check as this method shouldn't be 
called if MaxVectorSize is < 64.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768:

> 766: }
> 767: __ movq(temp3, temp2);
> 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0);

The last argument should be "true" or "1" instead of "0" or "false".  This is 
as temp3 (length) could be less than 32 as well. This case is only handled when 
use64byteVector argument is true.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777:

> 775: 
> 776: __ BIND(L_main_pre_loop_large);
> 777: __ subq(temp1, loop_size[shift]);  // whay is this here

Spurious comment.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797:

> 795: __ jcc(Assembler::lessEqual, L_exit_large);
> 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift,
> 797:  temp4, temp3, L_entry_large, 
> L_exit_large);

When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need 
to be copied so we don't need to go back to L_entry_large.

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058:


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-20 Thread Steve Dohrmann
On Mon, 20 Nov 2023 07:27:12 GMT, Tobias Hartmann  wrote:

>> Steve Dohrmann has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   - fix whitespace error
>
> test/micro/org/openjdk/bench/java/lang/foreign/xor/XorOp.java line 10:
> 
>> 8: void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int 
>> len);
>> 9: ===
>> 10: >>> 9727f4bdddc071e6f59806087339f345405ab004
> 
> You have multiple merge conflicts in the micro benchmark files.

Sorry, not sure how I missed the conflicts.  They should be resolved now. 
Thanks!

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399887729


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

2023-11-20 Thread Steve Dohrmann
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann  wrote:

>> Update: the XorTest::xor results shown in this message used test code from 
>> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit 
>> a788f066af17.  The XorTest has since been updated and XorTest::copy is no 
>> longer needed and has been removed from this pull request.  See comment 
>> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) 
>> for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into memcpy
>  - Update full name
>Previous commit (fcbbc0d7880) added 
> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>- remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>- fix xor test foreign impl constructor signature
>  - - initial commit -- optimize large array cases in 
> StubGenerator::generate_disjoint_copy_avx3_masked
>  - add src address prefetches
>  - switch to non-temporal writes
>  - added modified jmh benchmark based on xor benchmark from Maurizio 
> Cimadamore

The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the 
introductory comment above XorTest code from PR commit 7cc272e86279 which was 
based on Maurizio Cimadamore's commit a788f066af17.  The XorTest has since been 
updated and the XorTest::copy is no longer needed and has been removed from 
this pull request.  Performance can be evaluated using both the new XorTest and 
a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this 
PR.  Results from these two benchmarks are show below:

In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in 
sizes 5MB and 10MB.


// Baseline 
Benchmark(length)  Mode  Cnt   Score   Error  
Units
ArrayCopyAlignedLarge.testByte10  avgt   152434.515 ?11.526  
ns/op
ArrayCopyAlignedLarge.testByte   100  avgt   15   51211.235 ?   539.355  
ns/op
ArrayCopyAlignedLarge.testByte   200  avgt   15  104837.012 ?  1338.823  
ns/op
ArrayCopyAlignedLarge.testByte   500  avgt   15  293357.745 ?  3233.745  
ns/op
ArrayCopyAlignedLarge.testByte  1000  avgt   15  957068.292 ? 15509.983  
ns/op

// PR
Benchmark   (length)  Mode  Cnt   Score  Error  
Units
ArrayCopyAlignedLarge.testByte10  avgt   152443.354 ?   17.996  
ns/op

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

2023-11-20 Thread Steve Dohrmann
> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

Steve Dohrmann has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains ten commits:

 - Merge branch 'master' into memcpy
 - Update full name
   Previous commit (fcbbc0d7880) added 
org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
 - - remerge upstream master
   - remove ::copy test from XorTest
 - Merge branch 'master' into memcpy
 - - fix whitespace error
 - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
 - - bug fix: only generate / use large copy code if MaxVectorSize == 64
 - - fix whitespace issues
   - fix xor test foreign impl constructor signature
 - - initial commit -- optimize large array cases in 
StubGenerator::generate_disjoint_copy_avx3_masked
 - add src address prefetches
 - switch to non-temporal writes
 - added modified jmh benchmark based on xor benchmark from Maurizio 
Cimadamore

-

Changes: https://git.openjdk.org/jdk/pull/16575/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=04
  Stats: 264 lines in 5 files changed: 264 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-20 Thread Jatin Bhateja
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> Steve Dohrmann has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   - fix whitespace error

Hi @steveatgh , X86 code changes looks good to me.

-

Marked as reviewed by jbhateja (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1740295564


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-19 Thread Tobias Hartmann
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> Steve Dohrmann has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   - fix whitespace error

test/micro/org/openjdk/bench/java/lang/foreign/xor/XorOp.java line 10:

> 8: void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int 
> len);
> 9: ===
> 10: >>> 9727f4bdddc071e6f59806087339f345405ab004

You have multiple merge conflicts in the micro benchmark files.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1398749880


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-19 Thread Tobias Hartmann
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> Steve Dohrmann has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   - fix whitespace error

Thanks, I re-submitted testing.

-

PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1818334820


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-16 Thread Steve Dohrmann
On Mon, 13 Nov 2023 08:36:44 GMT, Tobias Hartmann  wrote:

>> Steve Dohrmann has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   - fix whitespace error
>
> I submitted some quick testing and I'm seeing the following failure with 
> multiple tests:
> 
> 
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error 
> (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201),
>  pid=24136, tid=24139
> #  assert(MaxVectorSize == 64) failed: vector length != 64
> #
> # JRE version:  (22.0) (fastdebug build )
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 
> 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, 
> compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, 
> Register, XMMRegister, KRegister, Register, Register, Register, int, int, 
> bool)+0x366
> 
> Stack: [0x7f0b5e919000,0x7f0b5ea1a000],  sp=0x7f0b5ea17150,  free 
> space=1016k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, 
> Register, XMMRegister, KRegister, Register, Register, Register, int, int, 
> bool)+0x366  (stubGenerator_x86_64_arraycopy.cpp:1201)
> V  [libjvm.so+0x16c0ecd]  
> StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, 
> Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d  
> (stubGenerator_x86_64_arraycopy.cpp:1055)
> V  [libjvm.so+0x16c16c1]  StubGenerator::arraycopy_avx3_large(Register, 
> Register, Register, Register, Register, Register, Register, XMMRegister, 
> XMMRegister, XMMRegister, XMMRegister, int)+0x3f1  
> (stubGenerator_x86_64_arraycopy.cpp:790)
> V  [libjvm.so+0x16c22f0]  
> StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char 
> const*, int, bool, bool, bool)+0xa90  (stubGenerator_x86_64_arraycopy.cpp:728)
> V  [libjvm.so+0x16c4b85]  StubGenerator::generate_disjoint_byte_copy(bool, 
> unsigned char**, char const*)+0x965  (stubGenerator_x86_64_arraycopy.cpp:1277)
> V  [libjvm.so+0x16cb309]  StubGenerator::generate_arraycopy_stubs()+0x29  
> (stubGenerator_x86_64_arraycopy.cpp:88)
> V  [libjvm.so+0x16a1089]  StubGenerator::generate_final_stubs()+0xb9  
> (stubGenerator_x86_64.cpp:4051)
> V  [libjvm.so+0x16a22a5]  StubGenerator_generate(CodeBuffer*, 
> StubCodeGenerator::StubsKind)+0x105  (stubGenerator_x86_64.cpp:4296)
> V  [libjvm.so+0x16f349e]  initialize_stubs(StubCodeGenerator::StubsKind, int, 
> int, char const*, char const*, char const*)+0x13e  (stubRoutines.cpp:241)
> V  [libjvm.so+0x16f500d]  final_stubs_init()+0x3d  (stubRoutines.cpp:288)
> V  [libjvm.so+0xe30c59]...

@TobiHartmann I updated the PR with a fix for the assert you saw.  The large 
copy code generation / use is now predicated on MaxVectorSize = 64.

-

PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1815351213


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-16 Thread Steve Dohrmann
On Thu, 16 Nov 2023 05:43:11 GMT, Jatin Bhateja  wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585:
>> 
>>> 583: __ shlq(temp2, shift);
>>> 584: __ cmpq(temp2, large_threshold);
>>> 585: __ jcc(Assembler::greaterEqual, L_copy_large);
>> 
>> Hi @steveatgh , Can you please share the performance number of other Array 
>> copy JMH micros in following directoy 
>> https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang
>
> I will still request you to run BM in above path, we may see performance dips 
> for sizes after special cases due to additional comparisons.

Here are the results on my Ubuntu laptop running at 3 GHz

// Baseline
Benchmark(length)  (size)   Mode  Cnt  
Score  Error   Units
ArrayCopyObject.conjoint_microN/A  31  thrpt   15  
77157.933 ? 1977.467  ops/ms
ArrayCopyObject.conjoint_microN/A  63  thrpt   15  
58329.157 ? 1667.574  ops/ms
ArrayCopyObject.conjoint_microN/A 127  thrpt   15  
49322.065 ? 2332.342  ops/ms
ArrayCopyObject.conjoint_microN/A2047  thrpt   15  
13895.531 ?  239.300  ops/ms
ArrayCopyObject.conjoint_microN/A4095  thrpt   15   
7926.854 ?  201.238  ops/ms
ArrayCopyObject.conjoint_microN/A8191  thrpt   15   
4289.582 ?   31.734  ops/ms
ArrayCopyObject.disjoint_microN/A  31  thrpt   15  
74711.699 ? 2463.378  ops/ms
ArrayCopyObject.disjoint_microN/A  63  thrpt   15  
65229.586 ? 1329.809  ops/ms
ArrayCopyObject.disjoint_microN/A 127  thrpt   15  
54330.794 ? 2372.868  ops/ms
ArrayCopyObject.disjoint_microN/A2047  thrpt   15   
9338.340 ?  132.987  ops/ms
ArrayCopyObject.disjoint_microN/A4095  thrpt   15   
5035.553 ?  109.679  ops/ms
ArrayCopyObject.disjoint_microN/A8191  thrpt   15   
1192.069 ?   10.765  ops/ms
ArrayCopy.arrayCopy   N/A N/A   avgt   15  
1.356 ?0.029   ns/op
ArrayCopy.arrayCopyChar   N/A N/A   avgt   15  
4.368 ?0.038   ns/op
ArrayCopy.arrayCopyCharNonConst   N/A N/A   avgt   15  
4.749 ?0.113   ns/op
ArrayCopy.arrayCopyLocalArray N/A N/A   avgt   15  
0.503 ?0.001   ns/op
ArrayCopy.arrayCopyNonConst   N/A N/A   avgt   15  
1.955 ?0.108   ns/op
ArrayCopy.arrayCopyObject N/A N/A   avgt   15 
22.403 ?0.563   ns/op
ArrayCopy.arrayCopyObjectNonConst N/A N/A   avgt   15 
25.188 ?0.484   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward   N/A N/A   avgt   15 
17.785 ?0.781   ns/op
ArrayCopy.arrayCopyObjectSameArraysForwardN/A N/A   avgt   15 
17.347 ?0.126   ns/op
ArrayCopy.copyLoopN/A N/A   avgt   15  
5.189 ?0.100   ns/op
ArrayCopy.copyLoopLocalArray  N/A N/A   avgt   15  
3.685 ?0.085   ns/op
ArrayCopy.copyLoopNonConstN/A N/A   avgt   15  
5.436 ?0.040   ns/op
ArrayCopyAligned.testByte   1 N/A   avgt   15  
2.366 ?0.028   ns/op
ArrayCopyAligned.testByte   3 N/A   avgt   15  
2.381 ?0.063   ns/op
ArrayCopyAligned.testByte   5 N/A   avgt   15  
2.362 ?0.035   ns/op
ArrayCopyAligned.testByte  10 N/A   avgt   15  
2.364 ?0.048   ns/op
ArrayCopyAligned.testByte  20 N/A   avgt   15  
2.353 ?0.026   ns/op
ArrayCopyAligned.testByte  70 N/A   avgt   15  
5.214 ?0.082   ns/op
ArrayCopyAligned.testByte 150 N/A   avgt   15  
6.081 ?0.140   ns/op
ArrayCopyAligned.testByte 300 N/A   avgt   15  
9.399 ?0.262   ns/op
ArrayCopyAligned.testByte 600 N/A   avgt   15 
12.710 ?0.149   ns/op
ArrayCopyAligned.testByte1200 N/A   avgt   15 
21.873 ?0.237   ns/op
ArrayCopyAligned.testChar   1 N/A   avgt   15  
2.349 ?0.014   ns/op
ArrayCopyAligned.testChar   3 N/A   avgt   15  
2.360 ?0.041   ns/op
ArrayCopyAligned.testChar   5 N/A   avgt   15  
2.359 ?0.021   ns/op
ArrayCopyAligned.testChar  10 N/A   avgt   15  
2.369 ?0.042   ns/op
ArrayCopyAligned.testChar  20 N/A   avgt   15  
5.101 ?0.080   ns/op
ArrayCopyAligned.testChar  70 N/A   avgt   15  
5.961 ?0.096   

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-16 Thread Steve Dohrmann
> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

Steve Dohrmann has updated the pull request incrementally with one additional 
commit since the last revision:

  - fix whitespace error

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16575/files
  - new: https://git.openjdk.org/jdk/pull/16575/files/10313c9a..3dbf3d5a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=16575=03
 - incr: https://webrevs.openjdk.org/?repo=jdk=16575=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v3]

2023-11-16 Thread Steve Dohrmann
> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

Steve Dohrmann has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains four commits:

 - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
 - - bug fix: only generate / use large copy code if MaxVectorSize == 64
 - - fix whitespace issues
   - fix xor test foreign impl constructor signature
 - - initial commit -- optimize large array cases in 
StubGenerator::generate_disjoint_copy_avx3_masked
 - add src address prefetches
 - switch to non-temporal writes
 - added modified jmh benchmark based on xor benchmark from Maurizio 
Cimadamore

-

Changes: https://git.openjdk.org/jdk/pull/16575/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=02
  Stats: 357 lines in 11 files changed: 357 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v2]

2023-11-16 Thread Steve Dohrmann
> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

Steve Dohrmann has updated the pull request incrementally with one additional 
commit since the last revision:

  - bug fix: only generate / use large copy code if MaxVectorSize == 64

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16575/files
  - new: https://git.openjdk.org/jdk/pull/16575/files/6c0fdf3e..983df17a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=16575=01
 - incr: https://webrevs.openjdk.org/?repo=jdk=16575=00-01

  Stats: 128 lines in 1 file changed: 43 ins; 32 del; 53 mod
  Patch: https://git.openjdk.org/jdk/pull/16575.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575

PR: https://git.openjdk.org/jdk/pull/16575


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-16 Thread Steve Dohrmann
On Thu, 16 Nov 2023 09:49:00 GMT, Andrew Haley  wrote:

>> Thanks for the clarification, agree behavior is similar to non-NT case, in 
>> fact using NT for huge copy operations will prevent polluting caches due to 
>> destination cache line fills.
>
> But won't it also cause performance regressions in the common case where the 
> caller needs to use the destination array?

One component of the included XorTest.xor benchmark is to read the bytes from 
two copied arrays.  See line 155 in libjnitest.c  
The nt stores are only used in the FOREGN LARGE case and it shows a net speedup 
 ~123 ms -> 104 ms.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1396050045


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-16 Thread Andrew Haley
On Thu, 16 Nov 2023 05:38:30 GMT, Jatin Bhateja  wrote:

>> The results a concurrent reader sees could be different if the copy is using 
>> nt writes, but if the read of the destination is not synced with the copy 
>> operation, I think the reader would not see consistent state in either case. 
>>  Is it worse with nt writes?
>
> Thanks for the clarification, agree behavior is similar to non-NT case, in 
> fact using NT for huge copy operations will prevent polluting caches due to 
> destination cache line fills.

But won't it also cause performance regressions in the common case where the 
caller needs to use the destination array?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395431482


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-15 Thread Jatin Bhateja
On Wed, 15 Nov 2023 17:03:38 GMT, Steve Dohrmann  wrote:

>> Do you see any concerns while handling multithreaded case where writer is 
>> busy copying 256 bytes block in loop and reader try to access a location 
>> still not flushed out of write combining buffer.
>
> The results a concurrent reader sees could be different if the copy is using 
> nt writes, but if the read of the destination is not synced with the copy 
> operation, I think the reader would not see consistent state in either case.  
> Is it worse with nt writes?

Thanks for the clarification, agree behavior is similar to non-NT case, in fact 
using NT for huge copy operations will prevent polluting caches due to 
destination cache line fills.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395171526


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-15 Thread Jatin Bhateja
On Tue, 14 Nov 2023 07:59:22 GMT, Jatin Bhateja  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585:
> 
>> 583: __ shlq(temp2, shift);
>> 584: __ cmpq(temp2, large_threshold);
>> 585: __ jcc(Assembler::greaterEqual, L_copy_large);
> 
> Hi @steveatgh , Can you please share the performance number of other Array 
> copy JMH micros in following directoy 
> https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang

I will still request you to run BM in above path, we may see performance dips 
for sizes after special cases due to additional comparisons.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395174203


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-15 Thread Steve Dohrmann
On Wed, 15 Nov 2023 01:44:56 GMT, Jatin Bhateja  wrote:

>> Thanks, there is an store fence upon completion of the main loop for the 
>> large size code:
>> 
>> ![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2)
>
> Do you see any concerns while handling multithreaded case where writer is 
> busy copying 256 bytes block in loop and reader try to access a location 
> still not flushed out of write combining buffer.

The results a concurrent reader sees could be different if the copy is using nt 
writes, but if the read of the destination is not synced with the copy 
operation, I think the reader would not see consistent state in either case.  
Is it worse with nt writes?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1394500454


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 15 Nov 2023 01:17:05 GMT, Steve Dohrmann  wrote:

>> @jatin-bhateja There is a sfence at line 781.
>
> Thanks, there is an store fence upon completion of the main loop for the 
> large size code:
> 
> ![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2)

How will it handle multithreaded case where writer is busy copying 256 bytes in 
loop and reader try to access a location still not flushed out of write 
combining buffer.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393529359


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Steve Dohrmann
On Wed, 15 Nov 2023 00:39:29 GMT, Sandhya Viswanathan 
 wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186:
>> 
>>> 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, 
>>> Assembler::AVX_512bit);
>>> 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, 
>>> Assembler::AVX_512bit);
>>> 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, 
>>> Assembler::AVX_512bit);
>> 
>> These are non-temporal memory moves, to force eviction from write combining 
>> buffers we may need to emit additional fences, else a subsequent read from 
>> destination memory may see incorrect values.
>
> @jatin-bhateja There is a sfence at line 781.

Thanks, there is an store fence upon completion of the main loop for the large 
size code:

![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393511087


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Steve Dohrmann
On Tue, 14 Nov 2023 08:00:13 GMT, Jatin Bhateja  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585:
> 
>> 583: __ shlq(temp2, shift);
>> 584: __ cmpq(temp2, large_threshold);
>> 585: __ jcc(Assembler::greaterEqual, L_copy_large);
> 
> I suspect additional checks for 2.5MB array size may hit the performance of 
> other general sizes.

Comparing several runs of the  XorTest.copy SMALL (100K) benchmark, baseline 
against PR, I see an average slowdown of 1.7% (7.566 ms / op vs 7.696 ms/op)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393511001


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Sandhya Viswanathan
On Tue, 14 Nov 2023 08:09:28 GMT, Jatin Bhateja  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186:
> 
>> 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, 
>> Assembler::AVX_512bit);
>> 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, 
>> Assembler::AVX_512bit);
>> 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, 
>> Assembler::AVX_512bit);
> 
> These are non-temporal memory moves, to force eviction from write combining 
> buffers we may need to emit additional fences, else a subsequent read from 
> destination memory may see incorrect values.

@jatin-bhateja There is a sfence at line 781.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393486384


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann  wrote:

> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585:

> 583: __ shlq(temp2, shift);
> 584: __ cmpq(temp2, large_threshold);
> 585: __ jcc(Assembler::greaterEqual, L_copy_large);

Hi @steveatgh , Can you please share the performance number of other Array copy 
JMH micros in following directoy 
https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585:

> 583: __ shlq(temp2, shift);
> 584: __ cmpq(temp2, large_threshold);
> 585: __ jcc(Assembler::greaterEqual, L_copy_large);

I suspect additional checks for 2.5MB array size may hit the performance of 
other general sizes.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392137605
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392138600


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann  wrote:

> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186:

> 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, 
> Assembler::AVX_512bit);
> 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, 
> Assembler::AVX_512bit);
> 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, 
> Assembler::AVX_512bit);

These are non-temporal memory moves, to force eviction from write combining 
buffers we may need to emit additional fences, else a sub-subsequent read from 
destination memory may see incorrect values.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392149191


Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-13 Thread Steve Dohrmann
On Mon, 13 Nov 2023 08:36:44 GMT, Tobias Hartmann  wrote:

>> Below is baseline data collected using a modified version of the 
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
>> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
>> i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
>> Error  Units
>> --
>> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
>> 60414308.540  ns/op
>> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
>> 2924954.498  ns/op
>> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
>> 28334453.652  ns/op
>> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
>> 216821.819  ns/op
>> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
>> 147398.572  ns/op
>> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
>> 179263.875  ns/op
>> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
>> 542.482  ns/op
>> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
>> 11.375  ns/op
>> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
>> 5.831  ns/op
>> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
>> 79489.276  ns/op
>> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
>> 500505.099  ns/op
>> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
>> 340300.726  ns/op
>> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
>> 2329417.319  ns/op
>> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
>> 3818334.424  ns/op
>> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
>> 5877981.900  ns/op
>> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
>> 599704.491  ns/op
>> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
>> 1406342.118  ns/op
>> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
>> 775577.613  ns/op
>> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
>> 438223.342  ns/op
>> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
>> 375355.215  ns/op
>> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
>> 588120.738  ns/op
>> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
>> 819965.524  ns/op
>> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
>> 1051257.152  ns/op
>> Xo...
>
> I submitted some quick testing and I'm seeing the following failure with 
> multiple tests:
> 
> 
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error 
> (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201),
>  pid=24136, tid=24139
> #  assert(MaxVectorSize == 64) failed: vector length != 64
> #
> # JRE version:  (22.0) (fastdebug build )
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 
> 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, 
> compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, 
> Register, XMMRegister, KRegister, Register, Register, Register, int, int, 
> bool)+0x366
> 
> Stack: [0x7f0b5e919000,0x7f0b5ea1a000],  sp=0x7f0b5ea17150,  free 
> space=1016k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, 
> Register, XMMRegister, KRegister, Register, Register, Register, int, int, 
> bool)+0x366  (stubGenerator_x86_64_arraycopy.cpp:1201)
> V  [libjvm.so+0x16c0ecd]  
> StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, 
> Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d  
> (stubGenerator_x86_64_arraycopy.cpp:1055)
> V  [libjvm.so+0x16c16c1]  StubGenerator::arraycopy_avx3_large(Register, 
> Register, Register, Register, Register, Register, Register, XMMRegister, 
> XMMRegister, XMMRegister, XMMRegister, int)+0x3f1  
> (stubGenerator_x86_64_arraycopy.cpp:790)
> V  [libjvm.so+0x16c22f0]  
> StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char 
> const*, int, bool, bool, bool)+0xa90  (stubGenerator_x86_64_arraycopy.cpp:728)
> V  [libjvm.so+0x16c4b85]  StubGenerator::generate_disjoint_byte_copy(bool, 
> unsigned char**, char const*)+0x965  (stubGenerator_x86_64_arraycopy.cpp:1277)
> V  [libjvm.so+0x16cb309]  StubGenerator::generate_arraycopy_stubs()+0x29  
> (stubGenerator_x86_64_arraycopy.cpp:88)
> V  [libjvm.so+0x16a1089]  StubGenerator::generate_final_stubs()+0xb9  
> (stubGenerator_x86_64.cpp:4051)
> V  [libjvm.so+0x16a22a5]  StubGenerator_generate(CodeBuffer*, 
> 

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-13 Thread Tobias Hartmann
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann  wrote:

> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

I submitted some quick testing and I'm seeing the following failure with 
multiple tests:


# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error 
(/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), 
pid=24136, tid=24139
#  assert(MaxVectorSize == 64) failed: vector length != 64
#
# JRE version:  (22.0) (fastdebug build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 
22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, 
compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, 
Register, XMMRegister, KRegister, Register, Register, Register, int, int, 
bool)+0x366

Stack: [0x7f0b5e919000,0x7f0b5ea1a000],  sp=0x7f0b5ea17150,  free 
space=1016k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, Register, 
XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366  
(stubGenerator_x86_64_arraycopy.cpp:1201)
V  [libjvm.so+0x16c0ecd]  
StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, 
Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d  
(stubGenerator_x86_64_arraycopy.cpp:1055)
V  [libjvm.so+0x16c16c1]  StubGenerator::arraycopy_avx3_large(Register, 
Register, Register, Register, Register, Register, Register, XMMRegister, 
XMMRegister, XMMRegister, XMMRegister, int)+0x3f1  
(stubGenerator_x86_64_arraycopy.cpp:790)
V  [libjvm.so+0x16c22f0]  
StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char const*, 
int, bool, bool, bool)+0xa90  (stubGenerator_x86_64_arraycopy.cpp:728)
V  [libjvm.so+0x16c4b85]  StubGenerator::generate_disjoint_byte_copy(bool, 
unsigned char**, char const*)+0x965  (stubGenerator_x86_64_arraycopy.cpp:1277)
V  [libjvm.so+0x16cb309]  StubGenerator::generate_arraycopy_stubs()+0x29  
(stubGenerator_x86_64_arraycopy.cpp:88)
V  [libjvm.so+0x16a1089]  StubGenerator::generate_final_stubs()+0xb9  
(stubGenerator_x86_64.cpp:4051)
V  [libjvm.so+0x16a22a5]  StubGenerator_generate(CodeBuffer*, 
StubCodeGenerator::StubsKind)+0x105  (stubGenerator_x86_64.cpp:4296)
V  [libjvm.so+0x16f349e] 

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-09 Thread Steve Dohrmann
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann  wrote:

> Below is baseline data collected using a modified version of the 
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
> report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake 
> i7-1185G7, which does support AVX512. 
> 
> Baseline data
> Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  
> Error  Units
> --
> XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 
> 60414308.540  ns/op
> XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  
> 2924954.498  ns/op
> XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 
> 28334453.652  ns/op
> XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   
> 216821.819  ns/op
> XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   
> 147398.572  ns/op
> XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   
> 179263.875  ns/op
> XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  
> 542.482  ns/op
> XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   
> 11.375  ns/op
> XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±
> 5.831  ns/op
> XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±
> 79489.276  ns/op
> XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   
> 500505.099  ns/op
> XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   
> 340300.726  ns/op
> XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  
> 2329417.319  ns/op
> XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  
> 3818334.424  ns/op
> XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  
> 5877981.900  ns/op
> XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   
> 599704.491  ns/op
> XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  
> 1406342.118  ns/op
> XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   
> 775577.613  ns/op
> XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   
> 438223.342  ns/op
> XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   
> 375355.215  ns/op
> XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   
> 588120.738  ns/op
> XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   
> 819965.524  ns/op
> XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  
> 1051257.152  ns/op
> XorTest.xor   FOREIGN   LARGE  avgt   30   123115513...

I'm part of the Intel Java team

-

PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1802923064


RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-09 Thread Steve Dohrmann
Below is baseline data collected using a modified version of the 
java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug 
report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, 
which does support AVX512. 

Baseline data
Benchmark (arrayKind)  (sizeKind)  Mode  Cnt   Score  Error 
 Units
--
XorTest.copy ELEMENTS   SMALL  avgt   30   584737355.767 ± 60414308.540 
 ns/op
XorTest.copy ELEMENTS  MEDIUM  avgt   30   272248995.683 ±  2924954.498 
 ns/op
XorTest.copy ELEMENTS   LARGE  avgt   30  1019200210.900 ± 28334453.652 
 ns/op
XorTest.copy   REGION   SMALL  avgt   30 7399944.164 ±   216821.819 
 ns/op
XorTest.copy   REGION  MEDIUM  avgt   3020591454.558 ±   147398.572 
 ns/op
XorTest.copy   REGION   LARGE  avgt   3021649266.051 ±   179263.875 
 ns/op
XorTest.copy CRITICAL   SMALL  avgt   30   51079.357 ±  542.482 
 ns/op
XorTest.copy CRITICAL  MEDIUM  avgt   302496.961 ±   11.375 
 ns/op
XorTest.copy CRITICAL   LARGE  avgt   30 515.454 ±5.831 
 ns/op
XorTest.copy  FOREIGN   SMALL  avgt   30 7558432.075 ±79489.276 
 ns/op
XorTest.copy  FOREIGN  MEDIUM  avgt   3019730666.341 ±   500505.099 
 ns/op
XorTest.copy  FOREIGN   LARGE  avgt   3034616758.085 ±   340300.726 
 ns/op
XorTest.xor  ELEMENTS   SMALL  avgt   30   219832692.489 ±  2329417.319 
 ns/op
XorTest.xor  ELEMENTS  MEDIUM  avgt   30   505138197.167 ±  3818334.424 
 ns/op
XorTest.xor  ELEMENTS   LARGE  avgt   30  1189608474.667 ±  5877981.900 
 ns/op
XorTest.xorREGION   SMALL  avgt   3064093872.804 ±   599704.491 
 ns/op
XorTest.xorREGION  MEDIUM  avgt   3081544576.454 ±  1406342.118 
 ns/op
XorTest.xorREGION   LARGE  avgt   3090091424.883 ±   775577.613 
 ns/op
XorTest.xor  CRITICAL   SMALL  avgt   3057231375.744 ±   438223.342 
 ns/op
XorTest.xor  CRITICAL  MEDIUM  avgt   3058583884.930 ±   375355.215 
 ns/op
XorTest.xor  CRITICAL   LARGE  avgt   3060644832.949 ±   588120.738 
 ns/op
XorTest.xor   FOREIGN   SMALL  avgt   3073868679.405 ±   819965.524 
 ns/op
XorTest.xor   FOREIGN  MEDIUM  avgt   3088156275.944 ±  1051257.152 
 ns/op
XorTest.xor   FOREIGN   LARGE  avgt   30   123115513.182 ±  1287935.621 
 ns/op

The 'copy' benchmark was added to measure the memory copy components of the 
'xor' benchmark, separate from the memory allocation and xor data update 
components.

Profile data for the baseline REGION LARGE case, shows two hotspots covering 
about 90% of cycles:


Baseline REGION LARGE (r231)
FunctionCPU TimeClockticks  Instructions 
RetiredCPI Rate

xor_op  63.7%   18,189,000,000  52,464,000,000  
0.347   
__memcpy_evex_unaligned_erms28.5%7,608,000,000   3,459,000,000  
2.199  
``` 
The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :

Baseline FOREIGN LARGE (r226)
FunctionCPU TimeClockticks  Instructions 
RetiredCPI Rate

xor_op  46.4%   18,345,000,000  52,476,000,000  
0.350   
jlong_disjoint_arraycopy_avx3   29.3%   11,124,000,000   1,404,000,000  
7.923   
Copy::fill_to_memory_atomic 15.3%5,016,000,000   8,010,000,000  
0.626   

This PR optimizes the jlong_disjoint_arraycopy_avx3 code.  The The 
Copy::fill_to memory_atomic hotspot (which I believe is associated with the 
benchmark's per-op off-heap buffer allocation) is not optimized here.  The av3 
array copy code is optimized by increasing the loop granularity from 192 to 256 
bytes, adding source address prefetches, and using non-temporal writes with a 
store fence.  The optimized code in only used with copies of greater that a set 
threshold number of bytes, currently 2.5MB.  This is the size at which the 
optimized code was observed to be faster than the original code.  The profile 
data with optimization is:

Optimized FOREIGN LARGE (r277)
FunctionCPU TimeClockticks  Instructions 
RetiredCPI Rate

xor_op  51.2%   18,153,000,000  52,404,000,000  
0.346   
jlong_disjoint_arraycopy_avx3   22.4%7,581,000,000   2,364,000,000  
3.207   
Copy::fill_to_memory_atomic 16.3%5,316,000,000   7,917,000,000  
0.671   

The optimization brings the cycles for the mem