Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]
On Tue, 21 Nov 2023 21:03:20 GMT, Steve Dohrmann wrote: >> Update: the XorTest::xor results shown in this message used test code from >> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit >> a788f066af17. The XorTest has since been updated and XorTest::copy is no >> longer needed and has been removed from this pull request. See comment >> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) >> for updated performance data. >> >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt ... > > Steve Dohrmann has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 11 commits: > > - Merge branch 'master' into memcpy > - Updates based on reviewer (sviswa7) comments including >- use asserts instead of conditionals in two logically unreachable blocks >- remove unused function parmeters >- use 64-byte vector path in pre-loop masked write > - Merge branch 'master' into memcpy > - Update full name >Previous commit (fcbbc0d7880) added > org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark > - - remerge upstream master >- remove ::copy test from XorTest > - Merge branch 'master' into memcpy > - - fix whitespace error > - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy > - - bug fix: only generate / use large copy code if MaxVectorSize == 64 > - - fix whitespace issues >- fix xor test foreign impl constructor signature > - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa Correctness and performance testing passed. - Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1749803030
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]
On Tue, 21 Nov 2023 21:03:20 GMT, Steve Dohrmann wrote: >> Update: the XorTest::xor results shown in this message used test code from >> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit >> a788f066af17. The XorTest has since been updated and XorTest::copy is no >> longer needed and has been removed from this pull request. See comment >> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) >> for updated performance data. >> >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt ... > > Steve Dohrmann has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 11 commits: > > - Merge branch 'master' into memcpy > - Updates based on reviewer (sviswa7) comments including >- use asserts instead of conditionals in two logically unreachable blocks >- remove unused function parmeters >- use 64-byte vector path in pre-loop masked write > - Merge branch 'master' into memcpy > - Update full name >Previous commit (fcbbc0d7880) added > org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark > - - remerge upstream master >- remove ::copy test from XorTest > - Merge branch 'master' into memcpy > - - fix whitespace error > - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy > - - bug fix: only generate / use large copy code if MaxVectorSize == 64 > - - fix whitespace issues >- fix xor test foreign impl constructor signature > - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa Thanks a lot for taking care of all the review comments. The PR looks good to me now. - Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1743562444
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
On Tue, 21 Nov 2023 01:14:49 GMT, Sandhya Viswanathan wrote: >> Steve Dohrmann has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains ten commits: >> >> - Merge branch 'master' into memcpy >> - Update full name >>Previous commit (fcbbc0d7880) added >> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark >> - - remerge upstream master >>- remove ::copy test from XorTest >> - Merge branch 'master' into memcpy >> - - fix whitespace error >> - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy >> - - bug fix: only generate / use large copy code if MaxVectorSize == 64 >> - - fix whitespace issues >>- fix xor test foreign impl constructor signature >> - - initial commit -- optimize large array cases in >> StubGenerator::generate_disjoint_copy_avx3_masked >> - add src address prefetches >> - switch to non-temporal writes >> - added modified jmh benchmark based on xor benchmark from Maurizio >> Cimadamore > > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768: > >> 766: } >> 767: __ movq(temp3, temp2); >> 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, >> 0); > > The last argument should be "true" or "1" instead of "0" or "false". This is > as temp3 (length) could be less than 32 as well. This case is only handled > when use64byteVector argument is true. Thanks, done. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401229608
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v6]
> Update: the XorTest::xor results shown in this message used test code from PR > commit 7cc272e862791 which was based on Maurizio Cimadamore's commit > a788f066af17. The XorTest has since been updated and XorTest::copy is no > longer needed and has been removed from this pull request. See comment > [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) for > updated performance data. > > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.x... Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Merge branch 'master' into memcpy - Updates based on reviewer (sviswa7) comments including - use asserts instead of conditionals in two logically unreachable blocks - remove unused function parmeters - use 64-byte vector path in pre-loop masked write - Merge branch 'master' into memcpy - Update full name Previous commit (fcbbc0d7880) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark - - remerge upstream master - remove ::copy test from XorTest - Merge branch 'master' into memcpy - - fix whitespace error - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy - - bug fix: only generate / use large copy code if MaxVectorSize == 64 - - fix whitespace issues - fix xor test foreign impl constructor signature - ... and 1 more: https://git.openjdk.org/jdk/compare/e47cf611...02ad27fa - Changes: https://git.openjdk.org/jdk/pull/16575/files Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=05 Stats: 259 lines in 5 files changed: 259 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16575.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575 PR: https://git.openjdk.org/jdk/pull/16575
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
On Tue, 21 Nov 2023 01:10:40 GMT, Sandhya Viswanathan wrote: >> Steve Dohrmann has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains ten commits: >> >> - Merge branch 'master' into memcpy >> - Update full name >>Previous commit (fcbbc0d7880) added >> org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark >> - - remerge upstream master >>- remove ::copy test from XorTest >> - Merge branch 'master' into memcpy >> - - fix whitespace error >> - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy >> - - bug fix: only generate / use large copy code if MaxVectorSize == 64 >> - - fix whitespace issues >>- fix xor test foreign impl constructor signature >> - - initial commit -- optimize large array cases in >> StubGenerator::generate_disjoint_copy_avx3_masked >> - add src address prefetches >> - switch to non-temporal writes >> - added modified jmh benchmark based on xor benchmark from Maurizio >> Cimadamore > > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753: > >> 751: Label L_pre_main_post_large; >> 752: >> 753: if (MaxVectorSize == 64) { > > This should be an assert here instead of if check as this method shouldn't be > called if MaxVectorSize is < 64. Thanks, done. > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777: > >> 775: >> 776: __ BIND(L_main_pre_loop_large); >> 777: __ subq(temp1, loop_size[shift]); // whay is this here > > Spurious comment. Thanks, done > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797: > >> 795: __ jcc(Assembler::lessEqual, L_exit_large); >> 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift, >> 797: temp4, temp3, L_entry_large, >> L_exit_large); > > When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes > need to be copied so we don't need to go back to L_entry_large. Thanks, done > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058: > >> 1056: }; >> 1057: >> 1058: if (MaxVectorSize == 64) { > > This should be an assert here instead of if check as this method shouldn't be > called if MaxVectorSize is < 64. Thanks, done. > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1175: > >> 1173: void StubGenerator::copy256_avx3(Register dst, Register src, Register >> index, XMMRegister xmm1, >> 1174: XMMRegister xmm2, XMMRegister xmm3, >> XMMRegister xmm4, >> 1175: bool conjoint, int shift, int offset) { > > The conjoint parameter is not used so could be removed from this function. Thanks, done. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401179711 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401178991 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401180641 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401180258 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1401177983
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann wrote: >> Update: the XorTest::xor results shown in this message used test code from >> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit >> a788f066af17. The XorTest has since been updated and XorTest::copy is no >> longer needed and has been removed from this pull request. See comment >> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) >> for updated performance data. >> >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt ... > > Steve Dohrmann has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into memcpy > - Update full name >Previous commit (fcbbc0d7880) added > org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark > - - remerge upstream master >- remove ::copy test from XorTest > - Merge branch 'master' into memcpy > - - fix whitespace error > - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy > - - bug fix: only generate / use large copy code if MaxVectorSize == 64 > - - fix whitespace issues >- fix xor test foreign impl constructor signature > - - initial commit -- optimize large array cases in > StubGenerator::generate_disjoint_copy_avx3_masked > - add src address prefetches > - switch to non-temporal writes > - added modified jmh benchmark based on xor benchmark from Maurizio > Cimadamore src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753: > 751: Label L_pre_main_post_large; > 752: > 753: if (MaxVectorSize == 64) { This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768: > 766: } > 767: __ movq(temp3, temp2); > 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0); The last argument should be "true" or "1" instead of "0" or "false". This is as temp3 (length) could be less than 32 as well. This case is only handled when use64byteVector argument is true. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777: > 775: > 776: __ BIND(L_main_pre_loop_large); > 777: __ subq(temp1, loop_size[shift]); // whay is this here Spurious comment. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797: > 795: __ jcc(Assembler::lessEqual, L_exit_large); > 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift, > 797: temp4, temp3, L_entry_large, > L_exit_large); When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need to be copied so we don't need to go back to L_entry_large. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058:
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Mon, 20 Nov 2023 07:27:12 GMT, Tobias Hartmann wrote: >> Steve Dohrmann has updated the pull request incrementally with one >> additional commit since the last revision: >> >> - fix whitespace error > > test/micro/org/openjdk/bench/java/lang/foreign/xor/XorOp.java line 10: > >> 8: void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int >> len); >> 9: === >> 10: >>> 9727f4bdddc071e6f59806087339f345405ab004 > > You have multiple merge conflicts in the micro benchmark files. Sorry, not sure how I missed the conflicts. They should be resolved now. Thanks! - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399887729
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann wrote: >> Update: the XorTest::xor results shown in this message used test code from >> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit >> a788f066af17. The XorTest has since been updated and XorTest::copy is no >> longer needed and has been removed from this pull request. See comment >> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) >> for updated performance data. >> >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt ... > > Steve Dohrmann has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into memcpy > - Update full name >Previous commit (fcbbc0d7880) added > org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark > - - remerge upstream master >- remove ::copy test from XorTest > - Merge branch 'master' into memcpy > - - fix whitespace error > - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy > - - bug fix: only generate / use large copy code if MaxVectorSize == 64 > - - fix whitespace issues >- fix xor test foreign impl constructor signature > - - initial commit -- optimize large array cases in > StubGenerator::generate_disjoint_copy_avx3_masked > - add src address prefetches > - switch to non-temporal writes > - added modified jmh benchmark based on xor benchmark from Maurizio > Cimadamore The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the introductory comment above XorTest code from PR commit 7cc272e86279 which was based on Maurizio Cimadamore's commit a788f066af17. The XorTest has since been updated and the XorTest::copy is no longer needed and has been removed from this pull request. Performance can be evaluated using both the new XorTest and a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this PR. Results from these two benchmarks are show below: In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in sizes 5MB and 10MB. // Baseline Benchmark(length) Mode Cnt Score Error Units ArrayCopyAlignedLarge.testByte10 avgt 152434.515 ?11.526 ns/op ArrayCopyAlignedLarge.testByte 100 avgt 15 51211.235 ? 539.355 ns/op ArrayCopyAlignedLarge.testByte 200 avgt 15 104837.012 ? 1338.823 ns/op ArrayCopyAlignedLarge.testByte 500 avgt 15 293357.745 ? 3233.745 ns/op ArrayCopyAlignedLarge.testByte 1000 avgt 15 957068.292 ? 15509.983 ns/op // PR Benchmark (length) Mode Cnt Score Error Units ArrayCopyAlignedLarge.testByte10 avgt 152443.354 ? 17.996 ns/op
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
> Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into memcpy - Update full name Previous commit (fcbbc0d7880) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark - - remerge upstream master - remove ::copy test from XorTest - Merge branch 'master' into memcpy - - fix whitespace error - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy - - bug fix: only generate / use large copy code if MaxVectorSize == 64 - - fix whitespace issues - fix xor test foreign impl constructor signature - - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked - add src address prefetches - switch to non-temporal writes - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore - Changes: https://git.openjdk.org/jdk/pull/16575/files Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=04 Stats: 264 lines in 5 files changed: 264 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16575.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575 PR: https://git.openjdk.org/jdk/pull/16575
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > Steve Dohrmann has updated the pull request incrementally with one additional > commit since the last revision: > > - fix whitespace error Hi @steveatgh , X86 code changes looks good to me. - Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16575#pullrequestreview-1740295564
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > Steve Dohrmann has updated the pull request incrementally with one additional > commit since the last revision: > > - fix whitespace error test/micro/org/openjdk/bench/java/lang/foreign/xor/XorOp.java line 10: > 8: void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int > len); > 9: === > 10: >>> 9727f4bdddc071e6f59806087339f345405ab004 You have multiple merge conflicts in the micro benchmark files. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1398749880
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > Steve Dohrmann has updated the pull request incrementally with one additional > commit since the last revision: > > - fix whitespace error Thanks, I re-submitted testing. - PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1818334820
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Mon, 13 Nov 2023 08:36:44 GMT, Tobias Hartmann wrote: >> Steve Dohrmann has updated the pull request incrementally with one >> additional commit since the last revision: >> >> - fix whitespace error > > I submitted some quick testing and I'm seeing the following failure with > multiple tests: > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error > (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), > pid=24136, tid=24139 > # assert(MaxVectorSize == 64) failed: vector length != 64 > # > # JRE version: (22.0) (fastdebug build ) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug > 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, > compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, > Register, XMMRegister, KRegister, Register, Register, Register, int, int, > bool)+0x366 > > Stack: [0x7f0b5e919000,0x7f0b5ea1a000], sp=0x7f0b5ea17150, free > space=1016k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, > Register, XMMRegister, KRegister, Register, Register, Register, int, int, > bool)+0x366 (stubGenerator_x86_64_arraycopy.cpp:1201) > V [libjvm.so+0x16c0ecd] > StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, > Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d > (stubGenerator_x86_64_arraycopy.cpp:1055) > V [libjvm.so+0x16c16c1] StubGenerator::arraycopy_avx3_large(Register, > Register, Register, Register, Register, Register, Register, XMMRegister, > XMMRegister, XMMRegister, XMMRegister, int)+0x3f1 > (stubGenerator_x86_64_arraycopy.cpp:790) > V [libjvm.so+0x16c22f0] > StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char > const*, int, bool, bool, bool)+0xa90 (stubGenerator_x86_64_arraycopy.cpp:728) > V [libjvm.so+0x16c4b85] StubGenerator::generate_disjoint_byte_copy(bool, > unsigned char**, char const*)+0x965 (stubGenerator_x86_64_arraycopy.cpp:1277) > V [libjvm.so+0x16cb309] StubGenerator::generate_arraycopy_stubs()+0x29 > (stubGenerator_x86_64_arraycopy.cpp:88) > V [libjvm.so+0x16a1089] StubGenerator::generate_final_stubs()+0xb9 > (stubGenerator_x86_64.cpp:4051) > V [libjvm.so+0x16a22a5] StubGenerator_generate(CodeBuffer*, > StubCodeGenerator::StubsKind)+0x105 (stubGenerator_x86_64.cpp:4296) > V [libjvm.so+0x16f349e] initialize_stubs(StubCodeGenerator::StubsKind, int, > int, char const*, char const*, char const*)+0x13e (stubRoutines.cpp:241) > V [libjvm.so+0x16f500d] final_stubs_init()+0x3d (stubRoutines.cpp:288) > V [libjvm.so+0xe30c59]... @TobiHartmann I updated the PR with a fix for the assert you saw. The large copy code generation / use is now predicated on MaxVectorSize = 64. - PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1815351213
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
On Thu, 16 Nov 2023 05:43:11 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585: >> >>> 583: __ shlq(temp2, shift); >>> 584: __ cmpq(temp2, large_threshold); >>> 585: __ jcc(Assembler::greaterEqual, L_copy_large); >> >> Hi @steveatgh , Can you please share the performance number of other Array >> copy JMH micros in following directoy >> https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang > > I will still request you to run BM in above path, we may see performance dips > for sizes after special cases due to additional comparisons. Here are the results on my Ubuntu laptop running at 3 GHz // Baseline Benchmark(length) (size) Mode Cnt Score Error Units ArrayCopyObject.conjoint_microN/A 31 thrpt 15 77157.933 ? 1977.467 ops/ms ArrayCopyObject.conjoint_microN/A 63 thrpt 15 58329.157 ? 1667.574 ops/ms ArrayCopyObject.conjoint_microN/A 127 thrpt 15 49322.065 ? 2332.342 ops/ms ArrayCopyObject.conjoint_microN/A2047 thrpt 15 13895.531 ? 239.300 ops/ms ArrayCopyObject.conjoint_microN/A4095 thrpt 15 7926.854 ? 201.238 ops/ms ArrayCopyObject.conjoint_microN/A8191 thrpt 15 4289.582 ? 31.734 ops/ms ArrayCopyObject.disjoint_microN/A 31 thrpt 15 74711.699 ? 2463.378 ops/ms ArrayCopyObject.disjoint_microN/A 63 thrpt 15 65229.586 ? 1329.809 ops/ms ArrayCopyObject.disjoint_microN/A 127 thrpt 15 54330.794 ? 2372.868 ops/ms ArrayCopyObject.disjoint_microN/A2047 thrpt 15 9338.340 ? 132.987 ops/ms ArrayCopyObject.disjoint_microN/A4095 thrpt 15 5035.553 ? 109.679 ops/ms ArrayCopyObject.disjoint_microN/A8191 thrpt 15 1192.069 ? 10.765 ops/ms ArrayCopy.arrayCopy N/A N/A avgt 15 1.356 ?0.029 ns/op ArrayCopy.arrayCopyChar N/A N/A avgt 15 4.368 ?0.038 ns/op ArrayCopy.arrayCopyCharNonConst N/A N/A avgt 15 4.749 ?0.113 ns/op ArrayCopy.arrayCopyLocalArray N/A N/A avgt 15 0.503 ?0.001 ns/op ArrayCopy.arrayCopyNonConst N/A N/A avgt 15 1.955 ?0.108 ns/op ArrayCopy.arrayCopyObject N/A N/A avgt 15 22.403 ?0.563 ns/op ArrayCopy.arrayCopyObjectNonConst N/A N/A avgt 15 25.188 ?0.484 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward N/A N/A avgt 15 17.785 ?0.781 ns/op ArrayCopy.arrayCopyObjectSameArraysForwardN/A N/A avgt 15 17.347 ?0.126 ns/op ArrayCopy.copyLoopN/A N/A avgt 15 5.189 ?0.100 ns/op ArrayCopy.copyLoopLocalArray N/A N/A avgt 15 3.685 ?0.085 ns/op ArrayCopy.copyLoopNonConstN/A N/A avgt 15 5.436 ?0.040 ns/op ArrayCopyAligned.testByte 1 N/A avgt 15 2.366 ?0.028 ns/op ArrayCopyAligned.testByte 3 N/A avgt 15 2.381 ?0.063 ns/op ArrayCopyAligned.testByte 5 N/A avgt 15 2.362 ?0.035 ns/op ArrayCopyAligned.testByte 10 N/A avgt 15 2.364 ?0.048 ns/op ArrayCopyAligned.testByte 20 N/A avgt 15 2.353 ?0.026 ns/op ArrayCopyAligned.testByte 70 N/A avgt 15 5.214 ?0.082 ns/op ArrayCopyAligned.testByte 150 N/A avgt 15 6.081 ?0.140 ns/op ArrayCopyAligned.testByte 300 N/A avgt 15 9.399 ?0.262 ns/op ArrayCopyAligned.testByte 600 N/A avgt 15 12.710 ?0.149 ns/op ArrayCopyAligned.testByte1200 N/A avgt 15 21.873 ?0.237 ns/op ArrayCopyAligned.testChar 1 N/A avgt 15 2.349 ?0.014 ns/op ArrayCopyAligned.testChar 3 N/A avgt 15 2.360 ?0.041 ns/op ArrayCopyAligned.testChar 5 N/A avgt 15 2.359 ?0.021 ns/op ArrayCopyAligned.testChar 10 N/A avgt 15 2.369 ?0.042 ns/op ArrayCopyAligned.testChar 20 N/A avgt 15 5.101 ?0.080 ns/op ArrayCopyAligned.testChar 70 N/A avgt 15 5.961 ?0.096
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]
> Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... Steve Dohrmann has updated the pull request incrementally with one additional commit since the last revision: - fix whitespace error - Changes: - all: https://git.openjdk.org/jdk/pull/16575/files - new: https://git.openjdk.org/jdk/pull/16575/files/10313c9a..3dbf3d5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=16575=03 - incr: https://webrevs.openjdk.org/?repo=jdk=16575=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16575.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575 PR: https://git.openjdk.org/jdk/pull/16575
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v3]
> Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy - - bug fix: only generate / use large copy code if MaxVectorSize == 64 - - fix whitespace issues - fix xor test foreign impl constructor signature - - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked - add src address prefetches - switch to non-temporal writes - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore - Changes: https://git.openjdk.org/jdk/pull/16575/files Webrev: https://webrevs.openjdk.org/?repo=jdk=16575=02 Stats: 357 lines in 11 files changed: 357 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16575.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575 PR: https://git.openjdk.org/jdk/pull/16575
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v2]
> Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... Steve Dohrmann has updated the pull request incrementally with one additional commit since the last revision: - bug fix: only generate / use large copy code if MaxVectorSize == 64 - Changes: - all: https://git.openjdk.org/jdk/pull/16575/files - new: https://git.openjdk.org/jdk/pull/16575/files/6c0fdf3e..983df17a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=16575=01 - incr: https://webrevs.openjdk.org/?repo=jdk=16575=00-01 Stats: 128 lines in 1 file changed: 43 ins; 32 del; 53 mod Patch: https://git.openjdk.org/jdk/pull/16575.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575 PR: https://git.openjdk.org/jdk/pull/16575
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Thu, 16 Nov 2023 09:49:00 GMT, Andrew Haley wrote: >> Thanks for the clarification, agree behavior is similar to non-NT case, in >> fact using NT for huge copy operations will prevent polluting caches due to >> destination cache line fills. > > But won't it also cause performance regressions in the common case where the > caller needs to use the destination array? One component of the included XorTest.xor benchmark is to read the bytes from two copied arrays. See line 155 in libjnitest.c The nt stores are only used in the FOREGN LARGE case and it shows a net speedup ~123 ms -> 104 ms. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1396050045
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Thu, 16 Nov 2023 05:38:30 GMT, Jatin Bhateja wrote: >> The results a concurrent reader sees could be different if the copy is using >> nt writes, but if the read of the destination is not synced with the copy >> operation, I think the reader would not see consistent state in either case. >> Is it worse with nt writes? > > Thanks for the clarification, agree behavior is similar to non-NT case, in > fact using NT for huge copy operations will prevent polluting caches due to > destination cache line fills. But won't it also cause performance regressions in the common case where the caller needs to use the destination array? - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395431482
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 15 Nov 2023 17:03:38 GMT, Steve Dohrmann wrote: >> Do you see any concerns while handling multithreaded case where writer is >> busy copying 256 bytes block in loop and reader try to access a location >> still not flushed out of write combining buffer. > > The results a concurrent reader sees could be different if the copy is using > nt writes, but if the read of the destination is not synced with the copy > operation, I think the reader would not see consistent state in either case. > Is it worse with nt writes? Thanks for the clarification, agree behavior is similar to non-NT case, in fact using NT for huge copy operations will prevent polluting caches due to destination cache line fills. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395171526
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Tue, 14 Nov 2023 07:59:22 GMT, Jatin Bhateja wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585: > >> 583: __ shlq(temp2, shift); >> 584: __ cmpq(temp2, large_threshold); >> 585: __ jcc(Assembler::greaterEqual, L_copy_large); > > Hi @steveatgh , Can you please share the performance number of other Array > copy JMH micros in following directoy > https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang I will still request you to run BM in above path, we may see performance dips for sizes after special cases due to additional comparisons. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1395174203
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 15 Nov 2023 01:44:56 GMT, Jatin Bhateja wrote: >> Thanks, there is an store fence upon completion of the main loop for the >> large size code: >> >> ![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2) > > Do you see any concerns while handling multithreaded case where writer is > busy copying 256 bytes block in loop and reader try to access a location > still not flushed out of write combining buffer. The results a concurrent reader sees could be different if the copy is using nt writes, but if the read of the destination is not synced with the copy operation, I think the reader would not see consistent state in either case. Is it worse with nt writes? - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1394500454
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 15 Nov 2023 01:17:05 GMT, Steve Dohrmann wrote: >> @jatin-bhateja There is a sfence at line 781. > > Thanks, there is an store fence upon completion of the main loop for the > large size code: > > ![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2) How will it handle multithreaded case where writer is busy copying 256 bytes in loop and reader try to access a location still not flushed out of write combining buffer. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393529359
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 15 Nov 2023 00:39:29 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186: >> >>> 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, >>> Assembler::AVX_512bit); >>> 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, >>> Assembler::AVX_512bit); >>> 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, >>> Assembler::AVX_512bit); >> >> These are non-temporal memory moves, to force eviction from write combining >> buffers we may need to emit additional fences, else a subsequent read from >> destination memory may see incorrect values. > > @jatin-bhateja There is a sfence at line 781. Thanks, there is an store fence upon completion of the main loop for the large size code: ![image](https://github.com/openjdk/jdk/assets/3858882/3bcea3c6-3bda-458c-aa7c-29ed6010cde2) - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393511087
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Tue, 14 Nov 2023 08:00:13 GMT, Jatin Bhateja wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585: > >> 583: __ shlq(temp2, shift); >> 584: __ cmpq(temp2, large_threshold); >> 585: __ jcc(Assembler::greaterEqual, L_copy_large); > > I suspect additional checks for 2.5MB array size may hit the performance of > other general sizes. Comparing several runs of the XorTest.copy SMALL (100K) benchmark, baseline against PR, I see an average slowdown of 1.7% (7.566 ms / op vs 7.696 ms/op) - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393511001
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Tue, 14 Nov 2023 08:09:28 GMT, Jatin Bhateja wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186: > >> 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, >> Assembler::AVX_512bit); >> 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, >> Assembler::AVX_512bit); >> 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, >> Assembler::AVX_512bit); > > These are non-temporal memory moves, to force eviction from write combining > buffers we may need to emit additional fences, else a subsequent read from > destination memory may see incorrect values. @jatin-bhateja There is a sfence at line 781. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1393486384
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585: > 583: __ shlq(temp2, shift); > 584: __ cmpq(temp2, large_threshold); > 585: __ jcc(Assembler::greaterEqual, L_copy_large); Hi @steveatgh , Can you please share the performance number of other Array copy JMH micros in following directoy https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 585: > 583: __ shlq(temp2, shift); > 584: __ cmpq(temp2, large_threshold); > 585: __ jcc(Assembler::greaterEqual, L_copy_large); I suspect additional checks for 2.5MB array size may hit the performance of other general sizes. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392137605 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392138600
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1186: > 1184: __ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, > Assembler::AVX_512bit); > 1185: __ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, > Assembler::AVX_512bit); > 1186: __ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, > Assembler::AVX_512bit); These are non-temporal memory moves, to force eviction from write combining buffers we may need to emit additional fences, else a sub-subsequent read from destination memory may see incorrect values. - PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1392149191
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Mon, 13 Nov 2023 08:36:44 GMT, Tobias Hartmann wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 3020591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 3021649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xorREGION SMALL avgt 3064093872.804 ± >> 599704.491 ns/op >> XorTest.xorREGION MEDIUM avgt 3081544576.454 ± >> 1406342.118 ns/op >> XorTest.xorREGION LARGE avgt 3090091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± >> 438223.342 ns/op >> XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± >> 375355.215 ns/op >> XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± >> 588120.738 ns/op >> XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± >> 819965.524 ns/op >> XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± >> 1051257.152 ns/op >> Xo... > > I submitted some quick testing and I'm seeing the following failure with > multiple tests: > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error > (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), > pid=24136, tid=24139 > # assert(MaxVectorSize == 64) failed: vector length != 64 > # > # JRE version: (22.0) (fastdebug build ) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug > 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, > compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, > Register, XMMRegister, KRegister, Register, Register, Register, int, int, > bool)+0x366 > > Stack: [0x7f0b5e919000,0x7f0b5ea1a000], sp=0x7f0b5ea17150, free > space=1016k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, > Register, XMMRegister, KRegister, Register, Register, Register, int, int, > bool)+0x366 (stubGenerator_x86_64_arraycopy.cpp:1201) > V [libjvm.so+0x16c0ecd] > StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, > Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d > (stubGenerator_x86_64_arraycopy.cpp:1055) > V [libjvm.so+0x16c16c1] StubGenerator::arraycopy_avx3_large(Register, > Register, Register, Register, Register, Register, Register, XMMRegister, > XMMRegister, XMMRegister, XMMRegister, int)+0x3f1 > (stubGenerator_x86_64_arraycopy.cpp:790) > V [libjvm.so+0x16c22f0] > StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char > const*, int, bool, bool, bool)+0xa90 (stubGenerator_x86_64_arraycopy.cpp:728) > V [libjvm.so+0x16c4b85] StubGenerator::generate_disjoint_byte_copy(bool, > unsigned char**, char const*)+0x965 (stubGenerator_x86_64_arraycopy.cpp:1277) > V [libjvm.so+0x16cb309] StubGenerator::generate_arraycopy_stubs()+0x29 > (stubGenerator_x86_64_arraycopy.cpp:88) > V [libjvm.so+0x16a1089] StubGenerator::generate_final_stubs()+0xb9 > (stubGenerator_x86_64.cpp:4051) > V [libjvm.so+0x16a22a5] StubGenerator_generate(CodeBuffer*, >
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... I submitted some quick testing and I'm seeing the following failure with multiple tests: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), pid=24136, tid=24139 # assert(MaxVectorSize == 64) failed: vector length != 64 # # JRE version: (22.0) (fastdebug build ) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366 Stack: [0x7f0b5e919000,0x7f0b5ea1a000], sp=0x7f0b5ea17150, free space=1016k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366 (stubGenerator_x86_64_arraycopy.cpp:1201) V [libjvm.so+0x16c0ecd] StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d (stubGenerator_x86_64_arraycopy.cpp:1055) V [libjvm.so+0x16c16c1] StubGenerator::arraycopy_avx3_large(Register, Register, Register, Register, Register, Register, Register, XMMRegister, XMMRegister, XMMRegister, XMMRegister, int)+0x3f1 (stubGenerator_x86_64_arraycopy.cpp:790) V [libjvm.so+0x16c22f0] StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char const*, int, bool, bool, bool)+0xa90 (stubGenerator_x86_64_arraycopy.cpp:728) V [libjvm.so+0x16c4b85] StubGenerator::generate_disjoint_byte_copy(bool, unsigned char**, char const*)+0x965 (stubGenerator_x86_64_arraycopy.cpp:1277) V [libjvm.so+0x16cb309] StubGenerator::generate_arraycopy_stubs()+0x29 (stubGenerator_x86_64_arraycopy.cpp:88) V [libjvm.so+0x16a1089] StubGenerator::generate_final_stubs()+0xb9 (stubGenerator_x86_64.cpp:4051) V [libjvm.so+0x16a22a5] StubGenerator_generate(CodeBuffer*, StubCodeGenerator::StubsKind)+0x105 (stubGenerator_x86_64.cpp:4296) V [libjvm.so+0x16f349e]
Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which does support AVX512. > > Baseline data > Benchmark (arrayKind) (sizeKind) Mode Cnt Score > Error Units > -- > XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± > 60414308.540 ns/op > XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± > 2924954.498 ns/op > XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± > 28334453.652 ns/op > XorTest.copy REGION SMALL avgt 30 7399944.164 ± > 216821.819 ns/op > XorTest.copy REGION MEDIUM avgt 3020591454.558 ± > 147398.572 ns/op > XorTest.copy REGION LARGE avgt 3021649266.051 ± > 179263.875 ns/op > XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± > 542.482 ns/op > XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± > 11.375 ns/op > XorTest.copy CRITICAL LARGE avgt 30 515.454 ± > 5.831 ns/op > XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± > 79489.276 ns/op > XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± > 500505.099 ns/op > XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± > 340300.726 ns/op > XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± > 2329417.319 ns/op > XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± > 3818334.424 ns/op > XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± > 5877981.900 ns/op > XorTest.xorREGION SMALL avgt 3064093872.804 ± > 599704.491 ns/op > XorTest.xorREGION MEDIUM avgt 3081544576.454 ± > 1406342.118 ns/op > XorTest.xorREGION LARGE avgt 3090091424.883 ± > 775577.613 ns/op > XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± > 438223.342 ns/op > XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± > 375355.215 ns/op > XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± > 588120.738 ns/op > XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± > 819965.524 ns/op > XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± > 1051257.152 ns/op > XorTest.xor FOREIGN LARGE avgt 30 123115513... I'm part of the Intel Java team - PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1802923064
RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512. Baseline data Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units -- XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± 60414308.540 ns/op XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± 2924954.498 ns/op XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± 28334453.652 ns/op XorTest.copy REGION SMALL avgt 30 7399944.164 ± 216821.819 ns/op XorTest.copy REGION MEDIUM avgt 3020591454.558 ± 147398.572 ns/op XorTest.copy REGION LARGE avgt 3021649266.051 ± 179263.875 ns/op XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± 542.482 ns/op XorTest.copy CRITICAL MEDIUM avgt 302496.961 ± 11.375 ns/op XorTest.copy CRITICAL LARGE avgt 30 515.454 ±5.831 ns/op XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ±79489.276 ns/op XorTest.copy FOREIGN MEDIUM avgt 3019730666.341 ± 500505.099 ns/op XorTest.copy FOREIGN LARGE avgt 3034616758.085 ± 340300.726 ns/op XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± 2329417.319 ns/op XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± 3818334.424 ns/op XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± 5877981.900 ns/op XorTest.xorREGION SMALL avgt 3064093872.804 ± 599704.491 ns/op XorTest.xorREGION MEDIUM avgt 3081544576.454 ± 1406342.118 ns/op XorTest.xorREGION LARGE avgt 3090091424.883 ± 775577.613 ns/op XorTest.xor CRITICAL SMALL avgt 3057231375.744 ± 438223.342 ns/op XorTest.xor CRITICAL MEDIUM avgt 3058583884.930 ± 375355.215 ns/op XorTest.xor CRITICAL LARGE avgt 3060644832.949 ± 588120.738 ns/op XorTest.xor FOREIGN SMALL avgt 3073868679.405 ± 819965.524 ns/op XorTest.xor FOREIGN MEDIUM avgt 3088156275.944 ± 1051257.152 ns/op XorTest.xor FOREIGN LARGE avgt 30 123115513.182 ± 1287935.621 ns/op The 'copy' benchmark was added to measure the memory copy components of the 'xor' benchmark, separate from the memory allocation and xor data update components. Profile data for the baseline REGION LARGE case, shows two hotspots covering about 90% of cycles: Baseline REGION LARGE (r231) FunctionCPU TimeClockticks Instructions RetiredCPI Rate xor_op 63.7% 18,189,000,000 52,464,000,000 0.347 __memcpy_evex_unaligned_erms28.5%7,608,000,000 3,459,000,000 2.199 ``` The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% : Baseline FOREIGN LARGE (r226) FunctionCPU TimeClockticks Instructions RetiredCPI Rate xor_op 46.4% 18,345,000,000 52,476,000,000 0.350 jlong_disjoint_arraycopy_avx3 29.3% 11,124,000,000 1,404,000,000 7.923 Copy::fill_to_memory_atomic 15.3%5,016,000,000 8,010,000,000 0.626 This PR optimizes the jlong_disjoint_arraycopy_avx3 code. The The Copy::fill_to memory_atomic hotspot (which I believe is associated with the benchmark's per-op off-heap buffer allocation) is not optimized here. The av3 array copy code is optimized by increasing the loop granularity from 192 to 256 bytes, adding source address prefetches, and using non-temporal writes with a store fence. The optimized code in only used with copies of greater that a set threshold number of bytes, currently 2.5MB. This is the size at which the optimized code was observed to be faster than the original code. The profile data with optimization is: Optimized FOREIGN LARGE (r277) FunctionCPU TimeClockticks Instructions RetiredCPI Rate xor_op 51.2% 18,153,000,000 52,404,000,000 0.346 jlong_disjoint_arraycopy_avx3 22.4%7,581,000,000 2,364,000,000 3.207 Copy::fill_to_memory_atomic 16.3%5,316,000,000 7,917,000,000 0.671 The optimization brings the cycles for the mem