On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann <d...@openjdk.org> wrote:
>> Update: the XorTest::xor results shown in this message used test code from >> PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit >> a788f066af17. The XorTest has since been updated and XorTest::copy is no >> longer needed and has been removed from this pull request. See comment >> [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) >> for updated performance data. >> >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7, which does support AVX512. >> >> Baseline data >> Benchmark (arrayKind) (sizeKind) Mode Cnt Score >> Error Units >> -------------------------------------------------------------------------------------- >> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± >> 60414308.540 ns/op >> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± >> 2924954.498 ns/op >> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± >> 28334453.652 ns/op >> XorTest.copy REGION SMALL avgt 30 7399944.164 ± >> 216821.819 ns/op >> XorTest.copy REGION MEDIUM avgt 30 20591454.558 ± >> 147398.572 ns/op >> XorTest.copy REGION LARGE avgt 30 21649266.051 ± >> 179263.875 ns/op >> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± >> 542.482 ns/op >> XorTest.copy CRITICAL MEDIUM avgt 30 2496.961 ± >> 11.375 ns/op >> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± >> 5.831 ns/op >> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± >> 79489.276 ns/op >> XorTest.copy FOREIGN MEDIUM avgt 30 19730666.341 ± >> 500505.099 ns/op >> XorTest.copy FOREIGN LARGE avgt 30 34616758.085 ± >> 340300.726 ns/op >> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± >> 2329417.319 ns/op >> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± >> 3818334.424 ns/op >> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± >> 5877981.900 ns/op >> XorTest.xor REGION SMALL avgt 30 64093872.804 ± >> 599704.491 ns/op >> XorTest.xor REGION MEDIUM avgt 30 81544576.454 ± >> 1406342.118 ns/op >> XorTest.xor REGION LARGE avgt 30 90091424.883 ± >> 775577.613 ns/op >> XorTest.xor CRITICAL SMALL avgt ... > > Steve Dohrmann has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into memcpy > - Update full name > Previous commit (fcbbc0d7880) added > org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark > - - remerge upstream master > - remove ::copy test from XorTest > - Merge branch 'master' into memcpy > - - fix whitespace error > - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy > - - bug fix: only generate / use large copy code if MaxVectorSize == 64 > - - fix whitespace issues > - fix xor test foreign impl constructor signature > - - initial commit -- optimize large array cases in > StubGenerator::generate_disjoint_copy_avx3_masked > - add src address prefetches > - switch to non-temporal writes > - added modified jmh benchmark based on xor benchmark from Maurizio > Cimadamore src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753: > 751: Label L_pre_main_post_large; > 752: > 753: if (MaxVectorSize == 64) { This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768: > 766: } > 767: __ movq(temp3, temp2); > 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0); The last argument should be "true" or "1" instead of "0" or "false". This is as temp3 (length) could be less than 32 as well. This case is only handled when use64byteVector argument is true. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777: > 775: > 776: __ BIND(L_main_pre_loop_large); > 777: __ subq(temp1, loop_size[shift]); // whay is this here Spurious comment. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797: > 795: __ jcc(Assembler::lessEqual, L_exit_large); > 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift, > 797: temp4, temp3, L_entry_large, > L_exit_large); When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need to be copied so we don't need to go back to L_entry_large. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058: > 1056: }; > 1057: > 1058: if (MaxVectorSize == 64) { This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64. src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1175: > 1173: void StubGenerator::copy256_avx3(Register dst, Register src, Register > index, XMMRegister xmm1, > 1174: XMMRegister xmm2, XMMRegister xmm3, > XMMRegister xmm4, > 1175: bool conjoint, int shift, int offset) { The conjoint parameter is not used so could be removed from this function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916497 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399918640 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399896421 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399934713 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916567 PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399881916