On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 >> 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 >> 1.000265957x >> StringIndexOf.success 9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 >> 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Rearrange; add lambdas for clarity src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1589: > 1587: case 3: > 1588: case 4: > 1589: __ movl(needleVal, Address(needle, offsetOfFirstByteToCompare)); If the size of the needle is 7 and it is an LL case with NUMBER_OF_NEEDLE_BYTES_TO_COMPARE set as 3: bytesLeftToCompare = 4 (i.e. 7-3); offsetOfFirstByteToCompare = 2 (i.e. 3-1); the movl will be loading bytes 2,3,4,5 So we seem to be missing loading the last byte of the needle. Is that correct? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1735: > 1733: // generated with 32 - (n - k + 1) bits set that ensures matches past > the end of the original > 1734: // haystack do not get considered during compares. > 1735: // Mask is generated below with (n-k+1) bits set and not 32- (n-k+1) bits set. Also it will be helpful if we specify what is n and k. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1784: > 1782: __ subq(tmp, haystack_len); > 1783: } > 1784: __ leaq(haystack, Address(rsp, tmp, Address::times_1)); This whole code is repeated in two places. Could be made into a function and used at both places. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1838: > 1836: __ shrq(rax, 1); > 1837: } > 1838: We need to be consistent either use tzcntl, shrl, testl or tzcntq, shrq, testq. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1600787103 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1600760538 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1600489229 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1600765277