Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 21:17:07 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: >> >>> 486: __ cmpq(r11, nMinusK); >>> 487: __ ja_b(L_return); >>> 488: __ movq(rax, r11); >> >> At places where we know that return value in r11 is correct, we dont need to >> checkRange so this could have its own label. > > Disabling causes the test to succeed, so we're not finding matches beyond the > end of the string, correct? Are we confident that this test passing can > warrant removing the range check? @sviswa7 ? Removed. >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621: >> >>> 619: __ addq(hsPtrRet, index); >>> 620: __ movq(r11, hsPtrRet); >>> 621: __ jmp(L_checkRangeAndReturn); >> >> Why do we have to checkRange here, would it not be always correct? It so we >> could return r11 directly (by moving into rax). > > There are cases where r11 could have a value that, when added to (k - 1) > would go past the end of the haystack. I did all in my power to ensure that > it doesn't but there's no test I know of to ensure that condition. I would > recommend leaving this in for now. Removed checkRangeAndReturn - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617956870 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617956635
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 16:37:23 GMT, Sandhya Viswanathan wrote: >> Scott Gibbons has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Fix tests > > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: > >> 486: __ cmpq(r11, nMinusK); >> 487: __ ja_b(L_return); >> 488: __ movq(rax, r11); > > At places where we know that return value in r11 is correct, we dont need to > checkRange so this could have its own label. Disabling causes the test to succeed, so we're not finding matches beyond the end of the string, correct? Are we confident that this test passing can warrant removing the range check? @sviswa7 ? - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617901070
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 20:56:42 GMT, Scott Gibbons wrote: >> We can also do full reads for (n-k) == 31, as we also compare the kth byte. > > I'll change and test. Passes tests, so I'll change. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617886613
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 20:37:43 GMT, Sandhya Viswanathan wrote: >> I listed all registers for clarity. This ensures that we know what can be >> used as values or as scratch registers with no ambiguity. > > Sounds good. We could keep only comment out of the two as it is the same for > both small haystack and big haystack. OK - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617889756
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 17:36:03 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: >> >>> 486: __ cmpq(r11, nMinusK); >>> 487: __ ja_b(L_return); >>> 488: __ movq(rax, r11); >> >> At places where we know that return value in r11 is correct, we dont need to >> checkRange so this could have its own label. > > I don't want to change this because its reason for existence is to ensure we > don't return a value that's beyond the end of the haystack. We don't yet > have a good enough test to validate whether we're reading past the end of the > haystack, so I like this as insurance. I would recommend an experiment. Disable the range-check and run String/IndexOf.java test. Particularly run test4(), which is designed exactly to test the reads beyond the end. It wont find all the bad reads, but right now if there are any failures, they are 'hidden' by this range-check. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617888680
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 20:35:26 GMT, Sandhya Viswanathan wrote: >> No. For (n-k)==32 we can do full reads. I'll clarify by changing the label >> name. > > We can also do full reads for (n-k) == 31, as we also compare the kth byte. I'll change and test. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617883225
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 20:29:38 GMT, Sandhya Viswanathan wrote: >> No. This is checking for a zero length haystack. The following compare >> checks for needle length longer than haystack, regardless of the value in >> each. The comparison is signed, so a haystack length of 0 with a needle >> length of -1 will pass the following test and assume validity. > > But we have already checked for needle length to be greater than 0 in the > following lines: > __ cmpq(needle_len_p, 0); > __ jg_b(L_nextCheck); OK >> Again, I think we ought to leave this in. Although it executes ~3 >> instructions that may not be necessary in some cases I think it's best to >> perform the check. Once we have a good enough test to check reading past >> the end of the haystack we can change it. > > In this particular case, we are returning -1 (NoMatch), so no need to do > L_checkRangeAndReturn here, we could directly jump to L_returnError. OK. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617876757 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617874637
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 18:11:13 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333: >> >>> 1331: >>> 1332: __ cmpq(nMinusK, 32); >>> 1333: __ jae_b(L_greaterThan32); >> >> Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31 >> __ cmpq(nMinusK, 31); >> __ jae_b(L_greaterThan32); > > No. For (n-k)==32 we can do full reads. I'll clarify by changing the label > name. We can also do full reads for (n-k) == 31, as we also compare the kth byte. >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750: >> >>> 1748: // r15 = unused >>> 1749: // XMM_BYTE_0 - first element of needle, broadcast >>> 1750: // XMM_BYTE_K - last element of needle, broadcast >> >> This comment is duplicated for both small haystack case and big haystack >> case, could be made a common comment. >> Also the only registers that are used as input in the switch case are: >> r14 = needle >> rbx = haystack >> rsi = haystack length (n) >> r12 = needle length (k) >> r10 = n - k (where k is needle length) >> XMM_BYTE_0 = first element of needle, broadcast >> XMM_BYTE_K = last element of needle, broadcast >> So we could only list these, making it easier to comprehend. > > I listed all registers for clarity. This ensures that we know what can be > used as values or as scratch registers with no ambiguity. Sounds good. We could keep only comment out of the two as it is the same for both small haystack and big haystack. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617862799 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617865049
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 17:30:24 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278: >> >>> 276: __ bind(L_nextCheck); >>> 277: __ testq(haystack_len_p, haystack_len_p); >>> 278: __ je(L_zeroCheckFailed); >> >> This check could be removed as the next check covers this one. > > No. This is checking for a zero length haystack. The following compare > checks for needle length longer than haystack, regardless of the value in > each. The comparison is signed, so a haystack length of 0 with a needle > length of -1 will pass the following test and assume validity. But we have already checked for needle length to be greater than 0 in the following lines: __ cmpq(needle_len_p, 0); __ jg_b(L_nextCheck); - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617857240
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 17:59:49 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578: >> >>> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value. >>> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, >>> mask, hsPtrRet, needleLen, >>> 578: needle, haystack, hsLength, tmp1, tmp2, tmp3, >>> rScratch, ae, _masm); >> >> If we run out of haystack instead of jumping to L_checkRangeAndReturn, we >> could directly jump to L_retrunError. > > Again, I think we ought to leave this in. Although it executes ~3 > instructions that may not be necessary in some cases I think it's best to > perform the check. Once we have a good enough test to check reading past the > end of the haystack we can change it. In this particular case, we are returning -1 (NoMatch), so no need to do L_checkRangeAndReturn here, we could directly jump to L_returnError. - PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617853337
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Tue, 28 May 2024 12:48:19 GMT, Sandhya Viswanathan wrote: >> Scott Gibbons has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Fix tests > > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239: > >> 237: // the needle size is less than 32 bytes, we default to a >> 238: // byte-by-byte comparison (this will be rare). >> 239: // > > Is this still true? Yes. For UL, the code within `L_compareFull` effectively does byte-by-byte. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278: > >> 276: __ bind(L_nextCheck); >> 277: __ testq(haystack_len_p, haystack_len_p); >> 278: __ je(L_zeroCheckFailed); > > This check could be removed as the next check covers this one. No. This is checking for a zero length haystack. The following compare checks for needle length longer than haystack, regardless of the value in each. The comparison is signed, so a haystack length of 0 with a needle length of -1 will pass the following test and assume validity. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360: > >> 358: __ push(rcx); >> 359: __ push(r8); >> 360: __ push(r9); > > No need to save/restore rcx/r8/r9 on windows platform as well. OK. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379: > >> 377: >> 378: // Assume failure >> 379: __ movq(rbp, -1); > > We are no more using rbp at return point so this is not needed now? Removed. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: > >> 486: __ cmpq(r11, nMinusK); >> 487: __ ja_b(L_return); >> 488: __ movq(rax, r11); > > At places where we know that return value in r11 is correct, we dont need to > checkRange so this could have its own label. I don't want to change this because its reason for existence is to ensure we don't return a value that's beyond the end of the haystack. We don't yet have a good enough test to validate whether we're reading past the end of the haystack, so I like this as insurance. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566: > >> 564: // rbp: -1 >> 565: // XMM_BYTE_0 - first element of needle broadcast >> 566: // XMM_BYTE_K - last element of needle broadcast > > The only registers that are used as input in the switch case are: > r14 = needle > rbx = haystack > rsi = haystack length (n) > r12 = needle length (k) > r10 = n - k (where k is needle length) > XMM_BYTE_0 = first element of needle, broadcast > XMM_BYTE_K = last element of needle, broadcast > So we could only list these, making it easier to comprehend. I listed these registers to make it clear which registers had no expected value and could be used for temps, etc. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578: > >> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value. >> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, >> mask, hsPtrRet, needleLen, >> 578: needle, haystack, hsLength, tmp1, tmp2, tmp3, >> rScratch, ae, _masm); > > If we run out of haystack instead of jumping to L_checkRangeAndReturn, we > could directly jump to L_retrunError. Again, I think we ought to leave this in. Although it executes ~3 instructions that may not be necessary in some cases I think it's best to perform the check. Once we have a good enough test to check reading past the end of the haystack we can change it. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597: > >> 595: >> 596: // Need a lot of registers here to preserve state across >> arrays_equals call >> 597: > > This comment is no longer valid, could be removed. OK > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621: > >> 619: __ addq(hsPtrRet, index); >> 620: __ movq(r11, hsPtrRet); >> 621: __ jmp(L_checkRangeAndReturn); > > Why do we have to checkRange here, would it not be always correct? It so we > could return r11 directly (by moving into rax). There are cases where r11 could have a value that, when added to (k - 1) would go past the end of the haystack. I did all in my power to ensure that it doesn't but there's no test I know of to ensure that condition. I would recommend leaving this in for now. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660: > >> 658: // Haystack always copied to stack, so 32-byte reads OK >> 659: // Haystack length < 32 >> 660: // 10 < needle length < 32 > > Haystack length <= 32 > 10 < needle length <= 32 Changed. > src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721: > >> 719:false /* char */, knoreg); >> 720: __ testl(rTmp3, rTmp3); >> 721: __ jne(L_checkRangeAndReturn); > > Why do we have to checkRange here, would it not be always correct? It so we > could return r11 directly (by moving into rax). OK >
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Fix tests src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239: > 237: // the needle size is less than 32 bytes, we default to a > 238: // byte-by-byte comparison (this will be rare). > 239: // Is this still true? src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278: > 276: __ bind(L_nextCheck); > 277: __ testq(haystack_len_p, haystack_len_p); > 278: __ je(L_zeroCheckFailed); This check could be removed as the next check covers this one. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360: > 358: __ push(rcx); > 359: __ push(r8); > 360: __ push(r9); No need to save/restore rcx/r8/r9 on windows platform as well. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379: > 377: > 378: // Assume failure > 379: __ movq(rbp, -1); We are no more using rbp at return point so this is not needed now? src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: > 486: __ cmpq(r11, nMinusK); > 487: __ ja_b(L_return); > 488: __ movq(rax, r11); At places where we know that return value in r11 is correct, we dont need to checkRange so this could have its own label. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566: > 564: // rbp: -1 > 565: // XMM_BYTE_0 - first element of needle broadcast > 566: // XMM_BYTE_K - last element of needle broadcast The only registers that are used as input in the switch case are: r14 = needle rbx = haystack rsi = haystack length (n) r12 = needle length (k) r10 = n - k (where k is needle length) XMM_BYTE_0 = first element of needle, broadcast XMM_BYTE_K = last element of needle, broadcast So we could only list these, making it easier to comprehend. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578: > 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value. > 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, > mask, hsPtrRet, needleLen, > 578: needle, haystack, hsLength, tmp1, tmp2, tmp3, > rScratch, ae, _masm); If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could directly jump to L_retrunError. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597: > 595: > 596: // Need a lot of registers here to preserve state across > arrays_equals call > 597: This comment is no longer valid, could be removed. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621: > 619: __ addq(hsPtrRet, index); > 620: __ movq(r11, hsPtrRet); > 621: __ jmp(L_checkRangeAndReturn); Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax).
Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
> Re-write the IndexOf code without the use of the pcmpestri instruction, only > using AVX2 instructions. This change accelerates String.IndexOf on average > 1.3x for AVX2. The benchmark numbers: > > > BenchmarkScore > Latest > StringIndexOf.advancedWithMediumSub 343.573 317.934 > 0.925375393x > StringIndexOf.advancedWithShortSub1 1039.081 1053.96 > 1.014319384x > StringIndexOf.advancedWithShortSub2 55.828110.541 > 1.980027943x > StringIndexOf.constantPattern 9.361 11.906 > 1.271872663x > StringIndexOf.searchCharLongSuccess 4.216 4.218 > 1.000474383x > StringIndexOf.searchCharMediumSuccess 3.133 3.216 > 1.02649218x > StringIndexOf.searchCharShortSuccess 3.763.761 > 1.000265957x > StringIndexOf.success 9.186 9.713 > 1.057369911x > StringIndexOf.successBig14.34146.343 > 3.231504079x > StringIndexOfChar.latin1_AVX2_String6220.918 12154.52 > 1.953814533x > StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 > 1.006629895x > StringIndexOfChar.latin1_SSE4_String6978.854 6818.689 > 0.977049957x > StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 > 0.967675646x > StringIndexOfChar.latin1_Short_String 7132.541 6863.359 > 0.962260014x > StringIndexOfChar.latin1_Short_char 16013.389 16162.437 > 1.009307711x > StringIndexOfChar.latin1_mixed_String 7386.12314771.622 > 1.15517x > StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 > 0.987938803 Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Fix tests - Changes: - all: https://git.openjdk.org/jdk/pull/16753/files - new: https://git.openjdk.org/jdk/pull/16753/files/e13c7ea4..15994a39 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=16753=42 - incr: https://webrevs.openjdk.org/?repo=jdk=16753=41-42 Stats: 2 lines in 2 files changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753 PR: https://git.openjdk.org/jdk/pull/16753