Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 21:17:07 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
>> 
>>> 486:   __ cmpq(r11, nMinusK);
>>> 487:   __ ja_b(L_return);
>>> 488:   __ movq(rax, r11);
>> 
>> At places where we know that return value in r11 is correct, we dont need to 
>> checkRange so this could have its own label.
>
> Disabling causes the test to succeed, so we're not finding matches beyond the 
> end of the string, correct?  Are we confident that this test passing can 
> warrant removing the range check? @sviswa7 ?

Removed.

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:
>> 
>>> 619: __ addq(hsPtrRet, index);
>>> 620: __ movq(r11, hsPtrRet);
>>> 621: __ jmp(L_checkRangeAndReturn);
>> 
>> Why do we have to checkRange here, would it not be always correct? It so we 
>> could return r11 directly (by moving into rax).
>
> There are cases where r11 could have a value that, when added to (k - 1) 
> would go past the end of the haystack.  I did all in my power to ensure that 
> it doesn't but there's no test I know of to ensure that condition.  I would 
> recommend leaving this in for now.

Removed checkRangeAndReturn

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617956870
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617956635


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 16:37:23 GMT, Sandhya Viswanathan 
 wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fix tests
>
> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
> 
>> 486:   __ cmpq(r11, nMinusK);
>> 487:   __ ja_b(L_return);
>> 488:   __ movq(rax, r11);
> 
> At places where we know that return value in r11 is correct, we dont need to 
> checkRange so this could have its own label.

Disabling causes the test to succeed, so we're not finding matches beyond the 
end of the string, correct?  Are we confident that this test passing can 
warrant removing the range check? @sviswa7 ?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617901070


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 20:56:42 GMT, Scott Gibbons  wrote:

>> We can also do full reads for (n-k) == 31, as we also compare the kth byte.
>
> I'll change and test.

Passes tests, so I'll change.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617886613


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 20:37:43 GMT, Sandhya Viswanathan 
 wrote:

>> I listed all registers for clarity.  This ensures that we know what can be 
>> used as values or as scratch registers with no ambiguity.
>
> Sounds good. We could keep only comment out of the two as it is the same for 
> both small haystack and big haystack.

OK

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617889756


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Volodymyr Paprotski
On Tue, 28 May 2024 17:36:03 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
>> 
>>> 486:   __ cmpq(r11, nMinusK);
>>> 487:   __ ja_b(L_return);
>>> 488:   __ movq(rax, r11);
>> 
>> At places where we know that return value in r11 is correct, we dont need to 
>> checkRange so this could have its own label.
>
> I don't want to change this because its reason for existence is to ensure we 
> don't return a value that's beyond the end of the haystack.  We don't yet 
> have a good enough test to validate whether we're reading past the end of the 
> haystack, so I like this as insurance.

I would recommend an experiment. Disable the range-check and run 
String/IndexOf.java test. Particularly run test4(), which is designed exactly 
to test the reads beyond the end. 

It wont find all the bad reads, but right now if there are any failures, they 
are 'hidden' by this range-check.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617888680


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 20:35:26 GMT, Sandhya Viswanathan 
 wrote:

>> No. For (n-k)==32 we can do full reads.  I'll clarify by changing the label 
>> name.
>
> We can also do full reads for (n-k) == 31, as we also compare the kth byte.

I'll change and test.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617883225


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 20:29:38 GMT, Sandhya Viswanathan 
 wrote:

>> No.  This is checking for a zero length haystack.  The following compare 
>> checks for needle length longer than haystack, regardless of the value in 
>> each.  The comparison is signed, so a haystack length of 0 with a needle 
>> length of -1 will pass the following test and assume validity.
>
> But we have already checked for needle length to be greater than 0 in the 
> following lines:
> __ cmpq(needle_len_p, 0);
>  __ jg_b(L_nextCheck);

OK

>> Again, I think we ought to leave this in.  Although it executes ~3 
>> instructions that may not be necessary in some cases I think it's best to 
>> perform the check.  Once we have a good enough test to check reading past 
>> the end of the haystack we can change it.
>
> In this particular case, we are returning -1 (NoMatch), so no need to do 
> L_checkRangeAndReturn here, we could directly jump to L_returnError.

OK.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617876757
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617874637


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Sandhya Viswanathan
On Tue, 28 May 2024 18:11:13 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333:
>> 
>>> 1331: 
>>> 1332:   __ cmpq(nMinusK, 32);
>>> 1333:   __ jae_b(L_greaterThan32);
>> 
>> Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31
>>  __ cmpq(nMinusK, 31);
>>  __ jae_b(L_greaterThan32);
>
> No. For (n-k)==32 we can do full reads.  I'll clarify by changing the label 
> name.

We can also do full reads for (n-k) == 31, as we also compare the kth byte.

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750:
>> 
>>> 1748: //  r15 = unused
>>> 1749: //  XMM_BYTE_0 - first element of needle, broadcast
>>> 1750: //  XMM_BYTE_K - last element of needle, broadcast
>> 
>> This comment is duplicated for both small haystack case and big haystack 
>> case, could be made a common comment. 
>> Also the only registers that are used as input in the switch case are:
>> r14 = needle
>> rbx = haystack
>> rsi = haystack length (n)
>> r12 = needle length (k)
>> r10 = n - k (where k is needle length)
>> XMM_BYTE_0 = first element of needle, broadcast
>> XMM_BYTE_K = last element of needle, broadcast
>> So we could only list these, making it easier to comprehend.
>
> I listed all registers for clarity.  This ensures that we know what can be 
> used as values or as scratch registers with no ambiguity.

Sounds good. We could keep only comment out of the two as it is the same for 
both small haystack and big haystack.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617862799
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617865049


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Sandhya Viswanathan
On Tue, 28 May 2024 17:30:24 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:
>> 
>>> 276:   __ bind(L_nextCheck);
>>> 277:   __ testq(haystack_len_p, haystack_len_p);
>>> 278:   __ je(L_zeroCheckFailed);
>> 
>> This check could be removed as the next check covers this one.
>
> No.  This is checking for a zero length haystack.  The following compare 
> checks for needle length longer than haystack, regardless of the value in 
> each.  The comparison is signed, so a haystack length of 0 with a needle 
> length of -1 will pass the following test and assume validity.

But we have already checked for needle length to be greater than 0 in the 
following lines:
__ cmpq(needle_len_p, 0);
 __ jg_b(L_nextCheck);

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617857240


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Sandhya Viswanathan
On Tue, 28 May 2024 17:59:49 GMT, Scott Gibbons  wrote:

>> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:
>> 
>>> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value.
>>> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, 
>>> mask, hsPtrRet, needleLen,
>>> 578:  needle, haystack, hsLength, tmp1, tmp2, tmp3, 
>>> rScratch, ae, _masm);
>> 
>> If we run out of haystack instead of jumping to L_checkRangeAndReturn, we 
>> could directly jump to  L_retrunError.
>
> Again, I think we ought to leave this in.  Although it executes ~3 
> instructions that may not be necessary in some cases I think it's best to 
> perform the check.  Once we have a good enough test to check reading past the 
> end of the haystack we can change it.

In this particular case, we are returning -1 (NoMatch), so no need to do 
L_checkRangeAndReturn here, we could directly jump to L_returnError.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617853337


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Scott Gibbons
On Tue, 28 May 2024 12:48:19 GMT, Sandhya Viswanathan 
 wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fix tests
>
> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239:
> 
>> 237:   //  the needle size is less than 32 bytes, we default to a
>> 238:   //  byte-by-byte comparison (this will be rare).
>> 239:   //
> 
> Is this still true?

Yes.  For UL, the code within `L_compareFull` effectively does byte-by-byte.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:
> 
>> 276:   __ bind(L_nextCheck);
>> 277:   __ testq(haystack_len_p, haystack_len_p);
>> 278:   __ je(L_zeroCheckFailed);
> 
> This check could be removed as the next check covers this one.

No.  This is checking for a zero length haystack.  The following compare checks 
for needle length longer than haystack, regardless of the value in each.  The 
comparison is signed, so a haystack length of 0 with a needle length of -1 will 
pass the following test and assume validity.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360:
> 
>> 358:   __ push(rcx);
>> 359:   __ push(r8);
>> 360:   __ push(r9);
> 
> No need to save/restore rcx/r8/r9 on windows platform as well.

OK.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379:
> 
>> 377: 
>> 378:   // Assume failure
>> 379:   __ movq(rbp, -1);
> 
> We are no more using rbp at return point so this is not needed now?

Removed.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
> 
>> 486:   __ cmpq(r11, nMinusK);
>> 487:   __ ja_b(L_return);
>> 488:   __ movq(rax, r11);
> 
> At places where we know that return value in r11 is correct, we dont need to 
> checkRange so this could have its own label.

I don't want to change this because its reason for existence is to ensure we 
don't return a value that's beyond the end of the haystack.  We don't yet have 
a good enough test to validate whether we're reading past the end of the 
haystack, so I like this as insurance.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566:
> 
>> 564: //  rbp: -1
>> 565: //  XMM_BYTE_0 - first element of needle broadcast
>> 566: //  XMM_BYTE_K - last element of needle broadcast
> 
> The only registers that are used as input in the switch case are:
> r14 = needle
> rbx = haystack
> rsi = haystack length (n)
> r12 = needle length (k)
> r10 = n - k (where k is needle length)
> XMM_BYTE_0 = first element of needle, broadcast
> XMM_BYTE_K = last element of needle, broadcast
> So we could only list these, making it easier to comprehend.

I listed these registers to make it clear which registers had no expected value 
and could be used for temps, etc.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:
> 
>> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value.
>> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, 
>> mask, hsPtrRet, needleLen,
>> 578:  needle, haystack, hsLength, tmp1, tmp2, tmp3, 
>> rScratch, ae, _masm);
> 
> If we run out of haystack instead of jumping to L_checkRangeAndReturn, we 
> could directly jump to  L_retrunError.

Again, I think we ought to leave this in.  Although it executes ~3 instructions 
that may not be necessary in some cases I think it's best to perform the check. 
 Once we have a good enough test to check reading past the end of the haystack 
we can change it.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597:
> 
>> 595: 
>> 596: // Need a lot of registers here to preserve state across 
>> arrays_equals call
>> 597: 
> 
> This comment is no longer valid, could be removed.

OK

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:
> 
>> 619: __ addq(hsPtrRet, index);
>> 620: __ movq(r11, hsPtrRet);
>> 621: __ jmp(L_checkRangeAndReturn);
> 
> Why do we have to checkRange here, would it not be always correct? It so we 
> could return r11 directly (by moving into rax).

There are cases where r11 could have a value that, when added to (k - 1) would 
go past the end of the haystack.  I did all in my power to ensure that it 
doesn't but there's no test I know of to ensure that condition.  I would 
recommend leaving this in for now.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660:
> 
>> 658:   //  Haystack always copied to stack, so 32-byte reads OK
>> 659:   //  Haystack length < 32
>> 660:   //  10 < needle length < 32
> 
> Haystack length <= 32
> 10 < needle length <= 32

Changed.

> src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721:
> 
>> 719:false /* char */, knoreg);
>> 720: __ testl(rTmp3, rTmp3);
>> 721: __ jne(L_checkRangeAndReturn);
> 
> Why do we have to checkRange here, would it not be always correct? It so we 
> could return r11 directly (by moving into rax).

OK

> 

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-28 Thread Sandhya Viswanathan
On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix tests

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239:

> 237:   //  the needle size is less than 32 bytes, we default to a
> 238:   //  byte-by-byte comparison (this will be rare).
> 239:   //

Is this still true?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:

> 276:   __ bind(L_nextCheck);
> 277:   __ testq(haystack_len_p, haystack_len_p);
> 278:   __ je(L_zeroCheckFailed);

This check could be removed as the next check covers this one.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360:

> 358:   __ push(rcx);
> 359:   __ push(r8);
> 360:   __ push(r9);

No need to save/restore rcx/r8/r9 on windows platform as well.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379:

> 377: 
> 378:   // Assume failure
> 379:   __ movq(rbp, -1);

We are no more using rbp at return point so this is not needed now?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:

> 486:   __ cmpq(r11, nMinusK);
> 487:   __ ja_b(L_return);
> 488:   __ movq(rax, r11);

At places where we know that return value in r11 is correct, we dont need to 
checkRange so this could have its own label.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566:

> 564: //  rbp: -1
> 565: //  XMM_BYTE_0 - first element of needle broadcast
> 566: //  XMM_BYTE_K - last element of needle broadcast

The only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:

> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value.
> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, 
> mask, hsPtrRet, needleLen,
> 578:  needle, haystack, hsLength, tmp1, tmp2, tmp3, 
> rScratch, ae, _masm);

If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could 
directly jump to  L_retrunError.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597:

> 595: 
> 596: // Need a lot of registers here to preserve state across 
> arrays_equals call
> 597: 

This comment is no longer valid, could be removed.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:

> 619: __ addq(hsPtrRet, index);
> 620: __ movq(r11, hsPtrRet);
> 621: __ jmp(L_checkRangeAndReturn);

Why do we have to checkRange here, would it not be always correct? It so we 
could return r11 directly (by moving into rax).


Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

2024-05-25 Thread Scott Gibbons
> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
> using AVX2 instructions.  This change accelerates String.IndexOf on average 
> 1.3x for AVX2.  The benchmark numbers:
> 
> 
> BenchmarkScore
> Latest  
> StringIndexOf.advancedWithMediumSub   343.573 317.934 
> 0.925375393x
> StringIndexOf.advancedWithShortSub1 1039.081  1053.96 
> 1.014319384x
> StringIndexOf.advancedWithShortSub2 55.828110.541 
> 1.980027943x
> StringIndexOf.constantPattern 9.361   11.906  
> 1.271872663x
> StringIndexOf.searchCharLongSuccess   4.216   4.218   
> 1.000474383x
> StringIndexOf.searchCharMediumSuccess 3.133   3.216   
> 1.02649218x
> StringIndexOf.searchCharShortSuccess  3.763.761   
> 1.000265957x
> StringIndexOf.success 9.186   9.713   
> 1.057369911x
> StringIndexOf.successBig14.34146.343  
> 3.231504079x
> StringIndexOfChar.latin1_AVX2_String6220.918  12154.52
> 1.953814533x
> StringIndexOfChar.latin1_AVX2_char  5503.556  5540.044
> 1.006629895x
> StringIndexOfChar.latin1_SSE4_String6978.854  6818.689
> 0.977049957x
> StringIndexOfChar.latin1_SSE4_char  5657.499  5474.624
> 0.967675646x
> StringIndexOfChar.latin1_Short_String   7132.541  6863.359
> 0.962260014x
> StringIndexOfChar.latin1_Short_char   16013.389 16162.437 
> 1.009307711x
> StringIndexOfChar.latin1_mixed_String   7386.12314771.622 
> 1.15517x
> StringIndexOfChar.latin1_mixed_char 9901.671  9782.245
> 0.987938803

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Fix tests

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16753/files
  - new: https://git.openjdk.org/jdk/pull/16753/files/e13c7ea4..15994a39

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=16753=42
 - incr: https://webrevs.openjdk.org/?repo=jdk=16753=41-42

  Stats: 2 lines in 2 files changed: 0 ins; 2 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16753.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753

PR: https://git.openjdk.org/jdk/pull/16753