Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v37]

Volodymyr Paprotski Fri, 24 May 2024 13:28:14 -0700

On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons <[email protected]> wrote:


>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark                                                   Score            
>> Latest          
>> StringIndexOf.advancedWithMediumSub   343.573                317.934         
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub1    1039.081              1053.96         
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub2        55.828            110.541         
>> 1.980027943x
>> StringIndexOf.constantPattern                        9.361           11.906  
>>         1.271872663x
>> StringIndexOf.searchCharLongSuccess          4.216           4.218           
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess        3.133           3.216           
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76                    3.761           
>> 1.000265957x
>> StringIndexOf.success                                        9.186           
>> 9.713           1.057369911x
>> StringIndexOf.successBig                           14.341            46.343  
>>         3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918              12154.52        
>>         1.953814533x
>> StringIndexOfChar.latin1_AVX2_char     5503.556              5540.044        
>>         1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854              6818.689        
>>         0.977049957x
>> StringIndexOfChar.latin1_SSE4_char     5657.499              5474.624        
>>         0.967675646x
>> StringIndexOfChar.latin1_Short_String          7132.541              
>> 6863.359                0.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389             16162.437         
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String          7386.123            14771.622 
>>         1.999915517x
>> StringIndexOfChar.latin1_mixed_char    9901.671              9782.245        
>>         0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   mov64 => lea(InternalAddress)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633:

> 4631:     andl(result, 0x0000000f);  //   tail count (in bytes)
> 4632:     andl(limit, 0xfffffff0);   // vector count (in bytes)
> 4633:     jcc(Assembler::zero, COMPARE_TAIL);

In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I 
think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, 
inside the if (and move the other jcc into the else branch)?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639:

> 4637:     negptr(limit);
> 4638: 
> 4639:     bind(COMPARE_WIDE_VECTORS_16);

Understanding-check.. this loop will execute at most 2 times, right?

i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks 
then byte-by-byte?

(Still a good optimization, just trying to understand the scope)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718:

> 4716:     jmp(TRUE_LABEL);
> 4717:   } else {
> 4718:     movl(chr, Address(ary1, limit, scaleFactor));

scaleFactor is always Address::times_1 here (expand_ary2==false), might be 
clearer to change it back

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57:

> 55: 
> 56:     generator = new Random();
> 57:     long seed = generator.nextLong();//-5291521104060046276L;

dead code

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63:

> 61:     ///////////////////////////  WARM-UP //////////////////////////
> 62: 
> 63:     for (int i = 0; i < 20000; i++) {

-Xcomp should be more deterministic (and quicker) way to reach the intrinsic 
(i.e. like the other tests)

On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check 
please.. these tests will run as part of every build from this 
point-till-infinity; Therefore, long test will affect every openjdk developer. 
But if this test is not run on every build, then the build-time does not 
matter, then this test can run for as long as it 'wants'.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160:

> 158:   }
> 159: 
> 160:   private static String generateTestString(int min, int max) {

I see you have various `Charset[] charSets` above, but this function still only 
generates LL. Are those separate tests? Or am I missing some concatenation 
somewhere that will convert the generated string string to the correct encoding?

You could had implemented my suggestion from IndexOf.generateTestString here 
instead, so that the tests that do call this function endup with multiple 
encodings; i.e. similar to what you already do in the next function.

I suppose, with addition of String/IndexOf.java that is a moot point.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185:

> 183:   }
> 184: 
> 185:   private static int indexOfKernel(String haystack, String needle) {

Is the intention of kernels not to be inlined so that it would be part of 
separate compilation?

If so, you probably want to annotate it with 
`@CompilerControl(CompilerControl.Mode.DONT_INLINE)`

i.e. 
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_16_CompilerControl.java

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 539:

> 537:     failCount = indexOfKernel("", "");
> 538: 
> 539:     for (int x = 0; x < 1000000; x++) {

Should we be concerned about the increased run-time? Or does this execute 
'quickly enough'

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613940896
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613943518
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613946470
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955620
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955354
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613970971
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613967681
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613983597

Re: RFR: 8320448: Accelerate IndexOf using AVX2 [v37]

Reply via email to