On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons <[email protected]> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark Score
>> Latest
>> StringIndexOf.advancedWithMediumSub 343.573 317.934
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub1 1039.081 1053.96
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub2 55.828 110.541
>> 1.980027943x
>> StringIndexOf.constantPattern 9.361 11.906
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess 4.216 4.218
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess 3.133 3.216
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76 3.761
>> 1.000265957x
>> StringIndexOf.success 9.186
>> 9.713 1.057369911x
>> StringIndexOf.successBig 14.341 46.343
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String 7132.541
>> 6863.359 0.962260014x
>> StringIndexOfChar.latin1_Short_char 16013.389 16162.437
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622
>> 1.999915517x
>> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional
> commit since the last revision:
>
> mov64 => lea(InternalAddress)
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633:
> 4631: andl(result, 0x0000000f); // tail count (in bytes)
> 4632: andl(limit, 0xfffffff0); // vector count (in bytes)
> 4633: jcc(Assembler::zero, COMPARE_TAIL);
In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I
think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549,
inside the if (and move the other jcc into the else branch)?
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639:
> 4637: negptr(limit);
> 4638:
> 4639: bind(COMPARE_WIDE_VECTORS_16);
Understanding-check.. this loop will execute at most 2 times, right?
i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks
then byte-by-byte?
(Still a good optimization, just trying to understand the scope)
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718:
> 4716: jmp(TRUE_LABEL);
> 4717: } else {
> 4718: movl(chr, Address(ary1, limit, scaleFactor));
scaleFactor is always Address::times_1 here (expand_ary2==false), might be
clearer to change it back
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57:
> 55:
> 56: generator = new Random();
> 57: long seed = generator.nextLong();//-5291521104060046276L;
dead code
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63:
> 61: /////////////////////////// WARM-UP //////////////////////////
> 62:
> 63: for (int i = 0; i < 20000; i++) {
-Xcomp should be more deterministic (and quicker) way to reach the intrinsic
(i.e. like the other tests)
On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check
please.. these tests will run as part of every build from this
point-till-infinity; Therefore, long test will affect every openjdk developer.
But if this test is not run on every build, then the build-time does not
matter, then this test can run for as long as it 'wants'.
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160:
> 158: }
> 159:
> 160: private static String generateTestString(int min, int max) {
I see you have various `Charset[] charSets` above, but this function still only
generates LL. Are those separate tests? Or am I missing some concatenation
somewhere that will convert the generated string string to the correct encoding?
You could had implemented my suggestion from IndexOf.generateTestString here
instead, so that the tests that do call this function endup with multiple
encodings; i.e. similar to what you already do in the next function.
I suppose, with addition of String/IndexOf.java that is a moot point.
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185:
> 183: }
> 184:
> 185: private static int indexOfKernel(String haystack, String needle) {
Is the intention of kernels not to be inlined so that it would be part of
separate compilation?
If so, you probably want to annotate it with
`@CompilerControl(CompilerControl.Mode.DONT_INLINE)`
i.e.
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_16_CompilerControl.java
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 539:
> 537: failCount = indexOfKernel("", "");
> 538:
> 539: for (int x = 0; x < 1000000; x++) {
Should we be concerned about the increased run-time? Or does this execute
'quickly enough'
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613940896
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613943518
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613946470
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955620
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955354
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613970971
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613967681
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613983597