On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 >> 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 >> 1.000265957x >> StringIndexOf.success 9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 >> 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > mov64 => lea(InternalAddress) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633: > 4631: andl(result, 0x0000000f); // tail count (in bytes) > 4632: andl(limit, 0xfffffff0); // vector count (in bytes) > 4633: jcc(Assembler::zero, COMPARE_TAIL); In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, inside the if (and move the other jcc into the else branch)? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639: > 4637: negptr(limit); > 4638: > 4639: bind(COMPARE_WIDE_VECTORS_16); Understanding-check.. this loop will execute at most 2 times, right? i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks then byte-by-byte? (Still a good optimization, just trying to understand the scope) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718: > 4716: jmp(TRUE_LABEL); > 4717: } else { > 4718: movl(chr, Address(ary1, limit, scaleFactor)); scaleFactor is always Address::times_1 here (expand_ary2==false), might be clearer to change it back test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57: > 55: > 56: generator = new Random(); > 57: long seed = generator.nextLong();//-5291521104060046276L; dead code test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63: > 61: /////////////////////////// WARM-UP ////////////////////////// > 62: > 63: for (int i = 0; i < 20000; i++) { -Xcomp should be more deterministic (and quicker) way to reach the intrinsic (i.e. like the other tests) On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check please.. these tests will run as part of every build from this point-till-infinity; Therefore, long test will affect every openjdk developer. But if this test is not run on every build, then the build-time does not matter, then this test can run for as long as it 'wants'. test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160: > 158: } > 159: > 160: private static String generateTestString(int min, int max) { I see you have various `Charset[] charSets` above, but this function still only generates LL. Are those separate tests? Or am I missing some concatenation somewhere that will convert the generated string string to the correct encoding? You could had implemented my suggestion from IndexOf.generateTestString here instead, so that the tests that do call this function endup with multiple encodings; i.e. similar to what you already do in the next function. I suppose, with addition of String/IndexOf.java that is a moot point. test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185: > 183: } > 184: > 185: private static int indexOfKernel(String haystack, String needle) { Is the intention of kernels not to be inlined so that it would be part of separate compilation? If so, you probably want to annotate it with `@CompilerControl(CompilerControl.Mode.DONT_INLINE)` i.e. https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_16_CompilerControl.java test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 539: > 537: failCount = indexOfKernel("", ""); > 538: > 539: for (int x = 0; x < 1000000; x++) { Should we be concerned about the increased run-time? Or does this execute 'quickly enough' ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613940896 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613943518 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613946470 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955620 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955354 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613970971 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613967681 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613983597