Re: RFR: 8329331: Intrinsify Unsafe::setMemory [v6]
On Sun, 7 Apr 2024 01:49:01 GMT, Dean Long wrote: >> Scott Gibbons has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Oops > > I went ahead and tried a pure-Java implementation, and it is faster for small > sizes (up to 8) and only about 1.5x slower for larger sizes, so that might > make for an interesting fallback if there is no customized assembler > implementation available or if the size is known to me small. > > Ideally, I think we would want C2 to be more aware of setMemory stores, so > that it can remove redundant stores, like it does with InitializeNode. @dean-long in my old PR I have done the same, choosing a (not yet) configurable cutoff value. See https://github.com/openjdk/jdk/pull/16760 - PR Comment: https://git.openjdk.org/jdk/pull/18555#issuecomment-2041314429
Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]
On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'openjdk:master' into indexof > - Cleaned up, ready for review > - Pre-cleanup code > - Add JMH. Add 16-byte compares to arrays_equals > - Better method for mask creation > - Merge branch 'openjdk:master' into indexof > - Most cleanup done. > - Remove header dependency > - Works - needs cleanup > - Passes tests. > - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 Sure thing: https://github.com/netty/netty/pull/13534#issuecomment-1685247165 It's the comparison with String::indexOf while the impl is: https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L590 - PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016576139
Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]
On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub11039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub255.828110.541 >> 1.980027943x >> StringIndexOf.constantPattern9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.763.761 >> 1.000265957x >> StringIndexOf.success9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.34146.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.3590.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.12314771.622 >> 1.15517x >> StringIndexOfChar.latin1_mixed_char9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'openjdk:master' into indexof > - Cleaned up, ready for review > - Pre-cleanup code > - Add JMH. Add 16-byte compares to arrays_equals > - Better method for mask creation > - Merge branch 'openjdk:master' into indexof > - Most cleanup done. > - Remove header dependency > - Works - needs cleanup > - Passes tests. > - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant. - PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016432345
Re: RFR: 8316704: Regex-free parsing of Formatter and FormatProcessor specifiers
On Sun, 17 Sep 2023 16:01:33 GMT, Shaojin Wen wrote: > @cl4es made performance optimizations for the simple specifiers of > String.format in PR https://github.com/openjdk/jdk/pull/2830. Based on the > same idea, I continued to make improvements. I made patterns like %2d %02d > also be optimized. > > The following are the test results based on MacBookPro M1 Pro: > > > -Benchmark Mode Cnt Score Error Units > -StringFormat.complexFormat avgt 15 1862.233 ? 217.479 ns/op > -StringFormat.int02Format avgt 15 312.491 ? 26.021 ns/op > -StringFormat.intFormat avgt 1584.432 ? 4.145 ns/op > -StringFormat.longFormatavgt 1587.330 ? 6.111 ns/op > -StringFormat.stringFormat avgt 1563.985 ? 11.366 ns/op > -StringFormat.stringIntFormat avgt 1587.422 ? 0.147 ns/op > -StringFormat.widthStringFormat avgt 15 250.740 ? 32.639 ns/op > -StringFormat.widthStringIntFormat avgt 15 312.474 ? 16.309 ns/op > > +Benchmark Mode CntScoreError Units > +StringFormat.complexFormat avgt 15 740.626 ? 66.671 ns/op > (+151.45) > +StringFormat.int02Format avgt 15 131.049 ? 0.432 ns/op > (+138.46) > +StringFormat.intFormat avgt 15 67.229 ? 4.155 ns/op > (+25.59) > +StringFormat.longFormatavgt 15 66.444 ? 0.614 ns/op > (+31.44) > +StringFormat.stringFormat avgt 15 62.619 ? 4.652 ns/op (+2.19) > +StringFormat.stringIntFormat avgt 15 89.606 ? 13.966 ns/op (-2.44) > +StringFormat.widthStringFormat avgt 15 52.462 ? 15.649 ns/op > (+377.95) > +StringFormat.widthStringIntFormat avgt 15 101.814 ? 3.147 ns/op > (+206.91) src/java.base/share/classes/java/util/Formatter.java line 2935: > 2933: for (int size = 0; off < max; c = s.charAt(++off), size++) { > 2934: if (!isDigit(c)) { > 2935: if (size > 0) { I would put the first iteration outside of this loop and have a loop which doesn't need to check for it - PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1338342542
Re: RFR: 8316704: Regex-free parsing of Formatter and FormatProcessor specifiers [v11]
On Wed, 27 Sep 2023 09:35:47 GMT, 温绍锦 wrote: >> @cl4es made performance optimizations for the simple specifiers of >> String.format in PR https://github.com/openjdk/jdk/pull/2830. Based on the >> same idea, I continued to make improvements. I made patterns like %2d %02d >> also be optimized. >> >> The following are the test results based on MacBookPro M1 Pro: >> >> >> -Benchmark Mode Cnt Score Error Units >> -StringFormat.complexFormat avgt 15 1862.233 ? 217.479 ns/op >> -StringFormat.int02Format avgt 15 312.491 ? 26.021 ns/op >> -StringFormat.intFormat avgt 1584.432 ? 4.145 ns/op >> -StringFormat.longFormatavgt 1587.330 ? 6.111 ns/op >> -StringFormat.stringFormat avgt 1563.985 ? 11.366 ns/op >> -StringFormat.stringIntFormat avgt 1587.422 ? 0.147 ns/op >> -StringFormat.widthStringFormat avgt 15 250.740 ? 32.639 ns/op >> -StringFormat.widthStringIntFormat avgt 15 312.474 ? 16.309 ns/op >> >> +Benchmark Mode CntScoreError Units >> +StringFormat.complexFormat avgt 15 740.626 ? 66.671 ns/op >> (+151.45) >> +StringFormat.int02Format avgt 15 131.049 ? 0.432 ns/op >> (+138.46) >> +StringFormat.intFormat avgt 15 67.229 ? 4.155 ns/op >> (+25.59) >> +StringFormat.longFormatavgt 15 66.444 ? 0.614 ns/op >> (+31.44) >> +StringFormat.stringFormat avgt 15 62.619 ? 4.652 ns/op >> (+2.19) >> +StringFormat.stringIntFormat avgt 15 89.606 ? 13.966 ns/op >> (-2.44) >> +StringFormat.widthStringFormat avgt 15 52.462 ? 15.649 ns/op >> (+377.95) >> +StringFormat.widthStringIntFormat avgt 15 101.814 ? 3.147 ns/op >> (+206.91) > > 温绍锦 has updated the pull request incrementally with one additional commit > since the last revision: > > code format src/java.base/share/classes/java/util/Formatter.java line 2935: > 2933: for (int size = 0; off < max; c = s.charAt(++off), size++) { > 2934: if (!isDigit(c)) { > 2935: if (size > 0) { I would put the first iteration outside of this loop and have a loop which doesn't need to check for it - PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1338342542
Re: RFR: 8291966: SwitchBootstrap.typeSwitch could be faster [v2]
On Tue, 30 May 2023 09:32:02 GMT, Jan Lahoda wrote: >> @forax >> >> Hi! Sorry for this sudden message, but this one captured my attention >> >>> and subtype checks are usually fast. >> >> And I hope this PR to be the right place to raise this. >> >> I was looking this PR to better understand how it applies to secondary >> supers and https://bugs.openjdk.org/browse/JDK-8180450 (that's still WIP) >> doesn't look like subtype checks are yet fast nor scalable (with multiple >> interfaces and > 1 receiver types observed) - see >> https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckscalabilitybenchmark >> that can be modified to use type switch too. >> >> In addition, at least for secondary types, a missed `IsInstance` (ie which >> type isn't implemented) can cost O(n) over the secondary supers types (see >> https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckslowpathbenchmark) >> that's still not ideal, due to the high bootstrapping cost of `prnz scas`. >> >> Hence, the implementation of type switch is likely to account for the >> existing (performance) deficiencies of the secondary supers type check or is >> relying on the fix https://bugs.openjdk.org/browse/JDK-8180450 that will >> appear at some point? >> >> Hope to have interacted in the right way with this with the JDK dev >> community, and thanks again for your hard work on this wonderful piece of >> engineering! > > @franz1981, thanks for your comment. I am afraid I don't know much about > JDK-8180450, but I suspect that it will affect the (type) switch lookup. > Please correct me if I am wrong, but it seems this patch is still an > improvement over the current state, and when JDK-8180450 is resolved, it > should automatically improve the type switch lookup as well. So, this patch > still seems useful to me, with a potential for future improvements, both > inside and outside type switch bootstrap itself. @lahodaj > So, this patch still seems useful to me, with a potential for future > improvements, both inside and outside type switch bootstrap itself. Yep, it is indeed still valuable: sorry if I didn't make it clear. My point is more related the need of the amount of type checks in some cases (reducing the impact of JDK-8180450); if a typecheck-like op is still required to route to some case (and which cost depends by many factors really), any target `checkcast` could be maybe saved. I'm currently unaware if the current state of this PR bring any improvement on this front too (or if is even possible), but this would be welcome regardless JDK-8180450 and much more if considering the mentioned issue too. This can be left for a follow-up PR too, obviously. I'm bringing the attention to this exactly because regardless JDK-8180450, secondary supers negative type check still have costs that would be awesome if could be addressed while exposing this feature to final users. - PR Comment: https://git.openjdk.org/jdk/pull/9779#issuecomment-1570214110
Re: RFR: 8291966: SwitchBootstrap.typeSwitch could be faster [v2]
On Tue, 2 May 2023 13:57:37 GMT, Rémi Forax wrote: >> Jan Lahoda has updated the pull request with a new target base due to a >> merge or a rebase. The incremental webrev excludes the unrelated changes >> brought in by the merge/rebase. The pull request contains four additional >> commits since the last revision: >> >> - Adding comments >> - Improving performance >> - Merge branch 'master' into JDK-8291966 >> - 8291966: SwitchBootstrap.typeSwitch could be faster > > Also there is a lot of use cases where the type switch is called only with > instances of the same class, for those scenarii, the VM should be able to > fully remove the type switch and inline the body of the corresponding case > like this is done with a virtual call. > > I don't think there is a way currently to explain to the VM that the a hash > map really acts as a cache so if the key is a constant then the value is a > constant too (this optimization is also missing when JITing ClassValue.get()). @forax Hi! Sorry for this sudden message, but this one captured my attention > and subtype checks are usually fast. And I hope this PR to be the right place to raise this. I was looking this PR to better understand how it applies to secondary supers and https://bugs.openjdk.org/browse/JDK-8180450 (that's still WIP) doesn't look like subtype checks are yet fast nor scalable (with multiple interfaces and > 1 receiver types observed) - see https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckscalabilitybenchmark that can be modified to use type switch too. In addition, at least for secondary types, a missed `IsInstance` (ie which type isn't implemented) can cost O(n) over the secondary supers types (see https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckslowpathbenchmark) that's still not ideal, due to the high bootstrapping cost of `prnz scas`. Hence, the implementation of type switch is likely to account for the existing (performance) deficiencies of the secondary supers type check or is relying on the fix https://bugs.openjdk.org/browse/JDK-8180450 that will appear at some point? Hope to have interacted in the right way with this with the JDK dev community, and thanks again for your hard work on this wonderful piece of engineering! - PR Comment: https://git.openjdk.org/jdk/pull/9779#issuecomment-1566691245
Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]
On Wed, 15 Mar 2023 14:56:28 GMT, Eirik Bjorsnos wrote: >> Eirik Bjorsnos has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Update StringLatin1.canEncode to sync with same test in CharacterData.of > > Just for fun, I tried with a benchmark where the code point is Latin1 in > every other call: > > > @Benchmark > public void isDigitVarying(Blackhole blackhole) { > blackhole.consume(Character.isDigit(48)); > blackhole.consume(Character.isDigit(1632)); > } > > > With this benchmark, there is no difference between the baseline, the PR and > using StringLatin1.canEncode: > > Baseline: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitVarying 1632 avgt 15 1.198 ± 0.056 ns/op > > > PR: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitVarying 1632 avgt 15 1.195 ± 0.058 ns/op > > > StringLatin1.canEncode: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitVarying 1632 avgt 15 1.193 ± 0.055 ns/op > ``` > > At this point, I'm starting to wonder a bit if the performance benefits > suggested by this PR might be dubious and will only surface in very narrow > benchmarks. On the other hand, it does not seem harmful either. What do > people think? @eirbjo I suggest to add `-prof perfasm` and see what's going on; I suspect it's better that inputs are loaded from `@State` field...and don't use `BlackHole` but combine the results of the 2 operations (although I remember that compiler assisted black holes are not as noisy as I'm used to observe with JDK 11 ones) - PR: https://git.openjdk.org/jdk/pull/13040
Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test
On Wed, 15 Mar 2023 13:42:22 GMT, Eirik Bjorsnos wrote: >> Can you check what happen adding much more inputs to the dataset including >> non-latin chars as well and use `-prof perfnorm` to check what `perf` report >> re branches/branch-misses? >> >> You can use `SplittableRandom` to pre-populate an array of inputs which >> sequence is "random" but still allow deterministic benchmarking and feed the >> benchmark method by cycling the pre-computed inputs. >> In the real world I expect `isDigit` to happen on different input types and >> both having C2 with both branches places based on prev inputs distribution >> and a confused branch-predictor to allow comparing vs something that looks a >> bit nearest to the real world (TBD, I know). >> I expect in that case that a single cmp + mask to work better depending on >> latin input distribution/occurrence > > I created a randomized version of `Characters.isDigit` which tests with code > points picked at random such that any category (Latin1, negative, different > planes, unassiged) are equally probable. > > Baseline: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitRandom 1632 avgt 15 5.503 ± 0.371 ns/op > > > Current PR: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitRandom 1632 avgt 15 5.393 ± 0.336 ns/op > > > Using StringLatin1.canEncode: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigitRandom 1632 avgt 15 5.377 ± 0.322 ns/op > > > Seems the PR still has a small improvement for this scenario. The > StringLatin1.canEncode regression disappears. > > In the real world ASCII/Latin1 seems to dominate most data, so this scenario > is perhaps not very realistic. > > I'm running this on a Mac, so cannot try `-prof perfnorm`. Many thanks to have tried, yep, I was curious indeed re the "StringLatin1.canEncode regression" case. I would still modify the benchmark to use inputs (I know that will make it memory bound sadly, due to reading inputs - but the size of such inputs can be a benchmark parameter, together with the bias eg "latin","mix", "non-latin") "semi-randomly" generated based on the mentioned strategies/biases. It will benefit future tests on this, although could be provided as a separate PR. - PR: https://git.openjdk.org/jdk/pull/13040
Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test
On Wed, 15 Mar 2023 12:28:05 GMT, Eirik Bjorsnos wrote: >>> `if (ch && 0xFF00 == 0) {` >> >> This seems to perform similar to baseline: >> >> >> Benchmark (codePoint) Mode Cnt Score Error Units >> Characters.isDigit 48 avgt 15 0.890 ± 0.025 ns/op >> Characters.isDigit 1632 avgt 15 2.174 ± 0.011 ns/op >> >> >> Would be interesting to check the performance on non-Intel architectures. If >> you want to give it a spin on your M1, here's the benchmark command I used: >> >> `make test TEST='micro:java.lang.Characters.isDigit' MICRO="OPTIONS=-p >> codePoint=48,1632"` > >> It seems reasonable to keep these two in sync, yes. (`CharacterData.of` >> could even call into `StringLatin1.canEncode`, unless that's cause for some >> performance anomaly) > > If I update `StringLatin1.canEncode` and call into that from > `CharacterData.of`, I observe no regression for the Latin1 case, but a > significant regression for the non-Latin1 case. I have no idea how to explain > that: > > > Benchmark (codePoint) Mode Cnt Score Error Units > Characters.isDigit 48 avgt 15 0.675 ± 0.029 ns/op > Characters.isDigit 1632 avgt 15 2.435 ± 0.032 ns/op Can you check what happen adding much more inputs to the dataset that includes non-latin chars as well and use `-prof perfnorm` to check what `perf` report re branches/branch-misses? You can use SplittableRandom to pre-populate an array of inputs which sequence is "random" but still allow deterministic benchmarking and feed the benchmark method cycling the pre-computed inputs - PR: https://git.openjdk.org/jdk/pull/13040
Re: RFR: 8301958: Reduce Arrays.copyOf/-Range overheads [v7]
On Wed, 8 Feb 2023 00:07:14 GMT, Claes Redestad wrote: >> This patch adds special-cases to `Arrays.copyOf` and `Arrays.copyOfRange` to >> clone arrays when `newLength` or range inputs span the input array. This >> helps eliminate range checks and has been verified to help various String >> operations. Example: >> >> Baseline >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 16.817 ± 0.369 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 16.866 ± 0.449 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 22.198 ± 0.396 ns/op >> >> Patch: >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 14.666 ± 0.336 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 14.582 ± 0.288 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 20.339 ± 0.328 ns/op > > Claes Redestad has updated the pull request incrementally with one additional > commit since the last revision: > > Minimize, force inline, generalize test/micro/org/openjdk/bench/java/lang/StringConstructor.java line 40: > 38: > 39: @Param({"0", "7", "64"}) > 40: public int size; I suggest to add the param `offset` for future experiment: together with `perfasm` it helps to check how different stubs are used and emulate the different branches of the optimized code - PR: https://git.openjdk.org/jdk/pull/12453
Re: RFR: 8301958: Reduce Arrays.copyOfRange overheads [v6]
On Tue, 7 Feb 2023 22:43:15 GMT, Claes Redestad wrote: >> This adds a local, specialized `copyBytes` method to `String` that avoids >> certain redundant range checks and clamping that JIT has issues removing >> fully. >> >> This has a small but statistically significant effect on `String` >> microbenchmarks, eg.: >> >> Baseline >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 16.817 ± 0.369 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 16.866 ± 0.449 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 22.198 ± 0.396 ns/op >> >> >> Patch: >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 15.477 ± 0.342 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 15.557 ± 0.352 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 21.272 ± 0.398 ns/op >> >> >> Care has to be taken to ensure preconditions have been checked when using >> `checkBytes`. In the case of `String(AbstractStringBuilder)` there's a >> possible pre-existing issue where the constructor might either throw an >> exception or truncate the buffer if the builder byte array and length is not >> in agreement (theoretically possible if you clear/remove and call >> `trimToSize()` concurrently). Adding an explicit check here seem to be the >> right thing to do regardless of this RFE. > > Claes Redestad has updated the pull request incrementally with two additional > commits since the last revision: > > - Apply @franz1981's idea but directly to Arrays.copyOfRange, factoring out > range checks and helping JIT pick the best arraycopy adapter > - Back out all changes except micro LGTM - Marked as reviewed by franz1...@github.com (no known OpenJDK username). PR: https://git.openjdk.org/jdk/pull/12453
Re: RFR: 8301958: Avoid Arrays.copyOfRange overhead in java.lang.String [v5]
On Tue, 7 Feb 2023 20:32:11 GMT, Claes Redestad wrote: >> src/java.base/share/classes/java/lang/String.java line 698: >> >>> 696: } >>> 697: >>> 698: static byte[] copyBytes(byte[] bytes, int offset, int length) { >> >> Given that the stub generated for array copy seems highly dependent by the >> call site constrains, did you tried adding a check for offset == 0 and/or >> length == bytes.length? >> >> If (offset == 0 && bytes.length == length) { >> System.arrayCopy(bytes, 0, dst, 0, bytes.length); >> // etc etc the other combinations >> >> This should have different generated stubs with much smaller ASM depending >> by the enforced constrains (and shouldn't affect terribly the code size of >> the method, given that the stub won't be inlined AFAIK) >> >> Beware, as noted by others, I'm not suggesting that's the way to fix this, >> but it would be interesting to check how much perf we leave on the ground >> due to the this supposed "inefficient" stub generation (if that's the issue). > > I did some quick experiments but saw no clear win from doing anything like > this here. Feel free to experiment and see if there's some particular > configuration that comes out ahead. > > FTR I did not intend for this RFE to solve > https://bugs.openjdk.org/browse/JDK-8295496 completely, but provide a small, > partial win that might possibly clear a path to solving that likely > orthogonal issue. I've created a separate benchmark for this (named as your by accident - given that I've used it as a blueprint): https://gist.github.com/franz1981/658c2bf6796aab4ae04a84bef1ef34b6 results are Benchmark (offset) (size) Mode Cnt Score Error Units StringConstructor.arrayCopy 0 7 avgt 10 9.519 ± 0.131 ns/op StringConstructor.arrayCopy 1 7 avgt 10 9.194 ± 0.232 ns/op StringConstructor.copyOf 0 7 avgt 10 11.548 ± 0.133 ns/op StringConstructor.copyOf 1 7 avgt 10 9.812 ± 0.018 ns/op StringConstructor.optimizedArrayCopy 0 7 avgt 10 6.854 ± 0.355 ns/op< THAT'S COOL StringConstructor.optimizedArrayCopy 1 7 avgt 10 9.088 ± 0.049 ns/op the optimized array copy is helping C2 on stub generation. I didn't checked yet if this applies to the `String` case and I didn't created a long enough dataset array to check the effects on the branch predictor with the newly introduced conditions too, but in term of generated stub, there's a difference. - PR: https://git.openjdk.org/jdk/pull/12453
Re: RFR: 8301958: Avoid Arrays.copyOfRange overhead in java.lang.String [v5]
On Tue, 7 Feb 2023 15:25:05 GMT, Claes Redestad wrote: >> This adds a local, specialized `copyBytes` method to `String` that avoids >> certain redundant range checks and clamping that JIT has issues removing >> fully. >> >> This has a small but statistically significant effect on `String` >> microbenchmarks, eg.: >> >> Baseline >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 16.817 ± 0.369 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 16.866 ± 0.449 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 22.198 ± 0.396 ns/op >> >> >> Patch: >> >> Benchmark(size) Mode Cnt >> Score Error Units >> StringConstructor.newStringFromArray 7 avgt 15 >> 15.477 ± 0.342 ns/op >> StringConstructor.newStringFromArrayWithCharset 7 avgt 15 >> 15.557 ± 0.352 ns/op >> StringConstructor.newStringFromArrayWithCharsetName 7 avgt 15 >> 21.272 ± 0.398 ns/op >> >> >> Care has to be taken to ensure preconditions have been checked when using >> `checkBytes`. In the case of `String(AbstractStringBuilder)` there's a >> possible pre-existing issue where the constructor might either throw an >> exception or truncate the buffer if the builder byte array and length is not >> in agreement (theoretically possible if you clear/remove and call >> `trimToSize()` concurrently). Adding an explicit check here seem to be the >> right thing to do regardless of this RFE. > > Claes Redestad has updated the pull request incrementally with one additional > commit since the last revision: > > copyrights Thanks @cl4es to look into this! src/java.base/share/classes/java/lang/String.java line 698: > 696: } > 697: > 698: static byte[] copyBytes(byte[] bytes, int offset, int length) { Given that the stub generated for array copy seems highly dependent by the call site constrains, did you tried adding a check for offset == 0 and/or length == bytes.length? If (offset == 0 && bytes.length == length) { System.arrayCopy(bytes, 0, dst, bytes.length); // etc etc the other combinations This should have different generated stubs with much smaller ASM depending by the enforced constrains - Changes requested by franz1...@github.com (no known OpenJDK username). PR: https://git.openjdk.org/jdk/pull/12453