Re: RFR: 8329331: Intrinsify Unsafe::setMemory [v6]

2024-04-06 Thread Francesco Nigro
On Sun, 7 Apr 2024 01:49:01 GMT, Dean Long  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Oops
>
> I went ahead and tried a pure-Java implementation, and it is faster for small 
> sizes (up to 8) and only about 1.5x slower for larger sizes, so that might 
> make for an interesting fallback if there is no customized assembler 
> implementation available or if the size is known to me small.
> 
> Ideally, I think we would want C2 to be more aware of setMemory stores, so 
> that it can remove redundant stores, like it does with InitializeNode.

@dean-long in my old PR I have done the same, choosing a (not yet) configurable 
cutoff value. 

See https://github.com/openjdk/jdk/pull/16760

-

PR Comment: https://git.openjdk.org/jdk/pull/18555#issuecomment-2041314429


Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

2024-03-23 Thread Francesco Nigro
On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 46 commits:
> 
>  - Merge branch 'openjdk:master' into indexof
>  - Cleaned up, ready for review
>  - Pre-cleanup code
>  - Add JMH. Add 16-byte compares to arrays_equals
>  - Better method for mask creation
>  - Merge branch 'openjdk:master' into indexof
>  - Most cleanup done.
>  - Remove header dependency
>  - Works - needs cleanup
>  - Passes tests.
>  - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12

Sure thing: https://github.com/netty/netty/pull/13534#issuecomment-1685247165

It's the comparison with String::indexOf while the impl is: 
https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L590

-

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016576139


Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

2024-03-23 Thread Francesco Nigro
On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons  wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark   Score
>> Latest  
>> StringIndexOf.advancedWithMediumSub   343.573317.934 
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub11039.081  1053.96 
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub255.828110.541 
>> 1.980027943x
>> StringIndexOf.constantPattern9.361   11.906  
>> 1.271872663x
>> StringIndexOf.searchCharLongSuccess  4.216   4.218   
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess3.133   3.216   
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.763.761   
>> 1.000265957x
>> StringIndexOf.success9.186   
>> 9.713   1.057369911x
>> StringIndexOf.successBig   14.34146.343  
>> 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918  12154.52
>> 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556  5540.044
>> 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854  6818.689
>> 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499  5474.624
>> 0.967675646x
>> StringIndexOfChar.latin1_Short_String  7132.541  
>> 6863.3590.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389 16162.437 
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String  7386.12314771.622 
>> 1.15517x
>> StringIndexOfChar.latin1_mixed_char9901.671  9782.245
>> 0.987938803
>
> Scott Gibbons has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 46 commits:
> 
>  - Merge branch 'openjdk:master' into indexof
>  - Cleaned up, ready for review
>  - Pre-cleanup code
>  - Add JMH. Add 16-byte compares to arrays_equals
>  - Better method for mask creation
>  - Merge branch 'openjdk:master' into indexof
>  - Most cleanup done.
>  - Remove header dependency
>  - Works - needs cleanup
>  - Passes tests.
>  - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12

Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, 
which is manually loop unrolling the head processing (first < 8 bytes) to 
artificially make sure the branch predictor got different branches to care AND 
JIT won't make it wrong. We have measured (I can provide a link of the 
benchmark and results, If you are interested) that it delivers a much better 
performance on tiny strings and makes a smoother  degradation vs perfectly 
aligned string length as well. Clearly this tends to be much visible if the 
input strings have shuffled delimiter positions, to make the branch prediction 
misses more relevant.

-

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016432345


Re: RFR: 8316704: Regex-free parsing of Formatter and FormatProcessor specifiers

2023-10-16 Thread Francesco Nigro
On Sun, 17 Sep 2023 16:01:33 GMT, Shaojin Wen  wrote:

> @cl4es made performance optimizations for the simple specifiers of 
> String.format in PR https://github.com/openjdk/jdk/pull/2830. Based on the 
> same idea, I continued to make improvements. I made patterns like %2d %02d 
> also be optimized.
> 
> The following are the test results based on MacBookPro M1 Pro: 
> 
> 
> -Benchmark  Mode  Cnt Score Error  Units
> -StringFormat.complexFormat avgt   15  1862.233 ? 217.479  ns/op
> -StringFormat.int02Format   avgt   15   312.491 ?  26.021  ns/op
> -StringFormat.intFormat avgt   1584.432 ?   4.145  ns/op
> -StringFormat.longFormatavgt   1587.330 ?   6.111  ns/op
> -StringFormat.stringFormat  avgt   1563.985 ?  11.366  ns/op
> -StringFormat.stringIntFormat   avgt   1587.422 ?   0.147  ns/op
> -StringFormat.widthStringFormat avgt   15   250.740 ?  32.639  ns/op
> -StringFormat.widthStringIntFormat  avgt   15   312.474 ?  16.309  ns/op
> 
> +Benchmark  Mode  CntScoreError  Units
> +StringFormat.complexFormat avgt   15  740.626 ? 66.671  ns/op 
> (+151.45)
> +StringFormat.int02Format   avgt   15  131.049 ?  0.432  ns/op 
> (+138.46)
> +StringFormat.intFormat avgt   15   67.229 ?  4.155  ns/op 
> (+25.59)
> +StringFormat.longFormatavgt   15   66.444 ?  0.614  ns/op 
> (+31.44)
> +StringFormat.stringFormat  avgt   15   62.619 ?  4.652  ns/op (+2.19)
> +StringFormat.stringIntFormat   avgt   15   89.606 ? 13.966  ns/op (-2.44)
> +StringFormat.widthStringFormat avgt   15   52.462 ? 15.649  ns/op 
> (+377.95)
> +StringFormat.widthStringIntFormat  avgt   15  101.814 ?  3.147  ns/op 
> (+206.91)

src/java.base/share/classes/java/util/Formatter.java line 2935:

> 2933: for (int size = 0; off < max; c = s.charAt(++off), size++) {
> 2934: if (!isDigit(c)) {
> 2935: if (size > 0) {

I would put the first iteration outside of this loop and have a loop which 
doesn't need to check for it

-

PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1338342542


Re: RFR: 8316704: Regex-free parsing of Formatter and FormatProcessor specifiers [v11]

2023-09-27 Thread Francesco Nigro
On Wed, 27 Sep 2023 09:35:47 GMT, 温绍锦  wrote:

>> @cl4es made performance optimizations for the simple specifiers of 
>> String.format in PR https://github.com/openjdk/jdk/pull/2830. Based on the 
>> same idea, I continued to make improvements. I made patterns like %2d %02d 
>> also be optimized.
>> 
>> The following are the test results based on MacBookPro M1 Pro: 
>> 
>> 
>> -Benchmark  Mode  Cnt Score Error  Units
>> -StringFormat.complexFormat avgt   15  1862.233 ? 217.479  ns/op
>> -StringFormat.int02Format   avgt   15   312.491 ?  26.021  ns/op
>> -StringFormat.intFormat avgt   1584.432 ?   4.145  ns/op
>> -StringFormat.longFormatavgt   1587.330 ?   6.111  ns/op
>> -StringFormat.stringFormat  avgt   1563.985 ?  11.366  ns/op
>> -StringFormat.stringIntFormat   avgt   1587.422 ?   0.147  ns/op
>> -StringFormat.widthStringFormat avgt   15   250.740 ?  32.639  ns/op
>> -StringFormat.widthStringIntFormat  avgt   15   312.474 ?  16.309  ns/op
>> 
>> +Benchmark  Mode  CntScoreError  Units
>> +StringFormat.complexFormat avgt   15  740.626 ? 66.671  ns/op 
>> (+151.45)
>> +StringFormat.int02Format   avgt   15  131.049 ?  0.432  ns/op 
>> (+138.46)
>> +StringFormat.intFormat avgt   15   67.229 ?  4.155  ns/op 
>> (+25.59)
>> +StringFormat.longFormatavgt   15   66.444 ?  0.614  ns/op 
>> (+31.44)
>> +StringFormat.stringFormat  avgt   15   62.619 ?  4.652  ns/op 
>> (+2.19)
>> +StringFormat.stringIntFormat   avgt   15   89.606 ? 13.966  ns/op 
>> (-2.44)
>> +StringFormat.widthStringFormat avgt   15   52.462 ? 15.649  ns/op 
>> (+377.95)
>> +StringFormat.widthStringIntFormat  avgt   15  101.814 ?  3.147  ns/op 
>> (+206.91)
>
> 温绍锦 has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   code format

src/java.base/share/classes/java/util/Formatter.java line 2935:

> 2933: for (int size = 0; off < max; c = s.charAt(++off), size++) {
> 2934: if (!isDigit(c)) {
> 2935: if (size > 0) {

I would put the first iteration outside of this loop and have a loop which 
doesn't need to check for it

-

PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1338342542


Re: RFR: 8291966: SwitchBootstrap.typeSwitch could be faster [v2]

2023-05-31 Thread Francesco Nigro
On Tue, 30 May 2023 09:32:02 GMT, Jan Lahoda  wrote:

>> @forax 
>> 
>> Hi! Sorry for this sudden message, but this one captured my attention
>> 
>>> and subtype checks are usually fast.
>> 
>> And I hope this PR to be the right place to raise this.
>> 
>> I was looking this PR to better understand how it applies to secondary 
>> supers and https://bugs.openjdk.org/browse/JDK-8180450 (that's still WIP) 
>> doesn't look like subtype checks are yet fast nor scalable (with multiple 
>> interfaces and > 1 receiver types observed) - see 
>> https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckscalabilitybenchmark
>>  that can be modified to use type switch too.
>>  
>> In addition, at least for secondary types, a missed `IsInstance` (ie which 
>> type isn't implemented) can cost O(n) over the secondary supers types (see 
>> https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckslowpathbenchmark)
>>  that's still not ideal, due to the high bootstrapping cost of `prnz scas`.
>> 
>> Hence, the implementation of type switch is likely to account for the 
>> existing (performance) deficiencies of the secondary supers type check or is 
>> relying on the fix https://bugs.openjdk.org/browse/JDK-8180450 that will 
>> appear at some point?
>> 
>> Hope to have interacted in the right way with this with the JDK dev 
>> community, and thanks again for your hard work on this wonderful piece of 
>> engineering!
>
> @franz1981, thanks for your comment. I am afraid I don't know much about 
> JDK-8180450, but I suspect that it will affect the (type) switch lookup. 
> Please correct me if I am wrong, but it seems this patch is still an 
> improvement over the current state, and when JDK-8180450 is resolved, it 
> should automatically improve the type switch lookup as well. So, this patch 
> still seems useful to me, with a potential for future improvements, both 
> inside and outside type switch bootstrap itself.

@lahodaj 

> So, this patch still seems useful to me, with a potential for future 
> improvements, both inside and outside type switch bootstrap itself.

Yep, it is indeed still valuable: sorry if I didn't make it clear.
My point is more related the need of the amount of type checks in some cases 
(reducing the impact of JDK-8180450); if a typecheck-like op is still required  
to route to some case (and which cost depends by many factors really), any 
target `checkcast` could be maybe saved. 
I'm currently unaware if the current state of this PR bring any improvement on 
this front too (or if is even possible), but this would be welcome regardless  
JDK-8180450 and much more if considering the mentioned issue too.
This can be left for a follow-up PR too, obviously. I'm bringing the attention 
to this exactly because regardless JDK-8180450, secondary supers negative type 
check still have costs that would be awesome if could be addressed while 
exposing this feature to final users.

-

PR Comment: https://git.openjdk.org/jdk/pull/9779#issuecomment-1570214110


Re: RFR: 8291966: SwitchBootstrap.typeSwitch could be faster [v2]

2023-05-29 Thread Francesco Nigro
On Tue, 2 May 2023 13:57:37 GMT, Rémi Forax  wrote:

>> Jan Lahoda has updated the pull request with a new target base due to a 
>> merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains four additional 
>> commits since the last revision:
>> 
>>  - Adding comments
>>  - Improving performance
>>  - Merge branch 'master' into JDK-8291966
>>  - 8291966: SwitchBootstrap.typeSwitch could be faster
>
> Also there is a lot of use cases where the type switch is called only with 
> instances of the same class, for those scenarii, the VM should be able to 
> fully remove the type switch and inline the body of the corresponding case 
> like this is done with a virtual call.
> 
> I don't think there is a way currently to explain to the VM that the a hash 
> map really acts as a cache so if the key is a constant then the value is a 
> constant too (this optimization is also missing when JITing ClassValue.get()).

@forax 

Hi! Sorry for this sudden message, but this one captured my attention

> and subtype checks are usually fast.

And I hope this PR to be the right place to raise this.

I was looking this PR to better understand how it applies to secondary supers 
and https://bugs.openjdk.org/browse/JDK-8180450 (that's still WIP) doesn't look 
like subtype checks are yet fast nor scalable (with multiple interfaces and > 1 
receiver types observed) - see 
https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckscalabilitybenchmark
 that can be modified to use type switch too.
 
In addition, at least for secondary types, a missed `IsInstance` (ie which type 
isn't implemented) can cost O(n) over the secondary supers types (see 
https://ionutbalosin.com/2023/03/jvm-performance-comparison-for-jdk-17/#typecheckslowpathbenchmark)
 that's still not ideal, due to the high bootstrapping cost of `prnz scas`.

Hence, the implementation of type switch is likely to account for the existing 
(performance) deficiencies of the secondary supers type check or is relying on 
the fix https://bugs.openjdk.org/browse/JDK-8180450 that will appear at some 
point?

Hope to have interacted in the right way with this with the JDK dev community, 
and thanks again for your hard work on this wonderful piece of engineering!

-

PR Comment: https://git.openjdk.org/jdk/pull/9779#issuecomment-1566691245


Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]

2023-03-15 Thread Francesco Nigro
On Wed, 15 Mar 2023 14:56:28 GMT, Eirik Bjorsnos  wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Update StringLatin1.canEncode to sync with same test in CharacterData.of
>
> Just for fun, I tried with a benchmark where the code point is Latin1 in 
> every other call:
> 
> 
> @Benchmark
> public void isDigitVarying(Blackhole blackhole) {
> blackhole.consume(Character.isDigit(48));
> blackhole.consume(Character.isDigit(1632));
> }
> 
> 
> With this benchmark, there is no difference between the baseline, the PR and 
> using StringLatin1.canEncode:
> 
> Baseline:
> 
> 
> Benchmark  (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitVarying 1632  avgt   15  1.198 ± 0.056  ns/op
> 
> 
> PR:
> 
> 
> Benchmark  (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitVarying 1632  avgt   15  1.195 ± 0.058  ns/op
> 
> 
> StringLatin1.canEncode:
> 
> 
> Benchmark  (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitVarying 1632  avgt   15  1.193 ± 0.055  ns/op
> ``` 
> 
> At this point, I'm starting to wonder a bit if the performance benefits 
> suggested by this PR might be dubious and will only surface in very narrow 
> benchmarks. On the other hand, it does not seem harmful either. What do 
> people think?

@eirbjo I suggest to add `-prof perfasm` and see what's going on; I suspect 
it's better that inputs are loaded from `@State` field...and don't use 
`BlackHole` but combine the results of the 2 operations (although I remember 
that compiler assisted black holes are not as noisy as I'm used to observe with 
JDK 11 ones)

-

PR: https://git.openjdk.org/jdk/pull/13040


Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test

2023-03-15 Thread Francesco Nigro
On Wed, 15 Mar 2023 13:42:22 GMT, Eirik Bjorsnos  wrote:

>> Can you check what happen adding much more inputs to the dataset including 
>> non-latin chars as well and use `-prof perfnorm` to check what `perf` report 
>> re branches/branch-misses?
>> 
>> You can use `SplittableRandom` to pre-populate an array of inputs which 
>> sequence is "random" but still allow deterministic benchmarking and feed the 
>> benchmark method by cycling the pre-computed inputs.
>> In the real world I expect `isDigit` to happen on different input types and 
>> both having C2 with both branches places based on prev inputs distribution 
>> and a confused branch-predictor to allow comparing vs something that looks a 
>> bit nearest to the real world (TBD, I know).
>> I expect in that case that a single cmp + mask to work better depending on 
>> latin input distribution/occurrence
>
> I created a randomized version of `Characters.isDigit` which tests with code 
> points picked at random such that any category (Latin1, negative, different 
> planes, unassiged) are equally probable.
> 
> Baseline:
> 
> 
> Benchmark (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitRandom 1632  avgt   15  5.503 ± 0.371  ns/op
> 
> 
> Current PR:
> 
> 
> Benchmark (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitRandom 1632  avgt   15  5.393 ± 0.336  ns/op
> 
> 
> Using StringLatin1.canEncode:
> 
> 
> Benchmark (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigitRandom 1632  avgt   15  5.377 ± 0.322  ns/op
> 
> 
> Seems the PR still has a small improvement for this scenario. The 
> StringLatin1.canEncode regression disappears.
> 
> In the real world ASCII/Latin1 seems to dominate most data, so this scenario 
> is perhaps not very realistic.
> 
> I'm running this on a Mac, so cannot try `-prof perfnorm`.

Many thanks to have tried, yep, I was curious indeed re the 
"StringLatin1.canEncode regression" case.
I would still modify the benchmark to use inputs (I know that will make it 
memory bound sadly, due to reading inputs - but the size of such inputs can be 
a benchmark parameter, together with the bias eg "latin","mix", "non-latin") 
"semi-randomly" generated based on the mentioned strategies/biases. 
It will benefit future tests on this, although could be provided as a separate 
PR.

-

PR: https://git.openjdk.org/jdk/pull/13040


Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test

2023-03-15 Thread Francesco Nigro
On Wed, 15 Mar 2023 12:28:05 GMT, Eirik Bjorsnos  wrote:

>>> `if (ch && 0xFF00 == 0) {`
>> 
>> This seems to perform similar to baseline:
>> 
>> 
>> Benchmark   (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigit   48  avgt   15  0.890 ± 0.025  ns/op
>> Characters.isDigit 1632  avgt   15  2.174 ± 0.011  ns/op
>> 
>> 
>> Would be interesting to check the performance on non-Intel architectures. If 
>> you want to give it a spin on your M1, here's the benchmark command I used:
>> 
>> `make test TEST='micro:java.lang.Characters.isDigit' MICRO="OPTIONS=-p 
>> codePoint=48,1632"`
>
>> It seems reasonable to keep these two in sync, yes. (`CharacterData.of` 
>> could even call into `StringLatin1.canEncode`, unless that's cause for some 
>> performance anomaly)
> 
> If I update `StringLatin1.canEncode` and call into that from 
> `CharacterData.of`, I observe no regression for the Latin1 case, but a 
> significant regression for the non-Latin1 case. I have no idea how to explain 
> that:
> 
> 
> Benchmark   (codePoint)  Mode  Cnt  Score   Error  Units
> Characters.isDigit   48  avgt   15  0.675 ± 0.029  ns/op
> Characters.isDigit 1632  avgt   15  2.435 ± 0.032  ns/op

Can you check what happen adding much more inputs to the dataset that includes 
non-latin chars as well and use `-prof perfnorm` to check what `perf` report re 
branches/branch-misses?

You can use SplittableRandom to pre-populate an array of inputs which sequence 
is "random" but still allow deterministic benchmarking and feed the benchmark 
method cycling the pre-computed inputs

-

PR: https://git.openjdk.org/jdk/pull/13040


Re: RFR: 8301958: Reduce Arrays.copyOf/-Range overheads [v7]

2023-02-08 Thread Francesco Nigro
On Wed, 8 Feb 2023 00:07:14 GMT, Claes Redestad  wrote:

>> This patch adds special-cases to `Arrays.copyOf` and `Arrays.copyOfRange` to 
>> clone arrays when `newLength` or range inputs span the input array. This 
>> helps eliminate range checks and has been verified to help various String 
>> operations. Example:
>> 
>> Baseline
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 16.817 ± 0.369  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 16.866 ± 0.449  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 22.198 ± 0.396  ns/op
>> 
>> Patch: 
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 14.666 ± 0.336  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 14.582 ± 0.288  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 20.339 ± 0.328  ns/op
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Minimize, force inline, generalize

test/micro/org/openjdk/bench/java/lang/StringConstructor.java line 40:

> 38: 
> 39:   @Param({"0", "7", "64"})
> 40:   public int size;

I suggest to add the param `offset` for future experiment: together with 
`perfasm` it helps to check how different stubs are used and emulate the 
different branches of the optimized code

-

PR: https://git.openjdk.org/jdk/pull/12453


Re: RFR: 8301958: Reduce Arrays.copyOfRange overheads [v6]

2023-02-07 Thread Francesco Nigro
On Tue, 7 Feb 2023 22:43:15 GMT, Claes Redestad  wrote:

>> This adds a local, specialized `copyBytes` method to `String` that avoids 
>> certain redundant range checks and clamping that JIT has issues removing 
>> fully.
>> 
>> This has a small but statistically significant effect on `String` 
>> microbenchmarks, eg.:
>> 
>> Baseline
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 16.817 ± 0.369  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 16.866 ± 0.449  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 22.198 ± 0.396  ns/op
>> 
>> 
>> Patch:
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 15.477 ± 0.342  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 15.557 ± 0.352  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 21.272 ± 0.398  ns/op
>> 
>> 
>> Care has to be taken to ensure preconditions have been checked when using 
>> `checkBytes`. In the case of `String(AbstractStringBuilder)` there's a 
>> possible pre-existing issue where the constructor might either throw an 
>> exception or truncate the buffer if the builder byte array and length is not 
>> in agreement (theoretically possible if you clear/remove and call 
>> `trimToSize()` concurrently). Adding an explicit check here seem to be the 
>> right thing to do regardless of this RFE.
>
> Claes Redestad has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Apply @franz1981's idea but directly to Arrays.copyOfRange, factoring out 
> range checks and helping JIT pick the best arraycopy adapter
>  - Back out all changes except micro

LGTM

-

Marked as reviewed by franz1...@github.com (no known OpenJDK username).

PR: https://git.openjdk.org/jdk/pull/12453


Re: RFR: 8301958: Avoid Arrays.copyOfRange overhead in java.lang.String [v5]

2023-02-07 Thread Francesco Nigro
On Tue, 7 Feb 2023 20:32:11 GMT, Claes Redestad  wrote:

>> src/java.base/share/classes/java/lang/String.java line 698:
>> 
>>> 696: }
>>> 697: 
>>> 698: static byte[] copyBytes(byte[] bytes, int offset, int length) {
>> 
>> Given that the stub generated for array copy seems highly dependent by the 
>> call site constrains, did you tried adding a check for offset == 0 and/or 
>> length == bytes.length?
>> 
>> If (offset == 0 && bytes.length == length) {
>> System.arrayCopy(bytes, 0, dst, 0, bytes.length);
>> // etc etc the other combinations 
>> 
>> This should have different generated stubs with much smaller ASM depending 
>> by the enforced constrains (and shouldn't affect terribly the code size of 
>> the method, given that the stub won't be inlined AFAIK)
>> 
>> Beware, as noted by others, I'm not suggesting that's the way to fix this, 
>> but it would be interesting to check how much perf we leave on the ground 
>> due to the this supposed "inefficient" stub generation (if that's the issue).
>
> I did some quick experiments but saw no clear win from doing anything like 
> this here. Feel free to experiment and see if there's some particular 
> configuration that comes out ahead.
> 
> FTR I did not intend for this RFE to solve 
> https://bugs.openjdk.org/browse/JDK-8295496 completely, but provide a small, 
> partial win that might possibly clear a path to solving that likely 
> orthogonal issue.

I've created a separate benchmark for this (named as your by accident - given 
that I've used it as a blueprint):
https://gist.github.com/franz1981/658c2bf6796aab4ae04a84bef1ef34b6
results are

Benchmark (offset)  (size)  Mode  Cnt   Score   
Error  Units
StringConstructor.arrayCopy  0   7  avgt   10   9.519 ± 
0.131  ns/op
StringConstructor.arrayCopy  1   7  avgt   10   9.194 ± 
0.232  ns/op
StringConstructor.copyOf 0   7  avgt   10  11.548 ± 
0.133  ns/op
StringConstructor.copyOf 1   7  avgt   10   9.812 ± 
0.018  ns/op
StringConstructor.optimizedArrayCopy 0   7  avgt   10   6.854 ± 
0.355  ns/op< THAT'S COOL
StringConstructor.optimizedArrayCopy 1   7  avgt   10   9.088 ± 
0.049  ns/op

the optimized array copy is helping C2 on stub generation.
I didn't checked yet if this applies to the `String` case and I didn't created 
a long enough dataset array to check the effects on the branch predictor with 
the newly introduced conditions too, but in term of generated stub, there's a 
difference.

-

PR: https://git.openjdk.org/jdk/pull/12453


Re: RFR: 8301958: Avoid Arrays.copyOfRange overhead in java.lang.String [v5]

2023-02-07 Thread Francesco Nigro
On Tue, 7 Feb 2023 15:25:05 GMT, Claes Redestad  wrote:

>> This adds a local, specialized `copyBytes` method to `String` that avoids 
>> certain redundant range checks and clamping that JIT has issues removing 
>> fully.
>> 
>> This has a small but statistically significant effect on `String` 
>> microbenchmarks, eg.:
>> 
>> Baseline
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 16.817 ± 0.369  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 16.866 ± 0.449  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 22.198 ± 0.396  ns/op
>> 
>> 
>> Patch:
>> 
>> Benchmark(size)  Mode  Cnt   
>> Score   Error  Units
>> StringConstructor.newStringFromArray  7  avgt   15  
>> 15.477 ± 0.342  ns/op
>> StringConstructor.newStringFromArrayWithCharset   7  avgt   15  
>> 15.557 ± 0.352  ns/op
>> StringConstructor.newStringFromArrayWithCharsetName   7  avgt   15  
>> 21.272 ± 0.398  ns/op
>> 
>> 
>> Care has to be taken to ensure preconditions have been checked when using 
>> `checkBytes`. In the case of `String(AbstractStringBuilder)` there's a 
>> possible pre-existing issue where the constructor might either throw an 
>> exception or truncate the buffer if the builder byte array and length is not 
>> in agreement (theoretically possible if you clear/remove and call 
>> `trimToSize()` concurrently). Adding an explicit check here seem to be the 
>> right thing to do regardless of this RFE.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   copyrights

Thanks @cl4es to look into this!

src/java.base/share/classes/java/lang/String.java line 698:

> 696: }
> 697: 
> 698: static byte[] copyBytes(byte[] bytes, int offset, int length) {

Given that the stub generated for array copy seems highly dependent by the call 
site constrains, did you tried adding a check for offset == 0 and/or length == 
bytes.length?

If (offset == 0 && bytes.length == length) {
System.arrayCopy(bytes, 0, dst, bytes.length);
// etc etc the other combinations 

This should have different generated stubs with much smaller ASM depending by 
the enforced constrains

-

Changes requested by franz1...@github.com (no known OpenJDK username).

PR: https://git.openjdk.org/jdk/pull/12453