Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-25 Thread Claes Redestad
On Wed, 24 Apr 2024 10:05:27 GMT, Aleksey Shipilev  wrote:

>>> I really wish this change was not done with ClassFile API, but with a 
>>> simple bundled ASM, so it would be easily backportable, if we decide to. It 
>>> does not look like CFA buys us a lot here readability/complexity wise: 
>>> [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
>> 
>> I would be open to splitting out and pushing the ASM version first and do 
>> this CFA port as a follow-up.
>
>> > I really wish this change was not done with ClassFile API, but with a 
>> > simple bundled ASM, so it would be easily backportable, if we decide to. 
>> > It does not look like CFA buys us a lot here readability/complexity wise: 
>> > [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
>> 
>> I would be open to splitting out and pushing the ASM version first and do 
>> this CFA port as a follow-up.
> 
> That would be good, thanks.

@shipilev @mlchung opened new PR #18953 for pushing the ASM-based version. 
Adapted applicable code changes from this PR so it should be equivalent in 
behavior.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2077319721


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Aleksey Shipilev
On Wed, 24 Apr 2024 10:01:47 GMT, Claes Redestad  wrote:

> > I really wish this change was not done with ClassFile API, but with a 
> > simple bundled ASM, so it would be easily backportable, if we decide to. It 
> > does not look like CFA buys us a lot here readability/complexity wise: 
> > [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
> 
> I would be open to splitting out and pushing the ASM version first and do 
> this CFA port as a follow-up.

That would be good, thanks.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2074585421


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Claes Redestad
On Wed, 24 Apr 2024 09:57:42 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java
>   
>   Co-authored-by: Mandy Chung 

> I really wish this change was not done with ClassFile API, but with a simple 
> bundled ASM, so it would be easily backportable, if we decide to. It does not 
> look like CFA buys us a lot here readability/complexity wise: 
> [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)

I would be open to splitting out and pushing the ASM version first and do this 
CFA port as a follow-up.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2074577669


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Louis Wasserman
On Sun, 14 Apr 2024 14:33:26 GMT, Claes Redestad  wrote:

>> What are the scenarios which had regressions? 
>> Given the conservative growth for StringBuilder, it surprises me a bit that 
>> any scenario would regress.
>
> I took a second look and it turns out that there were neither regressions nor 
> improvements in my test, only a few false positives: For C2 the SB chain is 
> recognized by the (fragile) C2 OptimizeStringConcat pass and transformed into 
> a shape where the initial size in java bytecode - if any - is ignored.
> 
> With C1 having an initial size can have a significant effect. One way to 
> provoke a regression there is to have a huge constant suffix while 
> underestimating the size of the operands, which can lead to excessive 
> allocation:
> 
> 
> Name Cnt BaseError   TestError   
> Unit  Change
> StringConcat.concat23StringConst   5  385,268 ±  5,238341,251 ±  2,606  
> ns/op   1,13x (p = 0,000*)
>   :gc.alloc.rate 6039,095 ± 82,309  12764,169 ± 97,146 
> MB/sec   2,11x (p = 0,000*)
>   :gc.alloc.rate.norm2440,003 ±  0,000   4568,002 ±  0,000   
> B/op   1,87x (p = 0,000*)
> 
> 
> Still a bit better on throughput from less actual copying.
> 
> *If* the `StringBuilder` is sized sufficiently (constants + args * N) things 
> look much better, of course: 
> 
> -XX:TieredStopAtLevel=1
> Name Cnt BaseError  TestError   
> Unit  Change
> StringConcat.concat23StringConst   5  385,268 ±  5,238   259,628 ±  2,515  
> ns/op   1,48x (p = 0,000*)
>   :gc.alloc.rate 6039,095 ± 82,309  8902,803 ± 86,563 
> MB/sec   1,47x (p = 0,000*)
>   :gc.alloc.rate.norm2440,003 ±  0,000  2424,002 ±  0,000   
> B/op   0,99x (p = 0,000*)
> 
> 
> For most cases having a size based on the sum of the constants plus some 
> small factor per argument seem to be a decent heuristic - for C1. Since this 
> adds just one bytecode to the generated method I guess it wouldn't hurt.

Presizing this StringBuilder is a thing I looked into once upon a time, 
discussed here: 
https://mail.openjdk.org/pipermail/compiler-dev/2015-March/009356.html  This 
work, I understand, the indyfication of string concatenation in the first place.

Primitive values can get bounded at appropriate lengths for their types (see 
e.g. https://stackoverflow.com/a/21146952/869736).  It sounds like you're 
trying to conserve bytecode length, so maybe you don't want to presize the 
StringBuilder with the exact Object.toString() lengths, though.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1576819289


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Update src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java
  
  Co-authored-by: Mandy Chung 

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/83f4048f..e7cbaaf5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=06
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=05-06

  Stats: 5 lines in 1 file changed: 1 ins; 2 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690