Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v4]

2024-04-26 Thread Claes Redestad
On Fri, 26 Apr 2024 09:42:06 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix nits, use iconst as intended

Running another round of sanity testing (tier1+2) before integration.

-

PR Comment: https://git.openjdk.org/jdk/pull/18953#issuecomment-2079038010


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-26 Thread Claes Redestad
On Fri, 26 Apr 2024 08:45:45 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Copyright and comment about ProxyClassDumper

Thanks, nits resolved. Also some minor improvement (improved for; use 
iconst(mv, len) as originally intended)

-

PR Comment: https://git.openjdk.org/jdk/pull/18953#issuecomment-2079016011


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v4]

2024-04-26 Thread Claes Redestad
> Splitting out the ASM-based version from #18690 to push that first under the 
> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
> this as a subtask. See discussion in that #18690 for more details, discussion 
> and motivation for this.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Fix nits, use iconst as intended

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18953/files
  - new: https://git.openjdk.org/jdk/pull/18953/files/591d2d1c..9b083f4b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18953=03
 - incr: https://webrevs.openjdk.org/?repo=jdk=18953=02-03

  Stats: 9 lines in 1 file changed: 1 ins; 3 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18953.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18953/head:pull/18953

PR: https://git.openjdk.org/jdk/pull/18953


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-26 Thread Aleksey Shipilev
On Fri, 26 Apr 2024 08:45:45 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Copyright and comment about ProxyClassDumper

Looks okay, thanks. Only cosmetic nits:

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 108:

> 106: static {
> 107: String highArity = 
> VM.getSavedProperty("java.lang.invoke.StringConcat.highArityThreshold");
> 108: HIGH_ARITY_THRESHOLD = Integer.parseInt(highArity != null ? 
> highArity : "20");

Maybe write this as `= (highArity != null) ? Integer.parseInt(highArity) : 20` 
to help interpreter a bit.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1105:

> 1103: for (int c = 0; c < constants.length; c++) {
> 1104: if (constants[c] != null)
> 1105: len += constants[c].length();

Braces?

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1118:

> 1116: );
> 1117: 
> 1118: 

Extra new-line?

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1164:

> 1162: }
> 1163: 
> 1164: 

Extra new-line?

-

Marked as reviewed by shade (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18953#pullrequestreview-2024514742
PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580723293
PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580723981
PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580724174
PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580724568


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-26 Thread Claes Redestad
On Thu, 25 Apr 2024 19:53:46 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove accidental use of java.lang.classfile

@shipilev are you OK with this?

-

PR Comment: https://git.openjdk.org/jdk/pull/18953#issuecomment-2078922183


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-26 Thread Claes Redestad
> Splitting out the ASM-based version from #18690 to push that first under the 
> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
> this as a subtask. See discussion in that #18690 for more details, discussion 
> and motivation for this.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Copyright and comment about ProxyClassDumper

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18953/files
  - new: https://git.openjdk.org/jdk/pull/18953/files/a4dfbfca..591d2d1c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18953=02
 - incr: https://webrevs.openjdk.org/?repo=jdk=18953=01-02

  Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18953.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18953/head:pull/18953

PR: https://git.openjdk.org/jdk/pull/18953


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-26 Thread Claes Redestad
On Fri, 26 Apr 2024 02:36:22 GMT, Chen Liang  wrote:

> Also a side note for backport: ClassFileDumper exists only since JDK 21, so 
> for earlier backports you need to use an alternative dumping method or remove 
> dumping ability.

Yes, original code used ProxyClassDumper, which was renamed/reworked in 21 by 
[JDK-8304846](https://bugs.openjdk.org/browse/JDK-8304846). I think we've gone 
as far as we can code-wise in ensuring this is cleanly backportable to at least 
21. I'll add a comment about ProxyClassDumper.

-

PR Comment: https://git.openjdk.org/jdk/pull/18953#issuecomment-2078910803


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-26 Thread Claes Redestad
On Fri, 26 Apr 2024 03:14:54 GMT, Mandy Chung  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Remove accidental use of java.lang.classfile
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1057:
> 
>> 1055:  */
>> 1056: private static final class SimpleStringBuilderStrategy {
>> 1057: static final int CLASSFILE_VERSION = 52; // JDK 8
> 
> Alternatively, this can use `ClassFileFormatVersion.latest().major()` which 
> exists since JDK 20.

As the original code just had 52 verbatim and I don't know how far we'll want 
to backport this I think it's fine to leave this at 52.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580682911


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-25 Thread Mandy Chung
On Thu, 25 Apr 2024 19:53:46 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove accidental use of java.lang.classfile

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1057:

> 1055:  */
> 1056: private static final class SimpleStringBuilderStrategy {
> 1057: static final int CLASSFILE_VERSION = 52; // JDK 8

Alternatively, this can use `ClassFileFormatVersion.latest().major()` which 
exists since JDK 20.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580432036


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-25 Thread Chen Liang
On Thu, 25 Apr 2024 19:53:46 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove accidental use of java.lang.classfile

Also a side note for backport: ClassFileDumper exists only since JDK 21, so for 
earlier backports you need to use an alternative dumping method or remove 
dumping ability.

-

PR Comment: https://git.openjdk.org/jdk/pull/18953#issuecomment-2078532971


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v9]

2024-04-25 Thread Mandy Chung
On Wed, 24 Apr 2024 19:13:43 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Nits, clean up constant, introduce missing constant MethodTypeDesc for 
> toString

It's a little confusing for JDK-8327247 to have 2 PRs but it's okay.   We can 
add a note in JDK-8327247 to clarify.  I assume you plan to use PR to convert 
to use ClassFile API after you pull from JDK-8327247 once integrated?

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2078512445


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-25 Thread Mandy Chung
On Thu, 25 Apr 2024 19:53:46 GMT, Claes Redestad  wrote:

>> Splitting out the ASM-based version from #18690 to push that first under the 
>> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
>> this as a subtask. See discussion in that #18690 for more details, 
>> discussion and motivation for this.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove accidental use of java.lang.classfile

Nit: the copyright header needs update before integration.

-

Marked as reviewed by mchung (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18953#pullrequestreview-2023964506


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-25 Thread Claes Redestad
On Thu, 25 Apr 2024 16:46:25 GMT, Chen Liang  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Remove accidental use of java.lang.classfile
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1059:
> 
>> 1057:  */
>> 1058: private static final class SimpleStringBuilderStrategy {
>> 1059: static final int CLASSFILE_VERSION = 
>> ClassFile.latestMajorVersion();
> 
> Still breaks backward ASM port, we should use a fixed version like 52 for 
> JAVA_8 and convert to latest only in the CF conversion later.

Good catch. In the code I am 
[reviving](https://github.com/cl4es/jdk/commit/36c4b11bc6cf5a008d5935934aa75f2d2bbe6a23#diff-1339c269a3729d849799d29a7431ccd508a034ced91c1796b952795396843891L771)
 this field was simply set to `52`. Fixed.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1580047931


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-25 Thread Claes Redestad
> Splitting out the ASM-based version from #18690 to push that first under the 
> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
> this as a subtask. See discussion in that #18690 for more details, discussion 
> and motivation for this.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Remove accidental use of java.lang.classfile

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18953/files
  - new: https://git.openjdk.org/jdk/pull/18953/files/d8d5e5d3..a4dfbfca

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18953=01
 - incr: https://webrevs.openjdk.org/?repo=jdk=18953=00-01

  Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18953.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18953/head:pull/18953

PR: https://git.openjdk.org/jdk/pull/18953


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-25 Thread Chen Liang
On Thu, 25 Apr 2024 14:15:56 GMT, Claes Redestad  wrote:

> Splitting out the ASM-based version from #18690 to push that first under the 
> JBS (to help backporting). Keeping #18690 open to rebase and follow-up on 
> this as a subtask. See discussion in that #18690 for more details, discussion 
> and motivation for this.

Only with ASM can we realize how concise ClassFile API is!

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1059:

> 1057:  */
> 1058: private static final class SimpleStringBuilderStrategy {
> 1059: static final int CLASSFILE_VERSION = 
> ClassFile.latestMajorVersion();

Still breaks backward ASM port, we should use a fixed version like 52 for 
JAVA_8 and convert to latest only in the CF conversion later.

-

PR Review: https://git.openjdk.org/jdk/pull/18953#pullrequestreview-2023078249
PR Review Comment: https://git.openjdk.org/jdk/pull/18953#discussion_r1579822765


RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-25 Thread Claes Redestad
Splitting out the ASM-based version from #18690 to push that first under the 
JBS (to help backporting). Keeping #18690 open to rebase and follow-up on this 
as a subtask. See discussion in that #18690 for more details, discussion and 
motivation for this.

-

Commit messages:
 - Adapt main PR feedback to ASM version
 - ASM fallback version

Changes: https://git.openjdk.org/jdk/pull/18953/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk=18953=00
  Issue: https://bugs.openjdk.org/browse/JDK-8327247
  Stats: 300 lines in 2 files changed: 294 ins; 3 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/18953.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18953/head:pull/18953

PR: https://git.openjdk.org/jdk/pull/18953


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-25 Thread Claes Redestad
On Wed, 24 Apr 2024 10:05:27 GMT, Aleksey Shipilev  wrote:

>>> I really wish this change was not done with ClassFile API, but with a 
>>> simple bundled ASM, so it would be easily backportable, if we decide to. It 
>>> does not look like CFA buys us a lot here readability/complexity wise: 
>>> [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
>> 
>> I would be open to splitting out and pushing the ASM version first and do 
>> this CFA port as a follow-up.
>
>> > I really wish this change was not done with ClassFile API, but with a 
>> > simple bundled ASM, so it would be easily backportable, if we decide to. 
>> > It does not look like CFA buys us a lot here readability/complexity wise: 
>> > [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
>> 
>> I would be open to splitting out and pushing the ASM version first and do 
>> this CFA port as a follow-up.
> 
> That would be good, thanks.

@shipilev @mlchung opened new PR #18953 for pushing the ASM-based version. 
Adapted applicable code changes from this PR so it should be equivalent in 
behavior.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2077319721


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v9]

2024-04-24 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Nits, clean up constant, introduce missing constant MethodTypeDesc for 
toString

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/9742f074..0ac8f24b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=08
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=07-08

  Stats: 13 lines in 1 file changed: 2 ins; 5 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v8]

2024-04-24 Thread Claes Redestad
On Wed, 24 Apr 2024 10:08:42 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Make Set.of(STRONG) a constant, fix compilation, minor code improvements

Thanks for reviewing! I'll split out and PR an ASM version as a subtask and 
rebase this PR on top of that.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2075654943


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v8]

2024-04-24 Thread Mandy Chung
On Wed, 24 Apr 2024 10:08:42 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Make Set.of(STRONG) a constant, fix compilation, minor code improvements

Looks fine to me.   Indeed, splitting this with ASM and then convert it to 
ClassFile API would help the backport.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1077:

> 1075: 
> 1076: /**
> 1077:  * Ensure a capacity in the initial StringBuilder to 
> accommodate all constants plus this factor times the number

Nit: wrap long line.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1085:

> 1083: 
> 1084: static {
> 1085: DUMPER = 
> ClassFileDumper.getInstance("java.lang.invoke.StringConcatFactory.dump", 
> "stringConcatClasses");

Nit: this static block isn't strictly needed.  Can assign at the declaration of 
the static variable.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1112:

> 1110: return hiddenLookup.findStatic(innerClass, METHOD_NAME, 
> args);
> : } catch (Exception e) {
> 1112: DUMPER.dumpFailedClass(className, classBytes);

This line is no longer needed.   The bytes will be dumped if it's enabled for 
both success and failing case.

-

Marked as reviewed by mchung (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18690#pullrequestreview-2020345792
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1578178759
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1578176295
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1578173742


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v8]

2024-04-24 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Make Set.of(STRONG) a constant, fix compilation, minor code improvements

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/e7cbaaf5..9742f074

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=07
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=06-07

  Stats: 9 lines in 1 file changed: 4 ins; 0 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Aleksey Shipilev
On Wed, 24 Apr 2024 10:01:47 GMT, Claes Redestad  wrote:

> > I really wish this change was not done with ClassFile API, but with a 
> > simple bundled ASM, so it would be easily backportable, if we decide to. It 
> > does not look like CFA buys us a lot here readability/complexity wise: 
> > [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)
> 
> I would be open to splitting out and pushing the ASM version first and do 
> this CFA port as a follow-up.

That would be good, thanks.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2074585421


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Claes Redestad
On Wed, 24 Apr 2024 09:57:42 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java
>   
>   Co-authored-by: Mandy Chung 

> I really wish this change was not done with ClassFile API, but with a simple 
> bundled ASM, so it would be easily backportable, if we decide to. It does not 
> look like CFA buys us a lot here readability/complexity wise: 
> [d99b159](https://github.com/openjdk/jdk/commit/d99b1596c5ca57b110c1db88be430c6c54c3d599)

I would be open to splitting out and pushing the ASM version first and do this 
CFA port as a follow-up.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2074577669


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Louis Wasserman
On Sun, 14 Apr 2024 14:33:26 GMT, Claes Redestad  wrote:

>> What are the scenarios which had regressions? 
>> Given the conservative growth for StringBuilder, it surprises me a bit that 
>> any scenario would regress.
>
> I took a second look and it turns out that there were neither regressions nor 
> improvements in my test, only a few false positives: For C2 the SB chain is 
> recognized by the (fragile) C2 OptimizeStringConcat pass and transformed into 
> a shape where the initial size in java bytecode - if any - is ignored.
> 
> With C1 having an initial size can have a significant effect. One way to 
> provoke a regression there is to have a huge constant suffix while 
> underestimating the size of the operands, which can lead to excessive 
> allocation:
> 
> 
> Name Cnt BaseError   TestError   
> Unit  Change
> StringConcat.concat23StringConst   5  385,268 ±  5,238341,251 ±  2,606  
> ns/op   1,13x (p = 0,000*)
>   :gc.alloc.rate 6039,095 ± 82,309  12764,169 ± 97,146 
> MB/sec   2,11x (p = 0,000*)
>   :gc.alloc.rate.norm2440,003 ±  0,000   4568,002 ±  0,000   
> B/op   1,87x (p = 0,000*)
> 
> 
> Still a bit better on throughput from less actual copying.
> 
> *If* the `StringBuilder` is sized sufficiently (constants + args * N) things 
> look much better, of course: 
> 
> -XX:TieredStopAtLevel=1
> Name Cnt BaseError  TestError   
> Unit  Change
> StringConcat.concat23StringConst   5  385,268 ±  5,238   259,628 ±  2,515  
> ns/op   1,48x (p = 0,000*)
>   :gc.alloc.rate 6039,095 ± 82,309  8902,803 ± 86,563 
> MB/sec   1,47x (p = 0,000*)
>   :gc.alloc.rate.norm2440,003 ±  0,000  2424,002 ±  0,000   
> B/op   0,99x (p = 0,000*)
> 
> 
> For most cases having a size based on the sum of the constants plus some 
> small factor per argument seem to be a decent heuristic - for C1. Since this 
> adds just one bytecode to the generated method I guess it wouldn't hurt.

Presizing this StringBuilder is a thing I looked into once upon a time, 
discussed here: 
https://mail.openjdk.org/pipermail/compiler-dev/2015-March/009356.html  This 
work, I understand, the indyfication of string concatenation in the first place.

Primitive values can get bounded at appropriate lengths for their types (see 
e.g. https://stackoverflow.com/a/21146952/869736).  It sounds like you're 
trying to conserve bytecode length, so maybe you don't want to presize the 
StringBuilder with the exact Object.toString() lengths, though.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1576819289


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v7]

2024-04-24 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Update src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java
  
  Co-authored-by: Mandy Chung 

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/83f4048f..e7cbaaf5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=06
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=05-06

  Stats: 5 lines in 1 file changed: 1 ins; 2 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v6]

2024-04-23 Thread Mandy Chung
On Thu, 18 Apr 2024 14:50:30 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Minor SimpleStringBuilderStrategy:: overhead reduction

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1108:

> 1106: DUMPER.dumpClass(innerClass.getName(), innerClass, 
> classBytes);
> 1107: MethodHandle mh = hiddenLookup.findStatic(innerClass, 
> METHOD_NAME, args);
> 1108: return mh;

An alternative way to define a hidden class that will detect if the dumper is 
enabled and dump the classfile. 

Suggestion:

Lookup hiddenLookup = lookup.makeHiddenClassDefiner(className, 
classBytes, Set.of(STRONG), DUMPER)
.defineClassAsLookup(true);
Class innerClass = hiddenLookup.lookupClass();
return hiddenLookup.findStatic(innerClass, METHOD_NAME, args);

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1576880600


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v6]

2024-04-23 Thread Aleksey Shipilev
On Thu, 18 Apr 2024 14:50:30 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Minor SimpleStringBuilderStrategy:: overhead reduction

I really wish this change was not done with ClassFile API, but with a simple 
bundled ASM, so it would be easily backportable, if we decide to. It does not 
look like CFA buys us a lot here readability/complexity wise: 
https://github.com/openjdk/jdk/pull/18690/commits/d99b1596c5ca57b110c1db88be430c6c54c3d599

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2073284920


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v6]

2024-04-23 Thread Mandy Chung
On Mon, 22 Apr 2024 08:59:33 GMT, Claes Redestad  wrote:

> @mlchung @asotona: Alan asked me to use the ClassFile API here. As it's only 
> used in what's effectively a slow path it shouldn't be blocked by the startup 
> investigations we're doing there, right?

I agree that this should not be blocked by the startup investigation for 
ClassFile API.  It only affects high arity expressions.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2073085437


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v6]

2024-04-22 Thread Claes Redestad
On Thu, 18 Apr 2024 14:50:30 GMT, Claes Redestad  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Minor SimpleStringBuilderStrategy:: overhead reduction

This one is still waiting for a review. 

@mlchung @asotona: Alan asked me to use the ClassFile API here. As it's only 
used in what's effectively a slow path it shouldn't be blocked by the startup 
investigations we're doing there, right?

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2068862978


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v6]

2024-04-18 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Minor SimpleStringBuilderStrategy:: overhead reduction

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/fd9c40d2..83f4048f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=05
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=04-05

  Stats: 11 lines in 1 file changed: 1 ins; 0 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v5]

2024-04-18 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains seven commits:

 - Merge master, resolve conflicts due pr/18688
 - Dump using the hidden class name
 - Pre-size stringbuilders based on constant size and a small argument factor
 - @liach feedback
 - Bump threshold after experiments
 - Port ASM to CFA
 - Lower compilation overhead for large concat expressions, initial ASM-based 
version based on pre-existing implementation

-

Changes: https://git.openjdk.org/jdk/pull/18690/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk=18690=04
  Stats: 237 lines in 2 files changed: 229 ins; 2 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v4]

2024-04-15 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Dump using the hidden class name

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/796d968b..9eed6609

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=03
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-15 Thread Claes Redestad
On Fri, 12 Apr 2024 15:06:36 GMT, Chen Liang  wrote:

>> I'd prefer considering such optimizations as follow-ups. We need to think 
>> about where to define such shared classes in a way that considers the full 
>> lifecycle, facilitates class unloading (one cache per classloader?) while 
>> still getting good reuse.
>
> I think `Invokers` might be a good place. Its lifetime is bound to its MT, 
> and MT already handles class unloading well. If we are doing erased then a 
> plain array suffices.

I think we should be wary about adding things that are strongly kept alive by 
the MT. While the MTs themselves are weakly referenced and can be unloaded if 
all usage goes aways, many oft-used MTs might be kept effectively forever.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1565912250


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-14 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  Pre-size stringbuilders based on constant size and a small argument factor

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/1c3d354c..796d968b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=02
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=01-02

  Stats: 44 lines in 2 files changed: 42 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

2024-04-14 Thread Claes Redestad
On Sat, 13 Apr 2024 17:54:54 GMT, Brett Okken  wrote:

>> Possibly. I tried a few simple variants that initialized the `StringBuilder` 
>> capacity at various guesses, such as sum of constant sizes + some factor 
>> based on args, but across a diverse set of micros that gives both some wins 
>> and some regressions. Getting the estimation just right is pretty tricky, 
>> especially when size of the arguments are arbitrary (like for any 
>> String/Object arg).
>
> What are the scenarios which had regressions? 
> Given the conservative growth for StringBuilder, it surprises me a bit that 
> any scenario would regress.

I took a second look and it turns out that there were neither regressions nor 
improvements in my test, only a few false positives: For C2 the SB chain is 
recognized by the (fragile) C2 OptimizeStringConcat pass and transformed into a 
shape where the initial size in java bytecode - if any - is ignored.

With C1 having an initial size can have a significant effect. One way to 
provoke a regression there is to have a huge constant suffix while 
underestimating the size of the operands, which can lead to excessive 
allocation:


Name Cnt BaseError   TestError   
Unit  Change
StringConcat.concat23StringConst   5  385,268 ±  5,238341,251 ±  2,606  
ns/op   1,13x (p = 0,000*)
  :gc.alloc.rate 6039,095 ± 82,309  12764,169 ± 97,146 
MB/sec   2,11x (p = 0,000*)
  :gc.alloc.rate.norm2440,003 ±  0,000   4568,002 ±  0,000   
B/op   1,87x (p = 0,000*)


Still a bit better on throughput from less actual copying.

*If* the `StringBuilder` is sized sufficiently (constants + args * N) things 
look much better, of course: 

-XX:TieredStopAtLevel=1
Name Cnt BaseError  TestError   
Unit  Change
StringConcat.concat23StringConst   5  385,268 ±  5,238   259,628 ±  2,515  
ns/op   1,48x (p = 0,000*)
  :gc.alloc.rate 6039,095 ± 82,309  8902,803 ± 86,563 
MB/sec   1,47x (p = 0,000*)
  :gc.alloc.rate.norm2440,003 ±  0,000  2424,002 ±  0,000   
B/op   0,99x (p = 0,000*)


For most cases having a size based on the sum of the constants plus some small 
factor per argument seem to be a decent heuristic - for C1. Since this adds 
just one bytecode to the generated method I guess it wouldn't hurt.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1564753612


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-13 Thread Brett Okken
On Fri, 12 Apr 2024 10:19:27 GMT, Claes Redestad  wrote:

>> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
>> 1430:
>> 
>>> 1428: cb.new_(STRING_BUILDER);
>>> 1429: cb.dup();
>>> 1430: cb.invokespecial(STRING_BUILDER, "", 
>>> MethodTypeDesc.ofDescriptor("()V"));
>> 
>> Would there be value in initializing to a larger capacity? Given the number 
>> of append calls, seems the default cap of 16 is unlikely to be sufficient.
>
> Possibly. I tried a few simple variants that initialized the `StringBuilder` 
> capacity at various guesses, such as sum of constant sizes + some factor 
> based on args, but across a diverse set of micros that gives both some wins 
> and some regressions. Getting the estimation just right is pretty tricky, 
> especially when size of the arguments are arbitrary (like for any 
> String/Object arg).

What are the scenarios which had regressions? 
Given the conservative growth for StringBuilder, it surprises me a bit that any 
scenario would regress.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1564165946


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-12 Thread Claes Redestad
> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Claes Redestad has updated the pull request incrementally with one additional 
commit since the last revision:

  @liach feedback

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18690/files
  - new: https://git.openjdk.org/jdk/pull/18690/files/b5928994..1c3d354c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18690=01
 - incr: https://webrevs.openjdk.org/?repo=jdk=18690=00-01

  Stats: 32 lines in 1 file changed: 4 ins; 23 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 14:26:04 GMT, Chen Liang  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   @liach feedback
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1406:
> 
>> 1404: @Override
>> 1405: public void accept(ClassBuilder clb) {
>> 1406: clb.withFlags(AccessFlag.PUBLIC, 
>> AccessFlag.FINAL, AccessFlag.SUPER, AccessFlag.SYNTHETIC)
> 
> Why is this hidden class public?

Removed `PUBLIC`.

> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1458:
> 
>> 1456: 
>> 1457: // Load the argument of type cl at slot onto stack, return the 
>> number of argument stack slots consumed.
>> 1458: private static int load(CodeBuilder cb, Class cl, int slot) 
>> {
> 
> You can replace all calls to `load(cb, cl, slot)` with 
> `cb.loadInstruction(TypeKind.from(cl), slot)`, and the retrn slot count can 
> be accessed by `TypeKind::slotSize`

Thanks, I've tried to simplify.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562702306
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562701048


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Chen Liang
On Fri, 12 Apr 2024 14:57:45 GMT, Claes Redestad  wrote:

>> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
>> 1437:
>> 
>>> 1435: for (int c = 0; c < args.parameterCount(); 
>>> c++) {
>>> 1436: if (constants[c] != null) {
>>> 1437: cb.ldc(constants[c]);
>> 
>> I think for Remi's approach, we change:
>> 1. Insert an extra String array (maybe need a way to mark it stable?) arg 
>> representing constants
>> 2. Change this ldc into aload + aaload (or List.get if we use immutable List)
>> 3. Call `bindTo(constantStrings)` on the resulting MH
>> 
>> This approach can significantly reduce the number of classes spinned instead 
>> of generating one class per constant array; might need to measure 
>> performance to see if this is a good tradeoff
>> 
>> Oh, I just noticed that we need to erase everything to the generic method 
>> type. I think Remi's "value class" means there's no overhead for converting 
>> primitives into wrappers in this conversion to generic method type.
>
> I'd prefer considering such optimizations as follow-ups. We need to think 
> about where to define such shared classes in a way that considers the full 
> lifecycle, facilitates class unloading (one cache per classloader?) while 
> still getting good reuse.

I think `Invokers` might be a good place. Its lifetime is bound to its MT, and 
MT already handles class unloading well. If we are doing erased then a plain 
array suffices.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562686793


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 14:53:58 GMT, Chen Liang  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1437:
> 
>> 1435: for (int c = 0; c < args.parameterCount(); 
>> c++) {
>> 1436: if (constants[c] != null) {
>> 1437: cb.ldc(constants[c]);
> 
> I think for Remi's approach, we change:
> 1. Insert an extra String array (maybe need a way to mark it stable?) arg 
> representing constants
> 2. Change this ldc into aload + aaload (or List.get if we use immutable List)
> 3. Call `bindTo(constantStrings)` on the resulting MH
> 
> This approach can significantly reduce the number of classes spinned instead 
> of generating one class per constant array; might need to measure performance 
> to see if this is a good tradeoff
> 
> Oh, I just noticed that we need to erase everything to the generic method 
> type. I think Remi's "value class" means there's no overhead for converting 
> primitives into wrappers in this conversion to generic method type.

I'd prefer considering such optimizations as follow-ups. We need to think about 
where to define such shared classes in a way that considers the full lifecycle, 
facilitates class unloading (one cache per classloader?) while still getting 
good reuse.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562673638


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Chen Liang
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1437:

> 1435: for (int c = 0; c < args.parameterCount(); c++) 
> {
> 1436: if (constants[c] != null) {
> 1437: cb.ldc(constants[c]);

I think for Remi's approach, we change:
1. Insert an extra String array (maybe need a way to mark it stable?) arg 
representing constants
2. Change this ldc into aload + aaload (or List.get if we use immutable List)
3. Call `bindTo(constantStrings)` on the resulting MH

This approach can significantly reduce the number of classes spinned instead of 
generating one class per constant array; might need to measure performance to 
see if this is a good tradeoff

Oh, I just noticed that we need to erase everything to the generic method type. 
I think Remi's "value class" means there's no overhead for converting 
primitives into wrappers in this conversion to generic method type.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562667143


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 14:32:05 GMT, Chen Liang  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1425:
> 
>> 1423: 
>> 1424: private static Consumer generateMethod(String[] 
>> constants, MethodType args) {
>> 1425: return new Consumer() {
> 
> Are you using inner classes for startup performance concerns? 
> java.lang.invoke should be ready when you can reach here so using lambdas is 
> safe.

We've opted to avoid lambda use in `java.lang.invoke` historically to avoid 
bootstrap surprises. A small price to pay, imho.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562661264


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 13:44:23 GMT, Rémi Forax  wrote:

> One class per arity + value classes can be a good combo ?

Not sure value classes matter here? We would need one static instance per call 
site holding the constants. Trickier for performance is the potential for 
profile pollution between such classes.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051885862


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Chen Liang
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1406:

> 1404: @Override
> 1405: public void accept(ClassBuilder clb) {
> 1406: clb.withFlags(AccessFlag.PUBLIC, 
> AccessFlag.FINAL, AccessFlag.SUPER, AccessFlag.SYNTHETIC)

Why is this hidden class public?

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1409:

> 1407: .withMethodBody(METHOD_NAME,
> 1408: 
> MethodTypeDesc.ofDescriptor(args.toMethodDescriptorString()),
> 1409: ClassFile.ACC_FINAL | 
> ClassFile.ACC_PUBLIC | ClassFile.ACC_STATIC,

Why is this method final? Don't think this method can ever be hidden.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1416:

> 1414: Class innerClass = 
> lookup.defineHiddenClass(classBytes, true, STRONG).lookupClass();
> 1415: DUMPER.dumpClass(className, innerClass, classBytes);
> 1416: MethodHandle mh = lookup.findStatic(innerClass, 
> METHOD_NAME, args);

We can probably make the method private, and use the lookup from 
`defineHiddenClass` instead of the host class lookup.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1425:

> 1423: 
> 1424: private static Consumer generateMethod(String[] 
> constants, MethodType args) {
> 1425: return new Consumer() {

Are you using inner classes for startup performance concerns? java.lang.invoke 
should be ready when you can reach here so using lambdas is safe.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1458:

> 1456: 
> 1457: // Load the argument of type cl at slot onto stack, return the 
> number of argument stack slots consumed.
> 1458: private static int load(CodeBuilder cb, Class cl, int slot) {

You can replace all calls to `load(cb, cl, slot)` with 
`cb.loadInstruction(TypeKind.from(cl), slot)`, and the retrn slot count can be 
accessed by `TypeKind::slotSize`

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562629418
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562630031
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562635315
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562637025
PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562640252


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Rémi Forax
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

One class per arity + value classes can be a good combo ?

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051792260


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 11:41:17 GMT, Thomas Stuefe  wrote:

> Yes, we are concerned with that, especially for a possible future where 
> Lilliput is the sole default. Atm we can address about 4 million classes. 
> There are thoughts about making this number of classes infinite, but if 
> possible we would like to avoid that complexity.

4M classes should be enough for everyone :-) Either way this PR shouldn't make 
the situation worse. I think for expressions of high arity the average number 
of classes generated by the current strategy is likely larger than 1. Forcing 
generation of exactly one class per such call site is thus unlikely to be 
worse. But since it depends on app having a parameter to adjust the level at 
which we fall back to simple bytecode generation seems reasonable.

There's a range of possible follow-up enhancements to reduce classes generated. 
For example we could generate classes where constants are injected rather than 
hardcoded and cache/reuse those classes at different call-sites. Further out we 
could pre-generate classes (one per arity) containing methods for known 
signatures when dumping CDS archives or running jlink plugins so that there's 
little to no bytecode generation at runtime and a bound number of classes.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051734530


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Thomas Stuefe
On Fri, 12 Apr 2024 10:07:48 GMT, Claes Redestad  wrote:

> I guess Lilliput adds some hard or soft limit on the number of classes loaded?

Yes, we are concerned with that, especially for a possible future where 
Lilliput is the sole default. Atm we can address about 4 million classes. There 
are thoughts about making this number of classes infinite, but if possible we 
would like to avoid that complexity.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051597166


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 03:16:58 GMT, Brett Okken  wrote:

>> This patch suggests a workaround to an issue with huge SCF MH expression 
>> trees taking excessive JIT compilation resources by reviving (part of) the 
>> simple bytecode-generating strategy that was originally available as an 
>> all-or-nothing strategy choice. 
>> 
>> Instead of reintroducing a binary strategy choice I propose a threshold 
>> parameter, controlled by 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
>> below or at this threshold there's no change, for expressions with an arity 
>> above it we use the `StringBuilder`-chain bytecode generator. 
>> 
>> There are a few trade-offs at play here which influence the choice of 
>> threshold. The simple high arity strategy will for example not see any reuse 
>> of LambdaForms but strictly always generate a class per indy callsite, which 
>> means we might end up with a higher total number of classes generated and 
>> loaded in applications if we set this value too low. It may also produce 
>> worse performance on average. On the other hand there is the observed 
>> increase in C2 resource usage as expressions grow unwieldy. On the other 
>> other hand high arity expressions are likely rare to begin with, with less 
>> opportunities for sharing than the more common low-arity expressions. 
>> 
>> I turned the submitted test case into a few JMH benchmarks and did some 
>> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
>> 
>> Baseline strategy:
>> 13 args: 6.3M
>> 23 args: 18M
>> 123 args: 868M
>> 
>> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
>> 13 args: 2.11M
>> 23 args: 3.67M
>> 123 args: 4.75M
>> 
>> For 123 args the memory overhead of the baseline strategy is 180x, but for 
>> 23 args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
>> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
>> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
>> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
>> at the vast majority of call sites.
>> 
>> I was asked to use the new class file API for mainline. There's a version of 
>> this patch implemented using ASM in 7c52a9f which might be a reasonable 
>> basis for a backport.
>
> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
> 1430:
> 
>> 1428: cb.new_(STRING_BUILDER);
>> 1429: cb.dup();
>> 1430: cb.invokespecial(STRING_BUILDER, "", 
>> MethodTypeDesc.ofDescriptor("()V"));
> 
> Would there be value in initializing to a larger capacity? Given the number 
> of append calls, seems the default cap of 16 is unlikely to be sufficient.

Possibly. I tried a few simple variants that initialized the `StringBuilder` 
capacity at various guesses, such as sum of constant sizes + some factor based 
on args, but across a diverse set of micros that gives both some wins and some 
regressions. Getting the estimation just right is pretty tricky, especially 
when size of the arguments are arbitrary (like for any String/Object arg).

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1562356874


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Claes Redestad
On Fri, 12 Apr 2024 06:34:32 GMT, Thomas Stuefe  wrote:

> Stupid question maybe, this would be one LambdaForm per argument set? E.g. 
> would two invocations with the same arguments re-use the same LambdaForm?
> 
> I would like to get an understanding of the numbers of classes involved for 
> these solutions, and whether they are bounded. Mostly for Lilliput reason.

Any concat expression sharing the same exact shape will ideally share the same 
chain of LFs, and many of the intermediary LFs will be shared between concat 
expressions of all manner of shapes and sizes. On the extreme of an application 
with one singular, huge expression we might see quite a few LF classes per call 
site, while the incremental number of classes loaded for an expression whose 
shape has been visited elsewhere will be none. 

One could guess there is some upper bound based on the number of variant types 
(7: int, long, float, double, char, boolean, Object) to the power of the 
highest possible arity (100 - after which javac will start splitting and 
chaining the indy calls), though since there are some softly referenced caches 
that the LF editor leans on then in theory it might very well be unbounded. In 
practice we're bounded by the number of call sites times some factor that 
depends on how much sharing is going on. Since most expressions are likely to 
be low arity with high degree of sharing, then a long tail of more complex high 
arity expressions that might see less degree of sharing the assumption has been 
that the number of classes loaded per concat will be low.

I guess Lilliput adds some hard or soft limit on the number of classes loaded?

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051462197


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Thomas Stuefe
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Fyi: 
https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2024-April/074787.html

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051184229


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-12 Thread Thomas Stuefe
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low

Stupid question maybe, this would be one LambdaForm per argument set? E.g. 
would two invocations with the same arguments re-use the same LambdaForm?

I would like to get an understanding of the numbers of classes involved for 
these solutions, and whether they are bounded. Mostly for Lilliput reason.

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2051082454


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-11 Thread Brett Okken
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1430:

> 1428: cb.new_(STRING_BUILDER);
> 1429: cb.dup();
> 1430: cb.invokespecial(STRING_BUILDER, "", 
> MethodTypeDesc.ofDescriptor("()V"));

Would there be value in initializing to a larger capacity? Given the number of 
append calls, seems the default cap of 16 is unlikely to be sufficient.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1561966067


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-11 Thread Claes Redestad
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

Anyone out there?

-

PR Comment: https://git.openjdk.org/jdk/pull/18690#issuecomment-2050707573


RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-09 Thread Claes Redestad
This patch suggests a workaround to an issue with huge SCF MH expression trees 
taking excessive JIT compilation resources by reviving (part of) the simple 
bytecode-generating strategy that was originally available as an all-or-nothing 
strategy choice. 

Instead of reintroducing a binary strategy choice I propose a threshold 
parameter, controlled by 
`-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
below or at this threshold there's no change, for expressions with an arity 
above it we use the straightforward one-class per expression.

There are a few trade-offs at play here which influence the choice of 
threshold. The simple high arity strategy will for example not see any reuse of 
LambdaForms, which means we might end up with a higher number of classes 
generated and loaded in applications if we set this value too low. It may also 
produce worse performance on average. On the other hand there is the observed 
increase in C2 resource usage as expressions grow unwieldy. On the other other 
hand high arity expressions are likely rare to begin with, with less 
opportunities for sharing than the more common low-arity expressions. 

I turned the submitted test case into a few JMH benchmarks and did some 
experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:

Baseline strategy:
13 args: 6.3M
23 args: 18M
123 args: 868M

`-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
13 args: 2.11M
23 args: 3.67M
123 args: 4.75M

For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
args we're down to a 5x memory overhead, and down to a 3x overhead for 13 args. 
Since the absolute overhead for 23 is borderline acceptable (+15Mb) I've 
conservatively chosen a threshold at arity 20. This keeps C2 resource pressure 
at a reasonable level (< 18M) while avoiding perturbing performance at the vast 
majority of call sites.

I was asked to use the new class file API for mainline. There's a version of 
this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
for a backport.

-

Commit messages:
 - Bump threshold after experiments
 - Port ASM to CFA
 - Lower compilation overhead for large concat expressions, initial ASM-based 
version based on pre-existing implementation

Changes: https://git.openjdk.org/jdk/pull/18690/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk=18690=00
  Issue: https://bugs.openjdk.org/browse/JDK-8327247
  Stats: 213 lines in 2 files changed: 210 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/18690.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18690/head:pull/18690

PR: https://git.openjdk.org/jdk/pull/18690