Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff [v2]

Maurizio Cimadamore Thu, 23 Apr 2026 04:48:57 -0700

On Thu, 23 Apr 2026 07:53:30 GMT, Quan Anh Mai <[email protected]> wrote:


>> ## Preliminaries
>> 
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>> 
>> In general, we don't want to inline a call if the graph is already too 
>> large. However, it is hard to decide whether the graph is large when we are 
>> still constructing the graph.
>> 
>> - There are still more bytecodes that need parsing, and more nodes that need 
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead 
>> during parsing.
>> 
>> Due to the issues above, the heuristic depends on the number of generated 
>> nodes, which is an upper bound of the number of live nodes, and the 
>> threshold is pretty conservative.
>> 
>> ### 2. Inlining a method
>> 
>> To inline a method, C2 needs to generate the structure for the callee to 
>> reside in. This includes the map for the exception path, the map for the 
>> merge of all normal paths, their memory states, etc. My experiment shows 
>> that, inlining a call generates around 20 more nodes than if the call is 
>> inlined in the source code.
>> 
>>     private int v() {
>>         return this,v;
>>     }
>> 
>>     int test1() {
>>         return this.v();
>>     }
>> 
>>     int test2() {
>>         return this.v;
>>     }
>> 
>> This means that, inlining a call consumes the budget of 
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined, 
>> even if other heuristics say that inlining is preferable. However, in 
>> practice, it is rarely an issue, because there is a difference of 3 orders 
>> of magnitude between the extra nodes generated by inlining, and the default 
>> value of `NodeCountInliningCutoff` (16000).
>> 
>> ### 3. Foreign memory access API
>> 
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of 
>> magnitude larger than the number of extra nodes generated when inlining a 
>> call is broken due to how the FMA API is implemented. A memory access such 
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs 
>> inlining:
>> 
>>     @ 8   jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)   
>> force inline by annotation   callee changed to  
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes)    -> TypeProfile 
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>>       @ 1   
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24 
>> bytes)   force inline by annotation
>>       @ 8   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force 
>> inline by annotation
>>         @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 
>> bytes)   force inline by annot...
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add benchmark

I was looking at the `reinterpret` variant of the benchmark, which is still 
slow. The issue there has to do with escape analysis getting "stuck" on such a 
big IR. Unfortunately, I was able to reproduce even w/o FFM/restricted methods:

https://github.com/mcimadamore/jdk/blob/6b51dfae06ad935a5487698fe2ab93babdcc4ade/test/micro/org/openjdk/bench/java/lang/EscapeAnalysisDupAccess.java

When the optimization in this PR is enabled, we get something like this:


Benchmark                           Mode  Cnt      Score      Error  Units
EscapeAnalysisDupAccess.copyDirect  avgt    3    157.055 ±    1.428  ns/op
EscapeAnalysisDupAccess.copyDup1    avgt    3    161.035 ±   18.928  ns/op
EscapeAnalysisDupAccess.copyDup2    avgt    3  18519.558 ± 7707.574  ns/op


If we disable the optimization (e.g. using `-XX:-IncrementalInline`), then we 
get this:


Benchmark                           Mode  Cnt     Score    Error  Units
EscapeAnalysisDupAccess.copyDirect  avgt    3   159.383 ± 27.783  ns/op
EscapeAnalysisDupAccess.copyDup1    avgt    3   766.249 ± 40.345  ns/op
EscapeAnalysisDupAccess.copyDup2    avgt    3  2324.964 ± 93.478  ns/op


I think this comparison is interesting because it shows many things at once:
* delayed inlining dosn't affect `copyDirect`
* delayed inlining _improves_ `copyDup1`
* delayed inlining significantly _regresses_ on `copyDup2`

That is, as long as escape analysis can keep up, delayed inlining wins. But 
when escape analysis stops working (and C2 never finishes compiling the 
benchmark method), then the numbers are much worse than when no inlining occurs 
at all.

When the optimization is disabled, we can clearly see inlining failures:

@ 274   org.openjdk.bench.java.lang.EscapeAnalysisDupAccess$Dup2Foo::z (10 
bytes)   failed to inline: NodeCountInliningCutoff


In some way, these failures "protect" escape analysis from combinatorial 
explosion.

So the question is:
* is this a problem?
* if so, is this a problem unique to escape analysis?
* or, as IR graph grows, other phases might exhibit similar runaway behavior?

Depending on what the answer is (and I'm not expert enough on C2 to comment on 
this), different approaches might be required. E.g. if it's "just" an EA issue, 
we might only delay inlining if there's no allocation in the bytecode of the 
method to be inlined. This is probably not hard to do. The question is: is that 
enough?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4304089914

Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff [v2]

Reply via email to