On Thu, 23 Apr 2026 07:53:30 GMT, Quan Anh Mai <[email protected]> wrote:
>> ## Preliminaries
>>
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>>
>> In general, we don't want to inline a call if the graph is already too
>> large. However, it is hard to decide whether the graph is large when we are
>> still constructing the graph.
>>
>> - There are still more bytecodes that need parsing, and more nodes that need
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead
>> during parsing.
>>
>> Due to the issues above, the heuristic depends on the number of generated
>> nodes, which is an upper bound of the number of live nodes, and the
>> threshold is pretty conservative.
>>
>> ### 2. Inlining a method
>>
>> To inline a method, C2 needs to generate the structure for the callee to
>> reside in. This includes the map for the exception path, the map for the
>> merge of all normal paths, their memory states, etc. My experiment shows
>> that, inlining a call generates around 20 more nodes than if the call is
>> inlined in the source code.
>>
>> private int v() {
>> return this,v;
>> }
>>
>> int test1() {
>> return this.v();
>> }
>>
>> int test2() {
>> return this.v;
>> }
>>
>> This means that, inlining a call consumes the budget of
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined,
>> even if other heuristics say that inlining is preferable. However, in
>> practice, it is rarely an issue, because there is a difference of 3 orders
>> of magnitude between the extra nodes generated by inlining, and the default
>> value of `NodeCountInliningCutoff` (16000).
>>
>> ### 3. Foreign memory access API
>>
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of
>> magnitude larger than the number of extra nodes generated when inlining a
>> call is broken due to how the FMA API is implemented. A memory access such
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs
>> inlining:
>>
>> @ 8 jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)
>> force inline by annotation callee changed to
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes) -> TypeProfile
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>> @ 1
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24
>> bytes) force inline by annotation
>> @ 8 java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes) force
>> inline by annotation
>> @ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29
>> bytes) force inline by annot...
>
> Quan Anh Mai has updated the pull request incrementally with one additional
> commit since the last revision:
>
> add benchmark
I was looking at the `reinterpret` variant of the benchmark, which is still
slow. The issue there has to do with escape analysis getting "stuck" on such a
big IR. Unfortunately, I was able to reproduce even w/o FFM/restricted methods:
https://github.com/mcimadamore/jdk/blob/6b51dfae06ad935a5487698fe2ab93babdcc4ade/test/micro/org/openjdk/bench/java/lang/EscapeAnalysisDupAccess.java
When the optimization in this PR is enabled, we get something like this:
Benchmark Mode Cnt Score Error Units
EscapeAnalysisDupAccess.copyDirect avgt 3 157.055 ± 1.428 ns/op
EscapeAnalysisDupAccess.copyDup1 avgt 3 161.035 ± 18.928 ns/op
EscapeAnalysisDupAccess.copyDup2 avgt 3 18519.558 ± 7707.574 ns/op
If we disable the optimization (e.g. using `-XX:-IncrementalInline`), then we
get this:
Benchmark Mode Cnt Score Error Units
EscapeAnalysisDupAccess.copyDirect avgt 3 159.383 ± 27.783 ns/op
EscapeAnalysisDupAccess.copyDup1 avgt 3 766.249 ± 40.345 ns/op
EscapeAnalysisDupAccess.copyDup2 avgt 3 2324.964 ± 93.478 ns/op
I think this comparison is interesting because it shows many things at once:
* delayed inlining dosn't affect `copyDirect`
* delayed inlining _improves_ `copyDup1`
* delayed inlining significantly _regresses_ on `copyDup2`
That is, as long as escape analysis can keep up, delayed inlining wins. But
when escape analysis stops working (and C2 never finishes compiling the
benchmark method), then the numbers are much worse than when no inlining occurs
at all.
When the optimization is disabled, we can clearly see inlining failures:
@ 274 org.openjdk.bench.java.lang.EscapeAnalysisDupAccess$Dup2Foo::z (10
bytes) failed to inline: NodeCountInliningCutoff
In some way, these failures "protect" escape analysis from combinatorial
explosion.
So the question is:
* is this a problem?
* if so, is this a problem unique to escape analysis?
* or, as IR graph grows, other phases might exhibit similar runaway behavior?
Depending on what the answer is (and I'm not expert enough on C2 to comment on
this), different approaches might be required. E.g. if it's "just" an EA issue,
we might only delay inlining if there's no allocation in the bytecode of the
method to be inlined. This is probably not hard to do. The question is: is that
enough?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4304089914