Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff [v2]

Maurizio Cimadamore Fri, 24 Apr 2026 02:20:42 -0700

On Thu, 23 Apr 2026 07:53:30 GMT, Quan Anh Mai <[email protected]> wrote:


>> ## Preliminaries
>> 
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>> 
>> In general, we don't want to inline a call if the graph is already too 
>> large. However, it is hard to decide whether the graph is large when we are 
>> still constructing the graph.
>> 
>> - There are still more bytecodes that need parsing, and more nodes that need 
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead 
>> during parsing.
>> 
>> Due to the issues above, the heuristic depends on the number of generated 
>> nodes, which is an upper bound of the number of live nodes, and the 
>> threshold is pretty conservative.
>> 
>> ### 2. Inlining a method
>> 
>> To inline a method, C2 needs to generate the structure for the callee to 
>> reside in. This includes the map for the exception path, the map for the 
>> merge of all normal paths, their memory states, etc. My experiment shows 
>> that, inlining a call generates around 20 more nodes than if the call is 
>> inlined in the source code.
>> 
>>     private int v() {
>>         return this,v;
>>     }
>> 
>>     int test1() {
>>         return this.v();
>>     }
>> 
>>     int test2() {
>>         return this.v;
>>     }
>> 
>> This means that, inlining a call consumes the budget of 
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined, 
>> even if other heuristics say that inlining is preferable. However, in 
>> practice, it is rarely an issue, because there is a difference of 3 orders 
>> of magnitude between the extra nodes generated by inlining, and the default 
>> value of `NodeCountInliningCutoff` (16000).
>> 
>> ### 3. Foreign memory access API
>> 
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of 
>> magnitude larger than the number of extra nodes generated when inlining a 
>> call is broken due to how the FMA API is implemented. A memory access such 
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs 
>> inlining:
>> 
>>     @ 8   jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)   
>> force inline by annotation   callee changed to  
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes)    -> TypeProfile 
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>>       @ 1   
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24 
>> bytes)   force inline by annotation
>>       @ 8   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force 
>> inline by annotation
>>         @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 
>> bytes)   force inline by annot...
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add benchmark

Thanks for the feedback.

> No idea if this is related to escape analysis and/or other algorithms. It is 
> entirely impractical, usually needing 15+ JMH warmup iterations to reach peak 
> performance, especially with `MemorySegment::reinterpret`

This specific issue is very likely related to the EA problem described here. I 
found similar pathological behavior when using either reinterpret or asSlice in 
the hot path when using the incremental inlining changes described in this PR. 
EA simply can't keep up (for now) with such a large IR and basically ends up 
taking a very long time -- comparable to the overall benchmark execution.

IMHO, this stresses the fact that some of this magic is not free. Some tricks, 
like doing an on-the-fly reinterpret might work well in synthetic small 
benchmarks, but it almost always blows up in more realistic conditions. In the 
current state, I think it causes more problems than it solves, and, while it is 
an interesting stress test (e.g. see how far we can push C2), I don't think we 
should specifically tune for it.

> For FFM specifically, a refactoring away from VarHandles, at least for the 
> plain memory access methods available on `MemorySegment`, might be worth 
> exploring. I also hope `MemorySegment::reinterpret` could be simplified. 
> Ideas:
> 
>    Make the default implementations throw UOE, move the actual 
> implementations to `NativeMemorySegmentImpl`. This removes the need for the 
> `!isNative()` check.
> 
>    Inline `reinterpretInternal` into `reinterpret`. This eliminates the 
> `cleanupAction` conditional, which should help escape analysis. It also 
> eliminates the call completely for the `MemorySegment reinterpret(long 
> newSize)` overload (the most important for us).
> 
>    Uhm... `Reflection.ensureNativeAccess`. I get the importance, but I wish 
> this could be called once per module, instead of carrying its complexity in 
> every call-site.

I agree with all the points above.

Btw, one thing I found adds more cost than anticipated is the fluent style 
adopted by LWJGL accessor. Fluent setters seems bigger than void setters. And 
instance accessors seems generally quite bigger than static accessors which 
take a memory segment parameter (jextract style). So, all these little factors 
contribute to what we see in the end.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4312018397

Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff [v2]

Reply via email to