Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff

Quan Anh Mai Thu, 23 Apr 2026 01:13:48 -0700

On Wed, 22 Apr 2026 11:30:08 GMT, Maurizio Cimadamore <[email protected]> 
wrote:


>> ## Preliminaries
>> 
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>> 
>> In general, we don't want to inline a call if the graph is already too 
>> large. However, it is hard to decide whether the graph is large when we are 
>> still constructing the graph.
>> 
>> - There are still more bytecodes that need parsing, and more nodes that need 
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead 
>> during parsing.
>> 
>> Due to the issues above, the heuristic depends on the number of generated 
>> nodes, which is an upper bound of the number of live nodes, and the 
>> threshold is pretty conservative.
>> 
>> ### 2. Inlining a method
>> 
>> To inline a method, C2 needs to generate the structure for the callee to 
>> reside in. This includes the map for the exception path, the map for the 
>> merge of all normal paths, their memory states, etc. My experiment shows 
>> that, inlining a call generates around 20 more nodes than if the call is 
>> inlined in the source code.
>> 
>>     private int v() {
>>         return this,v;
>>     }
>> 
>>     int test1() {
>>         return this.v();
>>     }
>> 
>>     int test2() {
>>         return this.v;
>>     }
>> 
>> This means that, inlining a call consumes the budget of 
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined, 
>> even if other heuristics say that inlining is preferable. However, in 
>> practice, it is rarely an issue, because there is a difference of 3 orders 
>> of magnitude between the extra nodes generated by inlining, and the default 
>> value of `NodeCountInliningCutoff` (16000).
>> 
>> ### 3. Foreign memory access API
>> 
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of 
>> magnitude larger than the number of extra nodes generated when inlining a 
>> call is broken due to how the FMA API is implemented. A memory access such 
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs 
>> inlining:
>> 
>>     @ 8   jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)   
>> force inline by annotation   callee changed to  
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes)    -> TypeProfile 
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>>       @ 1   
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24 
>> bytes)   force inline by annotation
>>       @ 8   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force 
>> inline by annotation
>>         @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 
>> bytes)   force inline by annot...
>
> Note: I'd still advise against using `copySegmentReinterpret` as a general 
> strategy. That trick can, under ideal condition, effectively remove all kinds 
> of bound checks. That said, as shown here, under more realistic conditions, 
> it just doesn't perform very well, while simpler and more idiomatic solutions 
> basically reach parity with Unsafe (at least in this benchmark).

@mcimadamore @vnkozlov Thanks for taking a look, I have added the benchmark.

@iwanowww Thanks a lot for your comment.

> First, I'd like to clarify the root cause of the problem. It's not specific 
> to FFM, but relates to how `MethodHandle`s and `VarHandle`s are implemented.

I can say it is intensified by how FFM is implemented, if you look at the 
inline tree of `AbstractMemorySegment::get` above, most of the calls are below 
`java.lang.invoke.VarHandleSegmentAsInts::get`. So, while it is often an issue 
with `Method/VarHandle`s, it is a more severe one with the FFM API.

> So, the question I have is what kind of performance testing besides 
> microbenchmarks have you run on it? I'd like to get a better understanding 
> how severe the risks of overinlining are and then decide how to better 
> address the issue.

I am running some benchmarks that measuring the compiler performance, as well 
as familiar benchmark suites Renaissance, Dacapo, Specjbb.

For the risk of overinlining, I think the risk is really small, because 
normally we will hit `DesiredMethodLimit` first, which is based on the total 
bytecode size, and it has a much smaller threshold. The reason we usually see 
`NodeCountInliningCutoff` with FFM API is that all of the methods in the 
implementation of FFM API is `@ForceInline`, and with `@FI` methods, 
`ciMethod::code_size_for_inlining` returns 1, which means they hide themselves 
from this heuristic, allowing us to reach `NodeCountInliningCutoff`. In fact, 
if I remove the part of `ciMethod::code_size_for_inlining` that returns 1 for 
`@FI` methods, we hit `DesiredMethodLimit`.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4302743800

Re: RFR: 8382700: C2: Delay inlining instead of giving up when hit NodeCountInliningCutoff

Reply via email to