On Thu, 23 Apr 2026 07:53:30 GMT, Quan Anh Mai <[email protected]> wrote:

>> ## Preliminaries
>> 
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>> 
>> In general, we don't want to inline a call if the graph is already too 
>> large. However, it is hard to decide whether the graph is large when we are 
>> still constructing the graph.
>> 
>> - There are still more bytecodes that need parsing, and more nodes that need 
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead 
>> during parsing.
>> 
>> Due to the issues above, the heuristic depends on the number of generated 
>> nodes, which is an upper bound of the number of live nodes, and the 
>> threshold is pretty conservative.
>> 
>> ### 2. Inlining a method
>> 
>> To inline a method, C2 needs to generate the structure for the callee to 
>> reside in. This includes the map for the exception path, the map for the 
>> merge of all normal paths, their memory states, etc. My experiment shows 
>> that, inlining a call generates around 20 more nodes than if the call is 
>> inlined in the source code.
>> 
>>     private int v() {
>>         return this,v;
>>     }
>> 
>>     int test1() {
>>         return this.v();
>>     }
>> 
>>     int test2() {
>>         return this.v;
>>     }
>> 
>> This means that, inlining a call consumes the budget of 
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined, 
>> even if other heuristics say that inlining is preferable. However, in 
>> practice, it is rarely an issue, because there is a difference of 3 orders 
>> of magnitude between the extra nodes generated by inlining, and the default 
>> value of `NodeCountInliningCutoff` (16000).
>> 
>> ### 3. Foreign memory access API
>> 
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of 
>> magnitude larger than the number of extra nodes generated when inlining a 
>> call is broken due to how the FMA API is implemented. A memory access such 
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs 
>> inlining:
>> 
>>     @ 8   jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)   
>> force inline by annotation   callee changed to  
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes)    -> TypeProfile 
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>>       @ 1   
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24 
>> bytes)   force inline by annotation
>>       @ 8   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force 
>> inline by annotation
>>         @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 
>> bytes)   force inline by annot...
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add benchmark

Hello, thank you all for looking into this!

> Also pinging @Spasi (in case he doesn't want the benchmark added -- after 
> all, he wrote the initial version of it :-) )

No problem of course, I'm glad it helped! I've been testing all day with this 
patch, my feedback:

I can confirm that it works as intended and gives a lot of breathing room to 
most cases that were hitting `NodeCountLimitCutoff` before.

I can also confirm that `DesiredMethodLimit` keeps the inlining decisions in 
check. The two limits seem to pull back unrestricted inlining from different 
directions, but at around the same ballpark. Big methods that were hitting 
`NodeCountLimitCutoff` at a certain point, now hit `DesiredMethodLimit` at 
around the same point, +/- a few method calls. More reasonable methods that 
were nowhere close to `DesiredMethodLimit` but were easily hitting 
`NodeCountLimitCutoff`, now get optimized properly.

Testing was done in a series of benchmarks that specifically target this issue, 
on server applications with mostly vanilla Java code and on LWJGL-based client 
applications with typical off-heap access patterns and foreign function calls. 
I could not spot any performance regressions or instances of slower JIT 
compilation, but I see how this kind of change requires a lot of careful 
testing and validation.

If I were to highlight the most interesting scenario, it would be this:

https://github.com/LWJGL/lwjgl3/blob/f7909c672a8c5e93479a5114239c9ec5e83609bc/modules/samples/src/test/java/org/lwjgl/demo/vulkan/HelloVulkan.java#L1632

Excuse the unsavory code style (it intentionally matches the original C code), 
but it is a good example of actual Vulkan rendering code. This method is called 
on every frame and does the typical Vulkan dance of filling structs with values 
and calling API functions. It is quite big, so it has always been hitting 
`NodeCountLimitCutoff`.

* With this change and the Unsafe backend, it is the first time I see full 
optimization, everything is inlined and all allocations are scalar replaced.
* With this change and the FFM backend, it simply hits `DesiredMethodLimit` 
instead of `NodeCountLimitCutoff`.

The point at which this happens varies depending on the actual offheap access 
method used. Using an `EVERYTHING` segment is lighter than 
`MemorySegment::reinterpret`. However, "hiding" `MemorySegment::reinterpret` 
behind a `VarHandle` seems to help. For reference, like this:


static {
    try {
        var lookup = MethodHandles.lookup();

        var ofAddress = lookup.findStatic(MemorySegment.class, "ofAddress", 
MethodType.methodType(MemorySegment.class, long.class));
        var reinterpret = lookup.findVirtual(MemorySegment.class, 
"reinterpret", MethodType.methodType(MemorySegment.class, long.class));

        VH_JAVA_BYTE = createMemoryAccessVH(ValueLayout.JAVA_BYTE, ofAddress, 
reinterpret).withInvokeExactBehavior();
        VH_JAVA_SHORT_UNALIGNED = 
createMemoryAccessVH(ValueLayout.JAVA_SHORT_UNALIGNED, ofAddress, 
reinterpret).withInvokeExactBehavior();
        VH_JAVA_INT_UNALIGNED = 
createMemoryAccessVH(ValueLayout.JAVA_INT_UNALIGNED, ofAddress, 
reinterpret).withInvokeExactBehavior();
        VH_JAVA_LONG_UNALIGNED = 
createMemoryAccessVH(ValueLayout.JAVA_LONG_UNALIGNED, ofAddress, 
reinterpret).withInvokeExactBehavior();
        VH_JAVA_FLOAT_UNALIGNED = 
createMemoryAccessVH(ValueLayout.JAVA_FLOAT_UNALIGNED, ofAddress, 
reinterpret).withInvokeExactBehavior();
        VH_JAVA_DOUBLE_UNALIGNED = 
createMemoryAccessVH(ValueLayout.JAVA_DOUBLE_UNALIGNED, ofAddress, 
reinterpret).withInvokeExactBehavior();
                ...
}

private static VarHandle createMemoryAccessVH(ValueLayout layout, MethodHandle 
ofAddress, MethodHandle reinterpret) {
        var vh = layout.varHandle();

        vh = MethodHandles.insertCoordinates(vh, 1, 0L);
        vh = MethodHandles.filterCoordinates(vh, 0, 
MethodHandles.filterReturnValue(
                ofAddress,
                MethodHandles.insertArguments(reinterpret, 1, layout.byteSize())
        ));

        return vh;
}


Anyway, no matter what, we're still hitting `DesiredMethodLimit`. There is a 
workaround `product` flag for this however, `-XX:-ClipInlining`. This is a bad 
idea in general, but testing with it was interesting. Specifically, how it 
affects time-to-peak-performance. With this flag, all approaches are able to 
fully inline all calls. However, the time it takes for that to happen grows 
exponentially for the non-Unsafe approaches. No idea if this is related to 
escape analysis and/or other algorithms. It is entirely impractical, usually 
needing 15+ JMH warmup iterations to reach peak performance, especially with 
`MemorySegment::reinterpret`. Other than time, an equally interesting aspect is 
JIT resource usage in general. I've seen C1 level 3 compilations that are 
easily 100x the size (in number of instructions) of the final C2 compilation.

So, one might think that the next step is making `@ForceInline` methods not 
contribute to `DesiredMethodLimit`. That is definitely a bad idea with the 
current implementation.

The `NodeCountLimitCutoff` fix will likely help a lot of cases, not only FFM 
related, but any code that works with VH/MH behind the scenes.

For FFM specifically, a refactoring away from VarHandles, at least for the 
plain memory access methods available on `MemorySegment`, might be worth 
exploring. I also hope `MemorySegment::reinterpret` could be simplified. Ideas:

* Make the default implementations throw UOE, move the actual implementations 
to `NativeMemorySegmentImpl`. This removes the need for the `!isNative()` check.
* Inline `reinterpretInternal` into `reinterpret`. This eliminates the 
`cleanupAction` conditional, which should help escape analysis. It also 
eliminates the call completely for the `MemorySegment reinterpret(long 
newSize)` overload (the most important for us).
* Uhm... `Reflection.ensureNativeAccess`. I get the importance, but I wish this 
could be called once per module, instead of carrying its complexity in every 
call-site.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4308745143

Reply via email to