On Thu, 23 Apr 2026 07:53:30 GMT, Quan Anh Mai <[email protected]> wrote:
>> ## Preliminaries
>>
>> ### 1. The inlining heuristic NodeCountLimitCutoff
>>
>> In general, we don't want to inline a call if the graph is already too
>> large. However, it is hard to decide whether the graph is large when we are
>> still constructing the graph.
>>
>> - There are still more bytecodes that need parsing, and more nodes that need
>> generating.
>> - It is hard (maybe impossible) to reliably determine whether a node is dead
>> during parsing.
>>
>> Due to the issues above, the heuristic depends on the number of generated
>> nodes, which is an upper bound of the number of live nodes, and the
>> threshold is pretty conservative.
>>
>> ### 2. Inlining a method
>>
>> To inline a method, C2 needs to generate the structure for the callee to
>> reside in. This includes the map for the exception path, the map for the
>> merge of all normal paths, their memory states, etc. My experiment shows
>> that, inlining a call generates around 20 more nodes than if the call is
>> inlined in the source code.
>>
>> private int v() {
>> return this,v;
>> }
>>
>> int test1() {
>> return this.v();
>> }
>>
>> int test2() {
>> return this.v;
>> }
>>
>> This means that, inlining a call consumes the budget of
>> `NodeCountInliningCutoff`, which may prevent other calls from being inlined,
>> even if other heuristics say that inlining is preferable. However, in
>> practice, it is rarely an issue, because there is a difference of 3 orders
>> of magnitude between the extra nodes generated by inlining, and the default
>> value of `NodeCountInliningCutoff` (16000).
>>
>> ### 3. Foreign memory access API
>>
>> The aforementioned property that `NodeCountInliningCutoff` is 3 orders of
>> magnitude larger than the number of extra nodes generated when inlining a
>> call is broken due to how the FMA API is implemented. A memory access such
>> as `j.l.f.MemorySegment::get` results in a huge call tree that needs
>> inlining:
>>
>> @ 8 jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes)
>> force inline by annotation callee changed to
>> io.github.merykitty.BenchmarkDraft::test1 (14 bytes) -> TypeProfile
>> (9083/9083 counts) = jdk/internal/foreign/NativeMemorySegmentImpl
>> @ 1
>> jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (24
>> bytes) force inline by annotation
>> @ 8 java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes) force
>> inline by annotation
>> @ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29
>> bytes) force inline by annot...
>
> Quan Anh Mai has updated the pull request incrementally with one additional
> commit since the last revision:
>
> add benchmark
Hello, thank you all for looking into this!
> Also pinging @Spasi (in case he doesn't want the benchmark added -- after
> all, he wrote the initial version of it :-) )
No problem of course, I'm glad it helped! I've been testing all day with this
patch, my feedback:
I can confirm that it works as intended and gives a lot of breathing room to
most cases that were hitting `NodeCountLimitCutoff` before.
I can also confirm that `DesiredMethodLimit` keeps the inlining decisions in
check. The two limits seem to pull back unrestricted inlining from different
directions, but at around the same ballpark. Big methods that were hitting
`NodeCountLimitCutoff` at a certain point, now hit `DesiredMethodLimit` at
around the same point, +/- a few method calls. More reasonable methods that
were nowhere close to `DesiredMethodLimit` but were easily hitting
`NodeCountLimitCutoff`, now get optimized properly.
Testing was done in a series of benchmarks that specifically target this issue,
on server applications with mostly vanilla Java code and on LWJGL-based client
applications with typical off-heap access patterns and foreign function calls.
I could not spot any performance regressions or instances of slower JIT
compilation, but I see how this kind of change requires a lot of careful
testing and validation.
If I were to highlight the most interesting scenario, it would be this:
https://github.com/LWJGL/lwjgl3/blob/f7909c672a8c5e93479a5114239c9ec5e83609bc/modules/samples/src/test/java/org/lwjgl/demo/vulkan/HelloVulkan.java#L1632
Excuse the unsavory code style (it intentionally matches the original C code),
but it is a good example of actual Vulkan rendering code. This method is called
on every frame and does the typical Vulkan dance of filling structs with values
and calling API functions. It is quite big, so it has always been hitting
`NodeCountLimitCutoff`.
* With this change and the Unsafe backend, it is the first time I see full
optimization, everything is inlined and all allocations are scalar replaced.
* With this change and the FFM backend, it simply hits `DesiredMethodLimit`
instead of `NodeCountLimitCutoff`.
The point at which this happens varies depending on the actual offheap access
method used. Using an `EVERYTHING` segment is lighter than
`MemorySegment::reinterpret`. However, "hiding" `MemorySegment::reinterpret`
behind a `VarHandle` seems to help. For reference, like this:
static {
try {
var lookup = MethodHandles.lookup();
var ofAddress = lookup.findStatic(MemorySegment.class, "ofAddress",
MethodType.methodType(MemorySegment.class, long.class));
var reinterpret = lookup.findVirtual(MemorySegment.class,
"reinterpret", MethodType.methodType(MemorySegment.class, long.class));
VH_JAVA_BYTE = createMemoryAccessVH(ValueLayout.JAVA_BYTE, ofAddress,
reinterpret).withInvokeExactBehavior();
VH_JAVA_SHORT_UNALIGNED =
createMemoryAccessVH(ValueLayout.JAVA_SHORT_UNALIGNED, ofAddress,
reinterpret).withInvokeExactBehavior();
VH_JAVA_INT_UNALIGNED =
createMemoryAccessVH(ValueLayout.JAVA_INT_UNALIGNED, ofAddress,
reinterpret).withInvokeExactBehavior();
VH_JAVA_LONG_UNALIGNED =
createMemoryAccessVH(ValueLayout.JAVA_LONG_UNALIGNED, ofAddress,
reinterpret).withInvokeExactBehavior();
VH_JAVA_FLOAT_UNALIGNED =
createMemoryAccessVH(ValueLayout.JAVA_FLOAT_UNALIGNED, ofAddress,
reinterpret).withInvokeExactBehavior();
VH_JAVA_DOUBLE_UNALIGNED =
createMemoryAccessVH(ValueLayout.JAVA_DOUBLE_UNALIGNED, ofAddress,
reinterpret).withInvokeExactBehavior();
...
}
private static VarHandle createMemoryAccessVH(ValueLayout layout, MethodHandle
ofAddress, MethodHandle reinterpret) {
var vh = layout.varHandle();
vh = MethodHandles.insertCoordinates(vh, 1, 0L);
vh = MethodHandles.filterCoordinates(vh, 0,
MethodHandles.filterReturnValue(
ofAddress,
MethodHandles.insertArguments(reinterpret, 1, layout.byteSize())
));
return vh;
}
Anyway, no matter what, we're still hitting `DesiredMethodLimit`. There is a
workaround `product` flag for this however, `-XX:-ClipInlining`. This is a bad
idea in general, but testing with it was interesting. Specifically, how it
affects time-to-peak-performance. With this flag, all approaches are able to
fully inline all calls. However, the time it takes for that to happen grows
exponentially for the non-Unsafe approaches. No idea if this is related to
escape analysis and/or other algorithms. It is entirely impractical, usually
needing 15+ JMH warmup iterations to reach peak performance, especially with
`MemorySegment::reinterpret`. Other than time, an equally interesting aspect is
JIT resource usage in general. I've seen C1 level 3 compilations that are
easily 100x the size (in number of instructions) of the final C2 compilation.
So, one might think that the next step is making `@ForceInline` methods not
contribute to `DesiredMethodLimit`. That is definitely a bad idea with the
current implementation.
The `NodeCountLimitCutoff` fix will likely help a lot of cases, not only FFM
related, but any code that works with VH/MH behind the scenes.
For FFM specifically, a refactoring away from VarHandles, at least for the
plain memory access methods available on `MemorySegment`, might be worth
exploring. I also hope `MemorySegment::reinterpret` could be simplified. Ideas:
* Make the default implementations throw UOE, move the actual implementations
to `NativeMemorySegmentImpl`. This removes the need for the `!isNative()` check.
* Inline `reinterpretInternal` into `reinterpret`. This eliminates the
`cleanupAction` conditional, which should help escape analysis. It also
eliminates the call completely for the `MemorySegment reinterpret(long
newSize)` overload (the most important for us).
* Uhm... `Reflection.ensureNativeAccess`. I get the importance, but I wish this
could be called once per module, instead of carrying its complexity in every
call-site.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/30874#issuecomment-4308745143