On Tue, 6 Aug 2024 17:26:55 GMT, Jorn Vernee <jver...@openjdk.org> wrote:

> As discussed in the JBS issue:
> 
> FFM upcall stubs embed a `Method*` of the target method in the stub. This 
> `Method*` is read from the `LambdaForm::vmentry` field associated with the 
> target method handle at the time when the upcall stub is generated. The MH 
> instance itself is stashed in a global JNI ref. So, there should be a 
> reachability chain to the holder class: `MH (receiver) -> LF (form) -> 
> MemberName (vmentry) -> ResolvedMethodName (method) -> Class<?> (vmholder)`
> 
> However, it appears that, due to multiple threads racing to initialize the 
> `vmentry` field of the `LambdaForm` of the target method handle of an upcall 
> stub, it is possible that the `vmentry` is updated _after_ we embed the 
> corresponding `Method`* into an upcall stub (or rather, the latest update is 
> not visible to the thread generating the upcall stub). Technically, it is 
> fine to keep using a 'stale' `vmentry`, but the problem is that now the 
> reachability chain is broken, since the upcall stub only extracts the target 
> `Method*`, and doesn't keep the stale `vmentry` reachable. The holder class 
> can then be unloaded, resulting in a crash.
> 
> The fix I've chosen for this is to mimic what we already do in 
> `MethodHandles::jump_to_lambda_form`, and re-load the `vmentry` field from 
> the target method handle each time. Luckily, this does not really seem to 
> impact performance.
> 
> <details>
> <summary>Performance numbers</summary>
> x64:
> 
> before:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  69.216 ± 1.791  ns/op
> 
> 
> after:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  67.787 ± 0.684  ns/op
> 
> 
> aarch64:
> 
> before:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  61.574 ± 0.801  ns/op
> 
> 
> after:
> 
> Benchmark             Mode  Cnt   Score   Error  Units
> Upcalls.panama_blank  avgt   30  61.218 ± 0.554  ns/op
> 
> </details>
> 
> As for the added TestUpcallStress test, it takes about 800 seconds to run 
> this test on the dev machine I'm using, so I've set the timeout quite high. 
> Since it runs for so long, I've dropped it from the default `jdk_foreign` 
> test suite, which runs in tier2. Instead the new test will run in tier4.
> 
> Testing: tier 1-4

I had to adjust the stub size on x64 when running on fastdebug/shenandoah. That 
may also be needed on other platforms except for aarch64), but I can't test 
there.

To find the right stub size, just increase `upcall_stub_code_base_size` to a 
high number (e.g. 2048), and then run this program:


import java.lang.foreign.*;
import java.lang.invoke.*;

public class Main {
    public static void main(String[] args) throws Throwable {
        try (Arena arena = Arena.ofConfined()) {
            MemorySegment stub = Linker.nativeLinker().upcallStub(
                MethodHandles.empty(MethodType.methodType(void.class)),
                FunctionDescriptor.ofVoid(),
                arena);
        }   
    }
}



$ javac -d classes Main.java
$ java -cp classes -XX:+UseShenandoahGC -XX:+LogCompilation Main
$ grep upcall_stub -A 1 hotspot_pid*
<blob name='upcall_stub_()V' total_size='1360'>
<sect index='1' capacity='1360' size='1195' remaining='165'/>


Then set `upcall_stub_code_base_size` to be bigger than `size`.

Using a static stub is still a bit slower, but much more in line with the 
performance of inline loads:


Current:

Benchmark                  Mode  Cnt    Score   Error  Units
QSort.panama_upcall_qsort  avgt   30  608.953 � 3.047  ns/op


Fully C++:

Benchmark                  Mode  Cnt    Score   Error  Units
QSort.panama_upcall_qsort  avgt   30  725.142 � 2.718  ns/op   ~19% slower


Static stub:

Benchmark                  Mode  Cnt    Score   Error  Units
QSort.panama_upcall_qsort  avgt   30  627.661 � 2.099  ns/op   ~3% slower


I think that's a good compromise. (Although I wish the C++ code was just as 
fast, as it's much nicer)

> Some of the DecoratorSet should be applicable and improve performance.

I gave `AS_NO_KEEPALIVE` a try. I'm not sure if that's correct, but it didn't 
really change performance. I'm not sure what other decorator would apply. I was 
looking at `ACCESS_READ`, but it seems that that can not be used for these 
loads.

I've added the implementations with the stubs, I had to eyeball `s390, ppc, and 
rsicv, so testing is still needed on those platforms (GHA will do the cross 
builds).

I also spent quite a while messing with the test. This test is quite unstable, 
because it creates a new thread pool for each test case, and then calls 
`ExecutorService::shutdownNow`, which doesn't allow submitted tasks to finish. 
It was running out of memory on some of the mac machine we test on in CI. I've 
made several attempts to make it more stable, but all of those resulted in the 
issue no longer being reproduced. For now I've restricted the test to 
linux/aarch64, since that's where we see the issue, and it seems stable enough 
to pass every time at least in out CI. If it causes issues though, I think we 
might have to just drop the test, or maybe mark it as `/manual`, so it doesn't 
run in CI.

I'll be away until September. Will pick this back up then.

With the latest version, we are slightly faster than the baseline:


baseline:

Benchmark                  Mode  Cnt    Score   Error  Units
QSort.panama_upcall_qsort  avgt   30  635.047 � 2.181  ns/op

Patched:

Benchmark                  Mode  Cnt    Score   Error  Units
QSort.panama_upcall_qsort  avgt   30  625.385 � 2.442  ns/op


@TheRealMDoerr Thanks for all the suggestions!

src/hotspot/cpu/x86/upcallLinker_x86_64.cpp line 311:

> 309:                     Address(rbx, 
> java_lang_invoke_ResolvedMethodName::vmtarget_offset()),
> 310:                     noreg, noreg);
> 311:   __ movptr(Address(r15_thread, JavaThread::callee_target_offset()), 
> rbx); // just in case callee is deoptimized

FWIW, I tried to move this code to `UpcallLinker::on_entry`, but there is a 
speed cost to that (of around 10ns on my machine). Doing that results in 
several out-of-line calls to 
`AccessInternal::RuntimeDispatch<...>::_load_at_func` to load the values. It 
seems like a tradeoff between speed and space (in the code cache). I went with 
speed here.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2273455863
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2277864015
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2278175462
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2278557487
PR Comment: https://git.openjdk.org/jdk/pull/20479#issuecomment-2326568409
PR Review Comment: https://git.openjdk.org/jdk/pull/20479#discussion_r1706122881

Reply via email to