On Mon, 31 Jul 2023 12:22:00 GMT, Yasumasa Suenaga <[email protected]> wrote:
> In FFM, native function would be called via `nep_invoker_blob`. If the
> function has two arguments, it would be following:
>
>
> Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10
> --------------------------------------------------------------------------------
> 0x00007fcae394cd80: pushq %rbp
> 0x00007fcae394cd81: movq %rsp, %rbp
> 0x00007fcae394cd84: subq $0, %rsp
> ;; { argument shuffle
> 0x00007fcae394cd88: movq %r8, %rax
> 0x00007fcae394cd8b: movq %rsi, %r10
> 0x00007fcae394cd8e: movq %rcx, %rsi
> 0x00007fcae394cd91: movq %rdx, %rdi
> ;; } argument shuffle
> 0x00007fcae394cd94: callq *%r10
> 0x00007fcae394cd97: leave
> 0x00007fcae394cd98: retq
>
>
> `subq $0, %rsp` is for shadow space on stack, and `movq %r8, %rax` is number
> of args for variadic function. So they are not necessary in some case. They
> should be remove following if they are not needed:
>
>
> Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810
> --------------------------------------------------------------------------------
> 0x00007fd8778e2880: pushq %rbp
> 0x00007fd8778e2881: movq %rsp, %rbp
> ;; { argument shuffle
> 0x00007fd8778e2884: movq %rsi, %r10
> 0x00007fd8778e2887: movq %rcx, %rsi
> 0x00007fd8778e288a: movq %rdx, %rdi
> ;; } argument shuffle
> 0x00007fd8778e288d: callq *%r10
> 0x00007fd8778e2890: leave
> 0x00007fd8778e2891: retq
>
>
> All java/foreign jtreg tests are passed.
>
> We can see these stub code on [ffmasm
> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/examples/cpumodel)
> with `-XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode` and hsdis library.
> This testcase linked the code with `Linker.Option.isTrivial()`.
>
> After this change, FFM performance on [another ffmasm
> testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/benchmarks/funccall)
> was improved:
>
> before:
>
> Benchmark Mode Cnt Score Error
> Units
> FuncCallComparison.invokeFFMRDTSC thrpt 3 106664071.816 ± 14396524.718
> ops/s
> FuncCallComparison.rdtsc thrpt 3 108024079.738 ± 13223921.011
> ops/s
>
>
> after:
>
> Benchmark Mode Cnt Score Error
> Units
> FuncCallComparison.invokeFFMRDTSC thrpt 3 107622971.525 ± 12249767.134
> ops/s
> FuncCallComparison.rdtsc thrpt 3 107695741.608 ± 23983281.346
> ops/s
>
>
> Environment:
> * CPU: AMD Ryzen 3 3300X
> * OS: Fedora 38 x86_64 (Kernel 6.3.8-200.fc38.x86_64)
> * Hyper-V 4vCPU, 8GB mem
FWIW, if you want to look into reducing the generated code further, I think we
can potentially reduce the amount of shuffling between registers that's needed
by reordering the arguments on the Java side so that each VMStorage
corresponding to an argument of the leaf method handle is the same as the
register for that argument in the Java calling convention.
I think the right place to do this is in DowncallLinker where we are creating
the NativeEntryPoint. The way I think it should work:
1. compute the Java calling convention's argument registers for the leaf method
type.
2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered method
type, such that the VMStorage/type for a particular argument index matches the
register for the same index used in the Java calling convention as much as
possible.
3. use these 2 to create the native entry point + native method handle
4. apply the same reordering to the created native method handle (using
MethodHandles::permuteArguments) so that the resulting method handle has the
original argument order/method type.
Pushing this shuffling to the Java side will allow the JIT to reduce data
motion, and this should result in reduced shuffling being needed overall I
think.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/15089#issuecomment-1661382669