On Thu, Mar 14, 2024 at 04:07:33PM +0100, Bj"orn T"opel wrote:
> After reading Mark's reply, and discussing with OpenJDK folks (who does
> the most crazy text patching on all platforms), having to patch multiple
> instructions (where the address materialization is split over multiple
> instructions) is a no-go. It's just a too big can of worms. So, if we
> can only patch one insn, it's CALL_OPS.
> 
> A couple of options (in addition to Andy's), and all require a
> per-function landing address ala CALL_OPS) tweaking what Mark is doing
> on Arm (given the poor branch range).
> 
> ..and maybe we'll get RISC-V rainbows/unicorns in the future getting
> better reach (full 64b! ;-)).
> 
> A) Use auipc/jalr, only patch jalr to take us to a common
>    dispatcher/trampoline
>   
>  | <func_trace_target_data_8B> # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   aupic
>  |   nop <=> jalr # Text patch point -> common_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | common_dispatch:
>  |   load <func_trace_target_data_8B> based on ra
>  |   jalr
>  |   ...
> 
> The auipc is never touched, and will be overhead. Also, we need a mv to
> store ra in a scratch register as well -- like Arm. We'll have two insn
> per-caller overhead for a disabled caller.

Is the AUIPC a significant overhead? IIUC that's similar to Arm's ADRP, and I'd
have expected that to be pretty cheap.

IIUC your JALR can choose which destination register to store the return
address in, and if so, you could leave the original ra untouched (and recover
that in the common trampoline). Have I misunderstood that?

Maybe that doesn't play nicely with something else?

> B) Use jal, which can only take us +/-1M, and requires multiple
>    dispatchers (and tracking which one to use, and properly distribute
>    them. Ick.)
> 
>  | <func_trace_target_data_8B> # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | within_1M_to_func_dispatch:
>  |   load <func_trace_target_data_8B> based on ra
>  |   jalr
> 
> C) Use jal, which can only take us +/-1M, and use a per-function
>    trampoline requires multiple dispatchers (and tracking which one to
>    use). Blows up text size A LOT.
> 
>  | <func_trace_target_data_8B> # somewhere, but probably on a different 
> cacheline than the .text to avoid ping-ongs
>  | ...
>  | per_func_dispatch
>  |   load <func_trace_target_data_8B> based on ra
>  |   jalr
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> per_func_dispatch
>  |   ACTUAL_FUNC

Beware that with option (C) you'll need to handle that in your unwinder for
RELIABLE_STACKTRACE. If you don't have a symbol for per_func_dispatch (or
func_trace_target_data_8B), PC values within per_func_dispatch would be
symbolized as the prior function/data.

> It's a bit sad that we'll always have to have a dispatcher/trampoline,
> but it's still better than stop_machine(). (And we'll need a fencei IPI
> as well, but only one. ;-))
> 
> Today, I'm leaning towards A (which is what Mark suggested, and also
> Robbin).. Any other options?

Assuming my understanding of JALR above is correct, I reckon A is the nicest
option out of A/B/C.

Mark.

Reply via email to