Mark, Mark Rutland <mark.rutl...@arm.com> writes:
>> A) Use auipc/jalr, only patch jalr to take us to a common >> dispatcher/trampoline >> >> | <func_trace_target_data_8B> # probably on a data cache-line != func .text >> to avoid ping-pong >> | ... >> | func: >> | ...make sure ra isn't messed up... >> | aupic >> | nop <=> jalr # Text patch point -> common_dispatch >> | ACTUAL_FUNC >> | >> | common_dispatch: >> | load <func_trace_target_data_8B> based on ra >> | jalr >> | ... >> >> The auipc is never touched, and will be overhead. Also, we need a mv to >> store ra in a scratch register as well -- like Arm. We'll have two insn >> per-caller overhead for a disabled caller. > > Is the AUIPC a significant overhead? IIUC that's similar to Arm's ADRP, and > I'd > have expected that to be pretty cheap. No, reg-to-reg moves are dirt cheap in my book. > IIUC your JALR can choose which destination register to store the return > address in, and if so, you could leave the original ra untouched (and recover > that in the common trampoline). Have I misunderstood that? > > Maybe that doesn't play nicely with something else? No, you're right, we can link to another register, and shave off an instruction. I can imagine that some implementation prefer x1/x5 for branch prediction reasons, but that's something that we can measure on. So, 1-2 movs + nop are unconditionally executed on the disabled case. (1-2 depending on the ra save/jalr reg strategy). >> B) Use jal, which can only take us +/-1M, and requires multiple >> dispatchers (and tracking which one to use, and properly distribute >> them. Ick.) >> >> | <func_trace_target_data_8B> # probably on a data cache-line != func .text >> to avoid ping-pong >> | ... >> | func: >> | ...make sure ra isn't messed up... >> | nop <=> jal # Text patch point -> within_1M_to_func_dispatch >> | ACTUAL_FUNC >> | >> | within_1M_to_func_dispatch: >> | load <func_trace_target_data_8B> based on ra >> | jalr >> >> C) Use jal, which can only take us +/-1M, and use a per-function >> trampoline requires multiple dispatchers (and tracking which one to >> use). Blows up text size A LOT. >> >> | <func_trace_target_data_8B> # somewhere, but probably on a different >> cacheline than the .text to avoid ping-ongs >> | ... >> | per_func_dispatch >> | load <func_trace_target_data_8B> based on ra >> | jalr >> | func: >> | ...make sure ra isn't messed up... >> | nop <=> jal # Text patch point -> per_func_dispatch >> | ACTUAL_FUNC > > Beware that with option (C) you'll need to handle that in your unwinder for > RELIABLE_STACKTRACE. If you don't have a symbol for per_func_dispatch (or > func_trace_target_data_8B), PC values within per_func_dispatch would be > symbolized as the prior function/data. Good point (but I don't like C much...)! >> It's a bit sad that we'll always have to have a dispatcher/trampoline, >> but it's still better than stop_machine(). (And we'll need a fencei IPI >> as well, but only one. ;-)) >> >> Today, I'm leaning towards A (which is what Mark suggested, and also >> Robbin).. Any other options? > > Assuming my understanding of JALR above is correct, I reckon A is the nicest > option out of A/B/C. Yes! +1! Björn