Mark Rutland <mark.rutl...@arm.com> writes: > Hi Bjorn > > (apologies, my corporate mail server has butchered your name here).
Ha! That's the price I have to pay for carrying double-umlauts everywhere. Thanks for getting back with a really useful answer! >> On Arm64, CALL_OPS makes it possible to implement direct calls, while >> only patching one BL instruction -- nice! > > The key thing here isn't that we patch a single instruction (since we have ot > patch the ops pointer too!); it's that we can safely patch either of the ops > pointer or BL/NOP at any time while threads are concurrently executing. ...which indeed is a very nice property! > If you have a multi-instruction sequence, then threads can be preempted > mid-sequence, and it's very painful/complex to handle all of the races that > entails. Oh yes. RISC-V is currently using auipc/jalr with stop_machine(), and also requires that preemtion is off. Unusable to put it blunt. > For example, if your callsites use a sequence: > > AUIPC <tmp>, <funcptr> > JALR <tmp2>, <funcptr>(<tmp>) > > Using stop_machine() won't allow you to patch that safely as some threads > could be stuck mid-sequence, e.g. > > AUIPC <tmp>, <funcptr> > [ preempted here ] > JALR <tmp2>, <funcptr>(<tmp>) > > ... and you can't update the JALR to use a new funcptr immediate until those > have completed the sequence. > > There are ways around that, but they're complicated and/or expensive, e.g. > > * Use a sequence of multiple patches, starting with replacing the JALR with an > exception-generating instruction with a fixup handler, which is sort-of what > x86 does with UD2. This may require multiple passes with > synchronize_rcu_tasks() to make sure all threads have seen the latest > instructions, and that cannot be done under stop_machine(), so if you need > stop_machine() for CMODx reasons, you may need to use that several times > with > intervening calls to synchronize_rcu_tasks(). > > * Have the patching logic manually go over each thread and fix up the pt_regs > for the interrupted thread. This is pretty horrid since you could have > nested > exceptions and a task could have several pt_regs which might require > updating. Yup, and both of these have rather unplesant overhead. > The CALL_OPS approach is a bit easier to deal with as we can patch the > per-callsite pointer atomically, then we can (possibly) enable/disable the > callsite's branch, then wait for threads to drain once. > > As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE > generally in this area (CALL_OPs happens to side-step those, but trampoline > usage is currently affected): > > https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/ > > ... I'm looking into fixing that at the moment, and it looks like that's > likely > to require some per-architecture changes. > >> On RISC-V we cannot use use the same ideas as Arm64 straight off, >> because the range of jal (compare to BL) is simply too short (+/-1M). >> So, on RISC-V we need to use a full auipc/jal pair (the text patching >> story is another chapter, but let's leave that aside for now). Since we >> have to patch multiple instructions, the cmodx situation doesn't really >> improve with CALL_OPS. > > The branch range thing is annoying, but I think this boils down to the same > problem as arm64 has with needing a "MOV <tmp>, LR" instruction that we have > to > patch in once at boot time. You could do the same and patch in the AUIPC once, > e.g. have > > | NOP > | NOP > | func: > | AUIPC <tmp>, <common_ftrace_caller> > | JALR <tmp2>, <common_ftrace_caller>(<tmp>) // patched with NOP > > ... which'd look very similar to arm64's sequence: > > | NOP > | NOP > | func: > | MOV X9, LR > | BL ftrace_caller // patched with NOP > > ... which I think means it *might* be better from a cmodx perspective? > >> Let's say that we continue building on your patch and implement direct >> calls on CALL_OPS for RISC-V as well. >> >> From Florent's commit message for direct calls: >> >> | There are a few cases to distinguish: >> | - If a direct call ops is the only one tracing a function: >> | - If the direct called trampoline is within the reach of a BL >> | instruction >> | -> the ftrace patchsite jumps to the trampoline >> | - Else >> | -> the ftrace patchsite jumps to the ftrace_caller trampoline >> which >> | reads the ops pointer in the patchsite and jumps to the direct >> | call address stored in the ops >> | - Else >> | -> the ftrace patchsite jumps to the ftrace_caller trampoline and >> its >> | ops literal points to ftrace_list_ops so it iterates over all >> | registered ftrace ops, including the direct call ops and calls >> its >> | call_direct_funcs handler which stores the direct called >> | trampoline's address in the ftrace_regs and the ftrace_caller >> | trampoline will return to that address instead of returning to >> the >> | traced function >> >> On RISC-V, where auipc/jalr is used, the direct called trampoline would >> always be reachable, and then first Else-clause would never be entered. >> This means the the performance for direct calls would be the same as the >> one we have today (i.e. no regression!). >> >> RISC-V does like x86 does (-ish) -- patch multiple instructions, long >> reach. >> >> Arm64 uses CALL_OPS and patch one instruction BL. >> >> Now, with this background in mind, compared to what we have today, >> CALL_OPS would give us (again assuming we're using it for direct calls): >> >> * Better performance for tracer per-call (faster ops lookup) GOOD >> * Larger text size (function alignment + extra nops) BAD >> * Same direct call performance NEUTRAL >> * Same complicated text patching required NEUTRAL > > Is your current sequence safe for preemptible kernels (i.e. with > PREEMPT_FULL=y > or PREEMPT_DYNAMIC=y + "preempt=full" on the kernel cmdline) ? It's very much not, and was in-fact presented by Andy (Cc) discussed at length at Plumbers two years back. Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic trampolines changes quite a bit. The more I look at the pains of patching two instruction ("split immediates"), the better "patch data" + one insn patching look. I which we had longer instructions, that could fit a 48b address or more! ;-) Again, thanks for a thought provoking reply! Björn