On Thu, Mar 7, 2024 at 8:19 AM Puranjay Mohan <puranja...@gmail.com> wrote: > > Hi Alex, > > On Wed, Mar 6, 2024 at 9:35 PM Alexandre Ghiti <a...@ghiti.fr> wrote: > > > > Hi Puranjay, > > > > On 06/03/2024 17:59, Puranjay Mohan wrote: > > > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. > > > This allows each ftrace callsite to provide an ftrace_ops to the common > > > ftrace trampoline, allowing each callsite to invoke distinct tracer > > > functions without the need to fall back to list processing or to > > > allocate custom trampolines for each callsite. This significantly speeds > > > up cases where multiple distinct trace functions are used and callsites > > > are mostly traced by a single tracer. > > > > > > The idea and most of the implementation is taken from the ARM64's > > > implementation of the same feature. The idea is to place a pointer to > > > the ftrace_ops as a literal at a fixed offset from the function entry > > > point, which can be recovered by the common ftrace trampoline. > > > > > > We use -fpatchable-function-entry to reserve 8 bytes above the function > > > entry by emitting 2 4 byte or 4 2 byte nops depending on the presence of > > > CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer > > > to the associated ftrace_ops for that callsite. Functions are aligned to > > > 8 bytes to make sure that the accesses to this literal are atomic. > > > > > > This approach allows for directly invoking ftrace_ops::func even for > > > ftrace_ops which are dynamically-allocated (or part of a module), > > > without going via ftrace_ops_list_func. > > > > > > I've benchamrked this with the ftrace_ops sample module on Qemu, with > > > the next version, I will provide benchmarks on real hardware: > > > > > > Without this patch: > > > > > > +-----------------------+-----------------+----------------------------+ > > > | Number of tracers | Total time (ns) | Per-call average time | > > > |-----------------------+-----------------+----------------------------| > > > | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | > > > |----------+------------+-----------------+------------+---------------| > > > | 0 | 0 | 15615700 | 156 | - | > > > | 0 | 1 | 15917600 | 159 | - | > > > | 0 | 2 | 15668000 | 156 | - | > > > | 0 | 10 | 14971500 | 149 | - | > > > | 0 | 100 | 15417600 | 154 | - | > > > | 0 | 200 | 15387000 | 153 | - | > > > |----------+------------+-----------------+------------+---------------| > > > | 1 | 0 | 119906800 | 1199 | 1043 | > > > | 1 | 1 | 137428600 | 1374 | 1218 | > > > | 1 | 2 | 159562400 | 1374 | 1218 | > > > | 1 | 10 | 302099900 | 3020 | 2864 | > > > | 1 | 100 | 2008785500 | 20087 | 19931 | > > > | 1 | 200 | 3965221900 | 39652 | 39496 | > > > |----------+------------+-----------------+------------+---------------| > > > | 1 | 0 | 119166700 | 1191 | 1035 | > > > | 2 | 0 | 157999900 | 1579 | 1423 | > > > | 10 | 0 | 425370100 | 4253 | 4097 | > > > | 100 | 0 | 3595252100 | 35952 | 35796 | > > > | 200 | 0 | 7023485700 | 70234 | 70078 | > > > +----------+------------+-----------------+------------+---------------+ > > > > > > Note: per-call overhead is estimated relative to the baseline case with > > > 0 relevant tracers and 0 irrelevant tracers. > > > > > > With this patch: > > > > > > +-----------------------+-----------------+----------------------------+ > > > | Number of tracers | Total time (ns) | Per-call average time | > > > |-----------------------+-----------------+----------------------------| > > > | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | > > > |----------+------------+-----------------+------------+---------------| > > > | 0 | 0 | 15254600 | 152 | - | > > > | 0 | 1 | 16136700 | 161 | - | > > > | 0 | 2 | 15329500 | 153 | - | > > > | 0 | 10 | 15148800 | 151 | - | > > > | 0 | 100 | 15746900 | 157 | - | > > > | 0 | 200 | 15737400 | 157 | - | > > > |----------+------------+-----------------+------------+---------------| > > > | 1 | 0 | 47909000 | 479 | 327 | > > > | 1 | 1 | 48297400 | 482 | 330 | > > > | 1 | 2 | 47314100 | 473 | 321 | > > > | 1 | 10 | 47844900 | 478 | 326 | > > > | 1 | 100 | 46591900 | 465 | 313 | > > > | 1 | 200 | 47178900 | 471 | 319 | > > > |----------+------------+-----------------+------------+---------------| > > > | 1 | 0 | 46715800 | 467 | 315 | > > > | 2 | 0 | 155134500 | 1551 | 1399 | > > > | 10 | 0 | 442672800 | 4426 | 4274 | > > > | 100 | 0 | 4092353900 | 40923 | 40771 | > > > | 200 | 0 | 7135796400 | 71357 | 71205 | > > > +----------+------------+-----------------+------------+---------------+ > > > > > > Note: per-call overhead is estimated relative to the baseline case with > > > 0 relevant tracers and 0 irrelevant tracers. > > > > > > As can be seen from the above: > > > > > > a) Whenever there is a single relevant tracer function associated with a > > > tracee, the overhead of invoking the tracer is constant, and does not > > > scale with the number of tracers which are *not* associated with that > > > tracee. > > > > > > b) The overhead for a single relevant tracer has dropped to ~1/3 of the > > > overhead prior to this series (from 1035ns to 315ns). This is largely > > > due to permitting calls to dynamically-allocated ftrace_ops without > > > going through ftrace_ops_list_func. > > > > > > Why is this patch a RFC patch: > > > 1. I saw some rcu stalls on Qemu and need to debug them and see if they > > > were introduced by this patch. > > > > > > FYI, I'm currently working on debugging such issues (and other) with the > > *current* ftrace implementation, so probably not caused by your > > patchset. But keep debugging too, maybe this introduces other issues or > > even better, you'll find the root cause :) > > > > > > > 2. This needs to be tested thoroughly on real hardware. > > > 3. Seeking reviews to fix any fundamental problems with this patch that > > > I > > > might have missed due to my lack of RISC-V architecture knowledge. > > > 4. I would like to benchmark this on real hardware and put the results > > > in > > > the commit message. > > > > > > Signed-off-by: Puranjay Mohan <puranja...@gmail.com> > > > --- > > > arch/riscv/Kconfig | 2 ++ > > > arch/riscv/Makefile | 8 +++++ > > > arch/riscv/include/asm/ftrace.h | 3 ++ > > > arch/riscv/kernel/asm-offsets.c | 3 ++ > > > arch/riscv/kernel/ftrace.c | 59 +++++++++++++++++++++++++++++++++ > > > arch/riscv/kernel/mcount-dyn.S | 42 ++++++++++++++++++++--- > > > 6 files changed, 112 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > > index 0bfcfec67ed5..e474742e23b2 100644 > > > --- a/arch/riscv/Kconfig > > > +++ b/arch/riscv/Kconfig > > > @@ -78,6 +78,7 @@ config RISCV > > > select EDAC_SUPPORT > > > select FRAME_POINTER if PERF_EVENTS || (FUNCTION_TRACER && > > > !DYNAMIC_FTRACE) > > > select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY if DYNAMIC_FTRACE > > > + select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS > > > > > > A recent discussion [1] states that -falign-functions cannot guarantee > > this alignment for all code and that gcc developers came up with a new > > option [2]: WDYT? I have added Andy and Evgenii in +cc to help on that. > > I saw arm64 uses the same and assumes this guarantee, maybe it is broken too? > Will look into the discussion and try to use the other option. I am > currently using Clang to build the kernel.
IIRC it should be fine if you're building with Clang. The "-falign-function" flag for gcc is not aligning cold functions. So I think we'd need some Kconfig checks for the difference. > > > > > [1] > > https://lore.kernel.org/linux-riscv/4fe4567b-96be-4102-952b-7d39109b2...@yadro.com/ > > [2] > > https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326 > > > > > > > select GENERIC_ARCH_TOPOLOGY > > > select GENERIC_ATOMIC64 if !64BIT > > > select GENERIC_CLOCKEVENTS_BROADCAST if SMP > > > @@ -127,6 +128,7 @@ config RISCV > > > select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && MMU && > > > (CLANG_SUPPORTS_DYNAMIC_FTRACE || GCC_SUPPORTS_DYNAMIC_FTRACE) > > > select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS > > > select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE > > > + select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS if > > > (DYNAMIC_FTRACE_WITH_REGS && !CFI_CLANG) > > > select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL > > > select HAVE_FUNCTION_GRAPH_TRACER > > > select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER > > > diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile > > > index 252d63942f34..875ad5dc3d32 100644 > > > --- a/arch/riscv/Makefile > > > +++ b/arch/riscv/Makefile > > > @@ -14,12 +14,20 @@ endif > > > ifeq ($(CONFIG_DYNAMIC_FTRACE),y) > > > LDFLAGS_vmlinux += --no-relax > > > KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY > > > +ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS), y) > > > +ifeq ($(CONFIG_RISCV_ISA_C),y) > > > + CC_FLAGS_FTRACE := -fpatchable-function-entry=8,4 > > > +else > > > + CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2 > > > +endif > > > +else > > > ifeq ($(CONFIG_RISCV_ISA_C),y) > > > CC_FLAGS_FTRACE := -fpatchable-function-entry=4 > > > else > > > CC_FLAGS_FTRACE := -fpatchable-function-entry=2 > > > endif > > > endif > > > +endif > > > > > > ifeq ($(CONFIG_CMODEL_MEDLOW),y) > > > KBUILD_CFLAGS_MODULE += -mcmodel=medany > > > diff --git a/arch/riscv/include/asm/ftrace.h > > > b/arch/riscv/include/asm/ftrace.h > > > index 329172122952..c9a84222c9ea 100644 > > > --- a/arch/riscv/include/asm/ftrace.h > > > +++ b/arch/riscv/include/asm/ftrace.h > > > @@ -28,6 +28,9 @@ > > > void MCOUNT_NAME(void); > > > static inline unsigned long ftrace_call_adjust(unsigned long addr) > > > { > > > + if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) > > > + return addr + 8; > > > + > > > return addr; > > > } > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c > > > b/arch/riscv/kernel/asm-offsets.c > > > index a03129f40c46..7d7c4b486852 100644 > > > --- a/arch/riscv/kernel/asm-offsets.c > > > +++ b/arch/riscv/kernel/asm-offsets.c > > > @@ -488,4 +488,7 @@ void asm_offsets(void) > > > DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), > > > STACK_ALIGN)); > > > OFFSET(STACKFRAME_FP, stackframe, fp); > > > OFFSET(STACKFRAME_RA, stackframe, ra); > > > +#ifdef CONFIG_FUNCTION_TRACER > > > + DEFINE(FTRACE_OPS_FUNC, offsetof(struct ftrace_ops, func)); > > > +#endif > > > } > > > diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c > > > index f5aa24d9e1c1..e2e75e15d32e 100644 > > > --- a/arch/riscv/kernel/ftrace.c > > > +++ b/arch/riscv/kernel/ftrace.c > > > @@ -82,9 +82,52 @@ static int __ftrace_modify_call(unsigned long > > > hook_pos, unsigned long target, > > > return 0; > > > } > > > > > > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS > > > +static const struct ftrace_ops *riscv64_rec_get_ops(struct dyn_ftrace > > > *rec) > > > +{ > > > + const struct ftrace_ops *ops = NULL; > > > + > > > + if (rec->flags & FTRACE_FL_CALL_OPS_EN) { > > > + ops = ftrace_find_unique_ops(rec); > > > + WARN_ON_ONCE(!ops); > > > + } > > > + > > > + if (!ops) > > > + ops = &ftrace_list_ops; > > > + > > > + return ops; > > > +} > > > + > > > +static int ftrace_rec_set_ops(const struct dyn_ftrace *rec, > > > + const struct ftrace_ops *ops) > > > +{ > > > + unsigned long literal = rec->ip - 8; > > > + > > > + return patch_text_nosync((void *)literal, &ops, sizeof(ops)); > > ^^ > I will change this to use a new function that does a 64 bit write and > doesn't do flush_icache_range() as naturally aligned writes are > atomic and therefore synchronization is not required here. Yes, it's atomic. But it doesn't guarantee ordering if fence.i is not executed, IIUC. I am not sure if CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS assumes seeing the updated ftrace_rec argument right after the patch completes. If that is the case then probably a batched fence.i is needed on local + remote cores. > > > > +} > > > + > > > +static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) > > > +{ > > > + return ftrace_rec_set_ops(rec, &ftrace_nop_ops); > > > +} > > > + > > > +static int ftrace_rec_update_ops(struct dyn_ftrace *rec) > > > +{ > > > + return ftrace_rec_set_ops(rec, riscv64_rec_get_ops(rec)); > > > +} > > > +#else > > > +static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) { return 0; } > > > +static int ftrace_rec_update_ops(struct dyn_ftrace *rec) { return 0; } > > > +#endif > > > + > > > int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) > > > { > > > unsigned int call[2]; > > > + int ret; > > > + > > > + ret = ftrace_rec_update_ops(rec); > > > + if (ret) > > > + return ret; > > > > > > make_call_t0(rec->ip, addr, call); > > > > > > @@ -98,6 +141,11 @@ int ftrace_make_nop(struct module *mod, struct > > > dyn_ftrace *rec, > > > unsigned long addr) > > > { > > > unsigned int nops[2] = {NOP4, NOP4}; > > > + int ret; > > > + > > > + ret = ftrace_rec_set_nop_ops(rec); > > > + if (ret) > > > + return ret; > > > > > > if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE)) > > > return -EPERM; > > > @@ -125,6 +173,13 @@ int ftrace_init_nop(struct module *mod, struct > > > dyn_ftrace *rec) > > > > > > int ftrace_update_ftrace_func(ftrace_func_t func) > > > { > > > + /* > > > + * When using CALL_OPS, the function to call is associated with the > > > + * call site, and we don't have a global function pointer to update. > > > + */ > > > + if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) > > > + return 0; > > > + > > > int ret = __ftrace_modify_call((unsigned long)&ftrace_call, > > > (unsigned long)func, true, true); > > > if (!ret) { > > > @@ -147,6 +202,10 @@ int ftrace_modify_call(struct dyn_ftrace *rec, > > > unsigned long old_addr, > > > make_call_t0(caller, old_addr, call); > > > ret = ftrace_check_current_call(caller, call); > > > > > > + if (ret) > > > + return ret; > > > + > > > + ret = ftrace_rec_update_ops(rec); > > > if (ret) > > > return ret; > > > > > > diff --git a/arch/riscv/kernel/mcount-dyn.S > > > b/arch/riscv/kernel/mcount-dyn.S > > > index b7561288e8da..cb241e36e514 100644 > > > --- a/arch/riscv/kernel/mcount-dyn.S > > > +++ b/arch/riscv/kernel/mcount-dyn.S > > > @@ -191,11 +191,35 @@ > > > .endm > > > > > > .macro PREPARE_ARGS > > > - addi a0, t0, -FENTRY_RA_OFFSET > > > + addi a0, t0, -FENTRY_RA_OFFSET // ip (callsite's auipc > > > insn) > > > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS > > > + /* > > > + * When CALL_OPS is enabled (2 or 4) nops [8B] are placed before the > > > + * function entry, these are later overwritten with the pointer to > > > the > > > + * associated struct ftrace_ops. > > > + * > > > + * -8: &ftrace_ops of the associated tracer function. > > > + *<ftrace enable>: > > > + * 0: auipc t0/ra, 0x? > > > + * 4: jalr t0/ra, ?(t0/ra) > > > + * > > > + * -8: &ftrace_nop_ops > > > + *<ftrace disable>: > > > + * 0: nop > > > + * 4: nop > > > + * > > > + * t0 is set to ip+8 after the jalr is executed at the callsite, > > > + * so we find the associated op at t0-16. > > > + */ > > > + mv a1, ra // parent_ip > > > + REG_L a2, -16(t0) // op > > > + REG_L ra, FTRACE_OPS_FUNC(a2) // op->func > > > +#else > > > la a1, function_trace_op > > > - REG_L a2, 0(a1) > > > - mv a1, ra > > > - mv a3, sp > > > + REG_L a2, 0(a1) // op > > > + mv a1, ra // parent_ip > > > +#endif > > > + mv a3, sp // regs > > > .endm > > > > > > #endif /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */ > > > @@ -233,8 +257,12 @@ SYM_FUNC_START(ftrace_regs_caller) > > > SAVE_ABI_REGS 1 > > > PREPARE_ARGS > > > > > > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS > > > + jalr ra > > > +#else > > > SYM_INNER_LABEL(ftrace_regs_call, SYM_L_GLOBAL) > > > call ftrace_stub > > > +#endif > > > > > > RESTORE_ABI_REGS 1 > > > bnez t1, .Ldirect > > > @@ -247,9 +275,13 @@ SYM_FUNC_START(ftrace_caller) > > > SAVE_ABI_REGS 0 > > > PREPARE_ARGS > > > > > > -SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) > > ^^ this hunk is a mistake, I will fix it in the next version. > > > > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS > > > + jalr ra > > > +#else > > > +SYM_INNER_LABEL(ftrace_regs_call, SYM_L_GLOBAL) > > > call ftrace_stub > > > > > > +#endif > > > RESTORE_ABI_REGS 0 > > > jr t0 > > > SYM_FUNC_END(ftrace_caller) > > > > > > As I'm diving into ftrace right now, I'll give a proper review soon. But > > as a note, I noticed that the function_graph tracer, when enabled, makes > > the whole system unresponsive (but still up, just very slow). A fix I > > sent recently seems to really improve that if you're interested in > > testing it (I am :)). You can find it here: > > https://lore.kernel.org/linux-riscv/20240229121056.203419-1-alexgh...@rivosinc.com/ > > I saw the same issue where function_graph was making the system slow on Qemu. > What hardware do you use for testing? or are you testing on Qemu as well? > > I tested your patch, it speeds up the process of patching the > instructions so the following > command completes ~2.5 seconds faster compared to without your patch. > $ time echo function_graph > current_tracer > > But at runtime the system is still slow and laggy with function_graph, > I guess because my > Qemu setup is not powerful enough to run function_graph. > > Thanks, > Puranjay > > _______________________________________________ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv