Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
On Wed, Jul 3, 2024 at 6:15 AM Masami Hiramatsu wrote: > > On Mon, 1 Jul 2024 15:39:25 -0700 > Andrii Nakryiko wrote: > > > It seems like uprobe_write_opcode() doesn't require writer locked > > mmap_sem, any lock (reader or writer) should be sufficient. This was > > established in a discussion in [0] and looking through existing code > > seems to confirm that there is no need for write-locked mmap_sem. > > > > Fix the comment to state this clearly. > > > > [0] > > https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/ > > > > Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into > > uprobe_write_opcode()") > > nit: why this has Fixes but [01/12] doesn't? > > Should I pick both to fixes branch? I'd keep both of them in probes/for-next, tbh. They are not literally fixing anything, just cleaning up comments. I can drop the Fixes tag from this one, if you'd like. > > Thank you, > > > Signed-off-by: Andrii Nakryiko > > --- > > kernel/events/uprobes.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 081821fd529a..f87049c08ee9 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct > > mm_struct *mm, > > * @vaddr: the virtual address to store the opcode. > > * @opcode: opcode to be written at @vaddr. > > * > > - * Called with mm->mmap_lock held for write. > > + * Called with mm->mmap_lock held for read or write. > > * Return 0 (success) or a negative errno. > > */ > > int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v2 01/12] uprobes: update outdated comment
On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov wrote: > > Sorry for the late reply. I'll try to read this version/discussion > when I have time... yes, I have already promised this before, sorry :/ > > On 07/01, Andrii Nakryiko wrote: > > > > There is no task_struct passed into get_user_pages_remote() anymore, > > drop the parts of comment mentioning NULL tsk, it's just confusing at > > this point. > > Agreed. > > > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, > > unsigned long vaddr) > > goto out; > > > > /* > > - * The NULL 'tsk' here ensures that any faults that occur here > > - * will not be accounted to the task. 'mm' *is* current->mm, > > - * but we treat this as a 'remote' access since it is > > - * essentially a kernel access to the memory. > > + * 'mm' *is* current->mm, but we treat this as a 'remote' access since > > + * it is essentially a kernel access to the memory. > >*/ > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > OK, this makes it less confusing, so > > Acked-by: Oleg Nesterov > > > - > but it still looks confusing to me. This code used to pass tsk = NULL > only to avoid tsk->maj/min_flt++ in faultin_page(). > > But today mm_account_fault() increments these counters without checking > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > just use get_user_pages() and remove this comment? I don't know, it was a drive-by cleanup which I'm starting to regret already :) > > Oleg. >
Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
On Wed, Jul 3, 2024 at 3:13 AM Masami Hiramatsu wrote: > > On Wed, 3 Jul 2024 10:13:15 +0200 > Peter Zijlstra wrote: > > > On Mon, Jul 01, 2024 at 03:39:28PM -0700, Andrii Nakryiko wrote: > > > Simplify uprobe registration/unregistration interfaces by making offset > > > and ref_ctr_offset part of uprobe_consumer "interface". In practice, all > > > existing users already store these fields somewhere in uprobe_consumer's > > > containing structure, so this doesn't pose any problem. We just move > > > some fields around. > > > > > > On the other hand, this simplifies uprobe_register() and > > > uprobe_unregister() API by having only struct uprobe_consumer as one > > > thing representing attachment/detachment entity. This makes batched > > > versions of uprobe_register() and uprobe_unregister() simpler. > > > > > > This also makes uprobe_register_refctr() unnecessary, so remove it and > > > simplify consumers. > > > > > > No functional changes intended. > > > > > > Acked-by: Masami Hiramatsu (Google) > > > Signed-off-by: Andrii Nakryiko > > > --- > > > include/linux/uprobes.h | 18 +++ > > > kernel/events/uprobes.c | 19 ++- > > > kernel/trace/bpf_trace.c | 21 +++- > > > kernel/trace/trace_uprobe.c | 53 --- > > > .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 > > > 5 files changed, 55 insertions(+), 78 deletions(-) > > > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > > index b503fafb7fb3..a75ba37ce3c8 100644 > > > --- a/include/linux/uprobes.h > > > +++ b/include/linux/uprobes.h > > > @@ -42,6 +42,11 @@ struct uprobe_consumer { > > > enum uprobe_filter_ctx ctx, > > > struct mm_struct *mm); > > > > > > + /* associated file offset of this probe */ > > > + loff_t offset; > > > + /* associated refctr file offset of this probe, or zero */ > > > + loff_t ref_ctr_offset; > > > + /* for internal uprobe infra use, consumers shouldn't touch fields > > > below */ > > > struct uprobe_consumer *next; > > > }; > > > > > > @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn); > > > extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); > > > extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); > > > extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct > > > mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); > > > -extern int uprobe_register(struct inode *inode, loff_t offset, struct > > > uprobe_consumer *uc); > > > -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, > > > loff_t ref_ctr_offset, struct uprobe_consumer *uc); > > > +extern int uprobe_register(struct inode *inode, struct uprobe_consumer > > > *uc); > > > extern int uprobe_apply(struct inode *inode, loff_t offset, struct > > > uprobe_consumer *uc, bool); > > > -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct > > > uprobe_consumer *uc); > > > +extern void uprobe_unregister(struct inode *inode, struct > > > uprobe_consumer *uc); > > > > It seems very weird and unnatural to split inode and offset like this. > > The whole offset thing only makes sense within the context of an inode. > > Hm, so would you mean we should have inode inside the uprobe_consumer? > If so, I think it is reasonable. > I don't think so, for at least two reasons. 1) We will be wasting 8 bytes per consumer saving exactly the same inode pointer for no good reason, while uprobe itself already stores this inode. One can argue that having offset and ref_ctr_offset inside uprobe_consumer is wasteful, in principle, and I agree. But all existing users already store them in the same struct that contains uprobe_consumer, so we are not regressing anything, while making the interface simpler. We can always optimize that further, if necessary, but right now that would give us nothing. But moving inode into uprobe_consumer will regress memory usage. 2) In the context of batched APIs, offset and ref_ctr_offset varies with each uprobe_consumer, while inode is explicitly the same for all consumers in that batch. This API makes it very clear. Technically, we can remove inode completely from *uprobe_unregister*, because we now can access it from uprobe_consumer->uprobe->inode, but it still feels right for symmetry and explicitly making a point that callers should ensure that inode is kept alive. > Thank you, > > > > > So yeah, lets not do this. > > > -- > Masami Hiramatsu
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Tue, Jul 2, 2024 at 9:53 AM Steven Rostedt wrote: > > On Wed, 3 Jul 2024 00:19:05 +0900 > Masami Hiramatsu (Google) wrote: > > > > BTW, is this (batched register/unregister APIs) something you'd like > > > to use from the tracefs-based (or whatever it's called, I mean non-BPF > > > ones) uprobes as well? Or there is just no way to even specify a batch > > > of uprobes? Just curious if you had any plans for this. > > > > No, because current tracefs dynamic event interface is not designed for > > batched registration. I think we can expand it to pass wildcard symbols > > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe. > > Um, that maybe another good idea. > > I don't see why not. The wild cards were added to the kernel > specifically for the tracefs interface (set_ftrace_filter). Nice, I'd be happy to adjust batch API to work for that use case as well (when we get there). > > -- Steve
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra wrote: > > On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote: > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 23449a8c5e7e..560cf1ca512a 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); > > > > struct uprobe { > > struct rb_node rb_node;/* node in the rb tree */ > > - refcount_t ref; > > + atomic64_t ref;/* see UPROBE_REFCNT_GET > > below */ > > struct rw_semaphore register_rwsem; > > struct rw_semaphore consumer_rwsem; > > + struct rcu_head rcu; > > struct list_headpending_list; > > struct uprobe_consumer *consumers; > > struct inode*inode; /* Also hold a ref to inode */ > > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct > > mm_struct *mm, unsigned long v > > *(uprobe_opcode_t *)>insn); > > } > > > > -static struct uprobe *get_uprobe(struct uprobe *uprobe) > > +/* > > + * Uprobe's 64-bit refcount is actually two independent counters > > co-located in > > + * a single u64 value: > > + * - lower 32 bits are just a normal refcount with is increment and > > + * decremented on get and put, respectively, just like normal refcount > > + * would; > > + * - upper 32 bits are a tag (or epoch, if you will), which is always > > + * incremented by one, no matter whether get or put operation is done. > > + * > > + * This upper counter is meant to distinguish between: > > + * - one CPU dropping refcnt from 1 -> 0 and proceeding with > > "destruction", > > + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 > > refcnt > > + * sequence, also proceeding to "destruction". > > + * > > + * In both cases refcount drops to zero, but in one case it will have > > epoch N, > > + * while the second drop to zero will have a different epoch N + 2, > > allowing > > + * first destructor to bail out because epoch changed between refcount > > going > > + * to zero and put_uprobe() taking uprobes_treelock (under which overall > > + * 64-bit refcount is double-checked, see put_uprobe() for details). > > + * > > + * Lower 32-bit counter is not meant to over overflow, while it's expected > > So refcount_t very explicitly handles both overflow and underflow and > screams bloody murder if they happen. Your thing does not.. > Correct, because I considered that to be practically impossible to overflow this refcount. The main source of refcounts are uretprobes that are holding uprobe references. We limit the depth of supported recursion to 64, so you'd need 30+ millions of threads all hitting the same uprobe/uretprobe to overflow this. I guess in theory it could happen (not sure if we have some limits on total number of threads in the system and if they can be bumped to over 30mln), but it just seemed out of realm of practical possibility. Having said that, I can add similar checks that refcount_t does in refcount_add and do what refcount_warn_saturate does as well. > > + * that upper 32-bit counter will overflow occasionally. Note, though, > > that we > > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit > > counter, > > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and > > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes > > + * epoch effectively a 31-bit counter with highest bit used as a flag to > > + * perform a fix-up. This ensures epoch and refcnt parts do not > > "interfere". > > + * > > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* > > + * epoch and refcnt parts atomically with one atomic_add(). > > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part > > and > > + * *increment* epoch part. > > + */ > > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */ > > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */ > > + > > +/* > > + * Caller has to make sure that: > > + * a) either uprobe's refcnt is positive before this call; > > + * b) or uprobes_treelock is held (doesn't matter if for read or write), > > + * preventing uprobe's destructor from removing it from uprobes_tree. > &
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu wrote: > > On Mon, 1 Jul 2024 15:15:56 -0700 > Andrii Nakryiko wrote: > > > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko > > wrote: > > > > > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu > > > wrote: > > > > > > > > On Fri, 28 Jun 2024 09:34:26 -0700 > > > > Andrii Nakryiko wrote: > > > > > > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > > > > wrote: > > > > > > > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t > > > > > > > > > offset, > > > > > > > > > - loff_t ref_ctr_offset, struct > > > > > > > > > uprobe_consumer *uc) > > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > > > > + uprobe_consumer_fn > > > > > > > > > get_uprobe_consumer, void *ctx) > > > > > > > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we > > > > > > > > just > > > > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > > > > would just contain pointers to uprobe_consumer. Consumers would > > > > > > > never > > > > > > > just have an array of `struct uprobe_consumer *`, because > > > > > > > uprobe_consumer struct is embedded in some other struct, so the > > > > > > > array > > > > > > > interface isn't the most convenient. > > > > > > > > > > > > OK, I understand it. > > > > > > > > > > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > > > > allocating an extra array *and keeping it* for the entire > > > > > > > duration of > > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > > > > waste. This is because we don't want to do anything that can fail > > > > > > > in > > > > > > > the detachment logic (so no temporary array allocation there). > > > > > > > > > > > > No need to change it, that sounds reasonable. > > > > > > > > > > > > > > > > Great, thanks. > > > > > > > > > > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > > > > > > > IMHO, maybe the interface function is better to change to > > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > > > > side uses a linked list of structure, index access will need to > > > > > > follow the list every time. > > > > > > > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > > > > ctx) with i going from 0 to N in multiple independent loops. So if we > > > > > are only allowed to ask for the next consumer, then > > > > > uprobe_register_batch and uprobe_unregister_batch would need to build > > > > > its own internal index and remember ith instance. Which again means > > > > > more allocations and possibly failing uprobe_unregister_batch(), which > > > > > isn't great. > > > > > > > > No, I think we can use a cursor variable as; > > > > > > > > int uprobe_register_batch(struct inode *inode, > > > > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > > > { > > > > void *cur = ctx; > > > > > > > > while (
[PATCH] perf,x86: avoid missing caller address in stack traces captured in uprobe
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. Signed-off-by: Andrii Nakryiko --- arch/x86/events/core.c | 20 include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5b0dd07b1ef1..82d5570b58ff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs return; pagefault_disable(); + +#ifdef CONFIG_UPROBES + /* +* If we are called from uprobe handler, and we are indeed at the very +* entry to user function (which is normally a `push %rbp` instruction, +* under assumption of application being compiled with frame pointers), +* we should read return address from *regs->sp before proceeding +* to follow frame pointers, otherwise we'll skip immediate caller +* as %rbp is not yet setup. +*/ + if (current->utask) { + struct arch_uprobe *auprobe = current->utask->auprobe; + u64 ret_addr; + + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && + !__get_user(ret_addr, (const u64 __user *)regs->sp)) + perf_callchain_store(entry, ret_addr); + } +#endif + while (entry->nr < entry->max_stack) { if (!valid_user_frame(fp, sizeof(frame))) break; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a270a5892ab4 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -76,6 +76,8 @@ struct uprobe_task { struct uprobe *active_uprobe; unsigned long xol_vaddr; + struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned intdepth; }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..6e22e4d80f1e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) bool need_prep = false; /* prepare return uprobe, when needed */ down_read(>register_rwsem); + current->utask->auprobe = >arch; for (uc = uprobe->consumers; uc; uc = uc->next) { int rc = 0; @@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) remove &= rc; } + current->utask->auprobe = NULL; if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */ -- 2.43.0
[PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore
With all the batch uprobe APIs work we are now finally ready to reap the benefits. Switch uprobes_treelock from reader-writer spinlock to a much more efficient and scalable per-CPU RW semaphore. Benchmarks and numbers time. I've used BPF selftests' bench tool, trig-uprobe-nop benchmark specifically, to see how uprobe total throughput scales with number of competing threads (mapped to individual CPUs). Here are results: # threads BEFORE (mln/s)AFTER (mln/s) - --- 1 3.131 3.140 2 3.394 3.601 3 3.630 3.960 4 3.317 3.551 5 3.448 3.464 6 3.345 3.283 7 3.469 3.444 8 3.182 3.258 9 3.138 3.139 10 2.999 3.212 11 2.903 3.183 12 2.802 3.027 13 2.792 3.027 14 2.695 3.086 15 2.822 2.965 16 2.679 2.939 17 2.622 2.888 18 2.628 2.914 19 2.702 2.836 20 2.561 2.837 One can see that per-CPU RW semaphore-based implementation scales better with number of CPUs (especially that single CPU throughput is basically the same). Note, scalability is still limited by register_rwsem and this will hopefully be address in follow up patch set(s). Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index bb480a2400e1..1d76551e5e23 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; */ #define no_uprobe_events() RB_EMPTY_ROOT(_tree) -static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ +DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ13 /* serialize uprobe->pending_list */ @@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) bool destroy; if (!tree_locked) - write_lock(_treelock); + percpu_down_write(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); if (!tree_locked) - write_unlock(_treelock); + percpu_up_write(_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { struct uprobe *uprobe; - read_lock(_treelock); + percpu_down_read(_treelock); uprobe = __find_uprobe(inode, offset); - read_unlock(_treelock); + percpu_up_read(_treelock); return uprobe; } @@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge up_write(>register_rwsem); } - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); uprobe = uc->uprobe; @@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge __put_uprobe(uprobe, true); uc->uprobe = NULL; } - write_unlock(_treelock); + percpu_up_write(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) @@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } ret = 0; - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } } unlock_treelock: - write_unlock(_treelock); + percpu_up_write(_treelock); if (ret) goto cleanup_uprobes; @@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1384,7 +1384,7
[PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
Switch internals of BPF multi-uprobes to batched version of uprobe registration and unregistration APIs. This also simplifies BPF clean up code a bit thanks to all-or-nothing guarantee of uprobes_register_batch(). Signed-off-by: Andrii Nakryiko --- kernel/trace/bpf_trace.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ba62baec3152..41bf6736c542 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx { struct bpf_uprobe *uprobe; }; -static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, - u32 cnt) +static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx) { - u32 i; + struct bpf_uprobe_multi_link *link = ctx; - for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), [i].consumer); - } + return >uprobes[idx].consumer; } static void bpf_uprobe_multi_link_release(struct bpf_link *link) @@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link) struct bpf_uprobe_multi_link *umulti_link; umulti_link = container_of(link, struct bpf_uprobe_multi_link, link); - bpf_uprobe_unregister(_link->path, umulti_link->uprobes, umulti_link->cnt); + uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt, + umulti_link_get_uprobe_consumer, umulti_link); if (umulti_link->task) put_task_struct(umulti_link->task); path_put(_link->path); @@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI, _uprobe_multi_link_lops, prog); - for (i = 0; i < cnt; i++) { - err = uprobe_register(d_real_inode(link->path.dentry), [i].consumer); - if (err) { - bpf_uprobe_unregister(, uprobes, i); - goto error_free; - } - } + err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt, + umulti_link_get_uprobe_consumer, link); + if (err) + goto error_free; err = bpf_link_prime(>link, _primer); if (err) -- 2.43.0
[PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch
Similarly to what we did for uprobes_register_batch(), split uprobe_unregister_batch() into two separate phases with different locking needs. First, all the VMA unregistration is performed while holding a per-uprobe register_rwsem. Then, we take a batched uprobes_treelock once to __put_uprobe() for all uprobe_consumers. That uprobe_consumer->uprobe field is really handy in helping with this. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ced85284bbf4..bb480a2400e1 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) */ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx) { - struct uprobe *uprobe; struct uprobe_consumer *uc; + struct uprobe *uprobe; int i; for (i = 0; i < cnt; i++) { @@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge down_write(>register_rwsem); __uprobe_unregister(uprobe, uc); up_write(>register_rwsem); - put_uprobe(uprobe); + } + write_lock(_treelock); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + if (!uprobe) + continue; + + __put_uprobe(uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) -- 2.43.0
[PATCH v2 09/12] uprobes: batch uprobes_treelock during registration
Now that we have a good separate of each registration step, take uprobes_treelock just once for relevant registration step, and then process all relevant uprobes in one go. Even if writer lock introduces a relatively large delay (as might happen with per-CPU RW semaphore), this will keep overall batch attachment reasonably fast. We teach put_uprobe(), though __put_uprobe() helper, to optionally take or not uprobes_treelock, to accommodate this pattern. With these changes we don't need insert_uprobe() operation that unconditionally takes uprobes_treelock, so get rid of it, leaving only lower-level __insert_uprobe() helper. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 45 + 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 128677ffe662..ced85284bbf4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu) kfree(uprobe); } -static void put_uprobe(struct uprobe *uprobe) +static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) { s64 v; @@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe) if (unlikely((u32)v == 0)) { bool destroy; - write_lock(_treelock); + if (!tree_locked) + write_lock(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe) destroy = atomic64_read(>ref) == v; if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); - write_unlock(_treelock); + if (!tree_locked) + write_unlock(_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe) rcu_read_unlock_trace(); } +static void put_uprobe(struct uprobe *uprobe) +{ + __put_uprobe(uprobe, false); +} + static __always_inline int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, const struct uprobe *r) @@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) return u; } -/* - * Acquire uprobes_treelock and insert uprobe into uprobes_tree - * (or reuse existing one, see __insert_uprobe() comments above). - */ -static struct uprobe *insert_uprobe(struct uprobe *uprobe) -{ - struct uprobe *u; - - write_lock(_treelock); - u = __insert_uprobe(uprobe); - write_unlock(_treelock); - - return u; -} - static void ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) { @@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, uc->uprobe = uprobe; } + ret = 0; + write_lock(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt, uprobe = uc->uprobe; /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); + cur_uprobe = __insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); + + __put_uprobe(cur_uprobe, true); ret = -EINVAL; - goto cleanup_uprobes; + goto unlock_treelock; } kfree(uprobe); uc->uprobe = cur_uprobe; } } +unlock_treelock: + write_unlock(_treelock); + if (ret) + goto cleanup_uprobes; for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ + write_lock(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); if (uc->uprobe) - put_uprobe(uc->uprobe); + __put_uprobe(uc->uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); return ret; } -- 2.43.0
[PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps
Now we are ready to split alloc-and-insert coupled step into two separate phases. First, we allocate and prepare all potentially-to-be-inserted uprobe instances, assuming corresponding uprobes are not yet in uprobes_tree. This is needed so that we don't do memory allocations under uprobes_treelock (once we batch locking for each step). Second, we insert new uprobes or reuse already existing ones into uprobes_tree. Any uprobe that turned out to be not necessary is immediately freed, as there are no other references to it. This concludes preparations that make uprobes_register_batch() ready to batch and optimize locking per each phase. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 0f928a72a658..128677ffe662 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, return -EINVAL; } + /* pre-allocate new uprobe instances */ for (i = 0; i < cnt; i++) { - struct uprobe *cur_uprobe; - uc = get_uprobe_consumer(i, ctx); uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); @@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, RB_CLEAR_NODE(>rb_node); atomic64_set(>ref, 1); + uc->uprobe = uprobe; + } + + for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ @@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt, if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); - kfree(uprobe); ret = -EINVAL; goto cleanup_uprobes; } kfree(uprobe); - uprobe = cur_uprobe; + uc->uprobe = cur_uprobe; } - - uc->uprobe = uprobe; } for (i = 0; i < cnt; i++) { -- 2.43.0
[PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()
To allow unbundling alloc-uprobe-and-insert step which is currently tightly coupled, inline alloc_uprobe() logic into uprobe_register_batch() loop. It's called from one place, so we don't really lose much in terms of maintainability. No functional changes. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 65 ++--- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 68fdf1b8e4bf..0f928a72a658 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) (unsigned long long) uprobe->ref_ctr_offset); } -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset) -{ - struct uprobe *uprobe, *cur_uprobe; - - uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); - if (!uprobe) - return ERR_PTR(-ENOMEM); - - uprobe->inode = inode; - uprobe->offset = offset; - uprobe->ref_ctr_offset = ref_ctr_offset; - init_rwsem(>register_rwsem); - init_rwsem(>consumer_rwsem); - RB_CLEAR_NODE(>rb_node); - atomic64_set(>ref, 1); - - /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); - /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe != uprobe) { - if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { - ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); - kfree(uprobe); - return ERR_PTR(-EINVAL); - } - kfree(uprobe); - uprobe = cur_uprobe; - } - - return uprobe; -} - static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(>consumer_rwsem); @@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt, } for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + uc = get_uprobe_consumer(i, ctx); - uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); - if (IS_ERR(uprobe)) { - ret = PTR_ERR(uprobe); + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) { + ret = -ENOMEM; goto cleanup_uprobes; } + uprobe->inode = inode; + uprobe->offset = uc->offset; + uprobe->ref_ctr_offset = uc->ref_ctr_offset; + init_rwsem(>register_rwsem); + init_rwsem(>consumer_rwsem); + RB_CLEAR_NODE(>rb_node); + atomic64_set(>ref, 1); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe != uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + ref_ctr_mismatch_warn(cur_uprobe, uprobe); + put_uprobe(cur_uprobe); + kfree(uprobe); + ret = -EINVAL; + goto cleanup_uprobes; + } + kfree(uprobe); + uprobe = cur_uprobe; + } + uc->uprobe = uprobe; } -- 2.43.0
[PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs
Introduce batch versions of uprobe registration (attachment) and unregistration (detachment) APIs. Unregistration is presumed to never fail, so that's easy. Batch registration can fail, and so the semantics of uprobe_register_batch() is such that either all uprobe_consumers are successfully attached or none of them remain attached after the return. There is no guarantee of atomicity of attachment, though, and so while batch attachment is proceeding, some uprobes might start firing before others are completely attached. Even if overall attachment eventually fails, some successfully attached uprobes might fire and callers have to be prepared to handle that. This is in no way a regression compared to current approach of attaching uprobes one-by-one, though. One crucial implementation detail is the addition of `struct uprobe *uprobe` field to `struct uprobe_consumer` which is meant for internal uprobe subsystem usage only. We use this field both as temporary storage (to avoid unnecessary allocations) and as a back link to associated uprobe to simplify and speed up uprobe unregistration, as we now can avoid yet another tree lookup when unregistering uprobe_consumer. The general direction with uprobe registration implementation is to do batch attachment in distinct steps, each step performing some set of checks or actions on all uprobe_consumers before proceeding to the next phase. This, after some more changes in next patches, allows to batch locking for each phase and in such a way amortize any long delays that might be added by writer locks (especially once we switch uprobes_treelock to per-CPU R/W semaphore later). Currently, uprobe_register_batch() performs all the sanity checks first. Then proceeds to allocate-and-insert (we'll split this up further later on) uprobe instances, as necessary. And then the last step is actual uprobe registration for all affected VMAs. We take care to undo all the actions in the event of an error at any point in this lengthy process, so end result is all-or-nothing, as described above. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 17 kernel/events/uprobes.c | 180 2 files changed, 146 insertions(+), 51 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index a75ba37ce3c8..a6e6eb70539d 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -33,6 +33,8 @@ enum uprobe_filter_ctx { UPROBE_FILTER_MMAP, }; +typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx); + struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); int (*ret_handler)(struct uprobe_consumer *self, @@ -48,6 +50,8 @@ struct uprobe_consumer { loff_t ref_ctr_offset; /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; + /* associated uprobe instance (or NULL if consumer isn't attached) */ + struct uprobe *uprobe; }; #ifdef CONFIG_UPROBES @@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); +extern int uprobe_register_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); +extern void uprobe_unregister_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { @@ -169,6 +182,10 @@ static inline void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } +static inline void uprobe_unregister_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ +} static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8759c6d0683e..68fdf1b8e4bf 100644
[PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
Simplify uprobe registration/unregistration interfaces by making offset and ref_ctr_offset part of uprobe_consumer "interface". In practice, all existing users already store these fields somewhere in uprobe_consumer's containing structure, so this doesn't pose any problem. We just move some fields around. On the other hand, this simplifies uprobe_register() and uprobe_unregister() API by having only struct uprobe_consumer as one thing representing attachment/detachment entity. This makes batched versions of uprobe_register() and uprobe_unregister() simpler. This also makes uprobe_register_refctr() unnecessary, so remove it and simplify consumers. No functional changes intended. Acked-by: Masami Hiramatsu (Google) Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 18 +++ kernel/events/uprobes.c | 19 ++- kernel/trace/bpf_trace.c | 21 +++- kernel/trace/trace_uprobe.c | 53 --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 5 files changed, 55 insertions(+), 78 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a75ba37ce3c8 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -42,6 +42,11 @@ struct uprobe_consumer { enum uprobe_filter_ctx ctx, struct mm_struct *mm); + /* associated file offset of this probe */ + loff_t offset; + /* associated refctr file offset of this probe, or zero */ + loff_t ref_ctr_offset; + /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; }; @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -152,11 +156,7 @@ static inline void uprobes_init(void) #define uprobe_get_trap_addr(regs) instruction_pointer(regs) static inline int -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - return -ENOSYS; -} -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } static inline int uprobe_mmap(struct vm_area_struct *vma) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 560cf1ca512a..8759c6d0683e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) /* * uprobe_unregister - unregister an already registered probe. * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. + * @uc: identify which probe consumer to unregister. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { struct uprobe *uprobe; - uprobe = find_uprobe(inode, offset); + uprobe = find_uprobe(inode, uc->offset); if (WARN_ON(!uprobe)) return; @@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset, return ret; } -int uprobe_register(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc) +int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { - return __uprobe_register(inode, offset, 0, uc); + return __uprobe_register(inod
[PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()
Return -ENOMEM instead of NULL, which makes caller's error handling just a touch simpler. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index f87049c08ee9..23449a8c5e7e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); if (!uprobe) - return NULL; + return ERR_PTR(-ENOMEM); uprobe->inode = inode; uprobe->offset = offset; @@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset, retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (!uprobe) - return -ENOMEM; if (IS_ERR(uprobe)) return PTR_ERR(uprobe); -- 2.43.0
[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
note the specific values of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that a single atomi64_t is actually a two sort-of-independent 32-bit counters that are incremented/decremented with a single atomic_add_and_return() operation. Note also a small and extremely rare (and thus having no effect on performance) need to clear the highest bit every 2 billion get/put operations to prevent high 32-bit counter from "bleeding over" into lower 32-bit counter. Another aspect with this race is the winning CPU might, at least theoretically, be so quick that it will free uprobe memory before losing CPU gets a chance to discover that it lost. To prevent this, we protected and delay uprobe lifetime with RCU. We can't use rcu_read_lock() + rcu_read_unlock(), because we need to take locks inside the RCU critical section. Luckily, we have RCU Tasks Trace flavor, which supports locking and sleeping. It is already used by BPF subsystem for sleepable BPF programs (including sleepable BPF uprobe programs), and is optimized for reader-dominated workflows. It fits perfectly and doesn't seem to introduce any significant slowdowns in uprobe hot path. All the above contained trickery aside, we end up with a nice semantics for get and put operations, where get always succeeds and put handles all the races properly and transparently to the caller. And just to justify this a bit unorthodox refcounting approach, under uprobe triggering micro-benchmark (using BPF selftests' bench tool) with 8 triggering threads, atomic_inc_not_zero() approach was producing about 3.3 millions/sec total uprobe triggerings across all threads. While the final atomic_add_and_return()-based approach managed to get 3.6 millions/sec throughput under the same 8 competing threads. Furthermore, CPU profiling showed the following overall CPU usage: - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for atomic_inc_not_zero approach; - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for atomic_add_and_return approach implemented by this patch. So, CPU is spending relatively more CPU time in get/put operations while delivering less total throughput if using atomic_inc_not_zero(). And this will be even more prominent once we optimize away uprobe->register_rwsem in the subsequent patch sets. So while slightly less straightforward, current approach seems to be clearly winning and justified. We also rename get_uprobe() to __get_uprobe() to indicate it's a delicate internal helper that is only safe to call under valid circumstances: - while holding uprobes_treelock (to synchronize with exclusive write lock in put_uprobe(), as described above); - or if we have a guarantee that uprobe's refcount is already positive through caller holding at least one refcount (in this case there is no risk of refcount dropping to zero by any other CPU). We also document why it's safe to do unconditional __get_uprobe() at all call sites, to make it clear that we maintain the above invariants. Note also, we now don't have a race between registration and unregistration, so we remove the retry logic completely. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 260 ++-- 1 file changed, 195 insertions(+), 65 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 23449a8c5e7e..560cf1ca512a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); struct uprobe { struct rb_node rb_node;/* node in the rb tree */ - refcount_t ref; + atomic64_t ref;/* see UPROBE_REFCNT_GET below */ struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; + struct rcu_head rcu; struct list_headpending_list; struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v *(uprobe_opcode_t *)>insn); } -static struct uprobe *get_uprobe(struct uprobe *uprobe) +/* + * Uprobe's 64-bit refcount is actually two independent counters co-located in + * a single u64 value: + * - lower 32 bits are just a normal refcount with is increment and + * decremented on get and put, respectively, just like normal refcount + * would; + * - upper 32 bits are a tag (or epoch, if you will), which is always + * incremented by one, no matter whether get or put operation is done. + * + * This upper counter is meant to distinguish between: + * - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction", + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt + * sequence, also proceeding to "destruction
[PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
It seems like uprobe_write_opcode() doesn't require writer locked mmap_sem, any lock (reader or writer) should be sufficient. This was established in a discussion in [0] and looking through existing code seems to confirm that there is no need for write-locked mmap_sem. Fix the comment to state this clearly. [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/ Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()") Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 081821fd529a..f87049c08ee9 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * - * Called with mm->mmap_lock held for write. + * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, -- 2.43.0
[PATCH v2 01/12] uprobes: update outdated comment
There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..081821fd529a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* -* The NULL 'tsk' here ensures that any faults that occur here -* will not be accounted to the task. 'mm' *is* current->mm, -* but we treat this as a 'remote' access since it is -* essentially a kernel access to the memory. +* 'mm' *is* current->mm, but we treat this as a 'remote' access since +* it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); if (result < 0) -- 2.43.0
[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
This patch set, ultimately, switches global uprobes_treelock from RW spinlock to per-CPU RW semaphore, which has better performance and scales better under contention and multiple parallel threads triggering lots of uprobes. To make this work well with attaching multiple uprobes (through BPF multi-uprobe), we need to add batched versions of uprobe register/unregister APIs. This is what most of the patch set is actually doing. The actual switch to per-CPU RW semaphore is trivial after that and is done in the very last patch #12. See commit message with some comparison numbers. Patch #4 is probably the most important patch in the series, revamping uprobe lifetime management and refcounting. See patch description and added code comments for all the details. With changes in patch #4, we open up the way to refactor uprobe_register() and uprobe_unregister() implementations in such a way that we can avoid taking uprobes_treelock many times during a single batched attachment/detachment. This allows to accommodate a much higher latency of taking per-CPU RW semaphore for write. The end result of this patch set is that attaching 50 thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms both before and after the changes in this patch set. Patch #5 updates existing uprobe consumers to put all the relevant necessary pieces into struct uprobe_consumer, without having to pass around offset/ref_ctr_offset. Existing consumers already keep this data around, we just formalize the interface. Patches #6 through #10 add batched versions of register/unregister APIs and gradually factor them in such a way as to allow taking single (batched) uprobes_treelock, splitting the logic into multiple independent phases. Patch #11 switched BPF multi-uprobes to batched uprobe APIs. As mentioned, a very straightforward patch #12 takes advantage of all the prep work and just switches uprobes_treelock to per-CPU RW semaphore. v1->v2: - added RCU-delayed uprobe freeing to put_uprobe() (Masami); - fixed clean up handling in uprobe_register_batch (Jiri); - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg); - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid comment instead (Oleg). Andrii Nakryiko (12): uprobes: update outdated comment uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() uprobes: simplify error handling for alloc_uprobe() uprobes: revamp uprobe refcounting and lifetime management uprobes: move offset and ref_ctr_offset into uprobe_consumer uprobes: add batch uprobe register/unregister APIs uprobes: inline alloc_uprobe() logic into __uprobe_register() uprobes: split uprobe allocation and uprobes_tree insertion steps uprobes: batch uprobes_treelock during registration uprobes: improve lock batching for uprobe_unregister_batch uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes uprobes: switch uprobes_treelock to per-CPU RW semaphore include/linux/uprobes.h | 29 +- kernel/events/uprobes.c | 550 -- kernel/trace/bpf_trace.c | 40 +- kernel/trace/trace_uprobe.c | 53 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 +- 5 files changed, 447 insertions(+), 247 deletions(-) -- 2.43.0
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko wrote: > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu wrote: > > > > On Fri, 28 Jun 2024 09:34:26 -0700 > > Andrii Nakryiko wrote: > > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > > wrote: > > > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > > Andrii Nakryiko wrote: > > > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > > wrote: > > > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > > > > > > - loff_t ref_ctr_offset, struct > > > > > > > uprobe_consumer *uc) > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > > + uprobe_consumer_fn get_uprobe_consumer, > > > > > > > void *ctx) > > > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we just > > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > > would just contain pointers to uprobe_consumer. Consumers would never > > > > > just have an array of `struct uprobe_consumer *`, because > > > > > uprobe_consumer struct is embedded in some other struct, so the array > > > > > interface isn't the most convenient. > > > > > > > > OK, I understand it. > > > > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > > allocating an extra array *and keeping it* for the entire duration of > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > > waste. This is because we don't want to do anything that can fail in > > > > > the detachment logic (so no temporary array allocation there). > > > > > > > > No need to change it, that sounds reasonable. > > > > > > > > > > Great, thanks. > > > > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > > > IMHO, maybe the interface function is better to change to > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > > side uses a linked list of structure, index access will need to > > > > follow the list every time. > > > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > > ctx) with i going from 0 to N in multiple independent loops. So if we > > > are only allowed to ask for the next consumer, then > > > uprobe_register_batch and uprobe_unregister_batch would need to build > > > its own internal index and remember ith instance. Which again means > > > more allocations and possibly failing uprobe_unregister_batch(), which > > > isn't great. > > > > No, I think we can use a cursor variable as; > > > > int uprobe_register_batch(struct inode *inode, > > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > { > > void *cur = ctx; > > > > while ((uc = get_uprobe_consumer()) != NULL) { > > ... > > } > > > > cur = ctx; > > while ((uc = get_uprobe_consumer()) != NULL) { > > ... > > } > > } > > > > This can also remove the cnt. > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use > for callers, but we have one right now, and might have another one, so > not a big deal. > Actually, now that I started implementing this, I really-really don't like it. In the example above you assume by storing and reusing original ctx value you will reset iteration to the very beginning. This is not how it works in practice though. Ctx is most probably a pointer to some struct somewhere with iteration state (e.g., array of all uprobes + current index), and so get_uprobe_consumer() doesn't update `void *ctx` itself, it updates the state of that struct. And so there is no easy and clean way to reset this iterator without adding another callback or something. At which point it becomes quite cumbersome and convoluted
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko wrote: > > On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu wrote: > > > > On Mon, 24 Jun 2024 17:21:36 -0700 > > Andrii Nakryiko wrote: > > > > > Anyways, under exclusive writer lock, we double-check that refcount > > > didn't change and is still zero. If it is, we proceed with destruction, > > > because at that point we have a guarantee that find_active_uprobe() > > > can't successfully look up this uprobe instance, as it's going to be > > > removed in destructor under writer lock. If, on the other hand, > > > find_active_uprobe() managed to bump refcount from zero to one in > > > between put_uprobe()'s atomic_dec_and_test(>ref) and > > > write_lock(_treelock), we'll deterministically detect this with > > > extra atomic_read(>ref) check, and if it doesn't hold, we > > > pretend like atomic_dec_and_test() never returned true. There is no > > > resource freeing or any other irreversible action taken up till this > > > point, so we just exit early. > > > > > > One tricky part in the above is actually two CPUs racing and dropping > > > refcnt to zero, and then attempting to free resources. This can happen > > > as follows: > > > - CPU #0 drops refcnt from 1 to 0, and proceeds to grab > > > uprobes_treelock; > > > - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at > > > which point it decides that it needs to free uprobe as well. > > > > > > At this point both CPU #0 and CPU #1 will believe they need to destroy > > > uprobe, which is obviously wrong. To prevent this situations, we augment > > > refcount with epoch counter, which is always incremented by 1 on either > > > get or put operation. This allows those two CPUs above to disambiguate > > > who should actually free uprobe (it's the CPU #1, because it has > > > up-to-date epoch). See comments in the code and note the specific values > > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that > > > a single atomi64_t is actually a two sort-of-independent 32-bit counters > > > that are incremented/decremented with a single atomic_add_and_return() > > > operation. Note also a small and extremely rare (and thus having no > > > effect on performance) need to clear the highest bit every 2 billion > > > get/put operations to prevent high 32-bit counter from "bleeding over" > > > into lower 32-bit counter. > > > > I have a question here. > > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets > > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref > > again? e.g. > > > > CPU#0 CPU#1 > > > > put_uprobe() { > > atomic64_add_return() > > __get_uprobe(); > > put_uprobe() { > > > > kfree(uprobe) > > } > > write_lock(_treelock); > > atomic64_read(>ref); > > } > > > > I think it is very rare case, but I could not find any code to prevent > > this scenario. > > > > Yes, I think you are right, great catch! > > I concentrated on preventing double kfree() in this situation, and > somehow convinced myself that eager kfree() is fine. But I think I'll > need to delay freeing, probably with RCU. The problem is that we can't > use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has > to be a sleepable variant of RCU. I'm thinking of using > rcu_read_lock_trace(), the same variant of RCU we use for sleepable > BPF programs (including sleepable uprobes). srcu might be too heavy > for this. > > I'll try a few variants over the next few days and see how the > performance looks. > So I think I'm going with the changes below, incorporated into this patch (nothing else changes). __get_uprobe() doesn't need any added RCU protection (we know that uprobe is alive). It's only put_uprobe() that needs to guarantee RCU protection before we drop refcount all the way until we know whether we are the winning destructor or not. Good thing is that the changes are pretty minimal in code and also don't seem to regress performance/scalability. So I'm pretty happy about that, will send v2 soon. diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 07ad8b2e7508..41d9e37633ca 100644 --- a/kernel/events/
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu wrote: > > On Fri, 28 Jun 2024 09:34:26 -0700 > Andrii Nakryiko wrote: > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > wrote: > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > Andrii Nakryiko wrote: > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > wrote: > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > > > > > - loff_t ref_ctr_offset, struct > > > > > > uprobe_consumer *uc) > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > + uprobe_consumer_fn get_uprobe_consumer, > > > > > > void *ctx) > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we just > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > would just contain pointers to uprobe_consumer. Consumers would never > > > > just have an array of `struct uprobe_consumer *`, because > > > > uprobe_consumer struct is embedded in some other struct, so the array > > > > interface isn't the most convenient. > > > > > > OK, I understand it. > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > allocating an extra array *and keeping it* for the entire duration of > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > waste. This is because we don't want to do anything that can fail in > > > > the detachment logic (so no temporary array allocation there). > > > > > > No need to change it, that sounds reasonable. > > > > > > > Great, thanks. > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > IMHO, maybe the interface function is better to change to > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > side uses a linked list of structure, index access will need to > > > follow the list every time. > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > ctx) with i going from 0 to N in multiple independent loops. So if we > > are only allowed to ask for the next consumer, then > > uprobe_register_batch and uprobe_unregister_batch would need to build > > its own internal index and remember ith instance. Which again means > > more allocations and possibly failing uprobe_unregister_batch(), which > > isn't great. > > No, I think we can use a cursor variable as; > > int uprobe_register_batch(struct inode *inode, > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > { > void *cur = ctx; > > while ((uc = get_uprobe_consumer()) != NULL) { > ... > } > > cur = ctx; > while ((uc = get_uprobe_consumer()) != NULL) { > ... > } > } > > This can also remove the cnt. Ok, if you prefer this I'll switch. It's a bit more cumbersome to use for callers, but we have one right now, and might have another one, so not a big deal. > > Thank you, > > > > > For now this API works well, I propose to keep it as is. For linked > > list case consumers would need to allocate one extra array or pay the > > price of O(N) search (which might be ok, depending on how many uprobes > > are being attached). But we don't have such consumers right now, > > thankfully. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) > > > > > > > > > -- > > > Masami Hiramatsu (Google) > > > -- > Masami Hiramatsu (Google)
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu wrote: > > On Thu, 27 Jun 2024 09:47:10 -0700 > Andrii Nakryiko wrote: > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > wrote: > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > Andrii Nakryiko wrote: > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > > > - loff_t ref_ctr_offset, struct > > > > uprobe_consumer *uc) > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > + uprobe_consumer_fn get_uprobe_consumer, void > > > > *ctx) > > > > > > Is this interface just for avoiding memory allocation? Can't we just > > > allocate a temporary array of *uprobe_consumer instead? > > > > Yes, exactly, to avoid the need for allocating another array that > > would just contain pointers to uprobe_consumer. Consumers would never > > just have an array of `struct uprobe_consumer *`, because > > uprobe_consumer struct is embedded in some other struct, so the array > > interface isn't the most convenient. > > OK, I understand it. > > > > > If you feel strongly, I can do an array, but this necessitates > > allocating an extra array *and keeping it* for the entire duration of > > BPF multi-uprobe link (attachment) existence, so it feels like a > > waste. This is because we don't want to do anything that can fail in > > the detachment logic (so no temporary array allocation there). > > No need to change it, that sounds reasonable. > Great, thanks. > > > > Anyways, let me know how you feel about keeping this callback. > > IMHO, maybe the interface function is better to change to > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > side uses a linked list of structure, index access will need to > follow the list every time. This would be problematic. Note how we call get_uprobe_consumer(i, ctx) with i going from 0 to N in multiple independent loops. So if we are only allowed to ask for the next consumer, then uprobe_register_batch and uprobe_unregister_batch would need to build its own internal index and remember ith instance. Which again means more allocations and possibly failing uprobe_unregister_batch(), which isn't great. For now this API works well, I propose to keep it as is. For linked list case consumers would need to allocate one extra array or pay the price of O(N) search (which might be ok, depending on how many uprobes are being attached). But we don't have such consumers right now, thankfully. > > Thank you, > > > > > > > > > > Thank you, > > > > > > -- > > > Masami Hiramatsu (Google) > > > -- > Masami Hiramatsu (Google)
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu wrote: > > On Mon, 24 Jun 2024 17:21:38 -0700 > Andrii Nakryiko wrote: > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > - loff_t ref_ctr_offset, struct uprobe_consumer > > *uc) > > +int uprobe_register_batch(struct inode *inode, int cnt, > > + uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > Is this interface just for avoiding memory allocation? Can't we just > allocate a temporary array of *uprobe_consumer instead? Yes, exactly, to avoid the need for allocating another array that would just contain pointers to uprobe_consumer. Consumers would never just have an array of `struct uprobe_consumer *`, because uprobe_consumer struct is embedded in some other struct, so the array interface isn't the most convenient. If you feel strongly, I can do an array, but this necessitates allocating an extra array *and keeping it* for the entire duration of BPF multi-uprobe link (attachment) existence, so it feels like a waste. This is because we don't want to do anything that can fail in the detachment logic (so no temporary array allocation there). Anyways, let me know how you feel about keeping this callback. > > Thank you, > > -- > Masami Hiramatsu (Google)
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu wrote: > > On Mon, 24 Jun 2024 17:21:36 -0700 > Andrii Nakryiko wrote: > > > Anyways, under exclusive writer lock, we double-check that refcount > > didn't change and is still zero. If it is, we proceed with destruction, > > because at that point we have a guarantee that find_active_uprobe() > > can't successfully look up this uprobe instance, as it's going to be > > removed in destructor under writer lock. If, on the other hand, > > find_active_uprobe() managed to bump refcount from zero to one in > > between put_uprobe()'s atomic_dec_and_test(>ref) and > > write_lock(_treelock), we'll deterministically detect this with > > extra atomic_read(>ref) check, and if it doesn't hold, we > > pretend like atomic_dec_and_test() never returned true. There is no > > resource freeing or any other irreversible action taken up till this > > point, so we just exit early. > > > > One tricky part in the above is actually two CPUs racing and dropping > > refcnt to zero, and then attempting to free resources. This can happen > > as follows: > > - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock; > > - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at > > which point it decides that it needs to free uprobe as well. > > > > At this point both CPU #0 and CPU #1 will believe they need to destroy > > uprobe, which is obviously wrong. To prevent this situations, we augment > > refcount with epoch counter, which is always incremented by 1 on either > > get or put operation. This allows those two CPUs above to disambiguate > > who should actually free uprobe (it's the CPU #1, because it has > > up-to-date epoch). See comments in the code and note the specific values > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that > > a single atomi64_t is actually a two sort-of-independent 32-bit counters > > that are incremented/decremented with a single atomic_add_and_return() > > operation. Note also a small and extremely rare (and thus having no > > effect on performance) need to clear the highest bit every 2 billion > > get/put operations to prevent high 32-bit counter from "bleeding over" > > into lower 32-bit counter. > > I have a question here. > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref > again? e.g. > > CPU#0 CPU#1 > > put_uprobe() { > atomic64_add_return() > __get_uprobe(); > put_uprobe() { > kfree(uprobe) > } > write_lock(_treelock); > atomic64_read(>ref); > } > > I think it is very rare case, but I could not find any code to prevent > this scenario. > Yes, I think you are right, great catch! I concentrated on preventing double kfree() in this situation, and somehow convinced myself that eager kfree() is fine. But I think I'll need to delay freeing, probably with RCU. The problem is that we can't use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has to be a sleepable variant of RCU. I'm thinking of using rcu_read_lock_trace(), the same variant of RCU we use for sleepable BPF programs (including sleepable uprobes). srcu might be too heavy for this. I'll try a few variants over the next few days and see how the performance looks. > Thank you, > > > -- > Masami Hiramatsu (Google) >
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Wed, Jun 26, 2024 at 4:27 AM Jiri Olsa wrote: > > On Mon, Jun 24, 2024 at 05:21:38PM -0700, Andrii Nakryiko wrote: > > SNIP > > > + for (i = 0; i < cnt; i++) { > > + uc = get_uprobe_consumer(i, ctx); > > + > > + /* Each consumer must have at least one set consumer */ > > + if (!uc || (!uc->handler && !uc->ret_handler)) > > + return -EINVAL; > > + /* Racy, just to catch the obvious mistakes */ > > + if (uc->offset > i_size_read(inode)) > > + return -EINVAL; > > + if (uc->uprobe) > > + return -EINVAL; > > + /* > > + * This ensures that copy_from_page(), copy_to_page() and > > + * __update_ref_ctr() can't cross page boundary. > > + */ > > + if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE)) > > + return -EINVAL; > > + if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short))) > > + return -EINVAL; > > + } > > > > - down_write(>register_rwsem); > > - consumer_add(uprobe, uc); > > - ret = register_for_each_vma(uprobe, uc); > > - if (ret) > > - __uprobe_unregister(uprobe, uc); > > - up_write(>register_rwsem); > > + for (i = 0; i < cnt; i++) { > > + uc = get_uprobe_consumer(i, ctx); > > > > - if (ret) > > - put_uprobe(uprobe); > > + uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); > > + if (IS_ERR(uprobe)) { > > + ret = PTR_ERR(uprobe); > > + goto cleanup_uprobes; > > + } > > + > > + uc->uprobe = uprobe; > > + } > > + > > + for (i = 0; i < cnt; i++) { > > + uc = get_uprobe_consumer(i, ctx); > > + uprobe = uc->uprobe; > > + > > + down_write(>register_rwsem); > > + consumer_add(uprobe, uc); > > + ret = register_for_each_vma(uprobe, uc); > > + if (ret) > > + __uprobe_unregister(uprobe, uc); > > + up_write(>register_rwsem); > > + > > + if (ret) { > > + put_uprobe(uprobe); > > + goto cleanup_unreg; > > + } > > + } > > + > > + return 0; > > > > +cleanup_unreg: > > + /* unregister all uprobes we managed to register until failure */ > > + for (i--; i >= 0; i--) { > > + uc = get_uprobe_consumer(i, ctx); > > + > > + down_write(>register_rwsem); > > + __uprobe_unregister(uc->uprobe, uc); > > + up_write(>register_rwsem); > > + } > > +cleanup_uprobes: > > when we jump here from 'goto cleanup_uprobes' not all of the > consumers might have uc->uprobe set up > > perhaps we can set cnt = i - 1 before the goto, or just check uc->uprobe below yep, you are right, I missed this part during multiple rounds of refactorings. I think the `if (uc->uprobe)` check is the cleanest approach here. > > > > + /* put all the successfully allocated/reused uprobes */ > > + for (i = cnt - 1; i >= 0; i--) { > > curious, any reason why we go from the top here? No particular reason. This started as (i = i - 1; i >= 0; i--), but then as I kept splitting steps I needed to do this over all uprobes. Anyways, I can do a clean `i = 0; i < cnt; i++` with `if (uc->uprobe)` check. > > thanks, > jirka > > > + uc = get_uprobe_consumer(i, ctx); > > + > > + put_uprobe(uc->uprobe); > > + uc->uprobe = NULL; > > + } > > return ret; > > } > > > > int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) > > { > > - return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc); > > + return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc); > > } > > EXPORT_SYMBOL_GPL(uprobe_register); > > > > -- > > 2.43.0 > >
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Tue, Jun 25, 2024 at 11:03 PM kernel test robot wrote: > > Hi Andrii, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on next-20240624] > [also build test WARNING on v6.10-rc5] > [cannot apply to perf-tools-next/perf-tools-next tip/perf/core > perf-tools/perf-tools linus/master acme/perf/core v6.10-rc5 v6.10-rc4 > v6.10-rc3] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: > https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-update-outdated-comment/20240626-001728 > base: next-20240624 > patch link: > https://lore.kernel.org/r/20240625002144.3485799-5-andrii%40kernel.org > patch subject: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime > management > config: x86_64-defconfig > (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfm0xj-...@intel.com/config) > compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0 > reproduce (this is a W=1 build): > (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfm0xj-...@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version > of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: > https://lore.kernel.org/oe-kbuild-all/202406261300.ebbfm0xj-...@intel.com/ > > All warnings (new ones prefixed by >>): > > >> kernel/events/uprobes.c:638: warning: Function parameter or struct member > >> 'uprobe' not described in '__get_uprobe' > >> kernel/events/uprobes.c:638: warning: expecting prototype for Caller has > >> to make sure that(). Prototype was for __get_uprobe() instead > > > vim +638 kernel/events/uprobes.c > > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 625 > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 626 /** I shouldn't have used /** here, I'll fix this. > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 627 * Caller has to make sure > that: > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 628 * a) either uprobe's > refcnt is positive before this call; > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 629 * b) or uprobes_treelock > is held (doesn't matter if for read or write), > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 630 * preventing uprobe's > destructor from removing it from uprobes_tree. > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 631 * > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 632 * In the latter case, > uprobe's destructor will "resurrect" uprobe instance if > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 633 * it detects that its > refcount went back to being positive again inbetween it > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 634 * dropping to zero at some > point and (potentially delayed) destructor > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 635 * callback actually running. > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 636 */ > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 637 static struct uprobe > *__get_uprobe(struct uprobe *uprobe) > f231722a2b27ee Oleg Nesterov 2015-07-21 @638 { > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 639 s64 v; > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 640 > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 641 v = > atomic64_add_return(UPROBE_REFCNT_GET, >ref); > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 642 > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 643 /* > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 644 * If the highest bit > is set, we need to clear it. If cmpxchg() fails, > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 645 * we don't retry > because there is another CPU that just managed to > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 646 * update refcnt and > will attempt the same "fix up". Eventually one of > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 647 * them will succeed > to clear highset bit. > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 648 */ > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 649 if (unlikely(v < 0)) > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 650 > (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63)); > b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24 651 > f231722a2b27ee Oleg Nesterov 2015-07-21 652 return uprobe; > f231722a2b27ee Oleg Nesterov 2015-07-21 653 } > f231722a2b27ee Oleg Nesterov 2015-07-21 654 > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
On Tue, Jun 25, 2024 at 12:09 PM Oleg Nesterov wrote: > > On 06/25, Andrii Nakryiko wrote: > > > > On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov wrote: > > > > > > Why? > > > > > > So far I don't understand this change. Quite possibly I missed something, > > > but in this case the changelog should explain the problem more clearly. > > > > > > > I just went off of "Called with mm->mmap_lock held for write." comment > > in uprobe_write_opcode(), tbh. > > Ah, indeed... and git blame makes me sad ;) > > I _think_ that 29dedee0e693a updated this comment without any thinking, > but today I can't recall. In any case, today this nothing to do with > mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect) > when it returns -EAGAIN but this is another story. > > > If we don't actually need writer > > mmap_lock, we should probably update at least that comment. > > Agreed. Ok, I'll drop this change and will just update the comment. > > > There is a > > lot going on in uprobe_write_opcode(), and I don't understand all the > > requirements there. > > Heh. Neither me today ;) > > Oleg. >
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov wrote: > > On 06/25, Masami Hiramatsu wrote: > > > > On Mon, 24 Jun 2024 17:21:34 -0700 > > Andrii Nakryiko wrote: > > > > > Given unapply_uprobe() can call remove_breakpoint() which eventually > > > calls uprobe_write_opcode(), which can modify a set of memory pages and > > > expects mm->mmap_lock held for write, it needs to have writer lock. > > > > Oops, it is an actual bug, right? > > Why? > > So far I don't understand this change. Quite possibly I missed something, > but in this case the changelog should explain the problem more clearly. > I just went off of "Called with mm->mmap_lock held for write." comment in uprobe_write_opcode(), tbh. If we don't actually need writer mmap_lock, we should probably update at least that comment. There is a lot going on in uprobe_write_opcode(), and I don't understand all the requirements there. > Oleg. >
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Tue, Jun 25, 2024 at 7:45 AM Oleg Nesterov wrote: > > Again, I'll try to read (at least this) patch later this week, > just one cosmetic nit for now... Thanks, and yep, please take your time. I understand that this is not a trivial change to a code base that has been sitting untouched for many years now. But I'd really appreciate if you can give it a through review anyways! > > On 06/24, Andrii Nakryiko wrote: > > > > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* > > + * epoch and refcnt parts atomically with one atomic_add(). > > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part > > and > > + * *increment* epoch part. > > + */ > > +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL) > > +#define UPROBE_REFCNT_PUT (0xLL) > > How about > > #define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL) > #define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL) It's cute, I'll change to that. But I'll probably also add a comment with the final value in hex for someone like me (because I can reason about 0x and its effect on refcount, not so much with `(1LL << 32) - 1`. > > ? this makes them symmetrical and makes it clear why > atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment. > > Feel free to ignore if you don't like it. > > Oleg. >
Re: [PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces
On Mon, Jun 24, 2024 at 6:14 PM Masami Hiramatsu wrote: > > On Tue, 21 May 2024 18:38:45 -0700 > Andrii Nakryiko wrote: > > > Add a set of tests to validate that stack traces captured from or in the > > presence of active uprobes and uretprobes are valid and complete. > > > > For this we use BPF program that are installed either on entry or exit > > of user function, plus deep-nested USDT. One of target funtions > > (target_1) is recursive to generate two different entries in the stack > > trace for the same uprobe/uretprobe, testing potential edge conditions. > > > > Without fixes in this patch set, we get something like this for one of > > the scenarios: > > > > caller: 0x758fff - 0x7595ab > > target_1: 0x758fd5 - 0x758fff > > target_2: 0x758fca - 0x758fd5 > > target_3: 0x758fbf - 0x758fca > > target_4: 0x758fb3 - 0x758fbf > > ENTRY #0: 0x758fb3 (in target_4) > > ENTRY #1: 0x758fd3 (in target_2) > > ENTRY #2: 0x758ffd (in target_1) > > ENTRY #3: 0x7fffe000 > > ENTRY #4: 0x7fffe000 > > ENTRY #5: 0x6f8f39 > > ENTRY #6: 0x6fa6f0 > > ENTRY #7: 0x7f403f229590 > > > > Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses > > which obscure actual target_1 and another target_1 invocations. Also > > note that between entry #0 and entry #1 we are missing an entry for > > target_3, which is fixed in patch #2. > > Please avoid using `patch #2` because after commit, this means nothing. Yep, makes sense, sorry about that, will keep descriptions a bit more general going forward. > > Thank you, > > > > > With all the fixes, we get desired full stack traces: > > > > caller: 0x758fff - 0x7595ab > > target_1: 0x758fd5 - 0x758fff > > target_2: 0x758fca - 0x758fd5 > > target_3: 0x758fbf - 0x758fca > > target_4: 0x758fb3 - 0x758fbf > > ENTRY #0: 0x758fb7 (in target_4) > > ENTRY #1: 0x758fc8 (in target_3) > > ENTRY #2: 0x758fd3 (in target_2) > > ENTRY #3: 0x758ffd (in target_1) > > ENTRY #4: 0x758ff3 (in target_1) > > ENTRY #5: 0x75922c (in caller) > > ENTRY #6: 0x6f8f39 > > ENTRY #7: 0x6fa6f0 > > ENTRY #8: 0x7f986adc4cd0 > > > > Now there is a logical and complete sequence of function calls. > > > > Signed-off-by: Andrii Nakryiko > > --- > > .../bpf/prog_tests/uretprobe_stack.c | 186 ++ > > .../selftests/bpf/progs/uretprobe_stack.c | 96 + > > 2 files changed, 282 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c > > create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c > > [...]
Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Mon, Jun 24, 2024 at 5:39 PM Masami Hiramatsu wrote: > > On Mon, 24 Jun 2024 13:32:35 -0700 > Andrii Nakryiko wrote: > > > On Mon, Jun 17, 2024 at 3:37 PM Andrii Nakryiko > > wrote: > > > > > > On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko > > > wrote: > > > > > > > > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu > > > > wrote: > > > > > > > > > > On Tue, 21 May 2024 18:38:43 -0700 > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > When kernel has pending uretprobes installed, it hijacks original > > > > > > user > > > > > > function return address on the stack with a uretprobe trampoline > > > > > > address. There could be multiple such pending uretprobes (either on > > > > > > different user functions or on the same recursive one) at any given > > > > > > time within the same task. > > > > > > > > > > > > This approach interferes with the user stack trace capture logic, > > > > > > which > > > > > > would report suprising addresses (like 0x7fffe000) that > > > > > > correspond > > > > > > to a special "[uprobes]" section that kernel installs in the target > > > > > > process address space for uretprobe trampoline code, while > > > > > > logically it > > > > > > should be an address somewhere within the calling function of > > > > > > another > > > > > > traced user function. > > > > > > > > > > > > This is easy to correct for, though. Uprobes subsystem keeps track > > > > > > of > > > > > > pending uretprobes and records original return addresses. This > > > > > > patch is > > > > > > using this to do a post-processing step and restore each trampoline > > > > > > address entries with correct original return address. This is done > > > > > > only > > > > > > if there are pending uretprobes for current task. > > > > > > > > > > > > This is a similar approach to what fprobe/kretprobe infrastructure > > > > > > is > > > > > > doing when capturing kernel stack traces in the presence of pending > > > > > > return probes. > > > > > > > > > > > > > > > > This looks good to me because this trampoline information is only > > > > > managed in uprobes. And it should be provided when unwinding user > > > > > stack. > > > > > > > > > > Reviewed-by: Masami Hiramatsu (Google) > > > > > > > > > > Thank you! > > > > > > > > Great, thanks for reviewing, Masami! > > > > > > > > Would you take this fix through your tree, or where should it be routed > > > > to? > > > > > > > > > > Ping! What would you like me to do with this patch set? Should I > > > resend it without patch 3 (the one that tries to guess whether we are > > > at the entry to the function?), or did I manage to convince you that > > > this heuristic is OK, given perf's stack trace capturing logic already > > > makes heavy assumption of rbp register being used for frame pointer? > > > > > > Please let me know your preference, I could drop patch 3 and send it > > > separately, if that helps move the main fix forward. Thanks! > > > > Masami, > > > > Another week went by with absolutely no action or reaction from you. > > Is there any way I can help improve the collaboration here? > > OK, if there is no change without [3/4], let me pick the others on Thanks, Masami! Selftest is probably failing (as it expects correct stack trace), but that's ok, we can fix it up once linux-trace-kernel and bpf-next trees converge. > probes/for-next directly. [3/4] I need other x86 maintainer's > comments. And it should be handled by PMU maintainers. Sounds good, I'll repost it separately. Do I need to CC anyone else besides people on this thread already? > > Thanks, > > > > > > I'm preparing more patches for uprobes and about to submit them. If > > each reviewed and ready to be applied patch set has to sit idle for > > multiple weeks for no good reason, we all will soon be lost just plain > > forgetting the context in which the patch was prepared. > > > > Please, prioritize handling patches that are meant to be routed > > through your tree in a more timely fashion. Or propose some > > alternative acceptable arrangement. > > > > Thank you. > > > > > > > > > > > > > > > > Reported-by: Riham Selim > > > > > > Signed-off-by: Andrii Nakryiko > > > > > > --- > > > > > > kernel/events/callchain.c | 43 > > > > > > ++- > > > > > > kernel/events/uprobes.c | 9 > > > > > > 2 files changed, 51 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > [...] > > > -- > Masami Hiramatsu (Google)
[PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
Switch internals of BPF multi-uprobes to batched version of uprobe registration and unregistration APIs. This also simplifies BPF clean up code a bit thanks to all-or-nothing guarantee of uprobes_register_batch(). Signed-off-by: Andrii Nakryiko --- kernel/trace/bpf_trace.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ba62baec3152..41bf6736c542 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx { struct bpf_uprobe *uprobe; }; -static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, - u32 cnt) +static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx) { - u32 i; + struct bpf_uprobe_multi_link *link = ctx; - for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), [i].consumer); - } + return >uprobes[idx].consumer; } static void bpf_uprobe_multi_link_release(struct bpf_link *link) @@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link) struct bpf_uprobe_multi_link *umulti_link; umulti_link = container_of(link, struct bpf_uprobe_multi_link, link); - bpf_uprobe_unregister(_link->path, umulti_link->uprobes, umulti_link->cnt); + uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt, + umulti_link_get_uprobe_consumer, umulti_link); if (umulti_link->task) put_task_struct(umulti_link->task); path_put(_link->path); @@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI, _uprobe_multi_link_lops, prog); - for (i = 0; i < cnt; i++) { - err = uprobe_register(d_real_inode(link->path.dentry), [i].consumer); - if (err) { - bpf_uprobe_unregister(, uprobes, i); - goto error_free; - } - } + err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt, + umulti_link_get_uprobe_consumer, link); + if (err) + goto error_free; err = bpf_link_prime(>link, _primer); if (err) -- 2.43.0
[PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore
With all the batch uprobe APIs work we are now finally ready to reap the benefits. Switch uprobes_treelock from reader-writer spinlock to a much more efficient and scalable per-CPU RW semaphore. Benchmarks and numbers time. I've used BPF selftests' bench tool, trig-uprobe-nop benchmark specifically, to see how uprobe total throughput scales with number of competing threads (mapped to individual CPUs). Here are results: # threads BEFORE (mln/s)AFTER (mln/s) - --- 1 3.131 3.140 2 3.394 3.601 3 3.630 3.960 4 3.317 3.551 5 3.448 3.464 6 3.345 3.283 7 3.469 3.444 8 3.182 3.258 9 3.138 3.139 10 2.999 3.212 11 2.903 3.183 12 2.802 3.027 13 2.792 3.027 14 2.695 3.086 15 2.822 2.965 16 2.679 2.939 17 2.622 2.888 18 2.628 2.914 19 2.702 2.836 20 2.561 2.837 One can see that per-CPU RW semaphore-based implementation scales better with number of CPUs (especially that single CPU throughput is basically the same). Note, scalability is still limited by register_rwsem and this will hopefully be address in follow up patch set(s). Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 7e94671a672a..e24b81b0e149 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; */ #define no_uprobe_events() RB_EMPTY_ROOT(_tree) -static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ +DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ13 /* serialize uprobe->pending_list */ @@ -667,7 +667,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) bool destroy; if (!tree_locked) - write_lock(_treelock); + percpu_down_write(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -691,7 +691,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); if (!tree_locked) - write_unlock(_treelock); + percpu_up_write(_treelock); /* uprobe got resurrected, pretend we never tried to free it */ if (!destroy) @@ -789,9 +789,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { struct uprobe *uprobe; - read_lock(_treelock); + percpu_down_read(_treelock); uprobe = __find_uprobe(inode, offset); - read_unlock(_treelock); + percpu_up_read(_treelock); return uprobe; } @@ -1178,7 +1178,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge up_write(>register_rwsem); } - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); uprobe = uc->uprobe; @@ -1189,7 +1189,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge __put_uprobe(uprobe, true); uc->uprobe = NULL; } - write_unlock(_treelock); + percpu_up_write(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) @@ -1294,7 +1294,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } ret = 0; - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1317,7 +1317,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } } unlock_treelock: - write_unlock(_treelock); + percpu_up_write(_treelock); if (ret) goto cleanup_uprobes; @@ -1349,14 +1349,14 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ - write_lock(_treelock); + percpu_down_write(_treelock); for (i = cnt - 1; i >= 0; i--) { uc = get_uprobe_consumer(i, ctx)
[PATCH 09/12] uprobes: batch uprobes_treelock during registration
Now that we have a good separate of each registration step, take uprobes_treelock just once for relevant registration step, and then process all relevant uprobes in one go. Even if writer lock introduces a relatively large delay (as might happen with per-CPU RW semaphore), this will keep overall batch attachment reasonably fast. We teach put_uprobe(), though __put_uprobe() helper, to optionally take or not uprobes_treelock, to accommodate this pattern. With these changes we don't need insert_uprobe() operation that unconditionally takes uprobes_treelock, so get rid of it, leaving only lower-level __insert_uprobe() helper. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 45 + 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 5e98e179d47d..416f408cbed9 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -657,7 +657,7 @@ static inline bool uprobe_is_active(struct uprobe *uprobe) return !RB_EMPTY_NODE(>rb_node); } -static void put_uprobe(struct uprobe *uprobe) +static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) { s64 v; @@ -666,7 +666,8 @@ static void put_uprobe(struct uprobe *uprobe) if (unlikely((u32)v == 0)) { bool destroy; - write_lock(_treelock); + if (!tree_locked) + write_lock(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -689,7 +690,8 @@ static void put_uprobe(struct uprobe *uprobe) destroy = atomic64_read(>ref) == v; if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); - write_unlock(_treelock); + if (!tree_locked) + write_unlock(_treelock); /* uprobe got resurrected, pretend we never tried to free it */ if (!destroy) @@ -718,6 +720,11 @@ static void put_uprobe(struct uprobe *uprobe) (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63)); } +static void put_uprobe(struct uprobe *uprobe) +{ + __put_uprobe(uprobe, false); +} + static __always_inline int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, const struct uprobe *r) @@ -817,21 +824,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) return u; } -/* - * Acquire uprobes_treelock and insert uprobe into uprobes_tree - * (or reuse existing one, see __insert_uprobe() comments above). - */ -static struct uprobe *insert_uprobe(struct uprobe *uprobe) -{ - struct uprobe *u; - - write_lock(_treelock); - u = __insert_uprobe(uprobe); - write_unlock(_treelock); - - return u; -} - static void ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) { @@ -1291,6 +1283,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, uc->uprobe = uprobe; } + ret = 0; + write_lock(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1298,19 +1292,24 @@ int uprobe_register_batch(struct inode *inode, int cnt, uprobe = uc->uprobe; /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); + cur_uprobe = __insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); + + __put_uprobe(cur_uprobe, true); ret = -EINVAL; - goto cleanup_uprobes; + goto unlock_treelock; } kfree(uprobe); uc->uprobe = cur_uprobe; } } +unlock_treelock: + write_unlock(_treelock); + if (ret) + goto cleanup_uprobes; for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1340,12 +1339,14 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ + write_lock(_treelock); for (i = cnt - 1; i >= 0; i--) { uc = get_uprobe_consumer(i, ctx); - put_uprobe(uc->uprobe); + __put_uprobe(uc->uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); return ret; } -- 2.43.0
[PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch
Similarly to what we did for uprobes_register_batch(), split uprobe_unregister_batch() into two separate phases with different locking needs. First, all the VMA unregistration is performed while holding a per-uprobe register_rwsem. Then, we take a batched uprobes_treelock once to __put_uprobe() for all uprobe_consumers. That uprobe_consumer->uprobe field is really handy in helping with this. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 416f408cbed9..7e94671a672a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1162,8 +1162,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) */ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx) { - struct uprobe *uprobe; struct uprobe_consumer *uc; + struct uprobe *uprobe; int i; for (i = 0; i < cnt; i++) { @@ -1176,10 +1176,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge down_write(>register_rwsem); __uprobe_unregister(uprobe, uc); up_write(>register_rwsem); - put_uprobe(uprobe); + } + write_lock(_treelock); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + if (!uprobe) + continue; + + __put_uprobe(uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) -- 2.43.0
[PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps
Now we are ready to split alloc-and-insert coupled step into two separate phases. First, we allocate and prepare all potentially-to-be-inserted uprobe instances, assuming corresponding uprobes are not yet in uprobes_tree. This is needed so that we don't do memory allocations under uprobes_treelock (once we batch locking for each step). Second, we insert new uprobes or reuse already existing ones into uprobes_tree. Any uprobe that turned out to be not necessary is immediately freed, as there are no other references to it. This concludes preparations that make uprobes_register_batch() ready to batch and optimize locking per each phase. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ebd8511b6eb2..5e98e179d47d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1270,9 +1270,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, return -EINVAL; } + /* pre-allocate new uprobe instances */ for (i = 0; i < cnt; i++) { - struct uprobe *cur_uprobe; - uc = get_uprobe_consumer(i, ctx); uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); @@ -1289,6 +1288,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, RB_CLEAR_NODE(>rb_node); atomic64_set(>ref, 1); + uc->uprobe = uprobe; + } + + for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ @@ -1296,15 +1304,12 @@ int uprobe_register_batch(struct inode *inode, int cnt, if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); - kfree(uprobe); ret = -EINVAL; goto cleanup_uprobes; } kfree(uprobe); - uprobe = cur_uprobe; + uc->uprobe = cur_uprobe; } - - uc->uprobe = uprobe; } for (i = 0; i < cnt; i++) { @@ -1318,10 +1323,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, __uprobe_unregister(uprobe, uc); up_write(>register_rwsem); - if (ret) { - put_uprobe(uprobe); + if (ret) goto cleanup_unreg; - } } return 0; -- 2.43.0
[PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
Introduce batch versions of uprobe registration (attachment) and unregistration (detachment) APIs. Unregistration is presumed to never fail, so that's easy. Batch registration can fail, and so the semantics of uprobe_register_batch() is such that either all uprobe_consumers are successfully attached or none of them remain attached after the return. There is no guarantee of atomicity of attachment, though, and so while batch attachment is proceeding, some uprobes might start firing before others are completely attached. Even if overall attachment eventually fails, some successfully attached uprobes might fire and callers have to be prepared to handle that. This is in no way a regression compared to current approach of attaching uprobes one-by-one, though. One crucial implementation detail is the addition of `struct uprobe *uprobe` field to `struct uprobe_consumer` which is meant for internal uprobe subsystem usage only. We use this field both as temporary storage (to avoid unnecessary allocations) and as a back link to associated uprobe to simplify and speed up uprobe unregistration, as we now can avoid yet another tree lookup when unregistering uprobe_consumer. The general direction with uprobe registration implementation is to do batch attachment in distinct steps, each step performing some set of checks or actions on all uprobe_consumers before proceeding to the next phase. This, after some more changes in next patches, allows to batch locking for each phase and in such a way amortize any long delays that might be added by writer locks (especially once we switch uprobes_treelock to per-CPU R/W semaphore later). Currently, uprobe_register_batch() performs all the sanity checks first. Then proceeds to allocate-and-insert (we'll split this up further later on) uprobe instances, as necessary. And then the last step is actual uprobe registration for all affected VMAs. We take care to undo all the actions in the event of an error at any point in this lengthy process, so end result is all-or-nothing, as described above. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 17 kernel/events/uprobes.c | 181 +--- 2 files changed, 147 insertions(+), 51 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index a75ba37ce3c8..a6e6eb70539d 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -33,6 +33,8 @@ enum uprobe_filter_ctx { UPROBE_FILTER_MMAP, }; +typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx); + struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); int (*ret_handler)(struct uprobe_consumer *self, @@ -48,6 +50,8 @@ struct uprobe_consumer { loff_t ref_ctr_offset; /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; + /* associated uprobe instance (or NULL if consumer isn't attached) */ + struct uprobe *uprobe; }; #ifdef CONFIG_UPROBES @@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); +extern int uprobe_register_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); +extern void uprobe_unregister_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { @@ -169,6 +182,10 @@ static inline void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } +static inline void uprobe_unregister_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ +} static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 2544e8b79bad..846efda614cb 100644
[PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()
To allow unbundling alloc-uprobe-and-insert step which is currently tightly coupled, inline alloc_uprobe() logic into uprobe_register_batch() loop. It's called from one place, so we don't really lose much in terms of maintainability. No functional changes. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 65 ++--- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 846efda614cb..ebd8511b6eb2 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -842,40 +842,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) (unsigned long long) uprobe->ref_ctr_offset); } -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset) -{ - struct uprobe *uprobe, *cur_uprobe; - - uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); - if (!uprobe) - return ERR_PTR(-ENOMEM); - - uprobe->inode = inode; - uprobe->offset = offset; - uprobe->ref_ctr_offset = ref_ctr_offset; - init_rwsem(>register_rwsem); - init_rwsem(>consumer_rwsem); - RB_CLEAR_NODE(>rb_node); - atomic64_set(>ref, 1); - - /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); - /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe != uprobe) { - if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { - ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); - kfree(uprobe); - return ERR_PTR(-EINVAL); - } - kfree(uprobe); - uprobe = cur_uprobe; - } - - return uprobe; -} - static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(>consumer_rwsem); @@ -1305,14 +1271,39 @@ int uprobe_register_batch(struct inode *inode, int cnt, } for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + uc = get_uprobe_consumer(i, ctx); - uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); - if (IS_ERR(uprobe)) { - ret = PTR_ERR(uprobe); + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) { + ret = -ENOMEM; goto cleanup_uprobes; } + uprobe->inode = inode; + uprobe->offset = uc->offset; + uprobe->ref_ctr_offset = uc->ref_ctr_offset; + init_rwsem(>register_rwsem); + init_rwsem(>consumer_rwsem); + RB_CLEAR_NODE(>rb_node); + atomic64_set(>ref, 1); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe != uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + ref_ctr_mismatch_warn(cur_uprobe, uprobe); + put_uprobe(cur_uprobe); + kfree(uprobe); + ret = -EINVAL; + goto cleanup_uprobes; + } + kfree(uprobe); + uprobe = cur_uprobe; + } + uc->uprobe = uprobe; } -- 2.43.0
[PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
Simplify uprobe registration/unregistration interfaces by making offset and ref_ctr_offset part of uprobe_consumer "interface". In practice, all existing users already store these fields somewhere in uprobe_consumer's containing structure, so this doesn't pose any problem. We just move some fields around. On the other hand, this simplifies uprobe_register() and uprobe_unregister() API by having only struct uprobe_consumer as one thing representing attachment/detachment entity. This makes batched versions of uprobe_register() and uprobe_unregister() simpler. This also makes uprobe_register_refctr() unnecessary, so remove it and simplify consumers. No functional changes intended. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 18 +++ kernel/events/uprobes.c | 19 ++- kernel/trace/bpf_trace.c | 21 +++- kernel/trace/trace_uprobe.c | 53 --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 23 5 files changed, 55 insertions(+), 79 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a75ba37ce3c8 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -42,6 +42,11 @@ struct uprobe_consumer { enum uprobe_filter_ctx ctx, struct mm_struct *mm); + /* associated file offset of this probe */ + loff_t offset; + /* associated refctr file offset of this probe, or zero */ + loff_t ref_ctr_offset; + /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; }; @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -152,11 +156,7 @@ static inline void uprobes_init(void) #define uprobe_get_trap_addr(regs) instruction_pointer(regs) static inline int -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - return -ENOSYS; -} -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } static inline int uprobe_mmap(struct vm_area_struct *vma) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8ce669bc6474..2544e8b79bad 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1197,14 +1197,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) /* * uprobe_unregister - unregister an already registered probe. * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. + * @uc: identify which probe consumer to unregister. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { struct uprobe *uprobe; - uprobe = find_uprobe(inode, offset); + uprobe = find_uprobe(inode, uc->offset); if (WARN_ON(!uprobe)) return; @@ -1277,20 +1276,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset, return ret; } -int uprobe_register(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc) +int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { - return __uprobe_register(inode, offset, 0, uc); + return __uprobe_register(inode, uc->offset, uc->
[PATCH 03/12] uprobes: simplify error handling for alloc_uprobe()
Return -ENOMEM instead of NULL, which makes caller's error handling just a touch simpler. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index e896eeecb091..aa59fa53ae67 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); if (!uprobe) - return NULL; + return ERR_PTR(-ENOMEM); uprobe->inode = inode; uprobe->offset = offset; @@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset, retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (!uprobe) - return -ENOMEM; if (IS_ERR(uprobe)) return PTR_ERR(uprobe); -- 2.43.0
[PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
note the specific values of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that a single atomi64_t is actually a two sort-of-independent 32-bit counters that are incremented/decremented with a single atomic_add_and_return() operation. Note also a small and extremely rare (and thus having no effect on performance) need to clear the highest bit every 2 billion get/put operations to prevent high 32-bit counter from "bleeding over" into lower 32-bit counter. All the above contained trickery aside, we end up with a nice semantics for get and put operations, where get always succeeds and put handles all the races properly and transparently to the caller. And just to justify this a bit unorthodox refcounting approach, under uprobe triggering micro-benchmark (using BPF selftests' bench tool) with 8 triggering threads, atomic_inc_not_zero() approach was producing about 3.3 millions/sec total uprobe triggerings across all threads. While the final atomic_add_and_return()-based approach managed to get 3.6 millions/sec throughput under the same 8 competing threads. Furthermore, CPU profiling showed the following overall CPU usage: - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for atomic_inc_not_zero approach; - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for atomic_add_and_return approach implemented by this patch. So, CPU is spending relatively more CPU time in get/put operations while delivering less total throughput if using atomic_inc_not_zero(). And this will be even more prominent once we optimize away uprobe->register_rwsem in the subsequent patch sets. So while slightly less straightforward, current approach seems to be clearly winning and justified. We also rename get_uprobe() to __get_uprobe() to indicate it's a delicate internal helper that is only safe to call under valid circumstances: - while holding uprobes_treelock (to synchronize with exclusive write lock in put_uprobe(), as described above); - or if we have a guarantee that uprobe's refcount is already positive through caller holding at least one refcount (in this case there is no risk of refcount dropping to zero by any other CPU). We also document why it's safe to do unconditional __get_uprobe() at all call sites, to make it clear that we maintain the above invariants. Note also, we now don't have a race between registration and unregistration, so we remove the retry logic completely. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 231 +--- 1 file changed, 167 insertions(+), 64 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index aa59fa53ae67..8ce669bc6474 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -53,7 +53,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); struct uprobe { struct rb_node rb_node;/* node in the rb tree */ - refcount_t ref; + atomic64_t ref;/* see UPROBE_REFCNT_GET below */ struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; struct list_headpending_list; @@ -587,15 +587,114 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v *(uprobe_opcode_t *)>insn); } -static struct uprobe *get_uprobe(struct uprobe *uprobe) +/* + * Uprobe's 64-bit refcount is actually two independent counters co-located in + * a single u64 value: + * - lower 32 bits are just a normal refcount with is increment and + * decremented on get and put, respectively, just like normal refcount + * would; + * - upper 32 bits are a tag (or epoch, if you will), which is always + * incremented by one, no matter whether get or put operation is done. + * + * This upper counter is meant to distinguish between: + * - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction", + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt + * sequence, also proceeding to "destruction". + * + * In both cases refcount drops to zero, but in one case it will have epoch N, + * while the second drop to zero will have a different epoch N + 2, allowing + * first destructor to bail out because epoch changed between refcount going + * to zero and put_uprobe() taking uprobes_treelock (under which overall + * 64-bit refcount is double-checked, see put_uprobe() for details). + * + * Lower 32-bit counter is not meant to over overflow, while it's expected + * that upper 32-bit counter will overflow occasionally. Note, though, that we + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter, + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes + * epoch effectively a 31-bit counter with highe
[PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
Given unapply_uprobe() can call remove_breakpoint() which eventually calls uprobe_write_opcode(), which can modify a set of memory pages and expects mm->mmap_lock held for write, it needs to have writer lock. Fix this by switching to mmap_write_lock()/mmap_write_unlock(). Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed task") Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 197fbe4663b5..e896eeecb091 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) struct vm_area_struct *vma; int err = 0; - mmap_read_lock(mm); + mmap_write_lock(mm); for_each_vma(vmi, vma) { unsigned long vaddr; loff_t offset; @@ -1252,7 +1252,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) vaddr = offset_to_vaddr(vma, uprobe->offset); err |= remove_breakpoint(uprobe, mm, vaddr); } - mmap_read_unlock(mm); + mmap_write_unlock(mm); return err; } -- 2.43.0
[PATCH 01/12] uprobes: update outdated comment
There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 2816e65729ac..197fbe4663b5 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* -* The NULL 'tsk' here ensures that any faults that occur here -* will not be accounted to the task. 'mm' *is* current->mm, -* but we treat this as a 'remote' access since it is -* essentially a kernel access to the memory. +* 'mm' *is* current->mm, but we treat this as a 'remote' access since +* it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); if (result < 0) -- 2.43.0
[PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
This patch set, ultimately, switches global uprobes_treelock from RW spinlock to per-CPU RW semaphore, which has better performance and scales better under contention and multiple parallel threads triggering lots of uprobes. To make this work well with attaching multiple uprobes (through BPF multi-uprobe), we need to add batched versions of uprobe register/unregister APIs. This is what most of the patch set is actually doing. The actual switch to per-CPU RW semaphore is trivial after that and is done in the very last patch #12. See commit message with some comparison numbers. Patch #4 is probably the most important patch in the series, revamping uprobe lifetime management and refcounting. See patch description and added code comments for all the details. With changes in patch #4, we open up the way to refactor uprobe_register() and uprobe_unregister() implementations in such a way that we can avoid taking uprobes_treelock many times during a single batched attachment/detachment. This allows to accommodate a much higher latency of taking per-CPU RW semaphore for write. The end result of this patch set is that attaching 50 thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms both before and after the changes in this patch set. Patch #5 updates existing uprobe consumers to put all the relevant necessary pieces into struct uprobe_consumer, without having to pass around offset/ref_ctr_offset. Existing consumers already keep this data around, we just formalize the interface. Patches #6 through #10 add batched versions of register/unregister APIs and gradually factor them in such a way as to allow taking single (batched) uprobes_treelock, splitting the logic into multiple independent phases. Patch #11 switched BPF multi-uprobes to batched uprobe APIs. As mentioned, a very straightforward patch #12 takes advantage of all the prep work and just switches uprobes_treelock to per-CPU RW semaphore. Andrii Nakryiko (12): uprobes: update outdated comment uprobes: grab write mmap lock in unapply_uprobe() uprobes: simplify error handling for alloc_uprobe() uprobes: revamp uprobe refcounting and lifetime management uprobes: move offset and ref_ctr_offset into uprobe_consumer uprobes: add batch uprobe register/unregister APIs uprobes: inline alloc_uprobe() logic into __uprobe_register() uprobes: split uprobe allocation and uprobes_tree insertion steps uprobes: batch uprobes_treelock during registration uprobes: improve lock batching for uprobe_unregister_batch uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes uprobes: switch uprobes_treelock to per-CPU RW semaphore include/linux/uprobes.h | 29 +- kernel/events/uprobes.c | 522 -- kernel/trace/bpf_trace.c | 40 +- kernel/trace/trace_uprobe.c | 53 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 23 +- 5 files changed, 419 insertions(+), 248 deletions(-) -- 2.43.0
Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Mon, Jun 17, 2024 at 3:37 PM Andrii Nakryiko wrote: > > On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko > wrote: > > > > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu wrote: > > > > > > On Tue, 21 May 2024 18:38:43 -0700 > > > Andrii Nakryiko wrote: > > > > > > > When kernel has pending uretprobes installed, it hijacks original user > > > > function return address on the stack with a uretprobe trampoline > > > > address. There could be multiple such pending uretprobes (either on > > > > different user functions or on the same recursive one) at any given > > > > time within the same task. > > > > > > > > This approach interferes with the user stack trace capture logic, which > > > > would report suprising addresses (like 0x7fffe000) that correspond > > > > to a special "[uprobes]" section that kernel installs in the target > > > > process address space for uretprobe trampoline code, while logically it > > > > should be an address somewhere within the calling function of another > > > > traced user function. > > > > > > > > This is easy to correct for, though. Uprobes subsystem keeps track of > > > > pending uretprobes and records original return addresses. This patch is > > > > using this to do a post-processing step and restore each trampoline > > > > address entries with correct original return address. This is done only > > > > if there are pending uretprobes for current task. > > > > > > > > This is a similar approach to what fprobe/kretprobe infrastructure is > > > > doing when capturing kernel stack traces in the presence of pending > > > > return probes. > > > > > > > > > > This looks good to me because this trampoline information is only > > > managed in uprobes. And it should be provided when unwinding user > > > stack. > > > > > > Reviewed-by: Masami Hiramatsu (Google) > > > > > > Thank you! > > > > Great, thanks for reviewing, Masami! > > > > Would you take this fix through your tree, or where should it be routed to? > > > > Ping! What would you like me to do with this patch set? Should I > resend it without patch 3 (the one that tries to guess whether we are > at the entry to the function?), or did I manage to convince you that > this heuristic is OK, given perf's stack trace capturing logic already > makes heavy assumption of rbp register being used for frame pointer? > > Please let me know your preference, I could drop patch 3 and send it > separately, if that helps move the main fix forward. Thanks! Masami, Another week went by with absolutely no action or reaction from you. Is there any way I can help improve the collaboration here? I'm preparing more patches for uprobes and about to submit them. If each reviewed and ready to be applied patch set has to sit idle for multiple weeks for no good reason, we all will soon be lost just plain forgetting the context in which the patch was prepared. Please, prioritize handling patches that are meant to be routed through your tree in a more timely fashion. Or propose some alternative acceptable arrangement. Thank you. > > > > > > > > Reported-by: Riham Selim > > > > Signed-off-by: Andrii Nakryiko > > > > --- > > > > kernel/events/callchain.c | 43 ++- > > > > kernel/events/uprobes.c | 9 > > > > 2 files changed, 51 insertions(+), 1 deletion(-) > > > > > > > > [...]
Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko wrote: > > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu wrote: > > > > On Tue, 21 May 2024 18:38:43 -0700 > > Andrii Nakryiko wrote: > > > > > When kernel has pending uretprobes installed, it hijacks original user > > > function return address on the stack with a uretprobe trampoline > > > address. There could be multiple such pending uretprobes (either on > > > different user functions or on the same recursive one) at any given > > > time within the same task. > > > > > > This approach interferes with the user stack trace capture logic, which > > > would report suprising addresses (like 0x7fffe000) that correspond > > > to a special "[uprobes]" section that kernel installs in the target > > > process address space for uretprobe trampoline code, while logically it > > > should be an address somewhere within the calling function of another > > > traced user function. > > > > > > This is easy to correct for, though. Uprobes subsystem keeps track of > > > pending uretprobes and records original return addresses. This patch is > > > using this to do a post-processing step and restore each trampoline > > > address entries with correct original return address. This is done only > > > if there are pending uretprobes for current task. > > > > > > This is a similar approach to what fprobe/kretprobe infrastructure is > > > doing when capturing kernel stack traces in the presence of pending > > > return probes. > > > > > > > This looks good to me because this trampoline information is only > > managed in uprobes. And it should be provided when unwinding user > > stack. > > > > Reviewed-by: Masami Hiramatsu (Google) > > > > Thank you! > > Great, thanks for reviewing, Masami! > > Would you take this fix through your tree, or where should it be routed to? > Ping! What would you like me to do with this patch set? Should I resend it without patch 3 (the one that tries to guess whether we are at the entry to the function?), or did I manage to convince you that this heuristic is OK, given perf's stack trace capturing logic already makes heavy assumption of rbp register being used for frame pointer? Please let me know your preference, I could drop patch 3 and send it separately, if that helps move the main fix forward. Thanks! > > > > > Reported-by: Riham Selim > > > Signed-off-by: Andrii Nakryiko > > > --- > > > kernel/events/callchain.c | 43 ++- > > > kernel/events/uprobes.c | 9 > > > 2 files changed, 51 insertions(+), 1 deletion(-) > > > > > [...]
Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu wrote: > > On Tue, 21 May 2024 18:38:43 -0700 > Andrii Nakryiko wrote: > > > When kernel has pending uretprobes installed, it hijacks original user > > function return address on the stack with a uretprobe trampoline > > address. There could be multiple such pending uretprobes (either on > > different user functions or on the same recursive one) at any given > > time within the same task. > > > > This approach interferes with the user stack trace capture logic, which > > would report suprising addresses (like 0x7fffe000) that correspond > > to a special "[uprobes]" section that kernel installs in the target > > process address space for uretprobe trampoline code, while logically it > > should be an address somewhere within the calling function of another > > traced user function. > > > > This is easy to correct for, though. Uprobes subsystem keeps track of > > pending uretprobes and records original return addresses. This patch is > > using this to do a post-processing step and restore each trampoline > > address entries with correct original return address. This is done only > > if there are pending uretprobes for current task. > > > > This is a similar approach to what fprobe/kretprobe infrastructure is > > doing when capturing kernel stack traces in the presence of pending > > return probes. > > > > This looks good to me because this trampoline information is only > managed in uprobes. And it should be provided when unwinding user > stack. > > Reviewed-by: Masami Hiramatsu (Google) > > Thank you! Great, thanks for reviewing, Masami! Would you take this fix through your tree, or where should it be routed to? > > > Reported-by: Riham Selim > > Signed-off-by: Andrii Nakryiko > > --- > > kernel/events/callchain.c | 43 ++- > > kernel/events/uprobes.c | 9 > > 2 files changed, 51 insertions(+), 1 deletion(-) > > [...]
Re: [PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe
On Tue, Jun 4, 2024 at 7:06 AM Masami Hiramatsu wrote: > > On Tue, 21 May 2024 18:38:44 -0700 > Andrii Nakryiko wrote: > > > When tracing user functions with uprobe functionality, it's common to > > install the probe (e.g., a BPF program) at the first instruction of the > > function. This is often going to be `push %rbp` instruction in function > > preamble, which means that within that function frame pointer hasn't > > been established yet. This leads to consistently missing an actual > > caller of the traced function, because perf_callchain_user() only > > records current IP (capturing traced function) and then following frame > > pointer chain (which would be caller's frame, containing the address of > > caller's caller). > > I thought this problem might be solved by sframe. Eventually, yes, when real-world applications switch to sframe and we get proper support for it in the kernel. But right now there are tons of applications relying on kernel capturing stack traces based on frame pointers, so it would be good to improve that as well. > > > > > So when we have target_1 -> target_2 -> target_3 call chain and we are > > tracing an entry to target_3, captured stack trace will report > > target_1 -> target_3 call chain, which is wrong and confusing. > > > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > > instruction being traced. > > I like this kind of idea :) But I think this should be done in > the user-space, not in the kernel because it is not always sure > that the user program uses stack frames. Existing kernel code that captures user space stack trace already assumes that code was compiled with a frame pointer (unconditionally), because that's the best kernel can do. So under that assumption this heuristic is valid and not harmful, IMO. User space can do nothing about this, because it is the kernel that captures stack trace (e.g., from BPF program), and we will lose the calling frame if we don't do it here. > > > If that's the case, with the assumption that > > applicatoin is compiled with frame pointers, this instruction would be > > a strong indicator that this is the entry to the function. In that case, > > return address is still pointed to by %rsp, so we fetch it and add to > > stack trace before proceeding to unwind the rest using frame > > pointer-based logic. > > Why don't we make it in the userspace BPF program? If it is done > in the user space, like perf-probe, I'm OK. But I doubt to do this in > kernel. That means it is not flexible. > You mean for the BPF program to capture the entire stack trace by itself, without asking the kernel for help? It's doable, but: a) it's inconvenient for all users to have to reimplement this low-level "primitive" operation, that we already promise is provided by kernel (through bpf_get_stack() API, and kernel has internal perf_callchain API for this) b) it's faster for kernel to do this, as kernel code disables page faults once and unwinds the stack, while BPF program would have to do multiple bpf_probe_read_user() calls, each individually disabling page faults. But really, there is an already existing API, which in some cases returns partially invalid stack traces (skipping function caller's frame), so this is an attempt to fix this issue. > More than anything, without user-space helper to find function > symbols, uprobe does not know the function entry. Then I'm curious > why don't you do this in the user space. You are mixing stack *capture* (in which we get memory addresses representing where a function call or currently running instruction pointer is) with stack *symbolization* (where user space needs ELF symbols and/or DWARF information to translate those addresses into something human-readable). This heuristic improves the former as performed by the kernel. Stack symbolization is completely orthogonal to all of this. > > At least, this should be done in the user of uprobes, like trace_uprobe > or bpf. > This is a really miserable user experience, if they have to implement their own stack trace capture for uprobes, but use built-in bpf_get_stack() API for any other type of program. > > Thank you, > > > > > Signed-off-by: Andrii Nakryiko > > --- > > arch/x86/events/core.c | 20 > > include/linux/uprobes.h | 2 ++ > > kernel/events/uprobes.c | 2 ++ > > 3 files changed, 24 insertions(+) > > [...]
Re: [PATCH v2 0/4] Fix user stack traces captured from uprobes
On Tue, May 21, 2024 at 6:38 PM Andrii Nakryiko wrote: > > This patch set reports two issues with captured stack traces. > > First issue, fixed in patch #2, deals with fixing up uretprobe trampoline > addresses in captured stack trace. This issue happens when there are pending > return probes, for which kernel hijacks some of the return addresses on user > stacks. The code is matching those special uretprobe trampoline addresses with > the list of pending return probe instances and replaces them with actual > return addresses. This is the same fixup logic that fprobe/kretprobe has for > kernel stack traces. > > Second issue, which patch #3 is fixing with the help of heuristic, is having > to do with capturing user stack traces in entry uprobes. At the very entrance > to user function, frame pointer in rbp register is not yet setup, so actual > caller return address is still pointed to by rsp. Patch is using a simple > heuristic, looking for `push %rbp` instruction, to fetch this extra direct > caller return address, before proceeding to unwind the stack using rbp. > > Patch #4 adds tests into BPF selftests, that validate that captured stack > traces at various points is what we expect to get. This patch, while being BPF > selftests, is isolated from any other BPF selftests changes and can go in > through non-BPF tree without the risk of merge conflicts. > > Patches are based on latest linux-trace/probes/for-next. > > v1->v2: > - fixed GCC aggressively inlining test_uretprobe_stack() function (BPF CI); > - fixed comments (Peter). > > Andrii Nakryiko (4): > uprobes: rename get_trampoline_vaddr() and make it global > perf,uprobes: fix user stack traces in the presence of pending > uretprobes > perf,x86: avoid missing caller address in stack traces captured in > uprobe > selftests/bpf: add test validating uprobe/uretprobe stack traces > > arch/x86/events/core.c| 20 ++ > include/linux/uprobes.h | 3 + > kernel/events/callchain.c | 43 +++- > kernel/events/uprobes.c | 17 +- > .../bpf/prog_tests/uretprobe_stack.c | 186 ++ > .../selftests/bpf/progs/uretprobe_stack.c | 96 + > 6 files changed, 361 insertions(+), 4 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c > create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c > > -- > 2.43.0 > Friendly ping. This is a real issue in practice that our production users are eager to be fixed, please help to get it upstream. Thanks!
[PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces
Add a set of tests to validate that stack traces captured from or in the presence of active uprobes and uretprobes are valid and complete. For this we use BPF program that are installed either on entry or exit of user function, plus deep-nested USDT. One of target funtions (target_1) is recursive to generate two different entries in the stack trace for the same uprobe/uretprobe, testing potential edge conditions. Without fixes in this patch set, we get something like this for one of the scenarios: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb3 (in target_4) ENTRY #1: 0x758fd3 (in target_2) ENTRY #2: 0x758ffd (in target_1) ENTRY #3: 0x7fffe000 ENTRY #4: 0x7fffe000 ENTRY #5: 0x6f8f39 ENTRY #6: 0x6fa6f0 ENTRY #7: 0x7f403f229590 Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses which obscure actual target_1 and another target_1 invocations. Also note that between entry #0 and entry #1 we are missing an entry for target_3, which is fixed in patch #2. With all the fixes, we get desired full stack traces: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb7 (in target_4) ENTRY #1: 0x758fc8 (in target_3) ENTRY #2: 0x758fd3 (in target_2) ENTRY #3: 0x758ffd (in target_1) ENTRY #4: 0x758ff3 (in target_1) ENTRY #5: 0x75922c (in caller) ENTRY #6: 0x6f8f39 ENTRY #7: 0x6fa6f0 ENTRY #8: 0x7f986adc4cd0 Now there is a logical and complete sequence of function calls. Signed-off-by: Andrii Nakryiko --- .../bpf/prog_tests/uretprobe_stack.c | 186 ++ .../selftests/bpf/progs/uretprobe_stack.c | 96 + 2 files changed, 282 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c diff --git a/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c new file mode 100644 index ..6deb8d560ddd --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c @@ -0,0 +1,186 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ + +#include +#include "uretprobe_stack.skel.h" +#include "../sdt.h" + +/* We set up target_1() -> target_2() -> target_3() -> target_4() -> USDT() + * call chain, each being traced by our BPF program. On entry or return from + * each target_*() we are capturing user stack trace and recording it in + * global variable, so that user space part of the test can validate it. + * + * Note, we put each target function into a custom section to get those + * __start_XXX/__stop_XXX symbols, generated by linker for us, which allow us + * to know address range of those functions + */ +__attribute__((section("uprobe__target_4"))) +__weak int target_4(void) +{ + STAP_PROBE1(uretprobe_stack, target, 42); + return 42; +} + +extern const void *__start_uprobe__target_4; +extern const void *__stop_uprobe__target_4; + +__attribute__((section("uprobe__target_3"))) +__weak int target_3(void) +{ + return target_4(); +} + +extern const void *__start_uprobe__target_3; +extern const void *__stop_uprobe__target_3; + +__attribute__((section("uprobe__target_2"))) +__weak int target_2(void) +{ + return target_3(); +} + +extern const void *__start_uprobe__target_2; +extern const void *__stop_uprobe__target_2; + +__attribute__((section("uprobe__target_1"))) +__weak int target_1(int depth) +{ + if (depth < 1) + return 1 + target_1(depth + 1); + else + return target_2(); +} + +extern const void *__start_uprobe__target_1; +extern const void *__stop_uprobe__target_1; + +extern const void *__start_uretprobe_stack_sec; +extern const void *__stop_uretprobe_stack_sec; + +struct range { + long start; + long stop; +}; + +static struct range targets[] = { + {}, /* we want target_1 to map to target[1], so need 1-based indexing */ + { (long)&__start_uprobe__target_1, (long)&__stop_uprobe__target_1 }, + { (long)&__start_uprobe__target_2, (long)&__stop_uprobe__target_2 }, + { (long)&__start_uprobe__target_3, (long)&__stop_uprobe__target_3 }, + { (long)&__start_uprobe__target_4, (long)&__stop_uprobe__target_4 }, +}; + +static struct range caller = { + (long)&__start_uretprobe_stack_sec, + (long)&__stop_uretprobe_stack_sec, +}; + +static void validate_stack(__u64 *ips, int stack_len, int cnt, ...) +{ + int i, j; + va_list args; + + if (!ASSERT_GT(stack_len, 0, "stack_len")) + return; + + stack_len /= 8; +
[PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` instruction being traced. If that's the case, with the assumption that applicatoin is compiled with frame pointers, this instruction would be a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. Signed-off-by: Andrii Nakryiko --- arch/x86/events/core.c | 20 include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5b0dd07b1ef1..82d5570b58ff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs return; pagefault_disable(); + +#ifdef CONFIG_UPROBES + /* +* If we are called from uprobe handler, and we are indeed at the very +* entry to user function (which is normally a `push %rbp` instruction, +* under assumption of application being compiled with frame pointers), +* we should read return address from *regs->sp before proceeding +* to follow frame pointers, otherwise we'll skip immediate caller +* as %rbp is not yet setup. +*/ + if (current->utask) { + struct arch_uprobe *auprobe = current->utask->auprobe; + u64 ret_addr; + + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && + !__get_user(ret_addr, (const u64 __user *)regs->sp)) + perf_callchain_store(entry, ret_addr); + } +#endif + while (entry->nr < entry->max_stack) { if (!valid_user_frame(fp, sizeof(frame))) break; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0c57eec85339..7b785cd30d86 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -76,6 +76,8 @@ struct uprobe_task { struct uprobe *active_uprobe; unsigned long xol_vaddr; + struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned intdepth; }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 1c99380dc89d..504693845187 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) bool need_prep = false; /* prepare return uprobe, when needed */ down_read(>register_rwsem); + current->utask->auprobe = >arch; for (uc = uprobe->consumers; uc; uc = uc->next) { int rc = 0; @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) remove &= rc; } + current->utask->auprobe = NULL; if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */ -- 2.43.0
[PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
When kernel has pending uretprobes installed, it hijacks original user function return address on the stack with a uretprobe trampoline address. There could be multiple such pending uretprobes (either on different user functions or on the same recursive one) at any given time within the same task. This approach interferes with the user stack trace capture logic, which would report suprising addresses (like 0x7fffe000) that correspond to a special "[uprobes]" section that kernel installs in the target process address space for uretprobe trampoline code, while logically it should be an address somewhere within the calling function of another traced user function. This is easy to correct for, though. Uprobes subsystem keeps track of pending uretprobes and records original return addresses. This patch is using this to do a post-processing step and restore each trampoline address entries with correct original return address. This is done only if there are pending uretprobes for current task. This is a similar approach to what fprobe/kretprobe infrastructure is doing when capturing kernel stack traces in the presence of pending return probes. Reported-by: Riham Selim Signed-off-by: Andrii Nakryiko --- kernel/events/callchain.c | 43 ++- kernel/events/uprobes.c | 9 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 1273be84392c..b17e3323f7f6 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "internal.h" @@ -176,13 +177,51 @@ put_callchain_entry(int rctx) put_recursion_context(this_cpu_ptr(callchain_recursion), rctx); } +static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entry, + int start_entry_idx) +{ +#ifdef CONFIG_UPROBES + struct uprobe_task *utask = current->utask; + struct return_instance *ri; + __u64 *cur_ip, *last_ip, tramp_addr; + + if (likely(!utask || !utask->return_instances)) + return; + + cur_ip = >ip[start_entry_idx]; + last_ip = >ip[entry->nr - 1]; + ri = utask->return_instances; + tramp_addr = uprobe_get_trampoline_vaddr(); + + /* +* If there are pending uretprobes for the current thread, they are +* recorded in a list inside utask->return_instances; each such +* pending uretprobe replaces traced user function's return address on +* the stack, so when stack trace is captured, instead of seeing +* actual function's return address, we'll have one or many uretprobe +* trampoline addresses in the stack trace, which are not helpful and +* misleading to users. +* So here we go over the pending list of uretprobes, and each +* encountered trampoline address is replaced with actual return +* address. +*/ + while (ri && cur_ip <= last_ip) { + if (*cur_ip == tramp_addr) { + *cur_ip = ri->orig_ret_vaddr; + ri = ri->next; + } + cur_ip++; + } +#endif +} + struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; - int rctx; + int rctx, start_entry_idx; entry = get_callchain_entry(); if (!entry) @@ -215,7 +254,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, if (add_mark) perf_callchain_store_context(, PERF_CONTEXT_USER); + start_entry_idx = entry->nr; perf_callchain_user(, regs); + fixup_uretprobe_trampoline_entries(entry, start_entry_idx); } } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index d60d24f0f2f4..1c99380dc89d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2149,6 +2149,15 @@ static void handle_trampoline(struct pt_regs *regs) instruction_pointer_set(regs, ri->orig_ret_vaddr); do { + /* pop current instance from the stack of pending return instances, +* as it's not pending anymore: we just fixed up original +* instruction pointer in regs and are about to call handlers; +* this allows fixup_uretprobe_trampoline_entries() to properly fix up +* captured stack traces from uretprobe handlers, in which pending +* trampoline addresses on the stac
[PATCH v2 1/4] uprobes: rename get_trampoline_vaddr() and make it global
This helper is needed in another file, so make it a bit more uniquely named and expose it internally. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 1 + kernel/events/uprobes.c | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..0c57eec85339 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern unsigned long uprobe_get_trampoline_vaddr(void); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8ae0eefc3a34..d60d24f0f2f4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, unsigned long flags) * * Returns -1 in case the xol_area is not allocated. */ -static unsigned long get_trampoline_vaddr(void) +unsigned long uprobe_get_trampoline_vaddr(void) { struct xol_area *area; unsigned long trampoline_vaddr = -1; @@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) if (!ri) return; - trampoline_vaddr = get_trampoline_vaddr(); + trampoline_vaddr = uprobe_get_trampoline_vaddr(); orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs); if (orig_ret_vaddr == -1) goto fail; @@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs) int is_swbp; bp_vaddr = uprobe_get_swbp_addr(regs); - if (bp_vaddr == get_trampoline_vaddr()) + if (bp_vaddr == uprobe_get_trampoline_vaddr()) return handle_trampoline(regs); uprobe = find_active_uprobe(bp_vaddr, _swbp); -- 2.43.0
[PATCH v2 0/4] Fix user stack traces captured from uprobes
This patch set reports two issues with captured stack traces. First issue, fixed in patch #2, deals with fixing up uretprobe trampoline addresses in captured stack trace. This issue happens when there are pending return probes, for which kernel hijacks some of the return addresses on user stacks. The code is matching those special uretprobe trampoline addresses with the list of pending return probe instances and replaces them with actual return addresses. This is the same fixup logic that fprobe/kretprobe has for kernel stack traces. Second issue, which patch #3 is fixing with the help of heuristic, is having to do with capturing user stack traces in entry uprobes. At the very entrance to user function, frame pointer in rbp register is not yet setup, so actual caller return address is still pointed to by rsp. Patch is using a simple heuristic, looking for `push %rbp` instruction, to fetch this extra direct caller return address, before proceeding to unwind the stack using rbp. Patch #4 adds tests into BPF selftests, that validate that captured stack traces at various points is what we expect to get. This patch, while being BPF selftests, is isolated from any other BPF selftests changes and can go in through non-BPF tree without the risk of merge conflicts. Patches are based on latest linux-trace/probes/for-next. v1->v2: - fixed GCC aggressively inlining test_uretprobe_stack() function (BPF CI); - fixed comments (Peter). Andrii Nakryiko (4): uprobes: rename get_trampoline_vaddr() and make it global perf,uprobes: fix user stack traces in the presence of pending uretprobes perf,x86: avoid missing caller address in stack traces captured in uprobe selftests/bpf: add test validating uprobe/uretprobe stack traces arch/x86/events/core.c| 20 ++ include/linux/uprobes.h | 3 + kernel/events/callchain.c | 43 +++- kernel/events/uprobes.c | 17 +- .../bpf/prog_tests/uretprobe_stack.c | 186 ++ .../selftests/bpf/progs/uretprobe_stack.c | 96 + 6 files changed, 361 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c -- 2.43.0
[PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()
Recent changes made uprobe_cpu_buffer preparation lazy, and moved it deeper into __uprobe_trace_func(). This is problematic because __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock() block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() -> mutex_lock(>mutex), leading to a splat about using mutex under non-sleepable RCU: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 ... Call Trace: dump_stack_lvl+0x3d/0xe0 __might_resched+0x24c/0x270 ? prepare_uprobe_buffer+0xd5/0x1d0 __mutex_lock+0x41/0x820 ? ___perf_sw_event+0x206/0x290 ? __perf_event_task_sched_in+0x54/0x660 ? __perf_event_task_sched_in+0x54/0x660 prepare_uprobe_buffer+0xd5/0x1d0 __uprobe_trace_func+0x4a/0x140 uprobe_dispatcher+0x135/0x280 ? uprobe_dispatcher+0x94/0x280 uprobe_notify_resume+0x650/0xec0 ? atomic_notifier_call_chain+0x21/0x110 ? atomic_notifier_call_chain+0xf8/0x110 irqentry_exit_to_user_mode+0xe2/0x1e0 asm_exc_int3+0x35/0x40 RIP: 0033:0x7f7e1d4da390 Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e RSP: 002b:7ffd2abc3608 EFLAGS: 0246 RAX: RBX: 76d325f1 RCX: RDX: 76d325f1 RSI: 000a RDI: 7ffd2abc3690 RBP: 000a R08: 00017fb7 R09: 00017fb7 R10: 00017fb7 R11: 0246 R12: 00017ff2 R13: 7ffd2abc3610 R14: R15: 7ffd2abc3780 Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside of RCU locked section. This still keeps this buffer preparation lazy and helps avoid the overhead when it's not needed. E.g., if there is only BPF uprobe handler installed on a given uprobe, buffer won't be initialized. Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not affected, as it doesn't prepare buffer under RCU read lock. Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily") Reported-by: Breno Leitao Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 8541fa1494ae..c98e3b3386ba 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -970,19 +970,17 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer **ucbp, + struct uprobe_cpu_buffer *ucb, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; struct trace_event_buffer fbuffer; - struct uprobe_cpu_buffer *ucb; void *data; int size, esize; struct trace_event_call *call = trace_probe_event_call(>tp); WARN_ON(call != trace_file->event_call); - ucb = prepare_uprobe_buffer(tu, regs, ucbp); if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE)) return; @@ -1014,13 +1012,16 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; + struct uprobe_cpu_buffer *ucb; if (is_ret_probe(tu)) return 0; + ucb = prepare_uprobe_buffer(tu, regs, ucbp); + rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, 0, regs, ucbp, link->file); + __uprobe_trace_func(tu, 0, regs, ucb, link->file); rcu_read_unlock(); return 0; @@ -1031,10 +1032,13 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; + struct uprobe_cpu_buffer *ucb; + + ucb = prepare_uprobe_buffer(tu, regs, ucbp); rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, func, regs, ucbp, link->file); + __uprobe_trace_func(tu, func, regs, ucb, link->file); rcu_read_unlock(); } -- 2.43.0
Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Mon, May 20, 2024 at 8:20 AM Jiri Olsa wrote: > > On Wed, May 15, 2024 at 08:32:30AM -0600, Andrii Nakryiko wrote: > > On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra wrote: > > > > > > On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote: > > > > > > > +static void fixup_uretprobe_trampoline_entries(struct > > > > perf_callchain_entry *entry, > > > > +int start_entry_idx) > > > > +{ > > > > +#ifdef CONFIG_UPROBES > > > > + struct uprobe_task *utask = current->utask; > > > > + struct return_instance *ri; > > > > + __u64 *cur_ip, *last_ip, tramp_addr; > > > > + > > > > + if (likely(!utask || !utask->return_instances)) > > > > + return; > > > > + > > > > + cur_ip = >ip[start_entry_idx]; > > > > + last_ip = >ip[entry->nr - 1]; > > > > + ri = utask->return_instances; > > > > + tramp_addr = uprobe_get_trampoline_vaddr(); > > > > + > > > > + /* If there are pending uretprobes for current thread, they are > > > > > > Comment style fail. Also 'for *the* current thread'. > > > > > > > ack, will fix > > > > > > + * recorded in a list inside utask->return_instances; each such > > > > + * pending uretprobe replaces traced user function's return > > > > address on > > > > + * the stack, so when stack trace is captured, instead of seeing > > > > + * actual function's return address, we'll have one or many > > > > uretprobe > > > > + * trampoline addresses in the stack trace, which are not helpful > > > > and > > > > + * misleading to users. > > > > > > I would beg to differ, what if the uprobe is causing the performance > > > issue? > > > > If uprobe/uretprobe code itself is causing performance issues, you'll > > see that in other stack traces, where this code will be actively > > running on CPU. I don't think we make anything worse here. > > I think we do similar thing in kernel unwind for rethook trampoline used > in fprobe/kretprobe code, so seems ok to me to do it for uprobes as well > > > > > Here we are talking about the case where the uprobe part is done and > > it hijacked the return address on the stack, uretprobe is not yet > > running (and so not causing any performance issues). The presence of > > this "snooping" (pending) uretprobe is irrelevant to the user that is > > capturing stack trace. Right now address in [uprobes] VMA section > > installed by uretprobe infra code is directly replacing correct and > > actual calling function address. > > > > Worst case, one can argue that both [uprobes] and original caller > > address should be in the stack trace, but I think it still will be > > confusing to users. And also will make implementation less efficient > > because now we'll need to insert entries into the array and shift > > everything around. > > agreed this would be confusing.. also as you noted above the return > trampoline did not get executed yet at the time of the callstack, > so it's bit misleading > > might be stupid idea.. but we do have the 'special' context entries > that we store in the callstack to mark user/kernel/guest context .. only when explicitly requested (add_mark argument to get_perf_callchain), right? BPF doesn't ever set this to true and generally speaking users don't care and shouldn't care about pending uretprobe. I think we are conflating unrelated things here, uretprobe is not running, so it's not really in the stack trace. I'd just do nothing about it, it should stay transparent. If uretprobe *handler* is causing issues, you'll see that in all the other stack traces (according to relative CPU/resource usage of that handler). > maybe we could add some special entry (context does not fit too well) > to point out there's uretprobe going on .. perf tool could print > 'uretprobe' hint when displaying the original address > > jirka > > > > > So as I mentioned above, if the concern is seeing uprobe/uretprobe > > code using CPU, that doesn't change, we'll see that in the overall set > > of captured stack traces (be it custom uprobe handler code or BPF > > program). > > > > > > > > While I do think it makes sense to fix the unwind in the sense that we > > > should be able to continue the unwind, I don't think it makes sense to > > > c
Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra wrote: > > On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote: > > > +static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry > > *entry, > > +int start_entry_idx) > > +{ > > +#ifdef CONFIG_UPROBES > > + struct uprobe_task *utask = current->utask; > > + struct return_instance *ri; > > + __u64 *cur_ip, *last_ip, tramp_addr; > > + > > + if (likely(!utask || !utask->return_instances)) > > + return; > > + > > + cur_ip = >ip[start_entry_idx]; > > + last_ip = >ip[entry->nr - 1]; > > + ri = utask->return_instances; > > + tramp_addr = uprobe_get_trampoline_vaddr(); > > + > > + /* If there are pending uretprobes for current thread, they are > > Comment style fail. Also 'for *the* current thread'. > ack, will fix > > + * recorded in a list inside utask->return_instances; each such > > + * pending uretprobe replaces traced user function's return address on > > + * the stack, so when stack trace is captured, instead of seeing > > + * actual function's return address, we'll have one or many uretprobe > > + * trampoline addresses in the stack trace, which are not helpful and > > + * misleading to users. > > I would beg to differ, what if the uprobe is causing the performance > issue? If uprobe/uretprobe code itself is causing performance issues, you'll see that in other stack traces, where this code will be actively running on CPU. I don't think we make anything worse here. Here we are talking about the case where the uprobe part is done and it hijacked the return address on the stack, uretprobe is not yet running (and so not causing any performance issues). The presence of this "snooping" (pending) uretprobe is irrelevant to the user that is capturing stack trace. Right now address in [uprobes] VMA section installed by uretprobe infra code is directly replacing correct and actual calling function address. Worst case, one can argue that both [uprobes] and original caller address should be in the stack trace, but I think it still will be confusing to users. And also will make implementation less efficient because now we'll need to insert entries into the array and shift everything around. So as I mentioned above, if the concern is seeing uprobe/uretprobe code using CPU, that doesn't change, we'll see that in the overall set of captured stack traces (be it custom uprobe handler code or BPF program). > > While I do think it makes sense to fix the unwind in the sense that we > should be able to continue the unwind, I don't think it makes sense to > completely hide the presence of uprobes. Unwind isn't broken in this sense, we do unwind the entire stack trace (see examples in the later patch). We just don't capture actual callers if they have uretprobe pending. > > > + * So here we go over the pending list of uretprobes, and each > > + * encountered trampoline address is replaced with actual return > > + * address. > > + */ > > + while (ri && cur_ip <= last_ip) { > > + if (*cur_ip == tramp_addr) { > > + *cur_ip = ri->orig_ret_vaddr; > > + ri = ri->next; > > + } > > + cur_ip++; > > + } > > +#endif > > +}
[PATCH 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces
Add a set of tests to validate that stack traces captured from or in the presence of active uprobes and uretprobes are valid and complete. For this we use BPF program that are installed either on entry or exit of user function, plus deep-nested USDT. One of target funtions (target_1) is recursive to generate two different entries in the stack trace for the same uprobe/uretprobe, testing potential edge conditions. Without fixes in this patch set, we get something like this for one of the scenarios: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb3 (in target_4) ENTRY #1: 0x758fd3 (in target_2) ENTRY #2: 0x758ffd (in target_1) ENTRY #3: 0x7fffe000 ENTRY #4: 0x7fffe000 ENTRY #5: 0x6f8f39 ENTRY #6: 0x6fa6f0 ENTRY #7: 0x7f403f229590 Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses which obscure actual target_1 and another target_1 invocations. Also note that between entry #0 and entry #1 we are missing an entry for target_3, which is fixed in patch #2. With all the fixes, we get desired full stack traces: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb7 (in target_4) ENTRY #1: 0x758fc8 (in target_3) ENTRY #2: 0x758fd3 (in target_2) ENTRY #3: 0x758ffd (in target_1) ENTRY #4: 0x758ff3 (in target_1) ENTRY #5: 0x75922c (in caller) ENTRY #6: 0x6f8f39 ENTRY #7: 0x6fa6f0 ENTRY #8: 0x7f986adc4cd0 Now there is a logical and complete sequence of function calls. Signed-off-by: Andrii Nakryiko --- .../bpf/prog_tests/uretprobe_stack.c | 185 ++ .../selftests/bpf/progs/uretprobe_stack.c | 96 + 2 files changed, 281 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c diff --git a/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c new file mode 100644 index ..2974553da6e5 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c @@ -0,0 +1,185 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ + +#include +#include "uretprobe_stack.skel.h" +#include "../sdt.h" + +/* We set up target_1() -> target_2() -> target_3() -> target_4() -> USDT() + * call chain, each being traced by our BPF program. On entry or return from + * each target_*() we are capturing user stack trace and recording it in + * global variable, so that user space part of the test can validate it. + * + * Note, we put each target function into a custom section to get those + * __start_XXX/__stop_XXX symbols, generated by linker for us, which allow us + * to know address range of those functions + */ +__attribute__((section("uprobe__target_4"))) +__weak int target_4(void) +{ + STAP_PROBE1(uretprobe_stack, target, 42); + return 42; +} + +extern const void *__start_uprobe__target_4; +extern const void *__stop_uprobe__target_4; + +__attribute__((section("uprobe__target_3"))) +__weak int target_3(void) +{ + return target_4(); +} + +extern const void *__start_uprobe__target_3; +extern const void *__stop_uprobe__target_3; + +__attribute__((section("uprobe__target_2"))) +__weak int target_2(void) +{ + return target_3(); +} + +extern const void *__start_uprobe__target_2; +extern const void *__stop_uprobe__target_2; + +__attribute__((section("uprobe__target_1"))) +__weak int target_1(int depth) +{ + if (depth < 1) + return 1 + target_1(depth + 1); + else + return target_2(); +} + +extern const void *__start_uprobe__target_1; +extern const void *__stop_uprobe__target_1; + +extern const void *__start_uretprobe_stack_sec; +extern const void *__stop_uretprobe_stack_sec; + +struct range { + long start; + long stop; +}; + +static struct range targets[] = { + {}, /* we want target_1 to map to target[1], so need 1-based indexing */ + { (long)&__start_uprobe__target_1, (long)&__stop_uprobe__target_1 }, + { (long)&__start_uprobe__target_2, (long)&__stop_uprobe__target_2 }, + { (long)&__start_uprobe__target_3, (long)&__stop_uprobe__target_3 }, + { (long)&__start_uprobe__target_4, (long)&__stop_uprobe__target_4 }, +}; + +static struct range caller = { + (long)&__start_uretprobe_stack_sec, + (long)&__stop_uretprobe_stack_sec, +}; + +static void validate_stack(__u64 *ips, int stack_len, int cnt, ...) +{ + int i, j; + va_list args; + + if (!ASSERT_GT(stack_len, 0, "stack_len")) + return; + + stack_len /= 8; +
[PATCH 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` instruction being traced. If that's the case, with the assumption that applicatoin is compiled with frame pointers, this instruction would be a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. Signed-off-by: Andrii Nakryiko --- arch/x86/events/core.c | 20 include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5b0dd07b1ef1..82d5570b58ff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs return; pagefault_disable(); + +#ifdef CONFIG_UPROBES + /* +* If we are called from uprobe handler, and we are indeed at the very +* entry to user function (which is normally a `push %rbp` instruction, +* under assumption of application being compiled with frame pointers), +* we should read return address from *regs->sp before proceeding +* to follow frame pointers, otherwise we'll skip immediate caller +* as %rbp is not yet setup. +*/ + if (current->utask) { + struct arch_uprobe *auprobe = current->utask->auprobe; + u64 ret_addr; + + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && + !__get_user(ret_addr, (const u64 __user *)regs->sp)) + perf_callchain_store(entry, ret_addr); + } +#endif + while (entry->nr < entry->max_stack) { if (!valid_user_frame(fp, sizeof(frame))) break; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0c57eec85339..7b785cd30d86 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -76,6 +76,8 @@ struct uprobe_task { struct uprobe *active_uprobe; unsigned long xol_vaddr; + struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned intdepth; }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 1c99380dc89d..504693845187 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) bool need_prep = false; /* prepare return uprobe, when needed */ down_read(>register_rwsem); + current->utask->auprobe = >arch; for (uc = uprobe->consumers; uc; uc = uc->next) { int rc = 0; @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) remove &= rc; } + current->utask->auprobe = NULL; if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */ -- 2.43.0
[PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes
When kernel has pending uretprobes installed, it hijacks original user function return address on the stack with a uretprobe trampoline address. There could be multiple such pending uretprobes (either on different user functions or on the same recursive one) at any given time within the same task. This approach interferes with the user stack trace capture logic, which would report suprising addresses (like 0x7fffe000) that correspond to a special "[uprobes]" section that kernel installs in the target process address space for uretprobe trampoline code, while logically it should be an address somewhere within the calling function of another traced user function. This is easy to correct for, though. Uprobes subsystem keeps track of pending uretprobes and records original return addresses. This patch is using this to do a post-processing step and restore each trampoline address entries with correct original return address. This is done only if there are pending uretprobes for current task. Reported-by: Riham Selim Signed-off-by: Andrii Nakryiko --- kernel/events/callchain.c | 42 ++- kernel/events/uprobes.c | 9 + 2 files changed, 50 insertions(+), 1 deletion(-) diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 1273be84392c..2f7ceca7ae3f 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "internal.h" @@ -176,13 +177,50 @@ put_callchain_entry(int rctx) put_recursion_context(this_cpu_ptr(callchain_recursion), rctx); } +static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entry, + int start_entry_idx) +{ +#ifdef CONFIG_UPROBES + struct uprobe_task *utask = current->utask; + struct return_instance *ri; + __u64 *cur_ip, *last_ip, tramp_addr; + + if (likely(!utask || !utask->return_instances)) + return; + + cur_ip = >ip[start_entry_idx]; + last_ip = >ip[entry->nr - 1]; + ri = utask->return_instances; + tramp_addr = uprobe_get_trampoline_vaddr(); + + /* If there are pending uretprobes for current thread, they are +* recorded in a list inside utask->return_instances; each such +* pending uretprobe replaces traced user function's return address on +* the stack, so when stack trace is captured, instead of seeing +* actual function's return address, we'll have one or many uretprobe +* trampoline addresses in the stack trace, which are not helpful and +* misleading to users. +* So here we go over the pending list of uretprobes, and each +* encountered trampoline address is replaced with actual return +* address. +*/ + while (ri && cur_ip <= last_ip) { + if (*cur_ip == tramp_addr) { + *cur_ip = ri->orig_ret_vaddr; + ri = ri->next; + } + cur_ip++; + } +#endif +} + struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; - int rctx; + int rctx, start_entry_idx; entry = get_callchain_entry(); if (!entry) @@ -215,7 +253,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, if (add_mark) perf_callchain_store_context(, PERF_CONTEXT_USER); + start_entry_idx = entry->nr; perf_callchain_user(, regs); + fixup_uretprobe_trampoline_entries(entry, start_entry_idx); } } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index d60d24f0f2f4..1c99380dc89d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2149,6 +2149,15 @@ static void handle_trampoline(struct pt_regs *regs) instruction_pointer_set(regs, ri->orig_ret_vaddr); do { + /* pop current instance from the stack of pending return instances, +* as it's not pending anymore: we just fixed up original +* instruction pointer in regs and are about to call handlers; +* this allows fixup_uretprobe_trampoline_entries() to properly fix up +* captured stack traces from uretprobe handlers, in which pending +* trampoline addresses on the stack are replaced with correct +* original return addresses +*/ + utask
[PATCH 1/4] uprobes: rename get_trampoline_vaddr() and make it global
This helper is needed in another file, so make it a bit more uniquely named and expose it internally. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 1 + kernel/events/uprobes.c | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..0c57eec85339 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern unsigned long uprobe_get_trampoline_vaddr(void); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8ae0eefc3a34..d60d24f0f2f4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, unsigned long flags) * * Returns -1 in case the xol_area is not allocated. */ -static unsigned long get_trampoline_vaddr(void) +unsigned long uprobe_get_trampoline_vaddr(void) { struct xol_area *area; unsigned long trampoline_vaddr = -1; @@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) if (!ri) return; - trampoline_vaddr = get_trampoline_vaddr(); + trampoline_vaddr = uprobe_get_trampoline_vaddr(); orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs); if (orig_ret_vaddr == -1) goto fail; @@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs) int is_swbp; bp_vaddr = uprobe_get_swbp_addr(regs); - if (bp_vaddr == get_trampoline_vaddr()) + if (bp_vaddr == uprobe_get_trampoline_vaddr()) return handle_trampoline(regs); uprobe = find_active_uprobe(bp_vaddr, _swbp); -- 2.43.0
[PATCH 0/4] Fix user stack traces captured from uprobes
This patch set reports two issues with captured stack traces. First issue, fixed in patch #2, deals with fixing up uretprobe trampoline addresses in captured stack trace. This issue happens when there are pending return probes, for which kernel hijacks some of the return addresses on user stacks. The code is matching those special uretprobe trampoline addresses with the list of pending return probe instances and replaces them with actual return addresses. Second issue, which patch #3 is trying to fix with the help of heuristic, is having to do with capturing user stack traces in entry uprobes. At the very entrance to user function, frame pointer in rbp register is not yet setup, so actual caller return address is still pointed to by rsp. Patch is using a simple heuristic, looking for `push %rbp` instruction, to fetch this extra direct caller return address, before proceeding to unwind the stack using rbp. Consider this patch #3 an RFC, if there are better suggestions how this can be solved, I'd be happy to hear that. Patch #4 adds tests into BPF selftests, that validate that captured stack traces at various points is what we expect to get. This patch, while being BPF selftests, is isolated from any other BPF selftests changes and can go in through non-BPF tree without the risk of merge conflicts. Patches are based on latest linux-trace's probes/for-next branch. Andrii Nakryiko (4): uprobes: rename get_trampoline_vaddr() and make it global perf,uprobes: fix user stack traces in the presence of pending uretprobes perf,x86: avoid missing caller address in stack traces captured in uprobe selftests/bpf: add test validating uprobe/uretprobe stack traces arch/x86/events/core.c| 20 ++ include/linux/uprobes.h | 3 + kernel/events/callchain.c | 42 +++- kernel/events/uprobes.c | 17 +- .../bpf/prog_tests/uretprobe_stack.c | 185 ++ .../selftests/bpf/progs/uretprobe_stack.c | 96 + 6 files changed, 359 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c -- 2.43.0
Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko wrote: > > At the lowest level, rethook-based kretprobes on x86-64 architecture go > through arch_rethoook_trampoline() function, manually written in > assembly, which calls into a simple arch_rethook_trampoline_callback() > function, written in C, and only doing a few straightforward field > assignments, before calling further into rethook_trampoline_handler(), > which handles kretprobe callbacks generically. > > Looking at simplicity of arch_rethook_trampoline_callback(), it seems > not really worthwhile to spend an extra function call just to do 4 or > 5 assignments. As such, this patch proposes to "inline" > arch_rethook_trampoline_callback() into arch_rethook_trampoline() by > manually implementing it in an assembly code. > > This has two motivations. First, we do get a bit of runtime speed up by > avoiding function calls. Using BPF selftests's bench tool, we see > 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe > triggering code path: > > BEFORE (latest probes/for-next) > === > kretprobe : 10.455 ± 0.024M/s > kretprobe-multi: 11.150 ± 0.012M/s > > AFTER (probes/for-next + this patch) > > kretprobe : 10.540 ± 0.009M/s (+0.8%) > kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) > > Second, and no less importantly for some specialized use cases, this > avoids unnecessarily "polluting" LBR records with an extra function call > (recorded as a jump by CPU). This is the case for the retsnoop ([0]) > tool, which relies havily on capturing LBR records to provide users with > lots of insight into kernel internals. > > This RFC patch is only inlining this function for x86-64, but it's > possible to do that for 32-bit x86 arch as well and then remove > arch_rethook_trampoline_callback() implementation altogether. Please let > me know if this change is acceptable and whether I should complete it > with 32-bit "inlining" as well. Thanks! > > [0] > https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr > > Signed-off-by: Andrii Nakryiko > --- > arch/x86/kernel/asm-offsets_64.c | 4 > arch/x86/kernel/rethook.c| 37 +++- > 2 files changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/asm-offsets_64.c > b/arch/x86/kernel/asm-offsets_64.c > index bb65371ea9df..5c444abc540c 100644 > --- a/arch/x86/kernel/asm-offsets_64.c > +++ b/arch/x86/kernel/asm-offsets_64.c > @@ -42,6 +42,10 @@ int main(void) > ENTRY(r14); > ENTRY(r15); > ENTRY(flags); > + ENTRY(ip); > + ENTRY(cs); > + ENTRY(ss); > + ENTRY(orig_ax); > BLANK(); > #undef ENTRY > > diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c > index 8a1c0111ae79..3e1c01beebd1 100644 > --- a/arch/x86/kernel/rethook.c > +++ b/arch/x86/kernel/rethook.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > #include "kprobes/common.h" > > @@ -34,10 +35,36 @@ asm( > " pushq %rsp\n" > " pushfq\n" > SAVE_REGS_STRING > - " movq %rsp, %rdi\n" > - " call arch_rethook_trampoline_callback\n" > + " movq %rsp, %rdi\n" /* $rdi points to regs */ > + /* fixup registers */ > + /* regs->cs = __KERNEL_CS; */ > + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) > "(%rdi)\n" > + /* regs->ip = (unsigned long)_rethook_trampoline; */ > + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) > "(%rdi)\n" > + /* regs->orig_ax = ~0UL; */ > + " movq $0x, " __stringify(pt_regs_orig_ax) > "(%rdi)\n" > + /* regs->sp += 2*sizeof(long); */ > + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" > + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ > + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" BTW, all this __stringify() ugliness can be avoided if we move this assembly into its own .S file, like lots of other assembly functions in arch/x86/kernel subdir. That has another benefit of generating better line information in DWARF for those assembly instructions. It's lots more work, so before I do this, I'd like to get confirmation that this change is acceptable in principle. > + /* > +* The return address at 'frame_pointer' is recovered by the > +* a
Re: [PATCH 0/2] Objpool performance improvements
On Fri, Apr 26, 2024 at 7:25 AM Masami Hiramatsu wrote: > > Hi Andrii, > > On Wed, 24 Apr 2024 14:52:12 -0700 > Andrii Nakryiko wrote: > > > Improve objpool (used heavily in kretprobe hot path) performance with two > > improvements: > > - inlining performance critical objpool_push()/objpool_pop() operations; > > - avoiding re-calculating relatively expensive nr_possible_cpus(). > > Thanks for optimizing objpool. Both looks good to me. Great, thanks for applying. > > BTW, I don't intend to stop this short-term optimization attempts, > but I would like to ask you check the new fgraph based fprobe > (kretprobe-multi)[1] instead of objpool/rethook. You can see that I did :) There is tons of code and I'm not familiar with internals of function_graph infra, but you can see I did run benchmarks, so I'm paying attention. But as for the objpool itself, I think it's a performant and useful internal building block, and we might use it outside of rethook as well, so I think making it as fast as possible is good regardless. > > [1] > https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/ > > I'm considering to obsolete the kretprobe (and rethook) with fprobe > and eventually remove it. Those have similar feature and we should > choose safer one. > Yep, I had a few more semi-ready patches, but I'll hold off for now given this move to function graph, plus some of the changes that Jiri is making in multi-kprobe code. I'll wait for things to settle down a bit before looking again. But just to give you some context, I'm an author of retsnoop tool, and one of the killer features of it is LBR capture in kretprobes, which is a tremendous help in investigating kernel failures, especially in unfamiliar code (LBR allows to "look back" and figure out "how did we get to this condition" after the fact). And so it's important to minimize the amount of wasted LBR records between some kernel function returns error (and thus is "an interesting event" and BPF program that captures LBR is triggered). Big part of that is ftrace/fprobe/rethook infra, so I was looking into making that part as "minimal" as possible, in the sense of eliminating as many function calls and jump as possible. This has an added benefit of making this hot path faster, but my main motivation is LBR. Anyways, just a bit of context for some of the other patches (like inlining arch_rethook_trampoline_callback). > Thank you, > > > > > These opportunities were found when benchmarking and profiling kprobes and > > kretprobes with BPF-based benchmarks. See individual patches for details and > > results. > > > > Andrii Nakryiko (2): > > objpool: enable inlining objpool_push() and objpool_pop() operations > > objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids > > > > include/linux/objpool.h | 105 +++-- > > lib/objpool.c | 112 +++- > > 2 files changed, 107 insertions(+), 110 deletions(-) > > > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code
At the lowest level, rethook-based kretprobes on x86-64 architecture go through arch_rethoook_trampoline() function, manually written in assembly, which calls into a simple arch_rethook_trampoline_callback() function, written in C, and only doing a few straightforward field assignments, before calling further into rethook_trampoline_handler(), which handles kretprobe callbacks generically. Looking at simplicity of arch_rethook_trampoline_callback(), it seems not really worthwhile to spend an extra function call just to do 4 or 5 assignments. As such, this patch proposes to "inline" arch_rethook_trampoline_callback() into arch_rethook_trampoline() by manually implementing it in an assembly code. This has two motivations. First, we do get a bit of runtime speed up by avoiding function calls. Using BPF selftests's bench tool, we see 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe triggering code path: BEFORE (latest probes/for-next) === kretprobe : 10.455 ± 0.024M/s kretprobe-multi: 11.150 ± 0.012M/s AFTER (probes/for-next + this patch) kretprobe : 10.540 ± 0.009M/s (+0.8%) kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) Second, and no less importantly for some specialized use cases, this avoids unnecessarily "polluting" LBR records with an extra function call (recorded as a jump by CPU). This is the case for the retsnoop ([0]) tool, which relies havily on capturing LBR records to provide users with lots of insight into kernel internals. This RFC patch is only inlining this function for x86-64, but it's possible to do that for 32-bit x86 arch as well and then remove arch_rethook_trampoline_callback() implementation altogether. Please let me know if this change is acceptable and whether I should complete it with 32-bit "inlining" as well. Thanks! [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr Signed-off-by: Andrii Nakryiko --- arch/x86/kernel/asm-offsets_64.c | 4 arch/x86/kernel/rethook.c| 37 +++- 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index bb65371ea9df..5c444abc540c 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -42,6 +42,10 @@ int main(void) ENTRY(r14); ENTRY(r15); ENTRY(flags); + ENTRY(ip); + ENTRY(cs); + ENTRY(ss); + ENTRY(orig_ax); BLANK(); #undef ENTRY diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c index 8a1c0111ae79..3e1c01beebd1 100644 --- a/arch/x86/kernel/rethook.c +++ b/arch/x86/kernel/rethook.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "kprobes/common.h" @@ -34,10 +35,36 @@ asm( " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING - " movq %rsp, %rdi\n" - " call arch_rethook_trampoline_callback\n" + " movq %rsp, %rdi\n" /* $rdi points to regs */ + /* fixup registers */ + /* regs->cs = __KERNEL_CS; */ + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) "(%rdi)\n" + /* regs->ip = (unsigned long)_rethook_trampoline; */ + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) "(%rdi)\n" + /* regs->orig_ax = ~0UL; */ + " movq $0x, " __stringify(pt_regs_orig_ax) "(%rdi)\n" + /* regs->sp += 2*sizeof(long); */ + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" + /* +* The return address at 'frame_pointer' is recovered by the +* arch_rethook_fixup_return() which called from this +* rethook_trampoline_handler(). +*/ + " call rethook_trampoline_handler\n" + /* +* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF. +* +* We don't save/restore %rax below, because we ignore +* rethook_trampoline_handler result. +* +* *(unsigned long *)>ss = regs->flags; +*/ + " mov " __stringify(pt_regs_flags) "(%rsp), %rax\n" + " mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n" RESTORE_REGS_STRING - /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */ + /* We just copied 'regs->flags' into 'regs->ss'. */ " addq $16, %rsp\n" " popfq\n" #else @@ -61,6 +88,7 @@ asm( ); NOKPROBE_SYMBOL(arch_retho
[PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
Profiling shows that calling nr_possible_cpus() in objpool_pop() takes a noticeable amount of CPU (when profiled on 80-core machine), as we need to recalculate number of set bits in a CPU bit mask. This number can't change, so there is no point in paying the price for recalculating it. As such, cache this value in struct objpool_head and use it in objpool_pop(). On the other hand, cached pool->nr_cpus isn't necessary, as it's not used in hot path and is also a pretty trivial value to retrieve. So drop pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size of struct objpool_head remains the same, which is a nice bonus. Same BPF selftests benchmarks were used to evaluate the effect. Using changes in previous patch (inlining of objpool_pop/objpool_push) as baseline, here are the differences: BASELINE kretprobe :9.937 ± 0.174M/s kretprobe-multi: 10.440 ± 0.108M/s AFTER = kretprobe : 10.106 ± 0.120M/s (+1.7%) kretprobe-multi: 10.515 ± 0.180M/s (+0.7%) Cc: Matt (Qiang) Wu Signed-off-by: Andrii Nakryiko --- include/linux/objpool.h | 6 +++--- lib/objpool.c | 12 ++-- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index d8b1f7b91128..cb1758eaa2d3 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); * struct objpool_head - object pooling metadata * @obj_size: object size, aligned to sizeof(void *) * @nr_objs:total objs (to be pre-allocated with objpool) - * @nr_cpus:local copy of nr_cpu_ids + * @nr_possible_cpus: cached value of num_possible_cpus() * @capacity: max objs can be managed by one objpool_slot * @gfp:gfp flags for kmalloc & vmalloc * @ref:refcount of objpool @@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); struct objpool_head { int obj_size; int nr_objs; - int nr_cpus; + int nr_possible_cpus; int capacity; gfp_t gfp; refcount_t ref; @@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool) raw_local_irq_save(flags); cpu = raw_smp_processor_id(); - for (i = 0; i < num_possible_cpus(); i++) { + for (i = 0; i < pool->nr_possible_cpus; i++) { obj = __objpool_try_get_slot(pool, cpu); if (obj) break; diff --git a/lib/objpool.c b/lib/objpool.c index f696308fc026..234f9d0bd081 100644 --- a/lib/objpool.c +++ b/lib/objpool.c @@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, { int i, cpu_count = 0; - for (i = 0; i < pool->nr_cpus; i++) { + for (i = 0; i < nr_cpu_ids; i++) { struct objpool_slot *slot; int nodes, size, rc; @@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, continue; /* compute how many objects to be allocated with this slot */ - nodes = nr_objs / num_possible_cpus(); - if (cpu_count < (nr_objs % num_possible_cpus())) + nodes = nr_objs / pool->nr_possible_cpus; + if (cpu_count < (nr_objs % pool->nr_possible_cpus)) nodes++; cpu_count++; @@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head *pool) if (!pool->cpu_slots) return; - for (i = 0; i < pool->nr_cpus; i++) + for (i = 0; i < nr_cpu_ids; i++) kvfree(pool->cpu_slots[i]); kfree(pool->cpu_slots); } @@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, /* initialize objpool pool */ memset(pool, 0, sizeof(struct objpool_head)); - pool->nr_cpus = nr_cpu_ids; + pool->nr_possible_cpus = num_possible_cpus(); pool->obj_size = object_size; pool->capacity = capacity; pool->gfp = gfp & ~__GFP_ZERO; pool->context = context; pool->release = release; - slot_size = pool->nr_cpus * sizeof(struct objpool_slot); + slot_size = nr_cpu_ids * sizeof(struct objpool_slot); pool->cpu_slots = kzalloc(slot_size, pool->gfp); if (!pool->cpu_slots) return -ENOMEM; -- 2.43.0
[PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations
objpool_push() and objpool_pop() are very performance-critical functions and can be called very frequently in kretprobe triggering path. As such, it makes sense to allow compiler to inline them completely to eliminate function calls overhead. Luckily, their logic is quite well isolated and doesn't have any sprawling dependencies. This patch moves both objpool_push() and objpool_pop() into include/linux/objpool.h and marks them as static inline functions, enabling inlining. To avoid anyone using internal helpers (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading underscores. We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi programs in a tight loop to evaluate the effect. BPF own overhead in this case is minimal and it mostly stresses the rest of in-kernel kretprobe infrastructure overhead. Results are in millions of calls per second. This is not super scientific, but shows the trend nevertheless. BEFORE == kretprobe :9.794 ± 0.086M/s kretprobe-multi: 10.219 ± 0.032M/s AFTER = kretprobe :9.937 ± 0.174M/s (+1.5%) kretprobe-multi: 10.440 ± 0.108M/s (+2.2%) Cc: Matt (Qiang) Wu Signed-off-by: Andrii Nakryiko --- include/linux/objpool.h | 101 +++- lib/objpool.c | 100 --- 2 files changed, 99 insertions(+), 102 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index 15aff4a17f0c..d8b1f7b91128 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -5,6 +5,10 @@ #include #include +#include +#include +#include +#include /* * objpool: ring-array based lockless MPMC queue @@ -118,13 +122,94 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, gfp_t gfp, void *context, objpool_init_obj_cb objinit, objpool_fini_cb release); +/* try to retrieve object from slot */ +static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + /* load head snapshot, other cpus may change it */ + uint32_t head = smp_load_acquire(>head); + + while (head != READ_ONCE(slot->last)) { + void *obj; + + /* +* data visibility of 'last' and 'head' could be out of +* order since memory updating of 'last' and 'head' are +* performed in push() and pop() independently +* +* before any retrieving attempts, pop() must guarantee +* 'last' is behind 'head', that is to say, there must +* be available objects in slot, which could be ensured +* by condition 'last != head && last - head <= nr_objs' +* that is equivalent to 'last - head - 1 < nr_objs' as +* 'last' and 'head' are both unsigned int32 +*/ + if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { + head = READ_ONCE(slot->head); + continue; + } + + /* obj must be retrieved before moving forward head */ + obj = READ_ONCE(slot->entries[head & slot->mask]); + + /* move head forward to mark it's consumption */ + if (try_cmpxchg_release(>head, , head + 1)) + return obj; + } + + return NULL; +} + /** * objpool_pop() - allocate an object from objpool * @pool: object pool * * return value: object ptr or NULL if failed */ -void *objpool_pop(struct objpool_head *pool); +static inline void *objpool_pop(struct objpool_head *pool) +{ + void *obj = NULL; + unsigned long flags; + int i, cpu; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + + cpu = raw_smp_processor_id(); + for (i = 0; i < num_possible_cpus(); i++) { + obj = __objpool_try_get_slot(pool, cpu); + if (obj) + break; + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); + } + raw_local_irq_restore(flags); + + return obj; +} + +/* adding object to slot, abort if the slot was already full */ +static inline int +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + uint32_t head, tail; + + /* loading tail and head as a local snapshot, tail first */ + tail = READ_ONCE(slot->tail); + + do { + head = READ_ONCE(slot->head); + /* fault caught: something must be wrong */ + WARN_ON_ONCE(tail - head > pool->nr_objs); + } while (!try_cmpxchg_acqui
[PATCH 0/2] Objpool performance improvements
Improve objpool (used heavily in kretprobe hot path) performance with two improvements: - inlining performance critical objpool_push()/objpool_pop() operations; - avoiding re-calculating relatively expensive nr_possible_cpus(). These opportunities were found when benchmarking and profiling kprobes and kretprobes with BPF-based benchmarks. See individual patches for details and results. Andrii Nakryiko (2): objpool: enable inlining objpool_push() and objpool_pop() operations objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids include/linux/objpool.h | 105 +++-- lib/objpool.c | 112 +++- 2 files changed, 107 insertions(+), 110 deletions(-) -- 2.43.0
Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu wrote: > > On Thu, 18 Apr 2024 12:09:09 -0700 > Andrii Nakryiko wrote: > > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating > > that RCU is watching when trying to setup rethooko on a function entry. > > > > One notable exception when we force rcu_is_watching() check is > > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use > > old-style int3-based workflow instead of relying on ftrace, making RCU > > watching check important to validate. > > > > This further (in addition to improvements in the previous patch) > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput > > by 2.3%, according to BPF benchmarks ([0]). > > > > [0] > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ > > > > Signed-off-by: Andrii Nakryiko > > > Thanks for update! This looks good to me. Thanks, Masami! Will you take it through your tree, or you'd like to route it through bpf-next? > > Acked-by: Masami Hiramatsu (Google) > > Thanks, > > > --- > > kernel/trace/rethook.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > > index fa03094e9e69..a974605ad7a5 100644 > > --- a/kernel/trace/rethook.c > > +++ b/kernel/trace/rethook.c > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > > if (unlikely(!handler)) > > return NULL; > > > > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || > > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) > > /* > >* This expects the caller will set up a rethook on a function entry. > >* When the function returns, the rethook will eventually be reclaimed > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > >*/ > > if (unlikely(!rcu_is_watching())) > > return NULL; > > +#endif > > > > return (struct rethook_node *)objpool_pop(>pool); > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. One notable exception when we force rcu_is_watching() check is CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use old-style int3-based workflow instead of relying on ftrace, making RCU watching check important to validate. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Signed-off-by: Andrii Nakryiko --- kernel/trace/rethook.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..a974605ad7a5 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!handler)) return NULL; +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) /* * This expects the caller will set up a rethook on a function entry. * When the function returns, the rethook will eventually be reclaimed @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) */ if (unlikely(!rcu_is_watching())) return NULL; +#endif return (struct rethook_node *)objpool_pop(>pool); } -- 2.43.0
[PATCH v4 1/2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Acked-by: Masami Hiramatsu (Google) Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..7aebd1b8f93e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace execution" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort of + protection against recursion. This option is only to verify that + ftrace (and other users of ftrace_test_recursion_trylock()) are not + called outside of RCU, as if they are, it can cause a race. But it + also has a noticeable overhead when enabled. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
On Tue, Apr 9, 2024 at 3:48 PM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 15:03:28 -0700 > Andrii Nakryiko wrote: > > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating > > that RCU is watching when trying to setup rethooko on a function entry. > > > > This further (in addition to improvements in the previous patch) > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput > > by 2.3%, according to BPF benchmarks ([0]). > > > > [0] > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ > > > > Hi Andrii, > > Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this > option, kretprobes can be used without ftrace, but with original int3) ? Sorry for the late response, I was out on vacation. Makes sense about KPROBE_EVENTS_ON_NOTRACE, I went with this condition: #if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) Will send an updated revision shortly. > This option should be set N on production system because of safety, > just for testing raw kretprobes. > > Thank you, > > > Signed-off-by: Andrii Nakryiko > > --- > > kernel/trace/rethook.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > > index fa03094e9e69..15b8aa4048d9 100644 > > --- a/kernel/trace/rethook.c > > +++ b/kernel/trace/rethook.c > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > > if (unlikely(!handler)) > > return NULL; > > > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > /* > >* This expects the caller will set up a rethook on a function entry. > >* When the function returns, the rethook will eventually be reclaimed > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > >*/ > > if (unlikely(!rcu_is_watching())) > > return NULL; > > +#endif > > > > return (struct rethook_node *)objpool_pop(>pool); > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Fri, Apr 5, 2024 at 8:41 PM Masami Hiramatsu wrote: > > On Tue, 2 Apr 2024 22:21:00 -0700 > Andrii Nakryiko wrote: > > > On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko > > wrote: > > > > > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > > > > > > > On Wed, 3 Apr 2024 09:40:48 +0900 > > > > Masami Hiramatsu (Google) wrote: > > > > > > > > > OK, for me, this last sentence is preferred for the help message. > > > > > That explains > > > > > what this is for. > > > > > > > > > > All callbacks that attach to the function tracing have some > > > > > sort > > > > > of protection against recursion. This option is only to > > > > > verify that > > > > > ftrace (and other users of ftrace_test_recursion_trylock()) > > > > >are not > > > > > called outside of RCU, as if they are, it can cause a race. > > > > > But it also has a noticeable overhead when enabled. > > > > > > Sounds good to me, I can add this to the description of the Kconfig > > > option. > > > > > > > > > > > > > BTW, how much overhead does this introduce? and the race case a > > > > > kernel crash? > > > > > > I just checked our fleet-wide production data for the last 24 hours. > > > Within the kprobe/kretprobe code path (ftrace_trampoline and > > > everything called from it), rcu_is_watching (both calls, see below) > > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > > > prefer to be able to avoid that in production use cases. > > > > > > > I just ran synthetic microbenchmark testing multi-kretprobe > > throughput. We get (in millions of BPF kretprobe-multi program > > invocations per second): > > - 5.568M/s as baseline; > > - 5.679M/s with changes in this patch (+2% throughput improvement); > > - 5.808M/s with disabling rcu_is_watching in rethook_try_get() > > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). > > > > It's definitely noticeable. > > Thanks for checking the overhead! Hmm, it is considerable. > > > > > > or just messed up the ftrace buffer? > > > > > > > > There's a hypothetical race where it can cause a use after free. > > Hmm, so it might not lead a kernel crash but is better to enable with > other debugging options. > > > > > > > > > That is, after you shutdown ftrace, you need to call > > > > synchronize_rcu_tasks(), > > > > which requires RCU to be watching. There's a theoretical case where that > > > > task calls the trampoline and misses the synchronization. Note, these > > > > locations are with preemption disabled, as rcu is always watching when > > > > preemption is enabled. Thus it would be extremely fast where as the > > > > synchronize_rcu_tasks() is rather slow. > > > > > > > > We also have synchronize_rcu_tasks_rude() which would actually keep the > > > > trace from happening, as it would schedule on each CPU forcing all CPUs > > > > to > > > > have RCU watching. > > > > > > > > I have never heard of this race being hit. I guess it could happen on a > > > > VM > > > > where a vCPU gets preempted at the right moment for a long time and the > > > > other CPUs synchronize. > > > > > > > > But like lockdep, where deadlocks can crash the kernel, we don't enable > > > > it > > > > for production. > > > > > > > > The overhead is another function call within the function tracer. I had > > > > numbers before, but I guess I could run tests again and get new numbers. > > > > > > > > > > I just noticed another rcu_is_watching() call, in rethook_try_get(), > > > which seems to be a similar and complementary validation check to the > > > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > > > in this patch. It feels like both of them should be controlled by the > > > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > > guard around rcu_is_watching() check in rethook_try_get() as well? > > Hmmm, no, I think it should not change the rethook side because rethook > can be used with kprobes without ftrace. If we can detect it is used from It's a good thing that I split that into a separate patch, then. Hopefully the first patch looks good and you can apply it as is. > the ftrace, we can skip it. (From this reason, I would like to remove > return probe from kprobes...) I'm on PTO for the next two weeks and I can take a look at more properly guarding rcu_is_watching() in rethook_try_get() when I'm back. Thanks. > > Thank you, > > > > > > > > > > > Thanks, > > > > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Signed-off-by: Andrii Nakryiko --- kernel/trace/rethook.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..15b8aa4048d9 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!handler)) return NULL; +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING /* * This expects the caller will set up a rethook on a function entry. * When the function returns, the rethook will eventually be reclaimed @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) */ if (unlikely(!rcu_is_watching())) return NULL; +#endif return (struct rethook_node *)objpool_pop(>pool); } -- 2.43.0
[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..7aebd1b8f93e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace execution" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort of + protection against recursion. This option is only to verify that + ftrace (and other users of ftrace_test_recursion_trylock()) are not + called outside of RCU, as if they are, it can cause a race. But it + also has a noticeable overhead when enabled. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko wrote: > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > > > On Wed, 3 Apr 2024 09:40:48 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > OK, for me, this last sentence is preferred for the help message. That > > > explains > > > what this is for. > > > > > > All callbacks that attach to the function tracing have some sort > > > of protection against recursion. This option is only to verify > > > that > > > ftrace (and other users of ftrace_test_recursion_trylock()) are not > > > called outside of RCU, as if they are, it can cause a race. > > > But it also has a noticeable overhead when enabled. > > Sounds good to me, I can add this to the description of the Kconfig option. > > > > > > > BTW, how much overhead does this introduce? and the race case a kernel > > > crash? > > I just checked our fleet-wide production data for the last 24 hours. > Within the kprobe/kretprobe code path (ftrace_trampoline and > everything called from it), rcu_is_watching (both calls, see below) > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > prefer to be able to avoid that in production use cases. > I just ran synthetic microbenchmark testing multi-kretprobe throughput. We get (in millions of BPF kretprobe-multi program invocations per second): - 5.568M/s as baseline; - 5.679M/s with changes in this patch (+2% throughput improvement); - 5.808M/s with disabling rcu_is_watching in rethook_try_get() (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). It's definitely noticeable. > > > or just messed up the ftrace buffer? > > > > There's a hypothetical race where it can cause a use after free. > > > > That is, after you shutdown ftrace, you need to call > > synchronize_rcu_tasks(), > > which requires RCU to be watching. There's a theoretical case where that > > task calls the trampoline and misses the synchronization. Note, these > > locations are with preemption disabled, as rcu is always watching when > > preemption is enabled. Thus it would be extremely fast where as the > > synchronize_rcu_tasks() is rather slow. > > > > We also have synchronize_rcu_tasks_rude() which would actually keep the > > trace from happening, as it would schedule on each CPU forcing all CPUs to > > have RCU watching. > > > > I have never heard of this race being hit. I guess it could happen on a VM > > where a vCPU gets preempted at the right moment for a long time and the > > other CPUs synchronize. > > > > But like lockdep, where deadlocks can crash the kernel, we don't enable it > > for production. > > > > The overhead is another function call within the function tracer. I had > > numbers before, but I guess I could run tests again and get new numbers. > > > > I just noticed another rcu_is_watching() call, in rethook_try_get(), > which seems to be a similar and complementary validation check to the > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > in this patch. It feels like both of them should be controlled by the > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > guard around rcu_is_watching() check in rethook_try_get() as well? > > > > Thanks, > > > > -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > On Wed, 3 Apr 2024 09:40:48 +0900 > Masami Hiramatsu (Google) wrote: > > > OK, for me, this last sentence is preferred for the help message. That > > explains > > what this is for. > > > > All callbacks that attach to the function tracing have some sort > > of protection against recursion. This option is only to verify that > > ftrace (and other users of ftrace_test_recursion_trylock()) are not > > called outside of RCU, as if they are, it can cause a race. > > But it also has a noticeable overhead when enabled. Sounds good to me, I can add this to the description of the Kconfig option. > > > > BTW, how much overhead does this introduce? and the race case a kernel > > crash? I just checked our fleet-wide production data for the last 24 hours. Within the kprobe/kretprobe code path (ftrace_trampoline and everything called from it), rcu_is_watching (both calls, see below) cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd prefer to be able to avoid that in production use cases. > > or just messed up the ftrace buffer? > > There's a hypothetical race where it can cause a use after free. > > That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(), > which requires RCU to be watching. There's a theoretical case where that > task calls the trampoline and misses the synchronization. Note, these > locations are with preemption disabled, as rcu is always watching when > preemption is enabled. Thus it would be extremely fast where as the > synchronize_rcu_tasks() is rather slow. > > We also have synchronize_rcu_tasks_rude() which would actually keep the > trace from happening, as it would schedule on each CPU forcing all CPUs to > have RCU watching. > > I have never heard of this race being hit. I guess it could happen on a VM > where a vCPU gets preempted at the right moment for a long time and the > other CPUs synchronize. > > But like lockdep, where deadlocks can crash the kernel, we don't enable it > for production. > > The overhead is another function call within the function tracer. I had > numbers before, but I guess I could run tests again and get new numbers. > I just noticed another rcu_is_watching() call, in rethook_try_get(), which seems to be a similar and complementary validation check to the one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option in this patch. It feels like both of them should be controlled by the same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING guard around rcu_is_watching() check in rethook_try_get() as well? > Thanks, > > -- Steve
Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee wrote: > > rethook_find_ret_addr() prints a warning message and returns 0 when the > target task is running and not the "current" task to prevent returning an > incorrect return address. However, this check is incomplete as the target > task can still transition to the running state when finding the return > address, although it is safe with RCU. > > The issue we encounter is that the kernel frequently prints warning > messages when BPF profiling programs call to bpf_get_task_stack() on > running tasks. > > The callers should be aware and willing to take the risk of receiving an > incorrect return address from a task that is currently running other than > the "current" one. A warning is not needed here as the callers are intent > on it. > > Signed-off-by: Kui-Feng Lee > --- > kernel/trace/rethook.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > index fa03094e9e69..4297a132a7ae 100644 > --- a/kernel/trace/rethook.c > +++ b/kernel/trace/rethook.c > @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct > *tsk, unsigned long frame > if (WARN_ON_ONCE(!cur)) > return 0; > > - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) > + if (tsk != current && task_is_running(tsk)) > return 0; > This should probably go through Masami's tree, but the change makes sense to me, given this is an expected condition. Acked-by: Andrii Nakryiko > do { > -- > 2.34.1 > >
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > On Mon, 1 Apr 2024 12:09:18 -0400 > Steven Rostedt wrote: > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > Masami, > > > > > > > > Are you OK with just keeping it set to N. > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > Agreed (but it should say this is a debug option) > > > > It does say "Validate" which to me is a debug option. What would you > > suggest? > > I think the help message should have "This is for debugging ftrace." > Sent v2 with adjusted wording, thanks! > Thank you, > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level debugging of ftrace subsystem. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..fcf45d5c60cb 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This is a feature useful for debugging ftrace. This will add more + overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Mar 26, 2024 at 11:58 AM Steven Rostedt wrote: > > On Tue, 26 Mar 2024 09:16:33 -0700 > Andrii Nakryiko wrote: > > > > It's no different than lockdep. Test boxes should have it enabled, but > > > there's no reason to have this enabled in a production system. > > > > > > > I tend to agree with Steven here (which is why I sent this patch as it > > is), but I'm happy to do it as an opt-out, if Masami insists. Please > > do let me know if I need to send v2 or this one is actually the one > > we'll end up using. Thanks! > > Masami, > > Are you OK with just keeping it set to N. > > We could have other options like PROVE_LOCKING enable it. > So what's the conclusion, Masami? Should I send another version where this config is opt-out, or are you ok with keeping it as opt-in as proposed in this revision? > -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt wrote: > > On Mon, 25 Mar 2024 11:38:48 +0900 > Masami Hiramatsu (Google) wrote: > > > On Fri, 22 Mar 2024 09:03:23 -0700 > > Andrii Nakryiko wrote: > > > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > > control whether ftrace low-level code performs additional > > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > > violations. > > > > > > This check is expected to never be true in practice and would be best > > > controlled with extra config to let users decide if they are willing to > > > pay the price. > > > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > is OK, but tht should be set to Y by default. If you have already verified > > that your system never make it true and you want to optimize your ftrace > > path, you can manually set it to N at your own risk. > > > > Really, it's for debugging. I would argue that it should *not* be default y. > Peter added this to find all the locations that could be called where RCU > is not watching. But the issue I have is that this is that it *does cause > overhead* with function tracing. > > I believe we found pretty much all locations that were an issue, and we > should now just make it an option for developers. > > It's no different than lockdep. Test boxes should have it enabled, but > there's no reason to have this enabled in a production system. > I tend to agree with Steven here (which is why I sent this patch as it is), but I'm happy to do it as an opt-out, if Masami insists. Please do let me know if I need to send v2 or this one is actually the one we'll end up using. Thanks! > -- Steve > > > > > > > > Cc: Steven Rostedt > > > Cc: Masami Hiramatsu > > > Cc: Paul E. McKenney > > > Signed-off-by: Andrii Nakryiko > > > --- > > > include/linux/trace_recursion.h | 2 +- > > > kernel/trace/Kconfig| 13 + > > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/trace_recursion.h > > > b/include/linux/trace_recursion.h > > > index d48cd92d2364..24ea8ac049b4 100644 > > > --- a/include/linux/trace_recursion.h > > > +++ b/include/linux/trace_recursion.h > > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > > unsigned long parent_ip); > > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > > #endif > > > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > > # define trace_warn_on_no_rcu(ip) \ > > > ({ \ > > > bool __ret = !rcu_is_watching();\ > > > > BTW, maybe we can add "unlikely" in the next "if" line? > > > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > > index 61c541c36596..19bce4e217d6 100644 > > > --- a/kernel/trace/Kconfig > > > +++ b/kernel/trace/Kconfig > > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > > This file can be reset, but the limit can not change in > > > size at runtime. > > > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > > + bool "Validate RCU is on during ftrace recursion check" > > > + depends on FUNCTION_TRACER > > > + depends on ARCH_WANTS_NO_INSTR > > > > default y > > > > > + help > > > + All callbacks that attach to the function tracing have some sort > > > + of protection against recursion. This option performs additional > > > + checks to make sure RCU is on when ftrace callbacks recurse. > > > + > > > + This will add more overhead to all ftrace-based invocations. > > > > ... invocations, but keep it safe. > > > > > + > > > + If unsure, say N > > > > If unsure, say Y > > > > Thank you, > > > > > + > > > config RING_BUFFER_RECORD_RECURSION > > > bool "Record functions that recurse in the ring buffer" > > > depends on FTRACE_RECORD_RECURSION > > > -- > > > 2.43.0 > > > > > > > >
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu wrote: > > On Fri, 22 Mar 2024 09:03:23 -0700 > Andrii Nakryiko wrote: > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > control whether ftrace low-level code performs additional > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > violations. > > > > This check is expected to never be true in practice and would be best > > controlled with extra config to let users decide if they are willing to > > pay the price. > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > is OK, but tht should be set to Y by default. If you have already verified > that your system never make it true and you want to optimize your ftrace > path, you can manually set it to N at your own risk. Yeah, I don't think we ever see this warning across our machines. And sure, I can default it to Y, no problem. > > > > > Cc: Steven Rostedt > > Cc: Masami Hiramatsu > > Cc: Paul E. McKenney > > Signed-off-by: Andrii Nakryiko > > --- > > include/linux/trace_recursion.h | 2 +- > > kernel/trace/Kconfig| 13 + > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/trace_recursion.h > > b/include/linux/trace_recursion.h > > index d48cd92d2364..24ea8ac049b4 100644 > > --- a/include/linux/trace_recursion.h > > +++ b/include/linux/trace_recursion.h > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > unsigned long parent_ip); > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > #endif > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > # define trace_warn_on_no_rcu(ip)\ > > ({ \ > > bool __ret = !rcu_is_watching();\ > > BTW, maybe we can add "unlikely" in the next "if" line? sure, can add that as well > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > index 61c541c36596..19bce4e217d6 100644 > > --- a/kernel/trace/Kconfig > > +++ b/kernel/trace/Kconfig > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > This file can be reset, but the limit can not change in > > size at runtime. > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > + bool "Validate RCU is on during ftrace recursion check" > > + depends on FUNCTION_TRACER > > + depends on ARCH_WANTS_NO_INSTR > > default y > ok > > + help > > + All callbacks that attach to the function tracing have some sort > > + of protection against recursion. This option performs additional > > + checks to make sure RCU is on when ftrace callbacks recurse. > > + > > + This will add more overhead to all ftrace-based invocations. > > ... invocations, but keep it safe. > > > + > > + If unsure, say N > > If unsure, say Y > yep, will do, thanks! > Thank you, > > > + > > config RING_BUFFER_RECORD_RECURSION > > bool "Record functions that recurse in the ring buffer" > > depends on FTRACE_RECORD_RECURSION > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true in practice and would be best controlled with extra config to let users decide if they are willing to pay the price. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..19bce4e217d6 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This will add more overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
[PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well
Recent changes to count number of matching symbols when creating a kprobe event failed to take into account kernel modules. As such, it breaks kprobes on kernel module symbols, by assuming there is no match. Fix this my calling module_kallsyms_on_each_symbol() in addition to kallsyms_on_each_match_symbol() to perform a proper counting. Cc: Francis Laniel Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Steven Rostedt Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_kprobe.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index effcaede4759..1efb27f35963 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long unused) return 0; } +struct sym_count_ctx { + unsigned int count; + const char *name; +}; + +static int count_mod_symbols(void *data, const char *name, unsigned long unused) +{ + struct sym_count_ctx *ctx = data; + + if (strcmp(name, ctx->name) == 0) + ctx->count++; + + return 0; +} + static unsigned int number_of_same_symbols(char *func_name) { - unsigned int count; + struct sym_count_ctx ctx = { .count = 0, .name = func_name }; + + kallsyms_on_each_match_symbol(count_symbols, func_name, ); - count = 0; - kallsyms_on_each_match_symbol(count_symbols, func_name, ); + module_kallsyms_on_each_symbol(NULL, count_mod_symbols, ); - return count; + return ctx.count; } static int __trace_kprobe_create(int argc, const char *argv[]) -- 2.34.1