Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu wrote: > > On Mon, 1 Jul 2024 15:15:56 -0700 > Andrii Nakryiko wrote: > > > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko > > wrote: > > > > > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu > > > wrote: > > > > > > > > On Fri, 28 Jun 2024 09:34:26 -0700 > > > > Andrii Nakryiko wrote: > > > > > > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > > > > wrote: > > > > > > > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t > > > > > > > > > offset, > > > > > > > > > - loff_t ref_ctr_offset, struct > > > > > > > > > uprobe_consumer *uc) > > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > > > > + uprobe_consumer_fn > > > > > > > > > get_uprobe_consumer, void *ctx) > > > > > > > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we > > > > > > > > just > > > > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > > > > would just contain pointers to uprobe_consumer. Consumers would > > > > > > > never > > > > > > > just have an array of `struct uprobe_consumer *`, because > > > > > > > uprobe_consumer struct is embedded in some other struct, so the > > > > > > > array > > > > > > > interface isn't the most convenient. > > > > > > > > > > > > OK, I understand it. > > > > > > > > > > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > > > > allocating an extra array *and keeping it* for the entire > > > > > > > duration of > > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > > > > waste. This is because we don't want to do anything that can fail > > > > > > > in > > > > > > > the detachment logic (so no temporary array allocation there). > > > > > > > > > > > > No need to change it, that sounds reasonable. > > > > > > > > > > > > > > > > Great, thanks. > > > > > > > > > > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > > > > > > > IMHO, maybe the interface function is better to change to > > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > > > > side uses a linked list of structure, index access will need to > > > > > > follow the list every time. > > > > > > > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > > > > ctx) with i going from 0 to N in multiple independent loops. So if we > > > > > are only allowed to ask for the next consumer, then > > > > > uprobe_register_batch and uprobe_unregister_batch would need to build > > > > > its own internal index and remember ith instance. Which again means > > > > > more allocations and possibly failing uprobe_unregister_batch(), which > > > > > isn't great. > > > > > > > > No, I think we can use a cursor variable as; > > > > > > > > int uprobe_register_batch(struct inode *inode, > > > > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > > > { > > > > void *cur = ctx; > > > > > > > > while ((uc = get_uprobe_consumer()) != NULL) { > > > > ... > > > > } > > > > > > > > cur = ctx; > > > > while ((uc = get_uprobe_consumer()) != NULL) { > > > > ... > > > > } > > > > } > > > > > > > > This can also remove the cnt. > > > > > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use > > > for callers, but we have one right now, and might have another one, so > > > not a big deal. > > > > > > > Actually, now that I started implementing this, I really-really don't > > like it. In the example above you assume by storing and reusing > > original ctx value you will reset iteration to the very beginning. > > This is not how it works in practice though. Ctx is most probably a > > pointer to some struct somewhere with iteration state (e.g., array of > > all uprobes + current index), and so get_uprobe_consumer() doesn't > > update `void *ctx` itself, it updates the state of that struct. > > Yeah, that should be noted so that if the get_uprobe_consumer() is > called with original `ctx` value, it should return the same. > Ah, and I found we need to pass both ctx and pos... > >while ((uc = get_uprobe_consumer(, ctx)) != NULL) { > ... > } > > Usually it is enough to pass the cursor as similar to the other > for_each_* macros. For example, struct foo has .list and .uc, then > > struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head) > { >
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, 1 Jul 2024 15:15:56 -0700 Andrii Nakryiko wrote: > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko > wrote: > > > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu > > wrote: > > > > > > On Fri, 28 Jun 2024 09:34:26 -0700 > > > Andrii Nakryiko wrote: > > > > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > > > wrote: > > > > > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > > > wrote: > > > > > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t > > > > > > > > offset, > > > > > > > > - loff_t ref_ctr_offset, struct > > > > > > > > uprobe_consumer *uc) > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > > > + uprobe_consumer_fn get_uprobe_consumer, > > > > > > > > void *ctx) > > > > > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we > > > > > > > just > > > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > > > would just contain pointers to uprobe_consumer. Consumers would > > > > > > never > > > > > > just have an array of `struct uprobe_consumer *`, because > > > > > > uprobe_consumer struct is embedded in some other struct, so the > > > > > > array > > > > > > interface isn't the most convenient. > > > > > > > > > > OK, I understand it. > > > > > > > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > > > allocating an extra array *and keeping it* for the entire duration > > > > > > of > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > > > waste. This is because we don't want to do anything that can fail in > > > > > > the detachment logic (so no temporary array allocation there). > > > > > > > > > > No need to change it, that sounds reasonable. > > > > > > > > > > > > > Great, thanks. > > > > > > > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > > > > > IMHO, maybe the interface function is better to change to > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > > > side uses a linked list of structure, index access will need to > > > > > follow the list every time. > > > > > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > > > ctx) with i going from 0 to N in multiple independent loops. So if we > > > > are only allowed to ask for the next consumer, then > > > > uprobe_register_batch and uprobe_unregister_batch would need to build > > > > its own internal index and remember ith instance. Which again means > > > > more allocations and possibly failing uprobe_unregister_batch(), which > > > > isn't great. > > > > > > No, I think we can use a cursor variable as; > > > > > > int uprobe_register_batch(struct inode *inode, > > > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > > { > > > void *cur = ctx; > > > > > > while ((uc = get_uprobe_consumer()) != NULL) { > > > ... > > > } > > > > > > cur = ctx; > > > while ((uc = get_uprobe_consumer()) != NULL) { > > > ... > > > } > > > } > > > > > > This can also remove the cnt. > > > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use > > for callers, but we have one right now, and might have another one, so > > not a big deal. > > > > Actually, now that I started implementing this, I really-really don't > like it. In the example above you assume by storing and reusing > original ctx value you will reset iteration to the very beginning. > This is not how it works in practice though. Ctx is most probably a > pointer to some struct somewhere with iteration state (e.g., array of > all uprobes + current index), and so get_uprobe_consumer() doesn't > update `void *ctx` itself, it updates the state of that struct. Yeah, that should be noted so that if the get_uprobe_consumer() is called with original `ctx` value, it should return the same. Ah, and I found we need to pass both ctx and pos... while ((uc = get_uprobe_consumer(, ctx)) != NULL) { ... } Usually it is enough to pass the cursor as similar to the other for_each_* macros. For example, struct foo has .list and .uc, then struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head) { struct foo *foo = *pos; if (!foo) return NULL; *pos = list_next_entry(foo, list); if (list_is_head(pos, (head))) *pos = NULL; return foo->uc; } This works something like this. #define for_each_uprobe_consumer_from_foo(uc, pos, head)
Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.
On Wed, 12 Jun 2024 09:11:56 +0800 Hongbo Li wrote: > @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap, > if (error) > return error; > > + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name, > + attr->ia_valid, attr->ia_mode, > + from_kuid(_user_ns, attr->ia_uid), > + from_kgid(_user_ns, attr->ia_gid), > + inode->i_size, attr->ia_size); > + That's a lot of parameters to pass to a tracepoint. Why not just pass the dentry and attr and do the above in the TP_fast_assign() logic? That would put less pressure on the icache for the code part. -- Steve
[PATCH] perf,x86: avoid missing caller address in stack traces captured in uprobe
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. Signed-off-by: Andrii Nakryiko --- arch/x86/events/core.c | 20 include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5b0dd07b1ef1..82d5570b58ff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs return; pagefault_disable(); + +#ifdef CONFIG_UPROBES + /* +* If we are called from uprobe handler, and we are indeed at the very +* entry to user function (which is normally a `push %rbp` instruction, +* under assumption of application being compiled with frame pointers), +* we should read return address from *regs->sp before proceeding +* to follow frame pointers, otherwise we'll skip immediate caller +* as %rbp is not yet setup. +*/ + if (current->utask) { + struct arch_uprobe *auprobe = current->utask->auprobe; + u64 ret_addr; + + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && + !__get_user(ret_addr, (const u64 __user *)regs->sp)) + perf_callchain_store(entry, ret_addr); + } +#endif + while (entry->nr < entry->max_stack) { if (!valid_user_frame(fp, sizeof(frame))) break; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a270a5892ab4 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -76,6 +76,8 @@ struct uprobe_task { struct uprobe *active_uprobe; unsigned long xol_vaddr; + struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned intdepth; }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..6e22e4d80f1e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) bool need_prep = false; /* prepare return uprobe, when needed */ down_read(>register_rwsem); + current->utask->auprobe = >arch; for (uc = uprobe->consumers; uc; uc = uc->next) { int rc = 0; @@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) remove &= rc; } + current->utask->auprobe = NULL; if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */ -- 2.43.0
[PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore
With all the batch uprobe APIs work we are now finally ready to reap the benefits. Switch uprobes_treelock from reader-writer spinlock to a much more efficient and scalable per-CPU RW semaphore. Benchmarks and numbers time. I've used BPF selftests' bench tool, trig-uprobe-nop benchmark specifically, to see how uprobe total throughput scales with number of competing threads (mapped to individual CPUs). Here are results: # threads BEFORE (mln/s)AFTER (mln/s) - --- 1 3.131 3.140 2 3.394 3.601 3 3.630 3.960 4 3.317 3.551 5 3.448 3.464 6 3.345 3.283 7 3.469 3.444 8 3.182 3.258 9 3.138 3.139 10 2.999 3.212 11 2.903 3.183 12 2.802 3.027 13 2.792 3.027 14 2.695 3.086 15 2.822 2.965 16 2.679 2.939 17 2.622 2.888 18 2.628 2.914 19 2.702 2.836 20 2.561 2.837 One can see that per-CPU RW semaphore-based implementation scales better with number of CPUs (especially that single CPU throughput is basically the same). Note, scalability is still limited by register_rwsem and this will hopefully be address in follow up patch set(s). Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index bb480a2400e1..1d76551e5e23 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; */ #define no_uprobe_events() RB_EMPTY_ROOT(_tree) -static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ +DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ13 /* serialize uprobe->pending_list */ @@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) bool destroy; if (!tree_locked) - write_lock(_treelock); + percpu_down_write(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); if (!tree_locked) - write_unlock(_treelock); + percpu_up_write(_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { struct uprobe *uprobe; - read_lock(_treelock); + percpu_down_read(_treelock); uprobe = __find_uprobe(inode, offset); - read_unlock(_treelock); + percpu_up_read(_treelock); return uprobe; } @@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge up_write(>register_rwsem); } - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); uprobe = uc->uprobe; @@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge __put_uprobe(uprobe, true); uc->uprobe = NULL; } - write_unlock(_treelock); + percpu_up_write(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) @@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } ret = 0; - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } } unlock_treelock: - write_unlock(_treelock); + percpu_up_write(_treelock); if (ret) goto cleanup_uprobes; @@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ - write_lock(_treelock); + percpu_down_write(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1384,7 +1384,7 @@ int uprobe_register_batch(struct
[PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
Switch internals of BPF multi-uprobes to batched version of uprobe registration and unregistration APIs. This also simplifies BPF clean up code a bit thanks to all-or-nothing guarantee of uprobes_register_batch(). Signed-off-by: Andrii Nakryiko --- kernel/trace/bpf_trace.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ba62baec3152..41bf6736c542 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx { struct bpf_uprobe *uprobe; }; -static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, - u32 cnt) +static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx) { - u32 i; + struct bpf_uprobe_multi_link *link = ctx; - for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), [i].consumer); - } + return >uprobes[idx].consumer; } static void bpf_uprobe_multi_link_release(struct bpf_link *link) @@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link) struct bpf_uprobe_multi_link *umulti_link; umulti_link = container_of(link, struct bpf_uprobe_multi_link, link); - bpf_uprobe_unregister(_link->path, umulti_link->uprobes, umulti_link->cnt); + uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt, + umulti_link_get_uprobe_consumer, umulti_link); if (umulti_link->task) put_task_struct(umulti_link->task); path_put(_link->path); @@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI, _uprobe_multi_link_lops, prog); - for (i = 0; i < cnt; i++) { - err = uprobe_register(d_real_inode(link->path.dentry), [i].consumer); - if (err) { - bpf_uprobe_unregister(, uprobes, i); - goto error_free; - } - } + err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt, + umulti_link_get_uprobe_consumer, link); + if (err) + goto error_free; err = bpf_link_prime(>link, _primer); if (err) -- 2.43.0
[PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch
Similarly to what we did for uprobes_register_batch(), split uprobe_unregister_batch() into two separate phases with different locking needs. First, all the VMA unregistration is performed while holding a per-uprobe register_rwsem. Then, we take a batched uprobes_treelock once to __put_uprobe() for all uprobe_consumers. That uprobe_consumer->uprobe field is really handy in helping with this. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ced85284bbf4..bb480a2400e1 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) */ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx) { - struct uprobe *uprobe; struct uprobe_consumer *uc; + struct uprobe *uprobe; int i; for (i = 0; i < cnt; i++) { @@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge down_write(>register_rwsem); __uprobe_unregister(uprobe, uc); up_write(>register_rwsem); - put_uprobe(uprobe); + } + write_lock(_treelock); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + if (!uprobe) + continue; + + __put_uprobe(uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) -- 2.43.0
[PATCH v2 09/12] uprobes: batch uprobes_treelock during registration
Now that we have a good separate of each registration step, take uprobes_treelock just once for relevant registration step, and then process all relevant uprobes in one go. Even if writer lock introduces a relatively large delay (as might happen with per-CPU RW semaphore), this will keep overall batch attachment reasonably fast. We teach put_uprobe(), though __put_uprobe() helper, to optionally take or not uprobes_treelock, to accommodate this pattern. With these changes we don't need insert_uprobe() operation that unconditionally takes uprobes_treelock, so get rid of it, leaving only lower-level __insert_uprobe() helper. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 45 + 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 128677ffe662..ced85284bbf4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu) kfree(uprobe); } -static void put_uprobe(struct uprobe *uprobe) +static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) { s64 v; @@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe) if (unlikely((u32)v == 0)) { bool destroy; - write_lock(_treelock); + if (!tree_locked) + write_lock(_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe) destroy = atomic64_read(>ref) == v; if (destroy && uprobe_is_active(uprobe)) rb_erase(>rb_node, _tree); - write_unlock(_treelock); + if (!tree_locked) + write_unlock(_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe) rcu_read_unlock_trace(); } +static void put_uprobe(struct uprobe *uprobe) +{ + __put_uprobe(uprobe, false); +} + static __always_inline int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, const struct uprobe *r) @@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) return u; } -/* - * Acquire uprobes_treelock and insert uprobe into uprobes_tree - * (or reuse existing one, see __insert_uprobe() comments above). - */ -static struct uprobe *insert_uprobe(struct uprobe *uprobe) -{ - struct uprobe *u; - - write_lock(_treelock); - u = __insert_uprobe(uprobe); - write_unlock(_treelock); - - return u; -} - static void ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) { @@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, uc->uprobe = uprobe; } + ret = 0; + write_lock(_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt, uprobe = uc->uprobe; /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); + cur_uprobe = __insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); + + __put_uprobe(cur_uprobe, true); ret = -EINVAL; - goto cleanup_uprobes; + goto unlock_treelock; } kfree(uprobe); uc->uprobe = cur_uprobe; } } +unlock_treelock: + write_unlock(_treelock); + if (ret) + goto cleanup_uprobes; for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ + write_lock(_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); if (uc->uprobe) - put_uprobe(uc->uprobe); + __put_uprobe(uc->uprobe, true); uc->uprobe = NULL; } + write_unlock(_treelock); return ret; } -- 2.43.0
[PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps
Now we are ready to split alloc-and-insert coupled step into two separate phases. First, we allocate and prepare all potentially-to-be-inserted uprobe instances, assuming corresponding uprobes are not yet in uprobes_tree. This is needed so that we don't do memory allocations under uprobes_treelock (once we batch locking for each step). Second, we insert new uprobes or reuse already existing ones into uprobes_tree. Any uprobe that turned out to be not necessary is immediately freed, as there are no other references to it. This concludes preparations that make uprobes_register_batch() ready to batch and optimize locking per each phase. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 0f928a72a658..128677ffe662 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, return -EINVAL; } + /* pre-allocate new uprobe instances */ for (i = 0; i < cnt; i++) { - struct uprobe *cur_uprobe; - uc = get_uprobe_consumer(i, ctx); uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); @@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, RB_CLEAR_NODE(>rb_node); atomic64_set(>ref, 1); + uc->uprobe = uprobe; + } + + for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ @@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt, if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); - kfree(uprobe); ret = -EINVAL; goto cleanup_uprobes; } kfree(uprobe); - uprobe = cur_uprobe; + uc->uprobe = cur_uprobe; } - - uc->uprobe = uprobe; } for (i = 0; i < cnt; i++) { -- 2.43.0
[PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()
To allow unbundling alloc-uprobe-and-insert step which is currently tightly coupled, inline alloc_uprobe() logic into uprobe_register_batch() loop. It's called from one place, so we don't really lose much in terms of maintainability. No functional changes. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 65 ++--- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 68fdf1b8e4bf..0f928a72a658 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) (unsigned long long) uprobe->ref_ctr_offset); } -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset) -{ - struct uprobe *uprobe, *cur_uprobe; - - uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); - if (!uprobe) - return ERR_PTR(-ENOMEM); - - uprobe->inode = inode; - uprobe->offset = offset; - uprobe->ref_ctr_offset = ref_ctr_offset; - init_rwsem(>register_rwsem); - init_rwsem(>consumer_rwsem); - RB_CLEAR_NODE(>rb_node); - atomic64_set(>ref, 1); - - /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); - /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe != uprobe) { - if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { - ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); - kfree(uprobe); - return ERR_PTR(-EINVAL); - } - kfree(uprobe); - uprobe = cur_uprobe; - } - - return uprobe; -} - static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(>consumer_rwsem); @@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt, } for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + uc = get_uprobe_consumer(i, ctx); - uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); - if (IS_ERR(uprobe)) { - ret = PTR_ERR(uprobe); + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) { + ret = -ENOMEM; goto cleanup_uprobes; } + uprobe->inode = inode; + uprobe->offset = uc->offset; + uprobe->ref_ctr_offset = uc->ref_ctr_offset; + init_rwsem(>register_rwsem); + init_rwsem(>consumer_rwsem); + RB_CLEAR_NODE(>rb_node); + atomic64_set(>ref, 1); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe != uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + ref_ctr_mismatch_warn(cur_uprobe, uprobe); + put_uprobe(cur_uprobe); + kfree(uprobe); + ret = -EINVAL; + goto cleanup_uprobes; + } + kfree(uprobe); + uprobe = cur_uprobe; + } + uc->uprobe = uprobe; } -- 2.43.0
[PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs
Introduce batch versions of uprobe registration (attachment) and unregistration (detachment) APIs. Unregistration is presumed to never fail, so that's easy. Batch registration can fail, and so the semantics of uprobe_register_batch() is such that either all uprobe_consumers are successfully attached or none of them remain attached after the return. There is no guarantee of atomicity of attachment, though, and so while batch attachment is proceeding, some uprobes might start firing before others are completely attached. Even if overall attachment eventually fails, some successfully attached uprobes might fire and callers have to be prepared to handle that. This is in no way a regression compared to current approach of attaching uprobes one-by-one, though. One crucial implementation detail is the addition of `struct uprobe *uprobe` field to `struct uprobe_consumer` which is meant for internal uprobe subsystem usage only. We use this field both as temporary storage (to avoid unnecessary allocations) and as a back link to associated uprobe to simplify and speed up uprobe unregistration, as we now can avoid yet another tree lookup when unregistering uprobe_consumer. The general direction with uprobe registration implementation is to do batch attachment in distinct steps, each step performing some set of checks or actions on all uprobe_consumers before proceeding to the next phase. This, after some more changes in next patches, allows to batch locking for each phase and in such a way amortize any long delays that might be added by writer locks (especially once we switch uprobes_treelock to per-CPU R/W semaphore later). Currently, uprobe_register_batch() performs all the sanity checks first. Then proceeds to allocate-and-insert (we'll split this up further later on) uprobe instances, as necessary. And then the last step is actual uprobe registration for all affected VMAs. We take care to undo all the actions in the event of an error at any point in this lengthy process, so end result is all-or-nothing, as described above. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 17 kernel/events/uprobes.c | 180 2 files changed, 146 insertions(+), 51 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index a75ba37ce3c8..a6e6eb70539d 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -33,6 +33,8 @@ enum uprobe_filter_ctx { UPROBE_FILTER_MMAP, }; +typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx); + struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); int (*ret_handler)(struct uprobe_consumer *self, @@ -48,6 +50,8 @@ struct uprobe_consumer { loff_t ref_ctr_offset; /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; + /* associated uprobe instance (or NULL if consumer isn't attached) */ + struct uprobe *uprobe; }; #ifdef CONFIG_UPROBES @@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); +extern int uprobe_register_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); +extern void uprobe_unregister_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { @@ -169,6 +182,10 @@ static inline void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } +static inline void uprobe_unregister_batch(struct inode *inode, int cnt, +uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ +} static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8759c6d0683e..68fdf1b8e4bf 100644
[PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
Simplify uprobe registration/unregistration interfaces by making offset and ref_ctr_offset part of uprobe_consumer "interface". In practice, all existing users already store these fields somewhere in uprobe_consumer's containing structure, so this doesn't pose any problem. We just move some fields around. On the other hand, this simplifies uprobe_register() and uprobe_unregister() API by having only struct uprobe_consumer as one thing representing attachment/detachment entity. This makes batched versions of uprobe_register() and uprobe_unregister() simpler. This also makes uprobe_register_refctr() unnecessary, so remove it and simplify consumers. No functional changes intended. Acked-by: Masami Hiramatsu (Google) Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 18 +++ kernel/events/uprobes.c | 19 ++- kernel/trace/bpf_trace.c | 21 +++- kernel/trace/trace_uprobe.c | 53 --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 5 files changed, 55 insertions(+), 78 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a75ba37ce3c8 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -42,6 +42,11 @@ struct uprobe_consumer { enum uprobe_filter_ctx ctx, struct mm_struct *mm); + /* associated file offset of this probe */ + loff_t offset; + /* associated refctr file offset of this probe, or zero */ + loff_t ref_ctr_offset; + /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; }; @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -152,11 +156,7 @@ static inline void uprobes_init(void) #define uprobe_get_trap_addr(regs) instruction_pointer(regs) static inline int -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - return -ENOSYS; -} -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } static inline int uprobe_mmap(struct vm_area_struct *vma) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 560cf1ca512a..8759c6d0683e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) /* * uprobe_unregister - unregister an already registered probe. * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. + * @uc: identify which probe consumer to unregister. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { struct uprobe *uprobe; - uprobe = find_uprobe(inode, offset); + uprobe = find_uprobe(inode, uc->offset); if (WARN_ON(!uprobe)) return; @@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset, return ret; } -int uprobe_register(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc) +int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { - return __uprobe_register(inode, offset, 0, uc); + return __uprobe_register(inode, uc->offset,
[PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()
Return -ENOMEM instead of NULL, which makes caller's error handling just a touch simpler. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index f87049c08ee9..23449a8c5e7e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); if (!uprobe) - return NULL; + return ERR_PTR(-ENOMEM); uprobe->inode = inode; uprobe->offset = offset; @@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset, retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (!uprobe) - return -ENOMEM; if (IS_ERR(uprobe)) return PTR_ERR(uprobe); -- 2.43.0
[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is problematic and inconvenient. Because of it, we have extra retry logic in uprobe_register(), and we have an extra logic in __uprobe_unregister(), which checks that uprobe has no more consumers, and if that's the case, it removes struct uprobe from uprobes_tree (through delete_uprobe(), which takes writer lock on uprobes_tree), decrementing refcount after that. The latter is the source of unfortunate race with uprobe_register, necessitating retries. All of the above is a complication that makes adding batched uprobe registration/unregistration APIs hard, and generally makes following the logic harder. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, uprobe_consumer is assuming this extra refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). This is because find_active_uprobe() bumps refcount without knowing for sure that uprobe's refcount is already positive (and it has to be this way, there is no way around that setup). One, attempted initially, way to solve this is through using atomic_inc_not_zero() approach, turning get_uprobe() into try_get_uprobe(), which can fail to bump refcount if uprobe is already destined to be destroyed. This, unfortunately, turns out to be a rather expensive due to underlying cmpxchg() operation in atomic_inc_not_zero() and scales rather poorly with increased amount of parallel threads triggering uprobes. So, we devise a refcounting scheme that doesn't require cmpxchg(), instead relying only on atomic additions, which scale better and are faster. While the solution has a bit of a trick to it, all the logic is nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and doesn't leak outside of those low-level helpers. We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic) in such a way that we support "resurrecting" uprobe by bumping its refcount from zero back to one, and pretending like it never dropped to zero in the first place. This is done in a race-free way under exclusive writer uprobes_treelock. Crucially, we take lock only once refcount drops to zero. If we had to take lock before decrementing refcount, the approach would be prohibitively expensive. Anyways, under exclusive writer lock, we double-check that refcount didn't change and is still zero. If it is, we proceed with destruction, because at that point we have a guarantee that find_active_uprobe() can't successfully look up this uprobe instance, as it's going to be removed in destructor under writer lock. If, on the other hand, find_active_uprobe() managed to bump refcount from zero to one in between put_uprobe()'s atomic_dec_and_test(>ref) and write_lock(_treelock), we'll deterministically detect this with extra atomic_read(>ref) check, and if it doesn't hold, we pretend like atomic_dec_and_test() never returned true. There is no resource freeing or any other irreversible action taken up till this point, so we just exit early. One tricky part in the above is actually two CPUs racing and dropping refcnt to zero, and then attempting to free resources. This can happen as follows: - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock; - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at which point it decides that it needs to free uprobe as well. At this point both CPU #0 and CPU #1 will believe they need to destroy uprobe, which is obviously wrong. To prevent this situations, we augment refcount with epoch counter, which is always incremented by 1 on either get or put operation. This allows those two CPUs above to disambiguate who should actually free uprobe (it's the CPU #1, because it has up-to-date epoch). See comments in the code and note the specific values of UPROBE_REFCNT_GET
[PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
It seems like uprobe_write_opcode() doesn't require writer locked mmap_sem, any lock (reader or writer) should be sufficient. This was established in a discussion in [0] and looking through existing code seems to confirm that there is no need for write-locked mmap_sem. Fix the comment to state this clearly. [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/ Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()") Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 081821fd529a..f87049c08ee9 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * - * Called with mm->mmap_lock held for write. + * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, -- 2.43.0
[PATCH v2 01/12] uprobes: update outdated comment
There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..081821fd529a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* -* The NULL 'tsk' here ensures that any faults that occur here -* will not be accounted to the task. 'mm' *is* current->mm, -* but we treat this as a 'remote' access since it is -* essentially a kernel access to the memory. +* 'mm' *is* current->mm, but we treat this as a 'remote' access since +* it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); if (result < 0) -- 2.43.0
[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
This patch set, ultimately, switches global uprobes_treelock from RW spinlock to per-CPU RW semaphore, which has better performance and scales better under contention and multiple parallel threads triggering lots of uprobes. To make this work well with attaching multiple uprobes (through BPF multi-uprobe), we need to add batched versions of uprobe register/unregister APIs. This is what most of the patch set is actually doing. The actual switch to per-CPU RW semaphore is trivial after that and is done in the very last patch #12. See commit message with some comparison numbers. Patch #4 is probably the most important patch in the series, revamping uprobe lifetime management and refcounting. See patch description and added code comments for all the details. With changes in patch #4, we open up the way to refactor uprobe_register() and uprobe_unregister() implementations in such a way that we can avoid taking uprobes_treelock many times during a single batched attachment/detachment. This allows to accommodate a much higher latency of taking per-CPU RW semaphore for write. The end result of this patch set is that attaching 50 thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms both before and after the changes in this patch set. Patch #5 updates existing uprobe consumers to put all the relevant necessary pieces into struct uprobe_consumer, without having to pass around offset/ref_ctr_offset. Existing consumers already keep this data around, we just formalize the interface. Patches #6 through #10 add batched versions of register/unregister APIs and gradually factor them in such a way as to allow taking single (batched) uprobes_treelock, splitting the logic into multiple independent phases. Patch #11 switched BPF multi-uprobes to batched uprobe APIs. As mentioned, a very straightforward patch #12 takes advantage of all the prep work and just switches uprobes_treelock to per-CPU RW semaphore. v1->v2: - added RCU-delayed uprobe freeing to put_uprobe() (Masami); - fixed clean up handling in uprobe_register_batch (Jiri); - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg); - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid comment instead (Oleg). Andrii Nakryiko (12): uprobes: update outdated comment uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() uprobes: simplify error handling for alloc_uprobe() uprobes: revamp uprobe refcounting and lifetime management uprobes: move offset and ref_ctr_offset into uprobe_consumer uprobes: add batch uprobe register/unregister APIs uprobes: inline alloc_uprobe() logic into __uprobe_register() uprobes: split uprobe allocation and uprobes_tree insertion steps uprobes: batch uprobes_treelock during registration uprobes: improve lock batching for uprobe_unregister_batch uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes uprobes: switch uprobes_treelock to per-CPU RW semaphore include/linux/uprobes.h | 29 +- kernel/events/uprobes.c | 550 -- kernel/trace/bpf_trace.c | 40 +- kernel/trace/trace_uprobe.c | 53 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 +- 5 files changed, 447 insertions(+), 247 deletions(-) -- 2.43.0
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko wrote: > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu wrote: > > > > On Fri, 28 Jun 2024 09:34:26 -0700 > > Andrii Nakryiko wrote: > > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > > wrote: > > > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > > Andrii Nakryiko wrote: > > > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > > wrote: > > > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > > > > > > - loff_t ref_ctr_offset, struct > > > > > > > uprobe_consumer *uc) > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > > + uprobe_consumer_fn get_uprobe_consumer, > > > > > > > void *ctx) > > > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we just > > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > > would just contain pointers to uprobe_consumer. Consumers would never > > > > > just have an array of `struct uprobe_consumer *`, because > > > > > uprobe_consumer struct is embedded in some other struct, so the array > > > > > interface isn't the most convenient. > > > > > > > > OK, I understand it. > > > > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > > allocating an extra array *and keeping it* for the entire duration of > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > > waste. This is because we don't want to do anything that can fail in > > > > > the detachment logic (so no temporary array allocation there). > > > > > > > > No need to change it, that sounds reasonable. > > > > > > > > > > Great, thanks. > > > > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > > > IMHO, maybe the interface function is better to change to > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > > side uses a linked list of structure, index access will need to > > > > follow the list every time. > > > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > > ctx) with i going from 0 to N in multiple independent loops. So if we > > > are only allowed to ask for the next consumer, then > > > uprobe_register_batch and uprobe_unregister_batch would need to build > > > its own internal index and remember ith instance. Which again means > > > more allocations and possibly failing uprobe_unregister_batch(), which > > > isn't great. > > > > No, I think we can use a cursor variable as; > > > > int uprobe_register_batch(struct inode *inode, > > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > > { > > void *cur = ctx; > > > > while ((uc = get_uprobe_consumer()) != NULL) { > > ... > > } > > > > cur = ctx; > > while ((uc = get_uprobe_consumer()) != NULL) { > > ... > > } > > } > > > > This can also remove the cnt. > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use > for callers, but we have one right now, and might have another one, so > not a big deal. > Actually, now that I started implementing this, I really-really don't like it. In the example above you assume by storing and reusing original ctx value you will reset iteration to the very beginning. This is not how it works in practice though. Ctx is most probably a pointer to some struct somewhere with iteration state (e.g., array of all uprobes + current index), and so get_uprobe_consumer() doesn't update `void *ctx` itself, it updates the state of that struct. And so there is no easy and clean way to reset this iterator without adding another callback or something. At which point it becomes quite cumbersome and convoluted. How about this? I'll keep the existing get_uprobe_consumer(idx, ctx) contract, which works for the only user right now, BPF multi-uprobes. When it's time to add another consumer that works with a linked list, we can add another more complicated contract that would do iterator-style callbacks. This would be used by linked list users, and we can transparently implement existing uprobe_register_batch() contract on top of if by implementing a trivial iterator wrapper on top of get_uprobe_consumer(idx, ctx) approach. Let's not add unnecessary complications right now given we have a clear path forward to add it later, if necessary, without breaking anything. I'll send v2 without changes to get_uprobe_consumer() for now, hopefully my above plan makes sense to you. Thanks! > > > > Thank you, > > > > > > > > For now this API works well, I propose to keep it as is. For linked > > > list case consumers would need to allocate one extra array or
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko wrote: > > On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu wrote: > > > > On Mon, 24 Jun 2024 17:21:36 -0700 > > Andrii Nakryiko wrote: > > > > > Anyways, under exclusive writer lock, we double-check that refcount > > > didn't change and is still zero. If it is, we proceed with destruction, > > > because at that point we have a guarantee that find_active_uprobe() > > > can't successfully look up this uprobe instance, as it's going to be > > > removed in destructor under writer lock. If, on the other hand, > > > find_active_uprobe() managed to bump refcount from zero to one in > > > between put_uprobe()'s atomic_dec_and_test(>ref) and > > > write_lock(_treelock), we'll deterministically detect this with > > > extra atomic_read(>ref) check, and if it doesn't hold, we > > > pretend like atomic_dec_and_test() never returned true. There is no > > > resource freeing or any other irreversible action taken up till this > > > point, so we just exit early. > > > > > > One tricky part in the above is actually two CPUs racing and dropping > > > refcnt to zero, and then attempting to free resources. This can happen > > > as follows: > > > - CPU #0 drops refcnt from 1 to 0, and proceeds to grab > > > uprobes_treelock; > > > - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at > > > which point it decides that it needs to free uprobe as well. > > > > > > At this point both CPU #0 and CPU #1 will believe they need to destroy > > > uprobe, which is obviously wrong. To prevent this situations, we augment > > > refcount with epoch counter, which is always incremented by 1 on either > > > get or put operation. This allows those two CPUs above to disambiguate > > > who should actually free uprobe (it's the CPU #1, because it has > > > up-to-date epoch). See comments in the code and note the specific values > > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that > > > a single atomi64_t is actually a two sort-of-independent 32-bit counters > > > that are incremented/decremented with a single atomic_add_and_return() > > > operation. Note also a small and extremely rare (and thus having no > > > effect on performance) need to clear the highest bit every 2 billion > > > get/put operations to prevent high 32-bit counter from "bleeding over" > > > into lower 32-bit counter. > > > > I have a question here. > > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets > > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref > > again? e.g. > > > > CPU#0 CPU#1 > > > > put_uprobe() { > > atomic64_add_return() > > __get_uprobe(); > > put_uprobe() { > > > > kfree(uprobe) > > } > > write_lock(_treelock); > > atomic64_read(>ref); > > } > > > > I think it is very rare case, but I could not find any code to prevent > > this scenario. > > > > Yes, I think you are right, great catch! > > I concentrated on preventing double kfree() in this situation, and > somehow convinced myself that eager kfree() is fine. But I think I'll > need to delay freeing, probably with RCU. The problem is that we can't > use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has > to be a sleepable variant of RCU. I'm thinking of using > rcu_read_lock_trace(), the same variant of RCU we use for sleepable > BPF programs (including sleepable uprobes). srcu might be too heavy > for this. > > I'll try a few variants over the next few days and see how the > performance looks. > So I think I'm going with the changes below, incorporated into this patch (nothing else changes). __get_uprobe() doesn't need any added RCU protection (we know that uprobe is alive). It's only put_uprobe() that needs to guarantee RCU protection before we drop refcount all the way until we know whether we are the winning destructor or not. Good thing is that the changes are pretty minimal in code and also don't seem to regress performance/scalability. So I'm pretty happy about that, will send v2 soon. diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 07ad8b2e7508..41d9e37633ca 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -56,6 +56,7 @@ struct uprobe { atomic64_t ref;/* see UPROBE_REFCNT_GET below */ struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; + struct rcu_head rcu; struct list_headpending_list; struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ @@ -623,7 +624,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu wrote: > > On Fri, 28 Jun 2024 09:34:26 -0700 > Andrii Nakryiko wrote: > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu > > wrote: > > > > > > On Thu, 27 Jun 2024 09:47:10 -0700 > > > Andrii Nakryiko wrote: > > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu > > > > wrote: > > > > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700 > > > > > Andrii Nakryiko wrote: > > > > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset, > > > > > > - loff_t ref_ctr_offset, struct > > > > > > uprobe_consumer *uc) > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt, > > > > > > + uprobe_consumer_fn get_uprobe_consumer, > > > > > > void *ctx) > > > > > > > > > > Is this interface just for avoiding memory allocation? Can't we just > > > > > allocate a temporary array of *uprobe_consumer instead? > > > > > > > > Yes, exactly, to avoid the need for allocating another array that > > > > would just contain pointers to uprobe_consumer. Consumers would never > > > > just have an array of `struct uprobe_consumer *`, because > > > > uprobe_consumer struct is embedded in some other struct, so the array > > > > interface isn't the most convenient. > > > > > > OK, I understand it. > > > > > > > > > > > If you feel strongly, I can do an array, but this necessitates > > > > allocating an extra array *and keeping it* for the entire duration of > > > > BPF multi-uprobe link (attachment) existence, so it feels like a > > > > waste. This is because we don't want to do anything that can fail in > > > > the detachment logic (so no temporary array allocation there). > > > > > > No need to change it, that sounds reasonable. > > > > > > > Great, thanks. > > > > > > > > > > Anyways, let me know how you feel about keeping this callback. > > > > > > IMHO, maybe the interface function is better to change to > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller > > > side uses a linked list of structure, index access will need to > > > follow the list every time. > > > > This would be problematic. Note how we call get_uprobe_consumer(i, > > ctx) with i going from 0 to N in multiple independent loops. So if we > > are only allowed to ask for the next consumer, then > > uprobe_register_batch and uprobe_unregister_batch would need to build > > its own internal index and remember ith instance. Which again means > > more allocations and possibly failing uprobe_unregister_batch(), which > > isn't great. > > No, I think we can use a cursor variable as; > > int uprobe_register_batch(struct inode *inode, > uprobe_consumer_fn get_uprobe_consumer, void *ctx) > { > void *cur = ctx; > > while ((uc = get_uprobe_consumer()) != NULL) { > ... > } > > cur = ctx; > while ((uc = get_uprobe_consumer()) != NULL) { > ... > } > } > > This can also remove the cnt. Ok, if you prefer this I'll switch. It's a bit more cumbersome to use for callers, but we have one right now, and might have another one, so not a big deal. > > Thank you, > > > > > For now this API works well, I propose to keep it as is. For linked > > list case consumers would need to allocate one extra array or pay the > > price of O(N) search (which might be ok, depending on how many uprobes > > are being attached). But we don't have such consumers right now, > > thankfully. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) > > > > > > > > > -- > > > Masami Hiramatsu (Google) > > > -- > Masami Hiramatsu (Google)