Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko
On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Jul 2024 15:15:56 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > > Andrii Nakryiko  wrote:
> > > > > > > >
> > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t 
> > > > > > > > > offset,
> > > > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > > > uprobe_consumer *uc)
> > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > > +   uprobe_consumer_fn 
> > > > > > > > > get_uprobe_consumer, void *ctx)
> > > > > > > >
> > > > > > > > Is this interface just for avoiding memory allocation? Can't we 
> > > > > > > > just
> > > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > > >
> > > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > > would just contain pointers to uprobe_consumer. Consumers would 
> > > > > > > never
> > > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > > uprobe_consumer struct is embedded in some other struct, so the 
> > > > > > > array
> > > > > > > interface isn't the most convenient.
> > > > > >
> > > > > > OK, I understand it.
> > > > > >
> > > > > > >
> > > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > > allocating an extra array *and keeping it* for the entire 
> > > > > > > duration of
> > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > > waste. This is because we don't want to do anything that can fail 
> > > > > > > in
> > > > > > > the detachment logic (so no temporary array allocation there).
> > > > > >
> > > > > > No need to change it, that sounds reasonable.
> > > > > >
> > > > >
> > > > > Great, thanks.
> > > > >
> > > > > > >
> > > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > > >
> > > > > > IMHO, maybe the interface function is better to change to
> > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > > side uses a linked list of structure, index access will need to
> > > > > > follow the list every time.
> > > > >
> > > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > > are only allowed to ask for the next consumer, then
> > > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > > its own internal index and remember ith instance. Which again means
> > > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > > isn't great.
> > > >
> > > > No, I think we can use a cursor variable as;
> > > >
> > > > int uprobe_register_batch(struct inode *inode,
> > > >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > {
> > > > void *cur = ctx;
> > > >
> > > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > > ...
> > > > }
> > > >
> > > > cur = ctx;
> > > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > > ...
> > > > }
> > > > }
> > > >
> > > > This can also remove the cnt.
> > >
> > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > > for callers, but we have one right now, and might have another one, so
> > > not a big deal.
> > >
> >
> > Actually, now that I started implementing this, I really-really don't
> > like it. In the example above you assume by storing and reusing
> > original ctx value you will reset iteration to the very beginning.
> > This is not how it works in practice though. Ctx is most probably a
> > pointer to some struct somewhere with iteration state (e.g., array of
> > all uprobes + current index), and so get_uprobe_consumer() doesn't
> > update `void *ctx` itself, it updates the state of that struct.
>
> Yeah, that should be noted so that if the get_uprobe_consumer() is
> called with original `ctx` value, it should return the same.
> Ah, and I found we need to pass both ctx and pos...
>
>while ((uc = get_uprobe_consumer(, ctx)) != NULL) {
>  ...
>  }
>
> Usually it is enough to pass the cursor as similar to the other
> for_each_* macros. For example, struct foo has .list and .uc, then
>
> struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
> {
>   

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Google
On Mon, 1 Jul 2024 15:15:56 -0700
Andrii Nakryiko  wrote:

> On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
>  wrote:
> >
> > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  
> > wrote:
> > >
> > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > Andrii Nakryiko  wrote:
> > > > > > >
> > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t 
> > > > > > > > offset,
> > > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > > uprobe_consumer *uc)
> > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > > > void *ctx)
> > > > > > >
> > > > > > > Is this interface just for avoiding memory allocation? Can't we 
> > > > > > > just
> > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > >
> > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > would just contain pointers to uprobe_consumer. Consumers would 
> > > > > > never
> > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > uprobe_consumer struct is embedded in some other struct, so the 
> > > > > > array
> > > > > > interface isn't the most convenient.
> > > > >
> > > > > OK, I understand it.
> > > > >
> > > > > >
> > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > allocating an extra array *and keeping it* for the entire duration 
> > > > > > of
> > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > waste. This is because we don't want to do anything that can fail in
> > > > > > the detachment logic (so no temporary array allocation there).
> > > > >
> > > > > No need to change it, that sounds reasonable.
> > > > >
> > > >
> > > > Great, thanks.
> > > >
> > > > > >
> > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > >
> > > > > IMHO, maybe the interface function is better to change to
> > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > side uses a linked list of structure, index access will need to
> > > > > follow the list every time.
> > > >
> > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > are only allowed to ask for the next consumer, then
> > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > its own internal index and remember ith instance. Which again means
> > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > isn't great.
> > >
> > > No, I think we can use a cursor variable as;
> > >
> > > int uprobe_register_batch(struct inode *inode,
> > >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > {
> > > void *cur = ctx;
> > >
> > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > ...
> > > }
> > >
> > > cur = ctx;
> > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > ...
> > > }
> > > }
> > >
> > > This can also remove the cnt.
> >
> > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > for callers, but we have one right now, and might have another one, so
> > not a big deal.
> >
> 
> Actually, now that I started implementing this, I really-really don't
> like it. In the example above you assume by storing and reusing
> original ctx value you will reset iteration to the very beginning.
> This is not how it works in practice though. Ctx is most probably a
> pointer to some struct somewhere with iteration state (e.g., array of
> all uprobes + current index), and so get_uprobe_consumer() doesn't
> update `void *ctx` itself, it updates the state of that struct.

Yeah, that should be noted so that if the get_uprobe_consumer() is
called with original `ctx` value, it should return the same.
Ah, and I found we need to pass both ctx and pos...

   while ((uc = get_uprobe_consumer(, ctx)) != NULL) {
 ...
 }

Usually it is enough to pass the cursor as similar to the other
for_each_* macros. For example, struct foo has .list and .uc, then

struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
{
struct foo *foo = *pos;

if (!foo)
return NULL;

*pos = list_next_entry(foo, list);
if (list_is_head(pos, (head)))
*pos = NULL;

return foo->uc;
}

This works something like this.

#define for_each_uprobe_consumer_from_foo(uc, pos, head) 

Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-01 Thread Steven Rostedt
On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:

> @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap,
>   if (error)
>   return error;
>  
> + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name,
> + attr->ia_valid, attr->ia_mode,
> + from_kuid(_user_ns, attr->ia_uid),
> + from_kgid(_user_ns, attr->ia_gid),
> + inode->i_size, attr->ia_size);
> +

That's a lot of parameters to pass to a tracepoint. Why not just pass the
dentry and attr and do the above in the TP_fast_assign() logic? That would
put less pressure on the icache for the code part.

-- Steve



[PATCH] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-07-01 Thread Andrii Nakryiko
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced. Given entire kernel implementation of user
space stack trace capturing works under assumption that user space code
was compiled with frame pointer register (%rbp) preservation, it seems
pretty reasonable to use this instruction as a strong indicator that
this is the entry to the function. In that case, return address is still
pointed to by %rsp, so we fetch it and add to stack trace before
proceeding to unwind the rest using frame pointer-based logic.

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/events/core.c  | 20 
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..82d5570b58ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *regs
return;
 
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+   /*
+* If we are called from uprobe handler, and we are indeed at the very
+* entry to user function (which is normally a `push %rbp` instruction,
+* under assumption of application being compiled with frame pointers),
+* we should read return address from *regs->sp before proceeding
+* to follow frame pointers, otherwise we'll skip immediate caller
+* as %rbp is not yet setup.
+*/
+   if (current->utask) {
+   struct arch_uprobe *auprobe = current->utask->auprobe;
+   u64 ret_addr;
+
+   if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
+   !__get_user(ret_addr, (const u64 __user *)regs->sp))
+   perf_callchain_store(entry, ret_addr);
+   }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
 
+   struct arch_uprobe  *auprobe;
+
struct return_instance  *return_instances;
unsigned intdepth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
 
down_read(>register_rwsem);
+   current->utask->auprobe = >arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
 
@@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
 
remove &= rc;
}
+   current->utask->auprobe = NULL;
 
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0




[PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko
With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)AFTER (mln/s)
  -   ---
  1   3.131 3.140
  2   3.394 3.601
  3   3.630 3.960
  4   3.317 3.551
  5   3.448 3.464
  6   3.345 3.283
  7   3.469 3.444
  8   3.182 3.258
  9   3.138 3.139
  10  2.999 3.212
  11  2.903 3.183
  12  2.802 3.027
  13  2.792 3.027
  14  2.695 3.086
  15  2.822 2.965
  16  2.679 2.939
  17  2.622 2.888
  18  2.628 2.914
  19  2.702 2.836
  20  2.561 2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bb480a2400e1..1d76551e5e23 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);  /* serialize rbtree access */
 
 #define UPROBES_HASH_SZ13
 /* serialize uprobe->pending_list */
@@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
bool destroy;
 
if (!tree_locked)
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
if (!tree_locked)
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
loff_t offset)
 {
struct uprobe *uprobe;
 
-   read_lock(_treelock);
+   percpu_down_read(_treelock);
uprobe = __find_uprobe(inode, offset);
-   read_unlock(_treelock);
+   percpu_up_read(_treelock);
 
return uprobe;
 }
@@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
up_write(>register_rwsem);
}
 
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
uprobe = uc->uprobe;
@@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
__put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
ret = 0;
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
}
 unlock_treelock:
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
if (ret)
goto cleanup_uprobes;
 
@@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
@@ -1384,7 +1384,7 @@ int uprobe_register_batch(struct 

[PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes

2024-07-01 Thread Andrii Nakryiko
Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/bpf_trace.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe 
*uprobes,
- u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, 
void *ctx)
 {
-   u32 i;
+   struct bpf_uprobe_multi_link *link = ctx;
 
-   for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), 
[i].consumer);
-   }
+   return >uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link 
*link)
struct bpf_uprobe_multi_link *umulti_link;
 
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-   bpf_uprobe_unregister(_link->path, umulti_link->uprobes, 
umulti_link->cnt);
+   uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), 
umulti_link->cnt,
+   umulti_link_get_uprobe_consumer, umulti_link);
if (umulti_link->task)
put_task_struct(umulti_link->task);
path_put(_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI,
  _uprobe_multi_link_lops, prog);
 
-   for (i = 0; i < cnt; i++) {
-   err = uprobe_register(d_real_inode(link->path.dentry), 
[i].consumer);
-   if (err) {
-   bpf_uprobe_unregister(, uprobes, i);
-   goto error_free;
-   }
-   }
+   err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+   umulti_link_get_uprobe_consumer, link);
+   if (err)
+   goto error_free;
 
err = bpf_link_prime(>link, _primer);
if (err)
-- 
2.43.0




[PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch

2024-07-01 Thread Andrii Nakryiko
Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ced85284bbf4..bb480a2400e1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
 {
-   struct uprobe *uprobe;
struct uprobe_consumer *uc;
+   struct uprobe *uprobe;
int i;
 
for (i = 0; i < cnt; i++) {
@@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
down_write(>register_rwsem);
__uprobe_unregister(uprobe, uc);
up_write(>register_rwsem);
-   put_uprobe(uprobe);
+   }
 
+   write_lock(_treelock);
+   for (i = 0; i < cnt; i++) {
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
+   if (!uprobe)
+   continue;
+
+   __put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0




[PATCH v2 09/12] uprobes: batch uprobes_treelock during registration

2024-07-01 Thread Andrii Nakryiko
Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 45 +
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 128677ffe662..ced85284bbf4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu)
kfree(uprobe);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
s64 v;
 
@@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe)
if (unlikely((u32)v == 0)) {
bool destroy;
 
-   write_lock(_treelock);
+   if (!tree_locked)
+   write_lock(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe)
destroy = atomic64_read(>ref) == v;
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
-   write_unlock(_treelock);
+   if (!tree_locked)
+   write_unlock(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe)
rcu_read_unlock_trace();
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+   __put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
   const struct uprobe *r)
@@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe 
*uprobe)
return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-   struct uprobe *u;
-
-   write_lock(_treelock);
-   u = __insert_uprobe(uprobe);
-   write_unlock(_treelock);
-
-   return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uc->uprobe = uprobe;
}
 
+   ret = 0;
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uprobe = uc->uprobe;
 
/* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
+   cur_uprobe = __insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
if (cur_uprobe != uprobe) {
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
+
+   __put_uprobe(cur_uprobe, true);
ret = -EINVAL;
-   goto cleanup_uprobes;
+   goto unlock_treelock;
}
kfree(uprobe);
uc->uprobe = cur_uprobe;
}
}
+unlock_treelock:
+   write_unlock(_treelock);
+   if (ret)
+   goto cleanup_uprobes;
 
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
@@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
if (uc->uprobe)
-   put_uprobe(uc->uprobe);
+   __put_uprobe(uc->uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
return ret;
 }
 
-- 
2.43.0




[PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps

2024-07-01 Thread Andrii Nakryiko
Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0f928a72a658..128677ffe662 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
return -EINVAL;
}
 
+   /* pre-allocate new uprobe instances */
for (i = 0; i < cnt; i++) {
-   struct uprobe *cur_uprobe;
-
uc = get_uprobe_consumer(i, ctx);
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
RB_CLEAR_NODE(>rb_node);
atomic64_set(>ref, 1);
 
+   uc->uprobe = uprobe;
+   }
+
+   for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
@@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
put_uprobe(cur_uprobe);
-   kfree(uprobe);
ret = -EINVAL;
goto cleanup_uprobes;
}
kfree(uprobe);
-   uprobe = cur_uprobe;
+   uc->uprobe = cur_uprobe;
}
-
-   uc->uprobe = uprobe;
}
 
for (i = 0; i < cnt; i++) {
-- 
2.43.0




[PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()

2024-07-01 Thread Andrii Nakryiko
To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 65 ++---
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 68fdf1b8e4bf..0f928a72a658 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct 
uprobe *uprobe)
(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-  loff_t ref_ctr_offset)
-{
-   struct uprobe *uprobe, *cur_uprobe;
-
-   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-   if (!uprobe)
-   return ERR_PTR(-ENOMEM);
-
-   uprobe->inode = inode;
-   uprobe->offset = offset;
-   uprobe->ref_ctr_offset = ref_ctr_offset;
-   init_rwsem(>register_rwsem);
-   init_rwsem(>consumer_rwsem);
-   RB_CLEAR_NODE(>rb_node);
-   atomic64_set(>ref, 1);
-
-   /* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
-   /* a uprobe exists for this inode:offset combination */
-   if (cur_uprobe != uprobe) {
-   if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
-   kfree(uprobe);
-   return ERR_PTR(-EINVAL);
-   }
-   kfree(uprobe);
-   uprobe = cur_uprobe;
-   }
-
-   return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(>consumer_rwsem);
@@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
uc = get_uprobe_consumer(i, ctx);
 
-   uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-   if (IS_ERR(uprobe)) {
-   ret = PTR_ERR(uprobe);
+   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+   if (!uprobe) {
+   ret = -ENOMEM;
goto cleanup_uprobes;
}
 
+   uprobe->inode = inode;
+   uprobe->offset = uc->offset;
+   uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+   init_rwsem(>register_rwsem);
+   init_rwsem(>consumer_rwsem);
+   RB_CLEAR_NODE(>rb_node);
+   atomic64_set(>ref, 1);
+
+   /* add to uprobes_tree, sorted on inode:offset */
+   cur_uprobe = insert_uprobe(uprobe);
+   /* a uprobe exists for this inode:offset combination */
+   if (cur_uprobe != uprobe) {
+   if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
+   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+   put_uprobe(cur_uprobe);
+   kfree(uprobe);
+   ret = -EINVAL;
+   goto cleanup_uprobes;
+   }
+   kfree(uprobe);
+   uprobe = cur_uprobe;
+   }
+
uc->uprobe = uprobe;
}
 
-- 
2.43.0




[PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko
Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h |  17 
 kernel/events/uprobes.c | 180 
 2 files changed, 146 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
loff_t ref_ctr_offset;
/* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
+   /* associated uprobe instance (or NULL if consumer isn't attached) */
+   struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs 
*regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn get_uprobe_consumer, void 
*ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct 
uprobe_consumer *uc)
 {
return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx)
+{
+   return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, 
bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8759c6d0683e..68fdf1b8e4bf 100644

[PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-01 Thread Andrii Nakryiko
Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h   | 18 +++
 kernel/events/uprobes.c   | 19 ++-
 kernel/trace/bpf_trace.c  | 21 +++-
 kernel/trace/trace_uprobe.c   | 53 ---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 
 5 files changed, 55 insertions(+), 78 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
 
+   /* associated file offset of this probe */
+   loff_t offset;
+   /* associated refctr file offset of this probe, or zero */
+   loff_t ref_ctr_offset;
+   /* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs) instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-   return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, 
loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 560cf1ca512a..8759c6d0683e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
struct uprobe *uprobe;
 
-   uprobe = find_uprobe(inode, offset);
+   uprobe = find_uprobe(inode, uc->offset);
if (WARN_ON(!uprobe))
return;
 
@@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, 
loff_t offset,
return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-   struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-   return __uprobe_register(inode, offset, 0, uc);
+   return __uprobe_register(inode, uc->offset, 

[PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()

2024-07-01 Thread Andrii Nakryiko
Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f87049c08ee9..23449a8c5e7e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
loff_t offset,
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
if (!uprobe)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
uprobe->inode = inode;
uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t 
offset,
 
  retry:
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-   if (!uprobe)
-   return -ENOMEM;
if (IS_ERR(uprobe))
return PTR_ERR(uprobe);
 
-- 
2.43.0




[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko
Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
handler code, temporary refcount bump is done to keep uprobe from
being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
take another refcount to keep uprobe alive until user space code
returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(>ref) and
write_lock(_treelock), we'll deterministically detect this with
extra atomic_read(>ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET 

[PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-01 Thread Andrii Nakryiko
It seems like uprobe_write_opcode() doesn't require writer locked
mmap_sem, any lock (reader or writer) should be sufficient. This was
established in a discussion in [0] and looking through existing code
seems to confirm that there is no need for write-locked mmap_sem.

Fix the comment to state this clearly.

  [0] 
https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/

Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
uprobe_write_opcode()")
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 081821fd529a..f87049c08ee9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
mm_struct *mm,
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
- * Called with mm->mmap_lock held for write.
+ * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-- 
2.43.0




[PATCH v2 01/12] uprobes: update outdated comment

2024-07-01 Thread Andrii Nakryiko
There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..081821fd529a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0




[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko
This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

v1->v2:
  - added RCU-delayed uprobe freeing to put_uprobe() (Masami);
  - fixed clean up handling in uprobe_register_batch (Jiri);
  - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg);
  - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid
comment instead (Oleg).

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h   |  29 +-
 kernel/events/uprobes.c   | 550 --
 kernel/trace/bpf_trace.c  |  40 +-
 kernel/trace/trace_uprobe.c   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  22 +-
 5 files changed, 447 insertions(+), 247 deletions(-)

-- 
2.43.0




Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko
On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
 wrote:
>
> On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
> >
> > On Fri, 28 Jun 2024 09:34:26 -0700
> > Andrii Nakryiko  wrote:
> >
> > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > uprobe_consumer *uc)
> > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > > void *ctx)
> > > > > >
> > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > >
> > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > interface isn't the most convenient.
> > > >
> > > > OK, I understand it.
> > > >
> > > > >
> > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > waste. This is because we don't want to do anything that can fail in
> > > > > the detachment logic (so no temporary array allocation there).
> > > >
> > > > No need to change it, that sounds reasonable.
> > > >
> > >
> > > Great, thanks.
> > >
> > > > >
> > > > > Anyways, let me know how you feel about keeping this callback.
> > > >
> > > > IMHO, maybe the interface function is better to change to
> > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > side uses a linked list of structure, index access will need to
> > > > follow the list every time.
> > >
> > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > are only allowed to ask for the next consumer, then
> > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > its own internal index and remember ith instance. Which again means
> > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > isn't great.
> >
> > No, I think we can use a cursor variable as;
> >
> > int uprobe_register_batch(struct inode *inode,
> >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > {
> > void *cur = ctx;
> >
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> >
> > cur = ctx;
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> > }
> >
> > This can also remove the cnt.
>
> Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> for callers, but we have one right now, and might have another one, so
> not a big deal.
>

Actually, now that I started implementing this, I really-really don't
like it. In the example above you assume by storing and reusing
original ctx value you will reset iteration to the very beginning.
This is not how it works in practice though. Ctx is most probably a
pointer to some struct somewhere with iteration state (e.g., array of
all uprobes + current index), and so get_uprobe_consumer() doesn't
update `void *ctx` itself, it updates the state of that struct.

And so there is no easy and clean way to reset this iterator without
adding another callback or something. At which point it becomes quite
cumbersome and convoluted.

How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
contract, which works for the only user right now, BPF multi-uprobes.
When it's time to add another consumer that works with a linked list,
we can add another more complicated contract that would do
iterator-style callbacks. This would be used by linked list users, and
we can transparently implement existing uprobe_register_batch()
contract on top of if by implementing a trivial iterator wrapper on
top of get_uprobe_consumer(idx, ctx) approach.

Let's not add unnecessary complications right now given we have a
clear path forward to add it later, if necessary, without breaking
anything. I'll send v2 without changes to get_uprobe_consumer() for
now, hopefully my above plan makes sense to you. Thanks!

> >
> > Thank you,
> >
> > >
> > > For now this API works well, I propose to keep it as is. For linked
> > > list case consumers would need to allocate one extra array or 

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko
On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko
 wrote:
>
> On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 24 Jun 2024 17:21:36 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Anyways, under exclusive writer lock, we double-check that refcount
> > > didn't change and is still zero. If it is, we proceed with destruction,
> > > because at that point we have a guarantee that find_active_uprobe()
> > > can't successfully look up this uprobe instance, as it's going to be
> > > removed in destructor under writer lock. If, on the other hand,
> > > find_active_uprobe() managed to bump refcount from zero to one in
> > > between put_uprobe()'s atomic_dec_and_test(>ref) and
> > > write_lock(_treelock), we'll deterministically detect this with
> > > extra atomic_read(>ref) check, and if it doesn't hold, we
> > > pretend like atomic_dec_and_test() never returned true. There is no
> > > resource freeing or any other irreversible action taken up till this
> > > point, so we just exit early.
> > >
> > > One tricky part in the above is actually two CPUs racing and dropping
> > > refcnt to zero, and then attempting to free resources. This can happen
> > > as follows:
> > >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab 
> > > uprobes_treelock;
> > >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > > which point it decides that it needs to free uprobe as well.
> > >
> > > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > > uprobe, which is obviously wrong. To prevent this situations, we augment
> > > refcount with epoch counter, which is always incremented by 1 on either
> > > get or put operation. This allows those two CPUs above to disambiguate
> > > who should actually free uprobe (it's the CPU #1, because it has
> > > up-to-date epoch). See comments in the code and note the specific values
> > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > > that are incremented/decremented with a single atomic_add_and_return()
> > > operation. Note also a small and extremely rare (and thus having no
> > > effect on performance) need to clear the highest bit every 2 billion
> > > get/put operations to prevent high 32-bit counter from "bleeding over"
> > > into lower 32-bit counter.
> >
> > I have a question here.
> > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> > again? e.g.
> >
> > CPU#0   CPU#1
> >
> > put_uprobe() {
> > atomic64_add_return()
> > __get_uprobe();
> > put_uprobe() {
> > 
> > kfree(uprobe)
> > }
> > write_lock(_treelock);
> > atomic64_read(>ref);
> > }
> >
> > I think it is very rare case, but I could not find any code to prevent
> > this scenario.
> >
>
> Yes, I think you are right, great catch!
>
> I concentrated on preventing double kfree() in this situation, and
> somehow convinced myself that eager kfree() is fine. But I think I'll
> need to delay freeing, probably with RCU. The problem is that we can't
> use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
> to be a sleepable variant of RCU. I'm thinking of using
> rcu_read_lock_trace(), the same variant of RCU we use for sleepable
> BPF programs (including sleepable uprobes). srcu might be too heavy
> for this.
>
> I'll try a few variants over the next few days and see how the
> performance looks.
>

So I think I'm going with the changes below, incorporated into this
patch (nothing else changes). __get_uprobe() doesn't need any added
RCU protection (we know that uprobe is alive). It's only put_uprobe()
that needs to guarantee RCU protection before we drop refcount all the
way until we know whether we are the winning destructor or not.

Good thing is that the changes are pretty minimal in code and also
don't seem to regress performance/scalability. So I'm pretty happy
about that, will send v2 soon.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 07ad8b2e7508..41d9e37633ca 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -56,6 +56,7 @@ struct uprobe {
atomic64_t  ref;/* see
UPROBE_REFCNT_GET below */
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
+   struct rcu_head rcu;
struct list_headpending_list;
struct uprobe_consumer  *consumers;
struct inode*inode; /* Also hold a ref to inode */
@@ -623,7 +624,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct
mm_struct *mm, 

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko
On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
>
> On Fri, 28 Jun 2024 09:34:26 -0700
> Andrii Nakryiko  wrote:
>
> > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > wrote:
> > >
> > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > uprobe_consumer *uc)
> > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > void *ctx)
> > > > >
> > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > allocate a temporary array of *uprobe_consumer instead?
> > > >
> > > > Yes, exactly, to avoid the need for allocating another array that
> > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > just have an array of `struct uprobe_consumer *`, because
> > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > interface isn't the most convenient.
> > >
> > > OK, I understand it.
> > >
> > > >
> > > > If you feel strongly, I can do an array, but this necessitates
> > > > allocating an extra array *and keeping it* for the entire duration of
> > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > waste. This is because we don't want to do anything that can fail in
> > > > the detachment logic (so no temporary array allocation there).
> > >
> > > No need to change it, that sounds reasonable.
> > >
> >
> > Great, thanks.
> >
> > > >
> > > > Anyways, let me know how you feel about keeping this callback.
> > >
> > > IMHO, maybe the interface function is better to change to
> > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > side uses a linked list of structure, index access will need to
> > > follow the list every time.
> >
> > This would be problematic. Note how we call get_uprobe_consumer(i,
> > ctx) with i going from 0 to N in multiple independent loops. So if we
> > are only allowed to ask for the next consumer, then
> > uprobe_register_batch and uprobe_unregister_batch would need to build
> > its own internal index and remember ith instance. Which again means
> > more allocations and possibly failing uprobe_unregister_batch(), which
> > isn't great.
>
> No, I think we can use a cursor variable as;
>
> int uprobe_register_batch(struct inode *inode,
>  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> {
> void *cur = ctx;
>
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
>
> cur = ctx;
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
> }
>
> This can also remove the cnt.

Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
for callers, but we have one right now, and might have another one, so
not a big deal.

>
> Thank you,
>
> >
> > For now this API works well, I propose to keep it as is. For linked
> > list case consumers would need to allocate one extra array or pay the
> > price of O(N) search (which might be ok, depending on how many uprobes
> > are being attached). But we don't have such consumers right now,
> > thankfully.
> >
> > >
> > > Thank you,
> > >
> > >
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) 
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) 
>
>
> --
> Masami Hiramatsu (Google)