date:20240701

[PATCH] staging: Fix missing warning/taint on builtin code

2024-07-01 Thread Ágatha Isabelle Chris Moreira Guedes

Fix the absence of warning message and kernel tainting when initializing
drivers from the `drivers/staging` subtree from initcalls (when
configured as built-in).

When such a driver is built as module and the module is loaded, the
`load_module()` function taints the kernel to signal code of unknown
quality is loaded, and produces a warning like this:

[8.076352] rts5208: module is from the staging directory, the
quality is unknown, you have been warned.

The same behaviour is absent, however, when a staging driver is compiled
as built-in on the kernel image, since loading it happens through
initcalls and not through load_module().

This might prevent relevant information of being available on a bug
report (i.e. on a panic log) among other possible problems.

Signed-off-by: Ágatha Isabelle Chris Moreira Guedes 
---
ACKNOWLEDGEMENTS
Thanks for Jookia, heat and ukleinek for the important comments &
suggestions on this patch prior to submission.

 drivers/staging/Makefile |  2 ++
 include/linux/init.h |  6 ++
 include/linux/module.h   | 33 +
 init/main.c  | 20 
 kernel/module/main.c |  4 +---
 5 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 5390879b5d1b..7cea13436426 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 # Makefile for staging directory
 
+subdir-ccflags-y += -DSTAGING_CODE
+
 obj-y  += media/
 obj-$(CONFIG_FB_OLPC_DCON) += olpc_dcon/
 obj-$(CONFIG_RTL8192E) += rtl8192e/
diff --git a/include/linux/init.h b/include/linux/init.h
index 58cef4c2e59a..68c37600958f 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -397,4 +397,10 @@ void __init parse_early_options(char *cmdline);
 #define __exit_p(x) NULL
 #endif
 
+#ifdef CONFIG_STAGING
+#ifndef __ASSEMBLY__
+extern void staging_taint(const char *code_id, bool module);
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_STAGING */
+
 #endif /* _LINUX_INIT_H */
diff --git a/include/linux/module.h b/include/linux/module.h
index 330ffb59efe5..ffe58f5d143b 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -76,6 +76,39 @@ extern struct module_attribute module_uevent;
 extern int init_module(void);
 extern void cleanup_module(void);
 
+#ifdef CONFIG_STAGING
+
+#define __lower_define_initcall(fn, id) ___define_initcall(fn, id, 
.initcall##id)
+
+/**
+ * __staging_define_initcall(fn,id) - staging initialization entry point
+ * @fn: the function to run at kernel boot time
+ * @id: the initcall level
+ *
+ * __staging_define_initcall() will ensure the drive's init function is always
+ * called during initcalls for staging code by producing a wrapper function.
+ * It applies if a module from the drivers/staging subtree is builtin to the
+ * kernel. It reproduces the behavior in load_module() by tainting the kernel
+ * and logging a warning about the code quality.
+ */
+
+#define __staging_define_initcall(fn, id) \
+   static int __init __staging_wrapped_##fn(void) \
+   { \
+   staging_taint(__FILE__, false); \
+   return fn(); \
+   } \
+__lower_define_initcall(__staging_wrapped_##fn, id)
+
+#ifdef STAGING_CODE
+
+#undef __define_initcall
+#define __define_initcall(fn, id) __staging_define_initcall(fn, id)
+
+#endif /* STAGING_CODE */
+
+#endif /* CONFIG_STAGING */
+
 #ifndef MODULE
 /**
  * module_init() - driver initialization entry point
diff --git a/init/main.c b/init/main.c
index 206acdde51f5..fca889f3bcc0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1602,3 +1602,23 @@ static noinline void __init kernel_init_freeable(void)
 
integrity_load_keys();
 }
+
+#ifdef CONFIG_STAGING
+/**
+ * staging_init_taint() - We need to taint the kernel whenever staging code
+ * is initialized (from built-in drivers) or loaded (as modules) and issue
+ * a warning the first time it happens.
+ */
+void staging_taint(const char *code_id, bool module)
+{
+   char *code_type = module ? "module" : "builtin driver at";
+
+   pr_warn("%s %s: The kernel contains code from staging directory"
+   "the quality is unknown, you have been warned.\n",
+   code_type, code_id);
+
+   add_taint(TAINT_CRAP, LOCKDEP_STILL_OK);
+}
+EXPORT_SYMBOL(staging_taint);
+
+#endif /* CONFIG_STAGING */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index d18a94b973e1..d7d6ab43 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2039,9 +2039,7 @@ static void module_augment_kernel_taints(struct module 
*mod, struct load_info *i
check_modinfo_retpoline(mod, info);
 
if (get_modinfo(info, "staging")) {
-   add_taint_module(mod, TAINT_CRAP, LOCKDEP_STILL_OK);
-   pr_warn("%s: module is from the staging directory, the quality "
-   "is unknown, you have been

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Jul 2024 15:15:56 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > > Andrii Nakryiko  wrote:
> > > > > > > >
> > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t 
> > > > > > > > > offset,
> > > > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > > > uprobe_consumer *uc)
> > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > > +   uprobe_consumer_fn 
> > > > > > > > > get_uprobe_consumer, void *ctx)
> > > > > > > >
> > > > > > > > Is this interface just for avoiding memory allocation? Can't we 
> > > > > > > > just
> > > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > > >
> > > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > > would just contain pointers to uprobe_consumer. Consumers would 
> > > > > > > never
> > > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > > uprobe_consumer struct is embedded in some other struct, so the 
> > > > > > > array
> > > > > > > interface isn't the most convenient.
> > > > > >
> > > > > > OK, I understand it.
> > > > > >
> > > > > > >
> > > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > > allocating an extra array *and keeping it* for the entire 
> > > > > > > duration of
> > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > > waste. This is because we don't want to do anything that can fail 
> > > > > > > in
> > > > > > > the detachment logic (so no temporary array allocation there).
> > > > > >
> > > > > > No need to change it, that sounds reasonable.
> > > > > >
> > > > >
> > > > > Great, thanks.
> > > > >
> > > > > > >
> > > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > > >
> > > > > > IMHO, maybe the interface function is better to change to
> > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > > side uses a linked list of structure, index access will need to
> > > > > > follow the list every time.
> > > > >
> > > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > > are only allowed to ask for the next consumer, then
> > > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > > its own internal index and remember ith instance. Which again means
> > > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > > isn't great.
> > > >
> > > > No, I think we can use a cursor variable as;
> > > >
> > > > int uprobe_register_batch(struct inode *inode,
> > > >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > {
> > > > void *cur = ctx;
> > > >
> > > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > > ...
> > > > }
> > > >
> > > > cur = ctx;
> > > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > > ...
> > > > }
> > > > }
> > > >
> > > > This can also remove the cnt.
> > >
> > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > > for callers, but we have one right now, and might have another one, so
> > > not a big deal.
> > >
> >
> > Actually, now that I started implementing this, I really-really don't
> > like it. In the example above you assume by storing and reusing
> > original ctx value you will reset iteration to the very beginning.
> > This is not how it works in practice though. Ctx is most probably a
> > pointer to some struct somewhere with iteration state (e.g., array of
> > all uprobes + current index), and so get_uprobe_consumer() doesn't
> > update `void *ctx` itself, it updates the state of that struct.
>
> Yeah, that should be noted so that if the get_uprobe_consumer() is
> called with original `ctx` value, it should return the same.
> Ah, and I found we need to pass both ctx and pos...
>
>while ((uc = get_uprobe_consumer(, ctx)) != NULL) {
>  ...
>  }
>
> Usually it is enough to pass the cursor as similar to the other
> for_each_* macros. For example, struct foo has .list and .uc, then
>
> struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
> {
>

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Google

On Mon, 1 Jul 2024 15:15:56 -0700
Andrii Nakryiko  wrote:

> On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
>  wrote:
> >
> > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  
> > wrote:
> > >
> > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > Andrii Nakryiko  wrote:
> > > > > > >
> > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t 
> > > > > > > > offset,
> > > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > > uprobe_consumer *uc)
> > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > > > void *ctx)
> > > > > > >
> > > > > > > Is this interface just for avoiding memory allocation? Can't we 
> > > > > > > just
> > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > >
> > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > would just contain pointers to uprobe_consumer. Consumers would 
> > > > > > never
> > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > uprobe_consumer struct is embedded in some other struct, so the 
> > > > > > array
> > > > > > interface isn't the most convenient.
> > > > >
> > > > > OK, I understand it.
> > > > >
> > > > > >
> > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > allocating an extra array *and keeping it* for the entire duration 
> > > > > > of
> > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > waste. This is because we don't want to do anything that can fail in
> > > > > > the detachment logic (so no temporary array allocation there).
> > > > >
> > > > > No need to change it, that sounds reasonable.
> > > > >
> > > >
> > > > Great, thanks.
> > > >
> > > > > >
> > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > >
> > > > > IMHO, maybe the interface function is better to change to
> > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > side uses a linked list of structure, index access will need to
> > > > > follow the list every time.
> > > >
> > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > are only allowed to ask for the next consumer, then
> > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > its own internal index and remember ith instance. Which again means
> > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > isn't great.
> > >
> > > No, I think we can use a cursor variable as;
> > >
> > > int uprobe_register_batch(struct inode *inode,
> > >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > {
> > > void *cur = ctx;
> > >
> > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > ...
> > > }
> > >
> > > cur = ctx;
> > > while ((uc = get_uprobe_consumer()) != NULL) {
> > > ...
> > > }
> > > }
> > >
> > > This can also remove the cnt.
> >
> > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > for callers, but we have one right now, and might have another one, so
> > not a big deal.
> >
> 
> Actually, now that I started implementing this, I really-really don't
> like it. In the example above you assume by storing and reusing
> original ctx value you will reset iteration to the very beginning.
> This is not how it works in practice though. Ctx is most probably a
> pointer to some struct somewhere with iteration state (e.g., array of
> all uprobes + current index), and so get_uprobe_consumer() doesn't
> update `void *ctx` itself, it updates the state of that struct.

Yeah, that should be noted so that if the get_uprobe_consumer() is
called with original `ctx` value, it should return the same.
Ah, and I found we need to pass both ctx and pos...

   while ((uc = get_uprobe_consumer(, ctx)) != NULL) {
 ...
 }

Usually it is enough to pass the cursor as similar to the other
for_each_* macros. For example, struct foo has .list and .uc, then

struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
{
struct foo *foo = *pos;

if (!foo)
return NULL;

*pos = list_next_entry(foo, list);
if (list_is_head(pos, (head)))
*pos = NULL;

return foo->uc;
}

This works something like this.

#define for_each_uprobe_consumer_from_foo(uc, pos, head)

Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-01 Thread Steven Rostedt

On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:

> @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap,
>   if (error)
>   return error;
>  
> + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name,
> + attr->ia_valid, attr->ia_mode,
> + from_kuid(_user_ns, attr->ia_uid),
> + from_kgid(_user_ns, attr->ia_gid),
> + inode->i_size, attr->ia_size);
> +

That's a lot of parameters to pass to a tracepoint. Why not just pass the
dentry and attr and do the above in the TP_fast_assign() logic? That would
put less pressure on the icache for the code part.

-- Steve

[PATCH] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-07-01 Thread Andrii Nakryiko

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced. Given entire kernel implementation of user
space stack trace capturing works under assumption that user space code
was compiled with frame pointer register (%rbp) preservation, it seems
pretty reasonable to use this instruction as a strong indicator that
this is the entry to the function. In that case, return address is still
pointed to by %rsp, so we fetch it and add to stack trace before
proceeding to unwind the rest using frame pointer-based logic.

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/events/core.c  | 20 
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..82d5570b58ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *regs
return;
 
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+   /*
+* If we are called from uprobe handler, and we are indeed at the very
+* entry to user function (which is normally a `push %rbp` instruction,
+* under assumption of application being compiled with frame pointers),
+* we should read return address from *regs->sp before proceeding
+* to follow frame pointers, otherwise we'll skip immediate caller
+* as %rbp is not yet setup.
+*/
+   if (current->utask) {
+   struct arch_uprobe *auprobe = current->utask->auprobe;
+   u64 ret_addr;
+
+   if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
+   !__get_user(ret_addr, (const u64 __user *)regs->sp))
+   perf_callchain_store(entry, ret_addr);
+   }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
 
+   struct arch_uprobe  *auprobe;
+
struct return_instance  *return_instances;
unsigned intdepth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
 
down_read(>register_rwsem);
+   current->utask->auprobe = >arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
 
@@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
 
remove &= rc;
}
+   current->utask->auprobe = NULL;
 
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0

[PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko

With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)AFTER (mln/s)
  -   ---
  1   3.131 3.140
  2   3.394 3.601
  3   3.630 3.960
  4   3.317 3.551
  5   3.448 3.464
  6   3.345 3.283
  7   3.469 3.444
  8   3.182 3.258
  9   3.138 3.139
  10  2.999 3.212
  11  2.903 3.183
  12  2.802 3.027
  13  2.792 3.027
  14  2.695 3.086
  15  2.822 2.965
  16  2.679 2.939
  17  2.622 2.888
  18  2.628 2.914
  19  2.702 2.836
  20  2.561 2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bb480a2400e1..1d76551e5e23 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);  /* serialize rbtree access */
 
 #define UPROBES_HASH_SZ13
 /* serialize uprobe->pending_list */
@@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
bool destroy;
 
if (!tree_locked)
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
if (!tree_locked)
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
loff_t offset)
 {
struct uprobe *uprobe;
 
-   read_lock(_treelock);
+   percpu_down_read(_treelock);
uprobe = __find_uprobe(inode, offset);
-   read_unlock(_treelock);
+   percpu_up_read(_treelock);
 
return uprobe;
 }
@@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
up_write(>register_rwsem);
}
 
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
uprobe = uc->uprobe;
@@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
__put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
ret = 0;
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
}
 unlock_treelock:
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
if (ret)
goto cleanup_uprobes;
 
@@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
@@ -1384,7 +1384,7 @@ int uprobe_register_batch(struct

[PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes

2024-07-01 Thread Andrii Nakryiko

Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/bpf_trace.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe 
*uprobes,
- u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, 
void *ctx)
 {
-   u32 i;
+   struct bpf_uprobe_multi_link *link = ctx;
 
-   for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), 
[i].consumer);
-   }
+   return >uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link 
*link)
struct bpf_uprobe_multi_link *umulti_link;
 
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-   bpf_uprobe_unregister(_link->path, umulti_link->uprobes, 
umulti_link->cnt);
+   uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), 
umulti_link->cnt,
+   umulti_link_get_uprobe_consumer, umulti_link);
if (umulti_link->task)
put_task_struct(umulti_link->task);
path_put(_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI,
  _uprobe_multi_link_lops, prog);
 
-   for (i = 0; i < cnt; i++) {
-   err = uprobe_register(d_real_inode(link->path.dentry), 
[i].consumer);
-   if (err) {
-   bpf_uprobe_unregister(, uprobes, i);
-   goto error_free;
-   }
-   }
+   err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+   umulti_link_get_uprobe_consumer, link);
+   if (err)
+   goto error_free;
 
err = bpf_link_prime(>link, _primer);
if (err)
-- 
2.43.0

[PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch

2024-07-01 Thread Andrii Nakryiko

Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ced85284bbf4..bb480a2400e1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
 {
-   struct uprobe *uprobe;
struct uprobe_consumer *uc;
+   struct uprobe *uprobe;
int i;
 
for (i = 0; i < cnt; i++) {
@@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
down_write(>register_rwsem);
__uprobe_unregister(uprobe, uc);
up_write(>register_rwsem);
-   put_uprobe(uprobe);
+   }
 
+   write_lock(_treelock);
+   for (i = 0; i < cnt; i++) {
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
+   if (!uprobe)
+   continue;
+
+   __put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0

[PATCH v2 09/12] uprobes: batch uprobes_treelock during registration

2024-07-01 Thread Andrii Nakryiko

Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 45 +
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 128677ffe662..ced85284bbf4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu)
kfree(uprobe);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
s64 v;
 
@@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe)
if (unlikely((u32)v == 0)) {
bool destroy;
 
-   write_lock(_treelock);
+   if (!tree_locked)
+   write_lock(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe)
destroy = atomic64_read(>ref) == v;
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
-   write_unlock(_treelock);
+   if (!tree_locked)
+   write_unlock(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe)
rcu_read_unlock_trace();
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+   __put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
   const struct uprobe *r)
@@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe 
*uprobe)
return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-   struct uprobe *u;
-
-   write_lock(_treelock);
-   u = __insert_uprobe(uprobe);
-   write_unlock(_treelock);
-
-   return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uc->uprobe = uprobe;
}
 
+   ret = 0;
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uprobe = uc->uprobe;
 
/* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
+   cur_uprobe = __insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
if (cur_uprobe != uprobe) {
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
+
+   __put_uprobe(cur_uprobe, true);
ret = -EINVAL;
-   goto cleanup_uprobes;
+   goto unlock_treelock;
}
kfree(uprobe);
uc->uprobe = cur_uprobe;
}
}
+unlock_treelock:
+   write_unlock(_treelock);
+   if (ret)
+   goto cleanup_uprobes;
 
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
@@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
if (uc->uprobe)
-   put_uprobe(uc->uprobe);
+   __put_uprobe(uc->uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
return ret;
 }
 
-- 
2.43.0

[PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps

2024-07-01 Thread Andrii Nakryiko

Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0f928a72a658..128677ffe662 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
return -EINVAL;
}
 
+   /* pre-allocate new uprobe instances */
for (i = 0; i < cnt; i++) {
-   struct uprobe *cur_uprobe;
-
uc = get_uprobe_consumer(i, ctx);
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
RB_CLEAR_NODE(>rb_node);
atomic64_set(>ref, 1);
 
+   uc->uprobe = uprobe;
+   }
+
+   for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
@@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
put_uprobe(cur_uprobe);
-   kfree(uprobe);
ret = -EINVAL;
goto cleanup_uprobes;
}
kfree(uprobe);
-   uprobe = cur_uprobe;
+   uc->uprobe = cur_uprobe;
}
-
-   uc->uprobe = uprobe;
}
 
for (i = 0; i < cnt; i++) {
-- 
2.43.0

[PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()

2024-07-01 Thread Andrii Nakryiko

To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 65 ++---
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 68fdf1b8e4bf..0f928a72a658 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct 
uprobe *uprobe)
(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-  loff_t ref_ctr_offset)
-{
-   struct uprobe *uprobe, *cur_uprobe;
-
-   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-   if (!uprobe)
-   return ERR_PTR(-ENOMEM);
-
-   uprobe->inode = inode;
-   uprobe->offset = offset;
-   uprobe->ref_ctr_offset = ref_ctr_offset;
-   init_rwsem(>register_rwsem);
-   init_rwsem(>consumer_rwsem);
-   RB_CLEAR_NODE(>rb_node);
-   atomic64_set(>ref, 1);
-
-   /* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
-   /* a uprobe exists for this inode:offset combination */
-   if (cur_uprobe != uprobe) {
-   if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
-   kfree(uprobe);
-   return ERR_PTR(-EINVAL);
-   }
-   kfree(uprobe);
-   uprobe = cur_uprobe;
-   }
-
-   return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(>consumer_rwsem);
@@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
uc = get_uprobe_consumer(i, ctx);
 
-   uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-   if (IS_ERR(uprobe)) {
-   ret = PTR_ERR(uprobe);
+   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+   if (!uprobe) {
+   ret = -ENOMEM;
goto cleanup_uprobes;
}
 
+   uprobe->inode = inode;
+   uprobe->offset = uc->offset;
+   uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+   init_rwsem(>register_rwsem);
+   init_rwsem(>consumer_rwsem);
+   RB_CLEAR_NODE(>rb_node);
+   atomic64_set(>ref, 1);
+
+   /* add to uprobes_tree, sorted on inode:offset */
+   cur_uprobe = insert_uprobe(uprobe);
+   /* a uprobe exists for this inode:offset combination */
+   if (cur_uprobe != uprobe) {
+   if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
+   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+   put_uprobe(cur_uprobe);
+   kfree(uprobe);
+   ret = -EINVAL;
+   goto cleanup_uprobes;
+   }
+   kfree(uprobe);
+   uprobe = cur_uprobe;
+   }
+
uc->uprobe = uprobe;
}
 
-- 
2.43.0

[PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h |  17 
 kernel/events/uprobes.c | 180 
 2 files changed, 146 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
loff_t ref_ctr_offset;
/* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
+   /* associated uprobe instance (or NULL if consumer isn't attached) */
+   struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs 
*regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn get_uprobe_consumer, void 
*ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct 
uprobe_consumer *uc)
 {
return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx)
+{
+   return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, 
bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8759c6d0683e..68fdf1b8e4bf 100644

[PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-01 Thread Andrii Nakryiko

Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h   | 18 +++
 kernel/events/uprobes.c   | 19 ++-
 kernel/trace/bpf_trace.c  | 21 +++-
 kernel/trace/trace_uprobe.c   | 53 ---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 
 5 files changed, 55 insertions(+), 78 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
 
+   /* associated file offset of this probe */
+   loff_t offset;
+   /* associated refctr file offset of this probe, or zero */
+   loff_t ref_ctr_offset;
+   /* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs) instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-   return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, 
loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 560cf1ca512a..8759c6d0683e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
struct uprobe *uprobe;
 
-   uprobe = find_uprobe(inode, offset);
+   uprobe = find_uprobe(inode, uc->offset);
if (WARN_ON(!uprobe))
return;
 
@@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, 
loff_t offset,
return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-   struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-   return __uprobe_register(inode, offset, 0, uc);
+   return __uprobe_register(inode, uc->offset,

[PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()

2024-07-01 Thread Andrii Nakryiko

Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f87049c08ee9..23449a8c5e7e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
loff_t offset,
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
if (!uprobe)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
uprobe->inode = inode;
uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t 
offset,
 
  retry:
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-   if (!uprobe)
-   return -ENOMEM;
if (IS_ERR(uprobe))
return PTR_ERR(uprobe);
 
-- 
2.43.0

[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
handler code, temporary refcount bump is done to keep uprobe from
being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
take another refcount to keep uprobe alive until user space code
returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(>ref) and
write_lock(_treelock), we'll deterministically detect this with
extra atomic_read(>ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET

[PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-01 Thread Andrii Nakryiko

It seems like uprobe_write_opcode() doesn't require writer locked
mmap_sem, any lock (reader or writer) should be sufficient. This was
established in a discussion in [0] and looking through existing code
seems to confirm that there is no need for write-locked mmap_sem.

Fix the comment to state this clearly.

  [0] 
https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/

Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
uprobe_write_opcode()")
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 081821fd529a..f87049c08ee9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
mm_struct *mm,
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
- * Called with mm->mmap_lock held for write.
+ * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-- 
2.43.0

[PATCH v2 01/12] uprobes: update outdated comment

2024-07-01 Thread Andrii Nakryiko

There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..081821fd529a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0

[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko

This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

v1->v2:
  - added RCU-delayed uprobe freeing to put_uprobe() (Masami);
  - fixed clean up handling in uprobe_register_batch (Jiri);
  - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg);
  - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid
comment instead (Oleg).

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h   |  29 +-
 kernel/events/uprobes.c   | 550 --
 kernel/trace/bpf_trace.c  |  40 +-
 kernel/trace/trace_uprobe.c   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  22 +-
 5 files changed, 447 insertions(+), 247 deletions(-)

-- 
2.43.0

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
 wrote:
>
> On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
> >
> > On Fri, 28 Jun 2024 09:34:26 -0700
> > Andrii Nakryiko  wrote:
> >
> > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > uprobe_consumer *uc)
> > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > > void *ctx)
> > > > > >
> > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > >
> > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > interface isn't the most convenient.
> > > >
> > > > OK, I understand it.
> > > >
> > > > >
> > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > waste. This is because we don't want to do anything that can fail in
> > > > > the detachment logic (so no temporary array allocation there).
> > > >
> > > > No need to change it, that sounds reasonable.
> > > >
> > >
> > > Great, thanks.
> > >
> > > > >
> > > > > Anyways, let me know how you feel about keeping this callback.
> > > >
> > > > IMHO, maybe the interface function is better to change to
> > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > side uses a linked list of structure, index access will need to
> > > > follow the list every time.
> > >
> > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > are only allowed to ask for the next consumer, then
> > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > its own internal index and remember ith instance. Which again means
> > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > isn't great.
> >
> > No, I think we can use a cursor variable as;
> >
> > int uprobe_register_batch(struct inode *inode,
> >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > {
> > void *cur = ctx;
> >
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> >
> > cur = ctx;
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> > }
> >
> > This can also remove the cnt.
>
> Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> for callers, but we have one right now, and might have another one, so
> not a big deal.
>

Actually, now that I started implementing this, I really-really don't
like it. In the example above you assume by storing and reusing
original ctx value you will reset iteration to the very beginning.
This is not how it works in practice though. Ctx is most probably a
pointer to some struct somewhere with iteration state (e.g., array of
all uprobes + current index), and so get_uprobe_consumer() doesn't
update `void *ctx` itself, it updates the state of that struct.

And so there is no easy and clean way to reset this iterator without
adding another callback or something. At which point it becomes quite
cumbersome and convoluted.

How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
contract, which works for the only user right now, BPF multi-uprobes.
When it's time to add another consumer that works with a linked list,
we can add another more complicated contract that would do
iterator-style callbacks. This would be used by linked list users, and
we can transparently implement existing uprobe_register_batch()
contract on top of if by implementing a trivial iterator wrapper on
top of get_uprobe_consumer(idx, ctx) approach.

Let's not add unnecessary complications right now given we have a
clear path forward to add it later, if necessary, without breaking
anything. I'll send v2 without changes to get_uprobe_consumer() for
now, hopefully my above plan makes sense to you. Thanks!

> >
> > Thank you,
> >
> > >
> > > For now this API works well, I propose to keep it as is. For linked
> > > list case consumers would need to allocate one extra array or

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko

On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko
 wrote:
>
> On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 24 Jun 2024 17:21:36 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Anyways, under exclusive writer lock, we double-check that refcount
> > > didn't change and is still zero. If it is, we proceed with destruction,
> > > because at that point we have a guarantee that find_active_uprobe()
> > > can't successfully look up this uprobe instance, as it's going to be
> > > removed in destructor under writer lock. If, on the other hand,
> > > find_active_uprobe() managed to bump refcount from zero to one in
> > > between put_uprobe()'s atomic_dec_and_test(>ref) and
> > > write_lock(_treelock), we'll deterministically detect this with
> > > extra atomic_read(>ref) check, and if it doesn't hold, we
> > > pretend like atomic_dec_and_test() never returned true. There is no
> > > resource freeing or any other irreversible action taken up till this
> > > point, so we just exit early.
> > >
> > > One tricky part in the above is actually two CPUs racing and dropping
> > > refcnt to zero, and then attempting to free resources. This can happen
> > > as follows:
> > >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab 
> > > uprobes_treelock;
> > >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > > which point it decides that it needs to free uprobe as well.
> > >
> > > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > > uprobe, which is obviously wrong. To prevent this situations, we augment
> > > refcount with epoch counter, which is always incremented by 1 on either
> > > get or put operation. This allows those two CPUs above to disambiguate
> > > who should actually free uprobe (it's the CPU #1, because it has
> > > up-to-date epoch). See comments in the code and note the specific values
> > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > > that are incremented/decremented with a single atomic_add_and_return()
> > > operation. Note also a small and extremely rare (and thus having no
> > > effect on performance) need to clear the highest bit every 2 billion
> > > get/put operations to prevent high 32-bit counter from "bleeding over"
> > > into lower 32-bit counter.
> >
> > I have a question here.
> > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> > again? e.g.
> >
> > CPU#0   CPU#1
> >
> > put_uprobe() {
> > atomic64_add_return()
> > __get_uprobe();
> > put_uprobe() {
> > 
> > kfree(uprobe)
> > }
> > write_lock(_treelock);
> > atomic64_read(>ref);
> > }
> >
> > I think it is very rare case, but I could not find any code to prevent
> > this scenario.
> >
>
> Yes, I think you are right, great catch!
>
> I concentrated on preventing double kfree() in this situation, and
> somehow convinced myself that eager kfree() is fine. But I think I'll
> need to delay freeing, probably with RCU. The problem is that we can't
> use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
> to be a sleepable variant of RCU. I'm thinking of using
> rcu_read_lock_trace(), the same variant of RCU we use for sleepable
> BPF programs (including sleepable uprobes). srcu might be too heavy
> for this.
>
> I'll try a few variants over the next few days and see how the
> performance looks.
>

So I think I'm going with the changes below, incorporated into this
patch (nothing else changes). __get_uprobe() doesn't need any added
RCU protection (we know that uprobe is alive). It's only put_uprobe()
that needs to guarantee RCU protection before we drop refcount all the
way until we know whether we are the winning destructor or not.

Good thing is that the changes are pretty minimal in code and also
don't seem to regress performance/scalability. So I'm pretty happy
about that, will send v2 soon.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 07ad8b2e7508..41d9e37633ca 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -56,6 +56,7 @@ struct uprobe {
atomic64_t  ref;/* see
UPROBE_REFCNT_GET below */
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
+   struct rcu_head rcu;
struct list_headpending_list;
struct uprobe_consumer  *consumers;
struct inode*inode; /* Also hold a ref to inode */
@@ -623,7 +624,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct
mm_struct *mm,

[PATCH RT 0/1] Linux v4.19.316-rt136-rc1

2024-07-01 Thread Daniel Wagner

Dear RT Folks,

This is the RT stable review cycle of patch 4.19.316-rt136-rc1.

Please scream at me if I messed something up. Please test the patches
too.

The -rc release is also available on kernel.org

  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

on the v4.19-rt-next branch.

If all goes well, this patch will be converted to the next main
release on 2024-07-08.

Signing key fingerprint:

  5BF6 7BC5 0826 72CA BB45  ACAE 587C 5ECA 5D0A 306C

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Daniel

Changes from v4.19.315-rt135:


Daniel Wagner (1):
  Linux 4.19.316-rt136

 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.45.2

[PATCH RT 1/1] Linux 4.19.316-rt136

2024-07-01 Thread Daniel Wagner

v4.19.316-rt136-rc1 stable review patch.
If anyone has any objections, please let me know.

---


Signed-off-by: Daniel Wagner 
---
 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localversion-rt b/localversion-rt
index e3026053f01e..f824f53c19ea 100644
--- a/localversion-rt
+++ b/localversion-rt
@@ -1 +1 @@
--rt135
+-rt136
-- 
2.45.2

Re: [PATCH v3 0/2] ARM: dts: qcom-msm8226-samsung-ms013g: Add initial device tree

2024-07-01 Thread Rob Herring (Arm)

On Sun, 30 Jun 2024 13:29:13 +, Raymond Hackley wrote:
> Samsung Galaxy Grand 2 is a phone based on MSM8226. It's similar to the
> other Samsung devices based on MSM8226 with only a few minor differences.
> 
> The device trees contain initial support with:
>  - GPIO keys
>  - Regulator haptic
>  - SDHCI (internal and external storage)
>  - UART (on USB connector via the TI TSU6721 MUIC)
>  - Regulators
>  - Touchscreen
>  - Accelerometer
> 
> ---
> v2: Adjust l3, l15, l22 and l27 regulator voltages. Sort nodes.
> Set regulator-allow-set-load for vqmmc supplies.
> v3: Rename node haptic to vibrator.
> 
> 
> 

My bot found new DTB warnings on the .dts files added or changed in this
series.

Some warnings may be from an existing SoC .dtsi. Or perhaps the warnings
are fixed by another series. Ultimately, it is up to the platform
maintainer whether these warnings are acceptable or not. No need to reply
unless the platform maintainer has comments.

If you already ran DT checks and didn't see these error(s), then
make sure dt-schema is up to date:

  pip3 install dtschema --upgrade

New warnings running 'make CHECK_DTBS=y qcom/qcom-msm8226-samsung-ms013g.dtb' 
for 20240630132859.2885-1-raymondhack...@protonmail.com:

arch/arm/boot/dts/qcom/qcom-msm8226-samsung-ms013g.dtb: syscon@f9011000: 
compatible: ['syscon'] is too short
from schema $id: http://devicetree.org/schemas/mfd/syscon-common.yaml#

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
>
> On Fri, 28 Jun 2024 09:34:26 -0700
> Andrii Nakryiko  wrote:
>
> > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > wrote:
> > >
> > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > uprobe_consumer *uc)
> > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > void *ctx)
> > > > >
> > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > allocate a temporary array of *uprobe_consumer instead?
> > > >
> > > > Yes, exactly, to avoid the need for allocating another array that
> > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > just have an array of `struct uprobe_consumer *`, because
> > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > interface isn't the most convenient.
> > >
> > > OK, I understand it.
> > >
> > > >
> > > > If you feel strongly, I can do an array, but this necessitates
> > > > allocating an extra array *and keeping it* for the entire duration of
> > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > waste. This is because we don't want to do anything that can fail in
> > > > the detachment logic (so no temporary array allocation there).
> > >
> > > No need to change it, that sounds reasonable.
> > >
> >
> > Great, thanks.
> >
> > > >
> > > > Anyways, let me know how you feel about keeping this callback.
> > >
> > > IMHO, maybe the interface function is better to change to
> > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > side uses a linked list of structure, index access will need to
> > > follow the list every time.
> >
> > This would be problematic. Note how we call get_uprobe_consumer(i,
> > ctx) with i going from 0 to N in multiple independent loops. So if we
> > are only allowed to ask for the next consumer, then
> > uprobe_register_batch and uprobe_unregister_batch would need to build
> > its own internal index and remember ith instance. Which again means
> > more allocations and possibly failing uprobe_unregister_batch(), which
> > isn't great.
>
> No, I think we can use a cursor variable as;
>
> int uprobe_register_batch(struct inode *inode,
>  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> {
> void *cur = ctx;
>
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
>
> cur = ctx;
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
> }
>
> This can also remove the cnt.

Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
for callers, but we have one right now, and might have another one, so
not a big deal.

>
> Thank you,
>
> >
> > For now this API works well, I propose to keep it as is. For linked
> > list case consumers would need to allocate one extra array or pay the
> > price of O(N) search (which might be ok, depending on how many uprobes
> > are being attached). But we don't have such consumers right now,
> > thankfully.
> >
> > >
> > > Thank you,
> > >
> > >
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) 
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) 
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH 3/4] remoteproc: k3-r5: k3_r5_rproc_stop: code reorder

2024-07-01 Thread Richard GENOUD


Le 01/07/2024 à 18:35, Mathieu Poirier a écrit :

On Mon, Jul 01, 2024 at 10:03:22AM +0200, Richard GENOUD wrote:

Le 28/06/2024 à 23:18, Mathieu Poirier a écrit :

On Fri, Jun 21, 2024 at 05:00:57PM +0200, Richard Genoud wrote:

In the next commit, a RP_MBOX_SHUTDOWN message will be sent in
k3_r5_rproc_stop() to the remote proc (in lockstep on not)
Thus, the sanity check "do not allow core 0 to stop before core 1"
should be moved at the beginning of the function so that the generic case
can be dealt with.

In order to have an easier patch to review, those actions are broke in
two patches:
- this patch: moving the sanity check at the beginning (No functional
change).
- next patch: doing the real job (sending shutdown messages to remote
procs before halting them).

Basically, we had:
- cluster_mode actions
- !cluster_mode sanity check
- !cluster_mode actions
And now:
- !cluster_mode sanity check
- cluster_mode actions
- !cluster_mode actions

Signed-off-by: Richard Genoud 
---
   drivers/remoteproc/ti_k3_r5_remoteproc.c | 24 ++--
   1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 1f18b08618c8..a2ead87952c7 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -636,16 +636,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
struct k3_r5_core *core1, *core = kproc->core;
int ret;
-   /* halt all applicable cores */
-   if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
-   list_for_each_entry(core, >cores, elem) {
-   ret = k3_r5_core_halt(core);
-   if (ret) {
-   core = list_prev_entry(core, elem);
-   goto unroll_core_halt;
-   }
-   }
-   } else {
+
+   if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
/* do not allow core 0 to stop before core 1 */
core1 = list_last_entry(>cores, struct k3_r5_core,
elem);
@@ -656,6 +648,18 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
ret = -EPERM;
goto out;
}
+   }
+
+   /* halt all applicable cores */
+   if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
+   list_for_each_entry(core, >cores, elem) {
+   ret = k3_r5_core_halt(core);
+   if (ret) {
+   core = list_prev_entry(core, elem);
+   goto unroll_core_halt;
+   }
+   }
+   } else {
ret = k3_r5_core_halt(core);
if (ret)


With this patch, the "else" in this "if" condition is coupled with the "if" from
the lockstep mode, making the code extremaly hard to read.  The original code
has a k3_r5_core_halt() in both "if" conditions, making the condition
independent from one another.


I'm not sure I understand what you mean.


With your patch applied I get the following: https://pastebin.com/yTZ0pKcS

Let's say the R5 is in split mode and k3_r5_rproc_stop() called on core1.  The
if() that deal with that condition is on line 10, while the function that halts
the remote processor is online 34, part of the else clause that handles lockstep
mode.  The two if() clauses are entangled and nothing good can come out of that.


Ok, I see your point now.

Thanks !




Anyway, I'm not happy with this diff, it doesn't reflect what was intended.
(which is moving the check "core 0 should not be stop before core 1" at the 
beginning)

Tweaking around with the diff algorithms, I came with something way easier to 
read I think:

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 1f18b08618c8..a2ead87952c7 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -636,6 +636,20 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
 struct k3_r5_core *core1, *core = kproc->core;
 int ret;
+
+   if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
+   /* do not allow core 0 to stop before core 1 */
+   core1 = list_last_entry(>cores, struct k3_r5_core,
+   elem);
+   if (core != core1 && core1->rproc->state != RPROC_OFFLINE &&
+   core1->rproc->state != RPROC_SUSPENDED) {
+   dev_err(dev, "%s: can not stop core 0 before core 1\n",
+   __func__);
+   ret = -EPERM;
+   goto out;
+   }
+   }
+
 /* halt all applicable cores */
 if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
 list_for_each_entry(core, >cores, elem) {
@@ -646,16 +660,6 @@ static int

Re: [PATCH 4/4] remoteproc: k3-r5: support for graceful stop of remote cores

2024-07-01 Thread Richard GENOUD


Le 29/06/2024 à 00:50, Andrew Davis a écrit :

On 6/21/24 10:00 AM, Richard Genoud wrote:

Introduce software IPC handshake between the K3-R5 remote proc driver
and the R5 MCU to gracefully stop/reset the remote core.

Upon a stop request, K3-R5 remote proc driver sends a RP_MBOX_SHUTDOWN
mailbox message to the remote R5 core.
The remote core is expected to:
- relinquish all the resources acquired through Device Manager (DM)
- disable its interrupts
- send back a mailbox acknowledgment RP_MBOX_SHUDOWN_ACK
- enter WFI state.

Meanwhile, the K3-R5 remote proc driver does:
- wait for the RP_MBOX_SHUTDOWN_ACK from the remote core
- wait for the remote proc to enter WFI state
- reset the remote core through device manager

Based on work from: Hari Nagalla 

Signed-off-by: Richard Genoud 
---
  drivers/remoteproc/omap_remoteproc.h |  9 +-
  drivers/remoteproc/ti_k3_r5_remoteproc.c | 40 
  2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/omap_remoteproc.h 
b/drivers/remoteproc/omap_remoteproc.h

index 828e13256c02..c008f11fa2a4 100644
--- a/drivers/remoteproc/omap_remoteproc.h
+++ b/drivers/remoteproc/omap_remoteproc.h
@@ -42,6 +42,11 @@
   * @RP_MBOX_SUSPEND_CANCEL: a cancel suspend response from a remote 
processor

   * on a suspend request
   *
+ * @RP_MBOX_SHUTDOWN: shutdown request for the remote processor
+ *
+ * @RP_MBOX_SHUTDOWN_ACK: successful response from remote processor 
for a
+ * shutdown request. The remote processor should be in WFI state 
short after.

+ *
   * Introduce new message definitions if any here.
   *
   * @RP_MBOX_END_MSG: Indicates end of known/defined messages from 
remote core

@@ -59,7 +64,9 @@ enum omap_rp_mbox_messages {
  RP_MBOX_SUSPEND_SYSTEM    = 0xFF11,
  RP_MBOX_SUSPEND_ACK    = 0xFF12,
  RP_MBOX_SUSPEND_CANCEL    = 0xFF13,
-    RP_MBOX_END_MSG    = 0xFF14,
+    RP_MBOX_SHUTDOWN    = 0xFF14,
+    RP_MBOX_SHUTDOWN_ACK    = 0xFF15,
+    RP_MBOX_END_MSG    = 0xFF16,
  };
  #endif /* _OMAP_RPMSG_H */
diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c

index a2ead87952c7..918a15e1dd9a 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -21,6 +21,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
@@ -172,8 +173,23 @@ struct k3_r5_rproc {
  struct k3_r5_core *core;
  struct k3_r5_mem *rmem;
  int num_rmems;
+    struct completion shutdown_complete;
  };
+/*
+ * This will return true if the remote core is in Wait For Interrupt 
state.

+ */
+static bool k3_r5_is_core_in_wfi(struct k3_r5_core *core)
+{
+    int ret;
+    u64 boot_vec;
+    u32 cfg, ctrl, stat;
+
+    ret = ti_sci_proc_get_status(core->tsp, _vec, , , 
);

+
+    return !ret ? !!(stat & PROC_BOOT_STATUS_FLAG_R5_WFI) : false;


Too fancy for me :) Just return if (ret) right after get_status().

Ok, too much punctuation :)



Looks like this function is called in a polling loop, if
ti_sci_proc_get_status() fails once, it won't get better,
no need to keep checking, we should just error out of
the polling loop.

Ok


Thanks !


Andrew


+}
+
  /**
   * k3_r5_rproc_mbox_callback() - inbound mailbox message handler
   * @client: mailbox client pointer used for requesting the mailbox 
channel
@@ -209,6 +225,10 @@ static void k3_r5_rproc_mbox_callback(struct 
mbox_client *client, void *data)

  case RP_MBOX_ECHO_REPLY:
  dev_info(dev, "received echo reply from %s\n", name);
  break;
+    case RP_MBOX_SHUTDOWN_ACK:
+    dev_dbg(dev, "received shutdown_ack from %s\n", name);
+    complete(>shutdown_complete);
+    break;
  default:
  /* silently handle all other valid messages */
  if (msg >= RP_MBOX_READY && msg < RP_MBOX_END_MSG)
@@ -634,6 +654,7 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
  struct k3_r5_cluster *cluster = kproc->cluster;
  struct device *dev = kproc->dev;
  struct k3_r5_core *core1, *core = kproc->core;
+    bool wfi;
  int ret;
@@ -650,6 +671,24 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
  }
  }
+    /* Send SHUTDOWN message to remote proc */
+    reinit_completion(>shutdown_complete);
+    ret = mbox_send_message(kproc->mbox, (void *)RP_MBOX_SHUTDOWN);
+    if (ret < 0) {
+    dev_err(dev, "Sending SHUTDOWN message failed: %d. Halting 
core anyway.\n", ret);

+    } else {
+    ret = wait_for_completion_timeout(>shutdown_complete,
+  msecs_to_jiffies(1000));
+    if (ret == 0) {
+    dev_err(dev, "Timeout waiting SHUTDOWN_ACK message. 
Halting core anyway.\n");

+    } else {
+    ret = readx_poll_timeout(k3_r5_is_core_in_wfi, core,
+ wfi, wfi, 200, 2000);
+    if (ret)
+    dev_err(dev, "Timeout waiting for remote proc to be 
in WFI state. Halting core

[PATCHv2 bpf-next 9/9] selftests/bpf: Add uprobe session consumers test

2024-07-01 Thread Jiri Olsa

Adding test that attached/detaches multiple consumers on
single uprobe and verifies all were hit as expected.

Signed-off-by: Jiri Olsa 
---
 .../bpf/prog_tests/uprobe_multi_test.c| 203 ++
 .../progs/uprobe_multi_session_consumers.c|  53 +
 2 files changed, 256 insertions(+)
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_consumers.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
index b521590fdbb9..83eac954cf00 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
@@ -9,6 +9,7 @@
 #include "uprobe_multi_session.skel.h"
 #include "uprobe_multi_session_cookie.skel.h"
 #include "uprobe_multi_session_recursive.skel.h"
+#include "uprobe_multi_session_consumers.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "testing_helpers.h"
 #include "../sdt.h"
@@ -739,6 +740,206 @@ static void test_session_recursive_skel_api(void)
uprobe_multi_session_recursive__destroy(skel);
 }
 
+static int uprobe_attach(struct uprobe_multi_session_consumers *skel, int bit)
+{
+   struct bpf_program **prog = >progs.uprobe_0 + bit;
+   struct bpf_link **link = >links.uprobe_0 + bit;
+   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts);
+
+   /*
+* bit: 0,1 uprobe session
+* bit: 2,3 uprobe entry
+* bit: 4,5 uprobe return
+*/
+   opts.session = bit < 2;
+   opts.retprobe = bit == 4 || bit == 5;
+
+   *link = bpf_program__attach_uprobe_multi(*prog, 0, "/proc/self/exe",
+"uprobe_session_consumer_test",
+);
+   if (!ASSERT_OK_PTR(*link, "bpf_program__attach_uprobe_multi"))
+   return -1;
+   return 0;
+}
+
+static void uprobe_detach(struct uprobe_multi_session_consumers *skel, int bit)
+{
+   struct bpf_link **link = >links.uprobe_0 + bit;
+
+   bpf_link__destroy(*link);
+   *link = NULL;
+}
+
+static bool test_bit(int bit, unsigned long val)
+{
+   return val & (1 << bit);
+}
+
+noinline int
+uprobe_session_consumer_test(struct uprobe_multi_session_consumers *skel,
+unsigned long before, unsigned long after)
+{
+   int bit;
+
+   /* detach uprobe for each unset bit in 'before' state ... */
+   for (bit = 0; bit < 6; bit++) {
+   if (test_bit(bit, before) && !test_bit(bit, after))
+   uprobe_detach(skel, bit);
+   }
+
+   /* ... and attach all new bits in 'after' state */
+   for (bit = 0; bit < 6; bit++) {
+   if (!test_bit(bit, before) && test_bit(bit, after)) {
+   if (!ASSERT_OK(uprobe_attach(skel, bit), 
"uprobe_attach_after"))
+   return -1;
+   }
+   }
+   return 0;
+}
+
+static void session_consumer_test(struct uprobe_multi_session_consumers *skel,
+ unsigned long before, unsigned long after)
+{
+   int err, bit;
+
+   /* 'before' is each, we attach uprobe for every set bit */
+   for (bit = 0; bit < 6; bit++) {
+   if (test_bit(bit, before)) {
+   if (!ASSERT_OK(uprobe_attach(skel, bit), 
"uprobe_attach_before"))
+   goto cleanup;
+   }
+   }
+
+   err = uprobe_session_consumer_test(skel, before, after);
+   if (!ASSERT_EQ(err, 0, "uprobe_session_consumer_test"))
+   goto cleanup;
+
+   for (bit = 0; bit < 6; bit++) {
+   const char *fmt = "BUG";
+   __u64 val = 0;
+
+   if (bit == 0) {
+   /*
+* session with return
+*  +1 if defined in 'before'
+*  +1 if defined in 'after'
+*/
+   if (test_bit(bit, before)) {
+   val++;
+   if (test_bit(bit, after))
+   val++;
+   }
+   fmt = "bit 0  : session with return";
+   } else if (bit == 1) {
+   /*
+* session without return
+*   +1 if defined in 'before'
+*/
+   if (test_bit(bit, before))
+   val++;
+   fmt = "bit 1  : session with NO return";
+   } else if (bit < 4) {
+   /*
+* uprobe entry
+*   +1 if define in 'before'
+*/
+   if (test_bit(bit, before))
+   val++;
+   fmt = "bit 3/4: uprobe";
+   } else {
+

[PATCHv2 bpf-next 8/9] selftests/bpf: Add uprobe session recursive test

2024-07-01 Thread Jiri Olsa

Adding uprobe session test that verifies the cookie value is stored
properly when single uprobe-ed function is executed recursively.

Signed-off-by: Jiri Olsa 
---
 .../bpf/prog_tests/uprobe_multi_test.c| 57 +++
 .../progs/uprobe_multi_session_recursive.c| 44 ++
 2 files changed, 101 insertions(+)
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
index d5f78fc61013..b521590fdbb9 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
@@ -8,6 +8,7 @@
 #include "uprobe_multi_usdt.skel.h"
 #include "uprobe_multi_session.skel.h"
 #include "uprobe_multi_session_cookie.skel.h"
+#include "uprobe_multi_session_recursive.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "testing_helpers.h"
 #include "../sdt.h"
@@ -34,6 +35,12 @@ noinline void usdt_trigger(void)
STAP_PROBE(test, pid_filter_usdt);
 }
 
+noinline void uprobe_session_recursive(int i)
+{
+   if (i)
+   uprobe_session_recursive(i - 1);
+}
+
 struct child {
int go[2];
int c2p[2]; /* child -> parent channel */
@@ -684,6 +691,54 @@ static void test_session_cookie_skel_api(void)
uprobe_multi_session_cookie__destroy(skel);
 }
 
+static void test_session_recursive_skel_api(void)
+{
+   struct uprobe_multi_session_recursive *skel = NULL;
+   int i, err;
+
+   skel = uprobe_multi_session_recursive__open_and_load();
+   if (!ASSERT_OK_PTR(skel, 
"uprobe_multi_session_recursive__open_and_load"))
+   goto cleanup;
+
+   skel->bss->pid = getpid();
+
+   err = uprobe_multi_session_recursive__attach(skel);
+   if (!ASSERT_OK(err, "uprobe_multi_session_recursive__attach"))
+   goto cleanup;
+
+   for (i = 0; i < ARRAY_SIZE(skel->bss->test_uprobe_cookie_entry); i++)
+   skel->bss->test_uprobe_cookie_entry[i] = i + 1;
+
+   uprobe_session_recursive(5);
+
+   /*
+* entry uprobe:
+* uprobe_session_recursive(5) { *cookie = 1, return 0
+*   uprobe_session_recursive(4) {   *cookie = 2, return 1
+* uprobe_session_recursive(3) { *cookie = 3, return 0
+*   uprobe_session_recursive(2) {   *cookie = 4, return 1
+* uprobe_session_recursive(1) { *cookie = 5, return 0
+*   uprobe_session_recursive(0) {   *cookie = 6, return 1
+*  return uprobe:
+*   } i = 0  not executed
+* } i = 1
test_uprobe_cookie_return[0] = 5
+*   } i = 2  not executed
+* } i = 3
test_uprobe_cookie_return[1] = 3
+*   } i = 4  not executed
+* } i = 5
test_uprobe_cookie_return[2] = 1
+*/
+
+   ASSERT_EQ(skel->bss->idx_entry, 6, "idx_entry");
+   ASSERT_EQ(skel->bss->idx_return, 3, "idx_return");
+
+   ASSERT_EQ(skel->bss->test_uprobe_cookie_return[0], 5, 
"test_uprobe_cookie_return[0]");
+   ASSERT_EQ(skel->bss->test_uprobe_cookie_return[1], 3, 
"test_uprobe_cookie_return[1]");
+   ASSERT_EQ(skel->bss->test_uprobe_cookie_return[2], 1, 
"test_uprobe_cookie_return[2]");
+
+cleanup:
+   uprobe_multi_session_recursive__destroy(skel);
+}
+
 static void test_bench_attach_uprobe(void)
 {
long attach_start_ns = 0, attach_end_ns = 0;
@@ -776,4 +831,6 @@ void test_uprobe_multi_test(void)
test_session_skel_api();
if (test__start_subtest("session_cookie"))
test_session_cookie_skel_api();
+   if (test__start_subtest("session_cookie_recursive"))
+   test_session_recursive_skel_api();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c 
b/tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c
new file mode 100644
index ..8fbcd69fae22
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include "bpf_kfuncs.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+int pid = 0;
+
+int idx_entry = 0;
+int idx_return = 0;
+
+__u64 test_uprobe_cookie_entry[6];
+__u64 test_uprobe_cookie_return[3];
+
+static int check_cookie(void)
+{
+   __u64 *cookie = bpf_session_cookie();
+
+   if (bpf_session_is_return()) {
+   if (idx_return >= ARRAY_SIZE(test_uprobe_cookie_return))
+   return 1;
+   test_uprobe_cookie_return[idx_return++] =

[PATCHv2 bpf-next 7/9] selftests/bpf: Add uprobe session cookie test

2024-07-01 Thread Jiri Olsa

Adding uprobe session test that verifies the cookie value
get properly propagated from entry to return program.

Signed-off-by: Jiri Olsa 
---
 .../bpf/prog_tests/uprobe_multi_test.c| 31 
 .../bpf/progs/uprobe_multi_session_cookie.c   | 48 +++
 2 files changed, 79 insertions(+)
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
index cd9581f46c73..d5f78fc61013 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
@@ -7,6 +7,7 @@
 #include "uprobe_multi_bench.skel.h"
 #include "uprobe_multi_usdt.skel.h"
 #include "uprobe_multi_session.skel.h"
+#include "uprobe_multi_session_cookie.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "testing_helpers.h"
 #include "../sdt.h"
@@ -655,6 +656,34 @@ static void test_session_skel_api(void)
uprobe_multi_session__destroy(skel);
 }
 
+static void test_session_cookie_skel_api(void)
+{
+   struct uprobe_multi_session_cookie *skel = NULL;
+   int err;
+
+   skel = uprobe_multi_session_cookie__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "fentry_raw_skel_load"))
+   goto cleanup;
+
+   skel->bss->pid = getpid();
+
+   err = uprobe_multi_session_cookie__attach(skel);
+   if (!ASSERT_OK(err, " kprobe_multi_session__attach"))
+   goto cleanup;
+
+   /* trigger all probes */
+   uprobe_multi_func_1();
+   uprobe_multi_func_2();
+   uprobe_multi_func_3();
+
+   ASSERT_EQ(skel->bss->test_uprobe_1_result, 1, "test_uprobe_1_result");
+   ASSERT_EQ(skel->bss->test_uprobe_2_result, 2, "test_uprobe_2_result");
+   ASSERT_EQ(skel->bss->test_uprobe_3_result, 3, "test_uprobe_3_result");
+
+cleanup:
+   uprobe_multi_session_cookie__destroy(skel);
+}
+
 static void test_bench_attach_uprobe(void)
 {
long attach_start_ns = 0, attach_end_ns = 0;
@@ -745,4 +774,6 @@ void test_uprobe_multi_test(void)
test_attach_api_fails();
if (test__start_subtest("session"))
test_session_skel_api();
+   if (test__start_subtest("session_cookie"))
+   test_session_cookie_skel_api();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c 
b/tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c
new file mode 100644
index ..5befdf944dc6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include "bpf_kfuncs.h"
+
+char _license[] SEC("license") = "GPL";
+
+int pid = 0;
+
+__u64 test_uprobe_1_result = 0;
+__u64 test_uprobe_2_result = 0;
+__u64 test_uprobe_3_result = 0;
+
+static int check_cookie(__u64 val, __u64 *result)
+{
+   __u64 *cookie;
+
+   if (bpf_get_current_pid_tgid() >> 32 != pid)
+   return 1;
+
+   cookie = bpf_session_cookie();
+
+   if (bpf_session_is_return())
+   *result = *cookie == val ? val : 0;
+   else
+   *cookie = val;
+   return 0;
+}
+
+SEC("uprobe.session//proc/self/exe:uprobe_multi_func_1")
+int uprobe_1(struct pt_regs *ctx)
+{
+   return check_cookie(1, _uprobe_1_result);
+}
+
+SEC("uprobe.session//proc/self/exe:uprobe_multi_func_2")
+int uprobe_2(struct pt_regs *ctx)
+{
+   return check_cookie(2, _uprobe_2_result);
+}
+
+SEC("uprobe.session//proc/self/exe:uprobe_multi_func_3")
+int uprobe_3(struct pt_regs *ctx)
+{
+   return check_cookie(3, _uprobe_3_result);
+}
-- 
2.45.2

[PATCHv2 bpf-next 6/9] selftests/bpf: Add uprobe session test

2024-07-01 Thread Jiri Olsa

Adding uprobe session test and testing that the entry program
return value controls execution of the return probe program.

Signed-off-by: Jiri Olsa 
---
 .../bpf/prog_tests/uprobe_multi_test.c| 42 +++
 .../bpf/progs/uprobe_multi_session.c  | 53 +++
 2 files changed, 95 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_multi_session.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
index bf6ca8e3eb13..cd9581f46c73 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
@@ -6,6 +6,7 @@
 #include "uprobe_multi.skel.h"
 #include "uprobe_multi_bench.skel.h"
 #include "uprobe_multi_usdt.skel.h"
+#include "uprobe_multi_session.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "testing_helpers.h"
 #include "../sdt.h"
@@ -615,6 +616,45 @@ static void test_link_api(void)
__test_link_api(child);
 }
 
+static void test_session_skel_api(void)
+{
+   struct uprobe_multi_session *skel = NULL;
+   LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+   struct bpf_link *link = NULL;
+   int err;
+
+   skel = uprobe_multi_session__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "fentry_raw_skel_load"))
+   goto cleanup;
+
+   skel->bss->pid = getpid();
+
+   err = uprobe_multi_session__attach(skel);
+   if (!ASSERT_OK(err, " uprobe_multi_session__attach"))
+   goto cleanup;
+
+   /* trigger all probes */
+   skel->bss->uprobe_multi_func_1_addr = (__u64) uprobe_multi_func_1;
+   skel->bss->uprobe_multi_func_2_addr = (__u64) uprobe_multi_func_2;
+   skel->bss->uprobe_multi_func_3_addr = (__u64) uprobe_multi_func_3;
+
+   uprobe_multi_func_1();
+   uprobe_multi_func_2();
+   uprobe_multi_func_3();
+
+   /*
+* We expect 2 for uprobe_multi_func_2 because it runs both 
entry/return probe,
+* uprobe_multi_func_[13] run just the entry probe.
+*/
+   ASSERT_EQ(skel->bss->uprobe_session_result[0], 1, 
"uprobe_multi_func_1_result");
+   ASSERT_EQ(skel->bss->uprobe_session_result[1], 2, 
"uprobe_multi_func_2_result");
+   ASSERT_EQ(skel->bss->uprobe_session_result[2], 1, 
"uprobe_multi_func_3_result");
+
+cleanup:
+   bpf_link__destroy(link);
+   uprobe_multi_session__destroy(skel);
+}
+
 static void test_bench_attach_uprobe(void)
 {
long attach_start_ns = 0, attach_end_ns = 0;
@@ -703,4 +743,6 @@ void test_uprobe_multi_test(void)
test_bench_attach_usdt();
if (test__start_subtest("attach_api_fails"))
test_attach_api_fails();
+   if (test__start_subtest("session"))
+   test_session_skel_api();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_multi_session.c 
b/tools/testing/selftests/bpf/progs/uprobe_multi_session.c
new file mode 100644
index ..72c00ae68372
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_multi_session.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include "bpf_kfuncs.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+__u64 uprobe_multi_func_1_addr = 0;
+__u64 uprobe_multi_func_2_addr = 0;
+__u64 uprobe_multi_func_3_addr = 0;
+
+__u64 uprobe_session_result[3];
+
+int pid = 0;
+
+static int uprobe_multi_check(void *ctx, bool is_return)
+{
+   const __u64 funcs[] = {
+   uprobe_multi_func_1_addr,
+   uprobe_multi_func_2_addr,
+   uprobe_multi_func_3_addr,
+   };
+   unsigned int i;
+   __u64 addr;
+
+   if (bpf_get_current_pid_tgid() >> 32 != pid)
+   return 1;
+
+   addr = bpf_get_func_ip(ctx);
+
+   for (i = 0; i < ARRAY_SIZE(funcs); i++) {
+   if (funcs[i] == addr) {
+   uprobe_session_result[i]++;
+   break;
+   }
+   }
+
+   /* only uprobe_multi_func_2 executes return probe */
+   if ((addr == uprobe_multi_func_1_addr) ||
+   (addr == uprobe_multi_func_3_addr))
+   return 1;
+
+   return 0;
+}
+
+SEC("uprobe.session//proc/self/exe:uprobe_multi_func_*")
+int uprobe(struct pt_regs *ctx)
+{
+   return uprobe_multi_check(ctx, bpf_session_is_return());
+}
-- 
2.45.2

[PATCHv2 bpf-next 5/9] libbpf: Add uprobe session attach type names to attach_type_name

2024-07-01 Thread Jiri Olsa

Adding uprobe session attach type name to attach_type_name,
so libbpf_bpf_attach_type_str returns proper string name for
BPF_TRACE_UPROBE_SESSION attach type.

Signed-off-by: Jiri Olsa 
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 492a8eb4d047..e69a54264580 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -133,6 +133,7 @@ static const char * const attach_type_name[] = {
[BPF_NETKIT_PRIMARY]= "netkit_primary",
[BPF_NETKIT_PEER]   = "netkit_peer",
[BPF_TRACE_KPROBE_SESSION]  = "trace_kprobe_session",
+   [BPF_TRACE_UPROBE_SESSION]  = "trace_uprobe_session",
 };
 
 static const char * const link_type_name[] = {
-- 
2.45.2

[PATCHv2 bpf-next 4/9] libbpf: Add support for uprobe multi session attach

2024-07-01 Thread Jiri Olsa

Adding support to attach program in uprobe session mode
with bpf_program__attach_uprobe_multi function.

Adding session bool to bpf_uprobe_multi_opts struct that allows
to load and attach the bpf program via uprobe session.
the attachment to create uprobe multi session.

Also adding new program loader section that allows:
  SEC("uprobe.session/bpf_fentry_test*")

and loads/attaches uprobe program as uprobe session.

Signed-off-by: Jiri Olsa 
---
 tools/lib/bpf/bpf.c|  1 +
 tools/lib/bpf/libbpf.c | 50 --
 tools/lib/bpf/libbpf.h |  4 +++-
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 2a4c71501a17..becdfa701c75 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -776,6 +776,7 @@ int bpf_link_create(int prog_fd, int target_fd,
return libbpf_err(-EINVAL);
break;
case BPF_TRACE_UPROBE_MULTI:
+   case BPF_TRACE_UPROBE_SESSION:
attr.link_create.uprobe_multi.flags = OPTS_GET(opts, 
uprobe_multi.flags, 0);
attr.link_create.uprobe_multi.cnt = OPTS_GET(opts, 
uprobe_multi.cnt, 0);
attr.link_create.uprobe_multi.path = ptr_to_u64(OPTS_GET(opts, 
uprobe_multi.path, 0));
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4a28fac4908a..492a8eb4d047 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9344,6 +9344,7 @@ static int attach_trace(const struct bpf_program *prog, 
long cookie, struct bpf_
 static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, 
struct bpf_link **link);
 static int attach_kprobe_session(const struct bpf_program *prog, long cookie, 
struct bpf_link **link);
 static int attach_uprobe_multi(const struct bpf_program *prog, long cookie, 
struct bpf_link **link);
+static int attach_uprobe_session(const struct bpf_program *prog, long cookie, 
struct bpf_link **link);
 static int attach_lsm(const struct bpf_program *prog, long cookie, struct 
bpf_link **link);
 static int attach_iter(const struct bpf_program *prog, long cookie, struct 
bpf_link **link);
 
@@ -9362,6 +9363,7 @@ static const struct bpf_sec_def section_defs[] = {
SEC_DEF("kprobe.session+",  KPROBE, BPF_TRACE_KPROBE_SESSION, 
SEC_NONE, attach_kprobe_session),
SEC_DEF("uprobe.multi+",KPROBE, BPF_TRACE_UPROBE_MULTI, 
SEC_NONE, attach_uprobe_multi),
SEC_DEF("uretprobe.multi+", KPROBE, BPF_TRACE_UPROBE_MULTI, 
SEC_NONE, attach_uprobe_multi),
+   SEC_DEF("uprobe.session+",  KPROBE, BPF_TRACE_UPROBE_SESSION, 
SEC_NONE, attach_uprobe_session),
SEC_DEF("uprobe.multi.s+",  KPROBE, BPF_TRACE_UPROBE_MULTI, 
SEC_SLEEPABLE, attach_uprobe_multi),
SEC_DEF("uretprobe.multi.s+",   KPROBE, BPF_TRACE_UPROBE_MULTI, 
SEC_SLEEPABLE, attach_uprobe_multi),
SEC_DEF("ksyscall+",KPROBE, 0, SEC_NONE, attach_ksyscall),
@@ -11698,6 +11700,40 @@ static int attach_uprobe_multi(const struct 
bpf_program *prog, long cookie, stru
return ret;
 }
 
+static int attach_uprobe_session(const struct bpf_program *prog, long cookie, 
struct bpf_link **link)
+{
+   char *binary_path = NULL, *func_name = NULL;
+   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+   .session = true,
+   );
+   int n, ret = -EINVAL;
+   const char *spec;
+
+   *link = NULL;
+
+   spec = prog->sec_name + sizeof("uprobe.session/") - 1;
+   n = sscanf(spec, "%m[^:]:%m[^\n]",
+  _path, _name);
+
+   switch (n) {
+   case 1:
+   /* but auto-attach is impossible. */
+   ret = 0;
+   break;
+   case 2:
+   *link = bpf_program__attach_uprobe_multi(prog, -1, binary_path, 
func_name, );
+   ret = *link ? 0 : -errno;
+   break;
+   default:
+   pr_warn("prog '%s': invalid format of section definition 
'%s'\n", prog->name,
+   prog->sec_name);
+   break;
+   }
+   free(binary_path);
+   free(func_name);
+   return ret;
+}
+
 static void gen_uprobe_legacy_event_name(char *buf, size_t buf_sz,
 const char *binary_path, uint64_t 
offset)
 {
@@ -11932,10 +11968,12 @@ bpf_program__attach_uprobe_multi(const struct 
bpf_program *prog,
const unsigned long *ref_ctr_offsets = NULL, *offsets = NULL;
LIBBPF_OPTS(bpf_link_create_opts, lopts);
unsigned long *resolved_offsets = NULL;
+   enum bpf_attach_type attach_type;
int err = 0, link_fd, prog_fd;
struct bpf_link *link = NULL;
char errmsg[STRERR_BUFSIZE];
char full_path[PATH_MAX];
+   bool retprobe, session;
const __u64 *cookies;
const char **syms;
size_t cnt;
@@ -12006,12 +12044,20 @@ bpf_program__attach_uprobe_multi(const struct 
bpf_program *prog,
offsets = resolved_offsets;

[PATCHv2 bpf-next 3/9] bpf: Add support for uprobe multi session context

2024-07-01 Thread Jiri Olsa

Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_uprobe_multi_run_ctx, so the session data can be retrieved
from uprobe_multi link.

Plus granting session kfuncs access to uprobe session programs.

Signed-off-by: Jiri Olsa 
---
 kernel/trace/bpf_trace.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 1b19c1cdb5e1..d431b880ca11 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3184,7 +3184,7 @@ struct bpf_uprobe_multi_link {
 };
 
 struct bpf_uprobe_multi_run_ctx {
-   struct bpf_run_ctx run_ctx;
+   struct bpf_session_run_ctx session_ctx;
unsigned long entry_ip;
struct bpf_uprobe *uprobe;
 };
@@ -3297,10 +3297,15 @@ static const struct bpf_link_ops 
bpf_uprobe_multi_link_lops = {
 
 static int uprobe_prog_run(struct bpf_uprobe *uprobe,
   unsigned long entry_ip,
-  struct pt_regs *regs)
+  struct pt_regs *regs,
+  bool is_return, void *data)
 {
struct bpf_uprobe_multi_link *link = uprobe->link;
struct bpf_uprobe_multi_run_ctx run_ctx = {
+   .session_ctx = {
+   .is_return = is_return,
+   .data = data,
+   },
.entry_ip = entry_ip,
.uprobe = uprobe,
};
@@ -3319,7 +3324,7 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
 
migrate_disable();
 
-   old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx);
+   old_run_ctx = bpf_set_run_ctx(_ctx.session_ctx.run_ctx);
err = bpf_prog_run(link->link.prog, regs);
bpf_reset_run_ctx(old_run_ctx);
 
@@ -3349,7 +3354,7 @@ uprobe_multi_link_handler(struct uprobe_consumer *con, 
struct pt_regs *regs,
struct bpf_uprobe *uprobe;
 
uprobe = container_of(con, struct bpf_uprobe, consumer);
-   return uprobe_prog_run(uprobe, instruction_pointer(regs), regs);
+   return uprobe_prog_run(uprobe, instruction_pointer(regs), regs, false, 
data);
 }
 
 static int
@@ -3359,14 +3364,15 @@ uprobe_multi_link_ret_handler(struct uprobe_consumer 
*con, unsigned long func, s
struct bpf_uprobe *uprobe;
 
uprobe = container_of(con, struct bpf_uprobe, consumer);
-   return uprobe_prog_run(uprobe, func, regs);
+   return uprobe_prog_run(uprobe, func, regs, true, data);
 }
 
 static u64 bpf_uprobe_multi_entry_ip(struct bpf_run_ctx *ctx)
 {
struct bpf_uprobe_multi_run_ctx *run_ctx;
 
-   run_ctx = container_of(current->bpf_ctx, struct 
bpf_uprobe_multi_run_ctx, run_ctx);
+   run_ctx = container_of(current->bpf_ctx, struct 
bpf_uprobe_multi_run_ctx,
+  session_ctx.run_ctx);
return run_ctx->entry_ip;
 }
 
@@ -3374,7 +3380,8 @@ static u64 bpf_uprobe_multi_cookie(struct bpf_run_ctx 
*ctx)
 {
struct bpf_uprobe_multi_run_ctx *run_ctx;
 
-   run_ctx = container_of(current->bpf_ctx, struct 
bpf_uprobe_multi_run_ctx, run_ctx);
+   run_ctx = container_of(current->bpf_ctx, struct 
bpf_uprobe_multi_run_ctx,
+  session_ctx.run_ctx);
return run_ctx->uprobe->cookie;
 }
 
@@ -3565,7 +3572,7 @@ static int bpf_kprobe_multi_filter(const struct bpf_prog 
*prog, u32 kfunc_id)
if (!btf_id_set8_contains(_multi_kfunc_set_ids, kfunc_id))
return 0;
 
-   if (!is_kprobe_session(prog))
+   if (!is_kprobe_session(prog) && !is_uprobe_session(prog))
return -EACCES;
 
return 0;
-- 
2.45.2

[PATCHv2 bpf-next 2/9] bpf: Add support for uprobe multi session attach

2024-07-01 Thread Jiri Olsa

Adding support to attach bpf program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two uprobe multi links.

Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the bpf program on return
probe simply by returning zero or non zero from the entry bpf
program execution to execute or not the bpf program on return
probe respectively.

Signed-off-by: Jiri Olsa 
---
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/syscall.c   |  9 +++--
 kernel/trace/bpf_trace.c   | 25 +++--
 tools/include/uapi/linux/bpf.h |  1 +
 4 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 35bcf52dbc65..1d93cb014884 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1116,6 +1116,7 @@ enum bpf_attach_type {
BPF_NETKIT_PRIMARY,
BPF_NETKIT_PEER,
BPF_TRACE_KPROBE_SESSION,
+   BPF_TRACE_UPROBE_SESSION,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 869265852d51..2a63a528fa3c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4049,10 +4049,14 @@ static int bpf_prog_attach_check_attach_type(const 
struct bpf_prog *prog,
if (prog->expected_attach_type == BPF_TRACE_UPROBE_MULTI &&
attach_type != BPF_TRACE_UPROBE_MULTI)
return -EINVAL;
+   if (prog->expected_attach_type == BPF_TRACE_UPROBE_SESSION &&
+   attach_type != BPF_TRACE_UPROBE_SESSION)
+   return -EINVAL;
if (attach_type != BPF_PERF_EVENT &&
attach_type != BPF_TRACE_KPROBE_MULTI &&
attach_type != BPF_TRACE_KPROBE_SESSION &&
-   attach_type != BPF_TRACE_UPROBE_MULTI)
+   attach_type != BPF_TRACE_UPROBE_MULTI &&
+   attach_type != BPF_TRACE_UPROBE_SESSION)
return -EINVAL;
return 0;
case BPF_PROG_TYPE_SCHED_CLS:
@@ -5315,7 +5319,8 @@ static int link_create(union bpf_attr *attr, bpfptr_t 
uattr)
else if (attr->link_create.attach_type == 
BPF_TRACE_KPROBE_MULTI ||
 attr->link_create.attach_type == 
BPF_TRACE_KPROBE_SESSION)
ret = bpf_kprobe_multi_link_attach(attr, prog);
-   else if (attr->link_create.attach_type == 
BPF_TRACE_UPROBE_MULTI)
+   else if (attr->link_create.attach_type == 
BPF_TRACE_UPROBE_MULTI ||
+attr->link_create.attach_type == 
BPF_TRACE_UPROBE_SESSION)
ret = bpf_uprobe_multi_link_attach(attr, prog);
break;
default:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 02d052639dfe..1b19c1cdb5e1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1645,6 +1645,17 @@ static inline bool is_kprobe_session(const struct 
bpf_prog *prog)
return prog->expected_attach_type == BPF_TRACE_KPROBE_SESSION;
 }
 
+static inline bool is_uprobe_multi(const struct bpf_prog *prog)
+{
+   return prog->expected_attach_type == BPF_TRACE_UPROBE_MULTI ||
+  prog->expected_attach_type == BPF_TRACE_UPROBE_SESSION;
+}
+
+static inline bool is_uprobe_session(const struct bpf_prog *prog)
+{
+   return prog->expected_attach_type == BPF_TRACE_UPROBE_SESSION;
+}
+
 static const struct bpf_func_proto *
 kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -1662,13 +1673,13 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
case BPF_FUNC_get_func_ip:
if (is_kprobe_multi(prog))
return _get_func_ip_proto_kprobe_multi;
-   if (prog->expected_attach_type == BPF_TRACE_UPROBE_MULTI)
+   if (is_uprobe_multi(prog))
return _get_func_ip_proto_uprobe_multi;
return _get_func_ip_proto_kprobe;
case BPF_FUNC_get_attach_cookie:
if (is_kprobe_multi(prog))
return _get_attach_cookie_proto_kmulti;
-   if (prog->expected_attach_type == BPF_TRACE_UPROBE_MULTI)
+   if (is_uprobe_multi(prog))
return _get_attach_cookie_proto_umulti;
return _get_attach_cookie_proto_trace;
default:
@@ -3387,7 +3398,7 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
if (sizeof(u64) != sizeof(void *))
return -EOPNOTSUPP;
 
-   if (prog->expected_attach_type != BPF_TRACE_UPROBE_MULTI)
+   if (!is_uprobe_multi(prog))
return -EINVAL;
 
flags = attr->link_create.uprobe_multi.flags;
@@ -3463,10 +3474,12

[PATCHv2 bpf-next 1/9] uprobe: Add support for session consumer

2024-07-01 Thread Jiri Olsa

Adding support for uprobe consumer to be defined as session and have
new behaviour for consumer's 'handler' and 'ret_handler' callbacks.

The session means that 'handler' and 'ret_handler' callbacks are
connected in a way that allows to:

  - control execution of 'ret_handler' from 'handler' callback
  - share data between 'handler' and 'ret_handler' callbacks

The session is enabled by setting new 'session' bool field to true
in uprobe_consumer object.

We keep count of session consumers for uprobe and allocate session_consumer
object for each in return_instance object. This allows us to store
return values of 'handler' callbacks and data pointers of shared
data between both handlers.

The session concept fits to our common use case where we do filtering
on entry uprobe and based on the result we decide to run the return
uprobe (or not).

It's also convenient to share the data between session callbacks.

The control of 'ret_handler' callback execution is done via return
value of the 'handler' callback. If it's 0 we install and execute
return uprobe, if it's 1 we do not.

Signed-off-by: Jiri Olsa 
---
 include/linux/uprobes.h |  16 -
 kernel/events/uprobes.c | 129 +---
 kernel/trace/bpf_trace.c|   6 +-
 kernel/trace/trace_uprobe.c |  12 ++--
 4 files changed, 144 insertions(+), 19 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..903a860a8d01 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -34,15 +34,18 @@ enum uprobe_filter_ctx {
 };
 
 struct uprobe_consumer {
-   int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+   int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, 
__u64 *data);
int (*ret_handler)(struct uprobe_consumer *self,
unsigned long func,
-   struct pt_regs *regs);
+   struct pt_regs *regs, __u64 *data);
bool (*filter)(struct uprobe_consumer *self,
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
 
struct uprobe_consumer *next;
+
+   boolsession;/* marks uprobe session 
consumer */
+   unsigned intsession_id; /* set when uprobe_consumer is 
registered */
 };
 
 #ifdef CONFIG_UPROBES
@@ -80,6 +83,12 @@ struct uprobe_task {
unsigned intdepth;
 };
 
+struct session_consumer {
+   __u64   cookie;
+   unsigned intid;
+   int rc;
+};
+
 struct return_instance {
struct uprobe   *uprobe;
unsigned long   func;
@@ -88,6 +97,9 @@ struct return_instance {
boolchained;/* true, if instance is nested 
*/
 
struct return_instance  *next;  /* keep as stack */
+
+   int sessions_cnt;
+   struct session_consumer sessions[];
 };
 
 enum rp_check {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..4da410460f2a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -63,6 +63,8 @@ struct uprobe {
loff_t  ref_ctr_offset;
unsigned long   flags;
 
+   unsigned intsessions_cnt;
+
/*
 * The generic code assumes that it has two members of unknown type
 * owned by the arch-specific code:
@@ -750,11 +752,30 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
loff_t offset,
return uprobe;
 }
 
+static void
+uprobe_consumer_account(struct uprobe *uprobe, struct uprobe_consumer *uc)
+{
+   static unsigned int session_id;
+
+   if (uc->session) {
+   uprobe->sessions_cnt++;
+   uc->session_id = ++session_id ?: ++session_id;
+   }
+}
+
+static void
+uprobe_consumer_unaccount(struct uprobe *uprobe, struct uprobe_consumer *uc)
+{
+   if (uc->session)
+   uprobe->sessions_cnt--;
+}
+
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(>consumer_rwsem);
uc->next = uprobe->consumers;
uprobe->consumers = uc;
+   uprobe_consumer_account(uprobe, uc);
up_write(>consumer_rwsem);
 }
 
@@ -773,6 +794,7 @@ static bool consumer_del(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
if (*con == uc) {
*con = uc->next;
ret = true;
+   uprobe_consumer_unaccount(uprobe, uc);
break;
}
}
@@ -1744,6 +1766,23 @@ static struct uprobe_task *get_utask(void)
return current->utask;
 }
 
+static size_t ri_size(int sessions_cnt)
+{
+   struct return_instance *ri __maybe_unused;
+
+   return sizeof(*ri) + sessions_cnt * sizeof(ri->sessions[0]);
+}
+
+static struct return_instance

[PATCHv2 bpf-next 0/9] uprobe, bpf: Add session support

2024-07-01 Thread Jiri Olsa

hi,
this patchset is adding support for session uprobe attachment
and using it through bpf link for bpf programs.

The session means that the uprobe consumer is executed on entry
and return of probed function with additional control:
  - entry callback can control execution of the return callback
  - entry and return callbacks can share data/cookie

On more details please see patch #1.

v2 changes:
  - re-implement uprobe session support [Andrii]
  - added test for mixed uprobe consumers

thanks,
jirka


---
Jiri Olsa (9):
  uprobe: Add support for session consumer
  bpf: Add support for uprobe multi session attach
  bpf: Add support for uprobe multi session context
  libbpf: Add support for uprobe multi session attach
  libbpf: Add uprobe session attach type names to attach_type_name
  selftests/bpf: Add uprobe session test
  selftests/bpf: Add uprobe session cookie test
  selftests/bpf: Add uprobe session recursive test
  selftests/bpf: Add uprobe session consumers test

 include/linux/uprobes.h|  16 +++-
 include/uapi/linux/bpf.h   |   1 +
 kernel/bpf/syscall.c   |   9 ++-
 kernel/events/uprobes.c| 129 
++---
 kernel/trace/bpf_trace.c   |  54 
++
 kernel/trace/trace_uprobe.c|  12 ++-
 tools/include/uapi/linux/bpf.h |   1 +
 tools/lib/bpf/bpf.c|   1 +
 tools/lib/bpf/libbpf.c |  51 
-
 tools/lib/bpf/libbpf.h |   4 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c | 333 

 tools/testing/selftests/bpf/progs/uprobe_multi_session.c   |  53 
++
 tools/testing/selftests/bpf/progs/uprobe_multi_session_consumers.c |  53 
++
 tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c|  48 

 tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c |  44 
+++
 15 files changed, 771 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_multi_session.c
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_consumers.c
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_cookie.c
 create mode 100644 
tools/testing/selftests/bpf/progs/uprobe_multi_session_recursive.c

Re: [PATCH 1/4] remoteproc: k3-r5: Fix IPC-only mode detection

2024-07-01 Thread Mathieu Poirier

On Mon, Jul 01, 2024 at 04:13:00AM -0500, Hari Nagalla wrote:
> On 6/28/24 14:58, Mathieu Poirier wrote:
> > > This could lead in an incorrect IPC-only mode detection if reset line is
> > > asserted (so reset_control_status() return > 0) and c_state != 0 and
> > > halted == 0.
> > > In this case, the old code would have detected an IPC-only mode instead
> > > of a mismatched mode.
> > > 
> > Your assessment seems to be correct.  That said I'd like to have an RB or a 
> > TB
> > from someone in the TI delegation - guys please have a look.
> Agree with Richard's assessment, and the proposed fix looks good.
> 
> Reviewed-by:
> Hari Nagalla 

I have applied this patch, no need to send it again.

Thanks,
Mathieu

Re: [PATCH 4/4] remoteproc: k3-r5: support for graceful stop of remote cores

2024-07-01 Thread Richard GENOUD


[adding Vibhore in Cc]

Le 28/06/2024 à 23:20, Mathieu Poirier a écrit :

On Fri, Jun 21, 2024 at 05:00:58PM +0200, Richard Genoud wrote:

Introduce software IPC handshake between the K3-R5 remote proc driver
and the R5 MCU to gracefully stop/reset the remote core.

Upon a stop request, K3-R5 remote proc driver sends a RP_MBOX_SHUTDOWN
mailbox message to the remote R5 core.
The remote core is expected to:
- relinquish all the resources acquired through Device Manager (DM)
- disable its interrupts
- send back a mailbox acknowledgment RP_MBOX_SHUDOWN_ACK
- enter WFI state.

Meanwhile, the K3-R5 remote proc driver does:
- wait for the RP_MBOX_SHUTDOWN_ACK from the remote core
- wait for the remote proc to enter WFI state
- reset the remote core through device manager

Based on work from: Hari Nagalla 



Why is this needed now and what happens to system with a new kernel driver and
an older K3R5 firmware?


Without this patch, the IPC firwmares from recent TI PDK prevent the 
suspend:

platform 5d0.r5f: ti-sci processor set_control failed: -19
remoteproc remoteproc1: can't stop rproc: -19
platform 5c0.r5f: Failed to shutdown rproc (-19)
platform 5c0.r5f: k3_r5_rproc_suspend failed (-19)

With a new kernel driver and an old firmware, this will add a timeout 
before stopping it, and the message:


platform 5c0.r5f: Timeout waiting SHUTDOWN_ACK message. Halting core 
anyway.


(tested on old FW 09.00.00.01 and new FW 09.02.01.18, on J7200)

Regards,
Richard



Thanks,
Mathieu


Signed-off-by: Richard Genoud 
---
  drivers/remoteproc/omap_remoteproc.h |  9 +-
  drivers/remoteproc/ti_k3_r5_remoteproc.c | 40 
  2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/omap_remoteproc.h 
b/drivers/remoteproc/omap_remoteproc.h
index 828e13256c02..c008f11fa2a4 100644
--- a/drivers/remoteproc/omap_remoteproc.h
+++ b/drivers/remoteproc/omap_remoteproc.h
@@ -42,6 +42,11 @@
   * @RP_MBOX_SUSPEND_CANCEL: a cancel suspend response from a remote processor
   * on a suspend request
   *
+ * @RP_MBOX_SHUTDOWN: shutdown request for the remote processor
+ *
+ * @RP_MBOX_SHUTDOWN_ACK: successful response from remote processor for a
+ * shutdown request. The remote processor should be in WFI state short after.
+ *
   * Introduce new message definitions if any here.
   *
   * @RP_MBOX_END_MSG: Indicates end of known/defined messages from remote core
@@ -59,7 +64,9 @@ enum omap_rp_mbox_messages {
RP_MBOX_SUSPEND_SYSTEM  = 0xFF11,
RP_MBOX_SUSPEND_ACK = 0xFF12,
RP_MBOX_SUSPEND_CANCEL  = 0xFF13,
-   RP_MBOX_END_MSG = 0xFF14,
+   RP_MBOX_SHUTDOWN= 0xFF14,
+   RP_MBOX_SHUTDOWN_ACK= 0xFF15,
+   RP_MBOX_END_MSG = 0xFF16,
  };
  
  #endif /* _OMAP_RPMSG_H */

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index a2ead87952c7..918a15e1dd9a 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -21,6 +21,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -172,8 +173,23 @@ struct k3_r5_rproc {

struct k3_r5_core *core;
struct k3_r5_mem *rmem;
int num_rmems;
+   struct completion shutdown_complete;
  };
  
+/*

+ * This will return true if the remote core is in Wait For Interrupt state.
+ */
+static bool k3_r5_is_core_in_wfi(struct k3_r5_core *core)
+{
+   int ret;
+   u64 boot_vec;
+   u32 cfg, ctrl, stat;
+
+   ret = ti_sci_proc_get_status(core->tsp, _vec, , , );
+
+   return !ret ? !!(stat & PROC_BOOT_STATUS_FLAG_R5_WFI) : false;
+}
+
  /**
   * k3_r5_rproc_mbox_callback() - inbound mailbox message handler
   * @client: mailbox client pointer used for requesting the mailbox channel
@@ -209,6 +225,10 @@ static void k3_r5_rproc_mbox_callback(struct mbox_client 
*client, void *data)
case RP_MBOX_ECHO_REPLY:
dev_info(dev, "received echo reply from %s\n", name);
break;
+   case RP_MBOX_SHUTDOWN_ACK:
+   dev_dbg(dev, "received shutdown_ack from %s\n", name);
+   complete(>shutdown_complete);
+   break;
default:
/* silently handle all other valid messages */
if (msg >= RP_MBOX_READY && msg < RP_MBOX_END_MSG)
@@ -634,6 +654,7 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
struct k3_r5_cluster *cluster = kproc->cluster;
struct device *dev = kproc->dev;
struct k3_r5_core *core1, *core = kproc->core;
+   bool wfi;
int ret;
  
  
@@ -650,6 +671,24 @@ static int k3_r5_rproc_stop(struct rproc *rproc)

}
}
  
+	/* Send SHUTDOWN message to remote proc */

+   reinit_completion(>shutdown_complete);
+   ret = mbox_send_message(kproc->mbox, (void *)RP_MBOX_SHUTDOWN);
+   if (ret < 0) {
+   dev_err(dev,

Re: [PATCH 3/4] remoteproc: k3-r5: k3_r5_rproc_stop: code reorder

2024-07-01 Thread Mathieu Poirier

On Mon, Jul 01, 2024 at 10:03:22AM +0200, Richard GENOUD wrote:
> Le 28/06/2024 à 23:18, Mathieu Poirier a écrit :
> > On Fri, Jun 21, 2024 at 05:00:57PM +0200, Richard Genoud wrote:
> > > In the next commit, a RP_MBOX_SHUTDOWN message will be sent in
> > > k3_r5_rproc_stop() to the remote proc (in lockstep on not)
> > > Thus, the sanity check "do not allow core 0 to stop before core 1"
> > > should be moved at the beginning of the function so that the generic case
> > > can be dealt with.
> > > 
> > > In order to have an easier patch to review, those actions are broke in
> > > two patches:
> > > - this patch: moving the sanity check at the beginning (No functional
> > >change).
> > > - next patch: doing the real job (sending shutdown messages to remote
> > >procs before halting them).
> > > 
> > > Basically, we had:
> > > - cluster_mode actions
> > > - !cluster_mode sanity check
> > > - !cluster_mode actions
> > > And now:
> > > - !cluster_mode sanity check
> > > - cluster_mode actions
> > > - !cluster_mode actions
> > > 
> > > Signed-off-by: Richard Genoud 
> > > ---
> > >   drivers/remoteproc/ti_k3_r5_remoteproc.c | 24 ++--
> > >   1 file changed, 14 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
> > > b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> > > index 1f18b08618c8..a2ead87952c7 100644
> > > --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
> > > +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> > > @@ -636,16 +636,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
> > >   struct k3_r5_core *core1, *core = kproc->core;
> > >   int ret;
> > > - /* halt all applicable cores */
> > > - if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
> > > - list_for_each_entry(core, >cores, elem) {
> > > - ret = k3_r5_core_halt(core);
> > > - if (ret) {
> > > - core = list_prev_entry(core, elem);
> > > - goto unroll_core_halt;
> > > - }
> > > - }
> > > - } else {
> > > +
> > > + if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
> > >   /* do not allow core 0 to stop before core 1 */
> > >   core1 = list_last_entry(>cores, struct 
> > > k3_r5_core,
> > >   elem);
> > > @@ -656,6 +648,18 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
> > >   ret = -EPERM;
> > >   goto out;
> > >   }
> > > + }
> > > +
> > > + /* halt all applicable cores */
> > > + if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
> > > + list_for_each_entry(core, >cores, elem) {
> > > + ret = k3_r5_core_halt(core);
> > > + if (ret) {
> > > + core = list_prev_entry(core, elem);
> > > + goto unroll_core_halt;
> > > + }
> > > + }
> > > + } else {
> > >   ret = k3_r5_core_halt(core);
> > >   if (ret)
> > 
> > With this patch, the "else" in this "if" condition is coupled with the "if" 
> > from
> > the lockstep mode, making the code extremaly hard to read.  The original 
> > code
> > has a k3_r5_core_halt() in both "if" conditions, making the condition
> > independent from one another.
> > 
> I'm not sure I understand what you mean.

With your patch applied I get the following: https://pastebin.com/yTZ0pKcS

Let's say the R5 is in split mode and k3_r5_rproc_stop() called on core1.  The
if() that deal with that condition is on line 10, while the function that halts
the remote processor is online 34, part of the else clause that handles lockstep
mode.  The two if() clauses are entangled and nothing good can come out of that.

> Anyway, I'm not happy with this diff, it doesn't reflect what was intended.
> (which is moving the check "core 0 should not be stop before core 1" at the 
> beginning)
> 
> Tweaking around with the diff algorithms, I came with something way easier to 
> read I think:
> 
> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
> b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> index 1f18b08618c8..a2ead87952c7 100644
> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> @@ -636,6 +636,20 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
> struct k3_r5_core *core1, *core = kproc->core;
> int ret;
> +
> +   if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
> +   /* do not allow core 0 to stop before core 1 */
> +   core1 = list_last_entry(>cores, struct k3_r5_core,
> +   elem);
> +   if (core != core1 && core1->rproc->state != RPROC_OFFLINE &&
> +   core1->rproc->state != RPROC_SUSPENDED) {
> +   dev_err(dev, "%s: can not stop core 0 before core 
> 1\n",
> +   __func__);
> +

Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-07-01 Thread Stefano Garzarella


Hi Arseniy,

On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:

Previous calculation of 'free_space' was wrong (but worked as expected
in most cases, see below), because it didn't account number of bytes in
rx queue. Let's rework 'free_space' calculation in the following way:
as this value is considered free space at rx side from tx point of 
view,

it must be equal to return value of 'virtio_transport_get_credit()' at
tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
value received from rx. So let's use same approach at rx side during
'free_space' calculation: add 'rx_cnt' counter which is number of
received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
Now we have:
1) 'rx_cnt' == 'tx_cnt' at both sides.
2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
  sent to tx, while second is last 'fwd_cnt' received from rx.

Now 'free_space' is handled correctly and also we don't need
'low_rx_bytes' flag - this was more like a hack.

Previous calculation of 'free_space' worked (in 99% cases), because if
we take a look on behaviour of both expressions (new and previous):

'(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'

Both of them always grows up, with almost same "speed": only difference
is that 'rx_cnt' is incremented earlier during packet is received,
while 'fwd_cnt' in incremented when packet is read by user. So if 
'rx_cnt'

grows "faster", then resulting 'free_space' become smaller also, so we
send credit updates a little bit more, but:

 * 'free_space' calculation based on 'rx_cnt' gives the same value,
   which tx sees as free space at rx side, so original idea of
   'free_space' is now implemented as planned.
 * Hack with 'low_rx_bytes' now is not needed.

Also here is some performance comparison between both versions of
'free_space' calculation:

*--*--*--*
|  | 'rx_cnt' | previous |
*--*--*--*
|H -> G|   8.42   |   7.82   |
*--*--*--*
|G -> H|   11.6   |   12.1   |
*--*--*--*


I did some tests on an Intel(R) Xeon(R) Silver 4410Y using iperf-vsock:
- kernel 6.9.0
pkt_size G->H H->G
4k4.6  6.4
64k  13.8 11.5
128k 13.4 11.7

- kernel 6.9.0 with this series applied
pkt_size     G->H     H->G
4k            4.6     8.16
64k          12.2     8.9
128k         12.8     8.8

I see a big drop, especially on H->G with big packets. Can you try to 
replicate on your env?


I'll try to understand more and also an i7 on the next days.

Thanks,
Stefano



As benchmark 'vsock-iperf' with default arguments was used. There is no
significant performance difference before and after this patch.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h| 1 +
net/vmw_vsock/virtio_transport_common.c | 8 +++-
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..3579491c411e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -135,6 +135,7 @@ struct virtio_vsock_sock {
u32 peer_buf_alloc;

/* Protected by rx_lock */
+   u32 rx_cnt;
u32 fwd_cnt;
u32 last_fwd_cnt;
u32 rx_bytes;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..1d4e2328e06e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
virtio_vsock_sock *vvs,
return false;

vvs->rx_bytes += len;
+   vvs->rx_cnt += len;
return true;
}

@@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t bytes, total = 0;
struct sk_buff *skb;
u32 fwd_cnt_delta;
-   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;

@@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}

fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-   free_space = vvs->buf_alloc - fwd_cnt_delta;
-   low_rx_bytes = (vvs->rx_bytes <
-   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
+   free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt);

spin_unlock_bh(>rx_lock);

@@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * number of bytes in rx queue is not enough to wake up reader.
 */
if (fwd_cnt_delta &&
-   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE))
virtio_transport_send_credit_update(vsk);

return total;
--
2.25.1

Re: [PATCH] LoongArch: make the users of larch_insn_gen_break() constant

2024-07-01 Thread Huacai Chen

On Mon, Jul 1, 2024 at 2:22 PM Tiezhu Yang  wrote:
>
> On 06/29/2024 11:03 PM, Oleg Nesterov wrote:
> > LoongArch defines UPROBE_SWBP_INSN as a function call and this breaks
> > arch_uprobe_trampoline() which uses it to initialize a static variable.
> >
> > Add the new "__builtin_constant_p" helper, __emit_break(), and redefine
> > the current users of larch_insn_gen_break() to use it.
> >
> > The patch adds check_emit_break() into kprobes.c and uprobes.c to test
> > this change. They can be removed if LoongArch boots at least once, but
> > otoh these 2 __init functions will be discarded by free_initmem().
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return 
> > probe")
> > Reported-by: Nathan Chancellor 
> > Closes: https://lore.kernel.org/all/20240614174822.GA1185149@thelio-3990X/
> > Suggested-by: Andrii Nakryiko 
> > Signed-off-by: Oleg Nesterov 
>
> Tested on LoongArch machine with Loongson-3A5000 and Loongson-3A6000
> CPU, based on 6.10-rc3,
>
> KPROBE_BP_INSN == larch_insn_gen_break(BRK_KPROBE_BP)
> KPROBE_SSTEPBP_INSN == larch_insn_gen_break(BRK_KPROBE_SSTEPBP)
> UPROBE_SWBP_INSN  == larch_insn_gen_break(BRK_UPROBE_BP)
> UPROBE_XOLBP_INSN == larch_insn_gen_break(BRK_UPROBE_XOLBP)
>
> The two functions check_emit_break() can be removed in
> arch/loongarch/kernel/kprobes.c and arch/loongarch/kernel/uprobes.c
>
> Tested-by: Tiezhu Yang 
Queued, thanks.

Huacai
>
> Thanks,
> Tiezhu
>
>

Re: [PATCH PATCH net-next v2 2/2] vsock/virtio: avoid enqueue packets when work queue is empty

2024-07-01 Thread Luigi Leonardi

Hi all,

> + /* Inside RCU, can't sleep! */
> + ret = mutex_trylock(>tx_lock);
> + if (unlikely(ret == 0))
> + goto out_worker;

I just realized that here I don't release the tx_lock and 
that the email subject is "PATCH PATCH".
I will fix this in the next version.
Any feedback is welcome!

Thanks,
Luigi

[PATCH net-next v2 0/2] vsock: avoid queuing on workqueue if possible

2024-07-01 Thread Luigi Leonardi via B4 Relay

This series introduces an optimization for vsock/virtio to reduce latency:
When the guest sends a packet to the host, and the workqueue is empty,
if there is enough space, the packet is put directly in the virtqueue.

In this v2 I replaced a mutex_lock with a mutex_trylock because it was inside
a RCU critical section. I also added a check on tx_run, so if the
module is being removed the packet is not queued. I'd like to thank Stefano
for reporting the tx_run issue.

v1->v2
Applied all Stefano's suggestions:
- Minor code style changes
- Minor commit text rewrite
Performed more experiments:
 - Check if all the packets go directly to the vq (Matias' suggestion)
 - Used iperf3 to see if there is any improvement in overall throughput
  from guest to host
 - Pinned the vhost process to a pCPU.
 - Run fio using 512B payload
Rebased on latest net-next

Signed-off-by: Luigi Leonardi 
---
Marco Pinna (2):
  vsock/virtio: refactor virtio_transport_send_pkt_work
  vsock/virtio: avoid enqueue packets when work queue is empty

 net/vmw_vsock/virtio_transport.c | 171 +--
 1 file changed, 109 insertions(+), 62 deletions(-)
---
base-commit: 2e7b471121b09e7fa8ffb437bfa0e59d13f96053
change-id: 20240701-pinna-611e8b6cdda0

Best regards,
-- 
Luigi Leonardi

[PATCH PATCH net-next v2 2/2] vsock/virtio: avoid enqueue packets when work queue is empty

2024-07-01 Thread Luigi Leonardi via B4 Relay

From: Marco Pinna 

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is
put directly in the virtqueue reducing latency.

In the following benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Payload: 512
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before (Linux 6.8.11)
--
mean(1st percentile):380.56 ns
mean(overall):   780.83 ns
mean(99th percentile):  8300.24 ns

After
--
mean(1st percentile):   370.59 ns
mean(overall):  720.66 ns
mean(99th percentile): 7600.27 ns

Same setup, using 4K payload:

Before (Linux 6.8.11)
--
mean(1st percentile):458.84 ns
mean(overall):  1650.17 ns
mean(99th percentile): 42240.68 ns

After
--
mean(1st percentile):450.12 ns
mean(overall):  1460.84 ns
mean(99th percentile): 37632.45 ns

virtqueue.

Throughput: iperf-vsock

Before (Linux 6.8.11)
G2H 28.7 Gb/s

After
G2H 40.8 Gb/s

The performance improvement is related to this optimization,
I checked that each packet was put directly on the vq
avoiding the work queue.

Co-developed-by: Luigi Leonardi 
Signed-off-by: Luigi Leonardi 
Signed-off-by: Marco Pinna 
---
 net/vmw_vsock/virtio_transport.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index a74083d28120..3815aa8d956b 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -213,6 +213,7 @@ virtio_transport_send_pkt(struct sk_buff *skb)
 {
struct virtio_vsock_hdr *hdr;
struct virtio_vsock *vsock;
+   bool use_worker = true;
int len = skb->len;
 
hdr = virtio_vsock_hdr(skb);
@@ -234,8 +235,41 @@ virtio_transport_send_pkt(struct sk_buff *skb)
if (virtio_vsock_skb_reply(skb))
atomic_inc(>queued_replies);
 
-   virtio_vsock_skb_queue_tail(>send_pkt_queue, skb);
-   queue_work(virtio_vsock_workqueue, >send_pkt_work);
+   /* If the workqueue (send_pkt_queue) is empty there is no need to 
enqueue the packet.
+* Just put it on the virtqueue using virtio_transport_send_skb.
+*/
+   if (skb_queue_empty_lockless(>send_pkt_queue)) {
+   bool restart_rx = false;
+   struct virtqueue *vq;
+   int ret;
+
+   /* Inside RCU, can't sleep! */
+   ret = mutex_trylock(>tx_lock);
+   if (unlikely(ret == 0))
+   goto out_worker;
+
+   /* Driver is being removed, no need to enqueue the packet */
+   if (!vsock->tx_run)
+   goto out_rcu;
+
+   vq = vsock->vqs[VSOCK_VQ_TX];
+
+   if (!virtio_transport_send_skb(skb, vq, vsock, _rx)) {
+   use_worker = false;
+   virtqueue_kick(vq);
+   }
+
+   mutex_unlock(>tx_lock);
+
+   if (restart_rx)
+   queue_work(virtio_vsock_workqueue, >rx_work);
+   }
+
+out_worker:
+   if (use_worker) {
+   virtio_vsock_skb_queue_tail(>send_pkt_queue, skb);
+   queue_work(virtio_vsock_workqueue, >send_pkt_work);
+   }
 
 out_rcu:
rcu_read_unlock();

-- 
2.45.2

[PATCH PATCH net-next v2 1/2] vsock/virtio: refactor virtio_transport_send_pkt_work

2024-07-01 Thread Luigi Leonardi via B4 Relay

From: Marco Pinna 

Preliminary patch to introduce an optimization to the
enqueue system.

All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.

Co-developed-by: Luigi Leonardi 
Signed-off-by: Luigi Leonardi 
Signed-off-by: Marco Pinna 
---
 net/vmw_vsock/virtio_transport.c | 133 +--
 1 file changed, 73 insertions(+), 60 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 43d405298857..a74083d28120 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -94,6 +94,77 @@ static u32 virtio_transport_get_local_cid(void)
return ret;
 }
 
+/* Caller need to hold vsock->tx_lock on vq */
+static int virtio_transport_send_skb(struct sk_buff *skb, struct virtqueue *vq,
+struct virtio_vsock *vsock, bool 
*restart_rx)
+{
+   int ret, in_sg = 0, out_sg = 0;
+   struct scatterlist **sgs;
+   bool reply;
+
+   reply = virtio_vsock_skb_reply(skb);
+   sgs = vsock->out_sgs;
+   sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
+   sizeof(*virtio_vsock_hdr(skb)));
+   out_sg++;
+
+   if (!skb_is_nonlinear(skb)) {
+   if (skb->len > 0) {
+   sg_init_one(sgs[out_sg], skb->data, skb->len);
+   out_sg++;
+   }
+   } else {
+   struct skb_shared_info *si;
+   int i;
+
+   /* If skb is nonlinear, then its buffer must contain
+* only header and nothing more. Data is stored in
+* the fragged part.
+*/
+   WARN_ON_ONCE(skb_headroom(skb) != 
sizeof(*virtio_vsock_hdr(skb)));
+
+   si = skb_shinfo(skb);
+
+   for (i = 0; i < si->nr_frags; i++) {
+   skb_frag_t *skb_frag = >frags[i];
+   void *va;
+
+   /* We will use 'page_to_virt()' for the userspace page
+* here, because virtio or dma-mapping layers will call
+* 'virt_to_phys()' later to fill the buffer descriptor.
+* We don't touch memory at "virtual" address of this 
page.
+*/
+   va = page_to_virt(skb_frag_page(skb_frag));
+   sg_init_one(sgs[out_sg],
+   va + skb_frag_off(skb_frag),
+   skb_frag_size(skb_frag));
+   out_sg++;
+   }
+   }
+
+   ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
+   /* Usually this means that there is no more space available in
+* the vq
+*/
+   if (ret < 0)
+   return ret;
+
+   virtio_transport_deliver_tap_pkt(skb);
+
+   if (reply) {
+   struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
+   int val;
+
+   val = atomic_dec_return(>queued_replies);
+
+   /* Do we now have resources to resume rx processing? */
+   if (val + 1 == virtqueue_get_vring_size(rx_vq))
+   *restart_rx = true;
+   }
+
+   return 0;
+}
+
 static void
 virtio_transport_send_pkt_work(struct work_struct *work)
 {
@@ -111,77 +182,19 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
 
for (;;) {
-   int ret, in_sg = 0, out_sg = 0;
-   struct scatterlist **sgs;
struct sk_buff *skb;
-   bool reply;
+   int ret;
 
skb = virtio_vsock_skb_dequeue(>send_pkt_queue);
if (!skb)
break;
 
-   reply = virtio_vsock_skb_reply(skb);
-   sgs = vsock->out_sgs;
-   sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
-   sizeof(*virtio_vsock_hdr(skb)));
-   out_sg++;
-
-   if (!skb_is_nonlinear(skb)) {
-   if (skb->len > 0) {
-   sg_init_one(sgs[out_sg], skb->data, skb->len);
-   out_sg++;
-   }
-   } else {
-   struct skb_shared_info *si;
-   int i;
-
-   /* If skb is nonlinear, then its buffer must contain
-* only header and nothing more. Data is stored in
-* the fragged part.
-*/
-   WARN_ON_ONCE(skb_headroom(skb) != 
sizeof(*virtio_vsock_hdr(skb)));
-
-   si = skb_shinfo(skb);
-
-   for (i = 0; i < si->nr_frags; i++) {
-   skb_frag_t *skb_frag = >frags[i];
-

Re: [PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-07-01 Thread Petr Mladek

On Fri 2024-06-28 10:36:45, Luis Chamberlain wrote:
> On Fri, Jun 28, 2024 at 02:23:49PM +0200, Miroslav Benes wrote:
> > On Fri, 7 Jun 2024, Song Liu wrote:
> > 
> > > Hi Miroslav,
> > > 
> > > Thanks for reviewing the patch!
> > > 
> > > On Fri, Jun 7, 2024 at 6:06 AM Miroslav Benes  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, 4 Jun 2024, Song Liu wrote:
> > > >
> > > > > With CONFIG_LTO_CLANG, the compiler may postfix symbols with 
> > > > > .llvm.
> > > > > to avoid symbol duplication. scripts/kallsyms.c sorted the symbols
> > > > > without these postfixes. The default symbol lookup also removes these
> > > > > postfixes before comparing symbols.
> > > > >
> > > > > On the other hand, livepatch need to look up symbols with the full 
> > > > > names.
> > > > > However, calling kallsyms_on_each_match_symbol with full name (with 
> > > > > the
> > > > > postfix) cannot find the symbol(s). As a result, we cannot livepatch
> > > > > kernel functions with .llvm. postfix or kernel functions that 
> > > > > use
> > > > > relocation information to symbols with .llvm. postfixes.
> > > > >
> > > > > Fix this by calling kallsyms_on_each_match_symbol without the postfix;
> > > > > and then match the full name (with postfix) in klp_match_callback.
> > > > >
> > > > > Signed-off-by: Song Liu 
> > > > > ---
> > > > >  include/linux/kallsyms.h | 13 +
> > > > >  kernel/kallsyms.c| 21 -
> > > > >  kernel/livepatch/core.c  | 32 +++-
> > > > >  3 files changed, 60 insertions(+), 6 deletions(-)
> > > >
> > > > I do not like much that something which seems to be kallsyms-internal is
> > > > leaked out. You need to export cleanup_symbol_name() and there is now a
> > > > lot of code outside. I would feel much more comfortable if it is all
> > > > hidden from kallsyms users and kept there. Would it be possible?
> > > 
> > > I think it is possible. Currently, kallsyms_on_each_match_symbol matches
> > > symbols without the postfix. We can add a variation or a parameter, so
> > > that it matches the full name with post fix.
> > 
> > I think it might be better.

Also, I agree that we need another variant of kallsyms_on_each_match_symbol()
which would match the full name.

> > Luis, what is your take on this?

[ Luis probably removed too much context here. IMHO, the following
  sentence was talking about another problem in the original mail..

> > If I am not mistaken, there was a patch set to address this. Luis might 
> > remember more.

I believe that Miroslav was talking about
https://lore.kernel.org/all/20231204214635.2916691-1-alessandro.carmin...@gmail.com/

It is a patch solving the opposite problem. It allows to match
an exact symbol even when there were more symbols of the same name.
It would allow to get rid of the sympos.

> Yeah this is a real issue outside of CONFIG_LTO_CLANG, Rust modules is
> another example where instead of symbol names they want to use full
> hashes. So, as I hinted to you Sami, can we knock two birds with one stone
> here and move CONFIG_LTO_CLANG to use the same strategy as Rust so we
> have two users instead of just one? Then we resolve this. In fact
> what I suggested was even to allow even non-Rust, and in this case
> even with gcc to enable this world. This gives much more wider scope
> of testing / review / impact of these sorts of changes and world view
> and it would resolve the Rust case, the live patch CONFIG_LTO_CLANG
> world too.

So, you suggest to search the symbols by a hash. Do I get it correctly?

Well, it might bring back the original problem. I mean
the commit 8b8e6b5d3b013b0 ("kallsyms: strip ThinLTO hashes from
static functions") added cleanup_symbol_name() so that user-space
tool do not need to take care of the "unstable" suffix.

So, it seems that we have two use cases:

   1. Some user-space tools want to ignore the extra suffix. I guess
  that it is in the case when the suffix is added only because
  the function was optimized.

  It can't work if there are two different functions of the same
  name. Otherwise, the user-space tool would not know which one
  they are tracing.

   2. There are other use-cases, including livepatching, where we
  want to be 100% sure that we match the right symbol.

  They want to match the full names. They even need to distinguish
  symbols with the same name.

IMHO, we need a separate API for each use-case.

Best Regards,
Petr

Re: [PATCH] vhost-vdpa: switch to use vmf_insert_pfn() in the fault handler

2024-07-01 Thread Michal Kubiak

On Mon, Jul 01, 2024 at 11:31:59AM +0800, Jason Wang wrote:
> remap_pfn_page() should not be called in the fault handler as it may
> change the vma->flags which may trigger lockdep warning since the vma
> write lock is not held. Actually there's no need to modify the
> vma->flags as it has been set in the mmap(). So this patch switches to
> use vmf_insert_pfn() instead.
> 
> Reported-by: Dragos Tatulea 
> Tested-by: Dragos Tatulea 
> Fixes: ddd89d0a059d ("vhost_vdpa: support doorbell mapping via mmap")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vdpa.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 63a53680a85c..6b9c12acf438 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1483,13 +1483,7 @@ static vm_fault_t vhost_vdpa_fault(struct vm_fault 
> *vmf)
>  
>   notify = ops->get_vq_notification(vdpa, index);
>  
> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> - if (remap_pfn_range(vma, vmf->address & PAGE_MASK,
> - PFN_DOWN(notify.addr), PAGE_SIZE,
> - vma->vm_page_prot))
> - return VM_FAULT_SIGBUS;
> -
> - return VM_FAULT_NOPAGE;
> + return vmf_insert_pfn(vma, vmf->address & PAGE_MASK, 
> PFN_DOWN(notify.addr));
>  }
>  
>  static const struct vm_operations_struct vhost_vdpa_vm_ops = {
> -- 
> 2.31.1
> 
> 

I would only consider pasting an example warning log. (It's my
suggestion only).

Anyway, the patch looks OK to me.

Thanks,
Reviewed-by: Michal Kubiak

Re: [PATCH 0/4] remoteproc: k3-r5: Introduce suspend to ram support

2024-07-01 Thread Hari Nagalla


On 6/21/24 10:00, Richard Genoud wrote:

Richard Genoud (4):
   remoteproc: k3-r5: Fix IPC-only mode detection
   remoteproc: k3-r5: Introduce PM suspend/resume handlers
   remoteproc: k3-r5: k3_r5_rproc_stop: code reorder
   remoteproc: k3-r5: support for graceful stop of remote cores
IMO, the patches are better reordered as below (since there is a LPM 
rearch in progress)

patch1 - Independent of other patches and is clearly a bug fix.
patch3,4 - Support for graceful shutdown. A separate feature used by 
customers wanting to change FW at runtime and is independent of 
suspend/resume.

patch 2 - suspend/resume handlers..

Re: [PATCH v2 1/2] dt-bindings: arm: qcom: Document samsung,ms013g

2024-07-01 Thread Krzysztof Kozlowski

On 27/06/2024 21:30, Raymond Hackley wrote:
> Document samsung,ms013g for Galaxy Grand 2.
> 
> Signed-off-by: Raymond Hackley 

Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH 1/4] remoteproc: k3-r5: Fix IPC-only mode detection

2024-07-01 Thread Hari Nagalla


On 6/28/24 14:58, Mathieu Poirier wrote:

This could lead in an incorrect IPC-only mode detection if reset line is
asserted (so reset_control_status() return > 0) and c_state != 0 and
halted == 0.
In this case, the old code would have detected an IPC-only mode instead
of a mismatched mode.


Your assessment seems to be correct.  That said I'd like to have an RB or a TB
from someone in the TI delegation - guys please have a look.

Agree with Richard's assessment, and the proposed fix looks good.

Reviewed-by:
Hari Nagalla

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-01 Thread David Woodhouse

On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
> On 28 June 2024 17:38:15 BST, Peter Hilber  
> wrote:
> > On 28.06.24 14:15, David Woodhouse wrote:
> > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> > > > On 27.06.24 16:52, David Woodhouse wrote:
> > > > > I already added a flags field, so this might look something like:
> > > > > 
> > > > >     /*
> > > > >  * Smearing flags. The UTC clock exposed through this 
> > > > > structure
> > > > >  * is only ever true UTC, but a guest operating system may
> > > > >  * choose to offer a monotonic smeared clock to its users. 
> > > > > This
> > > > >  * merely offers a hint about what kind of smearing to 
> > > > > perform,
> > > > >  * for consistency with systems in the nearby environment.
> > > > >  */
> > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* 
> > > > > draft-kuhn-leapsecond-00.txt */
> > > > > 
> > > > > (UTC-SLS is probably a bad example but are there formal definitions 
> > > > > for
> > > > > anything else?)
> > > > 
> > > > I think it could also be more generic, like flags for linear smearing,
> > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields 
> > > > (relative
> > > > to the leap second start). That could also represent UTC-SLS, and
> > > > noon-to-noon, and it would be well-defined.
> > > > 
> > > > This should reduce the likelihood that the guest doesn't know the 
> > > > smearing
> > > > variant.
> > > 
> > > I'm wary of making it too generic. That would seem to encourage a
> > > *proliferation* of false "UTC-like" clocks.
> > > 
> > > It's bad enough that we do smearing at all, let alone that we don't
> > > have a single definition of how to do it.
> > > 
> > > I made the smearing hint a full uint8_t instead of using bits in flags,
> > > in the end. That gives us a full 255 ways of lying to users about what
> > > the time is, so we're unlikely to run out. And it's easy enough to add
> > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> > > that get invented.
> > > 
> > > 
> > 
> > My concern is that the registry update may come after a driver has already
> > been implemented, so that it may be hard to ensure that the smearing which
> > has been chosen is actually implemented.
> 
> Well yes, but why in the name of all that is holy would anyone want
> to invent *new* ways to lie to users about the time? If we capture
> the existing ones as we write this, surely it's a good thing that
> there's a barrier to entry for adding more?

Ultimately though, this isn't the hill for me to die on. I'm pushing on
that topic because I want to avoid the proliferation of *ambiguity*. If
we have a precision clock, we should *know* what the time is.

So how about this proposal. I line up the fields in the proposed shared
memory structure to match your virtio-rtc proposal, using 'subtype' as
you proposed. But, instead of the 'subtype' being valid only for
VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.

So, you have:

+\begin{lstlisting}
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+\end{lstlisting}

I propose that you add

#define VIRTIO_RTC_CLOCK_SMEARED_UTC 3

If my proposed memory structure is subsumed into the virtio-rtc
proposal we'd literally use the same names, but for the time being I'll
update mine to:

/*
 * What time is exposed in the time_sec/time_frac_sec fields?
 */
uint8_t time_type;
#define VMCLOCK_TIME_UTC0   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI1   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC  2   /* Since undefined epoch */
#define VMCLOCK_TIME_INVALID3   /* virtio-rtc uses this for 
smeared UTC */


I can then use your smearing subtype values as the 'hint' field in the
shared memory structure. You currently have:

+\begin{lstlisting}
+#define VIRTIO_RTC_SUBTYPE_STRICT 0
+#define VIRTIO_RTC_SUBTYPE_SMEAR 1
+#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
+#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
+\end{lstlisting}

I can certainly ensure that 'noon linear' has the same value. I don't
think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:


+\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
+   smearing time in the vicinity of the leap second, in a not
+   precisely defined manner. This avoids clock steps due to UTC
+   leap seconds.

...

+\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
+   standard w.r.t.\ leap second introduction in an unspecified
way
+   (leap seconds may, or may not, be smeared).

To the client, both of those just mean "for a day or so around a leap
second event, you can't trust this device to know what the time is".
There isn't any point in separating "does lie to you" from "might lie
to you", surely? The guest can't do

Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-07-01 Thread Naina Mehta





On 6/26/2024 9:12 PM, Konrad Dybcio wrote:

On 24.06.2024 1:21 PM, Naina Mehta wrote:



On 6/18/2024 7:08 PM, Konrad Dybcio wrote:



On 6/18/24 15:13, Naina Mehta wrote:

Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850.
Split mpss_dsmharq_mem region into 2 separate regions and
reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---


Alright, we're getting somewhere. The commit message should however motivate
why such changes are necessary. For all we know, the splitting in two is
currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous
- does the firmware have some expectations about them being separate?



Since different DSM region size is required for different modem firmware, 
mpss_dsmharq_mem region being split into 2 separate regions.
This would provide the flexibility to remove the region which is
not required for a particular platform.
qlink_logging is being added at the memory region at the address of
0x8880 as the region is being used by modem firmware.


Ok, now put that in the commit message :)

And I suppose:

"This would provide the flexibility to remove the region which is not
required for a particular platform." - but you still pass both to the
remoteproc in patch 4. Are these regions mutually exclusive?



Yes, for IDP platform, we are using both the DSM regions.
Based on the modem firmware either both the regions have to be used or 
only mpss_dsm_mem has to be used.


Regards,
Naina


Konrad

Re: [PATCH 3/4] remoteproc: k3-r5: k3_r5_rproc_stop: code reorder

2024-07-01 Thread Richard GENOUD


Le 28/06/2024 à 23:18, Mathieu Poirier a écrit :

On Fri, Jun 21, 2024 at 05:00:57PM +0200, Richard Genoud wrote:

In the next commit, a RP_MBOX_SHUTDOWN message will be sent in
k3_r5_rproc_stop() to the remote proc (in lockstep on not)
Thus, the sanity check "do not allow core 0 to stop before core 1"
should be moved at the beginning of the function so that the generic case
can be dealt with.

In order to have an easier patch to review, those actions are broke in
two patches:
- this patch: moving the sanity check at the beginning (No functional
   change).
- next patch: doing the real job (sending shutdown messages to remote
   procs before halting them).

Basically, we had:
- cluster_mode actions
- !cluster_mode sanity check
- !cluster_mode actions
And now:
- !cluster_mode sanity check
- cluster_mode actions
- !cluster_mode actions

Signed-off-by: Richard Genoud 
---
  drivers/remoteproc/ti_k3_r5_remoteproc.c | 24 ++--
  1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 1f18b08618c8..a2ead87952c7 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -636,16 +636,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
struct k3_r5_core *core1, *core = kproc->core;
int ret;
  
-	/* halt all applicable cores */

-   if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
-   list_for_each_entry(core, >cores, elem) {
-   ret = k3_r5_core_halt(core);
-   if (ret) {
-   core = list_prev_entry(core, elem);
-   goto unroll_core_halt;
-   }
-   }
-   } else {
+
+   if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
/* do not allow core 0 to stop before core 1 */
core1 = list_last_entry(>cores, struct k3_r5_core,
elem);
@@ -656,6 +648,18 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
ret = -EPERM;
goto out;
}
+   }
+
+   /* halt all applicable cores */
+   if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
+   list_for_each_entry(core, >cores, elem) {
+   ret = k3_r5_core_halt(core);
+   if (ret) {
+   core = list_prev_entry(core, elem);
+   goto unroll_core_halt;
+   }
+   }
+   } else {
  
  		ret = k3_r5_core_halt(core);

if (ret)


With this patch, the "else" in this "if" condition is coupled with the "if" from
the lockstep mode, making the code extremaly hard to read.  The original code
has a k3_r5_core_halt() in both "if" conditions, making the condition
independent from one another.


I'm not sure I understand what you mean.
Anyway, I'm not happy with this diff, it doesn't reflect what was intended.
(which is moving the check "core 0 should not be stop before core 1" at the 
beginning)

Tweaking around with the diff algorithms, I came with something way easier to 
read I think:

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 1f18b08618c8..a2ead87952c7 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -636,6 +636,20 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
struct k3_r5_core *core1, *core = kproc->core;
int ret;
 
+

+   if (cluster->mode != CLUSTER_MODE_LOCKSTEP) {
+   /* do not allow core 0 to stop before core 1 */
+   core1 = list_last_entry(>cores, struct k3_r5_core,
+   elem);
+   if (core != core1 && core1->rproc->state != RPROC_OFFLINE &&
+   core1->rproc->state != RPROC_SUSPENDED) {
+   dev_err(dev, "%s: can not stop core 0 before core 1\n",
+   __func__);
+   ret = -EPERM;
+   goto out;
+   }
+   }
+
/* halt all applicable cores */
if (cluster->mode == CLUSTER_MODE_LOCKSTEP) {
list_for_each_entry(core, >cores, elem) {
@@ -646,16 +660,6 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
}
}
} else {
-   /* do not allow core 0 to stop before core 1 */
-   core1 = list_last_entry(>cores, struct k3_r5_core,
-   elem);
-   if (core != core1 && core1->rproc->state != RPROC_OFFLINE &&
-   core1->rproc->state != RPROC_SUSPENDED) {
-   dev_err(dev, "%s: can not stop core 0 before core 1\n",
-   __func__);
-   ret = -EPERM;
-

Re: [PATCH 2/4] remoteproc: k3-r5: Introduce PM suspend/resume handlers

2024-07-01 Thread Richard GENOUD


Le 28/06/2024 à 22:48, Mathieu Poirier a écrit :

On Fri, Jun 21, 2024 at 05:00:56PM +0200, Richard Genoud wrote:

This patch adds the support for system suspend/resume to the ti_k3_R5
remoteproc driver.

In order to save maximum power, the approach here is to shutdown
completely the cores that were started by the kernel (i.e. those in
RUNNING state).
Those which were started before the kernel (in attached mode) will be
detached.

The pm_notifier mechanism is used here because the remote procs firmwares
have to be reloaded at resume, and thus the driver must have access to
the file system were the firmware is stored.

On suspend, the running remote procs are stopped, the attached remote
procs are detached and processor control released.

On resume, the reverse operation is done.

Based on work from: Hari Nagalla 

Signed-off-by: Richard Genoud 
---
  drivers/remoteproc/ti_k3_r5_remoteproc.c | 123 ++-
  1 file changed, 121 insertions(+), 2 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 39a47540c590..1f18b08618c8 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -112,6 +113,7 @@ struct k3_r5_cluster {

struct list_head cores;
wait_queue_head_t core_transition;
const struct k3_r5_soc_data *soc_data;
+   struct notifier_block pm_notifier;
  };
  
  /**

@@ -577,7 +579,8 @@ static int k3_r5_rproc_start(struct rproc *rproc)
/* do not allow core 1 to start before core 0 */
core0 = list_first_entry(>cores, struct k3_r5_core,
 elem);
-   if (core != core0 && core0->rproc->state == RPROC_OFFLINE) {
+   if (core != core0 && (core0->rproc->state == RPROC_OFFLINE ||
+ core0->rproc->state == RPROC_SUSPENDED)) {


If I understand correctly, this is to address a possible race condition between
user space wanting to start core1 via sysfs while the system is being suspended.
Is this correct?  If so, please add a comment to explain what is going on.
Otherwise a comment is obviously needed.

Yes, you're right, I'll add a comment on the race condition at suspend.




dev_err(dev, "%s: can not start core 1 before core 0\n",
__func__);
ret = -EPERM;
@@ -646,7 +649,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
/* do not allow core 0 to stop before core 1 */
core1 = list_last_entry(>cores, struct k3_r5_core,
elem);
-   if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
+   if (core != core1 && core1->rproc->state != RPROC_OFFLINE &&
+   core1->rproc->state != RPROC_SUSPENDED) {
dev_err(dev, "%s: can not stop core 0 before core 1\n",
__func__);
ret = -EPERM;
@@ -1238,6 +1242,117 @@ static int k3_r5_rproc_configure_mode(struct 
k3_r5_rproc *kproc)
return ret;
  }
  
+static int k3_r5_rproc_suspend(struct k3_r5_rproc *kproc)

+{
+   unsigned int rproc_state = kproc->rproc->state;
+   int ret;
+
+   if (rproc_state != RPROC_RUNNING && rproc_state != RPROC_ATTACHED)
+   return 0;
+
+   if (rproc_state == RPROC_RUNNING)
+   ret = rproc_shutdown(kproc->rproc);
+   else
+   ret = rproc_detach(kproc->rproc);
+
+   if (ret) {
+   dev_err(kproc->dev, "Failed to %s rproc (%d)\n",
+   (rproc_state == RPROC_RUNNING) ? "shutdown" : "detach",
+   ret);
+   return ret;
+   }
+
+   kproc->rproc->state = RPROC_SUSPENDED;
+
+   return ret;
+}
+
+static int k3_r5_rproc_resume(struct k3_r5_rproc *kproc)
+{
+   int ret;
+
+   if (kproc->rproc->state != RPROC_SUSPENDED)
+   return 0;
+
+   ret = k3_r5_rproc_configure_mode(kproc);
+   if (ret < 0)
+   return -EBUSY;
+
+   /*
+* ret > 0 for IPC-only mode
+* ret == 0 for remote proc mode
+*/
+   if (ret == 0) {
+   /*
+* remote proc looses its configuration when powered off.
+* So, we have to configure it again on resume.
+*/
+   ret = k3_r5_rproc_configure(kproc);
+   if (ret < 0) {
+   dev_err(kproc->dev, "k3_r5_rproc_configure failed 
(%d)\n", ret);
+   return -EBUSY;
+   }
+   }
+
+   return rproc_boot(kproc->rproc);
+}
+
+static int k3_r5_cluster_pm_notifier_call(struct notifier_block *bl,
+ unsigned long state, void *unused)
+{
+

Re: [PATCH] LoongArch: make the users of larch_insn_gen_break() constant

2024-07-01 Thread Tiezhu Yang


On 06/29/2024 11:03 PM, Oleg Nesterov wrote:

LoongArch defines UPROBE_SWBP_INSN as a function call and this breaks
arch_uprobe_trampoline() which uses it to initialize a static variable.

Add the new "__builtin_constant_p" helper, __emit_break(), and redefine
the current users of larch_insn_gen_break() to use it.

The patch adds check_emit_break() into kprobes.c and uprobes.c to test
this change. They can be removed if LoongArch boots at least once, but
otoh these 2 __init functions will be discarded by free_initmem().

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Nathan Chancellor 
Closes: https://lore.kernel.org/all/20240614174822.GA1185149@thelio-3990X/
Suggested-by: Andrii Nakryiko 
Signed-off-by: Oleg Nesterov 


Tested on LoongArch machine with Loongson-3A5000 and Loongson-3A6000
CPU, based on 6.10-rc3,

KPROBE_BP_INSN == larch_insn_gen_break(BRK_KPROBE_BP)
KPROBE_SSTEPBP_INSN == larch_insn_gen_break(BRK_KPROBE_SSTEPBP)
UPROBE_SWBP_INSN  == larch_insn_gen_break(BRK_UPROBE_BP)
UPROBE_XOLBP_INSN == larch_insn_gen_break(BRK_UPROBE_XOLBP)

The two functions check_emit_break() can be removed in
arch/loongarch/kernel/kprobes.c and arch/loongarch/kernel/uprobes.c

Tested-by: Tiezhu Yang 

Thanks,
Tiezhu

55 matches

Mail list logo