[PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-20 Thread Andrii Nakryiko
Recent changes made uprobe_cpu_buffer preparation lazy, and moved it
deeper into __uprobe_trace_func(). This is problematic because
__uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock()
block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() ->
mutex_lock(>mutex), leading to a splat about using mutex under
non-sleepable RCU:

  BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:585
   in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: 
stress-ng-sigq
   preempt_count: 0, expected: 0
   RCU nest depth: 1, expected: 0
   ...
   Call Trace:

dump_stack_lvl+0x3d/0xe0
__might_resched+0x24c/0x270
? prepare_uprobe_buffer+0xd5/0x1d0
__mutex_lock+0x41/0x820
? ___perf_sw_event+0x206/0x290
? __perf_event_task_sched_in+0x54/0x660
? __perf_event_task_sched_in+0x54/0x660
prepare_uprobe_buffer+0xd5/0x1d0
__uprobe_trace_func+0x4a/0x140
uprobe_dispatcher+0x135/0x280
? uprobe_dispatcher+0x94/0x280
uprobe_notify_resume+0x650/0xec0
? atomic_notifier_call_chain+0x21/0x110
? atomic_notifier_call_chain+0xf8/0x110
irqentry_exit_to_user_mode+0xe2/0x1e0
asm_exc_int3+0x35/0x40
   RIP: 0033:0x7f7e1d4da390
   Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff 
ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00  0f 1e fa b8 27 00 
00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e
   RSP: 002b:7ffd2abc3608 EFLAGS: 0246
   RAX:  RBX: 76d325f1 RCX: 
   RDX: 76d325f1 RSI: 000a RDI: 7ffd2abc3690
   RBP: 000a R08: 00017fb7 R09: 00017fb7
   R10: 00017fb7 R11: 0246 R12: 00017ff2
   R13: 7ffd2abc3610 R14:  R15: 7ffd2abc3780


Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called
slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside
of RCU locked section. This still keeps this buffer preparation lazy and helps
avoid the overhead when it's not needed. E.g., if there is only BPF uprobe
handler installed on a given uprobe, buffer won't be initialized.

Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not
affected, as it doesn't prepare buffer under RCU read lock.

Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
Reported-by: Breno Leitao 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 8541fa1494ae..c98e3b3386ba 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -970,19 +970,17 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer **ucbp,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
-   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(>tp);
 
WARN_ON(call != trace_file->event_call);
 
-   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
@@ -1014,13 +1012,16 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
+   struct uprobe_cpu_buffer *ucb;
 
if (is_ret_probe(tu))
return 0;
 
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
+
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucbp, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1031,10 +1032,13 @@ static void uretprobe_trace_func(struct trace_uprobe 
*tu, unsigned long func,
 struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
+   struct uprobe_cpu_buffer *ucb;
+
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucbp, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
-- 
2.43.0




Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-20 Thread Andrii Nakryiko
On Mon, May 20, 2024 at 8:20 AM Jiri Olsa  wrote:
>
> On Wed, May 15, 2024 at 08:32:30AM -0600, Andrii Nakryiko wrote:
> > On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra  wrote:
> > >
> > > On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote:
> > >
> > > > +static void fixup_uretprobe_trampoline_entries(struct 
> > > > perf_callchain_entry *entry,
> > > > +int start_entry_idx)
> > > > +{
> > > > +#ifdef CONFIG_UPROBES
> > > > + struct uprobe_task *utask = current->utask;
> > > > + struct return_instance *ri;
> > > > + __u64 *cur_ip, *last_ip, tramp_addr;
> > > > +
> > > > + if (likely(!utask || !utask->return_instances))
> > > > + return;
> > > > +
> > > > + cur_ip = >ip[start_entry_idx];
> > > > + last_ip = >ip[entry->nr - 1];
> > > > + ri = utask->return_instances;
> > > > + tramp_addr = uprobe_get_trampoline_vaddr();
> > > > +
> > > > + /* If there are pending uretprobes for current thread, they are
> > >
> > > Comment style fail. Also 'for *the* current thread'.
> > >
> >
> > ack, will fix
> >
> > > > +  * recorded in a list inside utask->return_instances; each such
> > > > +  * pending uretprobe replaces traced user function's return 
> > > > address on
> > > > +  * the stack, so when stack trace is captured, instead of seeing
> > > > +  * actual function's return address, we'll have one or many 
> > > > uretprobe
> > > > +  * trampoline addresses in the stack trace, which are not helpful 
> > > > and
> > > > +  * misleading to users.
> > >
> > > I would beg to differ, what if the uprobe is causing the performance
> > > issue?
> >
> > If uprobe/uretprobe code itself is causing performance issues, you'll
> > see that in other stack traces, where this code will be actively
> > running on CPU. I don't think we make anything worse here.
>
> I think we do similar thing in kernel unwind for rethook trampoline used
> in fprobe/kretprobe code, so seems ok to me to do it for uprobes as well
>
> >
> > Here we are talking about the case where the uprobe part is done and
> > it hijacked the return address on the stack, uretprobe is not yet
> > running (and so not causing any performance issues). The presence of
> > this "snooping" (pending) uretprobe is irrelevant to the user that is
> > capturing stack trace. Right now address in [uprobes] VMA section
> > installed by uretprobe infra code is directly replacing correct and
> > actual calling function address.
> >
> > Worst case, one can argue that both [uprobes] and original caller
> > address should be in the stack trace, but I think it still will be
> > confusing to users. And also will make implementation less efficient
> > because now we'll need to insert entries into the array and shift
> > everything around.
>
> agreed this would be confusing.. also as you noted above the return
> trampoline did not get executed yet at the time of the callstack,
> so it's bit misleading
>
> might be stupid idea.. but we do have the 'special' context entries
> that we store in the callstack to mark user/kernel/guest context ..

only when explicitly requested (add_mark argument to
get_perf_callchain), right? BPF doesn't ever set this to true and
generally speaking users don't care and shouldn't care about pending
uretprobe. I think we are conflating unrelated things here, uretprobe
is not running, so it's not really in the stack trace. I'd just do
nothing about it, it should stay transparent.

If uretprobe *handler* is causing issues, you'll see that in all the
other stack traces (according to relative CPU/resource usage of that
handler).

> maybe we could add some special entry (context does not fit too well)
> to point out there's uretprobe going on .. perf tool could print
> 'uretprobe' hint when displaying the original address
>
> jirka
>
> >
> > So as I mentioned above, if the concern is seeing uprobe/uretprobe
> > code using CPU, that doesn't change, we'll see that in the overall set
> > of captured stack traces (be it custom uprobe handler code or BPF
> > program).
> >
> > >
> > > While I do think it makes sense to fix the unwind in the sense that we
> > > should be able to continue the unwind, I don't think it makes sense to
> > > completely hide the presence of uprobes.
> >
> > Unwind isn't broken in this sense, we do unwind the entire stack trace
> > (see examples in the later patch). We just don't capture actual
> > callers if they have uretprobe pending.
> >
> > >
> > > > +  * So here we go over the pending list of uretprobes, and each
> > > > +  * encountered trampoline address is replaced with actual return
> > > > +  * address.
> > > > +  */
> > > > + while (ri && cur_ip <= last_ip) {
> > > > + if (*cur_ip == tramp_addr) {
> > > > + *cur_ip = ri->orig_ret_vaddr;
> > > > + ri = ri->next;
> > > > + }
> > > > + cur_ip++;
> > > > + }
> > > 

Re: [PATCH 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-05-20 Thread Jiri Olsa
On Wed, May 08, 2024 at 02:26:04PM -0700, Andrii Nakryiko wrote:
> When tracing user functions with uprobe functionality, it's common to
> install the probe (e.g., a BPF program) at the first instruction of the
> function. This is often going to be `push %rbp` instruction in function
> preamble, which means that within that function frame pointer hasn't
> been established yet. This leads to consistently missing an actual
> caller of the traced function, because perf_callchain_user() only
> records current IP (capturing traced function) and then following frame
> pointer chain (which would be caller's frame, containing the address of
> caller's caller).
> 
> So when we have target_1 -> target_2 -> target_3 call chain and we are
> tracing an entry to target_3, captured stack trace will report
> target_1 -> target_3 call chain, which is wrong and confusing.
> 
> This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> instruction being traced. If that's the case, with the assumption that
> applicatoin is compiled with frame pointers, this instruction would be
> a strong indicator that this is the entry to the function. In that case,
> return address is still pointed to by %rsp, so we fetch it and add to
> stack trace before proceeding to unwind the rest using frame
> pointer-based logic.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
>  arch/x86/events/core.c  | 20 
>  include/linux/uprobes.h |  2 ++
>  kernel/events/uprobes.c |  2 ++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 5b0dd07b1ef1..82d5570b58ff 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *regs
>   return;
>  
>   pagefault_disable();
> +
> +#ifdef CONFIG_UPROBES
> + /*
> +  * If we are called from uprobe handler, and we are indeed at the very
> +  * entry to user function (which is normally a `push %rbp` instruction,
> +  * under assumption of application being compiled with frame pointers),
> +  * we should read return address from *regs->sp before proceeding
> +  * to follow frame pointers, otherwise we'll skip immediate caller
> +  * as %rbp is not yet setup.
> +  */
> + if (current->utask) {
> + struct arch_uprobe *auprobe = current->utask->auprobe;
> + u64 ret_addr;
> +
> + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&

nice cactch, I was wondering if we should set some auprobe flag in
arch_uprobe_analyze_insn and test it here instead of matching the
instruction opcode directly, but that would probably be just more
complicated.. this is simple and in arch code, no np

jirka

> + !__get_user(ret_addr, (const u64 __user *)regs->sp))
> + perf_callchain_store(entry, ret_addr);
> + }
> +#endif
> +
>   while (entry->nr < entry->max_stack) {
>   if (!valid_user_frame(fp, sizeof(frame)))
>   break;
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 0c57eec85339..7b785cd30d86 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -76,6 +76,8 @@ struct uprobe_task {
>   struct uprobe   *active_uprobe;
>   unsigned long   xol_vaddr;
>  
> + struct arch_uprobe  *auprobe;
> +
>   struct return_instance  *return_instances;
>   unsigned intdepth;
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 1c99380dc89d..504693845187 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
> pt_regs *regs)
>   bool need_prep = false; /* prepare return uprobe, when needed */
>  
>   down_read(>register_rwsem);
> + current->utask->auprobe = >arch;
>   for (uc = uprobe->consumers; uc; uc = uc->next) {
>   int rc = 0;
>  
> @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
> pt_regs *regs)
>  
>   remove &= rc;
>   }
> + current->utask->auprobe = NULL;
>  
>   if (need_prep && !remove)
>   prepare_uretprobe(uprobe, regs); /* put bp at return */
> -- 
> 2.43.0
> 
> 



Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-20 Thread Jiri Olsa
On Wed, May 15, 2024 at 08:32:30AM -0600, Andrii Nakryiko wrote:
> On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra  wrote:
> >
> > On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote:
> >
> > > +static void fixup_uretprobe_trampoline_entries(struct 
> > > perf_callchain_entry *entry,
> > > +int start_entry_idx)
> > > +{
> > > +#ifdef CONFIG_UPROBES
> > > + struct uprobe_task *utask = current->utask;
> > > + struct return_instance *ri;
> > > + __u64 *cur_ip, *last_ip, tramp_addr;
> > > +
> > > + if (likely(!utask || !utask->return_instances))
> > > + return;
> > > +
> > > + cur_ip = >ip[start_entry_idx];
> > > + last_ip = >ip[entry->nr - 1];
> > > + ri = utask->return_instances;
> > > + tramp_addr = uprobe_get_trampoline_vaddr();
> > > +
> > > + /* If there are pending uretprobes for current thread, they are
> >
> > Comment style fail. Also 'for *the* current thread'.
> >
> 
> ack, will fix
> 
> > > +  * recorded in a list inside utask->return_instances; each such
> > > +  * pending uretprobe replaces traced user function's return address 
> > > on
> > > +  * the stack, so when stack trace is captured, instead of seeing
> > > +  * actual function's return address, we'll have one or many 
> > > uretprobe
> > > +  * trampoline addresses in the stack trace, which are not helpful 
> > > and
> > > +  * misleading to users.
> >
> > I would beg to differ, what if the uprobe is causing the performance
> > issue?
> 
> If uprobe/uretprobe code itself is causing performance issues, you'll
> see that in other stack traces, where this code will be actively
> running on CPU. I don't think we make anything worse here.

I think we do similar thing in kernel unwind for rethook trampoline used
in fprobe/kretprobe code, so seems ok to me to do it for uprobes as well

> 
> Here we are talking about the case where the uprobe part is done and
> it hijacked the return address on the stack, uretprobe is not yet
> running (and so not causing any performance issues). The presence of
> this "snooping" (pending) uretprobe is irrelevant to the user that is
> capturing stack trace. Right now address in [uprobes] VMA section
> installed by uretprobe infra code is directly replacing correct and
> actual calling function address.
> 
> Worst case, one can argue that both [uprobes] and original caller
> address should be in the stack trace, but I think it still will be
> confusing to users. And also will make implementation less efficient
> because now we'll need to insert entries into the array and shift
> everything around.

agreed this would be confusing.. also as you noted above the return
trampoline did not get executed yet at the time of the callstack,
so it's bit misleading

might be stupid idea.. but we do have the 'special' context entries
that we store in the callstack to mark user/kernel/guest context ..
maybe we could add some special entry (context does not fit too well)
to point out there's uretprobe going on .. perf tool could print
'uretprobe' hint when displaying the original address

jirka

> 
> So as I mentioned above, if the concern is seeing uprobe/uretprobe
> code using CPU, that doesn't change, we'll see that in the overall set
> of captured stack traces (be it custom uprobe handler code or BPF
> program).
> 
> >
> > While I do think it makes sense to fix the unwind in the sense that we
> > should be able to continue the unwind, I don't think it makes sense to
> > completely hide the presence of uprobes.
> 
> Unwind isn't broken in this sense, we do unwind the entire stack trace
> (see examples in the later patch). We just don't capture actual
> callers if they have uretprobe pending.
> 
> >
> > > +  * So here we go over the pending list of uretprobes, and each
> > > +  * encountered trampoline address is replaced with actual return
> > > +  * address.
> > > +  */
> > > + while (ri && cur_ip <= last_ip) {
> > > + if (*cur_ip == tramp_addr) {
> > > + *cur_ip = ri->orig_ret_vaddr;
> > > + ri = ri->next;
> > > + }
> > > + cur_ip++;
> > > + }
> > > +#endif
> > > +}
>