from:"Andrii Nakryiko"

Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-10 Thread Andrii Nakryiko

On Wed, Jul 10, 2024 at 6:33 AM Oleg Nesterov  wrote:
>
> On 07/03, Oleg Nesterov wrote:
> >
> > > /*
> > > -* The NULL 'tsk' here ensures that any faults that occur here
> > > -* will not be accounted to the task.  'mm' *is* current->mm,
> > > -* but we treat this as a 'remote' access since it is
> > > -* essentially a kernel access to the memory.
> > > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > > +* it is essentially a kernel access to the memory.
> > >  */
> > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
> >
> > OK, this makes it less confusing, so
> >
> > Acked-by: Oleg Nesterov 
> >
> > -
> > but it still looks confusing to me. This code used to pass tsk = NULL
> > only to avoid tsk->maj/min_flt++ in faultin_page().
> >
> > But today mm_account_fault() increments these counters without checking
> > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> > just use get_user_pages() and remove this comment?
>
> Well, yes, it still looks confusing, imo.
>
> Andrii, I hope you won't mind if I redo/resend this and the next cleanup?
>
> The next one only updates the comment above uprobe_write_opcode(), but
> it would be nice to explain mmap_write_lock() in register_for_each_vma().
>

I don't mind a bit, thanks for sending the patches!

> Oleg.
>
>

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Andrii Nakryiko

On Tue, Jul 9, 2024 at 2:33 PM Oleg Nesterov  wrote:
>
> On 07/09, Andrii Nakryiko wrote:
> >
> > On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
> > >
> > > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > > detects "improper" stack pointer progression, for example),
> > >
> > > In this case we a) assume that user-space tries to fool the kernel and
> >
> > Well, it's a bad assumption. User space might just be using fibers and
> > managing its own stack.
>
> Do you mean something like the "go" language?
>

No, I think it was C++ application. I think we have some uses of
fibers in which an application does its own user-space scheduling and
manages stack in user space. But it's basically the same class of
problems that you'd get with Go, yes.

> Yes, not supported. And from the kernel perspective it still looks as if
> user-space tries to fool the kernel. I mean, if you insert a ret-probe,
> the kernel assumes that it "owns" the stack, if nothing else the kernel
> has to change the ret-address on stack.
>
> I agree, this is not good. But again, what else the kernel can do in
> this case?

Not that I'm proposing this, but kernel could probably maintain a
lookup table keyed by thread stack pointer, instead of maintaining
implicit stack (but that would probably be more expensive). With some
limits and stuff this probably would work fine.

>
> > > Not really expected, and that is why the "TODO" comment in _unregister()
> > > was never implemented. Although the real reason is that we are lazy ;)
> >
> > Worked fine for 10+ years, which says something ;)
>
> Or may be it doesn't but we do not know because this code doesn't do
> uprobe_warn() ;)

sure :)

>
> Oleg.
>

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Andrii Nakryiko

On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
>
> On 07/08, Andrii Nakryiko wrote:
> >
> > On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
> > >
> > > And I forgot to mention...
> > >
> > > In any case __uprobe_unregister() can't ignore the error code from
> > > register_for_each_vma(). If it fails to restore the original insn,
> > > we should not remove this uprobe from uprobes_tree.
> > >
> > > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > > probed application.
> >
> > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > detects "improper" stack pointer progression, for example),
>
> In this case we a) assume that user-space tries to fool the kernel and

Well, it's a bad assumption. User space might just be using fibers and
managing its own stack. Not saying SIGILL is good, but it's part of
the uprobe system regardless.

> b) the kernel can't handle this case in any case, thus uprobe_warn().
>
> > but from
> > what I gather it's not really expected to fail on unregistration given
> > we successfully registered uprobe.
>
> Not really expected, and that is why the "TODO" comment in _unregister()
> was never implemented. Although the real reason is that we are lazy ;)

Worked fine for 10+ years, which says something ;)

>
> But register_for_each_vma(NULL) can fail. Say, simply because
> kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
> A lot of other reasons.
>
> > I guess it's a decision between
> > leaking memory with an uprobe stuck in the tree or killing process due
> > to some very rare (or buggy) condition?
>
> Yes. I think in this case it is better to leak uprobe than kill the
> no longer probed task.

Ok, I think it's not hard to keep uprobe around if
__uprobe_unregister() fails, should be an easy addition from what I
can see.

>
> Oleg.
>

Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-08 Thread Andrii Nakryiko

On Sun, Jul 7, 2024 at 5:50 AM Oleg Nesterov  wrote:
>
> On 07/01, Andrii Nakryiko wrote:
> >
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -42,6 +42,11 @@ struct uprobe_consumer {
> >   enum uprobe_filter_ctx ctx,
> >   struct mm_struct *mm);
> >
> > + /* associated file offset of this probe */
> > + loff_t offset;
> > + /* associated refctr file offset of this probe, or zero */
> > + loff_t ref_ctr_offset;
> > + /* for internal uprobe infra use, consumers shouldn't touch fields 
> > below */
> >   struct uprobe_consumer *next;
>
>
> Well, I don't really like this patch either...
>
> If nothing else because all the consumers in uprobe->consumers list
> must have the same offset/ref_ctr_offset.

You are thinking from a per-uprobe's perspective. But during
attachment you are attaching multiple consumers at different locations
within a given inode (and that matches for consumers are already
doing, they remember those offsets in their own structs), so each
consumer has a different offset.

Again, I'm just saying that I'm codifying what uprobe users already do
and simplifying the interface (otherwise we'd need another set of
callbacks or some new struct just to pass those
offsets/ref_ctr_offset).

But we can put all that on hold if Peter's approach works well enough.
My goal is to have faster uprobes, not to land *my* patches.

>
> --
> But I agree, the ugly uprobe_register_refctr() must die, we need a single
> function
>
> int uprobe_register(inode, offset, ref_ctr_offset, consumer);
>
> This change is trivial.
>
> --
> And speaking of cleanups, I think another change makes sense:
>
> -   int uprobe_register(...);
> +   struct uprobe* uprobe_register(...);
>
> so that uprobe_register() returns uprobe or ERR_PTR.
>
> -   void uprobe_unregister(inode, offset, consumer);
> +   void uprobe_unregister(uprobe, consumer);
>
> this way unregister() doesn't need the extra find_uprobe() + put_uprobe().
> The same for uprobe_apply().

I'm achieving this by keeping uprobe pointer inside uprobe_consumer
(and not requiring callers to keep explicit track of that)

>
> The necessary changes in kernel/trace/trace_uprobe.c are trivial, we just
> need to change struct trace_uprobe
>
> -   struct inode*inode;
> +   struct uprobe   *uprobe;
>
> and fix the compilation errors.
>
>
> As for kernel/trace/bpf_trace.c, I guess struct bpf_uprobe  needs the new
> ->uprobe member, we can't kill bpf_uprobe->offset because of
> bpf_uprobe_multi_link_fill_link_info(), but I think this is not that bad.
>
> What do you think?

I'd add an uprobe field to uprobe_consumer, tbh, and keep callers
simpler (less aware of uprobe existence in principle). Even if we
don't do batch register/unregister APIs.

>
> Oleg.
>

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-08 Thread Andrii Nakryiko

On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
>
> And I forgot to mention...
>
> In any case __uprobe_unregister() can't ignore the error code from
> register_for_each_vma(). If it fails to restore the original insn,
> we should not remove this uprobe from uprobes_tree.
>
> Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> probed application.

Yep, that would be unfortunate (just like SIGILL sent when uretprobe
detects "improper" stack pointer progression, for example), but from
what I gather it's not really expected to fail on unregistration given
we successfully registered uprobe. I guess it's a decision between
leaking memory with an uprobe stuck in the tree or killing process due
to some very rare (or buggy) condition?


>
> On 07/05, Oleg Nesterov wrote:
> >
> > Tried to read this patch, but I fail to understand it. It looks
> > obvioulsy wrong to me, see below.
> >
> > I tend to agree with the comments from Peter, but lets ignore them
> > for the moment.
> >
> > On 07/01, Andrii Nakryiko wrote:
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > > -   if (refcount_dec_and_test(>ref)) {
> > > +   s64 v;
> > > +
> > > +   /*
> > > +* here uprobe instance is guaranteed to be alive, so we use Tasks
> > > +* Trace RCU to guarantee that uprobe won't be freed from under us, if
> > > +* we end up being a losing "destructor" inside uprobe_treelock'ed
> > > +* section double-checking uprobe->ref value below.
> > > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > > +*/
> > > +   rcu_read_lock_trace();
> > > +
> > > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > > +
> > > +   if (unlikely((u32)v == 0)) {
> >
> > I must have missed something, but how can this ever happen?
> >
> > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> > that this binary is not used, so _register() doesn't install 
> > breakpoints/etc.
> >
> > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() 
> > succeeds.
> >
> > Now suppose that uprobe_unregister() is called right after that. It does
> >
> >   uprobe = find_uprobe(inode, offset);
> >
> > this increments the counter, (u32)uprobe->ref == 2
> >
> >   __uprobe_unregister(...);
> >
> > this wont't change the counter,
> >
> >   put_uprobe(uprobe);
> >
> > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> >
> > Where should the "final" put_uprobe() come from?
> >
> > IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> >
> > Oleg.
>

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-08 Thread Andrii Nakryiko

On Fri, Jul 5, 2024 at 8:38 AM Oleg Nesterov  wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > - if (refcount_dec_and_test(>ref)) {
> > + s64 v;
> > +
> > + /*
> > +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +  * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +  * section double-checking uprobe->ref value below.
> > +  * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +  */
> > + rcu_read_lock_trace();
> > +
> > + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > +
> > + if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
> uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
> __uprobe_unregister(...);
>
> this wont't change the counter,
>
> put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Argh, this is an artifact of splitting the overall change into
separate patches. The final version of uprobe_unregister() doesn't do
find_uprobe(), we just get it from uprobe_consumer->uprobe pointer
without any tree lookup.

>
> Oleg.
>

Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Andrii Nakryiko

On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov  wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>

No worries, there will be v3 next week (I'm going on short PTO
starting tomorrow). So you still have time. I appreciate you looking
at it, though!

> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> > unsigned long vaddr)
> >   goto out;
> >
> >   /*
> > -  * The NULL 'tsk' here ensures that any faults that occur here
> > -  * will not be accounted to the task.  'mm' *is* current->mm,
> > -  * but we treat this as a 'remote' access since it is
> > -  * essentially a kernel access to the memory.
> > +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +  * it is essentially a kernel access to the memory.
> >*/
> >   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?
>
> Oleg.
>

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-03 Thread Andrii Nakryiko

On Wed, Jul 3, 2024 at 6:36 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > One, attempted initially, way to solve this is through using
> > atomic_inc_not_zero() approach, turning get_uprobe() into
> > try_get_uprobe(),
>
> This is the canonical thing to do. Everybody does this.

Sure, and I provided arguments why I don't do it. Can you provide your
counter argument, please? "Everybody does this." is hardly one.

>
> > which can fail to bump refcount if uprobe is already
> > destined to be destroyed. This, unfortunately, turns out to be a rather
> > expensive due to underlying cmpxchg() operation in
> > atomic_inc_not_zero() and scales rather poorly with increased amount of
> > parallel threads triggering uprobes.
>
> Different archs different trade-offs. You'll not see this on LL/SC archs
> for example.

Clearly x86-64 is the highest priority target, and I've shown that it
benefits from atomic addition vs cmpxchg. Sure, other architecture
might benefit less. But will atomic addition be slower than cmpxchg on
any other architecture?

It's clearly beneficial for x86-64 and not regressing other
architectures, right?

>
> > Furthermore, CPU profiling showed the following overall CPU usage:
> >   - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
> > atomic_inc_not_zero approach;
> >   - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
> > atomic_add_and_return approach implemented by this patch.
>
> I think those numbers suggest trying to not have a refcount in the first
> place. Both are pretty terrible, yes one is less terrible than the
> other, but still terrible.

Good, we are on the same page here, yes.

>
> Specifically, I'm thinking it is the refcounting in handlw_swbp() that
> is actually the problem, all the other stuff is noise.
>
> So if you have SRCU protected consumers, what is the reason for still
> having a refcount in handlw_swbp() ? Simply have the whole of it inside
> a single SRCU critical section, then all consumers you find get a hit.

That's the goal (except SRCU vs RCU Tasks Trace) and that's the next
step. I didn't want to add all that complexity to an already pretty
big and complex patch set. I do believe that batch APIs are the first
necessary step.

Your innocuous "// XXX amortize / batch" comment below is *the major
point of this patch set*. Try to appreciate that. It's not a small
todo, it took this entire patch set to allow for that.

Now, if you are so against percpu RW semapshore, I can just drop the
last patch for now, but the rest is necessary regardless.

Note how I didn't really touch locking *at all*. uprobes_treelock used
to be a spinlock, which we 1-to-1 replaced with rw_spinlock. And now
I'm replacing it, again 1-to-1, with percpu RW semaphore. Specifically
not to entangle batching with the locking schema changes.

>
> Hmm, return probes are a pain, they require the uprobe to stay extant
> between handle_swbp() and handle_trampoline(). I'm thinking we can do
> that with SRCU as well.

I don't think we can, and I'm surprised you don't think that way.

uretprobe might never be triggered for various reasons. Either user
space never returns from the function, or uretprobe was never
installed in the right place (and so uprobe part will trigger, but
there will never be returning probe triggering). I don't think it's
acceptable to delay whole global uprobes SRCU unlocking indefinitely
and keep that at user space's code will.

So, with that, I think refcounting *for return probe* will have to
stay. And will have to be fast.

>
> When I cobble all that together (it really shouldn't be one patch, but
> you get the idea I hope) it looks a little something like the below.
>
> I *think* it should work, but perhaps I've missed something?

Well, at the very least you missed that we can't delay SRCU (or any
other sleepable RCU flavor) potentially indefinitely for uretprobes,
which are completely under user space control.

>
> TL;DR replace treelock with seqcount+SRCU
>   replace register_rwsem with SRCU
>   replace handle_swbp() refcount with SRCU
>   replace return_instance refcount with a second SRCU

So, as I mentioned. I haven't considered seqcount just yet, and I will
think that through. This patch set was meant to add batched API to
unblock all of the above you describe. Percpu RW semaphore switch was
a no-brainer with batched APIs, so I went for that to get more
performance with zero added effort and complexity. If you hate that
part, I can drop it. But batching APIs are unavoidable, no matter what
specific RCU-protected locking schema we end up doing.

Can we agree on that and move this forward, please?

>
> Paul, I had to do something vile with SRCU. The basic pr

Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-03 Thread Andrii Nakryiko

On Wed, Jul 3, 2024 at 6:15 AM Masami Hiramatsu  wrote:
>
> On Mon,  1 Jul 2024 15:39:25 -0700
> Andrii Nakryiko  wrote:
>
> > It seems like uprobe_write_opcode() doesn't require writer locked
> > mmap_sem, any lock (reader or writer) should be sufficient. This was
> > established in a discussion in [0] and looking through existing code
> > seems to confirm that there is no need for write-locked mmap_sem.
> >
> > Fix the comment to state this clearly.
> >
> >   [0] 
> > https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/
> >
> > Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
> > uprobe_write_opcode()")
>
> nit: why this has Fixes but [01/12] doesn't?
>
> Should I pick both to fixes branch?

I'd keep both of them in probes/for-next, tbh. They are not literally
fixing anything, just cleaning up comments. I can drop the Fixes tag
from this one, if you'd like.

>
> Thank you,
>
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/events/uprobes.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 081821fd529a..f87049c08ee9 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
> > mm_struct *mm,
> >   * @vaddr: the virtual address to store the opcode.
> >   * @opcode: opcode to be written at @vaddr.
> >   *
> > - * Called with mm->mmap_lock held for write.
> > + * Called with mm->mmap_lock held for read or write.
> >   * Return 0 (success) or a negative errno.
> >   */
> >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Andrii Nakryiko

On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov  wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>
> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> > unsigned long vaddr)
> >   goto out;
> >
> >   /*
> > -  * The NULL 'tsk' here ensures that any faults that occur here
> > -  * will not be accounted to the task.  'mm' *is* current->mm,
> > -  * but we treat this as a 'remote' access since it is
> > -  * essentially a kernel access to the memory.
> > +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +  * it is essentially a kernel access to the memory.
> >*/
> >   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

I don't know, it was a drive-by cleanup which I'm starting to regret already :)

>
> Oleg.
>

Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-03 Thread Andrii Nakryiko

On Wed, Jul 3, 2024 at 3:13 AM Masami Hiramatsu
 wrote:
>
> On Wed, 3 Jul 2024 10:13:15 +0200
> Peter Zijlstra  wrote:
>
> > On Mon, Jul 01, 2024 at 03:39:28PM -0700, Andrii Nakryiko wrote:
> > > Simplify uprobe registration/unregistration interfaces by making offset
> > > and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> > > existing users already store these fields somewhere in uprobe_consumer's
> > > containing structure, so this doesn't pose any problem. We just move
> > > some fields around.
> > >
> > > On the other hand, this simplifies uprobe_register() and
> > > uprobe_unregister() API by having only struct uprobe_consumer as one
> > > thing representing attachment/detachment entity. This makes batched
> > > versions of uprobe_register() and uprobe_unregister() simpler.
> > >
> > > This also makes uprobe_register_refctr() unnecessary, so remove it and
> > > simplify consumers.
> > >
> > > No functional changes intended.
> > >
> > > Acked-by: Masami Hiramatsu (Google) 
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  include/linux/uprobes.h   | 18 +++
> > >  kernel/events/uprobes.c   | 19 ++-
> > >  kernel/trace/bpf_trace.c  | 21 +++-
> > >  kernel/trace/trace_uprobe.c   | 53 ---
> > >  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 
> > >  5 files changed, 55 insertions(+), 78 deletions(-)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index b503fafb7fb3..a75ba37ce3c8 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -42,6 +42,11 @@ struct uprobe_consumer {
> > > enum uprobe_filter_ctx ctx,
> > > struct mm_struct *mm);
> > >
> > > +   /* associated file offset of this probe */
> > > +   loff_t offset;
> > > +   /* associated refctr file offset of this probe, or zero */
> > > +   loff_t ref_ctr_offset;
> > > +   /* for internal uprobe infra use, consumers shouldn't touch fields 
> > > below */
> > > struct uprobe_consumer *next;
> > >  };
> > >
> > > @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
> > >  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
> > >  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> > >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct 
> > > mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> > > -extern int uprobe_register(struct inode *inode, loff_t offset, struct 
> > > uprobe_consumer *uc);
> > > -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, 
> > > loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> > > +extern int uprobe_register(struct inode *inode, struct uprobe_consumer 
> > > *uc);
> > >  extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
> > > uprobe_consumer *uc, bool);
> > > -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> > > uprobe_consumer *uc);
> > > +extern void uprobe_unregister(struct inode *inode, struct 
> > > uprobe_consumer *uc);
> >
> > It seems very weird and unnatural to split inode and offset like this.
> > The whole offset thing only makes sense within the context of an inode.
>
> Hm, so would you mean we should have inode inside the uprobe_consumer?
> If so, I think it is reasonable.
>

I don't think so, for at least two reasons.

1) We will be wasting 8 bytes per consumer saving exactly the same
inode pointer for no good reason, while uprobe itself already stores
this inode. One can argue that having offset and ref_ctr_offset inside
uprobe_consumer is wasteful, in principle, and I agree. But all
existing users already store them in the same struct that contains
uprobe_consumer, so we are not regressing anything, while making the
interface simpler. We can always optimize that further, if necessary,
but right now that would give us nothing.

But moving inode into uprobe_consumer will regress memory usage.

2) In the context of batched APIs, offset and ref_ctr_offset varies
with each uprobe_consumer, while inode is explicitly the same for all
consumers in that batch. This API makes it very clear.

Technically, we can remove inode completely from *uprobe_unregister*,
because we now can access it from uprobe_consumer->uprobe->inode, but
it still feels right for symmetry and explicitly making a point that
callers should ensure that inode is kept alive.


> Thank you,
>
> >
> > So yeah, lets not do this.
>
>
> --
> Masami Hiramatsu

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Andrii Nakryiko

On Tue, Jul 2, 2024 at 9:53 AM Steven Rostedt  wrote:
>
> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.
> >
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
>
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Nice, I'd be happy to adjust batch API to work for that use case as
well (when we get there).

>
> -- Steve

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-02 Thread Andrii Nakryiko

On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 23449a8c5e7e..560cf1ca512a 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> >
> >  struct uprobe {
> >   struct rb_node  rb_node;/* node in the rb tree */
> > - refcount_t  ref;
> > + atomic64_t  ref;/* see UPROBE_REFCNT_GET 
> > below */
> >   struct rw_semaphore register_rwsem;
> >   struct rw_semaphore consumer_rwsem;
> > + struct rcu_head rcu;
> >   struct list_headpending_list;
> >   struct uprobe_consumer  *consumers;
> >   struct inode*inode; /* Also hold a ref to inode */
> > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
> > mm_struct *mm, unsigned long v
> >   *(uprobe_opcode_t *)>insn);
> >  }
> >
> > -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > +/*
> > + * Uprobe's 64-bit refcount is actually two independent counters 
> > co-located in
> > + * a single u64 value:
> > + *   - lower 32 bits are just a normal refcount with is increment and
> > + *   decremented on get and put, respectively, just like normal refcount
> > + *   would;
> > + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> > + *   incremented by one, no matter whether get or put operation is done.
> > + *
> > + * This upper counter is meant to distinguish between:
> > + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with 
> > "destruction",
> > + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 
> > refcnt
> > + *   sequence, also proceeding to "destruction".
> > + *
> > + * In both cases refcount drops to zero, but in one case it will have 
> > epoch N,
> > + * while the second drop to zero will have a different epoch N + 2, 
> > allowing
> > + * first destructor to bail out because epoch changed between refcount 
> > going
> > + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> > + * 64-bit refcount is double-checked, see put_uprobe() for details).
> > + *
> > + * Lower 32-bit counter is not meant to over overflow, while it's expected
>
> So refcount_t very explicitly handles both overflow and underflow and
> screams bloody murder if they happen. Your thing does not..
>

Correct, because I considered that to be practically impossible to
overflow this refcount. The main source of refcounts are uretprobes
that are holding uprobe references. We limit the depth of supported
recursion to 64, so you'd need 30+ millions of threads all hitting the
same uprobe/uretprobe to overflow this. I guess in theory it could
happen (not sure if we have some limits on total number of threads in
the system and if they can be bumped to over 30mln), but it just
seemed out of realm of practical possibility.

Having said that, I can add similar checks that refcount_t does in
refcount_add and do what refcount_warn_saturate does as well.

> > + * that upper 32-bit counter will overflow occasionally. Note, though, 
> > that we
> > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit 
> > counter,
> > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> > + * epoch effectively a 31-bit counter with highest bit used as a flag to
> > + * perform a fix-up. This ensures epoch and refcnt parts do not 
> > "interfere".
> > + *
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part 
> > and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */
> > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */
> > +
> > +/*
> > + * Caller has to make sure that:
> > + *   a) either uprobe's refcnt is positive before this call;
> > + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> > + *  preventing uprobe's destructor from removing it from uprobes_tree.
> &

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Jul 2024 15:15:56 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > > Andrii Nakryiko  wrote:
> > > > > > > >
> > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t 
> > > > > > > > > offset,
> > > > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > > > uprobe_consumer *uc)
> > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > > +   uprobe_consumer_fn 
> > > > > > > > > get_uprobe_consumer, void *ctx)
> > > > > > > >
> > > > > > > > Is this interface just for avoiding memory allocation? Can't we 
> > > > > > > > just
> > > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > > >
> > > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > > would just contain pointers to uprobe_consumer. Consumers would 
> > > > > > > never
> > > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > > uprobe_consumer struct is embedded in some other struct, so the 
> > > > > > > array
> > > > > > > interface isn't the most convenient.
> > > > > >
> > > > > > OK, I understand it.
> > > > > >
> > > > > > >
> > > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > > allocating an extra array *and keeping it* for the entire 
> > > > > > > duration of
> > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > > waste. This is because we don't want to do anything that can fail 
> > > > > > > in
> > > > > > > the detachment logic (so no temporary array allocation there).
> > > > > >
> > > > > > No need to change it, that sounds reasonable.
> > > > > >
> > > > >
> > > > > Great, thanks.
> > > > >
> > > > > > >
> > > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > > >
> > > > > > IMHO, maybe the interface function is better to change to
> > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > > side uses a linked list of structure, index access will need to
> > > > > > follow the list every time.
> > > > >
> > > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > > are only allowed to ask for the next consumer, then
> > > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > > its own internal index and remember ith instance. Which again means
> > > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > > isn't great.
> > > >
> > > > No, I think we can use a cursor variable as;
> > > >
> > > > int uprobe_register_batch(struct inode *inode,
> > > >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > {
> > > > void *cur = ctx;
> > > >
> > > > while (

[PATCH] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-07-01 Thread Andrii Nakryiko

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced. Given entire kernel implementation of user
space stack trace capturing works under assumption that user space code
was compiled with frame pointer register (%rbp) preservation, it seems
pretty reasonable to use this instruction as a strong indicator that
this is the entry to the function. In that case, return address is still
pointed to by %rsp, so we fetch it and add to stack trace before
proceeding to unwind the rest using frame pointer-based logic.

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/events/core.c  | 20 
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..82d5570b58ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *regs
return;
 
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+   /*
+* If we are called from uprobe handler, and we are indeed at the very
+* entry to user function (which is normally a `push %rbp` instruction,
+* under assumption of application being compiled with frame pointers),
+* we should read return address from *regs->sp before proceeding
+* to follow frame pointers, otherwise we'll skip immediate caller
+* as %rbp is not yet setup.
+*/
+   if (current->utask) {
+   struct arch_uprobe *auprobe = current->utask->auprobe;
+   u64 ret_addr;
+
+   if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
+   !__get_user(ret_addr, (const u64 __user *)regs->sp))
+   perf_callchain_store(entry, ret_addr);
+   }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
 
+   struct arch_uprobe  *auprobe;
+
struct return_instance  *return_instances;
unsigned intdepth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
 
down_read(>register_rwsem);
+   current->utask->auprobe = >arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
 
@@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
 
remove &= rc;
}
+   current->utask->auprobe = NULL;
 
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0

[PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko

With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)AFTER (mln/s)
  -   ---
  1   3.131 3.140
  2   3.394 3.601
  3   3.630 3.960
  4   3.317 3.551
  5   3.448 3.464
  6   3.345 3.283
  7   3.469 3.444
  8   3.182 3.258
  9   3.138 3.139
  10  2.999 3.212
  11  2.903 3.183
  12  2.802 3.027
  13  2.792 3.027
  14  2.695 3.086
  15  2.822 2.965
  16  2.679 2.939
  17  2.622 2.888
  18  2.628 2.914
  19  2.702 2.836
  20  2.561 2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bb480a2400e1..1d76551e5e23 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);  /* serialize rbtree access */
 
 #define UPROBES_HASH_SZ13
 /* serialize uprobe->pending_list */
@@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
bool destroy;
 
if (!tree_locked)
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
if (!tree_locked)
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
loff_t offset)
 {
struct uprobe *uprobe;
 
-   read_lock(_treelock);
+   percpu_down_read(_treelock);
uprobe = __find_uprobe(inode, offset);
-   read_unlock(_treelock);
+   percpu_up_read(_treelock);
 
return uprobe;
 }
@@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
up_write(>register_rwsem);
}
 
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
uprobe = uc->uprobe;
@@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
__put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
ret = 0;
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
}
 unlock_treelock:
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
if (ret)
goto cleanup_uprobes;
 
@@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
@@ -1384,7 +1384,7

[PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes

2024-07-01 Thread Andrii Nakryiko

Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/bpf_trace.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe 
*uprobes,
- u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, 
void *ctx)
 {
-   u32 i;
+   struct bpf_uprobe_multi_link *link = ctx;
 
-   for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), 
[i].consumer);
-   }
+   return >uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link 
*link)
struct bpf_uprobe_multi_link *umulti_link;
 
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-   bpf_uprobe_unregister(_link->path, umulti_link->uprobes, 
umulti_link->cnt);
+   uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), 
umulti_link->cnt,
+   umulti_link_get_uprobe_consumer, umulti_link);
if (umulti_link->task)
put_task_struct(umulti_link->task);
path_put(_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI,
  _uprobe_multi_link_lops, prog);
 
-   for (i = 0; i < cnt; i++) {
-   err = uprobe_register(d_real_inode(link->path.dentry), 
[i].consumer);
-   if (err) {
-   bpf_uprobe_unregister(, uprobes, i);
-   goto error_free;
-   }
-   }
+   err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+   umulti_link_get_uprobe_consumer, link);
+   if (err)
+   goto error_free;
 
err = bpf_link_prime(>link, _primer);
if (err)
-- 
2.43.0

[PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch

2024-07-01 Thread Andrii Nakryiko

Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ced85284bbf4..bb480a2400e1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
 {
-   struct uprobe *uprobe;
struct uprobe_consumer *uc;
+   struct uprobe *uprobe;
int i;
 
for (i = 0; i < cnt; i++) {
@@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
down_write(>register_rwsem);
__uprobe_unregister(uprobe, uc);
up_write(>register_rwsem);
-   put_uprobe(uprobe);
+   }
 
+   write_lock(_treelock);
+   for (i = 0; i < cnt; i++) {
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
+   if (!uprobe)
+   continue;
+
+   __put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0

[PATCH v2 09/12] uprobes: batch uprobes_treelock during registration

2024-07-01 Thread Andrii Nakryiko

Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 45 +
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 128677ffe662..ced85284bbf4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu)
kfree(uprobe);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
s64 v;
 
@@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe)
if (unlikely((u32)v == 0)) {
bool destroy;
 
-   write_lock(_treelock);
+   if (!tree_locked)
+   write_lock(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe)
destroy = atomic64_read(>ref) == v;
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
-   write_unlock(_treelock);
+   if (!tree_locked)
+   write_unlock(_treelock);
 
/*
 * Beyond here we don't need RCU protection, we are either the
@@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe)
rcu_read_unlock_trace();
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+   __put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
   const struct uprobe *r)
@@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe 
*uprobe)
return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-   struct uprobe *u;
-
-   write_lock(_treelock);
-   u = __insert_uprobe(uprobe);
-   write_unlock(_treelock);
-
-   return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uc->uprobe = uprobe;
}
 
+   ret = 0;
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uprobe = uc->uprobe;
 
/* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
+   cur_uprobe = __insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
if (cur_uprobe != uprobe) {
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
+
+   __put_uprobe(cur_uprobe, true);
ret = -EINVAL;
-   goto cleanup_uprobes;
+   goto unlock_treelock;
}
kfree(uprobe);
uc->uprobe = cur_uprobe;
}
}
+unlock_treelock:
+   write_unlock(_treelock);
+   if (ret)
+   goto cleanup_uprobes;
 
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
@@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
 
if (uc->uprobe)
-   put_uprobe(uc->uprobe);
+   __put_uprobe(uc->uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
return ret;
 }
 
-- 
2.43.0

[PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps

2024-07-01 Thread Andrii Nakryiko

Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0f928a72a658..128677ffe662 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
return -EINVAL;
}
 
+   /* pre-allocate new uprobe instances */
for (i = 0; i < cnt; i++) {
-   struct uprobe *cur_uprobe;
-
uc = get_uprobe_consumer(i, ctx);
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
RB_CLEAR_NODE(>rb_node);
atomic64_set(>ref, 1);
 
+   uc->uprobe = uprobe;
+   }
+
+   for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
@@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
put_uprobe(cur_uprobe);
-   kfree(uprobe);
ret = -EINVAL;
goto cleanup_uprobes;
}
kfree(uprobe);
-   uprobe = cur_uprobe;
+   uc->uprobe = cur_uprobe;
}
-
-   uc->uprobe = uprobe;
}
 
for (i = 0; i < cnt; i++) {
-- 
2.43.0

[PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()

2024-07-01 Thread Andrii Nakryiko

To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 65 ++---
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 68fdf1b8e4bf..0f928a72a658 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct 
uprobe *uprobe)
(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-  loff_t ref_ctr_offset)
-{
-   struct uprobe *uprobe, *cur_uprobe;
-
-   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-   if (!uprobe)
-   return ERR_PTR(-ENOMEM);
-
-   uprobe->inode = inode;
-   uprobe->offset = offset;
-   uprobe->ref_ctr_offset = ref_ctr_offset;
-   init_rwsem(>register_rwsem);
-   init_rwsem(>consumer_rwsem);
-   RB_CLEAR_NODE(>rb_node);
-   atomic64_set(>ref, 1);
-
-   /* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
-   /* a uprobe exists for this inode:offset combination */
-   if (cur_uprobe != uprobe) {
-   if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
-   kfree(uprobe);
-   return ERR_PTR(-EINVAL);
-   }
-   kfree(uprobe);
-   uprobe = cur_uprobe;
-   }
-
-   return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(>consumer_rwsem);
@@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
uc = get_uprobe_consumer(i, ctx);
 
-   uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-   if (IS_ERR(uprobe)) {
-   ret = PTR_ERR(uprobe);
+   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+   if (!uprobe) {
+   ret = -ENOMEM;
goto cleanup_uprobes;
}
 
+   uprobe->inode = inode;
+   uprobe->offset = uc->offset;
+   uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+   init_rwsem(>register_rwsem);
+   init_rwsem(>consumer_rwsem);
+   RB_CLEAR_NODE(>rb_node);
+   atomic64_set(>ref, 1);
+
+   /* add to uprobes_tree, sorted on inode:offset */
+   cur_uprobe = insert_uprobe(uprobe);
+   /* a uprobe exists for this inode:offset combination */
+   if (cur_uprobe != uprobe) {
+   if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
+   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+   put_uprobe(cur_uprobe);
+   kfree(uprobe);
+   ret = -EINVAL;
+   goto cleanup_uprobes;
+   }
+   kfree(uprobe);
+   uprobe = cur_uprobe;
+   }
+
uc->uprobe = uprobe;
}
 
-- 
2.43.0

[PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h |  17 
 kernel/events/uprobes.c | 180 
 2 files changed, 146 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
loff_t ref_ctr_offset;
/* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
+   /* associated uprobe instance (or NULL if consumer isn't attached) */
+   struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs 
*regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn get_uprobe_consumer, void 
*ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct 
uprobe_consumer *uc)
 {
return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx)
+{
+   return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, 
bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8759c6d0683e..68fdf1b8e4bf 100644

[PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-01 Thread Andrii Nakryiko

Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h   | 18 +++
 kernel/events/uprobes.c   | 19 ++-
 kernel/trace/bpf_trace.c  | 21 +++-
 kernel/trace/trace_uprobe.c   | 53 ---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 
 5 files changed, 55 insertions(+), 78 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
 
+   /* associated file offset of this probe */
+   loff_t offset;
+   /* associated refctr file offset of this probe, or zero */
+   loff_t ref_ctr_offset;
+   /* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs) instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-   return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, 
loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 560cf1ca512a..8759c6d0683e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
struct uprobe *uprobe;
 
-   uprobe = find_uprobe(inode, offset);
+   uprobe = find_uprobe(inode, uc->offset);
if (WARN_ON(!uprobe))
return;
 
@@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, 
loff_t offset,
return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-   struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-   return __uprobe_register(inode, offset, 0, uc);
+   return __uprobe_register(inod

[PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()

2024-07-01 Thread Andrii Nakryiko

Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f87049c08ee9..23449a8c5e7e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
loff_t offset,
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
if (!uprobe)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
uprobe->inode = inode;
uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t 
offset,
 
  retry:
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-   if (!uprobe)
-   return -ENOMEM;
if (IS_ERR(uprobe))
return PTR_ERR(uprobe);
 
-- 
2.43.0

[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko

note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

Another aspect with this race is the winning CPU might, at least
theoretically, be so quick that it will free uprobe memory before losing
CPU gets a chance to discover that it lost. To prevent this, we
protected and delay uprobe lifetime with RCU. We can't use
rcu_read_lock() + rcu_read_unlock(), because we need to take locks
inside the RCU critical section. Luckily, we have RCU Tasks Trace
flavor, which supports locking and sleeping. It is already used by BPF
subsystem for sleepable BPF programs (including sleepable BPF uprobe
programs), and is optimized for reader-dominated workflows. It fits
perfectly and doesn't seem to introduce any significant slowdowns in
uprobe hot path.

All the above contained trickery aside, we end up with a nice semantics
for get and put operations, where get always succeeds and put handles
all the races properly and transparently to the caller.

And just to justify this a bit unorthodox refcounting approach, under
uprobe triggering micro-benchmark (using BPF selftests' bench tool) with
8 triggering threads, atomic_inc_not_zero() approach was producing about
3.3 millions/sec total uprobe triggerings across all threads. While the
final atomic_add_and_return()-based approach managed to get 3.6 millions/sec
throughput under the same 8 competing threads.

Furthermore, CPU profiling showed the following overall CPU usage:
  - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
atomic_inc_not_zero approach;
  - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
atomic_add_and_return approach implemented by this patch.

So, CPU is spending relatively more CPU time in get/put operations while
delivering less total throughput if using atomic_inc_not_zero(). And
this will be even more prominent once we optimize away uprobe->register_rwsem
in the subsequent patch sets. So while slightly less straightforward,
current approach seems to be clearly winning and justified.

We also rename get_uprobe() to __get_uprobe() to indicate it's
a delicate internal helper that is only safe to call under valid
circumstances:
  - while holding uprobes_treelock (to synchronize with exclusive write
lock in put_uprobe(), as described above);
  - or if we have a guarantee that uprobe's refcount is already positive
through caller holding at least one refcount (in this case there is
no risk of refcount dropping to zero by any other CPU).

We also document why it's safe to do unconditional __get_uprobe() at all
call sites, to make it clear that we maintain the above invariants.

Note also, we now don't have a race between registration and
unregistration, so we remove the retry logic completely.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 260 ++--
 1 file changed, 195 insertions(+), 65 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 23449a8c5e7e..560cf1ca512a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 struct uprobe {
struct rb_node  rb_node;/* node in the rb tree */
-   refcount_t  ref;
+   atomic64_t  ref;/* see UPROBE_REFCNT_GET below 
*/
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
+   struct rcu_head rcu;
struct list_headpending_list;
struct uprobe_consumer  *consumers;
struct inode*inode; /* Also hold a ref to inode */
@@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
mm_struct *mm, unsigned long v
*(uprobe_opcode_t *)>insn);
 }
 
-static struct uprobe *get_uprobe(struct uprobe *uprobe)
+/*
+ * Uprobe's 64-bit refcount is actually two independent counters co-located in
+ * a single u64 value:
+ *   - lower 32 bits are just a normal refcount with is increment and
+ *   decremented on get and put, respectively, just like normal refcount
+ *   would;
+ *   - upper 32 bits are a tag (or epoch, if you will), which is always
+ *   incremented by one, no matter whether get or put operation is done.
+ *
+ * This upper counter is meant to distinguish between:
+ *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
+ *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
+ *   sequence, also proceeding to "destruction

[PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-01 Thread Andrii Nakryiko

It seems like uprobe_write_opcode() doesn't require writer locked
mmap_sem, any lock (reader or writer) should be sufficient. This was
established in a discussion in [0] and looking through existing code
seems to confirm that there is no need for write-locked mmap_sem.

Fix the comment to state this clearly.

  [0] 
https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/

Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
uprobe_write_opcode()")
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 081821fd529a..f87049c08ee9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
mm_struct *mm,
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
- * Called with mm->mmap_lock held for write.
+ * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-- 
2.43.0

[PATCH v2 01/12] uprobes: update outdated comment

2024-07-01 Thread Andrii Nakryiko

There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..081821fd529a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0

[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

2024-07-01 Thread Andrii Nakryiko

This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

v1->v2:
  - added RCU-delayed uprobe freeing to put_uprobe() (Masami);
  - fixed clean up handling in uprobe_register_batch (Jiri);
  - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg);
  - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid
comment instead (Oleg).

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h   |  29 +-
 kernel/events/uprobes.c   | 550 --
 kernel/trace/bpf_trace.c  |  40 +-
 kernel/trace/trace_uprobe.c   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  22 +-
 5 files changed, 447 insertions(+), 247 deletions(-)

-- 
2.43.0

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
 wrote:
>
> On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
> >
> > On Fri, 28 Jun 2024 09:34:26 -0700
> > Andrii Nakryiko  wrote:
> >
> > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu 
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > Andrii Nakryiko  wrote:
> > > > > >
> > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > > uprobe_consumer *uc)
> > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > > void *ctx)
> > > > > >
> > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > >
> > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > interface isn't the most convenient.
> > > >
> > > > OK, I understand it.
> > > >
> > > > >
> > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > waste. This is because we don't want to do anything that can fail in
> > > > > the detachment logic (so no temporary array allocation there).
> > > >
> > > > No need to change it, that sounds reasonable.
> > > >
> > >
> > > Great, thanks.
> > >
> > > > >
> > > > > Anyways, let me know how you feel about keeping this callback.
> > > >
> > > > IMHO, maybe the interface function is better to change to
> > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > side uses a linked list of structure, index access will need to
> > > > follow the list every time.
> > >
> > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > are only allowed to ask for the next consumer, then
> > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > its own internal index and remember ith instance. Which again means
> > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > isn't great.
> >
> > No, I think we can use a cursor variable as;
> >
> > int uprobe_register_batch(struct inode *inode,
> >  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > {
> > void *cur = ctx;
> >
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> >
> > cur = ctx;
> > while ((uc = get_uprobe_consumer()) != NULL) {
> > ...
> > }
> > }
> >
> > This can also remove the cnt.
>
> Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> for callers, but we have one right now, and might have another one, so
> not a big deal.
>

Actually, now that I started implementing this, I really-really don't
like it. In the example above you assume by storing and reusing
original ctx value you will reset iteration to the very beginning.
This is not how it works in practice though. Ctx is most probably a
pointer to some struct somewhere with iteration state (e.g., array of
all uprobes + current index), and so get_uprobe_consumer() doesn't
update `void *ctx` itself, it updates the state of that struct.

And so there is no easy and clean way to reset this iterator without
adding another callback or something. At which point it becomes quite
cumbersome and convoluted

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko

On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko
 wrote:
>
> On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 24 Jun 2024 17:21:36 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Anyways, under exclusive writer lock, we double-check that refcount
> > > didn't change and is still zero. If it is, we proceed with destruction,
> > > because at that point we have a guarantee that find_active_uprobe()
> > > can't successfully look up this uprobe instance, as it's going to be
> > > removed in destructor under writer lock. If, on the other hand,
> > > find_active_uprobe() managed to bump refcount from zero to one in
> > > between put_uprobe()'s atomic_dec_and_test(>ref) and
> > > write_lock(_treelock), we'll deterministically detect this with
> > > extra atomic_read(>ref) check, and if it doesn't hold, we
> > > pretend like atomic_dec_and_test() never returned true. There is no
> > > resource freeing or any other irreversible action taken up till this
> > > point, so we just exit early.
> > >
> > > One tricky part in the above is actually two CPUs racing and dropping
> > > refcnt to zero, and then attempting to free resources. This can happen
> > > as follows:
> > >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab 
> > > uprobes_treelock;
> > >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > > which point it decides that it needs to free uprobe as well.
> > >
> > > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > > uprobe, which is obviously wrong. To prevent this situations, we augment
> > > refcount with epoch counter, which is always incremented by 1 on either
> > > get or put operation. This allows those two CPUs above to disambiguate
> > > who should actually free uprobe (it's the CPU #1, because it has
> > > up-to-date epoch). See comments in the code and note the specific values
> > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > > that are incremented/decremented with a single atomic_add_and_return()
> > > operation. Note also a small and extremely rare (and thus having no
> > > effect on performance) need to clear the highest bit every 2 billion
> > > get/put operations to prevent high 32-bit counter from "bleeding over"
> > > into lower 32-bit counter.
> >
> > I have a question here.
> > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> > again? e.g.
> >
> > CPU#0   CPU#1
> >
> > put_uprobe() {
> > atomic64_add_return()
> > __get_uprobe();
> > put_uprobe() {
> > 
> > kfree(uprobe)
> > }
> > write_lock(_treelock);
> > atomic64_read(>ref);
> > }
> >
> > I think it is very rare case, but I could not find any code to prevent
> > this scenario.
> >
>
> Yes, I think you are right, great catch!
>
> I concentrated on preventing double kfree() in this situation, and
> somehow convinced myself that eager kfree() is fine. But I think I'll
> need to delay freeing, probably with RCU. The problem is that we can't
> use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
> to be a sleepable variant of RCU. I'm thinking of using
> rcu_read_lock_trace(), the same variant of RCU we use for sleepable
> BPF programs (including sleepable uprobes). srcu might be too heavy
> for this.
>
> I'll try a few variants over the next few days and see how the
> performance looks.
>

So I think I'm going with the changes below, incorporated into this
patch (nothing else changes). __get_uprobe() doesn't need any added
RCU protection (we know that uprobe is alive). It's only put_uprobe()
that needs to guarantee RCU protection before we drop refcount all the
way until we know whether we are the winning destructor or not.

Good thing is that the changes are pretty minimal in code and also
don't seem to regress performance/scalability. So I'm pretty happy
about that, will send v2 soon.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 07ad8b2e7508..41d9e37633ca 100644
--- a/kernel/events/

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Andrii Nakryiko

On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu  wrote:
>
> On Fri, 28 Jun 2024 09:34:26 -0700
> Andrii Nakryiko  wrote:
>
> > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  
> > wrote:
> > >
> > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > -  loff_t ref_ctr_offset, struct 
> > > > > > uprobe_consumer *uc)
> > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > +   uprobe_consumer_fn get_uprobe_consumer, 
> > > > > > void *ctx)
> > > > >
> > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > allocate a temporary array of *uprobe_consumer instead?
> > > >
> > > > Yes, exactly, to avoid the need for allocating another array that
> > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > just have an array of `struct uprobe_consumer *`, because
> > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > interface isn't the most convenient.
> > >
> > > OK, I understand it.
> > >
> > > >
> > > > If you feel strongly, I can do an array, but this necessitates
> > > > allocating an extra array *and keeping it* for the entire duration of
> > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > waste. This is because we don't want to do anything that can fail in
> > > > the detachment logic (so no temporary array allocation there).
> > >
> > > No need to change it, that sounds reasonable.
> > >
> >
> > Great, thanks.
> >
> > > >
> > > > Anyways, let me know how you feel about keeping this callback.
> > >
> > > IMHO, maybe the interface function is better to change to
> > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > side uses a linked list of structure, index access will need to
> > > follow the list every time.
> >
> > This would be problematic. Note how we call get_uprobe_consumer(i,
> > ctx) with i going from 0 to N in multiple independent loops. So if we
> > are only allowed to ask for the next consumer, then
> > uprobe_register_batch and uprobe_unregister_batch would need to build
> > its own internal index and remember ith instance. Which again means
> > more allocations and possibly failing uprobe_unregister_batch(), which
> > isn't great.
>
> No, I think we can use a cursor variable as;
>
> int uprobe_register_batch(struct inode *inode,
>  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> {
> void *cur = ctx;
>
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
>
> cur = ctx;
> while ((uc = get_uprobe_consumer()) != NULL) {
> ...
> }
> }
>
> This can also remove the cnt.

Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
for callers, but we have one right now, and might have another one, so
not a big deal.

>
> Thank you,
>
> >
> > For now this API works well, I propose to keep it as is. For linked
> > list case consumers would need to allocate one extra array or pay the
> > price of O(N) search (which might be ok, depending on how many uprobes
> > are being attached). But we don't have such consumers right now,
> > thankfully.
> >
> > >
> > > Thank you,
> > >
> > >
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) 
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) 
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-28 Thread Andrii Nakryiko

On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  wrote:
>
> On Thu, 27 Jun 2024 09:47:10 -0700
> Andrii Nakryiko  wrote:
>
> > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  
> > wrote:
> > >
> > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > -  loff_t ref_ctr_offset, struct 
> > > > uprobe_consumer *uc)
> > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > +   uprobe_consumer_fn get_uprobe_consumer, void 
> > > > *ctx)
> > >
> > > Is this interface just for avoiding memory allocation? Can't we just
> > > allocate a temporary array of *uprobe_consumer instead?
> >
> > Yes, exactly, to avoid the need for allocating another array that
> > would just contain pointers to uprobe_consumer. Consumers would never
> > just have an array of `struct uprobe_consumer *`, because
> > uprobe_consumer struct is embedded in some other struct, so the array
> > interface isn't the most convenient.
>
> OK, I understand it.
>
> >
> > If you feel strongly, I can do an array, but this necessitates
> > allocating an extra array *and keeping it* for the entire duration of
> > BPF multi-uprobe link (attachment) existence, so it feels like a
> > waste. This is because we don't want to do anything that can fail in
> > the detachment logic (so no temporary array allocation there).
>
> No need to change it, that sounds reasonable.
>

Great, thanks.

> >
> > Anyways, let me know how you feel about keeping this callback.
>
> IMHO, maybe the interface function is better to change to
> `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> side uses a linked list of structure, index access will need to
> follow the list every time.

This would be problematic. Note how we call get_uprobe_consumer(i,
ctx) with i going from 0 to N in multiple independent loops. So if we
are only allowed to ask for the next consumer, then
uprobe_register_batch and uprobe_unregister_batch would need to build
its own internal index and remember ith instance. Which again means
more allocations and possibly failing uprobe_unregister_batch(), which
isn't great.

For now this API works well, I propose to keep it as is. For linked
list case consumers would need to allocate one extra array or pay the
price of O(N) search (which might be ok, depending on how many uprobes
are being attached). But we don't have such consumers right now,
thankfully.

>
> Thank you,
>
>
> >
> > >
> > > Thank you,
> > >
> > > --
> > > Masami Hiramatsu (Google) 
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-27 Thread Andrii Nakryiko

On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  wrote:
>
> On Mon, 24 Jun 2024 17:21:38 -0700
> Andrii Nakryiko  wrote:
>
> > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > -  loff_t ref_ctr_offset, struct uprobe_consumer 
> > *uc)
> > +int uprobe_register_batch(struct inode *inode, int cnt,
> > +   uprobe_consumer_fn get_uprobe_consumer, void *ctx)
>
> Is this interface just for avoiding memory allocation? Can't we just
> allocate a temporary array of *uprobe_consumer instead?

Yes, exactly, to avoid the need for allocating another array that
would just contain pointers to uprobe_consumer. Consumers would never
just have an array of `struct uprobe_consumer *`, because
uprobe_consumer struct is embedded in some other struct, so the array
interface isn't the most convenient.

If you feel strongly, I can do an array, but this necessitates
allocating an extra array *and keeping it* for the entire duration of
BPF multi-uprobe link (attachment) existence, so it feels like a
waste. This is because we don't want to do anything that can fail in
the detachment logic (so no temporary array allocation there).

Anyways, let me know how you feel about keeping this callback.

>
> Thank you,
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-27 Thread Andrii Nakryiko

On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu  wrote:
>
> On Mon, 24 Jun 2024 17:21:36 -0700
> Andrii Nakryiko  wrote:
>
> > Anyways, under exclusive writer lock, we double-check that refcount
> > didn't change and is still zero. If it is, we proceed with destruction,
> > because at that point we have a guarantee that find_active_uprobe()
> > can't successfully look up this uprobe instance, as it's going to be
> > removed in destructor under writer lock. If, on the other hand,
> > find_active_uprobe() managed to bump refcount from zero to one in
> > between put_uprobe()'s atomic_dec_and_test(>ref) and
> > write_lock(_treelock), we'll deterministically detect this with
> > extra atomic_read(>ref) check, and if it doesn't hold, we
> > pretend like atomic_dec_and_test() never returned true. There is no
> > resource freeing or any other irreversible action taken up till this
> > point, so we just exit early.
> >
> > One tricky part in the above is actually two CPUs racing and dropping
> > refcnt to zero, and then attempting to free resources. This can happen
> > as follows:
> >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
> >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > which point it decides that it needs to free uprobe as well.
> >
> > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > uprobe, which is obviously wrong. To prevent this situations, we augment
> > refcount with epoch counter, which is always incremented by 1 on either
> > get or put operation. This allows those two CPUs above to disambiguate
> > who should actually free uprobe (it's the CPU #1, because it has
> > up-to-date epoch). See comments in the code and note the specific values
> > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > that are incremented/decremented with a single atomic_add_and_return()
> > operation. Note also a small and extremely rare (and thus having no
> > effect on performance) need to clear the highest bit every 2 billion
> > get/put operations to prevent high 32-bit counter from "bleeding over"
> > into lower 32-bit counter.
>
> I have a question here.
> Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> again? e.g.
>
> CPU#0   CPU#1
>
> put_uprobe() {
> atomic64_add_return()
> __get_uprobe();
> put_uprobe() {
> kfree(uprobe)
> }
> write_lock(_treelock);
> atomic64_read(>ref);
> }
>
> I think it is very rare case, but I could not find any code to prevent
> this scenario.
>

Yes, I think you are right, great catch!

I concentrated on preventing double kfree() in this situation, and
somehow convinced myself that eager kfree() is fine. But I think I'll
need to delay freeing, probably with RCU. The problem is that we can't
use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
to be a sleepable variant of RCU. I'm thinking of using
rcu_read_lock_trace(), the same variant of RCU we use for sleepable
BPF programs (including sleepable uprobes). srcu might be too heavy
for this.

I'll try a few variants over the next few days and see how the
performance looks.

> Thank you,
>
>
> --
> Masami Hiramatsu (Google) 
>

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-26 Thread Andrii Nakryiko

On Wed, Jun 26, 2024 at 4:27 AM Jiri Olsa  wrote:
>
> On Mon, Jun 24, 2024 at 05:21:38PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> > + for (i = 0; i < cnt; i++) {
> > + uc = get_uprobe_consumer(i, ctx);
> > +
> > + /* Each consumer must have at least one set consumer */
> > + if (!uc || (!uc->handler && !uc->ret_handler))
> > + return -EINVAL;
> > + /* Racy, just to catch the obvious mistakes */
> > + if (uc->offset > i_size_read(inode))
> > + return -EINVAL;
> > + if (uc->uprobe)
> > + return -EINVAL;
> > + /*
> > +  * This ensures that copy_from_page(), copy_to_page() and
> > +  * __update_ref_ctr() can't cross page boundary.
> > +  */
> > + if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE))
> > + return -EINVAL;
> > + if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short)))
> > + return -EINVAL;
> > + }
> >
> > - down_write(>register_rwsem);
> > - consumer_add(uprobe, uc);
> > - ret = register_for_each_vma(uprobe, uc);
> > - if (ret)
> > - __uprobe_unregister(uprobe, uc);
> > - up_write(>register_rwsem);
> > + for (i = 0; i < cnt; i++) {
> > + uc = get_uprobe_consumer(i, ctx);
> >
> > - if (ret)
> > - put_uprobe(uprobe);
> > + uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
> > + if (IS_ERR(uprobe)) {
> > + ret = PTR_ERR(uprobe);
> > + goto cleanup_uprobes;
> > + }
> > +
> > + uc->uprobe = uprobe;
> > + }
> > +
> > + for (i = 0; i < cnt; i++) {
> > + uc = get_uprobe_consumer(i, ctx);
> > + uprobe = uc->uprobe;
> > +
> > + down_write(>register_rwsem);
> > + consumer_add(uprobe, uc);
> > + ret = register_for_each_vma(uprobe, uc);
> > + if (ret)
> > + __uprobe_unregister(uprobe, uc);
> > + up_write(>register_rwsem);
> > +
> > + if (ret) {
> > + put_uprobe(uprobe);
> > + goto cleanup_unreg;
> > + }
> > + }
> > +
> > + return 0;
> >
> > +cleanup_unreg:
> > + /* unregister all uprobes we managed to register until failure */
> > + for (i--; i >= 0; i--) {
> > + uc = get_uprobe_consumer(i, ctx);
> > +
> > + down_write(>register_rwsem);
> > + __uprobe_unregister(uc->uprobe, uc);
> > + up_write(>register_rwsem);
> > + }
> > +cleanup_uprobes:
>
> when we jump here from 'goto cleanup_uprobes' not all of the
> consumers might have uc->uprobe set up
>
> perhaps we can set cnt = i - 1 before the goto, or just check uc->uprobe below

yep, you are right, I missed this part during multiple rounds of
refactorings. I think the `if (uc->uprobe)` check is the cleanest
approach here.

>
>
> > + /* put all the successfully allocated/reused uprobes */
> > + for (i = cnt - 1; i >= 0; i--) {
>
> curious, any reason why we go from the top here?

No particular reason. This started as (i = i - 1; i >= 0; i--), but
then as I kept splitting steps I needed to do this over all uprobes.
Anyways, I can do a clean `i = 0; i < cnt; i++` with `if (uc->uprobe)`
check.

>
> thanks,
> jirka
>
> > + uc = get_uprobe_consumer(i, ctx);
> > +
> > + put_uprobe(uc->uprobe);
> > + uc->uprobe = NULL;
> > + }
> >   return ret;
> >  }
> >
> >  int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
> >  {
> > - return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
> > + return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc);
> >  }
> >  EXPORT_SYMBOL_GPL(uprobe_register);
> >
> > --
> > 2.43.0
> >

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-26 Thread Andrii Nakryiko

On Tue, Jun 25, 2024 at 11:03 PM kernel test robot  wrote:
>
> Hi Andrii,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on next-20240624]
> [also build test WARNING on v6.10-rc5]
> [cannot apply to perf-tools-next/perf-tools-next tip/perf/core 
> perf-tools/perf-tools linus/master acme/perf/core v6.10-rc5 v6.10-rc4 
> v6.10-rc3]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:
> https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-update-outdated-comment/20240626-001728
> base:   next-20240624
> patch link:
> https://lore.kernel.org/r/20240625002144.3485799-5-andrii%40kernel.org
> patch subject: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime 
> management
> config: x86_64-defconfig 
> (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfm0xj-...@intel.com/config)
> compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfm0xj-...@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202406261300.ebbfm0xj-...@intel.com/
>
> All warnings (new ones prefixed by >>):
>
> >> kernel/events/uprobes.c:638: warning: Function parameter or struct member 
> >> 'uprobe' not described in '__get_uprobe'
> >> kernel/events/uprobes.c:638: warning: expecting prototype for Caller has 
> >> to make sure that(). Prototype was for __get_uprobe() instead
>
>
> vim +638 kernel/events/uprobes.c
>
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  625
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  626  /**

I shouldn't have used /** here, I'll fix this.

> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  627   * Caller has to make sure 
> that:
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  628   *   a) either uprobe's 
> refcnt is positive before this call;
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  629   *   b) or uprobes_treelock 
> is held (doesn't matter if for read or write),
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  630   *  preventing uprobe's 
> destructor from removing it from uprobes_tree.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  631   *
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  632   * In the latter case, 
> uprobe's destructor will "resurrect" uprobe instance if
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  633   * it detects that its 
> refcount went back to being positive again inbetween it
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  634   * dropping to zero at some 
> point and (potentially delayed) destructor
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  635   * callback actually running.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  636   */
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  637  static struct uprobe 
> *__get_uprobe(struct uprobe *uprobe)
> f231722a2b27ee Oleg Nesterov   2015-07-21 @638  {
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  639      s64 v;
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  640
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  641  v = 
> atomic64_add_return(UPROBE_REFCNT_GET, >ref);
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  642
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  643  /*
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  644   * If the highest bit 
> is set, we need to clear it. If cmpxchg() fails,
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  645   * we don't retry 
> because there is another CPU that just managed to
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  646   * update refcnt and 
> will attempt the same "fix up". Eventually one of
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  647   * them will succeed 
> to clear highset bit.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  648   */
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  649  if (unlikely(v < 0))
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  650  
> (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63));
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  651
> f231722a2b27ee Oleg Nesterov   2015-07-21  652  return uprobe;
> f231722a2b27ee Oleg Nesterov   2015-07-21  653  }
> f231722a2b27ee Oleg Nesterov   2015-07-21  654
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-26 Thread Andrii Nakryiko

On Tue, Jun 25, 2024 at 12:09 PM Oleg Nesterov  wrote:
>
> On 06/25, Andrii Nakryiko wrote:
> >
> > On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov  wrote:
> > >
> > > Why?
> > >
> > > So far I don't understand this change. Quite possibly I missed something,
> > > but in this case the changelog should explain the problem more clearly.
> > >
> >
> > I just went off of "Called with mm->mmap_lock held for write." comment
> > in uprobe_write_opcode(), tbh.
>
> Ah, indeed... and git blame makes me sad ;)
>
> I _think_ that 29dedee0e693a updated this comment without any thinking,
> but today I can't recall. In any case, today this nothing to do with
> mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect)
> when it returns -EAGAIN but this is another story.
>
> > If we don't actually need writer
> > mmap_lock, we should probably update at least that comment.
>
> Agreed.

Ok, I'll drop this change and will just update the comment.

>
> > There is a
> > lot going on in uprobe_write_opcode(), and I don't understand all the
> > requirements there.
>
> Heh. Neither me today ;)
>
> Oleg.
>

Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-25 Thread Andrii Nakryiko

On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov  wrote:
>
> On 06/25, Masami Hiramatsu wrote:
> >
> > On Mon, 24 Jun 2024 17:21:34 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Given unapply_uprobe() can call remove_breakpoint() which eventually
> > > calls uprobe_write_opcode(), which can modify a set of memory pages and
> > > expects mm->mmap_lock held for write, it needs to have writer lock.
> >
> > Oops, it is an actual bug, right?
>
> Why?
>
> So far I don't understand this change. Quite possibly I missed something,
> but in this case the changelog should explain the problem more clearly.
>

I just went off of "Called with mm->mmap_lock held for write." comment
in uprobe_write_opcode(), tbh. If we don't actually need writer
mmap_lock, we should probably update at least that comment. There is a
lot going on in uprobe_write_opcode(), and I don't understand all the
requirements there.

> Oleg.
>

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-25 Thread Andrii Nakryiko

On Tue, Jun 25, 2024 at 7:45 AM Oleg Nesterov  wrote:
>
> Again, I'll try to read (at least this) patch later this week,
> just one cosmetic nit for now...

Thanks, and yep, please take your time. I understand that this is not
a trivial change to a code base that has been sitting untouched for
many years now. But I'd really appreciate if you can give it a through
review anyways!

>
> On 06/24, Andrii Nakryiko wrote:
> >
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part 
> > and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
> > +#define UPROBE_REFCNT_PUT (0xLL)
>
> How about
>
> #define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL)
> #define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL)

It's cute, I'll change to that. But I'll probably also add a comment
with the final value in hex for someone like me (because I can reason
about 0x and its effect on refcount, not so much with `(1LL <<
32) - 1`.

>
> ? this makes them symmetrical and makes it clear why
> atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment.
>
> Feel free to ignore if you don't like it.
>
> Oleg.
>

Re: [PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces

2024-06-24 Thread Andrii Nakryiko

On Mon, Jun 24, 2024 at 6:14 PM Masami Hiramatsu  wrote:
>
> On Tue, 21 May 2024 18:38:45 -0700
> Andrii Nakryiko  wrote:
>
> > Add a set of tests to validate that stack traces captured from or in the
> > presence of active uprobes and uretprobes are valid and complete.
> >
> > For this we use BPF program that are installed either on entry or exit
> > of user function, plus deep-nested USDT. One of target funtions
> > (target_1) is recursive to generate two different entries in the stack
> > trace for the same uprobe/uretprobe, testing potential edge conditions.
> >
> > Without fixes in this patch set, we get something like this for one of
> > the scenarios:
> >
> >  caller: 0x758fff - 0x7595ab
> >  target_1: 0x758fd5 - 0x758fff
> >  target_2: 0x758fca - 0x758fd5
> >  target_3: 0x758fbf - 0x758fca
> >  target_4: 0x758fb3 - 0x758fbf
> >  ENTRY #0: 0x758fb3 (in target_4)
> >  ENTRY #1: 0x758fd3 (in target_2)
> >  ENTRY #2: 0x758ffd (in target_1)
> >  ENTRY #3: 0x7fffe000
> >  ENTRY #4: 0x7fffe000
> >  ENTRY #5: 0x6f8f39
> >  ENTRY #6: 0x6fa6f0
> >  ENTRY #7: 0x7f403f229590
> >
> > Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses
> > which obscure actual target_1 and another target_1 invocations. Also
> > note that between entry #0 and entry #1 we are missing an entry for
> > target_3, which is fixed in patch #2.
>
> Please avoid using `patch #2` because after commit, this means nothing.

Yep, makes sense, sorry about that, will keep descriptions a bit more
general going forward.

>
> Thank you,
>
> >
> > With all the fixes, we get desired full stack traces:
> >
> >  caller: 0x758fff - 0x7595ab
> >  target_1: 0x758fd5 - 0x758fff
> >  target_2: 0x758fca - 0x758fd5
> >  target_3: 0x758fbf - 0x758fca
> >  target_4: 0x758fb3 - 0x758fbf
> >  ENTRY #0: 0x758fb7 (in target_4)
> >  ENTRY #1: 0x758fc8 (in target_3)
> >  ENTRY #2: 0x758fd3 (in target_2)
> >  ENTRY #3: 0x758ffd (in target_1)
> >  ENTRY #4: 0x758ff3 (in target_1)
> >  ENTRY #5: 0x75922c (in caller)
> >  ENTRY #6: 0x6f8f39
> >  ENTRY #7: 0x6fa6f0
> >  ENTRY #8: 0x7f986adc4cd0
> >
> > Now there is a logical and complete sequence of function calls.
> >
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  .../bpf/prog_tests/uretprobe_stack.c  | 186 ++
> >  .../selftests/bpf/progs/uretprobe_stack.c |  96 +
> >  2 files changed, 282 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c
> >

[...]

Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-24 Thread Andrii Nakryiko

On Mon, Jun 24, 2024 at 5:39 PM Masami Hiramatsu  wrote:
>
> On Mon, 24 Jun 2024 13:32:35 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Jun 17, 2024 at 3:37 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko
> > >  wrote:
> > > >
> > > > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu  
> > > > wrote:
> > > > >
> > > > > On Tue, 21 May 2024 18:38:43 -0700
> > > > > Andrii Nakryiko  wrote:
> > > > >
> > > > > > When kernel has pending uretprobes installed, it hijacks original 
> > > > > > user
> > > > > > function return address on the stack with a uretprobe trampoline
> > > > > > address. There could be multiple such pending uretprobes (either on
> > > > > > different user functions or on the same recursive one) at any given
> > > > > > time within the same task.
> > > > > >
> > > > > > This approach interferes with the user stack trace capture logic, 
> > > > > > which
> > > > > > would report suprising addresses (like 0x7fffe000) that 
> > > > > > correspond
> > > > > > to a special "[uprobes]" section that kernel installs in the target
> > > > > > process address space for uretprobe trampoline code, while 
> > > > > > logically it
> > > > > > should be an address somewhere within the calling function of 
> > > > > > another
> > > > > > traced user function.
> > > > > >
> > > > > > This is easy to correct for, though. Uprobes subsystem keeps track 
> > > > > > of
> > > > > > pending uretprobes and records original return addresses. This 
> > > > > > patch is
> > > > > > using this to do a post-processing step and restore each trampoline
> > > > > > address entries with correct original return address. This is done 
> > > > > > only
> > > > > > if there are pending uretprobes for current task.
> > > > > >
> > > > > > This is a similar approach to what fprobe/kretprobe infrastructure 
> > > > > > is
> > > > > > doing when capturing kernel stack traces in the presence of pending
> > > > > > return probes.
> > > > > >
> > > > >
> > > > > This looks good to me because this trampoline information is only
> > > > > managed in uprobes. And it should be provided when unwinding user
> > > > > stack.
> > > > >
> > > > > Reviewed-by: Masami Hiramatsu (Google) 
> > > > >
> > > > > Thank you!
> > > >
> > > > Great, thanks for reviewing, Masami!
> > > >
> > > > Would you take this fix through your tree, or where should it be routed 
> > > > to?
> > > >
> > >
> > > Ping! What would you like me to do with this patch set? Should I
> > > resend it without patch 3 (the one that tries to guess whether we are
> > > at the entry to the function?), or did I manage to convince you that
> > > this heuristic is OK, given perf's stack trace capturing logic already
> > > makes heavy assumption of rbp register being used for frame pointer?
> > >
> > > Please let me know your preference, I could drop patch 3 and send it
> > > separately, if that helps move the main fix forward. Thanks!
> >
> > Masami,
> >
> > Another week went by with absolutely no action or reaction from you.
> > Is there any way I can help improve the collaboration here?
>
> OK, if there is no change without [3/4], let me pick the others on

Thanks, Masami!

Selftest is probably failing (as it expects correct stack trace), but
that's ok, we can fix it up once linux-trace-kernel and bpf-next trees
converge.

> probes/for-next directly. [3/4] I need other x86 maintainer's
> comments. And it should be handled by PMU maintainers.

Sounds good, I'll repost it separately. Do I need to CC anyone else
besides people on this thread already?

>
> Thanks,
>
>
> >
> > I'm preparing more patches for uprobes and about to submit them. If
> > each reviewed and ready to be applied patch set has to sit idle for
> > multiple weeks for no good reason, we all will soon be lost just plain
> > forgetting the context in which the patch was prepared.
> >
> > Please, prioritize handling patches that are meant to be routed
> > through your tree in a more timely fashion. Or propose some
> > alternative acceptable arrangement.
> >
> > Thank you.
> >
> > >
> > > > >
> > > > > > Reported-by: Riham Selim 
> > > > > > Signed-off-by: Andrii Nakryiko 
> > > > > > ---
> > > > > >  kernel/events/callchain.c | 43 
> > > > > > ++-
> > > > > >  kernel/events/uprobes.c   |  9 
> > > > > >  2 files changed, 51 insertions(+), 1 deletion(-)
> > > > > >
> > > >
> > > > [...]
>
>
> --
> Masami Hiramatsu (Google)

[PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes

2024-06-24 Thread Andrii Nakryiko

Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/bpf_trace.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe 
*uprobes,
- u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, 
void *ctx)
 {
-   u32 i;
+   struct bpf_uprobe_multi_link *link = ctx;
 
-   for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), 
[i].consumer);
-   }
+   return >uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link 
*link)
struct bpf_uprobe_multi_link *umulti_link;
 
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-   bpf_uprobe_unregister(_link->path, umulti_link->uprobes, 
umulti_link->cnt);
+   uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), 
umulti_link->cnt,
+   umulti_link_get_uprobe_consumer, umulti_link);
if (umulti_link->task)
put_task_struct(umulti_link->task);
path_put(_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
bpf_link_init(>link, BPF_LINK_TYPE_UPROBE_MULTI,
  _uprobe_multi_link_lops, prog);
 
-   for (i = 0; i < cnt; i++) {
-   err = uprobe_register(d_real_inode(link->path.dentry), 
[i].consumer);
-   if (err) {
-   bpf_uprobe_unregister(, uprobes, i);
-   goto error_free;
-   }
-   }
+   err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+   umulti_link_get_uprobe_consumer, link);
+   if (err)
+   goto error_free;
 
err = bpf_link_prime(>link, _primer);
if (err)
-- 
2.43.0

[PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore

2024-06-24 Thread Andrii Nakryiko

With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)AFTER (mln/s)
  -   ---
  1   3.131 3.140
  2   3.394 3.601
  3   3.630 3.960
  4   3.317 3.551
  5   3.448 3.464
  6   3.345 3.283
  7   3.469 3.444
  8   3.182 3.258
  9   3.138 3.139
  10  2.999 3.212
  11  2.903 3.183
  12  2.802 3.027
  13  2.792 3.027
  14  2.695 3.086
  15  2.822 2.965
  16  2.679 2.939
  17  2.622 2.888
  18  2.628 2.914
  19  2.702 2.836
  20  2.561 2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7e94671a672a..e24b81b0e149 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);  /* serialize rbtree access */
 
 #define UPROBES_HASH_SZ13
 /* serialize uprobe->pending_list */
@@ -667,7 +667,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
bool destroy;
 
if (!tree_locked)
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -691,7 +691,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool 
tree_locked)
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
if (!tree_locked)
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 
/* uprobe got resurrected, pretend we never tried to free it */
if (!destroy)
@@ -789,9 +789,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
loff_t offset)
 {
struct uprobe *uprobe;
 
-   read_lock(_treelock);
+   percpu_down_read(_treelock);
uprobe = __find_uprobe(inode, offset);
-   read_unlock(_treelock);
+   percpu_up_read(_treelock);
 
return uprobe;
 }
@@ -1178,7 +1178,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
up_write(>register_rwsem);
}
 
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
uprobe = uc->uprobe;
@@ -1189,7 +1189,7 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
__put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1294,7 +1294,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
ret = 0;
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1317,7 +1317,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
}
 unlock_treelock:
-   write_unlock(_treelock);
+   percpu_up_write(_treelock);
if (ret)
goto cleanup_uprobes;
 
@@ -1349,14 +1349,14 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
-   write_lock(_treelock);
+   percpu_down_write(_treelock);
for (i = cnt - 1; i >= 0; i--) {
uc = get_uprobe_consumer(i, ctx)

[PATCH 09/12] uprobes: batch uprobes_treelock during registration

2024-06-24 Thread Andrii Nakryiko

Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 45 +
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5e98e179d47d..416f408cbed9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -657,7 +657,7 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
return !RB_EMPTY_NODE(>rb_node);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
s64 v;
 
@@ -666,7 +666,8 @@ static void put_uprobe(struct uprobe *uprobe)
if (unlikely((u32)v == 0)) {
bool destroy;
 
-   write_lock(_treelock);
+   if (!tree_locked)
+   write_lock(_treelock);
/*
 * We might race with find_uprobe()->__get_uprobe() executed
 * from inside read-locked uprobes_treelock, which can bump
@@ -689,7 +690,8 @@ static void put_uprobe(struct uprobe *uprobe)
destroy = atomic64_read(>ref) == v;
if (destroy && uprobe_is_active(uprobe))
rb_erase(>rb_node, _tree);
-   write_unlock(_treelock);
+   if (!tree_locked)
+   write_unlock(_treelock);
 
/* uprobe got resurrected, pretend we never tried to free it */
if (!destroy)
@@ -718,6 +720,11 @@ static void put_uprobe(struct uprobe *uprobe)
(void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63));
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+   __put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
   const struct uprobe *r)
@@ -817,21 +824,6 @@ static struct uprobe *__insert_uprobe(struct uprobe 
*uprobe)
return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-   struct uprobe *u;
-
-   write_lock(_treelock);
-   u = __insert_uprobe(uprobe);
-   write_unlock(_treelock);
-
-   return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1291,6 +1283,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uc->uprobe = uprobe;
}
 
+   ret = 0;
+   write_lock(_treelock);
for (i = 0; i < cnt; i++) {
struct uprobe *cur_uprobe;
 
@@ -1298,19 +1292,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
uprobe = uc->uprobe;
 
/* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
+   cur_uprobe = __insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
if (cur_uprobe != uprobe) {
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
+
+   __put_uprobe(cur_uprobe, true);
ret = -EINVAL;
-   goto cleanup_uprobes;
+   goto unlock_treelock;
}
kfree(uprobe);
uc->uprobe = cur_uprobe;
}
}
+unlock_treelock:
+   write_unlock(_treelock);
+   if (ret)
+   goto cleanup_uprobes;
 
for (i = 0; i < cnt; i++) {
uc = get_uprobe_consumer(i, ctx);
@@ -1340,12 +1339,14 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 cleanup_uprobes:
/* put all the successfully allocated/reused uprobes */
+   write_lock(_treelock);
for (i = cnt - 1; i >= 0; i--) {
uc = get_uprobe_consumer(i, ctx);
 
-   put_uprobe(uc->uprobe);
+   __put_uprobe(uc->uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
return ret;
 }
 
-- 
2.43.0

[PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch

2024-06-24 Thread Andrii Nakryiko

Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 416f408cbed9..7e94671a672a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1162,8 +1162,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
 {
-   struct uprobe *uprobe;
struct uprobe_consumer *uc;
+   struct uprobe *uprobe;
int i;
 
for (i = 0; i < cnt; i++) {
@@ -1176,10 +1176,20 @@ void uprobe_unregister_batch(struct inode *inode, int 
cnt, uprobe_consumer_fn ge
down_write(>register_rwsem);
__uprobe_unregister(uprobe, uc);
up_write(>register_rwsem);
-   put_uprobe(uprobe);
+   }
 
+   write_lock(_treelock);
+   for (i = 0; i < cnt; i++) {
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
+   if (!uprobe)
+   continue;
+
+   __put_uprobe(uprobe, true);
uc->uprobe = NULL;
}
+   write_unlock(_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0

[PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps

2024-06-24 Thread Andrii Nakryiko

Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ebd8511b6eb2..5e98e179d47d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1270,9 +1270,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
return -EINVAL;
}
 
+   /* pre-allocate new uprobe instances */
for (i = 0; i < cnt; i++) {
-   struct uprobe *cur_uprobe;
-
uc = get_uprobe_consumer(i, ctx);
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1289,6 +1288,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
RB_CLEAR_NODE(>rb_node);
atomic64_set(>ref, 1);
 
+   uc->uprobe = uprobe;
+   }
+
+   for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
+   uc = get_uprobe_consumer(i, ctx);
+   uprobe = uc->uprobe;
+
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
/* a uprobe exists for this inode:offset combination */
@@ -1296,15 +1304,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
put_uprobe(cur_uprobe);
-   kfree(uprobe);
ret = -EINVAL;
goto cleanup_uprobes;
}
kfree(uprobe);
-   uprobe = cur_uprobe;
+   uc->uprobe = cur_uprobe;
}
-
-   uc->uprobe = uprobe;
}
 
for (i = 0; i < cnt; i++) {
@@ -1318,10 +1323,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
__uprobe_unregister(uprobe, uc);
up_write(>register_rwsem);
 
-   if (ret) {
-   put_uprobe(uprobe);
+   if (ret)
goto cleanup_unreg;
-   }
}
 
return 0;
-- 
2.43.0

[PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-24 Thread Andrii Nakryiko

Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h |  17 
 kernel/events/uprobes.c | 181 +---
 2 files changed, 147 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
loff_t ref_ctr_offset;
/* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
+   /* associated uprobe instance (or NULL if consumer isn't attached) */
+   struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs 
*regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn get_uprobe_consumer, void 
*ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct 
uprobe_consumer *uc)
 {
return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+   uprobe_consumer_fn get_uprobe_consumer, 
void *ctx)
+{
+   return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, 
bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+uprobe_consumer_fn 
get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2544e8b79bad..846efda614cb 100644

[PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()

2024-06-24 Thread Andrii Nakryiko

To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 65 ++---
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 846efda614cb..ebd8511b6eb2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -842,40 +842,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct 
uprobe *uprobe)
(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-  loff_t ref_ctr_offset)
-{
-   struct uprobe *uprobe, *cur_uprobe;
-
-   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-   if (!uprobe)
-   return ERR_PTR(-ENOMEM);
-
-   uprobe->inode = inode;
-   uprobe->offset = offset;
-   uprobe->ref_ctr_offset = ref_ctr_offset;
-   init_rwsem(>register_rwsem);
-   init_rwsem(>consumer_rwsem);
-   RB_CLEAR_NODE(>rb_node);
-   atomic64_set(>ref, 1);
-
-   /* add to uprobes_tree, sorted on inode:offset */
-   cur_uprobe = insert_uprobe(uprobe);
-   /* a uprobe exists for this inode:offset combination */
-   if (cur_uprobe != uprobe) {
-   if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-   put_uprobe(cur_uprobe);
-   kfree(uprobe);
-   return ERR_PTR(-EINVAL);
-   }
-   kfree(uprobe);
-   uprobe = cur_uprobe;
-   }
-
-   return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(>consumer_rwsem);
@@ -1305,14 +1271,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
}
 
for (i = 0; i < cnt; i++) {
+   struct uprobe *cur_uprobe;
+
uc = get_uprobe_consumer(i, ctx);
 
-   uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-   if (IS_ERR(uprobe)) {
-   ret = PTR_ERR(uprobe);
+   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+   if (!uprobe) {
+   ret = -ENOMEM;
goto cleanup_uprobes;
}
 
+   uprobe->inode = inode;
+   uprobe->offset = uc->offset;
+   uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+   init_rwsem(>register_rwsem);
+   init_rwsem(>consumer_rwsem);
+   RB_CLEAR_NODE(>rb_node);
+   atomic64_set(>ref, 1);
+
+   /* add to uprobes_tree, sorted on inode:offset */
+   cur_uprobe = insert_uprobe(uprobe);
+   /* a uprobe exists for this inode:offset combination */
+   if (cur_uprobe != uprobe) {
+   if (cur_uprobe->ref_ctr_offset != 
uprobe->ref_ctr_offset) {
+   ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+   put_uprobe(cur_uprobe);
+   kfree(uprobe);
+   ret = -EINVAL;
+   goto cleanup_uprobes;
+   }
+   kfree(uprobe);
+   uprobe = cur_uprobe;
+   }
+
uc->uprobe = uprobe;
}
 
-- 
2.43.0

[PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-06-24 Thread Andrii Nakryiko

Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h   | 18 +++
 kernel/events/uprobes.c   | 19 ++-
 kernel/trace/bpf_trace.c  | 21 +++-
 kernel/trace/trace_uprobe.c   | 53 ---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 23 
 5 files changed, 55 insertions(+), 79 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
 
+   /* associated file offset of this probe */
+   loff_t offset;
+   /* associated refctr file offset of this probe, or zero */
+   loff_t ref_ctr_offset;
+   /* for internal uprobe infra use, consumers shouldn't touch fields 
below */
struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
*mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs) instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-   return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, 
loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8ce669bc6474..2544e8b79bad 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1197,14 +1197,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
struct uprobe *uprobe;
 
-   uprobe = find_uprobe(inode, offset);
+   uprobe = find_uprobe(inode, uc->offset);
if (WARN_ON(!uprobe))
return;
 
@@ -1277,20 +1276,12 @@ static int __uprobe_register(struct inode *inode, 
loff_t offset,
return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-   struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-   return __uprobe_register(inode, offset, 0, uc);
+   return __uprobe_register(inode, uc->offset, uc->

[PATCH 03/12] uprobes: simplify error handling for alloc_uprobe()

2024-06-24 Thread Andrii Nakryiko

Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e896eeecb091..aa59fa53ae67 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
loff_t offset,
 
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
if (!uprobe)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
uprobe->inode = inode;
uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t 
offset,
 
  retry:
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-   if (!uprobe)
-   return -ENOMEM;
if (IS_ERR(uprobe))
return PTR_ERR(uprobe);
 
-- 
2.43.0

[PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-24 Thread Andrii Nakryiko

note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

All the above contained trickery aside, we end up with a nice semantics
for get and put operations, where get always succeeds and put handles
all the races properly and transparently to the caller.

And just to justify this a bit unorthodox refcounting approach, under
uprobe triggering micro-benchmark (using BPF selftests' bench tool) with
8 triggering threads, atomic_inc_not_zero() approach was producing about
3.3 millions/sec total uprobe triggerings across all threads. While the
final atomic_add_and_return()-based approach managed to get 3.6 millions/sec
throughput under the same 8 competing threads.

Furthermore, CPU profiling showed the following overall CPU usage:
  - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
atomic_inc_not_zero approach;
  - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
atomic_add_and_return approach implemented by this patch.

So, CPU is spending relatively more CPU time in get/put operations while
delivering less total throughput if using atomic_inc_not_zero(). And
this will be even more prominent once we optimize away uprobe->register_rwsem
in the subsequent patch sets. So while slightly less straightforward,
current approach seems to be clearly winning and justified.

We also rename get_uprobe() to __get_uprobe() to indicate it's
a delicate internal helper that is only safe to call under valid
circumstances:
  - while holding uprobes_treelock (to synchronize with exclusive write
lock in put_uprobe(), as described above);
  - or if we have a guarantee that uprobe's refcount is already positive
through caller holding at least one refcount (in this case there is
no risk of refcount dropping to zero by any other CPU).

We also document why it's safe to do unconditional __get_uprobe() at all
call sites, to make it clear that we maintain the above invariants.

Note also, we now don't have a race between registration and
unregistration, so we remove the retry logic completely.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 231 +---
 1 file changed, 167 insertions(+), 64 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index aa59fa53ae67..8ce669bc6474 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -53,7 +53,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 struct uprobe {
struct rb_node  rb_node;/* node in the rb tree */
-   refcount_t  ref;
+   atomic64_t  ref;/* see UPROBE_REFCNT_GET below 
*/
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
struct list_headpending_list;
@@ -587,15 +587,114 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
mm_struct *mm, unsigned long v
*(uprobe_opcode_t *)>insn);
 }
 
-static struct uprobe *get_uprobe(struct uprobe *uprobe)
+/*
+ * Uprobe's 64-bit refcount is actually two independent counters co-located in
+ * a single u64 value:
+ *   - lower 32 bits are just a normal refcount with is increment and
+ *   decremented on get and put, respectively, just like normal refcount
+ *   would;
+ *   - upper 32 bits are a tag (or epoch, if you will), which is always
+ *   incremented by one, no matter whether get or put operation is done.
+ *
+ * This upper counter is meant to distinguish between:
+ *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
+ *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
+ *   sequence, also proceeding to "destruction".
+ *
+ * In both cases refcount drops to zero, but in one case it will have epoch N,
+ * while the second drop to zero will have a different epoch N + 2, allowing
+ * first destructor to bail out because epoch changed between refcount going
+ * to zero and put_uprobe() taking uprobes_treelock (under which overall
+ * 64-bit refcount is double-checked, see put_uprobe() for details).
+ *
+ * Lower 32-bit counter is not meant to over overflow, while it's expected
+ * that upper 32-bit counter will overflow occasionally. Note, though, that we
+ * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
+ * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
+ * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
+ * epoch effectively a 31-bit counter with highe

[PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-24 Thread Andrii Nakryiko

Given unapply_uprobe() can call remove_breakpoint() which eventually
calls uprobe_write_opcode(), which can modify a set of memory pages and
expects mm->mmap_lock held for write, it needs to have writer lock.

Fix this by switching to mmap_write_lock()/mmap_write_unlock().

Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed 
task")
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 197fbe4663b5..e896eeecb091 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct 
mm_struct *mm)
struct vm_area_struct *vma;
int err = 0;
 
-   mmap_read_lock(mm);
+   mmap_write_lock(mm);
for_each_vma(vmi, vma) {
unsigned long vaddr;
loff_t offset;
@@ -1252,7 +1252,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct 
mm_struct *mm)
vaddr = offset_to_vaddr(vma, uprobe->offset);
err |= remove_breakpoint(uprobe, mm, vaddr);
}
-   mmap_read_unlock(mm);
+   mmap_write_unlock(mm);
 
return err;
 }
-- 
2.43.0

[PATCH 01/12] uprobes: update outdated comment

2024-06-24 Thread Andrii Nakryiko

There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2816e65729ac..197fbe4663b5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0

[PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

2024-06-24 Thread Andrii Nakryiko

This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: grab write mmap lock in unapply_uprobe()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h   |  29 +-
 kernel/events/uprobes.c   | 522 --
 kernel/trace/bpf_trace.c  |  40 +-
 kernel/trace/trace_uprobe.c   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  23 +-
 5 files changed, 419 insertions(+), 248 deletions(-)

-- 
2.43.0

Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-24 Thread Andrii Nakryiko

On Mon, Jun 17, 2024 at 3:37 PM Andrii Nakryiko
 wrote:
>
> On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko
>  wrote:
> >
> > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu  wrote:
> > >
> > > On Tue, 21 May 2024 18:38:43 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > When kernel has pending uretprobes installed, it hijacks original user
> > > > function return address on the stack with a uretprobe trampoline
> > > > address. There could be multiple such pending uretprobes (either on
> > > > different user functions or on the same recursive one) at any given
> > > > time within the same task.
> > > >
> > > > This approach interferes with the user stack trace capture logic, which
> > > > would report suprising addresses (like 0x7fffe000) that correspond
> > > > to a special "[uprobes]" section that kernel installs in the target
> > > > process address space for uretprobe trampoline code, while logically it
> > > > should be an address somewhere within the calling function of another
> > > > traced user function.
> > > >
> > > > This is easy to correct for, though. Uprobes subsystem keeps track of
> > > > pending uretprobes and records original return addresses. This patch is
> > > > using this to do a post-processing step and restore each trampoline
> > > > address entries with correct original return address. This is done only
> > > > if there are pending uretprobes for current task.
> > > >
> > > > This is a similar approach to what fprobe/kretprobe infrastructure is
> > > > doing when capturing kernel stack traces in the presence of pending
> > > > return probes.
> > > >
> > >
> > > This looks good to me because this trampoline information is only
> > > managed in uprobes. And it should be provided when unwinding user
> > > stack.
> > >
> > > Reviewed-by: Masami Hiramatsu (Google) 
> > >
> > > Thank you!
> >
> > Great, thanks for reviewing, Masami!
> >
> > Would you take this fix through your tree, or where should it be routed to?
> >
>
> Ping! What would you like me to do with this patch set? Should I
> resend it without patch 3 (the one that tries to guess whether we are
> at the entry to the function?), or did I manage to convince you that
> this heuristic is OK, given perf's stack trace capturing logic already
> makes heavy assumption of rbp register being used for frame pointer?
>
> Please let me know your preference, I could drop patch 3 and send it
> separately, if that helps move the main fix forward. Thanks!

Masami,

Another week went by with absolutely no action or reaction from you.
Is there any way I can help improve the collaboration here?

I'm preparing more patches for uprobes and about to submit them. If
each reviewed and ready to be applied patch set has to sit idle for
multiple weeks for no good reason, we all will soon be lost just plain
forgetting the context in which the patch was prepared.

Please, prioritize handling patches that are meant to be routed
through your tree in a more timely fashion. Or propose some
alternative acceptable arrangement.

Thank you.

>
> > >
> > > > Reported-by: Riham Selim 
> > > > Signed-off-by: Andrii Nakryiko 
> > > > ---
> > > >  kernel/events/callchain.c | 43 ++-
> > > >  kernel/events/uprobes.c   |  9 
> > > >  2 files changed, 51 insertions(+), 1 deletion(-)
> > > >
> >
> > [...]

Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-17 Thread Andrii Nakryiko

On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko
 wrote:
>
> On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu  wrote:
> >
> > On Tue, 21 May 2024 18:38:43 -0700
> > Andrii Nakryiko  wrote:
> >
> > > When kernel has pending uretprobes installed, it hijacks original user
> > > function return address on the stack with a uretprobe trampoline
> > > address. There could be multiple such pending uretprobes (either on
> > > different user functions or on the same recursive one) at any given
> > > time within the same task.
> > >
> > > This approach interferes with the user stack trace capture logic, which
> > > would report suprising addresses (like 0x7fffe000) that correspond
> > > to a special "[uprobes]" section that kernel installs in the target
> > > process address space for uretprobe trampoline code, while logically it
> > > should be an address somewhere within the calling function of another
> > > traced user function.
> > >
> > > This is easy to correct for, though. Uprobes subsystem keeps track of
> > > pending uretprobes and records original return addresses. This patch is
> > > using this to do a post-processing step and restore each trampoline
> > > address entries with correct original return address. This is done only
> > > if there are pending uretprobes for current task.
> > >
> > > This is a similar approach to what fprobe/kretprobe infrastructure is
> > > doing when capturing kernel stack traces in the presence of pending
> > > return probes.
> > >
> >
> > This looks good to me because this trampoline information is only
> > managed in uprobes. And it should be provided when unwinding user
> > stack.
> >
> > Reviewed-by: Masami Hiramatsu (Google) 
> >
> > Thank you!
>
> Great, thanks for reviewing, Masami!
>
> Would you take this fix through your tree, or where should it be routed to?
>

Ping! What would you like me to do with this patch set? Should I
resend it without patch 3 (the one that tries to guess whether we are
at the entry to the function?), or did I manage to convince you that
this heuristic is OK, given perf's stack trace capturing logic already
makes heavy assumption of rbp register being used for frame pointer?

Please let me know your preference, I could drop patch 3 and send it
separately, if that helps move the main fix forward. Thanks!

> >
> > > Reported-by: Riham Selim 
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  kernel/events/callchain.c | 43 ++-
> > >  kernel/events/uprobes.c   |  9 
> > >  2 files changed, 51 insertions(+), 1 deletion(-)
> > >
>
> [...]

Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-04 Thread Andrii Nakryiko

On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu  wrote:
>
> On Tue, 21 May 2024 18:38:43 -0700
> Andrii Nakryiko  wrote:
>
> > When kernel has pending uretprobes installed, it hijacks original user
> > function return address on the stack with a uretprobe trampoline
> > address. There could be multiple such pending uretprobes (either on
> > different user functions or on the same recursive one) at any given
> > time within the same task.
> >
> > This approach interferes with the user stack trace capture logic, which
> > would report suprising addresses (like 0x7fffe000) that correspond
> > to a special "[uprobes]" section that kernel installs in the target
> > process address space for uretprobe trampoline code, while logically it
> > should be an address somewhere within the calling function of another
> > traced user function.
> >
> > This is easy to correct for, though. Uprobes subsystem keeps track of
> > pending uretprobes and records original return addresses. This patch is
> > using this to do a post-processing step and restore each trampoline
> > address entries with correct original return address. This is done only
> > if there are pending uretprobes for current task.
> >
> > This is a similar approach to what fprobe/kretprobe infrastructure is
> > doing when capturing kernel stack traces in the presence of pending
> > return probes.
> >
>
> This looks good to me because this trampoline information is only
> managed in uprobes. And it should be provided when unwinding user
> stack.
>
> Reviewed-by: Masami Hiramatsu (Google) 
>
> Thank you!

Great, thanks for reviewing, Masami!

Would you take this fix through your tree, or where should it be routed to?

>
> > Reported-by: Riham Selim 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/events/callchain.c | 43 ++-
> >  kernel/events/uprobes.c   |  9 
> >  2 files changed, 51 insertions(+), 1 deletion(-)
> >

[...]

Re: [PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-06-04 Thread Andrii Nakryiko

On Tue, Jun 4, 2024 at 7:06 AM Masami Hiramatsu  wrote:
>
> On Tue, 21 May 2024 18:38:44 -0700
> Andrii Nakryiko  wrote:
>
> > When tracing user functions with uprobe functionality, it's common to
> > install the probe (e.g., a BPF program) at the first instruction of the
> > function. This is often going to be `push %rbp` instruction in function
> > preamble, which means that within that function frame pointer hasn't
> > been established yet. This leads to consistently missing an actual
> > caller of the traced function, because perf_callchain_user() only
> > records current IP (capturing traced function) and then following frame
> > pointer chain (which would be caller's frame, containing the address of
> > caller's caller).
>
> I thought this problem might be solved by sframe.

Eventually, yes, when real-world applications switch to sframe and we
get proper support for it in the kernel. But right now there are tons
of applications relying on kernel capturing stack traces based on
frame pointers, so it would be good to improve that as well.

>
> >
> > So when we have target_1 -> target_2 -> target_3 call chain and we are
> > tracing an entry to target_3, captured stack trace will report
> > target_1 -> target_3 call chain, which is wrong and confusing.
> >
> > This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> > instruction being traced.
>
> I like this kind of idea :) But I think this should be done in
> the user-space, not in the kernel because it is not always sure
> that the user program uses stack frames.

Existing kernel code that captures user space stack trace already
assumes that code was compiled with a frame pointer (unconditionally),
because that's the best kernel can do. So under that assumption this
heuristic is valid and not harmful, IMO.

User space can do nothing about this, because it is the kernel that
captures stack trace (e.g., from BPF program), and we will lose the
calling frame if we don't do it here.

>
> > If that's the case, with the assumption that
> > applicatoin is compiled with frame pointers, this instruction would be
> > a strong indicator that this is the entry to the function. In that case,
> > return address is still pointed to by %rsp, so we fetch it and add to
> > stack trace before proceeding to unwind the rest using frame
> > pointer-based logic.
>
> Why don't we make it in the userspace BPF program? If it is done
> in the user space, like perf-probe, I'm OK. But I doubt to do this in
> kernel. That means it is not flexible.
>

You mean for the BPF program to capture the entire stack trace by
itself, without asking the kernel for help? It's doable, but:

  a) it's inconvenient for all users to have to reimplement this
low-level "primitive" operation, that we already promise is provided
by kernel (through bpf_get_stack() API, and kernel has internal
perf_callchain API for this)
  b) it's faster for kernel to do this, as kernel code disables page
faults once and unwinds the stack, while BPF program would have to do
multiple bpf_probe_read_user() calls, each individually disabling page
faults.

But really, there is an already existing API, which in some cases
returns partially invalid stack traces (skipping function caller's
frame), so this is an attempt to fix this issue.

> More than anything, without user-space helper to find function
> symbols, uprobe does not know the function entry. Then I'm curious
> why don't you do this in the user space.

You are mixing stack *capture* (in which we get memory addresses
representing where a function call or currently running instruction
pointer is) with stack *symbolization* (where user space needs ELF
symbols and/or DWARF information to translate those addresses into
something human-readable).

This heuristic improves the former as performed by the kernel. Stack
symbolization is completely orthogonal to all of this.

>
> At least, this should be done in the user of uprobes, like trace_uprobe
> or bpf.
>

This is a really miserable user experience, if they have to implement
their own stack trace capture for uprobes, but use built-in
bpf_get_stack() API for any other type of program.

>
> Thank you,
>
> >
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  arch/x86/events/core.c  | 20 
> >  include/linux/uprobes.h |  2 ++
> >  kernel/events/uprobes.c |  2 ++
> >  3 files changed, 24 insertions(+)
> >

[...]

Re: [PATCH v2 0/4] Fix user stack traces captured from uprobes

2024-06-03 Thread Andrii Nakryiko

On Tue, May 21, 2024 at 6:38 PM Andrii Nakryiko  wrote:
>
> This patch set reports two issues with captured stack traces.
>
> First issue, fixed in patch #2, deals with fixing up uretprobe trampoline
> addresses in captured stack trace. This issue happens when there are pending
> return probes, for which kernel hijacks some of the return addresses on user
> stacks. The code is matching those special uretprobe trampoline addresses with
> the list of pending return probe instances and replaces them with actual
> return addresses. This is the same fixup logic that fprobe/kretprobe has for
> kernel stack traces.
>
> Second issue, which patch #3 is fixing with the help of heuristic, is having
> to do with capturing user stack traces in entry uprobes. At the very entrance
> to user function, frame pointer in rbp register is not yet setup, so actual
> caller return address is still pointed to by rsp. Patch is using a simple
> heuristic, looking for `push %rbp` instruction, to fetch this extra direct
> caller return address, before proceeding to unwind the stack using rbp.
>
> Patch #4 adds tests into BPF selftests, that validate that captured stack
> traces at various points is what we expect to get. This patch, while being BPF
> selftests, is isolated from any other BPF selftests changes and can go in
> through non-BPF tree without the risk of merge conflicts.
>
> Patches are based on latest linux-trace/probes/for-next.
>
> v1->v2:
>   - fixed GCC aggressively inlining test_uretprobe_stack() function (BPF CI);
>   - fixed comments (Peter).
>
> Andrii Nakryiko (4):
>   uprobes: rename get_trampoline_vaddr() and make it global
>   perf,uprobes: fix user stack traces in the presence of pending
> uretprobes
>   perf,x86: avoid missing caller address in stack traces captured in
> uprobe
>   selftests/bpf: add test validating uprobe/uretprobe stack traces
>
>  arch/x86/events/core.c|  20 ++
>  include/linux/uprobes.h   |   3 +
>  kernel/events/callchain.c |  43 +++-
>  kernel/events/uprobes.c   |  17 +-
>  .../bpf/prog_tests/uretprobe_stack.c  | 186 ++
>  .../selftests/bpf/progs/uretprobe_stack.c |  96 +
>  6 files changed, 361 insertions(+), 4 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
>  create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c
>
> --
> 2.43.0
>

Friendly ping. This is a real issue in practice that our production
users are eager to be fixed, please help to get it upstream. Thanks!

[PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces

2024-05-21 Thread Andrii Nakryiko

Add a set of tests to validate that stack traces captured from or in the
presence of active uprobes and uretprobes are valid and complete.

For this we use BPF program that are installed either on entry or exit
of user function, plus deep-nested USDT. One of target funtions
(target_1) is recursive to generate two different entries in the stack
trace for the same uprobe/uretprobe, testing potential edge conditions.

Without fixes in this patch set, we get something like this for one of
the scenarios:

 caller: 0x758fff - 0x7595ab
 target_1: 0x758fd5 - 0x758fff
 target_2: 0x758fca - 0x758fd5
 target_3: 0x758fbf - 0x758fca
 target_4: 0x758fb3 - 0x758fbf
 ENTRY #0: 0x758fb3 (in target_4)
 ENTRY #1: 0x758fd3 (in target_2)
 ENTRY #2: 0x758ffd (in target_1)
 ENTRY #3: 0x7fffe000
 ENTRY #4: 0x7fffe000
 ENTRY #5: 0x6f8f39
 ENTRY #6: 0x6fa6f0
 ENTRY #7: 0x7f403f229590

Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses
which obscure actual target_1 and another target_1 invocations. Also
note that between entry #0 and entry #1 we are missing an entry for
target_3, which is fixed in patch #2.

With all the fixes, we get desired full stack traces:

 caller: 0x758fff - 0x7595ab
 target_1: 0x758fd5 - 0x758fff
 target_2: 0x758fca - 0x758fd5
 target_3: 0x758fbf - 0x758fca
 target_4: 0x758fb3 - 0x758fbf
 ENTRY #0: 0x758fb7 (in target_4)
 ENTRY #1: 0x758fc8 (in target_3)
 ENTRY #2: 0x758fd3 (in target_2)
 ENTRY #3: 0x758ffd (in target_1)
 ENTRY #4: 0x758ff3 (in target_1)
 ENTRY #5: 0x75922c (in caller)
 ENTRY #6: 0x6f8f39
 ENTRY #7: 0x6fa6f0
 ENTRY #8: 0x7f986adc4cd0

Now there is a logical and complete sequence of function calls.

Signed-off-by: Andrii Nakryiko 
---
 .../bpf/prog_tests/uretprobe_stack.c  | 186 ++
 .../selftests/bpf/progs/uretprobe_stack.c |  96 +
 2 files changed, 282 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c 
b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
new file mode 100644
index ..6deb8d560ddd
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include 
+#include "uretprobe_stack.skel.h"
+#include "../sdt.h"
+
+/* We set up target_1() -> target_2() -> target_3() -> target_4() -> USDT()
+ * call chain, each being traced by our BPF program. On entry or return from
+ * each target_*() we are capturing user stack trace and recording it in
+ * global variable, so that user space part of the test can validate it.
+ *
+ * Note, we put each target function into a custom section to get those
+ * __start_XXX/__stop_XXX symbols, generated by linker for us, which allow us
+ * to know address range of those functions
+ */
+__attribute__((section("uprobe__target_4")))
+__weak int target_4(void)
+{
+   STAP_PROBE1(uretprobe_stack, target, 42);
+   return 42;
+}
+
+extern const void *__start_uprobe__target_4;
+extern const void *__stop_uprobe__target_4;
+
+__attribute__((section("uprobe__target_3")))
+__weak int target_3(void)
+{
+   return target_4();
+}
+
+extern const void *__start_uprobe__target_3;
+extern const void *__stop_uprobe__target_3;
+
+__attribute__((section("uprobe__target_2")))
+__weak int target_2(void)
+{
+   return target_3();
+}
+
+extern const void *__start_uprobe__target_2;
+extern const void *__stop_uprobe__target_2;
+
+__attribute__((section("uprobe__target_1")))
+__weak int target_1(int depth)
+{
+   if (depth < 1)
+   return 1 + target_1(depth + 1);
+   else
+   return target_2();
+}
+
+extern const void *__start_uprobe__target_1;
+extern const void *__stop_uprobe__target_1;
+
+extern const void *__start_uretprobe_stack_sec;
+extern const void *__stop_uretprobe_stack_sec;
+
+struct range {
+   long start;
+   long stop;
+};
+
+static struct range targets[] = {
+   {}, /* we want target_1 to map to target[1], so need 1-based indexing */
+   { (long)&__start_uprobe__target_1, (long)&__stop_uprobe__target_1 },
+   { (long)&__start_uprobe__target_2, (long)&__stop_uprobe__target_2 },
+   { (long)&__start_uprobe__target_3, (long)&__stop_uprobe__target_3 },
+   { (long)&__start_uprobe__target_4, (long)&__stop_uprobe__target_4 },
+};
+
+static struct range caller = {
+   (long)&__start_uretprobe_stack_sec,
+   (long)&__stop_uretprobe_stack_sec,
+};
+
+static void validate_stack(__u64 *ips, int stack_len, int cnt, ...)
+{
+   int i, j;
+   va_list args;
+
+   if (!ASSERT_GT(stack_len, 0, "stack_len"))
+   return;
+
+   stack_len /= 8;
+

[PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-05-21 Thread Andrii Nakryiko

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced. If that's the case, with the assumption that
applicatoin is compiled with frame pointers, this instruction would be
a strong indicator that this is the entry to the function. In that case,
return address is still pointed to by %rsp, so we fetch it and add to
stack trace before proceeding to unwind the rest using frame
pointer-based logic.

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/events/core.c  | 20 
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..82d5570b58ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *regs
return;
 
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+   /*
+* If we are called from uprobe handler, and we are indeed at the very
+* entry to user function (which is normally a `push %rbp` instruction,
+* under assumption of application being compiled with frame pointers),
+* we should read return address from *regs->sp before proceeding
+* to follow frame pointers, otherwise we'll skip immediate caller
+* as %rbp is not yet setup.
+*/
+   if (current->utask) {
+   struct arch_uprobe *auprobe = current->utask->auprobe;
+   u64 ret_addr;
+
+   if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
+   !__get_user(ret_addr, (const u64 __user *)regs->sp))
+   perf_callchain_store(entry, ret_addr);
+   }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0c57eec85339..7b785cd30d86 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
 
+   struct arch_uprobe  *auprobe;
+
struct return_instance  *return_instances;
unsigned intdepth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1c99380dc89d..504693845187 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
 
down_read(>register_rwsem);
+   current->utask->auprobe = >arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
 
@@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
 
remove &= rc;
}
+   current->utask->auprobe = NULL;
 
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0

[PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-21 Thread Andrii Nakryiko

When kernel has pending uretprobes installed, it hijacks original user
function return address on the stack with a uretprobe trampoline
address. There could be multiple such pending uretprobes (either on
different user functions or on the same recursive one) at any given
time within the same task.

This approach interferes with the user stack trace capture logic, which
would report suprising addresses (like 0x7fffe000) that correspond
to a special "[uprobes]" section that kernel installs in the target
process address space for uretprobe trampoline code, while logically it
should be an address somewhere within the calling function of another
traced user function.

This is easy to correct for, though. Uprobes subsystem keeps track of
pending uretprobes and records original return addresses. This patch is
using this to do a post-processing step and restore each trampoline
address entries with correct original return address. This is done only
if there are pending uretprobes for current task.

This is a similar approach to what fprobe/kretprobe infrastructure is
doing when capturing kernel stack traces in the presence of pending
return probes.

Reported-by: Riham Selim 
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/callchain.c | 43 ++-
 kernel/events/uprobes.c   |  9 
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392c..b17e3323f7f6 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -176,13 +177,51 @@ put_callchain_entry(int rctx)
put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
 }
 
+static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry 
*entry,
+  int start_entry_idx)
+{
+#ifdef CONFIG_UPROBES
+   struct uprobe_task *utask = current->utask;
+   struct return_instance *ri;
+   __u64 *cur_ip, *last_ip, tramp_addr;
+
+   if (likely(!utask || !utask->return_instances))
+   return;
+
+   cur_ip = >ip[start_entry_idx];
+   last_ip = >ip[entry->nr - 1];
+   ri = utask->return_instances;
+   tramp_addr = uprobe_get_trampoline_vaddr();
+
+   /*
+* If there are pending uretprobes for the current thread, they are
+* recorded in a list inside utask->return_instances; each such
+* pending uretprobe replaces traced user function's return address on
+* the stack, so when stack trace is captured, instead of seeing
+* actual function's return address, we'll have one or many uretprobe
+* trampoline addresses in the stack trace, which are not helpful and
+* misleading to users.
+* So here we go over the pending list of uretprobes, and each
+* encountered trampoline address is replaced with actual return
+* address.
+*/
+   while (ri && cur_ip <= last_ip) {
+   if (*cur_ip == tramp_addr) {
+   *cur_ip = ri->orig_ret_vaddr;
+   ri = ri->next;
+   }
+   cur_ip++;
+   }
+#endif
+}
+
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
   u32 max_stack, bool crosstask, bool add_mark)
 {
struct perf_callchain_entry *entry;
struct perf_callchain_entry_ctx ctx;
-   int rctx;
+   int rctx, start_entry_idx;
 
entry = get_callchain_entry();
if (!entry)
@@ -215,7 +254,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool 
kernel, bool user,
if (add_mark)
perf_callchain_store_context(, 
PERF_CONTEXT_USER);
 
+   start_entry_idx = entry->nr;
perf_callchain_user(, regs);
+   fixup_uretprobe_trampoline_entries(entry, 
start_entry_idx);
}
}
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d60d24f0f2f4..1c99380dc89d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2149,6 +2149,15 @@ static void handle_trampoline(struct pt_regs *regs)
 
instruction_pointer_set(regs, ri->orig_ret_vaddr);
do {
+   /* pop current instance from the stack of pending 
return instances,
+* as it's not pending anymore: we just fixed up 
original
+* instruction pointer in regs and are about to call 
handlers;
+* this allows fixup_uretprobe_trampoline_entries() to 
properly fix up
+* captured stack traces from uretprobe handlers, in 
which pending
+* trampoline addresses on the stac

[PATCH v2 1/4] uprobes: rename get_trampoline_vaddr() and make it global

2024-05-21 Thread Andrii Nakryiko

This helper is needed in another file, so make it a bit more uniquely
named and expose it internally.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..0c57eec85339 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance 
*ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 void *src, unsigned long len);
+extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8ae0eefc3a34..d60d24f0f2f4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, unsigned 
long flags)
  *
  * Returns -1 in case the xol_area is not allocated.
  */
-static unsigned long get_trampoline_vaddr(void)
+unsigned long uprobe_get_trampoline_vaddr(void)
 {
struct xol_area *area;
unsigned long trampoline_vaddr = -1;
@@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
struct pt_regs *regs)
if (!ri)
return;
 
-   trampoline_vaddr = get_trampoline_vaddr();
+   trampoline_vaddr = uprobe_get_trampoline_vaddr();
orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
regs);
if (orig_ret_vaddr == -1)
goto fail;
@@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs)
int is_swbp;
 
bp_vaddr = uprobe_get_swbp_addr(regs);
-   if (bp_vaddr == get_trampoline_vaddr())
+   if (bp_vaddr == uprobe_get_trampoline_vaddr())
return handle_trampoline(regs);
 
uprobe = find_active_uprobe(bp_vaddr, _swbp);
-- 
2.43.0

[PATCH v2 0/4] Fix user stack traces captured from uprobes

2024-05-21 Thread Andrii Nakryiko

This patch set reports two issues with captured stack traces.

First issue, fixed in patch #2, deals with fixing up uretprobe trampoline
addresses in captured stack trace. This issue happens when there are pending
return probes, for which kernel hijacks some of the return addresses on user
stacks. The code is matching those special uretprobe trampoline addresses with
the list of pending return probe instances and replaces them with actual
return addresses. This is the same fixup logic that fprobe/kretprobe has for
kernel stack traces.

Second issue, which patch #3 is fixing with the help of heuristic, is having
to do with capturing user stack traces in entry uprobes. At the very entrance
to user function, frame pointer in rbp register is not yet setup, so actual
caller return address is still pointed to by rsp. Patch is using a simple
heuristic, looking for `push %rbp` instruction, to fetch this extra direct
caller return address, before proceeding to unwind the stack using rbp.

Patch #4 adds tests into BPF selftests, that validate that captured stack
traces at various points is what we expect to get. This patch, while being BPF
selftests, is isolated from any other BPF selftests changes and can go in
through non-BPF tree without the risk of merge conflicts.

Patches are based on latest linux-trace/probes/for-next.

v1->v2:
  - fixed GCC aggressively inlining test_uretprobe_stack() function (BPF CI);
  - fixed comments (Peter).

Andrii Nakryiko (4):
  uprobes: rename get_trampoline_vaddr() and make it global
  perf,uprobes: fix user stack traces in the presence of pending
uretprobes
  perf,x86: avoid missing caller address in stack traces captured in
uprobe
  selftests/bpf: add test validating uprobe/uretprobe stack traces

 arch/x86/events/core.c|  20 ++
 include/linux/uprobes.h   |   3 +
 kernel/events/callchain.c |  43 +++-
 kernel/events/uprobes.c   |  17 +-
 .../bpf/prog_tests/uretprobe_stack.c  | 186 ++
 .../selftests/bpf/progs/uretprobe_stack.c |  96 +
 6 files changed, 361 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c

-- 
2.43.0

[PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-20 Thread Andrii Nakryiko

Recent changes made uprobe_cpu_buffer preparation lazy, and moved it
deeper into __uprobe_trace_func(). This is problematic because
__uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock()
block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() ->
mutex_lock(>mutex), leading to a splat about using mutex under
non-sleepable RCU:

  BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:585
   in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: 
stress-ng-sigq
   preempt_count: 0, expected: 0
   RCU nest depth: 1, expected: 0
   ...
   Call Trace:

dump_stack_lvl+0x3d/0xe0
__might_resched+0x24c/0x270
? prepare_uprobe_buffer+0xd5/0x1d0
__mutex_lock+0x41/0x820
? ___perf_sw_event+0x206/0x290
? __perf_event_task_sched_in+0x54/0x660
? __perf_event_task_sched_in+0x54/0x660
prepare_uprobe_buffer+0xd5/0x1d0
__uprobe_trace_func+0x4a/0x140
uprobe_dispatcher+0x135/0x280
? uprobe_dispatcher+0x94/0x280
uprobe_notify_resume+0x650/0xec0
? atomic_notifier_call_chain+0x21/0x110
? atomic_notifier_call_chain+0xf8/0x110
irqentry_exit_to_user_mode+0xe2/0x1e0
asm_exc_int3+0x35/0x40
   RIP: 0033:0x7f7e1d4da390
   Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff 
ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00  0f 1e fa b8 27 00 
00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e
   RSP: 002b:7ffd2abc3608 EFLAGS: 0246
   RAX:  RBX: 76d325f1 RCX: 
   RDX: 76d325f1 RSI: 000a RDI: 7ffd2abc3690
   RBP: 000a R08: 00017fb7 R09: 00017fb7
   R10: 00017fb7 R11: 0246 R12: 00017ff2
   R13: 7ffd2abc3610 R14:  R15: 7ffd2abc3780


Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called
slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside
of RCU locked section. This still keeps this buffer preparation lazy and helps
avoid the overhead when it's not needed. E.g., if there is only BPF uprobe
handler installed on a given uprobe, buffer won't be initialized.

Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not
affected, as it doesn't prepare buffer under RCU read lock.

Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
Reported-by: Breno Leitao 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 8541fa1494ae..c98e3b3386ba 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -970,19 +970,17 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer **ucbp,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
-   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(>tp);
 
WARN_ON(call != trace_file->event_call);
 
-   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
@@ -1014,13 +1012,16 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
+   struct uprobe_cpu_buffer *ucb;
 
if (is_ret_probe(tu))
return 0;
 
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
+
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucbp, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1031,10 +1032,13 @@ static void uretprobe_trace_func(struct trace_uprobe 
*tu, unsigned long func,
 struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
+   struct uprobe_cpu_buffer *ucb;
+
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucbp, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
-- 
2.43.0

Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-20 Thread Andrii Nakryiko

On Mon, May 20, 2024 at 8:20 AM Jiri Olsa  wrote:
>
> On Wed, May 15, 2024 at 08:32:30AM -0600, Andrii Nakryiko wrote:
> > On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra  wrote:
> > >
> > > On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote:
> > >
> > > > +static void fixup_uretprobe_trampoline_entries(struct 
> > > > perf_callchain_entry *entry,
> > > > +int start_entry_idx)
> > > > +{
> > > > +#ifdef CONFIG_UPROBES
> > > > + struct uprobe_task *utask = current->utask;
> > > > + struct return_instance *ri;
> > > > + __u64 *cur_ip, *last_ip, tramp_addr;
> > > > +
> > > > + if (likely(!utask || !utask->return_instances))
> > > > + return;
> > > > +
> > > > + cur_ip = >ip[start_entry_idx];
> > > > + last_ip = >ip[entry->nr - 1];
> > > > + ri = utask->return_instances;
> > > > + tramp_addr = uprobe_get_trampoline_vaddr();
> > > > +
> > > > + /* If there are pending uretprobes for current thread, they are
> > >
> > > Comment style fail. Also 'for *the* current thread'.
> > >
> >
> > ack, will fix
> >
> > > > +  * recorded in a list inside utask->return_instances; each such
> > > > +  * pending uretprobe replaces traced user function's return 
> > > > address on
> > > > +  * the stack, so when stack trace is captured, instead of seeing
> > > > +  * actual function's return address, we'll have one or many 
> > > > uretprobe
> > > > +  * trampoline addresses in the stack trace, which are not helpful 
> > > > and
> > > > +  * misleading to users.
> > >
> > > I would beg to differ, what if the uprobe is causing the performance
> > > issue?
> >
> > If uprobe/uretprobe code itself is causing performance issues, you'll
> > see that in other stack traces, where this code will be actively
> > running on CPU. I don't think we make anything worse here.
>
> I think we do similar thing in kernel unwind for rethook trampoline used
> in fprobe/kretprobe code, so seems ok to me to do it for uprobes as well
>
> >
> > Here we are talking about the case where the uprobe part is done and
> > it hijacked the return address on the stack, uretprobe is not yet
> > running (and so not causing any performance issues). The presence of
> > this "snooping" (pending) uretprobe is irrelevant to the user that is
> > capturing stack trace. Right now address in [uprobes] VMA section
> > installed by uretprobe infra code is directly replacing correct and
> > actual calling function address.
> >
> > Worst case, one can argue that both [uprobes] and original caller
> > address should be in the stack trace, but I think it still will be
> > confusing to users. And also will make implementation less efficient
> > because now we'll need to insert entries into the array and shift
> > everything around.
>
> agreed this would be confusing.. also as you noted above the return
> trampoline did not get executed yet at the time of the callstack,
> so it's bit misleading
>
> might be stupid idea.. but we do have the 'special' context entries
> that we store in the callstack to mark user/kernel/guest context ..

only when explicitly requested (add_mark argument to
get_perf_callchain), right? BPF doesn't ever set this to true and
generally speaking users don't care and shouldn't care about pending
uretprobe. I think we are conflating unrelated things here, uretprobe
is not running, so it's not really in the stack trace. I'd just do
nothing about it, it should stay transparent.

If uretprobe *handler* is causing issues, you'll see that in all the
other stack traces (according to relative CPU/resource usage of that
handler).

> maybe we could add some special entry (context does not fit too well)
> to point out there's uretprobe going on .. perf tool could print
> 'uretprobe' hint when displaying the original address
>
> jirka
>
> >
> > So as I mentioned above, if the concern is seeing uprobe/uretprobe
> > code using CPU, that doesn't change, we'll see that in the overall set
> > of captured stack traces (be it custom uprobe handler code or BPF
> > program).
> >
> > >
> > > While I do think it makes sense to fix the unwind in the sense that we
> > > should be able to continue the unwind, I don't think it makes sense to
> > > c

Re: [PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-15 Thread Andrii Nakryiko

On Wed, May 15, 2024 at 3:30 AM Peter Zijlstra  wrote:
>
> On Wed, May 08, 2024 at 02:26:03PM -0700, Andrii Nakryiko wrote:
>
> > +static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry 
> > *entry,
> > +int start_entry_idx)
> > +{
> > +#ifdef CONFIG_UPROBES
> > + struct uprobe_task *utask = current->utask;
> > + struct return_instance *ri;
> > + __u64 *cur_ip, *last_ip, tramp_addr;
> > +
> > + if (likely(!utask || !utask->return_instances))
> > + return;
> > +
> > + cur_ip = >ip[start_entry_idx];
> > + last_ip = >ip[entry->nr - 1];
> > + ri = utask->return_instances;
> > + tramp_addr = uprobe_get_trampoline_vaddr();
> > +
> > + /* If there are pending uretprobes for current thread, they are
>
> Comment style fail. Also 'for *the* current thread'.
>

ack, will fix

> > +  * recorded in a list inside utask->return_instances; each such
> > +  * pending uretprobe replaces traced user function's return address on
> > +  * the stack, so when stack trace is captured, instead of seeing
> > +  * actual function's return address, we'll have one or many uretprobe
> > +  * trampoline addresses in the stack trace, which are not helpful and
> > +  * misleading to users.
>
> I would beg to differ, what if the uprobe is causing the performance
> issue?

If uprobe/uretprobe code itself is causing performance issues, you'll
see that in other stack traces, where this code will be actively
running on CPU. I don't think we make anything worse here.

Here we are talking about the case where the uprobe part is done and
it hijacked the return address on the stack, uretprobe is not yet
running (and so not causing any performance issues). The presence of
this "snooping" (pending) uretprobe is irrelevant to the user that is
capturing stack trace. Right now address in [uprobes] VMA section
installed by uretprobe infra code is directly replacing correct and
actual calling function address.

Worst case, one can argue that both [uprobes] and original caller
address should be in the stack trace, but I think it still will be
confusing to users. And also will make implementation less efficient
because now we'll need to insert entries into the array and shift
everything around.

So as I mentioned above, if the concern is seeing uprobe/uretprobe
code using CPU, that doesn't change, we'll see that in the overall set
of captured stack traces (be it custom uprobe handler code or BPF
program).

>
> While I do think it makes sense to fix the unwind in the sense that we
> should be able to continue the unwind, I don't think it makes sense to
> completely hide the presence of uprobes.

Unwind isn't broken in this sense, we do unwind the entire stack trace
(see examples in the later patch). We just don't capture actual
callers if they have uretprobe pending.

>
> > +  * So here we go over the pending list of uretprobes, and each
> > +  * encountered trampoline address is replaced with actual return
> > +  * address.
> > +  */
> > + while (ri && cur_ip <= last_ip) {
> > + if (*cur_ip == tramp_addr) {
> > + *cur_ip = ri->orig_ret_vaddr;
> > + ri = ri->next;
> > + }
> > + cur_ip++;
> > + }
> > +#endif
> > +}

[PATCH 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces

2024-05-08 Thread Andrii Nakryiko

Add a set of tests to validate that stack traces captured from or in the
presence of active uprobes and uretprobes are valid and complete.

For this we use BPF program that are installed either on entry or exit
of user function, plus deep-nested USDT. One of target funtions
(target_1) is recursive to generate two different entries in the stack
trace for the same uprobe/uretprobe, testing potential edge conditions.

Without fixes in this patch set, we get something like this for one of
the scenarios:

 caller: 0x758fff - 0x7595ab
 target_1: 0x758fd5 - 0x758fff
 target_2: 0x758fca - 0x758fd5
 target_3: 0x758fbf - 0x758fca
 target_4: 0x758fb3 - 0x758fbf
 ENTRY #0: 0x758fb3 (in target_4)
 ENTRY #1: 0x758fd3 (in target_2)
 ENTRY #2: 0x758ffd (in target_1)
 ENTRY #3: 0x7fffe000
 ENTRY #4: 0x7fffe000
 ENTRY #5: 0x6f8f39
 ENTRY #6: 0x6fa6f0
 ENTRY #7: 0x7f403f229590

Entry #3 and #4 (0x7fffe000) are uretprobe trampoline addresses
which obscure actual target_1 and another target_1 invocations. Also
note that between entry #0 and entry #1 we are missing an entry for
target_3, which is fixed in patch #2.

With all the fixes, we get desired full stack traces:

 caller: 0x758fff - 0x7595ab
 target_1: 0x758fd5 - 0x758fff
 target_2: 0x758fca - 0x758fd5
 target_3: 0x758fbf - 0x758fca
 target_4: 0x758fb3 - 0x758fbf
 ENTRY #0: 0x758fb7 (in target_4)
 ENTRY #1: 0x758fc8 (in target_3)
 ENTRY #2: 0x758fd3 (in target_2)
 ENTRY #3: 0x758ffd (in target_1)
 ENTRY #4: 0x758ff3 (in target_1)
 ENTRY #5: 0x75922c (in caller)
 ENTRY #6: 0x6f8f39
 ENTRY #7: 0x6fa6f0
 ENTRY #8: 0x7f986adc4cd0

Now there is a logical and complete sequence of function calls.

Signed-off-by: Andrii Nakryiko 
---
 .../bpf/prog_tests/uretprobe_stack.c  | 185 ++
 .../selftests/bpf/progs/uretprobe_stack.c |  96 +
 2 files changed, 281 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c 
b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
new file mode 100644
index ..2974553da6e5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include 
+#include "uretprobe_stack.skel.h"
+#include "../sdt.h"
+
+/* We set up target_1() -> target_2() -> target_3() -> target_4() -> USDT()
+ * call chain, each being traced by our BPF program. On entry or return from
+ * each target_*() we are capturing user stack trace and recording it in
+ * global variable, so that user space part of the test can validate it.
+ *
+ * Note, we put each target function into a custom section to get those
+ * __start_XXX/__stop_XXX symbols, generated by linker for us, which allow us
+ * to know address range of those functions
+ */
+__attribute__((section("uprobe__target_4")))
+__weak int target_4(void)
+{
+   STAP_PROBE1(uretprobe_stack, target, 42);
+   return 42;
+}
+
+extern const void *__start_uprobe__target_4;
+extern const void *__stop_uprobe__target_4;
+
+__attribute__((section("uprobe__target_3")))
+__weak int target_3(void)
+{
+   return target_4();
+}
+
+extern const void *__start_uprobe__target_3;
+extern const void *__stop_uprobe__target_3;
+
+__attribute__((section("uprobe__target_2")))
+__weak int target_2(void)
+{
+   return target_3();
+}
+
+extern const void *__start_uprobe__target_2;
+extern const void *__stop_uprobe__target_2;
+
+__attribute__((section("uprobe__target_1")))
+__weak int target_1(int depth)
+{
+   if (depth < 1)
+   return 1 + target_1(depth + 1);
+   else
+   return target_2();
+}
+
+extern const void *__start_uprobe__target_1;
+extern const void *__stop_uprobe__target_1;
+
+extern const void *__start_uretprobe_stack_sec;
+extern const void *__stop_uretprobe_stack_sec;
+
+struct range {
+   long start;
+   long stop;
+};
+
+static struct range targets[] = {
+   {}, /* we want target_1 to map to target[1], so need 1-based indexing */
+   { (long)&__start_uprobe__target_1, (long)&__stop_uprobe__target_1 },
+   { (long)&__start_uprobe__target_2, (long)&__stop_uprobe__target_2 },
+   { (long)&__start_uprobe__target_3, (long)&__stop_uprobe__target_3 },
+   { (long)&__start_uprobe__target_4, (long)&__stop_uprobe__target_4 },
+};
+
+static struct range caller = {
+   (long)&__start_uretprobe_stack_sec,
+   (long)&__stop_uretprobe_stack_sec,
+};
+
+static void validate_stack(__u64 *ips, int stack_len, int cnt, ...)
+{
+   int i, j;
+   va_list args;
+
+   if (!ASSERT_GT(stack_len, 0, "stack_len"))
+   return;
+
+   stack_len /= 8;
+

[PATCH 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-05-08 Thread Andrii Nakryiko

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced. If that's the case, with the assumption that
applicatoin is compiled with frame pointers, this instruction would be
a strong indicator that this is the entry to the function. In that case,
return address is still pointed to by %rsp, so we fetch it and add to
stack trace before proceeding to unwind the rest using frame
pointer-based logic.

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/events/core.c  | 20 
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..82d5570b58ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *regs
return;
 
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+   /*
+* If we are called from uprobe handler, and we are indeed at the very
+* entry to user function (which is normally a `push %rbp` instruction,
+* under assumption of application being compiled with frame pointers),
+* we should read return address from *regs->sp before proceeding
+* to follow frame pointers, otherwise we'll skip immediate caller
+* as %rbp is not yet setup.
+*/
+   if (current->utask) {
+   struct arch_uprobe *auprobe = current->utask->auprobe;
+   u64 ret_addr;
+
+   if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
+   !__get_user(ret_addr, (const u64 __user *)regs->sp))
+   perf_callchain_store(entry, ret_addr);
+   }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0c57eec85339..7b785cd30d86 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
 
+   struct arch_uprobe  *auprobe;
+
struct return_instance  *return_instances;
unsigned intdepth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1c99380dc89d..504693845187 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
 
down_read(>register_rwsem);
+   current->utask->auprobe = >arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
 
@@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
pt_regs *regs)
 
remove &= rc;
}
+   current->utask->auprobe = NULL;
 
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0

[PATCH 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-05-08 Thread Andrii Nakryiko

When kernel has pending uretprobes installed, it hijacks original user
function return address on the stack with a uretprobe trampoline
address. There could be multiple such pending uretprobes (either on
different user functions or on the same recursive one) at any given
time within the same task.

This approach interferes with the user stack trace capture logic, which
would report suprising addresses (like 0x7fffe000) that correspond
to a special "[uprobes]" section that kernel installs in the target
process address space for uretprobe trampoline code, while logically it
should be an address somewhere within the calling function of another
traced user function.

This is easy to correct for, though. Uprobes subsystem keeps track of
pending uretprobes and records original return addresses. This patch is
using this to do a post-processing step and restore each trampoline
address entries with correct original return address. This is done only
if there are pending uretprobes for current task.

Reported-by: Riham Selim 
Signed-off-by: Andrii Nakryiko 
---
 kernel/events/callchain.c | 42 ++-
 kernel/events/uprobes.c   |  9 +
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392c..2f7ceca7ae3f 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -176,13 +177,50 @@ put_callchain_entry(int rctx)
put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
 }
 
+static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry 
*entry,
+  int start_entry_idx)
+{
+#ifdef CONFIG_UPROBES
+   struct uprobe_task *utask = current->utask;
+   struct return_instance *ri;
+   __u64 *cur_ip, *last_ip, tramp_addr;
+
+   if (likely(!utask || !utask->return_instances))
+   return;
+
+   cur_ip = >ip[start_entry_idx];
+   last_ip = >ip[entry->nr - 1];
+   ri = utask->return_instances;
+   tramp_addr = uprobe_get_trampoline_vaddr();
+
+   /* If there are pending uretprobes for current thread, they are
+* recorded in a list inside utask->return_instances; each such
+* pending uretprobe replaces traced user function's return address on
+* the stack, so when stack trace is captured, instead of seeing
+* actual function's return address, we'll have one or many uretprobe
+* trampoline addresses in the stack trace, which are not helpful and
+* misleading to users.
+* So here we go over the pending list of uretprobes, and each
+* encountered trampoline address is replaced with actual return
+* address.
+*/
+   while (ri && cur_ip <= last_ip) {
+   if (*cur_ip == tramp_addr) {
+   *cur_ip = ri->orig_ret_vaddr;
+   ri = ri->next;
+   }
+   cur_ip++;
+   }
+#endif
+}
+
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
   u32 max_stack, bool crosstask, bool add_mark)
 {
struct perf_callchain_entry *entry;
struct perf_callchain_entry_ctx ctx;
-   int rctx;
+   int rctx, start_entry_idx;
 
entry = get_callchain_entry();
if (!entry)
@@ -215,7 +253,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool 
kernel, bool user,
if (add_mark)
perf_callchain_store_context(, 
PERF_CONTEXT_USER);
 
+   start_entry_idx = entry->nr;
perf_callchain_user(, regs);
+   fixup_uretprobe_trampoline_entries(entry, 
start_entry_idx);
}
}
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d60d24f0f2f4..1c99380dc89d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2149,6 +2149,15 @@ static void handle_trampoline(struct pt_regs *regs)
 
instruction_pointer_set(regs, ri->orig_ret_vaddr);
do {
+   /* pop current instance from the stack of pending 
return instances,
+* as it's not pending anymore: we just fixed up 
original
+* instruction pointer in regs and are about to call 
handlers;
+* this allows fixup_uretprobe_trampoline_entries() to 
properly fix up
+* captured stack traces from uretprobe handlers, in 
which pending
+* trampoline addresses on the stack are replaced with 
correct
+* original return addresses
+*/
+   utask

[PATCH 1/4] uprobes: rename get_trampoline_vaddr() and make it global

2024-05-08 Thread Andrii Nakryiko

This helper is needed in another file, so make it a bit more uniquely
named and expose it internally.

Signed-off-by: Andrii Nakryiko 
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..0c57eec85339 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance 
*ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 void *src, unsigned long len);
+extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8ae0eefc3a34..d60d24f0f2f4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, unsigned 
long flags)
  *
  * Returns -1 in case the xol_area is not allocated.
  */
-static unsigned long get_trampoline_vaddr(void)
+unsigned long uprobe_get_trampoline_vaddr(void)
 {
struct xol_area *area;
unsigned long trampoline_vaddr = -1;
@@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
struct pt_regs *regs)
if (!ri)
return;
 
-   trampoline_vaddr = get_trampoline_vaddr();
+   trampoline_vaddr = uprobe_get_trampoline_vaddr();
orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
regs);
if (orig_ret_vaddr == -1)
goto fail;
@@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs)
int is_swbp;
 
bp_vaddr = uprobe_get_swbp_addr(regs);
-   if (bp_vaddr == get_trampoline_vaddr())
+   if (bp_vaddr == uprobe_get_trampoline_vaddr())
return handle_trampoline(regs);
 
uprobe = find_active_uprobe(bp_vaddr, _swbp);
-- 
2.43.0

[PATCH 0/4] Fix user stack traces captured from uprobes

2024-05-08 Thread Andrii Nakryiko

This patch set reports two issues with captured stack traces.

First issue, fixed in patch #2, deals with fixing up uretprobe trampoline
addresses in captured stack trace. This issue happens when there are pending
return probes, for which kernel hijacks some of the return addresses on user
stacks. The code is matching those special uretprobe trampoline addresses with
the list of pending return probe instances and replaces them with actual
return addresses.

Second issue, which patch #3 is trying to fix with the help of heuristic, is
having to do with capturing user stack traces in entry uprobes. At the very
entrance to user function, frame pointer in rbp register is not yet setup, so
actual caller return address is still pointed to by rsp. Patch is using
a simple heuristic, looking for `push %rbp` instruction, to fetch this extra
direct caller return address, before proceeding to unwind the stack using rbp.

Consider this patch #3 an RFC, if there are better suggestions how this can be
solved, I'd be happy to hear that.

Patch #4 adds tests into BPF selftests, that validate that captured stack
traces at various points is what we expect to get. This patch, while being BPF
selftests, is isolated from any other BPF selftests changes and can go in
through non-BPF tree without the risk of merge conflicts.

Patches are based on latest linux-trace's probes/for-next branch.

Andrii Nakryiko (4):
  uprobes: rename get_trampoline_vaddr() and make it global
  perf,uprobes: fix user stack traces in the presence of pending
uretprobes
  perf,x86: avoid missing caller address in stack traces captured in
uprobe
  selftests/bpf: add test validating uprobe/uretprobe stack traces

 arch/x86/events/core.c|  20 ++
 include/linux/uprobes.h   |   3 +
 kernel/events/callchain.c |  42 +++-
 kernel/events/uprobes.c   |  17 +-
 .../bpf/prog_tests/uretprobe_stack.c  | 185 ++
 .../selftests/bpf/progs/uretprobe_stack.c |  96 +
 6 files changed, 359 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_stack.c

-- 
2.43.0

Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

2024-04-29 Thread Andrii Nakryiko

On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko  wrote:
>
> At the lowest level, rethook-based kretprobes on x86-64 architecture go
> through arch_rethoook_trampoline() function, manually written in
> assembly, which calls into a simple arch_rethook_trampoline_callback()
> function, written in C, and only doing a few straightforward field
> assignments, before calling further into rethook_trampoline_handler(),
> which handles kretprobe callbacks generically.
>
> Looking at simplicity of arch_rethook_trampoline_callback(), it seems
> not really worthwhile to spend an extra function call just to do 4 or
> 5 assignments. As such, this patch proposes to "inline"
> arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
> manually implementing it in an assembly code.
>
> This has two motivations. First, we do get a bit of runtime speed up by
> avoiding function calls. Using BPF selftests's bench tool, we see
> 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
> triggering code path:
>
> BEFORE (latest probes/for-next)
> ===
> kretprobe  :   10.455 ± 0.024M/s
> kretprobe-multi:   11.150 ± 0.012M/s
>
> AFTER (probes/for-next + this patch)
> 
> kretprobe  :   10.540 ± 0.009M/s (+0.8%)
> kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)
>
> Second, and no less importantly for some specialized use cases, this
> avoids unnecessarily "polluting" LBR records with an extra function call
> (recorded as a jump by CPU). This is the case for the retsnoop ([0])
> tool, which relies havily on capturing LBR records to provide users with
> lots of insight into kernel internals.
>
> This RFC patch is only inlining this function for x86-64, but it's
> possible to do that for 32-bit x86 arch as well and then remove
> arch_rethook_trampoline_callback() implementation altogether. Please let
> me know if this change is acceptable and whether I should complete it
> with 32-bit "inlining" as well. Thanks!
>
>   [0] 
> https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr
>
> Signed-off-by: Andrii Nakryiko 
> ---
>  arch/x86/kernel/asm-offsets_64.c |  4 
>  arch/x86/kernel/rethook.c| 37 +++-
>  2 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/asm-offsets_64.c 
> b/arch/x86/kernel/asm-offsets_64.c
> index bb65371ea9df..5c444abc540c 100644
> --- a/arch/x86/kernel/asm-offsets_64.c
> +++ b/arch/x86/kernel/asm-offsets_64.c
> @@ -42,6 +42,10 @@ int main(void)
> ENTRY(r14);
> ENTRY(r15);
> ENTRY(flags);
> +   ENTRY(ip);
> +   ENTRY(cs);
> +   ENTRY(ss);
> +   ENTRY(orig_ax);
> BLANK();
>  #undef ENTRY
>
> diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
> index 8a1c0111ae79..3e1c01beebd1 100644
> --- a/arch/x86/kernel/rethook.c
> +++ b/arch/x86/kernel/rethook.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "kprobes/common.h"
>
> @@ -34,10 +35,36 @@ asm(
> "   pushq %rsp\n"
> "   pushfq\n"
> SAVE_REGS_STRING
> -   "   movq %rsp, %rdi\n"
> -   "   call arch_rethook_trampoline_callback\n"
> +   "   movq %rsp, %rdi\n" /* $rdi points to regs */
> +   /* fixup registers */
> +   /* regs->cs = __KERNEL_CS; */
> +   "   movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) 
> "(%rdi)\n"
> +   /* regs->ip = (unsigned long)_rethook_trampoline; */
> +   "   movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) 
> "(%rdi)\n"
> +   /* regs->orig_ax = ~0UL; */
> +   "   movq $0x, " __stringify(pt_regs_orig_ax) 
> "(%rdi)\n"
> +   /* regs->sp += 2*sizeof(long); */
> +   "   addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
> +   /* 2nd arg is frame_pointer = (long *)(regs + 1); */
> +   "   lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"

BTW, all this __stringify() ugliness can be avoided if we move this
assembly into its own .S file, like lots of other assembly functions
in arch/x86/kernel subdir. That has another benefit of generating
better line information in DWARF for those assembly instructions. It's
lots more work, so before I do this, I'd like to get confirmation that
this change is acceptable in principle.

> +   /*
> +* The return address at 'frame_pointer' is recovered by the
> +* a

Re: [PATCH 0/2] Objpool performance improvements

2024-04-26 Thread Andrii Nakryiko

On Fri, Apr 26, 2024 at 7:25 AM Masami Hiramatsu  wrote:
>
> Hi Andrii,
>
> On Wed, 24 Apr 2024 14:52:12 -0700
> Andrii Nakryiko  wrote:
>
> > Improve objpool (used heavily in kretprobe hot path) performance with two
> > improvements:
> >   - inlining performance critical objpool_push()/objpool_pop() operations;
> >   - avoiding re-calculating relatively expensive nr_possible_cpus().
>
> Thanks for optimizing objpool. Both looks good to me.

Great, thanks for applying.

>
> BTW, I don't intend to stop this short-term optimization attempts,
> but I would like to ask you check the new fgraph based fprobe
> (kretprobe-multi)[1] instead of objpool/rethook.

You can see that I did :) There is tons of code and I'm not familiar
with internals of function_graph infra, but you can see I did run
benchmarks, so I'm paying attention.

But as for the objpool itself, I think it's a performant and useful
internal building block, and we might use it outside of rethook as
well, so I think making it as fast as possible is good regardless.

>
> [1] 
> https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/
>
> I'm considering to obsolete the kretprobe (and rethook) with fprobe
> and eventually remove it. Those have similar feature and we should
> choose safer one.
>

Yep, I had a few more semi-ready patches, but I'll hold off for now
given this move to function graph, plus some of the changes that Jiri
is making in multi-kprobe code. I'll wait for things to settle down a
bit before looking again.

But just to give you some context, I'm an author of retsnoop tool, and
one of the killer features of it is LBR capture in kretprobes, which
is a tremendous help in investigating kernel failures, especially in
unfamiliar code (LBR allows to "look back" and figure out "how did we
get to this condition" after the fact). And so it's important to
minimize the amount of wasted LBR records between some kernel function
returns error (and thus is "an interesting event" and BPF program that
captures LBR is triggered). Big part of that is ftrace/fprobe/rethook
infra, so I was looking into making that part as "minimal" as
possible, in the sense of eliminating as many function calls and jump
as possible. This has an added benefit of making this hot path faster,
but my main motivation is LBR.

Anyways, just a bit of context for some of the other patches (like
inlining arch_rethook_trampoline_callback).

> Thank you,
>
> >
> > These opportunities were found when benchmarking and profiling kprobes and
> > kretprobes with BPF-based benchmarks. See individual patches for details and
> > results.
> >
> > Andrii Nakryiko (2):
> >   objpool: enable inlining objpool_push() and objpool_pop() operations
> >   objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
> >
> >  include/linux/objpool.h | 105 +++--
> >  lib/objpool.c   | 112 +++-
> >  2 files changed, 107 insertions(+), 110 deletions(-)
> >
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

[PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

2024-04-24 Thread Andrii Nakryiko

At the lowest level, rethook-based kretprobes on x86-64 architecture go
through arch_rethoook_trampoline() function, manually written in
assembly, which calls into a simple arch_rethook_trampoline_callback()
function, written in C, and only doing a few straightforward field
assignments, before calling further into rethook_trampoline_handler(),
which handles kretprobe callbacks generically.

Looking at simplicity of arch_rethook_trampoline_callback(), it seems
not really worthwhile to spend an extra function call just to do 4 or
5 assignments. As such, this patch proposes to "inline"
arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
manually implementing it in an assembly code.

This has two motivations. First, we do get a bit of runtime speed up by
avoiding function calls. Using BPF selftests's bench tool, we see
0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
triggering code path:

BEFORE (latest probes/for-next)
===
kretprobe  :   10.455 ± 0.024M/s
kretprobe-multi:   11.150 ± 0.012M/s

AFTER (probes/for-next + this patch)

kretprobe  :   10.540 ± 0.009M/s (+0.8%)
kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)

Second, and no less importantly for some specialized use cases, this
avoids unnecessarily "polluting" LBR records with an extra function call
(recorded as a jump by CPU). This is the case for the retsnoop ([0])
tool, which relies havily on capturing LBR records to provide users with
lots of insight into kernel internals.

This RFC patch is only inlining this function for x86-64, but it's
possible to do that for 32-bit x86 arch as well and then remove
arch_rethook_trampoline_callback() implementation altogether. Please let
me know if this change is acceptable and whether I should complete it
with 32-bit "inlining" as well. Thanks!

  [0] 
https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/kernel/asm-offsets_64.c |  4 
 arch/x86/kernel/rethook.c| 37 +++-
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..5c444abc540c 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -42,6 +42,10 @@ int main(void)
ENTRY(r14);
ENTRY(r15);
ENTRY(flags);
+   ENTRY(ip);
+   ENTRY(cs);
+   ENTRY(ss);
+   ENTRY(orig_ax);
BLANK();
 #undef ENTRY
 
diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
index 8a1c0111ae79..3e1c01beebd1 100644
--- a/arch/x86/kernel/rethook.c
+++ b/arch/x86/kernel/rethook.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kprobes/common.h"
 
@@ -34,10 +35,36 @@ asm(
"   pushq %rsp\n"
"   pushfq\n"
SAVE_REGS_STRING
-   "   movq %rsp, %rdi\n"
-   "   call arch_rethook_trampoline_callback\n"
+   "   movq %rsp, %rdi\n" /* $rdi points to regs */
+   /* fixup registers */
+   /* regs->cs = __KERNEL_CS; */
+   "   movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) 
"(%rdi)\n"
+   /* regs->ip = (unsigned long)_rethook_trampoline; */
+   "   movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) 
"(%rdi)\n"
+   /* regs->orig_ax = ~0UL; */
+   "   movq $0x, " __stringify(pt_regs_orig_ax) 
"(%rdi)\n"
+   /* regs->sp += 2*sizeof(long); */
+   "   addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
+   /* 2nd arg is frame_pointer = (long *)(regs + 1); */
+   "   lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"
+   /*
+* The return address at 'frame_pointer' is recovered by the
+* arch_rethook_fixup_return() which called from this
+* rethook_trampoline_handler().
+*/
+   "   call rethook_trampoline_handler\n"
+   /*
+* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF.
+*
+* We don't save/restore %rax below, because we ignore
+* rethook_trampoline_handler result.
+*
+* *(unsigned long *)>ss = regs->flags;
+*/
+   "   mov " __stringify(pt_regs_flags) "(%rsp), %rax\n"
+   "   mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n"
RESTORE_REGS_STRING
-   /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */
+   /* We just copied 'regs->flags' into 'regs->ss'. */
"   addq $16, %rsp\n"
"   popfq\n"
 #else
@@ -61,6 +88,7 @@ asm(
 );
 NOKPROBE_SYMBOL(arch_retho

[PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids

2024-04-24 Thread Andrii Nakryiko

Profiling shows that calling nr_possible_cpus() in objpool_pop() takes
a noticeable amount of CPU (when profiled on 80-core machine), as we
need to recalculate number of set bits in a CPU bit mask. This number
can't change, so there is no point in paying the price for recalculating
it. As such, cache this value in struct objpool_head and use it in
objpool_pop().

On the other hand, cached pool->nr_cpus isn't necessary, as it's not
used in hot path and is also a pretty trivial value to retrieve. So drop
pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size
of struct objpool_head remains the same, which is a nice bonus.

Same BPF selftests benchmarks were used to evaluate the effect. Using
changes in previous patch (inlining of objpool_pop/objpool_push) as
baseline, here are the differences:

BASELINE

kretprobe  :9.937 ± 0.174M/s
kretprobe-multi:   10.440 ± 0.108M/s

AFTER
=
kretprobe  :   10.106 ± 0.120M/s (+1.7%)
kretprobe-multi:   10.515 ± 0.180M/s (+0.7%)

Cc: Matt (Qiang) Wu 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/objpool.h |  6 +++---
 lib/objpool.c   | 12 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index d8b1f7b91128..cb1758eaa2d3 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, 
void *context);
  * struct objpool_head - object pooling metadata
  * @obj_size:   object size, aligned to sizeof(void *)
  * @nr_objs:total objs (to be pre-allocated with objpool)
- * @nr_cpus:local copy of nr_cpu_ids
+ * @nr_possible_cpus: cached value of num_possible_cpus()
  * @capacity:   max objs can be managed by one objpool_slot
  * @gfp:gfp flags for kmalloc & vmalloc
  * @ref:refcount of objpool
@@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, 
void *context);
 struct objpool_head {
int obj_size;
int nr_objs;
-   int nr_cpus;
+   int nr_possible_cpus;
int capacity;
gfp_t   gfp;
refcount_t  ref;
@@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool)
raw_local_irq_save(flags);
 
cpu = raw_smp_processor_id();
-   for (i = 0; i < num_possible_cpus(); i++) {
+   for (i = 0; i < pool->nr_possible_cpus; i++) {
obj = __objpool_try_get_slot(pool, cpu);
if (obj)
break;
diff --git a/lib/objpool.c b/lib/objpool.c
index f696308fc026..234f9d0bd081 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int 
nr_objs,
 {
int i, cpu_count = 0;
 
-   for (i = 0; i < pool->nr_cpus; i++) {
+   for (i = 0; i < nr_cpu_ids; i++) {
 
struct objpool_slot *slot;
int nodes, size, rc;
@@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int 
nr_objs,
continue;
 
/* compute how many objects to be allocated with this slot */
-   nodes = nr_objs / num_possible_cpus();
-   if (cpu_count < (nr_objs % num_possible_cpus()))
+   nodes = nr_objs / pool->nr_possible_cpus;
+   if (cpu_count < (nr_objs % pool->nr_possible_cpus))
nodes++;
cpu_count++;
 
@@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head 
*pool)
if (!pool->cpu_slots)
return;
 
-   for (i = 0; i < pool->nr_cpus; i++)
+   for (i = 0; i < nr_cpu_ids; i++)
kvfree(pool->cpu_slots[i]);
kfree(pool->cpu_slots);
 }
@@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, 
int object_size,
 
/* initialize objpool pool */
memset(pool, 0, sizeof(struct objpool_head));
-   pool->nr_cpus = nr_cpu_ids;
+   pool->nr_possible_cpus = num_possible_cpus();
pool->obj_size = object_size;
pool->capacity = capacity;
pool->gfp = gfp & ~__GFP_ZERO;
pool->context = context;
pool->release = release;
-   slot_size = pool->nr_cpus * sizeof(struct objpool_slot);
+   slot_size = nr_cpu_ids * sizeof(struct objpool_slot);
pool->cpu_slots = kzalloc(slot_size, pool->gfp);
if (!pool->cpu_slots)
return -ENOMEM;
-- 
2.43.0

[PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations

2024-04-24 Thread Andrii Nakryiko

objpool_push() and objpool_pop() are very performance-critical functions
and can be called very frequently in kretprobe triggering path.

As such, it makes sense to allow compiler to inline them completely to
eliminate function calls overhead. Luckily, their logic is quite well
isolated and doesn't have any sprawling dependencies.

This patch moves both objpool_push() and objpool_pop() into
include/linux/objpool.h and marks them as static inline functions,
enabling inlining. To avoid anyone using internal helpers
(objpool_try_get_slot, objpool_try_add_slot), rename them to use leading
underscores.

We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe
and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi
programs in a tight loop to evaluate the effect. BPF own overhead in
this case is minimal and it mostly stresses the rest of in-kernel
kretprobe infrastructure overhead. Results are in millions of calls per
second. This is not super scientific, but shows the trend nevertheless.

BEFORE
==
kretprobe  :9.794 ± 0.086M/s
kretprobe-multi:   10.219 ± 0.032M/s

AFTER
=
kretprobe  :9.937 ± 0.174M/s (+1.5%)
kretprobe-multi:   10.440 ± 0.108M/s (+2.2%)

Cc: Matt (Qiang) Wu 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/objpool.h | 101 +++-
 lib/objpool.c   | 100 ---
 2 files changed, 99 insertions(+), 102 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 15aff4a17f0c..d8b1f7b91128 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -5,6 +5,10 @@
 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * objpool: ring-array based lockless MPMC queue
@@ -118,13 +122,94 @@ int objpool_init(struct objpool_head *pool, int nr_objs, 
int object_size,
 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
 objpool_fini_cb release);
 
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu)
+{
+   struct objpool_slot *slot = pool->cpu_slots[cpu];
+   /* load head snapshot, other cpus may change it */
+   uint32_t head = smp_load_acquire(>head);
+
+   while (head != READ_ONCE(slot->last)) {
+   void *obj;
+
+   /*
+* data visibility of 'last' and 'head' could be out of
+* order since memory updating of 'last' and 'head' are
+* performed in push() and pop() independently
+*
+* before any retrieving attempts, pop() must guarantee
+* 'last' is behind 'head', that is to say, there must
+* be available objects in slot, which could be ensured
+* by condition 'last != head && last - head <= nr_objs'
+* that is equivalent to 'last - head - 1 < nr_objs' as
+* 'last' and 'head' are both unsigned int32
+*/
+   if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
+   head = READ_ONCE(slot->head);
+   continue;
+   }
+
+   /* obj must be retrieved before moving forward head */
+   obj = READ_ONCE(slot->entries[head & slot->mask]);
+
+   /* move head forward to mark it's consumption */
+   if (try_cmpxchg_release(>head, , head + 1))
+   return obj;
+   }
+
+   return NULL;
+}
+
 /**
  * objpool_pop() - allocate an object from objpool
  * @pool: object pool
  *
  * return value: object ptr or NULL if failed
  */
-void *objpool_pop(struct objpool_head *pool);
+static inline void *objpool_pop(struct objpool_head *pool)
+{
+   void *obj = NULL;
+   unsigned long flags;
+   int i, cpu;
+
+   /* disable local irq to avoid preemption & interruption */
+   raw_local_irq_save(flags);
+
+   cpu = raw_smp_processor_id();
+   for (i = 0; i < num_possible_cpus(); i++) {
+   obj = __objpool_try_get_slot(pool, cpu);
+   if (obj)
+   break;
+   cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+   }
+   raw_local_irq_restore(flags);
+
+   return obj;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int
+__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
+{
+   struct objpool_slot *slot = pool->cpu_slots[cpu];
+   uint32_t head, tail;
+
+   /* loading tail and head as a local snapshot, tail first */
+   tail = READ_ONCE(slot->tail);
+
+   do {
+   head = READ_ONCE(slot->head);
+   /* fault caught: something must be wrong */
+   WARN_ON_ONCE(tail - head > pool->nr_objs);
+   } while (!try_cmpxchg_acqui

[PATCH 0/2] Objpool performance improvements

2024-04-24 Thread Andrii Nakryiko

Improve objpool (used heavily in kretprobe hot path) performance with two
improvements:
  - inlining performance critical objpool_push()/objpool_pop() operations;
  - avoiding re-calculating relatively expensive nr_possible_cpus().

These opportunities were found when benchmarking and profiling kprobes and
kretprobes with BPF-based benchmarks. See individual patches for details and
results.

Andrii Nakryiko (2):
  objpool: enable inlining objpool_push() and objpool_pop() operations
  objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids

 include/linux/objpool.h | 105 +++--
 lib/objpool.c   | 112 +++-
 2 files changed, 107 insertions(+), 110 deletions(-)

-- 
2.43.0

Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-19 Thread Andrii Nakryiko

On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu  wrote:
>
> On Thu, 18 Apr 2024 12:09:09 -0700
> Andrii Nakryiko  wrote:
>
> > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > that RCU is watching when trying to setup rethooko on a function entry.
> >
> > One notable exception when we force rcu_is_watching() check is
> > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> > old-style int3-based workflow instead of relying on ftrace, making RCU
> > watching check important to validate.
> >
> > This further (in addition to improvements in the previous patch)
> > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > by 2.3%, according to BPF benchmarks ([0]).
> >
> >   [0] 
> > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> >
> > Signed-off-by: Andrii Nakryiko 
>
>
> Thanks for update! This looks good to me.

Thanks, Masami! Will you take it through your tree, or you'd like to
route it through bpf-next?

>
> Acked-by: Masami Hiramatsu (Google) 
>
> Thanks,
>
> > ---
> >  kernel/trace/rethook.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > index fa03094e9e69..a974605ad7a5 100644
> > --- a/kernel/trace/rethook.c
> > +++ b/kernel/trace/rethook.c
> > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >   if (unlikely(!handler))
> >   return NULL;
> >
> > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
> >   /*
> >* This expects the caller will set up a rethook on a function entry.
> >* When the function returns, the rethook will eventually be reclaimed
> > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >*/
> >   if (unlikely(!rcu_is_watching()))
> >   return NULL;
> > +#endif
> >
> >   return (struct rethook_node *)objpool_pop(>pool);
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

[PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko

Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

One notable exception when we force rcu_is_watching() check is
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
old-style int3-based workflow instead of relying on ftrace, making RCU
watching check important to validate.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..a974605ad7a5 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0

[PATCH v4 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-18 Thread Andrii Nakryiko

Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0

Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko

On Tue, Apr 9, 2024 at 3:48 PM Masami Hiramatsu  wrote:
>
> On Wed,  3 Apr 2024 15:03:28 -0700
> Andrii Nakryiko  wrote:
>
> > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > that RCU is watching when trying to setup rethooko on a function entry.
> >
> > This further (in addition to improvements in the previous patch)
> > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > by 2.3%, according to BPF benchmarks ([0]).
> >
> >   [0] 
> > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> >
>
> Hi Andrii,
>
> Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this
> option, kretprobes can be used without ftrace, but with original int3) ?

Sorry for the late response, I was out on vacation. Makes sense about
KPROBE_EVENTS_ON_NOTRACE, I went with this condition:

#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) ||
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)

Will send an updated revision shortly.

> This option should be set N on production system because of safety,
> just for testing raw kretprobes.
>
> Thank you,
>
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/trace/rethook.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > index fa03094e9e69..15b8aa4048d9 100644
> > --- a/kernel/trace/rethook.c
> > +++ b/kernel/trace/rethook.c
> > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >   if (unlikely(!handler))
> >   return NULL;
> >
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >   /*
> >* This expects the caller will set up a rethook on a function entry.
> >* When the function returns, the rethook will eventually be reclaimed
> > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >*/
> >   if (unlikely(!rcu_is_watching()))
> >   return NULL;
> > +#endif
> >
> >   return (struct rethook_node *)objpool_pop(>pool);
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-06 Thread Andrii Nakryiko

On Fri, Apr 5, 2024 at 8:41 PM Masami Hiramatsu  wrote:
>
> On Tue, 2 Apr 2024 22:21:00 -0700
> Andrii Nakryiko  wrote:
>
> > On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> > > >
> > > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > > Masami Hiramatsu (Google)  wrote:
> > > >
> > > > > OK, for me, this last sentence is preferred for the help message. 
> > > > > That explains
> > > > > what this is for.
> > > > >
> > > > > All callbacks that attach to the function tracing have some 
> > > > > sort
> > > > > of protection against recursion. This option is only to 
> > > > > verify that
> > > > >　   ftrace (and other users of ftrace_test_recursion_trylock()) 
> > > > >are not
> > > > > called outside of RCU, as if they are, it can cause a race.
> > > > > But it also has a noticeable overhead when enabled.
> > >
> > > Sounds good to me, I can add this to the description of the Kconfig 
> > > option.
> > >
> > > > >
> > > > > BTW, how much overhead does this introduce? and the race case a 
> > > > > kernel crash?
> > >
> > > I just checked our fleet-wide production data for the last 24 hours.
> > > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > > everything called from it), rcu_is_watching (both calls, see below)
> > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > > prefer to be able to avoid that in production use cases.
> > >
> >
> > I just ran synthetic microbenchmark testing multi-kretprobe
> > throughput. We get (in millions of BPF kretprobe-multi program
> > invocations per second):
> >   - 5.568M/s as baseline;
> >   - 5.679M/s with changes in this patch (+2% throughput improvement);
> >   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> >
> > It's definitely noticeable.
>
> Thanks for checking the overhead! Hmm, it is considerable.
>
> > > > > or just messed up the ftrace buffer?
> > > >
> > > > There's a hypothetical race where it can cause a use after free.
>
> Hmm, so it might not lead a kernel crash but is better to enable with
> other debugging options.
>
> > > >
> > > > That is, after you shutdown ftrace, you need to call 
> > > > synchronize_rcu_tasks(),
> > > > which requires RCU to be watching. There's a theoretical case where that
> > > > task calls the trampoline and misses the synchronization. Note, these
> > > > locations are with preemption disabled, as rcu is always watching when
> > > > preemption is enabled. Thus it would be extremely fast where as the
> > > > synchronize_rcu_tasks() is rather slow.
> > > >
> > > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > > trace from happening, as it would schedule on each CPU forcing all CPUs 
> > > > to
> > > > have RCU watching.
> > > >
> > > > I have never heard of this race being hit. I guess it could happen on a 
> > > > VM
> > > > where a vCPU gets preempted at the right moment for a long time and the
> > > > other CPUs synchronize.
> > > >
> > > > But like lockdep, where deadlocks can crash the kernel, we don't enable 
> > > > it
> > > > for production.
> > > >
> > > > The overhead is another function call within the function tracer. I had
> > > > numbers before, but I guess I could run tests again and get new numbers.
> > > >
> > >
> > > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > > which seems to be a similar and complementary validation check to the
> > > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > > in this patch. It feels like both of them should be controlled by the
> > > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > > guard around rcu_is_watching() check in rethook_try_get() as well?
>
> Hmmm, no, I think it should not change the rethook side because rethook
> can be used with kprobes without ftrace. If we can detect it is used from

It's a good thing that I split that into a separate patch, then.
Hopefully the first patch looks good and you can apply it as is.

> the ftrace, we can skip it. (From this reason, I would like to remove
> return probe from kprobes...)

I'm on PTO for the next two weeks and I can take a look at more
properly guarding rcu_is_watching() in rethook_try_get() when I'm
back. Thanks.

>
> Thank you,
>
> > >
> > >
> > > > Thanks,
> > > >
> > > > -- Steve
>
>
> --
> Masami Hiramatsu (Google)

[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-03 Thread Andrii Nakryiko

Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..15b8aa4048d9 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0

[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Andrii Nakryiko

Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Andrii Nakryiko

On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
 wrote:
>
> On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> >
> > On Wed, 3 Apr 2024 09:40:48 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > OK, for me, this last sentence is preferred for the help message. That 
> > > explains
> > > what this is for.
> > >
> > > All callbacks that attach to the function tracing have some sort
> > > of protection against recursion. This option is only to verify 
> > > that
> > >　   ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > > called outside of RCU, as if they are, it can cause a race.
> > > But it also has a noticeable overhead when enabled.
>
> Sounds good to me, I can add this to the description of the Kconfig option.
>
> > >
> > > BTW, how much overhead does this introduce? and the race case a kernel 
> > > crash?
>
> I just checked our fleet-wide production data for the last 24 hours.
> Within the kprobe/kretprobe code path (ftrace_trampoline and
> everything called from it), rcu_is_watching (both calls, see below)
> cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> prefer to be able to avoid that in production use cases.
>

I just ran synthetic microbenchmark testing multi-kretprobe
throughput. We get (in millions of BPF kretprobe-multi program
invocations per second):
  - 5.568M/s as baseline;
  - 5.679M/s with changes in this patch (+2% throughput improvement);
  - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
(+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).

It's definitely noticeable.

> > > or just messed up the ftrace buffer?
> >
> > There's a hypothetical race where it can cause a use after free.
> >
> > That is, after you shutdown ftrace, you need to call 
> > synchronize_rcu_tasks(),
> > which requires RCU to be watching. There's a theoretical case where that
> > task calls the trampoline and misses the synchronization. Note, these
> > locations are with preemption disabled, as rcu is always watching when
> > preemption is enabled. Thus it would be extremely fast where as the
> > synchronize_rcu_tasks() is rather slow.
> >
> > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > have RCU watching.
> >
> > I have never heard of this race being hit. I guess it could happen on a VM
> > where a vCPU gets preempted at the right moment for a long time and the
> > other CPUs synchronize.
> >
> > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > for production.
> >
> > The overhead is another function call within the function tracer. I had
> > numbers before, but I guess I could run tests again and get new numbers.
> >
>
> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?
>
>
> > Thanks,
> >
> > -- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Andrii Nakryiko

On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
>
> On Wed, 3 Apr 2024 09:40:48 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > OK, for me, this last sentence is preferred for the help message. That 
> > explains
> > what this is for.
> >
> > All callbacks that attach to the function tracing have some sort
> > of protection against recursion. This option is only to verify that
> >　   ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > called outside of RCU, as if they are, it can cause a race.
> > But it also has a noticeable overhead when enabled.

Sounds good to me, I can add this to the description of the Kconfig option.

> >
> > BTW, how much overhead does this introduce? and the race case a kernel 
> > crash?

I just checked our fleet-wide production data for the last 24 hours.
Within the kprobe/kretprobe code path (ftrace_trampoline and
everything called from it), rcu_is_watching (both calls, see below)
cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
prefer to be able to avoid that in production use cases.

> > or just messed up the ftrace buffer?
>
> There's a hypothetical race where it can cause a use after free.
>
> That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> which requires RCU to be watching. There's a theoretical case where that
> task calls the trampoline and misses the synchronization. Note, these
> locations are with preemption disabled, as rcu is always watching when
> preemption is enabled. Thus it would be extremely fast where as the
> synchronize_rcu_tasks() is rather slow.
>
> We also have synchronize_rcu_tasks_rude() which would actually keep the
> trace from happening, as it would schedule on each CPU forcing all CPUs to
> have RCU watching.
>
> I have never heard of this race being hit. I guess it could happen on a VM
> where a vCPU gets preempted at the right moment for a long time and the
> other CPUs synchronize.
>
> But like lockdep, where deadlocks can crash the kernel, we don't enable it
> for production.
>
> The overhead is another function call within the function tracer. I had
> numbers before, but I guess I could run tests again and get new numbers.
>

I just noticed another rcu_is_watching() call, in rethook_try_get(),
which seems to be a similar and complementary validation check to the
one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
in this patch. It feels like both of them should be controlled by the
same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
guard around rcu_is_watching() check in rethook_try_get() as well?

> Thanks,
>
> -- Steve

Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-02 Thread Andrii Nakryiko

On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee  wrote:
>
> rethook_find_ret_addr() prints a warning message and returns 0 when the
> target task is running and not the "current" task to prevent returning an
> incorrect return address. However, this check is incomplete as the target
> task can still transition to the running state when finding the return
> address, although it is safe with RCU.
>
> The issue we encounter is that the kernel frequently prints warning
> messages when BPF profiling programs call to bpf_get_task_stack() on
> running tasks.
>
> The callers should be aware and willing to take the risk of receiving an
> incorrect return address from a task that is currently running other than
> the "current" one. A warning is not needed here as the callers are intent
> on it.
>
> Signed-off-by: Kui-Feng Lee 
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..4297a132a7ae 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> *tsk, unsigned long frame
> if (WARN_ON_ONCE(!cur))
> return 0;
>
> -   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> +   if (tsk != current && task_is_running(tsk))
> return 0;
>

This should probably go through Masami's tree, but the change makes
sense to me, given this is an expected condition.

Acked-by: Andrii Nakryiko 

> do {
> --
> 2.34.1
>
>

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko

On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Apr 2024 12:09:18 -0400
> Steven Rostedt  wrote:
>
> > On Mon, 1 Apr 2024 20:25:52 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > > Masami,
> > > >
> > > > Are you OK with just keeping it set to N.
> > >
> > > OK, if it is only for the debugging, I'm OK to set N this.
> > >
> > > >
> > > > We could have other options like PROVE_LOCKING enable it.
> > >
> > > Agreed (but it should say this is a debug option)
> >
> > It does say "Validate" which to me is a debug option. What would you
> > suggest?
>
> I think the help message should have "This is for debugging ftrace."
>

Sent v2 with adjusted wording, thanks!

> Thank you,
>
> >
> > -- Steve
>
>
> --
> Masami Hiramatsu (Google)

[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko

Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level debugging of ftrace subsystem. For most users it should
probably be kept disabled to eliminate unnecessary runtime overhead.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 14 ++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..fcf45d5c60cb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This is a feature useful for debugging ftrace. This will add more
+ overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-29 Thread Andrii Nakryiko

On Tue, Mar 26, 2024 at 11:58 AM Steven Rostedt  wrote:
>
> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko  wrote:
>
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >
> >
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
>
> Masami,
>
> Are you OK with just keeping it set to N.
>
> We could have other options like PROVE_LOCKING enable it.
>

So what's the conclusion, Masami? Should I send another version where
this config is opt-out, or are you ok with keeping it as opt-in as
proposed in this revision?

> -- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Andrii Nakryiko

On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt  wrote:
>
> On Mon, 25 Mar 2024 11:38:48 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > On Fri, 22 Mar 2024 09:03:23 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > > control whether ftrace low-level code performs additional
> > > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > > violations.
> > >
> > > This check is expected to never be true in practice and would be best
> > > controlled with extra config to let users decide if they are willing to
> > > pay the price.
> >
> > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > is OK, but tht should be set to Y by default. If you have already verified
> > that your system never make it true and you want to optimize your ftrace
> > path, you can manually set it to N at your own risk.
> >
>
> Really, it's for debugging. I would argue that it should *not* be default y.
> Peter added this to find all the locations that could be called where RCU
> is not watching. But the issue I have is that this is that it *does cause
> overhead* with function tracing.
>
> I believe we found pretty much all locations that were an issue, and we
> should now just make it an option for developers.
>
> It's no different than lockdep. Test boxes should have it enabled, but
> there's no reason to have this enabled in a production system.
>

I tend to agree with Steven here (which is why I sent this patch as it
is), but I'm happy to do it as an opt-out, if Masami insists. Please
do let me know if I need to send v2 or this one is actually the one
we'll end up using. Thanks!

> -- Steve
>
>
> > >
> > > Cc: Steven Rostedt 
> > > Cc: Masami Hiramatsu 
> > > Cc: Paul E. McKenney 
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  include/linux/trace_recursion.h |  2 +-
> > >  kernel/trace/Kconfig| 13 +
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/trace_recursion.h 
> > > b/include/linux/trace_recursion.h
> > > index d48cd92d2364..24ea8ac049b4 100644
> > > --- a/include/linux/trace_recursion.h
> > > +++ b/include/linux/trace_recursion.h
> > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > > unsigned long parent_ip);
> > >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> > >  #endif
> > >
> > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > >  # define trace_warn_on_no_rcu(ip)  \
> > > ({  \
> > > bool __ret = !rcu_is_watching();\
> >
> > BTW, maybe we can add "unlikely" in the next "if" line?
> >
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 61c541c36596..19bce4e217d6 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > >   This file can be reset, but the limit can not change in
> > >   size at runtime.
> > >
> > > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > > +   bool "Validate RCU is on during ftrace recursion check"
> > > +   depends on FUNCTION_TRACER
> > > +   depends on ARCH_WANTS_NO_INSTR
> >
> >   default y
> >
> > > +   help
> > > + All callbacks that attach to the function tracing have some sort
> > > + of protection against recursion. This option performs additional
> > > + checks to make sure RCU is on when ftrace callbacks recurse.
> > > +
> > > + This will add more overhead to all ftrace-based invocations.
> >
> >   ... invocations, but keep it safe.
> >
> > > +
> > > + If unsure, say N
> >
> >   If unsure, say Y
> >
> > Thank you,
> >
> > > +
> > >  config RING_BUFFER_RECORD_RECURSION
> > > bool "Record functions that recurse in the ring buffer"
> > > depends on FTRACE_RECORD_RECURSION
> > > --
> > > 2.43.0
> > >
> >
> >
>

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Andrii Nakryiko

On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu  wrote:
>
> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
>
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> >
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.
>
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.

Yeah, I don't think we ever see this warning across our machines. And
sure, I can default it to Y, no problem.

>
> >
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
> >  #endif
> >
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)\
> >   ({  \
> >   bool __ret = !rcu_is_watching();\
>
> BTW, maybe we can add "unlikely" in the next "if" line?

sure, can add that as well

>
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > This file can be reset, but the limit can not change in
> > size at runtime.
> >
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > + bool "Validate RCU is on during ftrace recursion check"
> > + depends on FUNCTION_TRACER
> > + depends on ARCH_WANTS_NO_INSTR
>
> default y
>

ok

> > + help
> > +   All callbacks that attach to the function tracing have some sort
> > +   of protection against recursion. This option performs additional
> > +   checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > +   This will add more overhead to all ftrace-based invocations.
>
> ... invocations, but keep it safe.
>
> > +
> > +   If unsure, say N
>
> If unsure, say Y
>

yep, will do, thanks!

> Thank you,
>
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> >   bool "Record functions that recurse in the ring buffer"
> >   depends on FTRACE_RECORD_RECURSION
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

[PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-22 Thread Andrii Nakryiko

Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true in practice and would be best
controlled with extra config to let users decide if they are willing to
pay the price.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..19bce4e217d6 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This will add more overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0

[PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Andrii Nakryiko

Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.

Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.

Cc: Francis Laniel 
Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Steven Rostedt 
Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches 
several symbols")
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_kprobe.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index effcaede4759..1efb27f35963 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long unused)
return 0;
 }
 
+struct sym_count_ctx {
+   unsigned int count;
+   const char *name;
+};
+
+static int count_mod_symbols(void *data, const char *name, unsigned long 
unused)
+{
+   struct sym_count_ctx *ctx = data;
+
+   if (strcmp(name, ctx->name) == 0)
+   ctx->count++;
+
+   return 0;
+}
+
 static unsigned int number_of_same_symbols(char *func_name)
 {
-   unsigned int count;
+   struct sym_count_ctx ctx = { .count = 0, .name = func_name };
+
+   kallsyms_on_each_match_symbol(count_symbols, func_name, );
 
-   count = 0;
-   kallsyms_on_each_match_symbol(count_symbols, func_name, );
+   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
 
-   return count;
+   return ctx.count;
 }
 
 static int __trace_kprobe_create(int argc, const char *argv[])
-- 
2.34.1

95 matches

Mail list logo