Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Andrii Nakryiko
On Tue, Jul 9, 2024 at 2:33 PM Oleg Nesterov  wrote:
>
> On 07/09, Andrii Nakryiko wrote:
> >
> > On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
> > >
> > > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > > detects "improper" stack pointer progression, for example),
> > >
> > > In this case we a) assume that user-space tries to fool the kernel and
> >
> > Well, it's a bad assumption. User space might just be using fibers and
> > managing its own stack.
>
> Do you mean something like the "go" language?
>

No, I think it was C++ application. I think we have some uses of
fibers in which an application does its own user-space scheduling and
manages stack in user space. But it's basically the same class of
problems that you'd get with Go, yes.

> Yes, not supported. And from the kernel perspective it still looks as if
> user-space tries to fool the kernel. I mean, if you insert a ret-probe,
> the kernel assumes that it "owns" the stack, if nothing else the kernel
> has to change the ret-address on stack.
>
> I agree, this is not good. But again, what else the kernel can do in
> this case?

Not that I'm proposing this, but kernel could probably maintain a
lookup table keyed by thread stack pointer, instead of maintaining
implicit stack (but that would probably be more expensive). With some
limits and stuff this probably would work fine.

>
> > > Not really expected, and that is why the "TODO" comment in _unregister()
> > > was never implemented. Although the real reason is that we are lazy ;)
> >
> > Worked fine for 10+ years, which says something ;)
>
> Or may be it doesn't but we do not know because this code doesn't do
> uprobe_warn() ;)

sure :)

>
> Oleg.
>



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Oleg Nesterov
On 07/09, Andrii Nakryiko wrote:
>
> On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
> >
> > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > detects "improper" stack pointer progression, for example),
> >
> > In this case we a) assume that user-space tries to fool the kernel and
>
> Well, it's a bad assumption. User space might just be using fibers and
> managing its own stack.

Do you mean something like the "go" language?

Yes, not supported. And from the kernel perspective it still looks as if
user-space tries to fool the kernel. I mean, if you insert a ret-probe,
the kernel assumes that it "owns" the stack, if nothing else the kernel
has to change the ret-address on stack.

I agree, this is not good. But again, what else the kernel can do in
this case?

> > Not really expected, and that is why the "TODO" comment in _unregister()
> > was never implemented. Although the real reason is that we are lazy ;)
>
> Worked fine for 10+ years, which says something ;)

Or may be it doesn't but we do not know because this code doesn't do
uprobe_warn() ;)

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Andrii Nakryiko
On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
>
> On 07/08, Andrii Nakryiko wrote:
> >
> > On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
> > >
> > > And I forgot to mention...
> > >
> > > In any case __uprobe_unregister() can't ignore the error code from
> > > register_for_each_vma(). If it fails to restore the original insn,
> > > we should not remove this uprobe from uprobes_tree.
> > >
> > > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > > probed application.
> >
> > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > detects "improper" stack pointer progression, for example),
>
> In this case we a) assume that user-space tries to fool the kernel and

Well, it's a bad assumption. User space might just be using fibers and
managing its own stack. Not saying SIGILL is good, but it's part of
the uprobe system regardless.

> b) the kernel can't handle this case in any case, thus uprobe_warn().
>
> > but from
> > what I gather it's not really expected to fail on unregistration given
> > we successfully registered uprobe.
>
> Not really expected, and that is why the "TODO" comment in _unregister()
> was never implemented. Although the real reason is that we are lazy ;)

Worked fine for 10+ years, which says something ;)

>
> But register_for_each_vma(NULL) can fail. Say, simply because
> kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
> A lot of other reasons.
>
> > I guess it's a decision between
> > leaking memory with an uprobe stuck in the tree or killing process due
> > to some very rare (or buggy) condition?
>
> Yes. I think in this case it is better to leak uprobe than kill the
> no longer probed task.

Ok, I think it's not hard to keep uprobe around if
__uprobe_unregister() fails, should be an easy addition from what I
can see.

>
> Oleg.
>



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Oleg Nesterov
On 07/08, Andrii Nakryiko wrote:
>
> On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
> >
> > And I forgot to mention...
> >
> > In any case __uprobe_unregister() can't ignore the error code from
> > register_for_each_vma(). If it fails to restore the original insn,
> > we should not remove this uprobe from uprobes_tree.
> >
> > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > probed application.
>
> Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> detects "improper" stack pointer progression, for example),

In this case we a) assume that user-space tries to fool the kernel and
b) the kernel can't handle this case in any case, thus uprobe_warn().

> but from
> what I gather it's not really expected to fail on unregistration given
> we successfully registered uprobe.

Not really expected, and that is why the "TODO" comment in _unregister()
was never implemented. Although the real reason is that we are lazy ;)

But register_for_each_vma(NULL) can fail. Say, simply because
kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
A lot of other reasons.

> I guess it's a decision between
> leaking memory with an uprobe stuck in the tree or killing process due
> to some very rare (or buggy) condition?

Yes. I think in this case it is better to leak uprobe than kill the
no longer probed task.

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-08 Thread Andrii Nakryiko
On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
>
> And I forgot to mention...
>
> In any case __uprobe_unregister() can't ignore the error code from
> register_for_each_vma(). If it fails to restore the original insn,
> we should not remove this uprobe from uprobes_tree.
>
> Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> probed application.

Yep, that would be unfortunate (just like SIGILL sent when uretprobe
detects "improper" stack pointer progression, for example), but from
what I gather it's not really expected to fail on unregistration given
we successfully registered uprobe. I guess it's a decision between
leaking memory with an uprobe stuck in the tree or killing process due
to some very rare (or buggy) condition?


>
> On 07/05, Oleg Nesterov wrote:
> >
> > Tried to read this patch, but I fail to understand it. It looks
> > obvioulsy wrong to me, see below.
> >
> > I tend to agree with the comments from Peter, but lets ignore them
> > for the moment.
> >
> > On 07/01, Andrii Nakryiko wrote:
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > > -   if (refcount_dec_and_test(>ref)) {
> > > +   s64 v;
> > > +
> > > +   /*
> > > +* here uprobe instance is guaranteed to be alive, so we use Tasks
> > > +* Trace RCU to guarantee that uprobe won't be freed from under us, if
> > > +* we end up being a losing "destructor" inside uprobe_treelock'ed
> > > +* section double-checking uprobe->ref value below.
> > > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > > +*/
> > > +   rcu_read_lock_trace();
> > > +
> > > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > > +
> > > +   if (unlikely((u32)v == 0)) {
> >
> > I must have missed something, but how can this ever happen?
> >
> > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> > that this binary is not used, so _register() doesn't install 
> > breakpoints/etc.
> >
> > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() 
> > succeeds.
> >
> > Now suppose that uprobe_unregister() is called right after that. It does
> >
> >   uprobe = find_uprobe(inode, offset);
> >
> > this increments the counter, (u32)uprobe->ref == 2
> >
> >   __uprobe_unregister(...);
> >
> > this wont't change the counter,
> >
> >   put_uprobe(uprobe);
> >
> > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> >
> > Where should the "final" put_uprobe() come from?
> >
> > IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> >
> > Oleg.
>



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-08 Thread Andrii Nakryiko
On Fri, Jul 5, 2024 at 8:38 AM Oleg Nesterov  wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > - if (refcount_dec_and_test(>ref)) {
> > + s64 v;
> > +
> > + /*
> > +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +  * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +  * section double-checking uprobe->ref value below.
> > +  * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +  */
> > + rcu_read_lock_trace();
> > +
> > + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > +
> > + if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
> uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
> __uprobe_unregister(...);
>
> this wont't change the counter,
>
> put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Argh, this is an artifact of splitting the overall change into
separate patches. The final version of uprobe_unregister() doesn't do
find_uprobe(), we just get it from uprobe_consumer->uprobe pointer
without any tree lookup.

>
> Oleg.
>



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-07 Thread Oleg Nesterov
And I forgot to mention...

In any case __uprobe_unregister() can't ignore the error code from
register_for_each_vma(). If it fails to restore the original insn,
we should not remove this uprobe from uprobes_tree.

Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
probed application.

On 07/05, Oleg Nesterov wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -   if (refcount_dec_and_test(>ref)) {
> > +   s64 v;
> > +
> > +   /*
> > +* here uprobe instance is guaranteed to be alive, so we use Tasks
> > +* Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +* we end up being a losing "destructor" inside uprobe_treelock'ed
> > +* section double-checking uprobe->ref value below.
> > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +*/
> > +   rcu_read_lock_trace();
> > +
> > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > +
> > +   if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
>   uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
>   __uprobe_unregister(...);
>
> this wont't change the counter,
>
>   put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
>
> Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-06 Thread Jiri Olsa
On Sat, Jul 06, 2024 at 07:00:34PM +0200, Jiri Olsa wrote:
> On Fri, Jul 05, 2024 at 05:37:05PM +0200, Oleg Nesterov wrote:
> > Tried to read this patch, but I fail to understand it. It looks
> > obvioulsy wrong to me, see below.
> > 
> > I tend to agree with the comments from Peter, but lets ignore them
> > for the moment.
> > 
> > On 07/01, Andrii Nakryiko wrote:
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > > - if (refcount_dec_and_test(>ref)) {
> > > + s64 v;
> > > +
> > > + /*
> > > +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> > > +  * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > > +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> > > +  * section double-checking uprobe->ref value below.
> > > +  * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > > +  */
> > > + rcu_read_lock_trace();
> > > +
> > > + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > > +
> > > + if (unlikely((u32)v == 0)) {
> > 
> > I must have missed something, but how can this ever happen?
> > 
> > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> > that this binary is not used, so _register() doesn't install 
> > breakpoints/etc.
> > 
> > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() 
> > succeeds.
> > 
> > Now suppose that uprobe_unregister() is called right after that. It does
> > 
> > uprobe = find_uprobe(inode, offset);
> > 
> > this increments the counter, (u32)uprobe->ref == 2
> > 
> > __uprobe_unregister(...);
> > 
> > this wont't change the counter,
> 
> __uprobe_unregister calls delete_uprobe that calls put_uprobe ?

ugh, wrong sources.. ok, don't know ;-)

jirka

> 
> jirka
> 
> > 
> > put_uprobe(uprobe);
> > 
> > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> > 
> > Where should the "final" put_uprobe() come from?
> > 
> > IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> > 
> > Oleg.
> > 



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-06 Thread Jiri Olsa
On Fri, Jul 05, 2024 at 05:37:05PM +0200, Oleg Nesterov wrote:
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
> 
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
> 
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -   if (refcount_dec_and_test(>ref)) {
> > +   s64 v;
> > +
> > +   /*
> > +* here uprobe instance is guaranteed to be alive, so we use Tasks
> > +* Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +* we end up being a losing "destructor" inside uprobe_treelock'ed
> > +* section double-checking uprobe->ref value below.
> > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +*/
> > +   rcu_read_lock_trace();
> > +
> > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > +
> > +   if (unlikely((u32)v == 0)) {
> 
> I must have missed something, but how can this ever happen?
> 
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
> 
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
> 
> Now suppose that uprobe_unregister() is called right after that. It does
> 
>   uprobe = find_uprobe(inode, offset);
> 
> this increments the counter, (u32)uprobe->ref == 2
> 
>   __uprobe_unregister(...);
> 
> this wont't change the counter,

__uprobe_unregister calls delete_uprobe that calls put_uprobe ?

jirka

> 
>   put_uprobe(uprobe);
> 
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> 
> Where should the "final" put_uprobe() come from?
> 
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> 
> Oleg.
> 



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-05 Thread Oleg Nesterov
Tried to read this patch, but I fail to understand it. It looks
obvioulsy wrong to me, see below.

I tend to agree with the comments from Peter, but lets ignore them
for the moment.

On 07/01, Andrii Nakryiko wrote:
>
>  static void put_uprobe(struct uprobe *uprobe)
>  {
> - if (refcount_dec_and_test(>ref)) {
> + s64 v;
> +
> + /*
> +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> +  * Trace RCU to guarantee that uprobe won't be freed from under us, if
> +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> +  * section double-checking uprobe->ref value below.
> +  * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> +  */
> + rcu_read_lock_trace();
> +
> + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> +
> + if (unlikely((u32)v == 0)) {

I must have missed something, but how can this ever happen?

Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
that this binary is not used, so _register() doesn't install breakpoints/etc.

IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.

Now suppose that uprobe_unregister() is called right after that. It does

uprobe = find_uprobe(inode, offset);

this increments the counter, (u32)uprobe->ref == 2

__uprobe_unregister(...);

this wont't change the counter,

put_uprobe(uprobe);

this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.

Where should the "final" put_uprobe() come from?

IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-04 Thread Google
On Thu, 4 Jul 2024 10:45:24 +0200
Peter Zijlstra  wrote:

> On Thu, Jul 04, 2024 at 10:03:48AM +0200, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index c98e3b3386ba..4aafb4485be7 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe 
> > *tp)
> > if (!tu->inode)
> > continue;
> >  
> > -   uprobe_unregister(tu->inode, tu->offset, >consumer);
> > +   uprobe_unregister(tu->inode, tu->offset, >consumer,
> > + list_is_last(trace_probe_probe_list(tp), 
> > >tp.list) ? 0 : URF_NO_SYNC);
> > tu->inode = NULL;
> > }
> >  }
> 
> 
> Hmm, that continue clause might ruin things. Still easy enough to add
> uprobe_unregister_sync() and simpy always pass URF_NO_SYNC.
> 
> I really don't see why we should make this more complicated than it
> needs to be.
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 354cab634341..681741a51df3 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -115,7 +115,9 @@ extern int uprobe_write_opcode(struct arch_uprobe 
> *auprobe, struct mm_struct *mm
>  extern int uprobe_register(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
>  extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
> ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
> +#define URF_NO_SYNC  0x01
> +extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, unsigned int flags);
> +extern void uprobe_unregister_sync(void);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
>  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
> unsigned long end);
>  extern void uprobe_start_dup_mmap(void);
> @@ -165,7 +167,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, boo
>   return -ENOSYS;
>  }
>  static inline void
> -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
> *uc)
> +uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
> *uc, unsigned int flags)

nit: IMHO, I would like to see uprobe_unregister_nosync() variant instead of
adding flags.

Thank you,

>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 0b7574a54093..d09f7b942076 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
> uprobe_consumer *uc)
>   * @offset: offset from the start of the file.
>   * @uc: identify which probe if multiple probes are colocated.
>   */
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc)
> +void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, unsigned int flags)
>  {
>   scoped_guard (srcu, _srcu) {
>   struct uprobe *uprobe = find_uprobe(inode, offset);
> @@ -1157,10 +1157,17 @@ void uprobe_unregister(struct inode *inode, loff_t 
> offset, struct uprobe_consume
>   mutex_unlock(>register_mutex);
>   }
>  
> - synchronize_srcu(_srcu); // XXX amortize / batch
> + if (!(flags & URF_NO_SYNC))
> + synchronize_srcu(_srcu);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_unregister);
>  
> +void uprobe_unregister_sync(void)
> +{
> + synchronize_srcu(_srcu);
> +}
> +EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
> +
>  /*
>   * __uprobe_register - register a probe
>   * @inode: the file in which the probe has to be placed.
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d1daeab1bbc1..1f6adabbb1e7 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3181,9 +3181,10 @@ static void bpf_uprobe_unregister(struct path *path, 
> struct bpf_uprobe *uprobes,
>   u32 i;
>  
>   for (i = 0; i < cnt; i++) {
> - uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
> -   [i].consumer);
> + uprobe_unregister(d_real_inode(path->dentry), 
> uprobes[i].offset, URF_NO_SYNC);
>   }
> + if (cnt > 0)
> + uprobe_unregister_sync();
>  }
>  
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..6b64470a1c5c 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, 
> filter_func_t filter)
>  static void __probe_event_disable(struct 

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-04 Thread Peter Zijlstra
On Thu, Jul 04, 2024 at 10:03:48AM +0200, Peter Zijlstra wrote:

> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..4aafb4485be7 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe 
> *tp)
>   if (!tu->inode)
>   continue;
>  
> - uprobe_unregister(tu->inode, tu->offset, >consumer);
> + uprobe_unregister(tu->inode, tu->offset, >consumer,
> +   list_is_last(trace_probe_probe_list(tp), 
> >tp.list) ? 0 : URF_NO_SYNC);
>   tu->inode = NULL;
>   }
>  }


Hmm, that continue clause might ruin things. Still easy enough to add
uprobe_unregister_sync() and simpy always pass URF_NO_SYNC.

I really don't see why we should make this more complicated than it
needs to be.

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 354cab634341..681741a51df3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -115,7 +115,9 @@ extern int uprobe_write_opcode(struct arch_uprobe *auprobe, 
struct mm_struct *mm
 extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+#define URF_NO_SYNC0x01
+extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, unsigned int flags);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -165,7 +167,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc, unsigned int flags)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0b7574a54093..d09f7b942076 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, unsigned int flags)
 {
scoped_guard (srcu, _srcu) {
struct uprobe *uprobe = find_uprobe(inode, offset);
@@ -1157,10 +1157,17 @@ void uprobe_unregister(struct inode *inode, loff_t 
offset, struct uprobe_consume
mutex_unlock(>register_mutex);
}
 
-   synchronize_srcu(_srcu); // XXX amortize / batch
+   if (!(flags & URF_NO_SYNC))
+   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
+void uprobe_unregister_sync(void)
+{
+   synchronize_srcu(_srcu);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
+
 /*
  * __uprobe_register - register a probe
  * @inode: the file in which the probe has to be placed.
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..1f6adabbb1e7 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3181,9 +3181,10 @@ static void bpf_uprobe_unregister(struct path *path, 
struct bpf_uprobe *uprobes,
u32 i;
 
for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
- [i].consumer);
+   uprobe_unregister(d_real_inode(path->dentry), 
uprobes[i].offset, URF_NO_SYNC);
}
+   if (cnt > 0)
+   uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..6b64470a1c5c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, 
filter_func_t filter)
 static void __probe_event_disable(struct trace_probe *tp)
 {
struct trace_uprobe *tu;
+   bool sync = false;
 
tu = container_of(tp, struct trace_uprobe, tp);
WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct trace_probe *tp)
if (!tu->inode)
continue;
 
-

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-04 Thread Peter Zijlstra
On Wed, Jul 03, 2024 at 01:47:23PM -0700, Andrii Nakryiko wrote:

> > When I cobble all that together (it really shouldn't be one patch, but
> > you get the idea I hope) it looks a little something like the below.
> >
> > I *think* it should work, but perhaps I've missed something?
> 
> Well, at the very least you missed that we can't delay SRCU (or any
> other sleepable RCU flavor) potentially indefinitely for uretprobes,
> which are completely under user space control.

Sure, but that's fixable. You can work around that by having (u)tasks
with a non-empty return_instance list carry a timer. When/if that timer
fires, it goes and converts the SRCU references to actual references.

Not so very hard to do, but very much not needed for a PoC.



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-04 Thread Peter Zijlstra
On Wed, Jul 03, 2024 at 01:47:23PM -0700, Andrii Nakryiko wrote:
> Your innocuous "// XXX amortize / batch" comment below is *the major
> point of this patch set*. Try to appreciate that. It's not a small
> todo, it took this entire patch set to allow for that.

Tada!

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 354cab634341..c9c9ec87ab9a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -115,7 +115,8 @@ extern int uprobe_write_opcode(struct arch_uprobe *auprobe, 
struct mm_struct *mm
 extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+#define URF_NO_SYNC0x01
+extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, unsigned int flags);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -165,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, boo
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc, unsigned int flags)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0b7574a54093..1f4151c518ed 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
uprobe_consumer *uc)
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, unsigned int flags)
 {
scoped_guard (srcu, _srcu) {
struct uprobe *uprobe = find_uprobe(inode, offset);
@@ -1157,7 +1157,8 @@ void uprobe_unregister(struct inode *inode, loff_t 
offset, struct uprobe_consume
mutex_unlock(>register_mutex);
}
 
-   synchronize_srcu(_srcu); // XXX amortize / batch
+   if (!(flags & URF_NO_SYNC))
+   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..950b5241244a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3182,7 +3182,7 @@ static void bpf_uprobe_unregister(struct path *path, 
struct bpf_uprobe *uprobes,
 
for (i = 0; i < cnt; i++) {
uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
- [i].consumer);
+ [i].consumer, i != cnt-1 ? 
URF_NO_SYNC : 0);
}
 }
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..4aafb4485be7 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe *tp)
if (!tu->inode)
continue;
 
-   uprobe_unregister(tu->inode, tu->offset, >consumer);
+   uprobe_unregister(tu->inode, tu->offset, >consumer,
+ list_is_last(trace_probe_probe_list(tp), 
>tp.list) ? 0 : URF_NO_SYNC);
tu->inode = NULL;
}
 }



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-03 Thread Andrii Nakryiko
On Wed, Jul 3, 2024 at 6:36 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > One, attempted initially, way to solve this is through using
> > atomic_inc_not_zero() approach, turning get_uprobe() into
> > try_get_uprobe(),
>
> This is the canonical thing to do. Everybody does this.

Sure, and I provided arguments why I don't do it. Can you provide your
counter argument, please? "Everybody does this." is hardly one.

>
> > which can fail to bump refcount if uprobe is already
> > destined to be destroyed. This, unfortunately, turns out to be a rather
> > expensive due to underlying cmpxchg() operation in
> > atomic_inc_not_zero() and scales rather poorly with increased amount of
> > parallel threads triggering uprobes.
>
> Different archs different trade-offs. You'll not see this on LL/SC archs
> for example.

Clearly x86-64 is the highest priority target, and I've shown that it
benefits from atomic addition vs cmpxchg. Sure, other architecture
might benefit less. But will atomic addition be slower than cmpxchg on
any other architecture?

It's clearly beneficial for x86-64 and not regressing other
architectures, right?

>
> > Furthermore, CPU profiling showed the following overall CPU usage:
> >   - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
> > atomic_inc_not_zero approach;
> >   - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
> > atomic_add_and_return approach implemented by this patch.
>
> I think those numbers suggest trying to not have a refcount in the first
> place. Both are pretty terrible, yes one is less terrible than the
> other, but still terrible.

Good, we are on the same page here, yes.

>
> Specifically, I'm thinking it is the refcounting in handlw_swbp() that
> is actually the problem, all the other stuff is noise.
>
> So if you have SRCU protected consumers, what is the reason for still
> having a refcount in handlw_swbp() ? Simply have the whole of it inside
> a single SRCU critical section, then all consumers you find get a hit.

That's the goal (except SRCU vs RCU Tasks Trace) and that's the next
step. I didn't want to add all that complexity to an already pretty
big and complex patch set. I do believe that batch APIs are the first
necessary step.

Your innocuous "// XXX amortize / batch" comment below is *the major
point of this patch set*. Try to appreciate that. It's not a small
todo, it took this entire patch set to allow for that.

Now, if you are so against percpu RW semapshore, I can just drop the
last patch for now, but the rest is necessary regardless.

Note how I didn't really touch locking *at all*. uprobes_treelock used
to be a spinlock, which we 1-to-1 replaced with rw_spinlock. And now
I'm replacing it, again 1-to-1, with percpu RW semaphore. Specifically
not to entangle batching with the locking schema changes.

>
> Hmm, return probes are a pain, they require the uprobe to stay extant
> between handle_swbp() and handle_trampoline(). I'm thinking we can do
> that with SRCU as well.

I don't think we can, and I'm surprised you don't think that way.

uretprobe might never be triggered for various reasons. Either user
space never returns from the function, or uretprobe was never
installed in the right place (and so uprobe part will trigger, but
there will never be returning probe triggering). I don't think it's
acceptable to delay whole global uprobes SRCU unlocking indefinitely
and keep that at user space's code will.

So, with that, I think refcounting *for return probe* will have to
stay. And will have to be fast.

>
> When I cobble all that together (it really shouldn't be one patch, but
> you get the idea I hope) it looks a little something like the below.
>
> I *think* it should work, but perhaps I've missed something?

Well, at the very least you missed that we can't delay SRCU (or any
other sleepable RCU flavor) potentially indefinitely for uretprobes,
which are completely under user space control.

>
> TL;DR replace treelock with seqcount+SRCU
>   replace register_rwsem with SRCU
>   replace handle_swbp() refcount with SRCU
>   replace return_instance refcount with a second SRCU

So, as I mentioned. I haven't considered seqcount just yet, and I will
think that through. This patch set was meant to add batched API to
unblock all of the above you describe. Percpu RW semaphore switch was
a no-brainer with batched APIs, so I went for that to get more
performance with zero added effort and complexity. If you hate that
part, I can drop it. But batching APIs are unavoidable, no matter what
specific RCU-protected locking schema we end up doing.

Can we agree on that and move this forward, please?

>
> Paul, I had to do something vile with SRCU. The basic problem is that we
> want to keep a SRCU critical section across fork(), which leads to both
> parent and child doing srcu_read_unlock(, idx). As such, I need an
> extra increment on the @idx ssp counter to even 

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-03 Thread Peter Zijlstra
On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:

> One, attempted initially, way to solve this is through using
> atomic_inc_not_zero() approach, turning get_uprobe() into
> try_get_uprobe(),

This is the canonical thing to do. Everybody does this.

> which can fail to bump refcount if uprobe is already
> destined to be destroyed. This, unfortunately, turns out to be a rather
> expensive due to underlying cmpxchg() operation in
> atomic_inc_not_zero() and scales rather poorly with increased amount of
> parallel threads triggering uprobes.

Different archs different trade-offs. You'll not see this on LL/SC archs
for example.

> Furthermore, CPU profiling showed the following overall CPU usage:
>   - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
> atomic_inc_not_zero approach;
>   - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
> atomic_add_and_return approach implemented by this patch.

I think those numbers suggest trying to not have a refcount in the first
place. Both are pretty terrible, yes one is less terrible than the
other, but still terrible.

Specifically, I'm thinking it is the refcounting in handlw_swbp() that
is actually the problem, all the other stuff is noise. 

So if you have SRCU protected consumers, what is the reason for still
having a refcount in handlw_swbp() ? Simply have the whole of it inside
a single SRCU critical section, then all consumers you find get a hit.

Hmm, return probes are a pain, they require the uprobe to stay extant
between handle_swbp() and handle_trampoline(). I'm thinking we can do
that with SRCU as well.

When I cobble all that together (it really shouldn't be one patch, but
you get the idea I hope) it looks a little something like the below.

I *think* it should work, but perhaps I've missed something?

TL;DR replace treelock with seqcount+SRCU
  replace register_rwsem with SRCU
  replace handle_swbp() refcount with SRCU
  replace return_instance refcount with a second SRCU

Paul, I had to do something vile with SRCU. The basic problem is that we
want to keep a SRCU critical section across fork(), which leads to both
parent and child doing srcu_read_unlock(, idx). As such, I need an
extra increment on the @idx ssp counter to even things out, see
__srcu_read_clone_lock().

---
 include/linux/rbtree.h  |  45 +
 include/linux/srcu.h|   2 +
 include/linux/uprobes.h |   2 +
 kernel/events/uprobes.c | 166 +++-
 kernel/rcu/srcutree.c   |   5 ++
 5 files changed, 161 insertions(+), 59 deletions(-)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index f7edca369eda..9847fa58a287 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -244,6 +244,31 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
return NULL;
 }
 
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+   int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+   struct rb_node **link = >rb_node;
+   struct rb_node *parent = NULL;
+   int c;
+
+   while (*link) {
+   parent = *link;
+   c = cmp(node, parent);
+
+   if (c < 0)
+   link = >rb_left;
+   else if (c > 0)
+   link = >rb_right;
+   else
+   return parent;
+   }
+
+   rb_link_node_rcu(node, parent, link);
+   rb_insert_color(node, tree);
+   return NULL;
+}
+
 /**
  * rb_find() - find @key in tree @tree
  * @key: key to match
@@ -272,6 +297,26 @@ rb_find(const void *key, const struct rb_root *tree,
return NULL;
 }
 
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+   int (*cmp)(const void *key, const struct rb_node *))
+{
+   struct rb_node *node = tree->rb_node;
+
+   while (node) {
+   int c = cmp(key, node);
+
+   if (c < 0)
+   node = rcu_dereference_raw(node->rb_left);
+   else if (c > 0)
+   node = rcu_dereference_raw(node->rb_right);
+   else
+   return node;
+   }
+
+   return NULL;
+}
+
 /**
  * rb_find_first() - find the first @key in @tree
  * @key: key to match
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 236610e4a8fa..9b14acecbb9d 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -55,7 +55,9 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
void (*func)(struct rcu_head *head));
 void cleanup_srcu_struct(struct srcu_struct *ssp);
 int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_read_clone_lock(struct srcu_struct *ssp, int idx);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
+
 void synchronize_srcu(struct srcu_struct *ssp);
 unsigned long 

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-02 Thread Andrii Nakryiko
On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 23449a8c5e7e..560cf1ca512a 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> >
> >  struct uprobe {
> >   struct rb_node  rb_node;/* node in the rb tree */
> > - refcount_t  ref;
> > + atomic64_t  ref;/* see UPROBE_REFCNT_GET 
> > below */
> >   struct rw_semaphore register_rwsem;
> >   struct rw_semaphore consumer_rwsem;
> > + struct rcu_head rcu;
> >   struct list_headpending_list;
> >   struct uprobe_consumer  *consumers;
> >   struct inode*inode; /* Also hold a ref to inode */
> > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
> > mm_struct *mm, unsigned long v
> >   *(uprobe_opcode_t *)>insn);
> >  }
> >
> > -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > +/*
> > + * Uprobe's 64-bit refcount is actually two independent counters 
> > co-located in
> > + * a single u64 value:
> > + *   - lower 32 bits are just a normal refcount with is increment and
> > + *   decremented on get and put, respectively, just like normal refcount
> > + *   would;
> > + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> > + *   incremented by one, no matter whether get or put operation is done.
> > + *
> > + * This upper counter is meant to distinguish between:
> > + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with 
> > "destruction",
> > + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 
> > refcnt
> > + *   sequence, also proceeding to "destruction".
> > + *
> > + * In both cases refcount drops to zero, but in one case it will have 
> > epoch N,
> > + * while the second drop to zero will have a different epoch N + 2, 
> > allowing
> > + * first destructor to bail out because epoch changed between refcount 
> > going
> > + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> > + * 64-bit refcount is double-checked, see put_uprobe() for details).
> > + *
> > + * Lower 32-bit counter is not meant to over overflow, while it's expected
>
> So refcount_t very explicitly handles both overflow and underflow and
> screams bloody murder if they happen. Your thing does not..
>

Correct, because I considered that to be practically impossible to
overflow this refcount. The main source of refcounts are uretprobes
that are holding uprobe references. We limit the depth of supported
recursion to 64, so you'd need 30+ millions of threads all hitting the
same uprobe/uretprobe to overflow this. I guess in theory it could
happen (not sure if we have some limits on total number of threads in
the system and if they can be bumped to over 30mln), but it just
seemed out of realm of practical possibility.

Having said that, I can add similar checks that refcount_t does in
refcount_add and do what refcount_warn_saturate does as well.

> > + * that upper 32-bit counter will overflow occasionally. Note, though, 
> > that we
> > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit 
> > counter,
> > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> > + * epoch effectively a 31-bit counter with highest bit used as a flag to
> > + * perform a fix-up. This ensures epoch and refcnt parts do not 
> > "interfere".
> > + *
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part 
> > and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */
> > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */
> > +
> > +/*
> > + * Caller has to make sure that:
> > + *   a) either uprobe's refcnt is positive before this call;
> > + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> > + *  preventing uprobe's destructor from removing it from uprobes_tree.
> > + *
> > + * In the latter case, uprobe's destructor will "resurrect" uprobe 
> > instance if
> > + * it detects that its refcount went back to being positive again 
> > inbetween it
> > + * dropping to zero at some point and (potentially delayed) destructor
> > + * callback actually running.
> > + */
> > +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
> >  {
> > - refcount_inc(>ref);
> > + s64 v;
> > +
> > + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref);
>
> Distinct lack of u32 overflow testing here..
>
> > +
> > + /*
> > +  * If the 

Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-02 Thread Peter Zijlstra
On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23449a8c5e7e..560cf1ca512a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  
>  struct uprobe {
>   struct rb_node  rb_node;/* node in the rb tree */
> - refcount_t  ref;
> + atomic64_t  ref;/* see UPROBE_REFCNT_GET below 
> */
>   struct rw_semaphore register_rwsem;
>   struct rw_semaphore consumer_rwsem;
> + struct rcu_head rcu;
>   struct list_headpending_list;
>   struct uprobe_consumer  *consumers;
>   struct inode*inode; /* Also hold a ref to inode */
> @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
> mm_struct *mm, unsigned long v
>   *(uprobe_opcode_t *)>insn);
>  }
>  
> -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> +/*
> + * Uprobe's 64-bit refcount is actually two independent counters co-located 
> in
> + * a single u64 value:
> + *   - lower 32 bits are just a normal refcount with is increment and
> + *   decremented on get and put, respectively, just like normal refcount
> + *   would;
> + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> + *   incremented by one, no matter whether get or put operation is done.
> + *
> + * This upper counter is meant to distinguish between:
> + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
> + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
> + *   sequence, also proceeding to "destruction".
> + *
> + * In both cases refcount drops to zero, but in one case it will have epoch 
> N,
> + * while the second drop to zero will have a different epoch N + 2, allowing
> + * first destructor to bail out because epoch changed between refcount going
> + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> + * 64-bit refcount is double-checked, see put_uprobe() for details).
> + *
> + * Lower 32-bit counter is not meant to over overflow, while it's expected

So refcount_t very explicitly handles both overflow and underflow and
screams bloody murder if they happen. Your thing does not.. 

> + * that upper 32-bit counter will overflow occasionally. Note, though, that 
> we
> + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit 
> counter,
> + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> + * epoch effectively a 31-bit counter with highest bit used as a flag to
> + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
> + *
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */
> +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */
> +
> +/*
> + * Caller has to make sure that:
> + *   a) either uprobe's refcnt is positive before this call;
> + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> + *  preventing uprobe's destructor from removing it from uprobes_tree.
> + *
> + * In the latter case, uprobe's destructor will "resurrect" uprobe instance 
> if
> + * it detects that its refcount went back to being positive again inbetween 
> it
> + * dropping to zero at some point and (potentially delayed) destructor
> + * callback actually running.
> + */
> +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
>  {
> - refcount_inc(>ref);
> + s64 v;
> +
> + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref);

Distinct lack of u32 overflow testing here..

> +
> + /*
> +  * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> +  * we don't retry because there is another CPU that just managed to
> +  * update refcnt and will attempt the same "fix up". Eventually one of
> +  * them will succeed to clear highset bit.
> +  */
> + if (unlikely(v < 0))
> + (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63));
> +
>   return uprobe;
>  }

>  static void put_uprobe(struct uprobe *uprobe)
>  {
> - if (refcount_dec_and_test(>ref)) {
> + s64 v;
> +
> + /*
> +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> +  * Trace RCU to guarantee that uprobe won't be freed from under us, if

What's wrong with normal RCU?

> +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> +  * section double-checking uprobe->ref value below.
> +  * Note call_rcu_tasks_trace() + uprobe_free_rcu 

[PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-01 Thread Andrii Nakryiko
Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
handler code, temporary refcount bump is done to keep uprobe from
being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
take another refcount to keep uprobe alive until user space code
returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(>ref) and
write_lock(_treelock), we'll deterministically detect this with
extra atomic_read(>ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET