unregister APIs and per-CPU RW semaphore

Paul E. McKenney Tue, 02 Jul 2024 16:57:00 -0700

On Tue, Jul 02, 2024 at 09:18:57PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
> 
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe 
> > > *uprobe)
> > >         return uprobe;
> > >  }
> > >
> > > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > > +{
> > > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > +       kfree(uprobe);
> > > +}
> > > +
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > >         if (refcount_dec_and_test(&uprobe->ref)) {
> > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> > 
> > right above this we have roughly this:
> > 
> > percpu_down_write(&uprobes_treelock);
> > 
> > /* refcount check */
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > 
> > percpu_up_write(&uprobes_treelock);
> > 
> > 
> > This writer lock is necessary for modification of the RB tree. And I
> > was under impression that I shouldn't be doing
> > percpu_(down|up)_write() inside the normal
> > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> > clarify.
> 
> preemptible RCU or SRCU would work.


I agree that SRCU would work from a functional viewpoint.  No so for
preemptible RCU, which permits preemption (and on -rt, blocking for
spinlocks), it does not permit full-up blocking, and for good reason.

> > But actually what's wrong with RCU Tasks Trace flavor? 
> 
> Paul, isn't this the RCU flavour you created to deal with
> !rcu_is_watching()? The flavour that never should have been created in
> favour of just cleaning up the mess instead of making more.

My guess is that you are instead thinking of RCU Tasks Rude, which can
be eliminated once all architectures get their entry/exit/deep-idle
functions either inlined or marked noinstr.

> > I will
> > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > and to protect uprobe->consumers iteration and uc->handler() calls,
> > which could be sleepable, so would need rcu_read_lock_trace().
> 
> I don't think you need trace-rcu for that. SRCU would do nicely I think.

>From a functional viewpoint, agreed.

However, in the past, the memory-barrier and array-indexing overhead
of SRCU has made it a no-go for lightweight probes into fastpath code.
And these cases were what motivated RCU Tasks Trace (as opposed to RCU
Tasks Rude).

The other rule for RCU Tasks Trace is that although readers are permitted
to block, this blocking can be for no longer than a major page fault.
If you need longer-term blocking, then you should instead use SRCU.

                                                        Thanx, Paul

> > >                 mutex_lock(&delayed_uprobe_lock);
> > >                 delayed_uprobe_remove(uprobe, NULL);
> > >                 mutex_unlock(&delayed_uprobe_lock);
> > > -               kfree(uprobe);
> > > +
> > > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> > >         }
> > >  }
> > >
> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode 
> > > *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to 
> > > false-negatives.
> > > +                        * If they find something, it's good. If they do 
> > > not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> > 
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
> 
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.
> 
> The patches you propose are quite complex, this alternative not so much.
> 
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
> 
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.

Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

Reply via email to