Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Google
On Tue, 2 Jul 2024 12:53:20 -0400
Steven Rostedt  wrote:

> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.  
> > 
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
> 
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Sorry for mislead you, I meant current "dynamic_events" interface does not
support the wildcard places.
And I agree that we can update it to support something like

 p:multi_uprobe 0x1234,0x2234,0x3234@/bin/foo $arg1 $arg2 $arg3

(note: kernel does not read the symbols in user binary)

Thank you,


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Andrii Nakryiko
On Tue, Jul 2, 2024 at 9:53 AM Steven Rostedt  wrote:
>
> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.
> >
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
>
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Nice, I'd be happy to adjust batch API to work for that use case as
well (when we get there).

>
> -- Steve



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-02 Thread Andrii Nakryiko
On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 23449a8c5e7e..560cf1ca512a 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> >
> >  struct uprobe {
> >   struct rb_node  rb_node;/* node in the rb tree */
> > - refcount_t  ref;
> > + atomic64_t  ref;/* see UPROBE_REFCNT_GET 
> > below */
> >   struct rw_semaphore register_rwsem;
> >   struct rw_semaphore consumer_rwsem;
> > + struct rcu_head rcu;
> >   struct list_headpending_list;
> >   struct uprobe_consumer  *consumers;
> >   struct inode*inode; /* Also hold a ref to inode */
> > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
> > mm_struct *mm, unsigned long v
> >   *(uprobe_opcode_t *)>insn);
> >  }
> >
> > -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > +/*
> > + * Uprobe's 64-bit refcount is actually two independent counters 
> > co-located in
> > + * a single u64 value:
> > + *   - lower 32 bits are just a normal refcount with is increment and
> > + *   decremented on get and put, respectively, just like normal refcount
> > + *   would;
> > + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> > + *   incremented by one, no matter whether get or put operation is done.
> > + *
> > + * This upper counter is meant to distinguish between:
> > + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with 
> > "destruction",
> > + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 
> > refcnt
> > + *   sequence, also proceeding to "destruction".
> > + *
> > + * In both cases refcount drops to zero, but in one case it will have 
> > epoch N,
> > + * while the second drop to zero will have a different epoch N + 2, 
> > allowing
> > + * first destructor to bail out because epoch changed between refcount 
> > going
> > + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> > + * 64-bit refcount is double-checked, see put_uprobe() for details).
> > + *
> > + * Lower 32-bit counter is not meant to over overflow, while it's expected
>
> So refcount_t very explicitly handles both overflow and underflow and
> screams bloody murder if they happen. Your thing does not..
>

Correct, because I considered that to be practically impossible to
overflow this refcount. The main source of refcounts are uretprobes
that are holding uprobe references. We limit the depth of supported
recursion to 64, so you'd need 30+ millions of threads all hitting the
same uprobe/uretprobe to overflow this. I guess in theory it could
happen (not sure if we have some limits on total number of threads in
the system and if they can be bumped to over 30mln), but it just
seemed out of realm of practical possibility.

Having said that, I can add similar checks that refcount_t does in
refcount_add and do what refcount_warn_saturate does as well.

> > + * that upper 32-bit counter will overflow occasionally. Note, though, 
> > that we
> > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit 
> > counter,
> > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> > + * epoch effectively a 31-bit counter with highest bit used as a flag to
> > + * perform a fix-up. This ensures epoch and refcnt parts do not 
> > "interfere".
> > + *
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part 
> > and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */
> > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */
> > +
> > +/*
> > + * Caller has to make sure that:
> > + *   a) either uprobe's refcnt is positive before this call;
> > + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> > + *  preventing uprobe's destructor from removing it from uprobes_tree.
> > + *
> > + * In the latter case, uprobe's destructor will "resurrect" uprobe 
> > instance if
> > + * it detects that its refcount went back to being positive again 
> > inbetween it
> > + * dropping to zero at some point and (potentially delayed) destructor
> > + * callback actually running.
> > + */
> > +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
> >  {
> > - refcount_inc(>ref);
> > + s64 v;
> > +
> > + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref);
>
> Distinct lack of u32 overflow testing here..
>
> > +
> > + /*
> > +  * If the 

Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Steven Rostedt
On Wed, 3 Jul 2024 00:19:05 +0900
Masami Hiramatsu (Google)  wrote:

> > BTW, is this (batched register/unregister APIs) something you'd like
> > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > ones) uprobes as well? Or there is just no way to even specify a batch
> > of uprobes? Just curious if you had any plans for this.  
> 
> No, because current tracefs dynamic event interface is not designed for
> batched registration. I think we can expand it to pass wildcard symbols
> (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> Um, that maybe another good idea.

I don't see why not. The wild cards were added to the kernel
specifically for the tracefs interface (set_ftrace_filter).

-- Steve



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Google
On Mon, 1 Jul 2024 18:34:55 -0700
Andrii Nakryiko  wrote:

> > > How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> > > contract, which works for the only user right now, BPF multi-uprobes.
> > > When it's time to add another consumer that works with a linked list,
> > > we can add another more complicated contract that would do
> > > iterator-style callbacks. This would be used by linked list users, and
> > > we can transparently implement existing uprobe_register_batch()
> > > contract on top of if by implementing a trivial iterator wrapper on
> > > top of get_uprobe_consumer(idx, ctx) approach.
> >
> > Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
> > When we need to register list of the structure, we may be possible to
> > allocate an array or introduce new function.
> >
> 
> Cool, glad we agree. What you propose above with start + next + ctx
> seems like a way forward if we need this.
> 
> BTW, is this (batched register/unregister APIs) something you'd like
> to use from the tracefs-based (or whatever it's called, I mean non-BPF
> ones) uprobes as well? Or there is just no way to even specify a batch
> of uprobes? Just curious if you had any plans for this.

No, because current tracefs dynamic event interface is not designed for
batched registration. I think we can expand it to pass wildcard symbols
(for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
Um, that maybe another good idea.

Thank you!

-- 
Masami Hiramatsu (Google) 



Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-02 Thread Hongbo Li




On 2024/7/2 21:30, Mathieu Desnoyers wrote:

On 2024-07-02 07:55, Hongbo Li wrote:



On 2024/7/2 7:49, Steven Rostedt wrote:

On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:

@@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap 
*idmap,

  if (error)
  return error;
+    trace_hugetlbfs_setattr(inode, dentry->d_name.len, 
dentry->d_name.name,

+    attr->ia_valid, attr->ia_mode,
+    from_kuid(_user_ns, attr->ia_uid),
+    from_kgid(_user_ns, attr->ia_gid),
+    inode->i_size, attr->ia_size);
+


That's a lot of parameters to pass to a tracepoint. Why not just pass 
the
dentry and attr and do the above in the TP_fast_assign() logic? That 
would

put less pressure on the icache for the code part.


Thanks for reviewing!

Some logic such as kuid_t --> uid_t might be reasonable obtained in 
filesystem layer. Passing the dentry and attr will let trace know the 
meaning of structure, perhaps tracepoint should not be aware of the

members of these structures as much as possible.


As maintainer of the LTTng out-of-tree kernel tracer, I appreciate the
effort to decouple instrumentation from the subsystem instrumentation,
but as long as the structure sits in public headers and the global
variables used within the TP_fast_assign() logic (e.g. init_user_ns)
are export-gpl, this is enough to make it easy for tracer integration

Thank you for your friendly elaboration and suggestion!
I will update this part based on your suggestion in next version.

Thanks,
Hongbo

and it keeps the tracepoint caller code footprint to a minimum.

The TRACE_EVENT definitions are specific to the subsystem anyway,
so I don't think it matters that the TRACE_EVENT() need to access
the dentry and attr structures.

So I agree with Steven's suggestion. However, just as a precision,
I suspect it will have mainly an impact on code size, but not
necessarily on icache footprint, because it will shrink the code
size within the tracepoint unlikely branch (cold instructions).

Thanks,

Mathieu



Thanks,
Hongbo



-- Steve







Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-02 Thread Mathieu Desnoyers

On 2024-07-02 07:55, Hongbo Li wrote:



On 2024/7/2 7:49, Steven Rostedt wrote:

On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:

@@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap 
*idmap,

  if (error)
  return error;
+    trace_hugetlbfs_setattr(inode, dentry->d_name.len, 
dentry->d_name.name,

+    attr->ia_valid, attr->ia_mode,
+    from_kuid(_user_ns, attr->ia_uid),
+    from_kgid(_user_ns, attr->ia_gid),
+    inode->i_size, attr->ia_size);
+


That's a lot of parameters to pass to a tracepoint. Why not just pass the
dentry and attr and do the above in the TP_fast_assign() logic? That 
would

put less pressure on the icache for the code part.


Thanks for reviewing!

Some logic such as kuid_t --> uid_t might be reasonable obtained in 
filesystem layer. Passing the dentry and attr will let trace know the 
meaning of structure, perhaps tracepoint should not be aware of the

members of these structures as much as possible.


As maintainer of the LTTng out-of-tree kernel tracer, I appreciate the
effort to decouple instrumentation from the subsystem instrumentation,
but as long as the structure sits in public headers and the global
variables used within the TP_fast_assign() logic (e.g. init_user_ns)
are export-gpl, this is enough to make it easy for tracer integration
and it keeps the tracepoint caller code footprint to a minimum.

The TRACE_EVENT definitions are specific to the subsystem anyway,
so I don't think it matters that the TRACE_EVENT() need to access
the dentry and attr structures.

So I agree with Steven's suggestion. However, just as a precision,
I suspect it will have mainly an impact on code size, but not
necessarily on icache footprint, because it will shrink the code
size within the tracepoint unlikely branch (cold instructions).

Thanks,

Mathieu



Thanks,
Hongbo



-- Steve



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-02 Thread Hongbo Li




On 2024/7/2 7:49, Steven Rostedt wrote:

On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:


@@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap,
if (error)
return error;
  
+	trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name,

+   attr->ia_valid, attr->ia_mode,
+   from_kuid(_user_ns, attr->ia_uid),
+   from_kgid(_user_ns, attr->ia_gid),
+   inode->i_size, attr->ia_size);
+


That's a lot of parameters to pass to a tracepoint. Why not just pass the
dentry and attr and do the above in the TP_fast_assign() logic? That would
put less pressure on the icache for the code part.


Thanks for reviewing!

Some logic such as kuid_t --> uid_t might be reasonable obtained in 
filesystem layer. Passing the dentry and attr will let trace know the 
meaning of structure, perhaps tracepoint should not be aware of the

members of these structures as much as possible.

Thanks,
Hongbo



-- Steve





Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

2024-07-02 Thread Peter Zijlstra
On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.

Why not RCU + normal lock thing?



Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-02 Thread Peter Zijlstra
On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23449a8c5e7e..560cf1ca512a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  
>  struct uprobe {
>   struct rb_node  rb_node;/* node in the rb tree */
> - refcount_t  ref;
> + atomic64_t  ref;/* see UPROBE_REFCNT_GET below 
> */
>   struct rw_semaphore register_rwsem;
>   struct rw_semaphore consumer_rwsem;
> + struct rcu_head rcu;
>   struct list_headpending_list;
>   struct uprobe_consumer  *consumers;
>   struct inode*inode; /* Also hold a ref to inode */
> @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct 
> mm_struct *mm, unsigned long v
>   *(uprobe_opcode_t *)>insn);
>  }
>  
> -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> +/*
> + * Uprobe's 64-bit refcount is actually two independent counters co-located 
> in
> + * a single u64 value:
> + *   - lower 32 bits are just a normal refcount with is increment and
> + *   decremented on get and put, respectively, just like normal refcount
> + *   would;
> + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> + *   incremented by one, no matter whether get or put operation is done.
> + *
> + * This upper counter is meant to distinguish between:
> + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
> + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
> + *   sequence, also proceeding to "destruction".
> + *
> + * In both cases refcount drops to zero, but in one case it will have epoch 
> N,
> + * while the second drop to zero will have a different epoch N + 2, allowing
> + * first destructor to bail out because epoch changed between refcount going
> + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> + * 64-bit refcount is double-checked, see put_uprobe() for details).
> + *
> + * Lower 32-bit counter is not meant to over overflow, while it's expected

So refcount_t very explicitly handles both overflow and underflow and
screams bloody murder if they happen. Your thing does not.. 

> + * that upper 32-bit counter will overflow occasionally. Note, though, that 
> we
> + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit 
> counter,
> + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> + * epoch effectively a 31-bit counter with highest bit used as a flag to
> + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
> + *
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */
> +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */
> +
> +/*
> + * Caller has to make sure that:
> + *   a) either uprobe's refcnt is positive before this call;
> + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> + *  preventing uprobe's destructor from removing it from uprobes_tree.
> + *
> + * In the latter case, uprobe's destructor will "resurrect" uprobe instance 
> if
> + * it detects that its refcount went back to being positive again inbetween 
> it
> + * dropping to zero at some point and (potentially delayed) destructor
> + * callback actually running.
> + */
> +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
>  {
> - refcount_inc(>ref);
> + s64 v;
> +
> + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref);

Distinct lack of u32 overflow testing here..

> +
> + /*
> +  * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> +  * we don't retry because there is another CPU that just managed to
> +  * update refcnt and will attempt the same "fix up". Eventually one of
> +  * them will succeed to clear highset bit.
> +  */
> + if (unlikely(v < 0))
> + (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63));
> +
>   return uprobe;
>  }

>  static void put_uprobe(struct uprobe *uprobe)
>  {
> - if (refcount_dec_and_test(>ref)) {
> + s64 v;
> +
> + /*
> +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> +  * Trace RCU to guarantee that uprobe won't be freed from under us, if

What's wrong with normal RCU?

> +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> +  * section double-checking uprobe->ref value below.
> +  * Note call_rcu_tasks_trace() + uprobe_free_rcu