Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Tue, 2 Jul 2024 12:53:20 -0400 Steven Rostedt wrote: > On Wed, 3 Jul 2024 00:19:05 +0900 > Masami Hiramatsu (Google) wrote: > > > > BTW, is this (batched register/unregister APIs) something you'd like > > > to use from the tracefs-based (or whatever it's called, I mean non-BPF > > > ones) uprobes as well? Or there is just no way to even specify a batch > > > of uprobes? Just curious if you had any plans for this. > > > > No, because current tracefs dynamic event interface is not designed for > > batched registration. I think we can expand it to pass wildcard symbols > > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe. > > Um, that maybe another good idea. > > I don't see why not. The wild cards were added to the kernel > specifically for the tracefs interface (set_ftrace_filter). Sorry for mislead you, I meant current "dynamic_events" interface does not support the wildcard places. And I agree that we can update it to support something like p:multi_uprobe 0x1234,0x2234,0x3234@/bin/foo $arg1 $arg2 $arg3 (note: kernel does not read the symbols in user binary) Thank you, -- Masami Hiramatsu (Google)
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Tue, Jul 2, 2024 at 9:53 AM Steven Rostedt wrote: > > On Wed, 3 Jul 2024 00:19:05 +0900 > Masami Hiramatsu (Google) wrote: > > > > BTW, is this (batched register/unregister APIs) something you'd like > > > to use from the tracefs-based (or whatever it's called, I mean non-BPF > > > ones) uprobes as well? Or there is just no way to even specify a batch > > > of uprobes? Just curious if you had any plans for this. > > > > No, because current tracefs dynamic event interface is not designed for > > batched registration. I think we can expand it to pass wildcard symbols > > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe. > > Um, that maybe another good idea. > > I don't see why not. The wild cards were added to the kernel > specifically for the tracefs interface (set_ftrace_filter). Nice, I'd be happy to adjust batch API to work for that use case as well (when we get there). > > -- Steve
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra wrote: > > On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote: > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 23449a8c5e7e..560cf1ca512a 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); > > > > struct uprobe { > > struct rb_node rb_node;/* node in the rb tree */ > > - refcount_t ref; > > + atomic64_t ref;/* see UPROBE_REFCNT_GET > > below */ > > struct rw_semaphore register_rwsem; > > struct rw_semaphore consumer_rwsem; > > + struct rcu_head rcu; > > struct list_headpending_list; > > struct uprobe_consumer *consumers; > > struct inode*inode; /* Also hold a ref to inode */ > > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct > > mm_struct *mm, unsigned long v > > *(uprobe_opcode_t *)>insn); > > } > > > > -static struct uprobe *get_uprobe(struct uprobe *uprobe) > > +/* > > + * Uprobe's 64-bit refcount is actually two independent counters > > co-located in > > + * a single u64 value: > > + * - lower 32 bits are just a normal refcount with is increment and > > + * decremented on get and put, respectively, just like normal refcount > > + * would; > > + * - upper 32 bits are a tag (or epoch, if you will), which is always > > + * incremented by one, no matter whether get or put operation is done. > > + * > > + * This upper counter is meant to distinguish between: > > + * - one CPU dropping refcnt from 1 -> 0 and proceeding with > > "destruction", > > + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 > > refcnt > > + * sequence, also proceeding to "destruction". > > + * > > + * In both cases refcount drops to zero, but in one case it will have > > epoch N, > > + * while the second drop to zero will have a different epoch N + 2, > > allowing > > + * first destructor to bail out because epoch changed between refcount > > going > > + * to zero and put_uprobe() taking uprobes_treelock (under which overall > > + * 64-bit refcount is double-checked, see put_uprobe() for details). > > + * > > + * Lower 32-bit counter is not meant to over overflow, while it's expected > > So refcount_t very explicitly handles both overflow and underflow and > screams bloody murder if they happen. Your thing does not.. > Correct, because I considered that to be practically impossible to overflow this refcount. The main source of refcounts are uretprobes that are holding uprobe references. We limit the depth of supported recursion to 64, so you'd need 30+ millions of threads all hitting the same uprobe/uretprobe to overflow this. I guess in theory it could happen (not sure if we have some limits on total number of threads in the system and if they can be bumped to over 30mln), but it just seemed out of realm of practical possibility. Having said that, I can add similar checks that refcount_t does in refcount_add and do what refcount_warn_saturate does as well. > > + * that upper 32-bit counter will overflow occasionally. Note, though, > > that we > > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit > > counter, > > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and > > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes > > + * epoch effectively a 31-bit counter with highest bit used as a flag to > > + * perform a fix-up. This ensures epoch and refcnt parts do not > > "interfere". > > + * > > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* > > + * epoch and refcnt parts atomically with one atomic_add(). > > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part > > and > > + * *increment* epoch part. > > + */ > > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */ > > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */ > > + > > +/* > > + * Caller has to make sure that: > > + * a) either uprobe's refcnt is positive before this call; > > + * b) or uprobes_treelock is held (doesn't matter if for read or write), > > + * preventing uprobe's destructor from removing it from uprobes_tree. > > + * > > + * In the latter case, uprobe's destructor will "resurrect" uprobe > > instance if > > + * it detects that its refcount went back to being positive again > > inbetween it > > + * dropping to zero at some point and (potentially delayed) destructor > > + * callback actually running. > > + */ > > +static struct uprobe *__get_uprobe(struct uprobe *uprobe) > > { > > - refcount_inc(>ref); > > + s64 v; > > + > > + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref); > > Distinct lack of u32 overflow testing here.. > > > + > > + /* > > + * If the
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Wed, 3 Jul 2024 00:19:05 +0900 Masami Hiramatsu (Google) wrote: > > BTW, is this (batched register/unregister APIs) something you'd like > > to use from the tracefs-based (or whatever it's called, I mean non-BPF > > ones) uprobes as well? Or there is just no way to even specify a batch > > of uprobes? Just curious if you had any plans for this. > > No, because current tracefs dynamic event interface is not designed for > batched registration. I think we can expand it to pass wildcard symbols > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe. > Um, that maybe another good idea. I don't see why not. The wild cards were added to the kernel specifically for the tracefs interface (set_ftrace_filter). -- Steve
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
On Mon, 1 Jul 2024 18:34:55 -0700 Andrii Nakryiko wrote: > > > How about this? I'll keep the existing get_uprobe_consumer(idx, ctx) > > > contract, which works for the only user right now, BPF multi-uprobes. > > > When it's time to add another consumer that works with a linked list, > > > we can add another more complicated contract that would do > > > iterator-style callbacks. This would be used by linked list users, and > > > we can transparently implement existing uprobe_register_batch() > > > contract on top of if by implementing a trivial iterator wrapper on > > > top of get_uprobe_consumer(idx, ctx) approach. > > > > Agreed, anyway as far as it uses an array of uprobe_consumer, it works. > > When we need to register list of the structure, we may be possible to > > allocate an array or introduce new function. > > > > Cool, glad we agree. What you propose above with start + next + ctx > seems like a way forward if we need this. > > BTW, is this (batched register/unregister APIs) something you'd like > to use from the tracefs-based (or whatever it's called, I mean non-BPF > ones) uprobes as well? Or there is just no way to even specify a batch > of uprobes? Just curious if you had any plans for this. No, because current tracefs dynamic event interface is not designed for batched registration. I think we can expand it to pass wildcard symbols (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe. Um, that maybe another good idea. Thank you! -- Masami Hiramatsu (Google)
Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.
On 2024/7/2 21:30, Mathieu Desnoyers wrote: On 2024-07-02 07:55, Hongbo Li wrote: On 2024/7/2 7:49, Steven Rostedt wrote: On Wed, 12 Jun 2024 09:11:56 +0800 Hongbo Li wrote: @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap, if (error) return error; + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name, + attr->ia_valid, attr->ia_mode, + from_kuid(_user_ns, attr->ia_uid), + from_kgid(_user_ns, attr->ia_gid), + inode->i_size, attr->ia_size); + That's a lot of parameters to pass to a tracepoint. Why not just pass the dentry and attr and do the above in the TP_fast_assign() logic? That would put less pressure on the icache for the code part. Thanks for reviewing! Some logic such as kuid_t --> uid_t might be reasonable obtained in filesystem layer. Passing the dentry and attr will let trace know the meaning of structure, perhaps tracepoint should not be aware of the members of these structures as much as possible. As maintainer of the LTTng out-of-tree kernel tracer, I appreciate the effort to decouple instrumentation from the subsystem instrumentation, but as long as the structure sits in public headers and the global variables used within the TP_fast_assign() logic (e.g. init_user_ns) are export-gpl, this is enough to make it easy for tracer integration Thank you for your friendly elaboration and suggestion! I will update this part based on your suggestion in next version. Thanks, Hongbo and it keeps the tracepoint caller code footprint to a minimum. The TRACE_EVENT definitions are specific to the subsystem anyway, so I don't think it matters that the TRACE_EVENT() need to access the dentry and attr structures. So I agree with Steven's suggestion. However, just as a precision, I suspect it will have mainly an impact on code size, but not necessarily on icache footprint, because it will shrink the code size within the tracepoint unlikely branch (cold instructions). Thanks, Mathieu Thanks, Hongbo -- Steve
Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.
On 2024-07-02 07:55, Hongbo Li wrote: On 2024/7/2 7:49, Steven Rostedt wrote: On Wed, 12 Jun 2024 09:11:56 +0800 Hongbo Li wrote: @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap, if (error) return error; + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name, + attr->ia_valid, attr->ia_mode, + from_kuid(_user_ns, attr->ia_uid), + from_kgid(_user_ns, attr->ia_gid), + inode->i_size, attr->ia_size); + That's a lot of parameters to pass to a tracepoint. Why not just pass the dentry and attr and do the above in the TP_fast_assign() logic? That would put less pressure on the icache for the code part. Thanks for reviewing! Some logic such as kuid_t --> uid_t might be reasonable obtained in filesystem layer. Passing the dentry and attr will let trace know the meaning of structure, perhaps tracepoint should not be aware of the members of these structures as much as possible. As maintainer of the LTTng out-of-tree kernel tracer, I appreciate the effort to decouple instrumentation from the subsystem instrumentation, but as long as the structure sits in public headers and the global variables used within the TP_fast_assign() logic (e.g. init_user_ns) are export-gpl, this is enough to make it easy for tracer integration and it keeps the tracepoint caller code footprint to a minimum. The TRACE_EVENT definitions are specific to the subsystem anyway, so I don't think it matters that the TRACE_EVENT() need to access the dentry and attr structures. So I agree with Steven's suggestion. However, just as a precision, I suspect it will have mainly an impact on code size, but not necessarily on icache footprint, because it will shrink the code size within the tracepoint unlikely branch (cold instructions). Thanks, Mathieu Thanks, Hongbo -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.
On 2024/7/2 7:49, Steven Rostedt wrote: On Wed, 12 Jun 2024 09:11:56 +0800 Hongbo Li wrote: @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap, if (error) return error; + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name, + attr->ia_valid, attr->ia_mode, + from_kuid(_user_ns, attr->ia_uid), + from_kgid(_user_ns, attr->ia_gid), + inode->i_size, attr->ia_size); + That's a lot of parameters to pass to a tracepoint. Why not just pass the dentry and attr and do the above in the TP_fast_assign() logic? That would put less pressure on the icache for the code part. Thanks for reviewing! Some logic such as kuid_t --> uid_t might be reasonable obtained in filesystem layer. Passing the dentry and attr will let trace know the meaning of structure, perhaps tracepoint should not be aware of the members of these structures as much as possible. Thanks, Hongbo -- Steve
Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote: > This patch set, ultimately, switches global uprobes_treelock from RW spinlock > to per-CPU RW semaphore, which has better performance and scales better under > contention and multiple parallel threads triggering lots of uprobes. Why not RCU + normal lock thing?
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote: > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 23449a8c5e7e..560cf1ca512a 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); > > struct uprobe { > struct rb_node rb_node;/* node in the rb tree */ > - refcount_t ref; > + atomic64_t ref;/* see UPROBE_REFCNT_GET below > */ > struct rw_semaphore register_rwsem; > struct rw_semaphore consumer_rwsem; > + struct rcu_head rcu; > struct list_headpending_list; > struct uprobe_consumer *consumers; > struct inode*inode; /* Also hold a ref to inode */ > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct > mm_struct *mm, unsigned long v > *(uprobe_opcode_t *)>insn); > } > > -static struct uprobe *get_uprobe(struct uprobe *uprobe) > +/* > + * Uprobe's 64-bit refcount is actually two independent counters co-located > in > + * a single u64 value: > + * - lower 32 bits are just a normal refcount with is increment and > + * decremented on get and put, respectively, just like normal refcount > + * would; > + * - upper 32 bits are a tag (or epoch, if you will), which is always > + * incremented by one, no matter whether get or put operation is done. > + * > + * This upper counter is meant to distinguish between: > + * - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction", > + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt > + * sequence, also proceeding to "destruction". > + * > + * In both cases refcount drops to zero, but in one case it will have epoch > N, > + * while the second drop to zero will have a different epoch N + 2, allowing > + * first destructor to bail out because epoch changed between refcount going > + * to zero and put_uprobe() taking uprobes_treelock (under which overall > + * 64-bit refcount is double-checked, see put_uprobe() for details). > + * > + * Lower 32-bit counter is not meant to over overflow, while it's expected So refcount_t very explicitly handles both overflow and underflow and screams bloody murder if they happen. Your thing does not.. > + * that upper 32-bit counter will overflow occasionally. Note, though, that > we > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit > counter, > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes > + * epoch effectively a 31-bit counter with highest bit used as a flag to > + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere". > + * > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* > + * epoch and refcnt parts atomically with one atomic_add(). > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and > + * *increment* epoch part. > + */ > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x00010001LL */ > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0xLL */ > + > +/* > + * Caller has to make sure that: > + * a) either uprobe's refcnt is positive before this call; > + * b) or uprobes_treelock is held (doesn't matter if for read or write), > + * preventing uprobe's destructor from removing it from uprobes_tree. > + * > + * In the latter case, uprobe's destructor will "resurrect" uprobe instance > if > + * it detects that its refcount went back to being positive again inbetween > it > + * dropping to zero at some point and (potentially delayed) destructor > + * callback actually running. > + */ > +static struct uprobe *__get_uprobe(struct uprobe *uprobe) > { > - refcount_inc(>ref); > + s64 v; > + > + v = atomic64_add_return(UPROBE_REFCNT_GET, >ref); Distinct lack of u32 overflow testing here.. > + > + /* > + * If the highest bit is set, we need to clear it. If cmpxchg() fails, > + * we don't retry because there is another CPU that just managed to > + * update refcnt and will attempt the same "fix up". Eventually one of > + * them will succeed to clear highset bit. > + */ > + if (unlikely(v < 0)) > + (void)atomic64_cmpxchg(>ref, v, v & ~(1ULL << 63)); > + > return uprobe; > } > static void put_uprobe(struct uprobe *uprobe) > { > - if (refcount_dec_and_test(>ref)) { > + s64 v; > + > + /* > + * here uprobe instance is guaranteed to be alive, so we use Tasks > + * Trace RCU to guarantee that uprobe won't be freed from under us, if What's wrong with normal RCU? > + * we end up being a losing "destructor" inside uprobe_treelock'ed > + * section double-checking uprobe->ref value below. > + * Note call_rcu_tasks_trace() + uprobe_free_rcu