Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-04 Thread Google
nregister_sync();
>  }
>  
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..6b64470a1c5c 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, 
> filter_func_t filter)
>  static void __probe_event_disable(struct trace_probe *tp)
>  {
>   struct trace_uprobe *tu;
> + bool sync = false;
>  
>   tu = container_of(tp, struct trace_uprobe, tp);
>   WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> @@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct trace_probe 
> *tp)
>   if (!tu->inode)
>   continue;
>  
> - uprobe_unregister(tu->inode, tu->offset, >consumer);
> + uprobe_unregister(tu->inode, tu->offset, >consumer, 
> URF_NO_SYNC);
> + sync = true;
>   tu->inode = NULL;
>   }
> + if (sync)
> + uprobe_unregister_sync();
>  }
>  
>  static int probe_event_enable(struct trace_event_call *call,


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-03 Thread Google
On Wed, 3 Jul 2024 11:25:35 -0700
Andrii Nakryiko  wrote:

> On Wed, Jul 3, 2024 at 6:15 AM Masami Hiramatsu  wrote:
> >
> > On Mon,  1 Jul 2024 15:39:25 -0700
> > Andrii Nakryiko  wrote:
> >
> > > It seems like uprobe_write_opcode() doesn't require writer locked
> > > mmap_sem, any lock (reader or writer) should be sufficient. This was
> > > established in a discussion in [0] and looking through existing code
> > > seems to confirm that there is no need for write-locked mmap_sem.
> > >
> > > Fix the comment to state this clearly.
> > >
> > >   [0] 
> > > https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/
> > >
> > > Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
> > > uprobe_write_opcode()")
> >
> > nit: why this has Fixes but [01/12] doesn't?
> >
> > Should I pick both to fixes branch?
> 
> I'd keep both of them in probes/for-next, tbh. They are not literally
> fixing anything, just cleaning up comments. I can drop the Fixes tag
> from this one, if you'd like.

Got it. Usually cleanup patch will not have Fixed tag, so if you don't mind,
please drop it.

Thank you,

> 
> >
> > Thank you,
> >
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  kernel/events/uprobes.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 081821fd529a..f87049c08ee9 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, 
> > > struct mm_struct *mm,
> > >   * @vaddr: the virtual address to store the opcode.
> > >   * @opcode: opcode to be written at @vaddr.
> > >   *
> > > - * Called with mm->mmap_lock held for write.
> > > + * Called with mm->mmap_lock held for read or write.
> > >   * Return 0 (success) or a negative errno.
> > >   */
> > >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
> > > *mm,
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-03 Thread Google
On Mon,  1 Jul 2024 15:39:25 -0700
Andrii Nakryiko  wrote:

> It seems like uprobe_write_opcode() doesn't require writer locked
> mmap_sem, any lock (reader or writer) should be sufficient. This was
> established in a discussion in [0] and looking through existing code
> seems to confirm that there is no need for write-locked mmap_sem.
> 
> Fix the comment to state this clearly.
> 
>   [0] 
> https://lore.kernel.org/linux-trace-kernel/20240625190748.gc14...@redhat.com/
> 
> Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into 
> uprobe_write_opcode()")

nit: why this has Fixes but [01/12] doesn't?

Should I pick both to fixes branch?

Thank you,

> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/events/uprobes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 081821fd529a..f87049c08ee9 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
> mm_struct *mm,
>   * @vaddr: the virtual address to store the opcode.
>   * @opcode: opcode to be written at @vaddr.
>   *
> - * Called with mm->mmap_lock held for write.
> + * Called with mm->mmap_lock held for read or write.
>   * Return 0 (success) or a negative errno.
>   */
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Google
On Tue, 2 Jul 2024 12:53:20 -0400
Steven Rostedt  wrote:

> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.  
> > 
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
> 
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Sorry for mislead you, I meant current "dynamic_events" interface does not
support the wildcard places.
And I agree that we can update it to support something like

 p:multi_uprobe 0x1234,0x2234,0x3234@/bin/foo $arg1 $arg2 $arg3

(note: kernel does not read the symbols in user binary)

Thank you,


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Google
On Mon, 1 Jul 2024 18:34:55 -0700
Andrii Nakryiko  wrote:

> > > How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> > > contract, which works for the only user right now, BPF multi-uprobes.
> > > When it's time to add another consumer that works with a linked list,
> > > we can add another more complicated contract that would do
> > > iterator-style callbacks. This would be used by linked list users, and
> > > we can transparently implement existing uprobe_register_batch()
> > > contract on top of if by implementing a trivial iterator wrapper on
> > > top of get_uprobe_consumer(idx, ctx) approach.
> >
> > Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
> > When we need to register list of the structure, we may be possible to
> > allocate an array or introduce new function.
> >
> 
> Cool, glad we agree. What you propose above with start + next + ctx
> seems like a way forward if we need this.
> 
> BTW, is this (batched register/unregister APIs) something you'd like
> to use from the tracefs-based (or whatever it's called, I mean non-BPF
> ones) uprobes as well? Or there is just no way to even specify a batch
> of uprobes? Just curious if you had any plans for this.

No, because current tracefs dynamic event interface is not designed for
batched registration. I think we can expand it to pass wildcard symbols
(for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
Um, that maybe another good idea.

Thank you!

-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-01 Thread Google
 one, so
> > not a big deal.
> >
> 
> Actually, now that I started implementing this, I really-really don't
> like it. In the example above you assume by storing and reusing
> original ctx value you will reset iteration to the very beginning.
> This is not how it works in practice though. Ctx is most probably a
> pointer to some struct somewhere with iteration state (e.g., array of
> all uprobes + current index), and so get_uprobe_consumer() doesn't
> update `void *ctx` itself, it updates the state of that struct.

Yeah, that should be noted so that if the get_uprobe_consumer() is
called with original `ctx` value, it should return the same.
Ah, and I found we need to pass both ctx and pos...

   while ((uc = get_uprobe_consumer(, ctx)) != NULL) {
 ...
 }

Usually it is enough to pass the cursor as similar to the other
for_each_* macros. For example, struct foo has .list and .uc, then

struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
{
struct foo *foo = *pos;

if (!foo)
return NULL;

*pos = list_next_entry(foo, list);
if (list_is_head(pos, (head)))
*pos = NULL;

return foo->uc;
}

This works something like this.

#define for_each_uprobe_consumer_from_foo(uc, pos, head) \
list_for_each_entry(pos, head, list) \
if (uc = uprobe_consumer_from_foo(pos))

or, for array of *uprobe_consumer (array must be end with NULL), 

struct uprobe_consumer *get_uprobe_consumer_array(void **pos, void *head 
__unused)
{
struct uprobe_consumer **uc = *pos;

if (!*uc)
return NULL;

*pos = uc + 1;

return *uc;
}

But this may not be able to support array of uprobe_consumer. Hmm.


> And so there is no easy and clean way to reset this iterator without
> adding another callback or something. At which point it becomes quite
> cumbersome and convoluted.

If you consider that is problematic, I think we can prepare more
iterator like object;

struct uprobe_consumer_iter_ops {
struct uprobe_consumer *(*start)(struct uprobe_consumer_iter_ops *);
struct uprobe_consumer *(*next)(struct uprobe_consumer_iter_ops *);
void *ctx; // or, just embed the data in this structure.
};


> How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> contract, which works for the only user right now, BPF multi-uprobes.
> When it's time to add another consumer that works with a linked list,
> we can add another more complicated contract that would do
> iterator-style callbacks. This would be used by linked list users, and
> we can transparently implement existing uprobe_register_batch()
> contract on top of if by implementing a trivial iterator wrapper on
> top of get_uprobe_consumer(idx, ctx) approach.

Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
When we need to register list of the structure, we may be possible to
allocate an array or introduce new function.

Thank you!

> 
> Let's not add unnecessary complications right now given we have a
> clear path forward to add it later, if necessary, without breaking
> anything. I'll send v2 without changes to get_uprobe_consumer() for
> now, hopefully my above plan makes sense to you. Thanks!
> 
> > >
> > > Thank you,
> > >
> > > >
> > > > For now this API works well, I propose to keep it as is. For linked
> > > > list case consumers would need to allocate one extra array or pay the
> > > > price of O(N) search (which might be ok, depending on how many uprobes
> > > > are being attached). But we don't have such consumers right now,
> > > > thankfully.
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) 
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) 
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-29 Thread Google
On Fri, 28 Jun 2024 09:34:26 -0700
Andrii Nakryiko  wrote:

> On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu  wrote:
> >
> > On Thu, 27 Jun 2024 09:47:10 -0700
> > Andrii Nakryiko  wrote:
> >
> > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > -  loff_t ref_ctr_offset, struct 
> > > > > uprobe_consumer *uc)
> > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > +   uprobe_consumer_fn get_uprobe_consumer, void 
> > > > > *ctx)
> > > >
> > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > allocate a temporary array of *uprobe_consumer instead?
> > >
> > > Yes, exactly, to avoid the need for allocating another array that
> > > would just contain pointers to uprobe_consumer. Consumers would never
> > > just have an array of `struct uprobe_consumer *`, because
> > > uprobe_consumer struct is embedded in some other struct, so the array
> > > interface isn't the most convenient.
> >
> > OK, I understand it.
> >
> > >
> > > If you feel strongly, I can do an array, but this necessitates
> > > allocating an extra array *and keeping it* for the entire duration of
> > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > waste. This is because we don't want to do anything that can fail in
> > > the detachment logic (so no temporary array allocation there).
> >
> > No need to change it, that sounds reasonable.
> >
> 
> Great, thanks.
> 
> > >
> > > Anyways, let me know how you feel about keeping this callback.
> >
> > IMHO, maybe the interface function is better to change to
> > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > side uses a linked list of structure, index access will need to
> > follow the list every time.
> 
> This would be problematic. Note how we call get_uprobe_consumer(i,
> ctx) with i going from 0 to N in multiple independent loops. So if we
> are only allowed to ask for the next consumer, then
> uprobe_register_batch and uprobe_unregister_batch would need to build
> its own internal index and remember ith instance. Which again means
> more allocations and possibly failing uprobe_unregister_batch(), which
> isn't great.

No, I think we can use a cursor variable as;

int uprobe_register_batch(struct inode *inode,
 uprobe_consumer_fn get_uprobe_consumer, void *ctx)
{
void *cur = ctx;

while ((uc = get_uprobe_consumer()) != NULL) {
...
} 

cur = ctx;
while ((uc = get_uprobe_consumer()) != NULL) {
...
} 
}

This can also remove the cnt.

Thank you,

> 
> For now this API works well, I propose to keep it as is. For linked
> list case consumers would need to allocate one extra array or pay the
> price of O(N) search (which might be ok, depending on how many uprobes
> are being attached). But we don't have such consumers right now,
> thankfully.
> 
> >
> > Thank you,
> >
> >
> > >
> > > >
> > > > Thank you,
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) 
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-28 Thread Google
On Thu, 27 Jun 2024 09:47:10 -0700
Andrii Nakryiko  wrote:

> On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu  wrote:
> >
> > On Mon, 24 Jun 2024 17:21:38 -0700
> > Andrii Nakryiko  wrote:
> >
> > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > -  loff_t ref_ctr_offset, struct uprobe_consumer 
> > > *uc)
> > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > +   uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> >
> > Is this interface just for avoiding memory allocation? Can't we just
> > allocate a temporary array of *uprobe_consumer instead?
> 
> Yes, exactly, to avoid the need for allocating another array that
> would just contain pointers to uprobe_consumer. Consumers would never
> just have an array of `struct uprobe_consumer *`, because
> uprobe_consumer struct is embedded in some other struct, so the array
> interface isn't the most convenient.

OK, I understand it.

> 
> If you feel strongly, I can do an array, but this necessitates
> allocating an extra array *and keeping it* for the entire duration of
> BPF multi-uprobe link (attachment) existence, so it feels like a
> waste. This is because we don't want to do anything that can fail in
> the detachment logic (so no temporary array allocation there).

No need to change it, that sounds reasonable.

> 
> Anyways, let me know how you feel about keeping this callback.

IMHO, maybe the interface function is better to change to
`uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
side uses a linked list of structure, index access will need to
follow the list every time.

Thank you,


> 
> >
> > Thank you,
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-06-27 Thread Google
On Mon, 24 Jun 2024 17:21:38 -0700
Andrii Nakryiko  wrote:

> -static int __uprobe_register(struct inode *inode, loff_t offset,
> -  loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +int uprobe_register_batch(struct inode *inode, int cnt,
> +   uprobe_consumer_fn get_uprobe_consumer, void *ctx)

Is this interface just for avoiding memory allocation? Can't we just
allocate a temporary array of *uprobe_consumer instead? 

Thank you,

-- 
Masami Hiramatsu (Google) 



Re: [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-06-26 Thread Google
On Mon, 24 Jun 2024 17:21:37 -0700
Andrii Nakryiko  wrote:

> Simplify uprobe registration/unregistration interfaces by making offset
> and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> existing users already store these fields somewhere in uprobe_consumer's
> containing structure, so this doesn't pose any problem. We just move
> some fields around.
> 
> On the other hand, this simplifies uprobe_register() and
> uprobe_unregister() API by having only struct uprobe_consumer as one
> thing representing attachment/detachment entity. This makes batched
> versions of uprobe_register() and uprobe_unregister() simpler.
> 
> This also makes uprobe_register_refctr() unnecessary, so remove it and
> simplify consumers.
> 
> No functional changes intended.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks!

> Signed-off-by: Andrii Nakryiko 
> ---
>  include/linux/uprobes.h   | 18 +++
>  kernel/events/uprobes.c   | 19 ++-
>  kernel/trace/bpf_trace.c  | 21 +++-
>  kernel/trace/trace_uprobe.c   | 53 ---
>  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 23 
>  5 files changed, 55 insertions(+), 79 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b503fafb7fb3..a75ba37ce3c8 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -42,6 +42,11 @@ struct uprobe_consumer {
>   enum uprobe_filter_ctx ctx,
>   struct mm_struct *mm);
>  
> + /* associated file offset of this probe */
> + loff_t offset;
> + /* associated refctr file offset of this probe, or zero */
> + loff_t ref_ctr_offset;
> + /* for internal uprobe infra use, consumers shouldn't touch fields 
> below */
>   struct uprobe_consumer *next;
>  };
>  
> @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
>  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct 
> *mm, unsigned long vaddr, uprobe_opcode_t);
> -extern int uprobe_register(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
> -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
> ref_ctr_offset, struct uprobe_consumer *uc);
> +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
> +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer 
> *uc);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
>  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
> unsigned long end);
>  extern void uprobe_start_dup_mmap(void);
> @@ -152,11 +156,7 @@ static inline void uprobes_init(void)
>  #define uprobe_get_trap_addr(regs)   instruction_pointer(regs)
>  
>  static inline int
> -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer 
> *uc)
> -{
> - return -ENOSYS;
> -}
> -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, 
> loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
>  {
>   return -ENOSYS;
>  }
> @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, boo
>   return -ENOSYS;
>  }
>  static inline void
> -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
> *uc)
> +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8ce669bc6474..2544e8b79bad 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1197,14 +1197,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct 
> uprobe_consumer *uc)
>  /*
>   * uprobe_unregister - unregister an already registered probe.
>   * @inode: the file in which the probe has to be removed.
> - * @offset: offset from the start of the file.
> - * @uc: identify which probe if multiple probes are colocated.
> + * @uc: identify which probe consumer to unregister.
>   */
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc)
> +void uprobe_unregister(struct inode *inode, struct upr

Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-26 Thread Google
On Mon, 24 Jun 2024 17:21:36 -0700
Andrii Nakryiko  wrote:

> Anyways, under exclusive writer lock, we double-check that refcount
> didn't change and is still zero. If it is, we proceed with destruction,
> because at that point we have a guarantee that find_active_uprobe()
> can't successfully look up this uprobe instance, as it's going to be
> removed in destructor under writer lock. If, on the other hand,
> find_active_uprobe() managed to bump refcount from zero to one in
> between put_uprobe()'s atomic_dec_and_test(>ref) and
> write_lock(_treelock), we'll deterministically detect this with
> extra atomic_read(>ref) check, and if it doesn't hold, we
> pretend like atomic_dec_and_test() never returned true. There is no
> resource freeing or any other irreversible action taken up till this
> point, so we just exit early.
> 
> One tricky part in the above is actually two CPUs racing and dropping
> refcnt to zero, and then attempting to free resources. This can happen
> as follows:
>   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
>   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> which point it decides that it needs to free uprobe as well.
> 
> At this point both CPU #0 and CPU #1 will believe they need to destroy
> uprobe, which is obviously wrong. To prevent this situations, we augment
> refcount with epoch counter, which is always incremented by 1 on either
> get or put operation. This allows those two CPUs above to disambiguate
> who should actually free uprobe (it's the CPU #1, because it has
> up-to-date epoch). See comments in the code and note the specific values
> of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> a single atomi64_t is actually a two sort-of-independent 32-bit counters
> that are incremented/decremented with a single atomic_add_and_return()
> operation. Note also a small and extremely rare (and thus having no
> effect on performance) need to clear the highest bit every 2 billion
> get/put operations to prevent high 32-bit counter from "bleeding over"
> into lower 32-bit counter.

I have a question here.
Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
again? e.g.

CPU#0   CPU#1

put_uprobe() {
atomic64_add_return()
__get_uprobe();
put_uprobe() {
kfree(uprobe)
}
write_lock(_treelock);
atomic64_read(>ref);
}

I think it is very rare case, but I could not find any code to prevent
this scenario.

Thank you,


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-24 Thread Google
On Mon, 24 Jun 2024 17:21:34 -0700
Andrii Nakryiko  wrote:

> Given unapply_uprobe() can call remove_breakpoint() which eventually
> calls uprobe_write_opcode(), which can modify a set of memory pages and
> expects mm->mmap_lock held for write, it needs to have writer lock.
> 
> Fix this by switching to mmap_write_lock()/mmap_write_unlock().
> 

Oops, it is an actual bug, right?

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed 
> task")
> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/events/uprobes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 197fbe4663b5..e896eeecb091 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct 
> mm_struct *mm)
>   struct vm_area_struct *vma;
>   int err = 0;
>  
> - mmap_read_lock(mm);
> + mmap_write_lock(mm);
>   for_each_vma(vmi, vma) {
>   unsigned long vaddr;
>   loff_t offset;
> @@ -1252,7 +1252,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct 
> mm_struct *mm)
>   vaddr = offset_to_vaddr(vma, uprobe->offset);
>   err |= remove_breakpoint(uprobe, mm, vaddr);
>   }
> - mmap_read_unlock(mm);
> + mmap_write_unlock(mm);
>  
>   return err;
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces

2024-06-24 Thread Google
ut in EXIT uprobes
> +  */
> +
> + /* (uretprobe 4) everything up to target_4, but excluding it */
> + validate_stack(skel->bss->exit_stack4, skel->bss->exit4_len,
> +5, , [1], [1], [2], 
> [3]);
> + /* we didn't install uretprobes on target_2 and target_3 */
> + /* (uretprobe 1, recur) first target_1 call only */
> + validate_stack(skel->bss->exit_stack1_recur, skel->bss->exit1_recur_len,
> +2, , [1]);
> + /* (uretprobe 1) just a caller in the stack trace */
> + validate_stack(skel->bss->exit_stack1, skel->bss->exit1_len,
> +1, );
> +
> +cleanup:
> + uretprobe_stack__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/uretprobe_stack.c 
> b/tools/testing/selftests/bpf/progs/uretprobe_stack.c
> new file mode 100644
> index ..9fdcf396b8f4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/uretprobe_stack.c
> @@ -0,0 +1,96 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +char _license[] SEC("license") = "GPL";
> +
> +__u64 entry_stack1[32], exit_stack1[32];
> +__u64 entry_stack1_recur[32], exit_stack1_recur[32];
> +__u64 entry_stack2[32];
> +__u64 entry_stack3[32];
> +__u64 entry_stack4[32], exit_stack4[32];
> +__u64 usdt_stack[32];
> +
> +int entry1_len, exit1_len;
> +int entry1_recur_len, exit1_recur_len;
> +int entry2_len, exit2_len;
> +int entry3_len, exit3_len;
> +int entry4_len, exit4_len;
> +int usdt_len;
> +
> +#define SZ sizeof(usdt_stack)
> +
> +SEC("uprobe//proc/self/exe:target_1")
> +int BPF_UPROBE(uprobe_1)
> +{
> + /* target_1 is recursive wit depth of 2, so we capture two separate
> +  * stack traces, depending on which occurence it is
> +  */
> + static bool recur = false;
> +
> + if (!recur)
> + entry1_len = bpf_get_stack(ctx, _stack1, SZ, 
> BPF_F_USER_STACK);
> + else
> + entry1_recur_len = bpf_get_stack(ctx, _stack1_recur, SZ, 
> BPF_F_USER_STACK);
> +
> + recur = true;
> + return 0;
> +}
> +
> +SEC("uretprobe//proc/self/exe:target_1")
> +int BPF_URETPROBE(uretprobe_1)
> +{
> + /* see above, target_1 is recursive */
> + static bool recur = false;
> +
> + /* NOTE: order of returns is reversed to order of entries */
> + if (!recur)
> + exit1_recur_len = bpf_get_stack(ctx, _stack1_recur, SZ, 
> BPF_F_USER_STACK);
> + else
> + exit1_len = bpf_get_stack(ctx, _stack1, SZ, 
> BPF_F_USER_STACK);
> +
> + recur = true;
> + return 0;
> +}
> +
> +SEC("uprobe//proc/self/exe:target_2")
> +int BPF_UPROBE(uprobe_2)
> +{
> + entry2_len = bpf_get_stack(ctx, _stack2, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +/* no uretprobe for target_2 */
> +
> +SEC("uprobe//proc/self/exe:target_3")
> +int BPF_UPROBE(uprobe_3)
> +{
> + entry3_len = bpf_get_stack(ctx, _stack3, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +/* no uretprobe for target_3 */
> +
> +SEC("uprobe//proc/self/exe:target_4")
> +int BPF_UPROBE(uprobe_4)
> +{
> + entry4_len = bpf_get_stack(ctx, _stack4, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +SEC("uretprobe//proc/self/exe:target_4")
> +int BPF_URETPROBE(uretprobe_4)
> +{
> + exit4_len = bpf_get_stack(ctx, _stack4, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +SEC("usdt//proc/self/exe:uretprobe_stack:target")
> +int BPF_USDT(usdt_probe)
> +{
> + usdt_len = bpf_get_stack(ctx, _stack, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 4/4] selftests/bpf: add test validating uprobe/uretprobe stack traces

2024-06-24 Thread Google
arget_4, but excluding it */
> + validate_stack(skel->bss->exit_stack4, skel->bss->exit4_len,
> +5, , [1], [1], [2], 
> [3]);
> + /* we didn't install uretprobes on target_2 and target_3 */
> + /* (uretprobe 1, recur) first target_1 call only */
> + validate_stack(skel->bss->exit_stack1_recur, skel->bss->exit1_recur_len,
> +2, , [1]);
> + /* (uretprobe 1) just a caller in the stack trace */
> + validate_stack(skel->bss->exit_stack1, skel->bss->exit1_len,
> +1, );
> +
> +cleanup:
> + uretprobe_stack__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/uretprobe_stack.c 
> b/tools/testing/selftests/bpf/progs/uretprobe_stack.c
> new file mode 100644
> index ..9fdcf396b8f4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/uretprobe_stack.c
> @@ -0,0 +1,96 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +char _license[] SEC("license") = "GPL";
> +
> +__u64 entry_stack1[32], exit_stack1[32];
> +__u64 entry_stack1_recur[32], exit_stack1_recur[32];
> +__u64 entry_stack2[32];
> +__u64 entry_stack3[32];
> +__u64 entry_stack4[32], exit_stack4[32];
> +__u64 usdt_stack[32];
> +
> +int entry1_len, exit1_len;
> +int entry1_recur_len, exit1_recur_len;
> +int entry2_len, exit2_len;
> +int entry3_len, exit3_len;
> +int entry4_len, exit4_len;
> +int usdt_len;
> +
> +#define SZ sizeof(usdt_stack)
> +
> +SEC("uprobe//proc/self/exe:target_1")
> +int BPF_UPROBE(uprobe_1)
> +{
> + /* target_1 is recursive wit depth of 2, so we capture two separate
> +  * stack traces, depending on which occurence it is
> +  */
> + static bool recur = false;
> +
> + if (!recur)
> + entry1_len = bpf_get_stack(ctx, _stack1, SZ, 
> BPF_F_USER_STACK);
> + else
> + entry1_recur_len = bpf_get_stack(ctx, _stack1_recur, SZ, 
> BPF_F_USER_STACK);
> +
> + recur = true;
> + return 0;
> +}
> +
> +SEC("uretprobe//proc/self/exe:target_1")
> +int BPF_URETPROBE(uretprobe_1)
> +{
> + /* see above, target_1 is recursive */
> + static bool recur = false;
> +
> + /* NOTE: order of returns is reversed to order of entries */
> + if (!recur)
> + exit1_recur_len = bpf_get_stack(ctx, _stack1_recur, SZ, 
> BPF_F_USER_STACK);
> + else
> + exit1_len = bpf_get_stack(ctx, _stack1, SZ, 
> BPF_F_USER_STACK);
> +
> + recur = true;
> + return 0;
> +}
> +
> +SEC("uprobe//proc/self/exe:target_2")
> +int BPF_UPROBE(uprobe_2)
> +{
> + entry2_len = bpf_get_stack(ctx, _stack2, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +/* no uretprobe for target_2 */
> +
> +SEC("uprobe//proc/self/exe:target_3")
> +int BPF_UPROBE(uprobe_3)
> +{
> + entry3_len = bpf_get_stack(ctx, _stack3, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +/* no uretprobe for target_3 */
> +
> +SEC("uprobe//proc/self/exe:target_4")
> +int BPF_UPROBE(uprobe_4)
> +{
> + entry4_len = bpf_get_stack(ctx, _stack4, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +SEC("uretprobe//proc/self/exe:target_4")
> +int BPF_URETPROBE(uretprobe_4)
> +{
> + exit4_len = bpf_get_stack(ctx, _stack4, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> +
> +SEC("usdt//proc/self/exe:uretprobe_stack:target")
> +int BPF_USDT(usdt_probe)
> +{
> + usdt_len = bpf_get_stack(ctx, _stack, SZ, BPF_F_USER_STACK);
> + return 0;
> +}
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 1/4] uprobes: rename get_trampoline_vaddr() and make it global

2024-06-24 Thread Google
On Tue, 21 May 2024 18:38:42 -0700
Andrii Nakryiko  wrote:

> This helper is needed in another file, so make it a bit more uniquely
> named and expose it internally.
> 
> Signed-off-by: Andrii Nakryiko 

Sorry, I think this conflicts with

https://lore.kernel.org/all/2024062158.40795-4-jo...@kernel.org/

And that is already picked.

Thanks,

> ---
>  include/linux/uprobes.h | 1 +
>  kernel/events/uprobes.c | 6 +++---
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..0c57eec85339 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct 
> return_instance *ret, enum rp_check c
>  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs 
> *regs);
>  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>void *src, unsigned long len);
> +extern unsigned long uprobe_get_trampoline_vaddr(void);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8ae0eefc3a34..d60d24f0f2f4 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, 
> unsigned long flags)
>   *
>   * Returns -1 in case the xol_area is not allocated.
>   */
> -static unsigned long get_trampoline_vaddr(void)
> +unsigned long uprobe_get_trampoline_vaddr(void)
>  {
>   struct xol_area *area;
>   unsigned long trampoline_vaddr = -1;
> @@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
> struct pt_regs *regs)
>   if (!ri)
>   return;
>  
> - trampoline_vaddr = get_trampoline_vaddr();
> + trampoline_vaddr = uprobe_get_trampoline_vaddr();
>   orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
> regs);
>   if (orig_ret_vaddr == -1)
>   goto fail;
> @@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs)
>   int is_swbp;
>  
>   bp_vaddr = uprobe_get_swbp_addr(regs);
> - if (bp_vaddr == get_trampoline_vaddr())
> + if (bp_vaddr == uprobe_get_trampoline_vaddr())
>   return handle_trampoline(regs);
>  
>   uprobe = find_active_uprobe(bp_vaddr, _swbp);
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-24 Thread Google
On Mon, 24 Jun 2024 13:32:35 -0700
Andrii Nakryiko  wrote:

> On Mon, Jun 17, 2024 at 3:37 PM Andrii Nakryiko
>  wrote:
> >
> > On Tue, Jun 4, 2024 at 10:16 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Tue, Jun 4, 2024 at 7:13 AM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Tue, 21 May 2024 18:38:43 -0700
> > > > Andrii Nakryiko  wrote:
> > > >
> > > > > When kernel has pending uretprobes installed, it hijacks original user
> > > > > function return address on the stack with a uretprobe trampoline
> > > > > address. There could be multiple such pending uretprobes (either on
> > > > > different user functions or on the same recursive one) at any given
> > > > > time within the same task.
> > > > >
> > > > > This approach interferes with the user stack trace capture logic, 
> > > > > which
> > > > > would report suprising addresses (like 0x7fffe000) that correspond
> > > > > to a special "[uprobes]" section that kernel installs in the target
> > > > > process address space for uretprobe trampoline code, while logically 
> > > > > it
> > > > > should be an address somewhere within the calling function of another
> > > > > traced user function.
> > > > >
> > > > > This is easy to correct for, though. Uprobes subsystem keeps track of
> > > > > pending uretprobes and records original return addresses. This patch 
> > > > > is
> > > > > using this to do a post-processing step and restore each trampoline
> > > > > address entries with correct original return address. This is done 
> > > > > only
> > > > > if there are pending uretprobes for current task.
> > > > >
> > > > > This is a similar approach to what fprobe/kretprobe infrastructure is
> > > > > doing when capturing kernel stack traces in the presence of pending
> > > > > return probes.
> > > > >
> > > >
> > > > This looks good to me because this trampoline information is only
> > > > managed in uprobes. And it should be provided when unwinding user
> > > > stack.
> > > >
> > > > Reviewed-by: Masami Hiramatsu (Google) 
> > > >
> > > > Thank you!
> > >
> > > Great, thanks for reviewing, Masami!
> > >
> > > Would you take this fix through your tree, or where should it be routed 
> > > to?
> > >
> >
> > Ping! What would you like me to do with this patch set? Should I
> > resend it without patch 3 (the one that tries to guess whether we are
> > at the entry to the function?), or did I manage to convince you that
> > this heuristic is OK, given perf's stack trace capturing logic already
> > makes heavy assumption of rbp register being used for frame pointer?
> >
> > Please let me know your preference, I could drop patch 3 and send it
> > separately, if that helps move the main fix forward. Thanks!
> 
> Masami,
> 
> Another week went by with absolutely no action or reaction from you.
> Is there any way I can help improve the collaboration here?

OK, if there is no change without [3/4], let me pick the others on
probes/for-next directly. [3/4] I need other x86 maintainer's
comments. And it should be handled by PMU maintainers.

Thanks,


> 
> I'm preparing more patches for uprobes and about to submit them. If
> each reviewed and ready to be applied patch set has to sit idle for
> multiple weeks for no good reason, we all will soon be lost just plain
> forgetting the context in which the patch was prepared.
> 
> Please, prioritize handling patches that are meant to be routed
> through your tree in a more timely fashion. Or propose some
> alternative acceptable arrangement.
> 
> Thank you.
> 
> >
> > > >
> > > > > Reported-by: Riham Selim 
> > > > > Signed-off-by: Andrii Nakryiko 
> > > > > ---
> > > > >  kernel/events/callchain.c | 43 
> > > > > ++-
> > > > >  kernel/events/uprobes.c   |  9 
> > > > >  2 files changed, 51 insertions(+), 1 deletion(-)
> > > > >
> > >
> > > [...]


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 1/4] uprobes: rename get_trampoline_vaddr() and make it global

2024-06-24 Thread Google
On Tue, 21 May 2024 18:38:42 -0700
Andrii Nakryiko  wrote:

> This helper is needed in another file, so make it a bit more uniquely
> named and expose it internally.
> 
> Signed-off-by: Andrii Nakryiko 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you

> ---
>  include/linux/uprobes.h | 1 +
>  kernel/events/uprobes.c | 6 +++---
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..0c57eec85339 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct 
> return_instance *ret, enum rp_check c
>  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs 
> *regs);
>  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>void *src, unsigned long len);
> +extern unsigned long uprobe_get_trampoline_vaddr(void);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8ae0eefc3a34..d60d24f0f2f4 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1827,7 +1827,7 @@ void uprobe_copy_process(struct task_struct *t, 
> unsigned long flags)
>   *
>   * Returns -1 in case the xol_area is not allocated.
>   */
> -static unsigned long get_trampoline_vaddr(void)
> +unsigned long uprobe_get_trampoline_vaddr(void)
>  {
>   struct xol_area *area;
>   unsigned long trampoline_vaddr = -1;
> @@ -1878,7 +1878,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
> struct pt_regs *regs)
>   if (!ri)
>   return;
>  
> - trampoline_vaddr = get_trampoline_vaddr();
> + trampoline_vaddr = uprobe_get_trampoline_vaddr();
>   orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
> regs);
>   if (orig_ret_vaddr == -1)
>   goto fail;
> @@ -2187,7 +2187,7 @@ static void handle_swbp(struct pt_regs *regs)
>   int is_swbp;
>  
>   bp_vaddr = uprobe_get_swbp_addr(regs);
> - if (bp_vaddr == get_trampoline_vaddr())
> + if (bp_vaddr == uprobe_get_trampoline_vaddr())
>   return handle_trampoline(regs);
>  
>   uprobe = find_active_uprobe(bp_vaddr, _swbp);
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 2/4] perf,uprobes: fix user stack traces in the presence of pending uretprobes

2024-06-04 Thread Google
On Tue, 21 May 2024 18:38:43 -0700
Andrii Nakryiko  wrote:

> When kernel has pending uretprobes installed, it hijacks original user
> function return address on the stack with a uretprobe trampoline
> address. There could be multiple such pending uretprobes (either on
> different user functions or on the same recursive one) at any given
> time within the same task.
> 
> This approach interferes with the user stack trace capture logic, which
> would report suprising addresses (like 0x7fffe000) that correspond
> to a special "[uprobes]" section that kernel installs in the target
> process address space for uretprobe trampoline code, while logically it
> should be an address somewhere within the calling function of another
> traced user function.
> 
> This is easy to correct for, though. Uprobes subsystem keeps track of
> pending uretprobes and records original return addresses. This patch is
> using this to do a post-processing step and restore each trampoline
> address entries with correct original return address. This is done only
> if there are pending uretprobes for current task.
> 
> This is a similar approach to what fprobe/kretprobe infrastructure is
> doing when capturing kernel stack traces in the presence of pending
> return probes.
> 

This looks good to me because this trampoline information is only
managed in uprobes. And it should be provided when unwinding user
stack.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you!

> Reported-by: Riham Selim 
> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/events/callchain.c | 43 ++-
>  kernel/events/uprobes.c   |  9 
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1273be84392c..b17e3323f7f6 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -176,13 +177,51 @@ put_callchain_entry(int rctx)
>   put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
>  }
>  
> +static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry 
> *entry,
> +int start_entry_idx)
> +{
> +#ifdef CONFIG_UPROBES
> + struct uprobe_task *utask = current->utask;
> + struct return_instance *ri;
> + __u64 *cur_ip, *last_ip, tramp_addr;
> +
> + if (likely(!utask || !utask->return_instances))
> + return;
> +
> + cur_ip = >ip[start_entry_idx];
> + last_ip = >ip[entry->nr - 1];
> + ri = utask->return_instances;
> + tramp_addr = uprobe_get_trampoline_vaddr();
> +
> + /*
> +  * If there are pending uretprobes for the current thread, they are
> +  * recorded in a list inside utask->return_instances; each such
> +  * pending uretprobe replaces traced user function's return address on
> +  * the stack, so when stack trace is captured, instead of seeing
> +  * actual function's return address, we'll have one or many uretprobe
> +  * trampoline addresses in the stack trace, which are not helpful and
> +  * misleading to users.
> +  * So here we go over the pending list of uretprobes, and each
> +  * encountered trampoline address is replaced with actual return
> +  * address.
> +  */
> + while (ri && cur_ip <= last_ip) {
> + if (*cur_ip == tramp_addr) {
> + *cur_ip = ri->orig_ret_vaddr;
> + ri = ri->next;
> + }
> + cur_ip++;
> + }
> +#endif
> +}
> +
>  struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
>  u32 max_stack, bool crosstask, bool add_mark)
>  {
>   struct perf_callchain_entry *entry;
>   struct perf_callchain_entry_ctx ctx;
> - int rctx;
> + int rctx, start_entry_idx;
>  
>   entry = get_callchain_entry();
>   if (!entry)
> @@ -215,7 +254,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, 
> bool kernel, bool user,
>   if (add_mark)
>   perf_callchain_store_context(, 
> PERF_CONTEXT_USER);
>  
> + start_entry_idx = entry->nr;
>   perf_callchain_user(, regs);
> + fixup_uretprobe_trampoline_entries(entry, 
> start_entry_idx);
>   }
>   }
>  
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index d60d24f0f2f4..1c99380dc89d 100644
> --- a/kernel/

Re: [PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

2024-06-04 Thread Google
On Tue, 21 May 2024 18:38:44 -0700
Andrii Nakryiko  wrote:

> When tracing user functions with uprobe functionality, it's common to
> install the probe (e.g., a BPF program) at the first instruction of the
> function. This is often going to be `push %rbp` instruction in function
> preamble, which means that within that function frame pointer hasn't
> been established yet. This leads to consistently missing an actual
> caller of the traced function, because perf_callchain_user() only
> records current IP (capturing traced function) and then following frame
> pointer chain (which would be caller's frame, containing the address of
> caller's caller).

I thought this problem might be solved by sframe.

> 
> So when we have target_1 -> target_2 -> target_3 call chain and we are
> tracing an entry to target_3, captured stack trace will report
> target_1 -> target_3 call chain, which is wrong and confusing.
> 
> This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> instruction being traced.

I like this kind of idea :) But I think this should be done in
the user-space, not in the kernel because it is not always sure
that the user program uses stack frames. 

> If that's the case, with the assumption that
> applicatoin is compiled with frame pointers, this instruction would be
> a strong indicator that this is the entry to the function. In that case,
> return address is still pointed to by %rsp, so we fetch it and add to
> stack trace before proceeding to unwind the rest using frame
> pointer-based logic.

Why don't we make it in the userspace BPF program? If it is done
in the user space, like perf-probe, I'm OK. But I doubt to do this in
kernel. That means it is not flexible.

More than anything, without user-space helper to find function
symbols, uprobe does not know the function entry. Then I'm curious
why don't you do this in the user space.

At least, this should be done in the user of uprobes, like trace_uprobe
or bpf.


Thank you,

> 
> Signed-off-by: Andrii Nakryiko 
> ---
>  arch/x86/events/core.c  | 20 
>  include/linux/uprobes.h |  2 ++
>  kernel/events/uprobes.c |  2 ++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 5b0dd07b1ef1..82d5570b58ff 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *regs
>   return;
>  
>   pagefault_disable();
> +
> +#ifdef CONFIG_UPROBES
> + /*
> +  * If we are called from uprobe handler, and we are indeed at the very
> +  * entry to user function (which is normally a `push %rbp` instruction,
> +  * under assumption of application being compiled with frame pointers),
> +  * we should read return address from *regs->sp before proceeding
> +  * to follow frame pointers, otherwise we'll skip immediate caller
> +  * as %rbp is not yet setup.
> +  */
> + if (current->utask) {
> + struct arch_uprobe *auprobe = current->utask->auprobe;
> + u64 ret_addr;
> +
> + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ &&
> + !__get_user(ret_addr, (const u64 __user *)regs->sp))
> + perf_callchain_store(entry, ret_addr);
> + }
> +#endif
> +
>   while (entry->nr < entry->max_stack) {
>   if (!valid_user_frame(fp, sizeof(frame)))
>   break;
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 0c57eec85339..7b785cd30d86 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -76,6 +76,8 @@ struct uprobe_task {
>   struct uprobe   *active_uprobe;
>   unsigned long   xol_vaddr;
>  
> + struct arch_uprobe  *auprobe;
> +
>   struct return_instance  *return_instances;
>   unsigned intdepth;
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 1c99380dc89d..504693845187 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
> pt_regs *regs)
>   bool need_prep = false; /* prepare return uprobe, when needed */
>  
>   down_read(>register_rwsem);
> + current->utask->auprobe = >arch;
>   for (uc = uprobe->consumers; uc; uc = uc->next) {
>   int rc = 0;
>  
> @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct 
> pt_regs *regs)
>  
>   remove &= rc;
>   }
> + current->utask->auprobe = NULL;
>  
>   if (need_prep && !remove)
>   prepare_uretprobe(uprobe, regs); /* put bp at return */
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: How to trace kvm:kvm_exit using fprobe?

2024-05-30 Thread Google
Hi don,

Thanks for reporting!
I confirmed that the raw tracepoint probe event does not work on
the tracepoint in the module. (not only kvm but also other modules)

Let me fix the issue.

Thank you,

On Tue, 21 May 2024 17:28:48 -0700
don  wrote:

> Hi,  Mr Masami Hiramatsu,
> I see your presentation "Function Parameters with BTF Function tracing with
> parameters",
> and I started using fprobe and dynamic_events to trace functions.
> 
> I have a question regarding tracepoints using fprobe/dynamic_events
> I can trace the kvm:kvm_exit event using trace-cmd:
> 
> *trace-cmd stream -e kvm:kvm_exit*
> But if I echo to dynamic_event will get "Invalid argument" error:
> # cd /sys/kernel/debug/tracing
> 
> *# echo 't:kvm kvm_exit' >> dynamic_events*-bash: echo: write error:
> Invalid argument
> 
> How to solve this problem?
> 
> Thanks,
> don


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations

2024-05-28 Thread Google
Hi,

Sorry for late reply.

On Fri, 10 May 2024 10:20:56 +0200
Vlastimil Babka  wrote:

> On 5/10/24 9:59 AM, wuqiang.matt wrote:
> > On 2024/5/7 21:55, Vlastimil Babka wrote:
>  >>
> >>> + } while (!try_cmpxchg_acquire(>tail, , tail + 1));
> >>> +
> >>> + /* now the tail position is reserved for the given obj */
> >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >>> + /* update sequence to make this obj available for pop() */
> >>> + smp_store_release(>last, tail + 1);
> >>> +
> >>> + return 0;
> >>> +}
> >>>   
> >>>   /**
> >>>* objpool_push() - reclaim the object and return back to objpool
> >>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
> >>>* return: 0 or error code (it fails only when user tries to push
> >>>* the same object multiple times or wrong "objects" into objpool)
> >>>*/
> >>> -int objpool_push(void *obj, struct objpool_head *pool);
> >>> +static inline int objpool_push(void *obj, struct objpool_head *pool)
> >>> +{
> >>> + unsigned long flags;
> >>> + int rc;
> >>> +
> >>> + /* disable local irq to avoid preemption & interruption */
> >>> + raw_local_irq_save(flags);
> >>> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >> 
> >> And IIUC, we could in theory objpool_pop() on one cpu, then later another
> >> cpu might do objpool_push() and cause the latter cpu's pool to go over
> >> capacity? Is there some implicit requirements of objpool users to take care
> >> of having matched cpu for pop and push? Are the current objpool users
> >> obeying this requirement? (I can see the selftests do, not sure about the
> >> actual users).
> >> Or am I missing something? Thanks.
> > 
> > The objects are all pre-allocated along with creation of the new objpool
> > and the total number of objects never exceeds the capacity on local node.
> 
> Aha, I see, the capacity of entries is enough to hold objects from all nodes
> in the most unfortunate case they all end up freed from a single cpu.
> 
> > So objpool_push() would always find an available slot from the ring-array
> > for the given object to insert back. objpool_pop() would try looping all
> > the percpu slots until an object is found or whole objpool is empty.
> 
> So it's correct, but seems rather wasteful to have the whole capacity for
> entries replicated on every cpu? It does make objpool_push() simple and
> fast, but as you say, objpool_pop() still has to search potentially all
> non-local percpu slots, with disabled irqs, which is far from ideal.

For the kretprobe/fprobe use-case, it is important to push (return) object
fast. We can reservce enough number of objects when registering but push
operation will happen always on random CPU.

> 
> And the "abort if the slot was already full" comment for
> objpool_try_add_slot() seems still misleading? Maybe that was your initial
> idea but changed later?

Ah, it should not happen...

> 
> > Currently kretprobe is the only actual usecase of objpool.

Note that fprobe is also using this objpool, but currently I'm working on
integrating fprobe on function-graph tracer[1] which will make fprobe not
using objpool. And also I'm planning to replace kretprobe with the new
fprobe eventually. So if SLUB will use objpool for frontend caching, it
sounds good to me. (Maybe it can speed up the object allocation/free)

> > 
> > I'm testing an updated objpool in our HIDS project for critical pathes,
> > which is widely deployed on servers inside my company. The new version
> > eliminates the raw_local_irq_save and raw_local_irq_restore pair of
> > objpool_push and gains up to 5% of performance boost.
> 
> Mind Ccing me and linux-mm once you are posting that?

Can you add me too?

Thank you,

> 
> Thanks,
> Vlastimil
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-23 Thread Google
;   return 0;
> @@ -1031,10 +1032,13 @@ static void uretprobe_trace_func(struct trace_uprobe 
> *tu, unsigned long func,
>struct uprobe_cpu_buffer **ucbp)
>  {
>   struct event_file_link *link;
> + struct uprobe_cpu_buffer *ucb;
> +
> + ucb = prepare_uprobe_buffer(tu, regs, ucbp);
>  
>   rcu_read_lock();
>   trace_probe_for_each_link_rcu(link, >tp)
> - __uprobe_trace_func(tu, func, regs, ucbp, link->file);
> + __uprobe_trace_func(tu, func, regs, ucb, link->file);
>   rcu_read_unlock();
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 0/2] Objpool performance improvements

2024-04-26 Thread Google
Hi Andrii,

On Wed, 24 Apr 2024 14:52:12 -0700
Andrii Nakryiko  wrote:

> Improve objpool (used heavily in kretprobe hot path) performance with two
> improvements:
>   - inlining performance critical objpool_push()/objpool_pop() operations;
>   - avoiding re-calculating relatively expensive nr_possible_cpus().

Thanks for optimizing objpool. Both looks good to me.

BTW, I don't intend to stop this short-term optimization attempts,
but I would like to ask you check the new fgraph based fprobe 
(kretprobe-multi)[1] instead of objpool/rethook.

[1] 
https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/

I'm considering to obsolete the kretprobe (and rethook) with fprobe
and eventually remove it. Those have similar feature and we should
choose safer one.

Thank you,

> 
> These opportunities were found when benchmarking and profiling kprobes and
> kretprobes with BPF-based benchmarks. See individual patches for details and
> results.
> 
> Andrii Nakryiko (2):
>   objpool: enable inlining objpool_push() and objpool_pop() operations
>   objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
> 
>  include/linux/objpool.h | 105 +++--
>  lib/objpool.c   | 112 +++-
>  2 files changed, 107 insertions(+), 110 deletions(-)
> 
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-20 Thread Google
On Fri, 19 Apr 2024 10:59:09 -0700
Andrii Nakryiko  wrote:

> On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu  wrote:
> >
> > On Thu, 18 Apr 2024 12:09:09 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > > that RCU is watching when trying to setup rethooko on a function entry.
> > >
> > > One notable exception when we force rcu_is_watching() check is
> > > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> > > old-style int3-based workflow instead of relying on ftrace, making RCU
> > > watching check important to validate.
> > >
> > > This further (in addition to improvements in the previous patch)
> > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > > by 2.3%, according to BPF benchmarks ([0]).
> > >
> > >   [0] 
> > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> > >
> > > Signed-off-by: Andrii Nakryiko 
> >
> >
> > Thanks for update! This looks good to me.
> 
> Thanks, Masami! Will you take it through your tree, or you'd like to
> route it through bpf-next?

OK, let me take it through linux-trace tree.

Thank you!

> 
> >
> > Acked-by: Masami Hiramatsu (Google) 
> >
> > Thanks,
> >
> > > ---
> > >  kernel/trace/rethook.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > > index fa03094e9e69..a974605ad7a5 100644
> > > --- a/kernel/trace/rethook.c
> > > +++ b/kernel/trace/rethook.c
> > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook 
> > > *rh)
> > >   if (unlikely(!handler))
> > >   return NULL;
> > >
> > > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> > > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
> > >   /*
> > >* This expects the caller will set up a rethook on a function 
> > > entry.
> > >* When the function returns, the rethook will eventually be 
> > > reclaimed
> > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook 
> > > *rh)
> > >*/
> > >   if (unlikely(!rcu_is_watching()))
> > >   return NULL;
> > > +#endif
> > >
> > >   return (struct rethook_node *)objpool_pop(>pool);
> > >  }
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Google
On Thu, 18 Apr 2024 12:09:09 -0700
Andrii Nakryiko  wrote:

> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> One notable exception when we force rcu_is_watching() check is
> CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> old-style int3-based workflow instead of relying on ftrace, making RCU
> watching check important to validate.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 
> Signed-off-by: Andrii Nakryiko 


Thanks for update! This looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..a974605ad7a5 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-09 Thread Google
On Wed,  3 Apr 2024 15:03:27 -0700
Andrii Nakryiko  wrote:

> Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> control whether ftrace low-level code performs additional
> rcu_is_watching()-based validation logic in an attempt to catch noinstr
> violations.
> 
> This check is expected to never be true and is mostly useful for
> low-level validation of ftrace subsystem invariants. For most users it
> should probably be kept disabled to eliminate unnecessary runtime
> overhead.
> 
> This improves BPF multi-kretprobe (relying on ftrace and rethook
> infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 

This looks good to me :)

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Paul E. McKenney 
> Signed-off-by: Andrii Nakryiko 
> ---
>  include/linux/trace_recursion.h |  2 +-
>  kernel/trace/Kconfig| 13 +
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index d48cd92d2364..24ea8ac049b4 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> unsigned long parent_ip);
>  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
>  #endif
>  
> -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>  # define trace_warn_on_no_rcu(ip)\
>   ({  \
>   bool __ret = !rcu_is_watching();\
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 61c541c36596..7aebd1b8f93e 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> This file can be reset, but the limit can not change in
> size at runtime.
>  
> +config FTRACE_VALIDATE_RCU_IS_WATCHING
> + bool "Validate RCU is on during ftrace execution"
> + depends on FUNCTION_TRACER
> + depends on ARCH_WANTS_NO_INSTR
> + help
> +   All callbacks that attach to the function tracing have some sort of
> +   protection against recursion. This option is only to verify that
> +   ftrace (and other users of ftrace_test_recursion_trylock()) are not
> +   called outside of RCU, as if they are, it can cause a race. But it
> +   also has a noticeable overhead when enabled.
> +
> +   If unsure, say N
> +
>  config RING_BUFFER_RECORD_RECURSION
>   bool "Record functions that recurse in the ring buffer"
>   depends on FTRACE_RECORD_RECURSION
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-09 Thread Google
On Wed,  3 Apr 2024 15:03:28 -0700
Andrii Nakryiko  wrote:

> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 

Hi Andrii,

Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this
option, kretprobes can be used without ftrace, but with original int3) ?
This option should be set N on production system because of safety,
just for testing raw kretprobes.

Thank you,

> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..15b8aa4048d9 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-08 Thread Google
On Mon,  8 Apr 2024 10:51:40 -0700
Kui-Feng Lee  wrote:

> The function rethook_find_ret_addr() prints a warning message and returns 0
> when the target task is running and is not the "current" task in order to
> prevent the incorrect return address, although it still may return an
> incorrect address.
> 
> However, the warning message turns into noise when BPF profiling programs
> call bpf_get_task_stack() on running tasks in a firm with a large number of
> hosts.
> 
> The callers should be aware and willing to take the risk of receiving an
> incorrect return address from a task that is currently running other than
> the "current" one. A warning is not needed here as the callers are intent
> on it.
> 

OK, looks good to me. Let me pick it to probes/for-next. Thanks!

> Acked-by: Andrii Nakryiko 
> Acked-by: John Fastabend 
> Signed-off-by: Kui-Feng Lee 
> 
> ---
> Changes from v1:
> 
>  - Rephrased the commit log.
> 
>- Removed the confusing last part of the first paragraph.
> 
>- Removed "frequently" from the 2nd paragraph, replaced by "a firm with
>  a large number of hosts".
> 
> v1: https://lore.kernel.org/all/20240401191621.758056-1-thinker...@gmail.com/
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..4297a132a7ae 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> *tsk, unsigned long frame
>   if (WARN_ON_ONCE(!cur))
>   return 0;
>  
> - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> + if (tsk != current && task_is_running(tsk))
>   return 0;
>  
>   do {
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-07 Thread Google
On Wed, 3 Apr 2024 16:36:25 +0200
Daniel Borkmann  wrote:

> On 4/2/24 6:58 PM, Andrii Nakryiko wrote:
> > On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee  wrote:
> >>
> >> rethook_find_ret_addr() prints a warning message and returns 0 when the
> >> target task is running and not the "current" task to prevent returning an
> >> incorrect return address. However, this check is incomplete as the target
> >> task can still transition to the running state when finding the return
> >> address, although it is safe with RCU.

Could you tell me more about this last part? This change just remove
WARN_ON_ONCE() which warns that the user tries to unwind stack of a running
task. This means the task can change the stack in parallel if the task is
running on other CPU.
Does the BPF stop the task? or do you have any RCU magic to copy the stack?

> >>
> >> The issue we encounter is that the kernel frequently prints warning
> >> messages when BPF profiling programs call to bpf_get_task_stack() on
> >> running tasks.

Hmm, WARN_ON_ONCE should print it once, not frequently.

> >>
> >> The callers should be aware and willing to take the risk of receiving an
> >> incorrect return address from a task that is currently running other than
> >> the "current" one. A warning is not needed here as the callers are intent
> >> on it.
> >>
> >> Signed-off-by: Kui-Feng Lee 
> >> ---
> >>   kernel/trace/rethook.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> >> index fa03094e9e69..4297a132a7ae 100644
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> >> *tsk, unsigned long frame
> >>  if (WARN_ON_ONCE(!cur))
> >>  return 0;
> >>
> >> -   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> >> +   if (tsk != current && task_is_running(tsk))
> >>  return 0;
> >>
> > 
> > This should probably go through Masami's tree, but the change makes
> > sense to me, given this is an expected condition.
> > 
> > Acked-by: Andrii Nakryiko 
> 
> Masami, I assume you'll pick this up?

OK, anyway it will just return 0 if this situation happens, and caller will
get the trampoline address instead of correct return address in this case.
I think it does not do any unsafe things. So I agree removing it.
But I think the explanation is a bit confusing.

Thank you,

> 
> Thanks,
> Daniel


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-05 Thread Google
On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko  wrote:

> On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
>  wrote:
> >
> > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> > >
> > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >
> > > > OK, for me, this last sentence is preferred for the help message. That 
> > > > explains
> > > > what this is for.
> > > >
> > > > All callbacks that attach to the function tracing have some sort
> > > > of protection against recursion. This option is only to verify 
> > > > that
> > > >    ftrace (and other users of ftrace_test_recursion_trylock()) are 
> > > >not
> > > > called outside of RCU, as if they are, it can cause a race.
> > > > But it also has a noticeable overhead when enabled.
> >
> > Sounds good to me, I can add this to the description of the Kconfig option.
> >
> > > >
> > > > BTW, how much overhead does this introduce? and the race case a kernel 
> > > > crash?
> >
> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Thanks for checking the overhead! Hmm, it is considerable.

> > > > or just messed up the ftrace buffer?
> > >
> > > There's a hypothetical race where it can cause a use after free.

Hmm, so it might not lead a kernel crash but is better to enable with
other debugging options.

> > >
> > > That is, after you shutdown ftrace, you need to call 
> > > synchronize_rcu_tasks(),
> > > which requires RCU to be watching. There's a theoretical case where that
> > > task calls the trampoline and misses the synchronization. Note, these
> > > locations are with preemption disabled, as rcu is always watching when
> > > preemption is enabled. Thus it would be extremely fast where as the
> > > synchronize_rcu_tasks() is rather slow.
> > >
> > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > > have RCU watching.
> > >
> > > I have never heard of this race being hit. I guess it could happen on a VM
> > > where a vCPU gets preempted at the right moment for a long time and the
> > > other CPUs synchronize.
> > >
> > > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > > for production.
> > >
> > > The overhead is another function call within the function tracer. I had
> > > numbers before, but I guess I could run tests again and get new numbers.
> > >
> >
> > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > which seems to be a similar and complementary validation check to the
> > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > in this patch. It feels like both of them should be controlled by the
> > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > guard around rcu_is_watching() check in rethook_try_get() as well?

Hmmm, no, I think it should not change the rethook side because rethook
can be used with kprobes without ftrace. If we can detect it is used from
the ftrace, we can skip it. (From this reason, I would like to remove
return probe from kprobes...)

Thank you,

> >
> >
> > > Thanks,
> > >
> > > -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Google
On Mon, 1 Apr 2024 22:47:33 -0400
Steven Rostedt  wrote:

> On Mon, 1 Apr 2024 19:29:46 -0700
> Andrii Nakryiko  wrote:
> 
> > On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
> > >
> > > On Mon, 1 Apr 2024 12:09:18 -0400
> > > Steven Rostedt  wrote:
> > >  
> > > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > > Masami Hiramatsu (Google)  wrote:
> > > >  
> > > > > > Masami,
> > > > > >
> > > > > > Are you OK with just keeping it set to N.  
> > > > >
> > > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > > >  
> > > > > >
> > > > > > We could have other options like PROVE_LOCKING enable it.  
> > > > >
> > > > > Agreed (but it should say this is a debug option)  
> > > >
> > > > It does say "Validate" which to me is a debug option. What would you
> > > > suggest?  
> > >
> > > I think the help message should have "This is for debugging ftrace."
> > >  
> > 
> > Sent v2 with adjusted wording, thanks!
> 
> You may want to wait till Masami and I agree ;-)
> 
> Masami,
> 
> But it isn't really for "debugging", it's for validating. That is, it
> doesn't give us any information to debug ftrace. It only validates if it is
> executed properly. In other words, I never want to be asked "How can I use
> this option to debug ftrace?"
> 
> For example, we also have:
> 
>   "Verify ring buffer time stamp deltas"
> 
> that makes sure the time stamps of the ring buffer are not buggy.
> 
> And there's:
> 
>   "Verify compile time sorting of ftrace functions"
> 
> Which is also used to make sure things are working properly.
> 
> Neither of the above says they are for "debugging", even though they are
> more useful for debugging than this option.
> 
> I'm not sure saying this is "debugging ftrace" is accurate. It may help
> debug ftrace if it is called outside of an RCU location, which has a
> 1 in 100,000,000,000 chance of causing an actual bug, as the race window is
> extremely small. 
> 
> Now if it is also called outside of instrumentation, that will likely trigger
> other warnings even without this code, and this will not be needed to debug
> that.
> 
> ftrace has all sorts of "verifiers" that is used to make sure things are
> working properly. And yes, you can consider it as "debugging". But I would
> not consider this an option to enable if ftrace was broken, and you are
> looking into why it is broken.
> 
> To me, this option is only to verify that ftrace (and other users of
> ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
> are, it can cause a race. But it also has a noticeable overhead when enabled.

OK, for me, this last sentence is preferred for the help message. That explains
what this is for. 

  All callbacks that attach to the function tracing have some sort
  of protection against recursion. This option is only to verify that
 ftrace (and other users of ftrace_test_recursion_trylock()) are not
  called outside of RCU, as if they are, it can cause a race.
  But it also has a noticeable overhead when enabled.

BTW, how much overhead does this introduce? and the race case a kernel crash?
or just messed up the ftrace buffer?

Thank you,

> 
> -- Steve
> 
> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Mon, 1 Apr 2024 12:09:18 -0400
Steven Rostedt  wrote:

> On Mon, 1 Apr 2024 20:25:52 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > Masami,
> > > 
> > > Are you OK with just keeping it set to N.  
> > 
> > OK, if it is only for the debugging, I'm OK to set N this.
> > 
> > > 
> > > We could have other options like PROVE_LOCKING enable it.  
> > 
> > Agreed (but it should say this is a debug option)
> 
> It does say "Validate" which to me is a debug option. What would you
> suggest?

I think the help message should have "This is for debugging ftrace."

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Tue, 26 Mar 2024 15:01:21 -0400
Steven Rostedt  wrote:

> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko  wrote:
> 
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >  
> > 
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
> 
> Masami,
> 
> Are you OK with just keeping it set to N.

OK, if it is only for the debugging, I'm OK to set N this.

> 
> We could have other options like PROVE_LOCKING enable it.

Agreed (but it should say this is a debug option)

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Google
On Fri, 22 Mar 2024 09:03:23 -0700
Andrii Nakryiko  wrote:

> Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> control whether ftrace low-level code performs additional
> rcu_is_watching()-based validation logic in an attempt to catch noinstr
> violations.
> 
> This check is expected to never be true in practice and would be best
> controlled with extra config to let users decide if they are willing to
> pay the price.

Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
is OK, but tht should be set to Y by default. If you have already verified
that your system never make it true and you want to optimize your ftrace
path, you can manually set it to N at your own risk.

> 
> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Paul E. McKenney 
> Signed-off-by: Andrii Nakryiko 
> ---
>  include/linux/trace_recursion.h |  2 +-
>  kernel/trace/Kconfig| 13 +
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index d48cd92d2364..24ea8ac049b4 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> unsigned long parent_ip);
>  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
>  #endif
>  
> -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>  # define trace_warn_on_no_rcu(ip)\
>   ({  \
>   bool __ret = !rcu_is_watching();\

BTW, maybe we can add "unlikely" in the next "if" line?

> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 61c541c36596..19bce4e217d6 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> This file can be reset, but the limit can not change in
> size at runtime.
>  
> +config FTRACE_VALIDATE_RCU_IS_WATCHING
> + bool "Validate RCU is on during ftrace recursion check"
> + depends on FUNCTION_TRACER
> + depends on ARCH_WANTS_NO_INSTR

default y

> + help
> +   All callbacks that attach to the function tracing have some sort
> +   of protection against recursion. This option performs additional
> +   checks to make sure RCU is on when ftrace callbacks recurse.
> +
> +   This will add more overhead to all ftrace-based invocations.

... invocations, but keep it safe.

> +
> +   If unsure, say N

If unsure, say Y

Thank you,

> +
>  config RING_BUFFER_RECORD_RECURSION
>   bool "Record functions that recurse in the ring buffer"
>   depends on FTRACE_RECORD_RECURSION
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Google
On Wed, 20 Mar 2024 09:27:49 -0400
Steven Rostedt  wrote:

> On Wed, 20 Mar 2024 17:10:38 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > Fix to initialize 'val' local variable with zero.
> > Dan reported that Smatch static code checker reports an error that a local
> > 'val' variable needs to be initialized. Actually, the 'val' is expected to
> > be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> > initialize it with zero.
> 
> BTW, that loop should really have a comment stating that FETCH_OP_ARG is
> expected to happen before FETCH_OP_ST_EDATA.

Indeed, OK, let me add it.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



[PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.

Reported-by: Dan Carpenter 
Closes: 
https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and 
fprobe)")
Signed-off-by: Masami Hiramatsu (Google) 
---
 kernel/trace/trace_probe.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 217169de0920..dfe3ee6035ec 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -839,7 +839,7 @@ int traceprobe_get_entry_data_size(struct trace_probe *tp)
 void store_trace_entry_data(void *edata, struct trace_probe *tp, struct 
pt_regs *regs)
 {
struct probe_entry_arg *earg = tp->entry_arg;
-   unsigned long val;
+   unsigned long val = 0;
int i;
 
if (!earg)




Re: [bug report] tracing/probes: Support $argN in return probe (kprobe and fprobe)

2024-03-19 Thread Google
On Tue, 19 Mar 2024 10:10:00 -0400
Steven Rostedt  wrote:

> On Tue, 19 Mar 2024 10:19:09 +0300
> Dan Carpenter  wrote:
> 
> > Hello Masami Hiramatsu (Google),
> > 
> > Commit 25f00e40ce79 ("tracing/probes: Support $argN in return probe
> > (kprobe and fprobe)") from Mar 4, 2024 (linux-next), leads to the
> > following Smatch static checker warning:
> > 
> > kernel/trace/trace_probe.c:856 store_trace_entry_data()
> > error: uninitialized symbol 'val'.
> > 
> > kernel/trace/trace_probe.c
> > 846 return;
> > 847 
> > 848 for (i = 0; i < earg->size; i++) {
> > 849 struct fetch_insn *code = >code[i];
> > 850 
> > 851 switch (code->op) {
> > 852 case FETCH_OP_ARG:
> > 853 val = regs_get_kernel_argument(regs, 
> > code->param);
> > 854 break;
> > 855 case FETCH_OP_ST_EDATA:
> > --> 856 *(unsigned long *)((unsigned long)edata + 
> > code->offset) = val;  
> > 
> > Probably the earg->code[i] always has FETCH_OP_ARG before
> > FETCH_OP_ST_EDATA but Smatch isn't smart enough to figure that out...
> 
> Looks that way:
> 
>   case FETCH_OP_END:
>   earg->code[i].op = FETCH_OP_ARG;
>   earg->code[i].param = argnum;
>   earg->code[i + 1].op = FETCH_OP_ST_EDATA;
>   earg->code[i + 1].offset = offset;
>   return offset;
> 
> But probably should still initialize val to zero or have a WARN_ON() if
> that doesn't happen.

OK, let's val = 0 in the store_trace_entry_data(), but WARN_ON() in this loop
is a bit strange. I think we should have a verifiler.

Thank you,

> 
> -- Steve
> 
> 
> > 
> > 857 break;
> > 858 case FETCH_OP_END:
> > 859     goto end;
> > 860 default:
> > 861 break;
> > 862 }
> > 863 }
> > 864 end:
> > 865 return;
> > 866 }
> > 
> > regards,
> > dan carpenter
> 


-- 
Masami Hiramatsu (Google) 



Re: tprobe event tracing error

2024-02-26 Thread Google
Hi,
(Cc: linux-kernel-trace ML for sharing this knowledge)

On Mon, 26 Feb 2024 16:36:29 +0300
Максим Морсков  wrote:

> 
> Hello, dear Masami.
> I am researching Linux event tracing subsystem in part of tprobes,
> and found interesting behavior in kernel version 6.6:
>  
> echo  't:my_fchmodat sys_enter_fchmodat' | sudo tee 
> ‘/sys/kernel/tracing/dynamic_events’
> bash: line 1: echo: write error: Invalid argument

Yeah, I understand that you are confused by this behavior, but it is
actually expected behavior. syscalls:* events looks like trace events
based on tracepoint, but those are software generated trace event.

You can find raw_syscalls:* trace events, that is based on the tracepoint,
and other syscalls:* are based on that raw_syscalls:* trace points.
(IOW, those are a kind of pre-compiled dynamic events)

e.g.

/sys/kernel/tracing # echo "t sys_enter \$arg*" >> dynamic_events 
/sys/kernel/tracing # cat dynamic_events 
t:tracepoints/sys_enter sys_enter regs=regs id=id

/sys/kernel/tracing # echo "t sys_enter_open \$arg*" >> dynamic_events 
sh: write error: Invalid argument
/sys/kernel/tracing # cat error_log 
[  227.981347] trace_fprobe: error: Tracepoint is not found
  Command: t sys_enter_open $arg*
 ^

So, tprobe can not find the hard-coded tracepoints for those dynamically
generated syscall trace events. But raw_syscall sys_enter/sys_exit are OK.

Thank you,

>  
> sudo cat /sys/kernel/tracing/error_log
> trace_fprobe: error: Tracepoint is not found 
>  Command: t:my_fchmodat sys_enter_fchmodat
>                                                     ^
>  
> But such tracepoint ( sys_enter_fchmodat ) exists in system:
> sudo perf list | grep chmod   
>  syscalls:sys_enter_ chmod   [Tracepoint event] 
>  syscalls:sys_enter_f chmod  [Tracepoint event] 
>  syscalls:sys_enter_f chmod at    [Tracepoint event] 
>  syscalls:sys_enter_f chmod at2   [Tracepoint event] 
>  syscalls:sys_exit_ chmod    [Tracepoint event] 
>  syscalls:sys_exit_f chmod   [Tracepoint event] 
>  syscalls:sys_exit_f chmod at [Tracepoint event] 
>  syscalls:sys_exit_f chmod at2    [Tracepoint event]
>  
> sudo ls -lha /sys/kernel/tracing/events/syscalls/ | grep chmod
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_enter_ chmod
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_enter_f chmod
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_enter_f chmod at 
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_enter_f chmod at2 
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_exit_ chmod
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_exit_f chmod
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_exit_f chmod at 
> drwxr-xr-x 1 root root 0 фев 26 16:07 sys_exit_f chmod at2
>  
> My kernel config in the attachment.
> I noticed that you are developing event tracing tprobe subsystem.
> May be you could explain such behavior?
> Thank you in advance!
>  
> --
> Best regards
> Maksim Morskov

-- 
Masami Hiramatsu (Google) 



Re: Sleepable kprobes

2023-11-21 Thread Google
(Cc: linux-trace-kernel@vger.kernel.org)
Hello Tomer,

Sorry, last weeks I was busy for LPC23.

As far as I can see (and IIUC), this hacks the ftrace's fentry to just
call the probe? (BTW, that is very x86 specific feature.)

In that case, unlike the kprobe, it does not probe the function body
thus it is not kprobes, and it can not work with ftrace and fgraph.

Moreover, since the probe is not called from fixed point, the probed
function can be called under non-preemptive section. Thus it is not
generically able to sleep.

If you need to do I/O inside the kernel, you need to do it carefully.
I recommend you to use deferred work (workqueu or different threads)
to do I/O asynchronously. That will be safer and generic way.
And also, new feature needs to co-exist with the feature already
exists in maintenanceable way. I meant using ftrace fentry directly
is not recommended because it breaks ftrace.

Anyway, what you need to do is starting discussion on the mailing list
instead of pushing your change. 'Why' I/O in the probe is needed inside
probe callback, and propose your solution. (My suggestion is above, but
maybe other people have other opinions.)

Thank you,


On Thu, 16 Nov 2023 05:38:57 +
tomer samara  wrote:

> Hi,
> Just bumping up this message in case got lost,
> 
> Thanks,
> Tomer
> 
> 
> 
> From: tomer samara 
> Sent: Sunday, November 5, 2023 5:42 PM
> To: naveen.n@linux.ibm.com ; 
> anil.s.keshavamur...@intel.com ; 
> da...@davemloft.net ; mhira...@kernel.org 
> 
> Subject: Sleepable kprobes
> 
> Hi,
> 
> My name is Tomer and in the last few weeks I worked on a side project that 
> will create a sleepable kprobes.
> As you know kprobes is using PERCPU variables and can’t schedule or sleep.
> The motivation behind this is to allow the user to preform operation like I/O 
> operations, put the task in sleep so it will be inspected in user-mode with 
> easier tools or just call kmalloc without GFP_NOWAIT.
> 
> The code is very similar to parts kprobe itself since the idea is the same.
> You can see the full code here: https://github.com/Tsn0w/frogprobe
> The core logic is under src/frogprobe.c while other are just helpers.
> If you think this can be a good change to the current kprobe or as a 
> different subsystem, what can I push this? (beside support more archs, 
> support LKM symbols, blacklist symbols, …).
> 
> Thanks for your time and looking forward to your reply,
> Tomer Samara,


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 3/3] locking/atomic: openrisc: use generic_cmpxchg[64]_local for arch_cmpxchg[64]_local

2023-10-28 Thread Google
Hi Wuqiang,

On Thu, 26 Oct 2023 19:05:51 +0800
"wuqiang.matt"  wrote:

> On 2023/10/26 16:46, Arnd Bergmann wrote:
> > On Thu, Oct 26, 2023, at 09:39, wuqiang.matt wrote:
> >> arch_cmpxchg[64]_local() are not defined for openrisc. So implement
> >> them with generci_cmpxchg[64]_local, advised by Masami Hiramatsu.
> >>
> >> Closes:
> >> https://lore.kernel.org/linux-trace-kernel/169824660459.24340.14614817132696360531.stgit@devnote2
> >> Closes:
> >> https://lore.kernel.org/oe-kbuild-all/202310241310.ir5uukog-...@intel.com
> >>
> >> Signed-off-by: wuqiang.matt 
> > 
> > I think on architectures that have actual atomics, you
> > generally want to define this to be the same as arch_cmpxchg()
> > rather than the generic version.
> > 
> > It depends on the relative cost of doing one atomic compared
> > to an irq-disable/enable pair, but everyone else went with
> > the former if they could. The exceptions are armv4/armv5,
> > sparc32 and parisc, which don't have a generic cmpxchg()
> > or similar operation.
> 
> Sure, better native than the generic. I'll try to collect more
> insights before next move.

So I will temporally remove the last change (use arch_cmpxchg_local
in objpool) until these series are rewritten with arch native code,
so that the next release will not break the kernel build.

But this must be fixed because arch_cmpxchg_local() is required
for each arch anyway.

> 
> > You could do the thing that sparc64 and xtensa do, which
> > use the native cmpxchg for supported word sizes but the
> > generic version for 1- and 2-byte swaps, but that has its
> > own set of problems if you end up doing operations on both
> > the entire word and a sub-unit of the same thing.
> 
> Thank you for pointing out this. I'll do some research on these
> implementations.

arc also has the LL-SC instruction but depends on the core feature,
so I think we can use it.

Thank you,

> 
> >  Arnd
> 
> Regards,
> wuqiang
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Google
On Sat, 28 Oct 2023 10:41:44 +0900
Masami Hiramatsu (Google)  wrote:

> Hi,
> 
> On Fri, 27 Oct 2023 16:31:26 -0700
> Andrii Nakryiko  wrote:
> 
> > Recent changes to count number of matching symbols when creating
> > a kprobe event failed to take into account kernel modules. As such, it
> > breaks kprobes on kernel module symbols, by assuming there is no match.
> > 
> > Fix this my calling module_kallsyms_on_each_symbol() in addition to
> > kallsyms_on_each_match_symbol() to perform a proper counting.
> > 
> > Cc: Francis Laniel 
> > Cc: sta...@vger.kernel.org
> > Cc: Masami Hiramatsu 
> > Cc: Steven Rostedt 
> > Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func 
> > matches several symbols")
> > Signed-off-by: Andrii Nakryiko 
> 
> This fixes "BPF" kprobe event, but not ftrace kprobe events.
> ftrace kprobe events only checks this if it is in the vmlinux not
> modules and that's why my selftest passed the original one.

Ah, my bad. ftrace kprobe event should accept the target symbol without
module name. Yes, in that case it missed the count.

> Hmm, I need another enhancement like this for the events on offline
> modules.

Also, I need to add another test case update without module name.
(this one should be another fix)

Thank you,

> 
> Thank you,
> 
> > ---
> >  kernel/trace/trace_kprobe.c | 24 
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index effcaede4759..1efb27f35963 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long 
> > unused)
> > return 0;
> >  }
> >  
> > +struct sym_count_ctx {
> > +   unsigned int count;
> > +   const char *name;
> > +};
> > +
> > +static int count_mod_symbols(void *data, const char *name, unsigned long 
> > unused)
> > +{
> > +   struct sym_count_ctx *ctx = data;
> > +
> > +   if (strcmp(name, ctx->name) == 0)
> > +   ctx->count++;
> > +
> > +   return 0;
> > +}
> > +
> >  static unsigned int number_of_same_symbols(char *func_name)
> >  {
> > -   unsigned int count;
> > +   struct sym_count_ctx ctx = { .count = 0, .name = func_name };
> > +
> > +   kallsyms_on_each_match_symbol(count_symbols, func_name, );
> >  
> > -   count = 0;
> > -   kallsyms_on_each_match_symbol(count_symbols, func_name, );
> > +   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
> >  
> > -   return count;
> > +   return ctx.count;
> >  }
> >  
> >  static int __trace_kprobe_create(int argc, const char *argv[])
> > -- 
> > 2.34.1
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Google
Hi,

On Fri, 27 Oct 2023 16:31:26 -0700
Andrii Nakryiko  wrote:

> Recent changes to count number of matching symbols when creating
> a kprobe event failed to take into account kernel modules. As such, it
> breaks kprobes on kernel module symbols, by assuming there is no match.
> 
> Fix this my calling module_kallsyms_on_each_symbol() in addition to
> kallsyms_on_each_match_symbol() to perform a proper counting.
> 
> Cc: Francis Laniel 
> Cc: sta...@vger.kernel.org
> Cc: Masami Hiramatsu 
> Cc: Steven Rostedt 
> Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches 
> several symbols")
> Signed-off-by: Andrii Nakryiko 

This fixes "BPF" kprobe event, but not ftrace kprobe events.
ftrace kprobe events only checks this if it is in the vmlinux not
modules and that's why my selftest passed the original one.
Hmm, I need another enhancement like this for the events on offline
modules.

Thank you,

> ---
>  kernel/trace/trace_kprobe.c | 24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index effcaede4759..1efb27f35963 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long 
> unused)
>   return 0;
>  }
>  
> +struct sym_count_ctx {
> + unsigned int count;
> + const char *name;
> +};
> +
> +static int count_mod_symbols(void *data, const char *name, unsigned long 
> unused)
> +{
> + struct sym_count_ctx *ctx = data;
> +
> + if (strcmp(name, ctx->name) == 0)
> + ctx->count++;
> +
> + return 0;
> +}
> +
>  static unsigned int number_of_same_symbols(char *func_name)
>  {
> - unsigned int count;
> + struct sym_count_ctx ctx = { .count = 0, .name = func_name };
> +
> + kallsyms_on_each_match_symbol(count_symbols, func_name, );
>  
> - count = 0;
> - kallsyms_on_each_match_symbol(count_symbols, func_name, );
> + module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
>  
> - return count;
> + return ctx.count;
>  }
>  
>  static int __trace_kprobe_create(int argc, const char *argv[])
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Google
On Fri, 27 Oct 2023 16:37:29 -0700
Song Liu  wrote:

> On Fri, Oct 27, 2023 at 4:31 PM Andrii Nakryiko  wrote:
> >
> > Recent changes to count number of matching symbols when creating
> > a kprobe event failed to take into account kernel modules. As such, it
> > breaks kprobes on kernel module symbols, by assuming there is no match.
> >
> > Fix this my calling module_kallsyms_on_each_symbol() in addition to
> > kallsyms_on_each_match_symbol() to perform a proper counting.
> >
> > Cc: Francis Laniel 
> > Cc: sta...@vger.kernel.org
> > Cc: Masami Hiramatsu 
> > Cc: Steven Rostedt 
> > Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func 
> > matches several symbols")
> > Signed-off-by: Andrii Nakryiko 
> 
> Acked-by: Song Liu 

Good catch! Thanks!



-- 
Masami Hiramatsu (Google) 



Re: [PATCH v6 0/2] Return EADDRNOTAVAIL when func matches several symbols during kprobe creation

2023-10-20 Thread Google
on of a previous RFC which 
> intended
> to install a PMU kprobe for all matching symbols [1, 2].
> 
> If you see any way to improve this contribution, please share your feedback.
> 
> Changes since:
>  v1:
>   * Use EADDRNOTAVAIL instead of adding a new error code.
>   * Correct also this behavior for sysfs kprobe.
>  v2:
>   * Count the number of symbols corresponding to function name and return
>   EADDRNOTAVAIL if higher than 1.
>   * Return ENOENT if above count is 0, as it would be returned later by while
>   registering the kprobe.
>  v3:
>   * Check symbol does not contain ':' before testing its uniqueness.
>   * Add a selftest to check this is not possible to install a kprobe for a non
>   unique symbol.
>  v5:
>   * No changes, just add linux-stable as recipient.
>  v6:
>   * Rephrase commit message.
>   * Add "Cc:" to stable.
> 
> Francis Laniel (2):
>   tracing/kprobes: Return EADDRNOTAVAIL when func matches several
> symbols
>   selftests/ftrace: Add new test case which checks non unique symbol
> 
>  kernel/trace/trace_kprobe.c   | 63 +++
>  kernel/trace/trace_probe.h|  1 +
>  .../test.d/kprobe/kprobe_non_uniq_symbol.tc   | 13 
>  3 files changed, 77 insertions(+)
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
> 
> Best regards and thank you in advance.
> ---
> [1]: 
> https://lore.kernel.org/lkml/20230816163517.112518-1-flan...@linux.microsoft.com/
> [2]: 
> https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea...@kernel.org/
> --
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 0/2] Return EADDRNOTAVAIL when func matches several symbols during kprobe creation

2023-10-19 Thread Google
On Thu, 19 Oct 2023 09:51:04 -0400
Steven Rostedt  wrote:

> On Thu, 19 Oct 2023 21:18:43 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > So why is this adding stable? (and as Greg's form letter states, that's 
> > > not
> > > how you do that)
> > > 
> > > I don't see this as a fix but a new feature.  
> > 
> > I asked him to make this a fix since the current kprobe event' behavior is
> > somewhat strange. It puts the probe on only the "first symbol" if user
> > specifies a symbol name which has multiple instances. In this case, the
> > actual probe address can not be solved by name. User must specify the
> > probe address by unique name + offset. Unless, it can put a probe on
> > unexpected address, especially if it specifies non-unique symbol + offset,
> > the address may NOT be the instruction boundary.
> > To avoid this issue, it should check the given symbol is unique.
> >
> 
> OK, so what is broken is that when you add a probe to a function that has
> multiple names, it will attach to the first one and not necessarily the one
> you want.
> 
> The change log needs to be more explicit in what the "bug" is. It does
> state this in a round about way, but it is written in a way that it doesn't
> stand out.
> 
> Previously to this commit, if func matches several symbols, a kprobe,
> being either sysfs or PMU, would only be installed for the first
> matching address. This could lead to some misunderstanding when some
> BPF code was never called because it was attached to a function which
> was indeed not called, because the effectively called one has no
> kprobes attached.
> 
> So, this commit returns EADDRNOTAVAIL when func matches several
> symbols. This way, user needs to use address to remove the ambiguity.
> 
> 
> What it should say is:
> 
> When a kprobe is attached to a function that's name is not unique (is
> static and shares the name with other functions in the kernel), the
> kprobe is attached to the first function it finds. This is a bug as the
> function that it is attaching to is not necessarily the one that the
> user wants to attach to.
> 
> Instead of blindly picking a function to attach to what is ambiguous,
> error with EADDRNOTAVAIL to let the user know that this function is not
> unique, and that the user must use another unique function with an
> address offset to get to the function they want to attach to.
> 

Great!, yes this looks good to me too.

> And yes, it should have:
> 
> Cc: sta...@vger.kernel.org
> 
> which is how to mark something for stable, and
> 
> Fixes: ...
> 
> To the commit that caused the bug.

Yes, this should be the first one.

Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer")

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 0/2] Return EADDRNOTAVAIL when func matches several symbols during kprobe creation

2023-10-19 Thread Google
On Wed, 18 Oct 2023 13:00:42 -0400
Steven Rostedt  wrote:

> On Wed, 18 Oct 2023 17:40:28 +0300
> Francis Laniel  wrote:
> 
> > Changes since:
> >  v1:
> >   * Use EADDRNOTAVAIL instead of adding a new error code.
> >   * Correct also this behavior for sysfs kprobe.
> >  v2:
> >   * Count the number of symbols corresponding to function name and return
> >   EADDRNOTAVAIL if higher than 1.
> >   * Return ENOENT if above count is 0, as it would be returned later by 
> > while
> >   registering the kprobe.
> >  v3:
> >   * Check symbol does not contain ':' before testing its uniqueness.
> >   * Add a selftest to check this is not possible to install a kprobe for a 
> > non
> >   unique symbol.
> >  v5:
> >   * No changes, just add linux-stable as recipient.
> 
> So why is this adding stable? (and as Greg's form letter states, that's not
> how you do that)
> 
> I don't see this as a fix but a new feature.

I asked him to make this a fix since the current kprobe event' behavior is
somewhat strange. It puts the probe on only the "first symbol" if user
specifies a symbol name which has multiple instances. In this case, the
actual probe address can not be solved by name. User must specify the
probe address by unique name + offset. Unless, it can put a probe on
unexpected address, especially if it specifies non-unique symbol + offset,
the address may NOT be the instruction boundary.
To avoid this issue, it should check the given symbol is unique.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] riscv: ftrace: Fix to pass correct ftrace_regs to ftrace_func_t functions

2023-10-15 Thread Google
Hi,

Sorry please ignore this. I found that the arch/riscv doesn't support
HAVE_DYNAMIC_FTRACE_WITH_ARGS. Thus this might add a new feature support.
Let me update it.

Thank you,

On Sun, 15 Oct 2023 23:15:23 +0900
Masami Hiramatsu (Google)  wrote:

> Hi,
> 
> Gentry ping.
> 
> I think this should be an important fix because if a fprobe handler without
> FTRACE_OPS_FL_SAVE_REGS tries to access any register via ftrace_regs, that
> will get a wrong value.
> 
> Thank you,
> 
> On Mon,  2 Oct 2023 21:50:34 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > Since ftrace_func_t requires to pass 'struct ftrace_regs *' as the 4th
> > argument even if FTRACE_OPS_FL_SAVE_REGS is not set, ftrace_caller must
> > pass 'struct ftrace_regs *', which is a partial pt_regs, on the stack
> > to the ftrace_func_t functions, so that the ftrace_func_t functions can
> > access some partial registers.
> > 
> > Fix to allocate 'struct ftrace_regs' (which has the same size of 'struct
> > pt_regs') on the stack and save partial (argument) registers on it
> > instead of reduced size custom data structure.
> > 
> > Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of 
> > MCOUNT")
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  arch/riscv/kernel/mcount-dyn.S |   65 
> > +---
> >  1 file changed, 28 insertions(+), 37 deletions(-)
> > 
> > diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
> > index 669b8697aa38..84963680eff4 100644
> > --- a/arch/riscv/kernel/mcount-dyn.S
> > +++ b/arch/riscv/kernel/mcount-dyn.S
> > @@ -14,46 +14,37 @@
> > .text
> >  
> >  #define FENTRY_RA_OFFSET   8
> > -#define ABI_SIZE_ON_STACK  80
> > -#define ABI_A0 0
> > -#define ABI_A1 8
> > -#define ABI_A2 16
> > -#define ABI_A3 24
> > -#define ABI_A4 32
> > -#define ABI_A5 40
> > -#define ABI_A6 48
> > -#define ABI_A7 56
> > -#define ABI_T0 64
> > -#define ABI_RA 72
> >  
> > .macro SAVE_ABI
> > -   addisp, sp, -ABI_SIZE_ON_STACK
> > -
> > -   REG_S   a0, ABI_A0(sp)
> > -   REG_S   a1, ABI_A1(sp)
> > -   REG_S   a2, ABI_A2(sp)
> > -   REG_S   a3, ABI_A3(sp)
> > -   REG_S   a4, ABI_A4(sp)
> > -   REG_S   a5, ABI_A5(sp)
> > -   REG_S   a6, ABI_A6(sp)
> > -   REG_S   a7, ABI_A7(sp)
> > -   REG_S   t0, ABI_T0(sp)
> > -   REG_S   ra, ABI_RA(sp)
> > +   addisp, sp, -PT_SIZE_ON_STACK
> > +
> > +   /* Save t0 as epc for ftrace_regs_get_instruction_pointer() */
> > +   REG_S   t0, PT_EPC(sp)
> > +   REG_S   a0, PT_A0(sp)
> > +   REG_S   a1, PT_A1(sp)
> > +   REG_S   a2, PT_A2(sp)
> > +   REG_S   a3, PT_A3(sp)
> > +   REG_S   a4, PT_A4(sp)
> > +   REG_S   a5, PT_A5(sp)
> > +   REG_S   a6, PT_A6(sp)
> > +   REG_S   a7, PT_A7(sp)
> > +   REG_S   t0, PT_T0(sp)
> > +   REG_S   ra, PT_RA(sp)
> > .endm
> >  
> > .macro RESTORE_ABI
> > -   REG_L   a0, ABI_A0(sp)
> > -   REG_L   a1, ABI_A1(sp)
> > -   REG_L   a2, ABI_A2(sp)
> > -   REG_L   a3, ABI_A3(sp)
> > -   REG_L   a4, ABI_A4(sp)
> > -   REG_L   a5, ABI_A5(sp)
> > -   REG_L   a6, ABI_A6(sp)
> > -   REG_L   a7, ABI_A7(sp)
> > -   REG_L   t0, ABI_T0(sp)
> > -   REG_L   ra, ABI_RA(sp)
> > -
> > -   addisp, sp, ABI_SIZE_ON_STACK
> > +   REG_L   a0, PT_A0(sp)
> > +   REG_L   a1, PT_A1(sp)
> > +   REG_L   a2, PT_A2(sp)
> > +   REG_L   a3, PT_A3(sp)
> > +   REG_L   a4, PT_A4(sp)
> > +   REG_L   a5, PT_A5(sp)
> > +   REG_L   a6, PT_A6(sp)
> > +   REG_L   a7, PT_A7(sp)
> > +   REG_L   t0, PT_T0(sp)
> > +   REG_L   ra, PT_RA(sp)
> > +
> > +   addisp, sp, PT_SIZE_ON_STACK
> > .endm
> >  
> >  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > @@ -96,8 +87,8 @@ ftrace_call:
> > callftrace_stub
> >  
> >  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> > -   addia0, sp, ABI_RA
> > -   REG_L   a1, ABI_T0(sp)
> > +   addia0, sp, PT_RA
> > +   REG_L   a1, PT_T0(sp)
> > addia1, a1, -FENTRY_RA_OFFSET
> >  #ifdef HAVE_FUNCTION_GRAPH_FP_TEST
> > mv  a2, s0
> > 
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] riscv: ftrace: Fix to pass correct ftrace_regs to ftrace_func_t functions

2023-10-15 Thread Google
Hi,

Gentry ping.

I think this should be an important fix because if a fprobe handler without
FTRACE_OPS_FL_SAVE_REGS tries to access any register via ftrace_regs, that
will get a wrong value.

Thank you,

On Mon,  2 Oct 2023 21:50:34 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Since ftrace_func_t requires to pass 'struct ftrace_regs *' as the 4th
> argument even if FTRACE_OPS_FL_SAVE_REGS is not set, ftrace_caller must
> pass 'struct ftrace_regs *', which is a partial pt_regs, on the stack
> to the ftrace_func_t functions, so that the ftrace_func_t functions can
> access some partial registers.
> 
> Fix to allocate 'struct ftrace_regs' (which has the same size of 'struct
> pt_regs') on the stack and save partial (argument) registers on it
> instead of reduced size custom data structure.
> 
> Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of 
> MCOUNT")
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  arch/riscv/kernel/mcount-dyn.S |   65 
> +---
>  1 file changed, 28 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
> index 669b8697aa38..84963680eff4 100644
> --- a/arch/riscv/kernel/mcount-dyn.S
> +++ b/arch/riscv/kernel/mcount-dyn.S
> @@ -14,46 +14,37 @@
>   .text
>  
>  #define FENTRY_RA_OFFSET 8
> -#define ABI_SIZE_ON_STACK80
> -#define ABI_A0   0
> -#define ABI_A1   8
> -#define ABI_A2   16
> -#define ABI_A3   24
> -#define ABI_A4   32
> -#define ABI_A5   40
> -#define ABI_A6   48
> -#define ABI_A7   56
> -#define ABI_T0   64
> -#define ABI_RA   72
>  
>   .macro SAVE_ABI
> - addisp, sp, -ABI_SIZE_ON_STACK
> -
> - REG_S   a0, ABI_A0(sp)
> - REG_S   a1, ABI_A1(sp)
> - REG_S   a2, ABI_A2(sp)
> - REG_S   a3, ABI_A3(sp)
> - REG_S   a4, ABI_A4(sp)
> - REG_S   a5, ABI_A5(sp)
> - REG_S   a6, ABI_A6(sp)
> - REG_S   a7, ABI_A7(sp)
> - REG_S   t0, ABI_T0(sp)
> - REG_S   ra, ABI_RA(sp)
> + addisp, sp, -PT_SIZE_ON_STACK
> +
> + /* Save t0 as epc for ftrace_regs_get_instruction_pointer() */
> + REG_S   t0, PT_EPC(sp)
> + REG_S   a0, PT_A0(sp)
> + REG_S   a1, PT_A1(sp)
> + REG_S   a2, PT_A2(sp)
> + REG_S   a3, PT_A3(sp)
> + REG_S   a4, PT_A4(sp)
> + REG_S   a5, PT_A5(sp)
> + REG_S   a6, PT_A6(sp)
> + REG_S   a7, PT_A7(sp)
> + REG_S   t0, PT_T0(sp)
> + REG_S   ra, PT_RA(sp)
>   .endm
>  
>   .macro RESTORE_ABI
> - REG_L   a0, ABI_A0(sp)
> - REG_L   a1, ABI_A1(sp)
> - REG_L   a2, ABI_A2(sp)
> - REG_L   a3, ABI_A3(sp)
> - REG_L   a4, ABI_A4(sp)
> - REG_L   a5, ABI_A5(sp)
> - REG_L   a6, ABI_A6(sp)
> - REG_L   a7, ABI_A7(sp)
> - REG_L   t0, ABI_T0(sp)
> - REG_L   ra, ABI_RA(sp)
> -
> - addisp, sp, ABI_SIZE_ON_STACK
> + REG_L   a0, PT_A0(sp)
> + REG_L   a1, PT_A1(sp)
> + REG_L   a2, PT_A2(sp)
> + REG_L   a3, PT_A3(sp)
> + REG_L   a4, PT_A4(sp)
> + REG_L   a5, PT_A5(sp)
> + REG_L   a6, PT_A6(sp)
> + REG_L   a7, PT_A7(sp)
> + REG_L   t0, PT_T0(sp)
> + REG_L   ra, PT_RA(sp)
> +
> + addisp, sp, PT_SIZE_ON_STACK
>   .endm
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> @@ -96,8 +87,8 @@ ftrace_call:
>   call    ftrace_stub
>  
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> - addia0, sp, ABI_RA
> - REG_L   a1, ABI_T0(sp)
> + addia0, sp, PT_RA
> + REG_L   a1, PT_T0(sp)
>   addia1, a1, -FENTRY_RA_OFFSET
>  #ifdef HAVE_FUNCTION_GRAPH_FP_TEST
>   mv  a2, s0
> 
> 


-- 
Masami Hiramatsu (Google) 



[PATCH] riscv: ftrace: Fix to pass correct ftrace_regs to ftrace_func_t functions

2023-10-02 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Since ftrace_func_t requires to pass 'struct ftrace_regs *' as the 4th
argument even if FTRACE_OPS_FL_SAVE_REGS is not set, ftrace_caller must
pass 'struct ftrace_regs *', which is a partial pt_regs, on the stack
to the ftrace_func_t functions, so that the ftrace_func_t functions can
access some partial registers.

Fix to allocate 'struct ftrace_regs' (which has the same size of 'struct
pt_regs') on the stack and save partial (argument) registers on it
instead of reduced size custom data structure.

Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT")
Signed-off-by: Masami Hiramatsu (Google) 
---
 arch/riscv/kernel/mcount-dyn.S |   65 +---
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index 669b8697aa38..84963680eff4 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -14,46 +14,37 @@
.text
 
 #define FENTRY_RA_OFFSET   8
-#define ABI_SIZE_ON_STACK  80
-#define ABI_A0 0
-#define ABI_A1 8
-#define ABI_A2 16
-#define ABI_A3 24
-#define ABI_A4 32
-#define ABI_A5 40
-#define ABI_A6 48
-#define ABI_A7 56
-#define ABI_T0 64
-#define ABI_RA 72
 
.macro SAVE_ABI
-   addisp, sp, -ABI_SIZE_ON_STACK
-
-   REG_S   a0, ABI_A0(sp)
-   REG_S   a1, ABI_A1(sp)
-   REG_S   a2, ABI_A2(sp)
-   REG_S   a3, ABI_A3(sp)
-   REG_S   a4, ABI_A4(sp)
-   REG_S   a5, ABI_A5(sp)
-   REG_S   a6, ABI_A6(sp)
-   REG_S   a7, ABI_A7(sp)
-   REG_S   t0, ABI_T0(sp)
-   REG_S   ra, ABI_RA(sp)
+   addisp, sp, -PT_SIZE_ON_STACK
+
+   /* Save t0 as epc for ftrace_regs_get_instruction_pointer() */
+   REG_S   t0, PT_EPC(sp)
+   REG_S   a0, PT_A0(sp)
+   REG_S   a1, PT_A1(sp)
+   REG_S   a2, PT_A2(sp)
+   REG_S   a3, PT_A3(sp)
+   REG_S   a4, PT_A4(sp)
+   REG_S   a5, PT_A5(sp)
+   REG_S   a6, PT_A6(sp)
+   REG_S   a7, PT_A7(sp)
+   REG_S   t0, PT_T0(sp)
+   REG_S   ra, PT_RA(sp)
.endm
 
.macro RESTORE_ABI
-   REG_L   a0, ABI_A0(sp)
-   REG_L   a1, ABI_A1(sp)
-   REG_L   a2, ABI_A2(sp)
-   REG_L   a3, ABI_A3(sp)
-   REG_L   a4, ABI_A4(sp)
-   REG_L   a5, ABI_A5(sp)
-   REG_L   a6, ABI_A6(sp)
-   REG_L   a7, ABI_A7(sp)
-   REG_L   t0, ABI_T0(sp)
-   REG_L   ra, ABI_RA(sp)
-
-   addisp, sp, ABI_SIZE_ON_STACK
+   REG_L   a0, PT_A0(sp)
+   REG_L   a1, PT_A1(sp)
+   REG_L   a2, PT_A2(sp)
+   REG_L   a3, PT_A3(sp)
+   REG_L   a4, PT_A4(sp)
+   REG_L   a5, PT_A5(sp)
+   REG_L   a6, PT_A6(sp)
+   REG_L   a7, PT_A7(sp)
+   REG_L   t0, PT_T0(sp)
+   REG_L   ra, PT_RA(sp)
+
+   addisp, sp, PT_SIZE_ON_STACK
.endm
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
@@ -96,8 +87,8 @@ ftrace_call:
callftrace_stub
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   addia0, sp, ABI_RA
-   REG_L   a1, ABI_T0(sp)
+   addia0, sp, PT_RA
+   REG_L   a1, PT_T0(sp)
addia1, a1, -FENTRY_RA_OFFSET
 #ifdef HAVE_FUNCTION_GRAPH_FP_TEST
mv  a2, s0




Re: [PATCH v5] eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()

2023-09-12 Thread Google
On Tue, 12 Sep 2023 21:47:52 +0800
Jinjie Ruan  wrote:

> Inject fault while probing btrfs.ko, if kstrdup() fails in
> eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR
> to assign file->ef. But the eventfs_remove() check NULL in
> trace_module_remove_events(), which causes the below NULL
> pointer dereference.
> 
> As both Masami and Steven suggest, allocater side should handle the
> error carefully and remove it, so fix the places where it failed.
> 
>  Could not create tracefs 'raid56_write' directory
>  Btrfs loaded, zoned=no, fsverity=no
>  Unable to handle kernel NULL pointer dereference at virtual address 
> 001c
>  Mem abort info:
>ESR = 0x9604
>EC = 0x25: DABT (current EL), IL = 32 bits
>SET = 0, FnV = 0
>EA = 0, S1PTW = 0
>FSC = 0x04: level 0 translation fault
>  Data abort info:
>ISV = 0, ISS = 0x0004, ISS2 = 0x
>CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>  user pgtable: 4k pages, 48-bit VAs, pgdp=000102544000
>  [001c] pgd=, p4d=
>  Internal error: Oops: 9604 [#1] PREEMPT SMP
>  Dumping ftrace buffer:
> (ftrace buffer empty)
>  Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 
> 8021q garp mrp stp llc ipv6 [last unloaded: btrfs]
>  CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40
>  Hardware name: linux,dummy-virt (DT)
>  pstate: 8005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : eventfs_remove_rec+0x24/0xc0
>  lr : eventfs_remove+0x68/0x1d8
>  sp : 800082d63b60
>  x29: 800082d63b60 x28: b84b80ddd00c x27: b84b3054ba40
>  x26: 0002 x25: 800082d63bf8 x24: b84b8398e440
>  x23: b84b82af3000 x22: dead0100 x21: dead0122
>  x20: 800082d63bf8 x19: fff4 x18: b84b82508820
>  x17:  x16:  x15: 83bc876a3166
>  x14: 006d x13: 006d x12: 
>  x11: 0001 x10: 17e0 x9 : 0001
>  x8 :  x7 :  x6 : b84b84289804
>  x5 :  x4 : 9696969696969697 x3 : 33a5b7601f38
>  x2 :  x1 : 800082d63bf8 x0 : fff4
>  Call trace:
>   eventfs_remove_rec+0x24/0xc0
>   eventfs_remove+0x68/0x1d8
>   remove_event_file_dir+0x88/0x100
>   event_remove+0x140/0x15c
>   trace_module_notify+0x1fc/0x230
>   notifier_call_chain+0x98/0x17c
>   blocking_notifier_call_chain+0x4c/0x74
>   __arm64_sys_delete_module+0x1a4/0x298
>   invoke_syscall+0x44/0x100
>   el0_svc_common.constprop.1+0x68/0xe0
>   do_el0_svc+0x1c/0x28
>   el0_svc+0x3c/0xc4
>   el0t_64_sync_handler+0xa0/0xc4
>   el0t_64_sync+0x174/0x178
>  Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400)
>  ---[ end trace  ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Dumping ftrace buffer:
> (ftrace buffer empty)
>  Kernel Offset: 0x384b00c0 from 0xffff80008000
>  PHYS_OFFSET: 0xcc5b8000
>  CPU features: 0x88000203,3c02,1000421b
>  Memory Limit: none
>  Rebooting in 1 seconds..
> 

This looks good to me!

Acked-by: Masami Hiramatsu (Google) 

Thanks!

> Fixes: 5bdcd5f5331a ("eventfs: Implement removal of meta data from eventfs")
> Signed-off-by: Jinjie Ruan 
> Suggested-by: Masami Hiramatsu (Google) 
> Suggested-by: Steven Rostedt 
> ---
> v5:
> - Move the ef below dir to keep the "upside-down x-mas tree" format.
> v4:
> - Not change the returning error code to NULL and simplify the fix.
> v3:
> - Fix the places where it failed instead of IS_ERR_OR_NULL()
> - Update the commit message.
> v2:
> - Use IS_ERR_OR_NULL() to check instead of IS_ERR() as suggested.
> - Update the commit message.
> ---
>  kernel/trace/trace_events.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index ed367d713be0..68b02d6f4fbf 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2297,6 +2297,7 @@ event_subsystem_dir(struct trace_array *tr, const char 
> *name,
>  {
>   struct event_subsystem *system, *iter;
>   struct trace_subsystem_dir *dir;
> + struct eventfs_file *ef;
>   int res;
>  
>   /* First see if we did not already create this dir */
> @@ -2329,13 +2330,14 @@ event_subsystem_dir(struct trace_array *tr, const 
> char *name,
>   } else
>   __get_system(system);
>  
> - dir->ef = even

Re: [PATCH v3] eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()

2023-09-11 Thread Google
Hi Jinjie,

On Mon, 11 Sep 2023 13:28:17 +0800
Jinjie Ruan  wrote:

> Inject fault while probing btrfs.ko, if kstrdup() fails in
> eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR
> to assign file->ef. But the eventfs_remove() check NULL in
> trace_module_remove_events(), which causes the below NULL
> pointer dereference.
> 
> As both Masami and Steven suggest, allocater side should handle the
> error carefully and remove it, so fix the places where it failed.

Thanks for update, but this is too much. Sorry if I confused you.

I meant just *one site* must be fixed. *The returning error code is not
a bad idea*, as far as it is used in temporary variable.

e.g. 

dir->ef = eventfs_add_subsystem_dir(name, parent);
if (IS_ERR(dir->ef)) {
...
}

This smells *bad* because if the eventfs_add_subsystem_dir() can return
the error code, 'dir->ef' will be kept and may be referred by another
code.

Instead,

ef = eventfs_add_subsystem_dir(name, parent);
if (IS_ERR(ef)) {
...
} else
dir->ef = ef;

This is safer, because 'dir->ef' always has NULL (unintialized) or a
valid address (initialized). So the error code is alyways assigned to
the temporary variable 'ef', and the caller can handle error correctly
by checking the error code.

Thank you,

> 
>  Could not create tracefs 'raid56_write' directory
>  Btrfs loaded, zoned=no, fsverity=no
>  Unable to handle kernel NULL pointer dereference at virtual address 
> 001c
>  Mem abort info:
>ESR = 0x9604
>EC = 0x25: DABT (current EL), IL = 32 bits
>SET = 0, FnV = 0
>EA = 0, S1PTW = 0
>FSC = 0x04: level 0 translation fault
>  Data abort info:
>ISV = 0, ISS = 0x0004, ISS2 = 0x
>CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>  user pgtable: 4k pages, 48-bit VAs, pgdp=000102544000
>  [001c] pgd=, p4d=
>  Internal error: Oops: 9604 [#1] PREEMPT SMP
>  Dumping ftrace buffer:
> (ftrace buffer empty)
>  Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 
> 8021q garp mrp stp llc ipv6 [last unloaded: btrfs]
>  CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40
>  Hardware name: linux,dummy-virt (DT)
>  pstate: 8005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : eventfs_remove_rec+0x24/0xc0
>  lr : eventfs_remove+0x68/0x1d8
>  sp : 800082d63b60
>  x29: 800082d63b60 x28: b84b80ddd00c x27: b84b3054ba40
>  x26: 0002 x25: 800082d63bf8 x24: b84b8398e440
>  x23: b84b82af3000 x22: dead0100 x21: dead0122
>  x20: 800082d63bf8 x19: fff4 x18: b84b82508820
>  x17:  x16:  x15: 83bc876a3166
>  x14: 006d x13: 006d x12: 
>  x11: 0001 x10: 17e0 x9 : 0001
>  x8 :  x7 :  x6 : b84b84289804
>  x5 :  x4 : 9696969696969697 x3 : 33a5b7601f38
>  x2 :  x1 : 800082d63bf8 x0 : fff4
>  Call trace:
>   eventfs_remove_rec+0x24/0xc0
>   eventfs_remove+0x68/0x1d8
>   remove_event_file_dir+0x88/0x100
>   event_remove+0x140/0x15c
>   trace_module_notify+0x1fc/0x230
>   notifier_call_chain+0x98/0x17c
>   blocking_notifier_call_chain+0x4c/0x74
>   __arm64_sys_delete_module+0x1a4/0x298
>   invoke_syscall+0x44/0x100
>   el0_svc_common.constprop.1+0x68/0xe0
>   do_el0_svc+0x1c/0x28
>   el0_svc+0x3c/0xc4
>   el0t_64_sync_handler+0xa0/0xc4
>   el0t_64_sync+0x174/0x178
>  Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400)
>  ---[ end trace  ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Dumping ftrace buffer:
> (ftrace buffer empty)
>  Kernel Offset: 0x384b00c0 from 0x80008000
>  PHYS_OFFSET: 0xcc5b8000
>  CPU features: 0x88000203,3c02,1000421b
>  Memory Limit: none
>  Rebooting in 1 seconds..
> 
> Fixes: 5bdcd5f5331a ("eventfs: Implement removal of meta data from eventfs")
> Signed-off-by: Jinjie Ruan 
> Suggested-by: Masami Hiramatsu (Google) 
> Suggested-by: Steven Rostedt 
> ---
> v3:
> - Fix the places where it failed instead of IS_ERR_OR_NULL()
> - Update the commit message.
> v2:
> - Use IS_ERR_OR_NULL() to check instead of IS_ERR() as suggested.
> - Update the commit message.
> ---
>  fs/tracefs/event_inode.c| 22 +++---
>  kernel/trace/trace_events.c |  4 ++--
>  2 files changed, 13 insertions(+), 13 deletions(-)
>