from:"Daniel Xu"

[PATCH bpf-next v5 06/12] bpf: Make bpf_session_cookie() kfunc return long *

2024-06-12 Thread Daniel Xu

We will soon be generating kfunc prototypes from BTF. As part of that,
we need to align the manual signatures in bpf_kfuncs.h with the actual
kfunc definitions. There is currently a conflicting signature for
bpf_session_cookie() w.r.t. return type.

The original intent was to return long * and not __u64 *. You can see
evidence of that intent in a3a5113393cc ("selftests/bpf: Add kprobe
session cookie test").

Fix conflict by changing kfunc definition.

Fixes: 5c919acef851 ("bpf: Add support for kprobe session cookie")
Signed-off-by: Daniel Xu 
---
 kernel/trace/bpf_trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..bc16e21a2a44 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3527,7 +3527,7 @@ __bpf_kfunc bool bpf_session_is_return(void)
return session_ctx->is_return;
 }
 
-__bpf_kfunc __u64 *bpf_session_cookie(void)
+__bpf_kfunc long *bpf_session_cookie(void)
 {
struct bpf_session_run_ctx *session_ctx;
 
-- 
2.44.0

Re: [RFC bpf-next 0/1] bpf: Add page cache iterator

2021-04-08 Thread Daniel Xu

On Thu, Apr 08, 2021 at 04:13:32PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 07, 2021 at 02:46:10PM -0700, Daniel Xu wrote:
> > There currently does not exist a way to answer the question: "What is in
> > the page cache?". There are various heuristics and counters but nothing
> > that can tell you anything like:
> > 
> >   * 3M from /home/dxu/foo.txt
> >   * 5K from ...
> 
> 5K?  That's an extraordinary Weird Machine(tm).

Just typing random numbers :)

> >   * etc.
> > 
> > The answer to the question is particularly useful in the stacked
> > container world. Stacked containers implies multiple containers are run
> > on the same physical host. Memory is precious resource on some (if not
> > most) of these systems. On these systems, it's useful to know how much
> > duplicated data is in the page cache. Once you know the answer, you can
> > do something about it. One possible technique would be bind mount common
> > items from the root host into each container.
> 
> Um, are you describing a system that uses BPF to deduplicating the page
> cache by using bind mounts?  Can the containers scribble on these files
> and thereby mess up the other containers?  What happens if the container
> wants to update itself and clobbers the root host's copy instead?  How
> do you deal with a software update process failing because the root host
> fights back against the container trying to change its files?

No, the BPF progs are not intended to modify the pages. This is just for
read only observability.

> Also, I thought we weren't supposed to share resources across security
> boundaries anymore?

I can't speak to this, but bpf progs can pretty much be attached to
anywhere so this iterator doesn't expose anything new.

<...>

Thanks,
Daniel

Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

2021-04-08 Thread Daniel Xu

On Thu, Apr 08, 2021 at 04:45:37PM +, Al Viro wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> 
> > +static void fini_seq_pagecache(void *priv_data)
> > +{
> > +   struct bpf_iter_seq_pagecache_info *info = priv_data;
> > +   struct radix_tree_iter iter;
> > +   struct super_block *sb;
> > +   void **slot;
> > +
> > +   radix_tree_for_each_slot(slot, >superblocks, , 0) {
> > +   sb = (struct super_block *)iter.index;
> > +   atomic_dec(>s_active);
> > +   radix_tree_delete(>superblocks, iter.index);
> > +   }
> 
> ... and if in the meanwhile all other contributors to ->s_active have
> gone away, that will result in...?

Ah right, sorry. Nobody will clean up the super_block.

> IOW, NAK.  The objects you are playing with have non-trivial lifecycle
> and poking into the guts of data structures without bothering to
> understand it is not a good idea.
> 
> Rule of the thumb: if your code ends up using fields that are otherwise
> handled by a small part of codebase, the odds are that you need to be
> bloody careful.  In particular, ->ns_lock has 3 users - all in
> fs/namespace.c.  ->list/->mnt_list: all users in fs/namespace.c and
> fs/pnode.c.  ->s_active: majority in fs/super.c, with several outliers
> in filesystems and safety of those is not trivial.
> 
> Any time you see that kind of pattern, you are risking to reprise
> a scene from The Modern Times - the one with Charlie taking a trip
> through the guts of machinery.

I'll take a closer look at the lifetime semantics.

Hopefully the overall goal of the patch is ok. Happy to iterate on the
implementation details until it's correct.

Thanks,
Daniel

Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

2021-04-08 Thread Daniel Xu

On Thu, Apr 08, 2021 at 10:19:35AM +0200, Christian Brauner wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> > This commit introduces the bpf page cache iterator. This iterator allows
> > users to run a bpf prog against each page in the "page cache".
> > Internally, the "page cache" is extremely tied to VFS superblock + inode
> > combo. Because of this, iter_pagecache will only examine pages in the
> > caller's mount namespace.
> > 
> > Signed-off-by: Daniel Xu 
> > ---
> >  kernel/bpf/Makefile |   2 +-
> >  kernel/bpf/pagecache_iter.c | 293 
> >  2 files changed, 294 insertions(+), 1 deletion(-)
> >  create mode 100644 kernel/bpf/pagecache_iter.c

<...>

> > 
> > +static int init_seq_pagecache(void *priv_data, struct bpf_iter_aux_info 
> > *aux)
> > +{
> > +   struct bpf_iter_seq_pagecache_info *info = priv_data;
> > +   struct radix_tree_iter iter;
> > +   struct super_block *sb;
> > +   struct mount *mnt;
> > +   void **slot;
> > +   int err;
> > +
> > +   info->ns = current->nsproxy->mnt_ns;
> > +   get_mnt_ns(info->ns);
> > +   INIT_RADIX_TREE(>superblocks, GFP_KERNEL);
> > +
> > +   spin_lock(>ns->ns_lock);
> > +   list_for_each_entry(mnt, >ns->list, mnt_list) {
> 
> Not just are there helpers for taking ns_lock
> static inline void lock_ns_list(struct mnt_namespace *ns)
> static inline void unlock_ns_list(struct mnt_namespace *ns)
> they are private to fs/namespace.c because it's the only place that
> should ever walk this list.

Thanks for the hints. Would it be acceptable to add some helpers to
fs/namespace.c to allow walking the list?

IIUC the only way to find a list of mounts is by looking at the mount
namespace. And walking each mount and looking at each `struct
super_node`'s inode's `struct address_space` seemed like the cleanest
way to walkthe page cache.

> This seems buggy: why is it ok here to only take ns_lock and not also
> namespace_sem like mnt_already_visible() and __is_local_mountpoint()
> or the relevant proc iterators? I might be missing something.

Thanks for the hints. I'll take a closer look at the locking. Most
probably I didn't get it right.

I should have also mentioned in the cover letter that I'm fairly sure I
messed up the locking somewhere.

> 
> > +   sb = mnt->mnt.mnt_sb;
> > +
> > +   /* The same mount may be mounted in multiple places */
> > +   if (radix_tree_lookup(>superblocks, (unsigned long)sb))
> > +   continue;
> > +
> > +   err = radix_tree_insert(>superblocks,
> > +   (unsigned long)sb, (void *)1);
> > +   if (err)
> > +   goto out;
> > +   }
> > +
> > +   radix_tree_for_each_slot(slot, >superblocks, , 0) {
> > +   sb = (struct super_block *)iter.index;
> > +   atomic_inc(>s_active);
> 
> It also isn't nice that you mess with sb->s_active directly.
> 
> Imho, this is poking around in a lot of fs/ specific stuff that other
> parts of the kernel should not care about or have access to.

Re above: do you think it'd be appropriate to add more helpers to fs/ ?

<...>

Thanks,
Daniel

Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

2021-04-08 Thread Daniel Xu

On Thu, Apr 08, 2021 at 07:14:01AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> > +struct bpf_iter_seq_pagecache_info {
> > +   struct mnt_namespace *ns;
> > +   struct radix_tree_root superblocks;
> 
> Why are you adding a new radix tree?  Use an XArray instead.

Ah right, sorry. Will do.

> > +static struct page *goto_next_page(struct bpf_iter_seq_pagecache_info 
> > *info)
> > +{
> > +   struct page *page, *ret = NULL;
> > +   unsigned long idx;
> > +
> > +   rcu_read_lock();
> > +retry:
> > +   BUG_ON(!info->cur_inode);
> > +   ret = NULL;
> > +   xa_for_each_start(>cur_inode->i_data.i_pages, idx, page,
> > + info->cur_page_idx) {
> > +   if (!page_cache_get_speculative(page))
> > +   continue;
> 
> Why do you feel the need to poke around in i_pages directly?  Is there
> something wrong with find_get_entries()?

No reason other than I didn't know about the latter. Thanks for the
hint. find_get_entries() seems to return a pagevec of entries which
would complicate the iteration (a 4th layer of things to iterate over).

But I did find find_get_pages_range() which I think can be used to find
1 page at a time. I'll look into it further.

> > +static int __pagecache_seq_show(struct seq_file *seq, struct page *page,
> > +   bool in_stop)
> > +{
> > +   struct bpf_iter_meta meta;
> > +   struct bpf_iter__pagecache ctx;
> > +   struct bpf_prog *prog;
> > +
> > +   meta.seq = seq;
> > +   prog = bpf_iter_get_info(, in_stop);
> > +   if (!prog)
> > +   return 0;
> > +
> > +   meta.seq = seq;
> > +   ctx.meta = 
> > +   ctx.page = page;
> > +   return bpf_iter_run_prog(prog, );
> 
> I'm not really keen on the idea of random BPF programs being able to poke
> at pages in the page cache like this.  From your initial description,
> it sounded like all you needed was a list of which pages are present.

Could you elaborate on what "list of which pages are present" implies?
The overall goal with this patch is to detect duplicate content in the
page cache. So anything that helps achieve that goal I would (in theory)
be OK with.

My understanding is the user would need to hash the contents
of each page in the page cache. And BPF provides the flexibility such
that this work could be reused for currently unanticipated use cases.

Furthermore, bpf programs could already look at all the pages in the
page cache by hooking into tracepoint:filemap:mm_filemap_add_to_page_cache,
albeit at a much slower rate. I figure the downside of adding this
page cache iterator is we're explicitly condoning the behavior.

> > +   INIT_RADIX_TREE(>superblocks, GFP_KERNEL);
> > +
> > +   spin_lock(>ns->ns_lock);
> > +   list_for_each_entry(mnt, >ns->list, mnt_list) {
> > +   sb = mnt->mnt.mnt_sb;
> > +
> > +   /* The same mount may be mounted in multiple places */
> > +   if (radix_tree_lookup(>superblocks, (unsigned long)sb))
> > +   continue;
> > +
> > +   err = radix_tree_insert(>superblocks,
> > +   (unsigned long)sb, (void *)1);
> > +   if (err)
> > +   goto out;
> > +   }
> > +
> > +   radix_tree_for_each_slot(slot, >superblocks, , 0) {
> > +   sb = (struct super_block *)iter.index;
> > +   atomic_inc(>s_active);
> > +   }
> 
> Uh.  What on earth made you think this was a good way to use the radix
> tree?  And, no, the XArray doesn't change that.

The idea behind the radix tree was to deduplicate the mounts by
superblock. Because a single filesystem may be mounted in different
locations. I didn't find a set data structure I could reuse so I
figured radix tree / xarray would work too.

Happy to take any better ideas too.

> If you don't understand why this is so bad, call xa_dump() on it after
> constructing it.  I'll wait.

I did a dump and got the following results: http://ix.io/2VpY .

I receieved a hint that you may be referring to how the xarray/radix
tree would be as large as the largest pointer. To my uneducated eye it
doesn't look like that's the case in this dump. Could you please
clarify?

<...>

Thanks,
Daniel

Re: [RFC bpf-next 0/1] bpf: Add page cache iterator

2021-04-08 Thread Daniel Xu

Hi Christian, thanks for taking a look.

On Thu, Apr 08, 2021 at 09:51:17AM +0200, Christian Brauner wrote:
> On Wed, Apr 07, 2021 at 02:46:10PM -0700, Daniel Xu wrote:
> > There currently does not exist a way to answer the question: "What is in
> > the page cache?". There are various heuristics and counters but nothing
> > that can tell you anything like:
> > 
> >   * 3M from /home/dxu/foo.txt
> >   * 5K from ...
> >   * etc.
> > 
> > The answer to the question is particularly useful in the stacked
> > container world. Stacked containers implies multiple containers are run
> > on the same physical host. Memory is precious resource on some (if not
> 
> Just to clarify: what are "stacked containers"? Do you mean nested
> containers, i.e. containers running within containers?

I mean multiple containers running side by side on the same host.

Thanks,
Daniel

[RFC bpf-next 1/1] bpf: Introduce iter_pagecache

2021-04-07 Thread Daniel Xu

This commit introduces the bpf page cache iterator. This iterator allows
users to run a bpf prog against each page in the "page cache".
Internally, the "page cache" is extremely tied to VFS superblock + inode
combo. Because of this, iter_pagecache will only examine pages in the
caller's mount namespace.

Signed-off-by: Daniel Xu 
---
 kernel/bpf/Makefile |   2 +-
 kernel/bpf/pagecache_iter.c | 293 
 2 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/pagecache_iter.c

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 7f33098ca63f..3deb6a8d3f75 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o 
bpf_iter.o map_iter.o task_iter.o prog_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o 
bpf_iter.o pagecache_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
diff --git a/kernel/bpf/pagecache_iter.c b/kernel/bpf/pagecache_iter.c
new file mode 100644
index ..8442ab0d4221
--- /dev/null
+++ b/kernel/bpf/pagecache_iter.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021 Facebook */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../../fs/mount.h"
+
+struct bpf_iter_seq_pagecache_info {
+   struct mnt_namespace *ns;
+   struct radix_tree_root superblocks;
+   struct super_block *cur_sb;
+   struct inode *cur_inode;
+   unsigned long cur_page_idx;
+};
+
+static struct super_block *goto_next_sb(struct bpf_iter_seq_pagecache_info 
*info)
+{
+   struct super_block *sb = NULL;
+   struct radix_tree_iter iter;
+   void **slot;
+
+   radix_tree_for_each_slot(slot, >superblocks, ,
+((unsigned long)info->cur_sb + 1)) {
+   sb = (struct super_block *)iter.index;
+   break;
+   }
+
+   info->cur_sb = sb;
+   info->cur_inode = NULL;
+   info->cur_page_idx = 0;
+   return sb;
+}
+
+static bool inode_unusual(struct inode *inode) {
+   return ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (inode->i_mapping->nrpages == 0));
+}
+
+static struct inode *goto_next_inode(struct bpf_iter_seq_pagecache_info *info)
+{
+   struct inode *prev_inode = info->cur_inode;
+   struct inode *inode;
+
+retry:
+   BUG_ON(!info->cur_sb);
+   spin_lock(>cur_sb->s_inode_list_lock);
+
+   if (!info->cur_inode) {
+   list_for_each_entry(inode, >cur_sb->s_inodes, i_sb_list) {
+   spin_lock(>i_lock);
+   if (inode_unusual(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   info->cur_inode = inode;
+   break;
+   }
+   } else {
+   inode = info->cur_inode;
+   info->cur_inode = NULL;
+   list_for_each_entry_continue(inode, >cur_sb->s_inodes,
+i_sb_list) {
+   spin_lock(>i_lock);
+   if (inode_unusual(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   info->cur_inode = inode;
+   break;
+   }
+   }
+
+   /* Seen all inodes in this superblock */
+   if (!info->cur_inode) {
+   spin_unlock(>cur_sb->s_inode_list_lock);
+   if (!goto_next_sb(info)) {
+   inode = NULL;
+   goto out;
+   }
+
+   goto retry;
+   }
+
+   spin_unlock(>cur_sb->s_inode_list_lock);
+   info->cur_page_idx = 0;
+out:
+   iput(prev_inode);
+   return info->cur_inode;
+}
+
+static struct page *goto_next_page(struct bpf_iter_seq_pagecache_info *info)
+{
+   struct page *page, *ret = NULL;
+   unsigned long idx;
+
+   rcu_read_lock();
+retry:
+   BUG_ON(!info->cur_inode);
+   ret = NULL;
+   xa_for_each_start(>cur_inode->i_data.i_pages, i

[RFC bpf-next 0/1] bpf: Add page cache iterator

2021-04-07 Thread Daniel Xu

There currently does not exist a way to answer the question: "What is in
the page cache?". There are various heuristics and counters but nothing
that can tell you anything like:

  * 3M from /home/dxu/foo.txt
  * 5K from ...
  * etc.

The answer to the question is particularly useful in the stacked
container world. Stacked containers implies multiple containers are run
on the same physical host. Memory is precious resource on some (if not
most) of these systems. On these systems, it's useful to know how much
duplicated data is in the page cache. Once you know the answer, you can
do something about it. One possible technique would be bind mount common
items from the root host into each container.

NOTES: 

  * This patch compiles and (maybe) works -- totally not fully tested
or in a final state

  * I'm sending this early RFC to get comments on the general approach.
I chatted w/ Johannes a little bit and it seems like the best way to
do this is through superblock -> inode -> address_space iteration
rather than going from numa node -> LRU iteration

  * I'll most likely add a page_hash() helper (or something) that hashes
a page so that userspace can more easily tell which pages are
    duplicate

Daniel Xu (1):
  bpf: Introduce iter_pagecache

 kernel/bpf/Makefile |   2 +-
 kernel/bpf/pagecache_iter.c | 293 
 2 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/pagecache_iter.c

-- 
2.26.3

Re: [PATCH -tip v2 00/10] kprobes: Fix stacktrace with kretprobes

2021-03-12 Thread Daniel Xu

On Fri, Mar 12, 2021 at 03:41:44PM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> Here is the 2nd version of the series to fix the stacktrace with kretprobe.
> 
> The 1st series is here;
> 
> https://lore.kernel.org/bpf/161495873696.346821.10161501768906432924.stgit@devnote2/
> 
> In this version I merged the ORC unwinder fix for kretprobe which discussed 
> in the
> previous thread. [3/10] is updated according to the Miroslav's comment. 
> [4/10] is
> updated for simplify the code. [5/10]-[9/10] are discussed in the previsous 
> tread
> and are introduced to the series.
> 
> Daniel, can you also test this again? I and Josh discussed a bit different
> method and I've implemented it on this version.

Works great, thanks!

Tested-by: Daniel Xu 

> 
> This actually changes the kretprobe behavisor a bit, now the instraction 
> pointer in
> the pt_regs passed to kretprobe user handler is correctly set the real return
> address. So user handlers can get it via instruction_pointer() API.

While this changes behavior a little bit, I don't think anyone will
mind. I think it's more accurate now.

<...>

Thanks,
Daniel

Re: [PATCH -tip 0/5] kprobes: Fix stacktrace in kretprobes

2021-03-10 Thread Daniel Xu

On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote:
> Hi Josh and Daniel,
<...>

> commit aa452d999b524b1851f69cc947be3e1a2f3ca1ec
> Author: Masami Hiramatsu 
> Date:   Sat Mar 6 08:34:51 2021 +0900
> 
> x86/unwind/orc: Fixup kretprobe trampoline entry
> 
> Since the kretprobe replaces the function return address with
> the kretprobe_trampoline on the stack, the ORC unwinder can not
> continue the stack unwinding at that point.
> 
> To fix this issue, correct state->ip as like as function-graph
> tracer in the unwind_next_frame().
> 
> Signed-off-by: Masami Hiramatsu 
> 

I applied your original patchset + Josh's patch + this patch and can
confirm it works for bpftrace + kretprobes.

Tested-by: Daniel Xu

Re: [PATCH -tip 0/5] kprobes: Fix stacktrace in kretprobes

2021-03-09 Thread Daniel Xu

Hi Masami,

Just want to clarify a few points:

On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote:
> On Sun, 7 Mar 2021 13:23:33 -0800
> Daniel Xu  wrote:
> To help your understanding, let me explain.
> 
> If we have a code here
> 
> caller_func:
> 0x00 add sp, 0x20 /* 0x20 bytes stack frame allocated */
> ...
> 0x10 call target_func
> 0x15 ... /* return address */
> 
> On the stack in the entry of target_func, we have
> 
> [stack]
> 0x0e0 caller_func+0x15
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> And when we put a kretprobe on the target_func, the stack will be
> 
> [stack]
> 0x0e0 kretprobe_trampoline
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> * "caller_func+0x15" is saved in current->kretprobe_instances.first.
> 
> When returning from the target_func, call consumed the 0x0e0 and
> jump to kretprobe_trampoline. Let's see the assembler code.
> 
> ".text\n"
> ".global kretprobe_trampoline\n"
> ".type kretprobe_trampoline, @function\n"
> "kretprobe_trampoline:\n"
> /* We don't bother saving the ss register */
> "   pushq %rsp\n"
> "   pushfq\n"
> SAVE_REGS_STRING
> "   movq %rsp, %rdi\n"
> "   call trampoline_handler\n"
> /* Replace saved sp with true return address. */
> "   movq %rax, 19*8(%rsp)\n"
> RESTORE_REGS_STRING
> "   popfq\n"
> "   ret\n"
> 
> When the entry of trampoline_handler, stack is like this;
> 
> [stack]
> 0x040 kretprobe_trampoline+0x25
> 0x048 r15
> ... /* pt_regs */
> 0x0d8 flags
> 0x0e0 rsp (=0x0e0)
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> And after returned from trampoline_handler, "movq" changes the
> stack like this.
> 
> [stack]
> 0x040 kretprobe_trampoline+0x25
> 0x048 r15
> ... /* pt_regs */
> 0x0d8 flags
> 0x0e0 caller_func+0x15
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */

Thanks for the detailed explanation. I think I understand kretprobe
mechanics from a somewhat high level (kprobe saves real return address
on entry, overwrites return address to trampoline, then trampoline
runs handler and finally resets return address to real return address).

I don't usually write much assembly so the details escape me somewhat.

> So at the kretprobe handler, we have 2 issues.
> 1) the return address (caller_func+0x15) is not on the stack.
>this can be solved by searching from current->kretprobe_instances.

Yes, agreed.

> 2) the stack frame size of kretprobe_trampoline is unknown
>Since the stackframe is fixed, the fixed number (0x98) can be used.

I'm confused why this is relevant. Is it so ORC knows where to find
saved return address in the frame?

> However, those solutions are only for the kretprobe handler. The stacktrace
> from interrupt handler hit in the kretprobe_trampoline still doesn't work.
> 
> So, here is my idea;
> 
> 1) Change the trampline code to prepare stack frame at first and save
>registers on it, instead of "push". This will makes ORC easy to setup
>stackframe information for this code.

I'm confused on the details here. But this is what Josh solves in his
patch, right?

> 2) change the return addres fixup timing. Instead of using return value
>of trampoline handler, before removing the real return address from
>current->kretprobe_instances.

Is the idea to have `kretprobe_trampoline` place the real return address
on the stack (while telling ORC where to find it) _before_ running `call
trampoline_handler` ? So that an unwind from inside the user defined
kretprobe handler simply unwinds correctly?

And to be extra clear, this would only work for stack_trace_save() and
not stack_trace_save_regs()?

> 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
>checks the contents of the end of stackframe (at the place of regs->sp)
>is same as the address of it. If it is, it can find the correct address
>from current->kretprobe_instances. If not, there is a correct address.

What do you mean by "it" w.r.t. "is the same address of it"? I'm
confused on this point.

Thanks,
Daniel

Re: [PATCH -tip 0/5] kprobes: Fix stacktrace in kretprobes

2021-03-07 Thread Daniel Xu

On Sat, Mar 06, 2021 at 10:13:57AM +0900, Masami Hiramatsu wrote:
> On Fri, 5 Mar 2021 11:16:45 -0800
> Daniel Xu  wrote:
> 
> > Hi Masami,
> > 
> > On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote:
> > > Hello,
> > > 
> > > Here is a series of patches for kprobes and stacktracer to fix the 
> > > kretprobe
> > > entries in the kernel stack. This was reported by Daniel Xu. I thought 
> > > that
> > > was in the bpftrace, but it is actually more generic issue.
> > > So I decided to fix the issue in arch independent part.
> > > 
> > > While fixing the issue, I found a bug in ia64 related to kretprobe, which 
> > > is
> > > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the 
> > > main
> > > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
> > > internal change. And [5/5] removing the stacktrace kretprobe fixup code in
> > > ftrace. 
> > > 
> > > Daniel, can you also check that this fixes your issue too? I hope it is.
> > 
> > Unfortunately, this patch series does not fix the issue I reported.
> 
> Ah, OK. This was my mistake I didn't choose ORC unwinder for test kernel.
> 
> > 
> > I think the reason your tests work is because you're using ftrace and
> > the ORC unwinder is aware of ftrace trampolines (see
> > arch/x86/kernel/unwind_orc.c:orc_ftrace_find).
> 
> OK, so it has to be fixed in ORC unwinder.
> 
> > 
> > bpftrace kprobes go through perf event subsystem (ie not ftrace) so
> > naturally orc_ftrace_find() does not find an associated trampoline. ORC
> > unwinding fails in this case because
> > arch/x86/kernel/kprobes/core.c:trampoline_handler sets
> > 
> > regs->ip = (unsigned long)_trampoline;
> > 
> > and `kretprobe_trampoline` is marked
> > 
> > STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
> > 
> > so it doesn't have a valid ORC entry. Thus, ORC immediately bails when
> > trying to unwind past the first frame.
> 
> Hm, in ftrace case, it decode kretprobe's stackframe and stoped right
> after kretprobe_trampoline callsite.
> 
>  => kretprobe_trace_func+0x21f/0x340
>  => kretprobe_dispatcher+0x73/0xb0
>  => __kretprobe_trampoline_handler+0xcd/0x1a0
>  => trampoline_handler+0x3d/0x50
>  => kretprobe_trampoline+0x25/0x50
>  => 0
> 
> kretprobe replaces the real return address with kretprobe_trampoline
> and kretprobe_trampoline *calls* trampoline_handler (this part depends
> on architecture implementation).
> Thus, if kretprobe_trampoline has no stack frame information, ORC may
> fails at the first kretprobe_trampoline+0x25, that is different from
> the kretprobe_trampoline+0, so the hack doesn't work.

I'm not sure I follow 100% what you're saying, but assuming you're
asking why bpftrace fails at `kretprobe_trampoline+0` instead of
`kretprobe_trampoline+0x25`:

`regs` is set to _trampoline:

regs->ip = (unsigned long)_trampoline;

Then the kretprobe event is dispatched like this:

kretprobe_trampoline_handler
  __kretprobe_trampoline_handler
rp->handler // ie kernel/trace/trace_kprobe.c:kretprobe_dispatcher
  kretprobe_perf_func
trace_bpf_call(.., regs)
  BPF_PROG_RUN_ARRAY_CHECK
bpf_get_stackid(regs, .., ..) // in bpftrace prog 

And then `bpf_get_stackid` unwinds the stack via:

bpf_get_stackid
  get_perf_callchain(regs, ...)
perf_callchain_kernel(.., regs)
  perf_callchain_store(.., regs->ip) // !!! first unwound entry
  unwind_start
  unwind_next_frame

In summary: unwinding via BPF begins at regs->ip, which
`trampoline_handler` sets to `_trampoline+0x0`.

> Hmm, how the other inline-asm function makes its stackframe info?
> If I get the kretprobe_trampoline+0, I can fix it up.

So I think I misunderstood the mechanics before. I think `call
trampoline_handler` does set up a new frame. My current guess is that
ftrace doesn't thread through `regs` like the bpf code path does. I'm
not very familiar with ftrace internals so I haven't looked.

> 
> > The only way I can think of to fix this issue is to make the ORC
> > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
> > you have another idea if my patch isn't acceptable.
> 
> OK, anyway, your patch doesn't care the case that the multiple functions
> are probed by kretprobes. In that case, you'll see several entries are
> replaced by the kretprobe_trampoline. To find it correctly, you have
> to pass a state-holder (@cur of the kretprobe_find_ret_addr())
> to the fixup entries.

I'll see if I can figure something out tomorrow.

Thanks,
Daniel

Re: [PATCH] x86: kprobes: orc: Fix ORC walks in kretprobes

2021-03-05 Thread Daniel Xu

On Fri, Mar 05, 2021 at 01:32:44PM -0600, Josh Poimboeuf wrote:
> On Fri, Mar 05, 2021 at 11:25:15AM -0800, Daniel Xu wrote:
> > > BTW, is this a regression? or CONFIG_UNWINDER_ORC has this issue before?
> > > It seems that the above commit just changed the default unwinder. This 
> > > means
> > > OCR stack unwinder has this bug before that commit.
> > 
> > I see your point -- I suppose it depends on point of view. Viewed from
> > userspace, a change in kernel defaults means that one kernel worked and
> > the next one didn't -- all without the user doing anything. Consider it
> > from the POV of a typical linux user who just takes whatever the distro
> > gives them and doesn't compile their own kernels.
> > 
> > From the kernel point of view, you're also right. ORC didn't regress, it
> > was always broken for this particular use case. But as a primarily
> > userspace developer, I would consider this a kernel regression.
> 
> Either way, if the bug has always existed in the ORC unwinder, the Fixes
> tag needs to reference the original ORC commit:
> 
>   Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder")
> 
> That makes it possible for stable kernels to identify the source of the
> bug and corresponding fix.  Many people used the ORC unwinder before it
> became the default.

Got it. I'll change it in the next version if we get to V2 (another
ongoing discussion in Masami's patchset).

Thanks,
Daniel

Re: [PATCH] x86: kprobes: orc: Fix ORC walks in kretprobes

2021-03-05 Thread Daniel Xu

On Fri, Mar 05, 2021 at 07:58:09PM +0900, Masami Hiramatsu wrote:
> On Fri, 5 Mar 2021 18:28:06 +0900
> Masami Hiramatsu  wrote:
> 
> > Hi Daniel,
> > 
> > On Thu,  4 Mar 2021 16:07:52 -0800
> > Daniel Xu  wrote:
> > 
> > > Getting a stack trace from inside a kretprobe used to work with frame
> > > pointer stack walks. After the default unwinder was switched to ORC,
> > > stack traces broke because ORC did not know how to skip the
> > > `kretprobe_trampoline` "frame".
> > > 
> > > Frame based stack walks used to work with kretprobes because
> > > `kretprobe_trampoline` does not set up a new call frame. Thus, the frame
> > > pointer based unwinder could walk directly to the kretprobe'd caller.
> > > 
> > > For example, this stack is walked incorrectly with ORC + kretprobe:
> > > 
> > > # bpftrace -e 'kretprobe:do_nanosleep { @[kstack] = count() }'
> > > Attaching 1 probe...
> > > ^C
> > > 
> > > @[
> > > kretprobe_trampoline+0
> > > ]: 1
> > > 
> > > After this patch, the stack is walked correctly:
> > > 
> > > # bpftrace -e 'kretprobe:do_nanosleep { @[kstack] = count() }'
> > > Attaching 1 probe...
> > > ^C
> > > 
> > > @[
> > > kretprobe_trampoline+0
> > > __x64_sys_nanosleep+150
> > > do_syscall_64+51
> > > entry_SYSCALL_64_after_hwframe+68
> > > ]: 12
> > > 
> > > Fixes: fc72ae40e303 ("x86/unwind: Make CONFIG_UNWINDER_ORC=y the default 
> > > in kconfig for 64-bit")
> > > Signed-off-by: Daniel Xu 
> > 
> > OK, basically good, but this is messy, and doing much more than fixing 
> > issue.

Thanks for taking a look!

> BTW, is this a regression? or CONFIG_UNWINDER_ORC has this issue before?
> It seems that the above commit just changed the default unwinder. This means
> OCR stack unwinder has this bug before that commit.

I see your point -- I suppose it depends on point of view. Viewed from
userspace, a change in kernel defaults means that one kernel worked and
the next one didn't -- all without the user doing anything. Consider it
from the POV of a typical linux user who just takes whatever the distro
gives them and doesn't compile their own kernels.

>From the kernel point of view, you're also right. ORC didn't regress, it
was always broken for this particular use case. But as a primarily
userspace developer, I would consider this a kernel regression.


> If you choose the CONFIG_UNWINDER_FRAME_POINTER, it might work again.

Yes, I've confirmed that switching to CONFIG_UNWINDER_FRAME_POINTER does
fix the issue. But it's a non-starter for production machines b/c the
perf regression is too significant.

Thanks,
Daniel

Re: [PATCH -tip 0/5] kprobes: Fix stacktrace in kretprobes

2021-03-05 Thread Daniel Xu

Hi Masami,

On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> Here is a series of patches for kprobes and stacktracer to fix the kretprobe
> entries in the kernel stack. This was reported by Daniel Xu. I thought that
> was in the bpftrace, but it is actually more generic issue.
> So I decided to fix the issue in arch independent part.
> 
> While fixing the issue, I found a bug in ia64 related to kretprobe, which is
> fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main
> issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
> internal change. And [5/5] removing the stacktrace kretprobe fixup code in
> ftrace. 
> 
> Daniel, can you also check that this fixes your issue too? I hope it is.

Unfortunately, this patch series does not fix the issue I reported.

I think the reason your tests work is because you're using ftrace and
the ORC unwinder is aware of ftrace trampolines (see
arch/x86/kernel/unwind_orc.c:orc_ftrace_find).

bpftrace kprobes go through perf event subsystem (ie not ftrace) so
naturally orc_ftrace_find() does not find an associated trampoline. ORC
unwinding fails in this case because
arch/x86/kernel/kprobes/core.c:trampoline_handler sets

regs->ip = (unsigned long)_trampoline;

and `kretprobe_trampoline` is marked

STACK_FRAME_NON_STANDARD(kretprobe_trampoline);

so it doesn't have a valid ORC entry. Thus, ORC immediately bails when
trying to unwind past the first frame.

The only way I can think of to fix this issue is to make the ORC
unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
you have another idea if my patch isn't acceptable.

Thanks,
Daniel

[PATCH] x86: kprobes: orc: Fix ORC walks in kretprobes

2021-03-04 Thread Daniel Xu

Getting a stack trace from inside a kretprobe used to work with frame
pointer stack walks. After the default unwinder was switched to ORC,
stack traces broke because ORC did not know how to skip the
`kretprobe_trampoline` "frame".

Frame based stack walks used to work with kretprobes because
`kretprobe_trampoline` does not set up a new call frame. Thus, the frame
pointer based unwinder could walk directly to the kretprobe'd caller.

For example, this stack is walked incorrectly with ORC + kretprobe:

# bpftrace -e 'kretprobe:do_nanosleep { @[kstack] = count() }'
Attaching 1 probe...
^C

@[
kretprobe_trampoline+0
]: 1

After this patch, the stack is walked correctly:

# bpftrace -e 'kretprobe:do_nanosleep { @[kstack] = count() }'
Attaching 1 probe...
^C

@[
kretprobe_trampoline+0
__x64_sys_nanosleep+150
do_syscall_64+51
entry_SYSCALL_64_after_hwframe+68
]: 12

Fixes: fc72ae40e303 ("x86/unwind: Make CONFIG_UNWINDER_ORC=y the default in 
kconfig for 64-bit")
Signed-off-by: Daniel Xu 
---
 arch/x86/kernel/unwind_orc.c | 53 +++-
 kernel/kprobes.c |  8 +++---
 2 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 2a1d47f47eee..1b88d75e2e9e 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -1,7 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -77,9 +79,11 @@ static struct orc_entry *orc_module_find(unsigned long ip)
 }
 #endif
 
-#ifdef CONFIG_DYNAMIC_FTRACE
+#if defined(CONFIG_DYNAMIC_FTRACE) || defined(CONFIG_KRETPROBES)
 static struct orc_entry *orc_find(unsigned long ip);
+#endif
 
+#ifdef CONFIG_DYNAMIC_FTRACE
 /*
  * Ftrace dynamic trampolines do not have orc entries of their own.
  * But they are copies of the ftrace entries that are static and
@@ -117,6 +121,43 @@ static struct orc_entry *orc_ftrace_find(unsigned long ip)
 }
 #endif
 
+#ifdef CONFIG_KRETPROBES
+static struct orc_entry *orc_kretprobe_find(void)
+{
+   kprobe_opcode_t *correct_ret_addr = NULL;
+   struct kretprobe_instance *ri = NULL;
+   struct llist_node *node;
+
+   node = current->kretprobe_instances.first;
+   while (node) {
+   ri = container_of(node, struct kretprobe_instance, llist);
+
+   if ((void *)ri->ret_addr != _trampoline) {
+   /*
+* This is the real return address. Any other
+* instances associated with this task are for
+* other calls deeper on the call stack
+*/
+   correct_ret_addr = ri->ret_addr;
+   break;
+   }
+
+
+   node = node->next;
+   }
+
+   if (!correct_ret_addr)
+   return NULL;
+
+   return orc_find((unsigned long)correct_ret_addr);
+}
+#else
+static struct orc_entry *orc_kretprobe_find(void)
+{
+   return NULL;
+}
+#endif
+
 /*
  * If we crash with IP==0, the last successfully executed instruction
  * was probably an indirect function call with a NULL function pointer,
@@ -148,6 +189,16 @@ static struct orc_entry *orc_find(unsigned long ip)
if (ip == 0)
return _orc_entry;
 
+   /*
+* Kretprobe lookup -- must occur before vmlinux addresses as
+* kretprobe_trampoline is in the symbol table.
+*/
+   if (ip == (unsigned long) _trampoline) {
+   orc = orc_kretprobe_find();
+   if (orc)
+   return orc;
+   }
+
/* For non-init vmlinux addresses, use the fast lookup table: */
if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
unsigned int idx, start, stop;
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 745f08fdd7a6..334c23d33451 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1895,10 +1895,6 @@ unsigned long __kretprobe_trampoline_handler(struct 
pt_regs *regs,
BUG_ON(1);
 
 found:
-   /* Unlink all nodes for this frame. */
-   current->kretprobe_instances.first = node->next;
-   node->next = NULL;
-
/* Run them..  */
while (first) {
ri = container_of(first, struct kretprobe_instance, llist);
@@ -1917,6 +1913,10 @@ unsigned long __kretprobe_trampoline_handler(struct 
pt_regs *regs,
recycle_rp_inst(ri);
}
 
+   /* Unlink all nodes for this frame. */
+   current->kretprobe_instances.first = node->next;
+   node->next = NULL;
+
return (unsigned long)correct_ret_addr;
 }
 NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
-- 
2.30.1

Re: Broken kretprobe stack traces

2021-03-04 Thread Daniel Xu

On Wed, Mar 3, 2021, at 6:18 PM, Daniel Xu wrote:
> On Wed, Mar 03, 2021 at 03:37:40PM -0500, Steven Rostedt wrote:
> > On Wed, 03 Mar 2021 12:13:08 -0800
> > "Daniel Xu"  wrote:
> > 
> > > On Wed, Mar 3, 2021, at 11:58 AM, Daniel Xu wrote:
> > > > On Wed, Mar 03, 2021 at 09:26:04AM -0500, Steven Rostedt wrote:  
> > > > > On Wed, 3 Mar 2021 13:48:28 +0900
> > > > > Masami Hiramatsu  wrote:
> > > > >   
> > > > > >   
> > > > > > > 
> > > > > > > I think (can't prove) this used to work:
> > > > > 
> > > > > Would be good to find out if it did.  
> > > > 
> > > > I'm installing some older kernels now to check. Will report back.  
> > > 
> > > Yep, works in 4.11. So there was a regression somewhere.
> > 
> > Care to bisect? ;-)
> 
> Took a while (I'll probably be typing "test_regression.sh" in my sleep
> tonight) but I've bisected it down to f95b23a112f1 ("Merge branch
> 'x86/urgent' into x86/asm, to pick up dependent fixes").
> 
> I think I saw the default option for stack unwinder change from frame
> pointers -> ORC so that may be the root cause. Not sure, though. Need to
> look more closely at the commits in the merge commit.
> 
> <...>
> 
> Daniel
>

Compiling with:

CONFIG_UNWINDER_ORC=n
CONFIG_UNWINDER_FRAME_POINTER=y

fixes the issues and leads me to believe stacktraces on kretprobes
never worked with ORC.

Josh, any chance you have an idea?

Thanks,
Daniel

Re: Broken kretprobe stack traces

2021-03-03 Thread Daniel Xu

On Wed, Mar 03, 2021 at 03:37:40PM -0500, Steven Rostedt wrote:
> On Wed, 03 Mar 2021 12:13:08 -0800
> "Daniel Xu"  wrote:
> 
> > On Wed, Mar 3, 2021, at 11:58 AM, Daniel Xu wrote:
> > > On Wed, Mar 03, 2021 at 09:26:04AM -0500, Steven Rostedt wrote:  
> > > > On Wed, 3 Mar 2021 13:48:28 +0900
> > > > Masami Hiramatsu  wrote:
> > > >   
> > > > >   
> > > > > > 
> > > > > > I think (can't prove) this used to work:
> > > > 
> > > > Would be good to find out if it did.  
> > > 
> > > I'm installing some older kernels now to check. Will report back.  
> > 
> > Yep, works in 4.11. So there was a regression somewhere.
> 
> Care to bisect? ;-)

Took a while (I'll probably be typing "test_regression.sh" in my sleep
tonight) but I've bisected it down to f95b23a112f1 ("Merge branch
'x86/urgent' into x86/asm, to pick up dependent fixes").

I think I saw the default option for stack unwinder change from frame
pointers -> ORC so that may be the root cause. Not sure, though. Need to
look more closely at the commits in the merge commit.

<...>

Daniel

Re: Broken kretprobe stack traces

2021-03-03 Thread Daniel Xu

On Wed, Mar 03, 2021 at 01:48:28PM +0900, Masami Hiramatsu wrote:
> Hi Daniel,
> 
> On Tue, 02 Mar 2021 17:15:13 -0800
> "Daniel Xu"  wrote:
> 
> > Hi Masami,
> > 
> > Jakub reported a bug with kretprobe stack traces -- wondering if you've 
> > gotten
> > any bug reports related to stack traces being broken for kretprobes.
> 
> Yeah, stack dumper must check the stack entry is kretprobe'd and skip it.
> For example, ftrace checks it as below.
> 
> /sys/kernel/debug/tracing # echo r vfs_read > kprobe_events 
> /sys/kernel/debug/tracing # echo stacktrace > 
> events/kprobes/r_vfs_read_0/trigger 
> /sys/kernel/debug/tracing # echo 1 > events/kprobes/r_vfs_read_0/enable 
> /sys/kernel/debug/tracing # head -20 trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 15685/336094   #P:8
> #
> #_-=> irqs-off
> #   / _=> need-resched
> #  | / _---=> hardirq/softirq
> #  || / _--=> preempt-depth
> #  ||| / delay
> #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> #  | | |     | |
>   sh-132 [006] ...138.920352: 
>  => kretprobe_dispatcher
>  => __kretprobe_trampoline_handler
>  => trampoline_handler
>  => [unknown/kretprobe'd]
>  => [unknown/kretprobe'd]
>  => __x64_sys_read
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> 
> > 
> > I think (can't prove) this used to work:
> 
> I'm not sure the bpftrace had correctly handled it or not.
> 
> > 
> > # bpftrace -e 'kretprobe:__tcp_retransmit_skb { @[kstack()] = count() }'
> > Attaching 1 probe...
> > ^C
> > 
> > @[
> > kretprobe_trampoline+0
> > ]: 1
> 
> Would you know how the bpftrace stacktracer rewinds the stack entries?
> FYI, ftrace does it in trace_seq_print_sym()@kernel/trace/trace_output.c

Thanks for the hint, I'll take a look.

bpftrace generates a bpf program that calls into the kernel's
bpf_get_stackid():

https://github.com/torvalds/linux/blob/f69d02e37a85645aa90d18cacfff36dba370f797/include/uapi/linux/bpf.h#L1296

so it could be a bug with bpf.

<...>

Re: Broken kretprobe stack traces

2021-03-03 Thread Daniel Xu

On Wed, Mar 03, 2021 at 09:26:04AM -0500, Steven Rostedt wrote:
> On Wed, 3 Mar 2021 13:48:28 +0900
> Masami Hiramatsu  wrote:
> 
> > 
> > > 
> > > I think (can't prove) this used to work:  
> 
> Would be good to find out if it did.

I'm installing some older kernels now to check. Will report back.

<...>

Thanks,
Daniel

Re: Broken kretprobe stack traces

2021-03-03 Thread Daniel Xu

On Wed, Mar 3, 2021, at 11:58 AM, Daniel Xu wrote:
> On Wed, Mar 03, 2021 at 09:26:04AM -0500, Steven Rostedt wrote:
> > On Wed, 3 Mar 2021 13:48:28 +0900
> > Masami Hiramatsu  wrote:
> > 
> > > 
> > > > 
> > > > I think (can't prove) this used to work:  
> > 
> > Would be good to find out if it did.
> 
> I'm installing some older kernels now to check. Will report back.

Yep, works in 4.11. So there was a regression somewhere.

Thanks,
Daniel

Re: Why do kprobes and uprobes singlestep?

2021-03-03 Thread Daniel Xu

On Tue, Mar 02, 2021 at 06:18:23PM -0800, Alexei Starovoitov wrote:
> On Tue, Mar 2, 2021 at 5:46 PM Andy Lutomirski  wrote:
> >
> >
> > > On Mar 2, 2021, at 5:22 PM, Alexei Starovoitov 
> > >  wrote:
> > >
> > > On Tue, Mar 2, 2021 at 1:02 PM Andy Lutomirski  
> > > wrote:
> > >>
> > >>
> >  On Mar 2, 2021, at 12:24 PM, Alexei Starovoitov 
> >   wrote:
> > >>>
> > >>> On Tue, Mar 2, 2021 at 10:38 AM Andy Lutomirski  
> > >>> wrote:
> > 
> >  Is there something like a uprobe test suite?  How maintained /
> >  actively used is uprobe?
> > >>>
> > >>> uprobe+bpf is heavily used in production.
> > >>> selftests/bpf has only one test for it though.
> > >>>
> > >>> Why are you asking?
> > >>
> > >> Because the integration with the x86 entry code is a mess, and I want to 
> > >> know whether to mark it BROKEN or how to make sure the any cleanups 
> > >> actually work.
> > >
> > > Any test case to repro the issue you found?
> > > Is it a bug or just messy code?
> >
> > Just messy code.
> >
> > > Nowadays a good chunk of popular applications (python, mysql, etc) has
> > > USDTs in them.
> > > Issues reported with bcc:
> > > https://github.com/iovisor/bcc/issues?q=is%3Aissue+USDT
> > > Similar thing with bpftrace.
> > > Both standard USDT and semaphore based are used in the wild.
> > > uprobe for containers has been a long standing feature request.
> > > If you can improve uprobe performance that would be awesome.
> > > That's another thing that people report often. We optimized it a bit.
> > > More can be done.
> >
> >
> > Wait... USDT is much easier to implement well.  Are we talking just USDT or 
> > are we talking about general uprobes in which almost any instruction can 
> > get probed?  If the only users that care about uprobes are doing USDT, we 
> > could vastly simplify the implementation and probably make it faster, too.
> 
> USDTs are driving the majority of uprobe usage.

I'd say 50/50 in my experience. Larger userspace applications using bpf
for production monitoring tend to use USDT for stability and ABI reasons
(hard for bpf to read C++ classes). Bare uprobes (ie not USDT) are used
quite often for ad-hoc production debugging.

> If they can get faster it will increase their adoption even more.
> There are certainly cases of normal uprobes.
> They are at the start of the function 99% of the time.
> Like the following:
> "uprobe:/lib64/libc.so:malloc(u64 size):size:size,_ret",
> "uprobe:/lib64/libc.so:free(void *ptr)::ptr",
> is common despite its overhead.
> 
> Here is the most interesting and practical usage of uprobes:
> https://github.com/iovisor/bcc/blob/master/tools/sslsniff.py
> and the manpage for the tool:
> https://github.com/iovisor/bcc/blob/master/tools/sslsniff_example.txt
> 
> uprobe in the middle of the function is very rare.
> If the kernel starts rejecting uprobes on some weird instructions
> I suspect no one will complain.

I think it would be great if the kernel could reject mid-instruction
uprobes. Unlike with kprobes, you can place uprobes on immediate
operands which can cause silent data corruption. See
https://github.com/iovisor/bpftrace/pull/803#issuecomment-507693933
for a funny example.

To prevent accidental (and silent) data corruption, bpftrace uses a
disassembler to ensure uprobes are placed on instruction boundaries.

<...>

Daniel

Broken kretprobe stack traces

2021-03-03 Thread Daniel Xu

Hi Masami,

Jakub reported a bug with kretprobe stack traces -- wondering if you've gotten
any bug reports related to stack traces being broken for kretprobes.

I think (can't prove) this used to work:

# bpftrace -e 'kretprobe:__tcp_retransmit_skb { @[kstack()] = count() }'
Attaching 1 probe...
^C

@[
kretprobe_trampoline+0
]: 1

fentry/fexit probes seem to work:

# bpftrace -e 'kretfunc:__tcp_retransmit_skb { @[kstack()] = count() }'
Attaching 1 probe...
^C

@[
ftrace_trampoline+10799
bpf_get_stackid_raw_tp+121
ftrace_trampoline+10799
__tun_chr_ioctl.isra.0.cold+33312
__tcp_retransmit_skb+5
tcp_send_loss_probe+254
tcp_write_timer_handler+394
tcp_write_timer+149
call_timer_fn+41
__run_timers+493
run_timer_softirq+25
__softirqentry_text_start+207
asm_call_sysvec_on_stack+18
do_softirq_own_stack+55
irq_exit_rcu+158
sysvec_apic_timer_interrupt+54
asm_sysvec_apic_timer_interrupt+18
]: 1
@[
ftrace_trampoline+10799
bpf_get_stackid_raw_tp+121
ftrace_trampoline+10799
__tun_chr_ioctl.isra.0.cold+33312
__tcp_retransmit_skb+5
  <...>

which makes me suspect it's a kprobe specific issue.

Thanks,
Daniel

[PATCH bpf v7 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-17 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Acked-by: Song Liu 
Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 2 files changed, 96 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..e419298132b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str1[] = "mestring";
+static const char str2[] = "mestringalittlebigger";
+static const char str3[] = "mestringblubblubblubblubblub";
+
+static int test_one_str(struct test_probe_read_user_str *skel, const char *str,
+   size_t len)
+{
+   int err, duration = 0;
+   char buf[256];
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, len);
+
+   /* Give prog our userspace pointer */
+   skel->bss->user_ptr = buf;
+
+   /* Trigger tracepoint */
+   usleep(1);
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog_ret", "prog returned: %ld\n",
+ skel->bss->ret))
+   return 1;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, len);
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   return 1;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + len, buf, sizeof(buf) - len);
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   return 1;
+
+   return 0;
+}
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int err, duration = 0;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   return;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   if (test_one_str(skel, str1, sizeof(str1)))
+   goto out;
+   if (test_one_str(skel, str2, sizeof(str2)))
+   goto out;
+   if (test_one_str(skel, str3, sizeof(str3)))
+   goto out;
+
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..3ae398b75dcd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+pid_t pid = 0;
+long ret = 0;
+void *user_ptr = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int on_write(void *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.29.2

[PATCH bpf v7 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-17 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
As mentioned in the v6 discussion, I didn't think it would make a lot
of sense to put a comment in kernel/bpf/hashtab.c:alloc_htab_elem .
I opted to add the comment to bpf_probe_read_user_str_common() b/c it
seems like the next best place. Just let me know if you want it
somewhere else.

 kernel/trace/bpf_trace.c | 10 ++
 lib/strncpy_from_user.c  | 19 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 5113fd423cdf..048c655315f1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -181,6 +181,16 @@ bpf_probe_read_user_str_common(void *dst, u32 size,
 {
int ret;
 
+   /*
+* NB: We rely on strncpy_from_user() not copying junk past the NUL
+* terminator into `dst`.
+*
+* strncpy_from_user() does long-sized strides in the fast path. If the
+* strncpy does not mask out the bytes after the NUL in `unsafe_ptr`,
+* then there could be junk after the NUL in `dst`. If user takes `dst`
+* and keys a hash map with it, then semantically identical strings can
+* occupy multiple entries in the map.
+*/
ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
if (unlikely(ret < 0))
memset(dst, 0, size);
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..122d8d0e253c 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,32 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
+   /*
+* Note that we mask out the bytes following the NUL. This is
+* important to do because string oblivious code may read past
+* the NUL. For those routines, we don't want to give them
+* potentially random bytes after the NUL in `src`.
+*
+* One example of such code is BPF map keys. BPF treats map keys
+* as an opaque set of bytes. Without the post-NUL mask, any BPF
+* maps keyed by strings returned from strncpy_from_user() may
+* have multiple entries for semantically identical strings.
+*/
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   *(unsigned long *)(dst+res) = c & mask;
return res + find_zero(data);
}
+
+   *(unsigned long *)(dst+res) = c;
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.29.2

[PATCH bpf v7 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-17 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v6 -> v7:
* Add comments

v5 -> v6:
* zero-pad up to sizeof(unsigned long) after NUL

v4 -> v5:
* don't read potentially uninitialized memory

v3 -> v4:
* directly pass userspace pointer to prog
* test more strings of different length

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 kernel/trace/bpf_trace.c  | 10 +++
 lib/strncpy_from_user.c   | 19 -
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 4 files changed, 123 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.29.2

Re: [PATCH bpf v6 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-16 Thread Daniel Xu

Hi Linus,

On Mon, Nov 16, 2020 at 02:15:52PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 1:17 PM Daniel Xu  wrote:
> >
> > Based on on-list discussion and some off-list discussion with Alexei,
> > I'd like to propose the v4-style patch without the `(*out & ~mask)`
> > bit.
> 
> So I've verified that at least on x86-64, this doesn't really make
> code generation any worse, and I'm ok with the patch from that
> standpoint.
> 
> However, this was not what the discussion actually amended at as far
> as I'm concerned.
> 
> I mentioned that if BPF cares about the bytes past the end of the
> string, I want a *BIG COMMENT* about it. Yes, in strncpy_from_user()
> itself, but even more in the place that cares.

Sure, sorry. Will send another version with comments.

> And no, that does not mean bpf_probe_read_user_str().  That function
> clearly doesn't care at all, and doesn't access anything past the end
> of the string. I want a comment in whatever code that accesses past
> the end of the string.

If I'm reading things right, that place is technically in
kernel/bpf/hashtab.c:alloc_htab_elem. In line:

memcpy(l_new->key, key, key_size);

where `key_size` is the width of the hashtab key. The flow looks like:


  bpf_map_update_elem()
htab_map_update_elem()
  alloc_htab_elem()

It feels a bit weird to me to add a comment about strings in there
because the hash table code is string-agnostic, as mentioned in the
commit message.

> And your ABI point is actively misleading:
> 
> > We can't really zero out the rest of the buffer due to ABI issues.
> > The bpf docs state for bpf_probe_read_user_str():
> >
> > > In case the string length is smaller than *size*, the target is not
> > > padded with further NUL bytes.
> 
> This comment is actively wrong and misleading.
> 
> The code (after the patch) clearly *does* pad a bit with "further NUL
> bytes". It's just that it doesn't pad all the way to the end.

Right, it's a bit ugly and perhaps misleading. But it fixes the real
problem of keying a map with potentially random memory that
strncpy_from_user() might append on. If we pad a deterministic number of
zeros it should be ok.

> Where is the actual buffer zeroing done?

Usually the bpf prog does it. I believe (could be wrong) the verifier
enforces the key is initialized in some form.

For my specific use case, it's done in the bpftrace code:
https://github.com/iovisor/bpftrace/blob/0c88a1a4711a3d13e8c9475f7d08f83a5018fdfd/src/ast/codegen_llvm.cpp#L529-L530

> Because without the buffer zeroing, this whole patch is completely
> pointless. Which is why I want that comment, and I want a pointer to
> where that zeroing is done.
> 
> Really. You have two cases:
> 
>  (a) the buffer isn't zeroed before the strncpy_from_user()
> 
>  (b) the buffer is guaranteed to be zero before that
> 
> and in case (a), this patch is pointless, since the data after the
> string is already undefined.

See above -- I think the verifier enforces that the data is initialized.

> And in case (b), I want to see a comment and a pointer to the code
> that actually does the zeroing.

See above also.

> HOWEVER. Look at bpf_probe_read_user_str_common(), and notice how it
> ALREADY does the zeroing of the buffer past the end, it's just that it
> only does it in the error case.

I don't think the error case is very relevant here. I normally wouldn't
make any assumptions about the state of a buffer after a failed function
call.

> Why do you send this patch, instead of
> 
>  (a) get rid of the pointless pre-zeroing
> 
>  (b) change bpf_probe_read_user_str_common() to do
> 
> int ret;
> u32 offset;
> 
> ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
> offset = ret < 0 ? 0 : ret;
> memset(dst+offset, 0, size-offset);
> return ret;
> 
> which seems to be much simpler anyway. The comment you quote about
> "target is not padded with further NUL bytes" is clearly wrong anyway,
> since that error case *does* pad the target with NUL bytes, and always
> has.

I think on the bpf side there's the same concern about performance.
You can't really dynamically size buffers in bpf land so users usually
use a larger buffer than necessary, sometimes on the order of KBs. So if
we unnecessarily zero out the rest of the buffer it could cause perf
regressions.

[...]

Thanks,
Daniel

[PATCH bpf v6 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-16 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
Based on on-list discussion and some off-list discussion with Alexei,
I'd like to propose the v4-style patch without the `(*out & ~mask)`
bit.

We can't really zero out the rest of the buffer due to ABI issues.
The bpf docs state for bpf_probe_read_user_str():

> In case the string length is smaller than *size*, the target is not
> padded with further NUL bytes.

 lib/strncpy_from_user.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..de084f04e50d 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,21 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   *(unsigned long *)(dst+res) = c & mask;
return res + find_zero(data);
}
+
+   *(unsigned long *)(dst+res) = c;
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.29.2

[PATCH bpf v6 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-16 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v5 -> v6:
* zero-pad up to sizeof(unsigned long) after NUL

v4 -> v5:
* don't read potentially uninitialized memory

v3 -> v4:
* directly pass userspace pointer to prog
* test more strings of different length

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 lib/strncpy_from_user.c   |  8 ++-
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 3 files changed, 102 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.29.2

[PATCH bpf v6 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-16 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Acked-by: Song Liu 
Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 2 files changed, 96 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..e419298132b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str1[] = "mestring";
+static const char str2[] = "mestringalittlebigger";
+static const char str3[] = "mestringblubblubblubblubblub";
+
+static int test_one_str(struct test_probe_read_user_str *skel, const char *str,
+   size_t len)
+{
+   int err, duration = 0;
+   char buf[256];
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, len);
+
+   /* Give prog our userspace pointer */
+   skel->bss->user_ptr = buf;
+
+   /* Trigger tracepoint */
+   usleep(1);
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog_ret", "prog returned: %ld\n",
+ skel->bss->ret))
+   return 1;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, len);
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   return 1;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + len, buf, sizeof(buf) - len);
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   return 1;
+
+   return 0;
+}
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int err, duration = 0;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   return;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   if (test_one_str(skel, str1, sizeof(str1)))
+   goto out;
+   if (test_one_str(skel, str2, sizeof(str2)))
+   goto out;
+   if (test_one_str(skel, str3, sizeof(str3)))
+   goto out;
+
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..3ae398b75dcd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+pid_t pid = 0;
+long ret = 0;
+void *user_ptr = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int on_write(void *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.29.2

Re: [PATCH bpf v5 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-12 Thread Daniel Xu

On Wed Nov 11, 2020 at 3:22 PM PST, Andrii Nakryiko wrote:
> On Wed, Nov 11, 2020 at 2:46 PM Daniel Xu  wrote:
> >
> > 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
> > kernel}_str helpers") introduced a subtle bug where
> > bpf_probe_read_user_str() would potentially copy a few extra bytes after
> > the NUL terminator.
> >
> > This issue is particularly nefarious when strings are used as map keys,
> > as seemingly identical strings can occupy multiple entries in a map.
> >
> > This patchset fixes the issue and introduces a selftest to prevent
> > future regressions.
> >
> > v4 -> v5:
> > * don't read potentially uninitialized memory
>
> I think the bigger problem was that it could overwrite unintended
> memory. E.g., in BPF program, if you had something like:
>
> char my_buf[8 + 3];
> char my_precious_data[5] = {1, 2, 3, 4, 5};

How does that happen?

The

while (max >= sizeof(unsigned long)) {
/* copy 4 bytes */

max -= sizeof(unsigned long)
}

/* copy byte at a time */

where `max` is the user supplied length should prevent that kind of
corruption, right?

[...]

[PATCH bpf v5 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-11 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit has strncpy start copying a byte at a time if a NUL is
spotted.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
 lib/strncpy_from_user.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..83180742e729 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -40,12 +40,11 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
+   if (has_zero(c, , ))
+   goto byte_at_a_time;
+
*(unsigned long *)(dst+res) = c;
-   if (has_zero(c, , )) {
-   data = prep_zero_mask(c, data, );
-   data = create_zero_mask(data);
-   return res + find_zero(data);
-   }
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.29.2

Re: [selftest/bpf] b83590ee1a: BUG:KASAN:slab-out-of-bounds_in_l

2020-11-11 Thread Daniel Xu

Hi Daniel,

On Mon Nov 9, 2020 at 8:54 AM PST, Daniel Borkmann wrote:
> Hi Daniel,
>
> On 11/9/20 3:54 PM, kernel test robot wrote:
> > Greeting,
> > 
> > FYI, we noticed the following commit (built with gcc-9):
> > 
> > commit: b83590ee1add052518603bae607b0524632b7793 ("[PATCH bpf v3 2/2] 
> > selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after 
> > NUL")
> > url: 
> > https://github.com/0day-ci/linux/commits/Daniel-Xu/Fix-bpf_probe_read_user_str-overcopying/20201106-033210
> > base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf.git master
>
> I've tossed them from the tree for now as it looks like these are adding
> regressions
> for regular strncpy_from_user() calls, please take a look.
>
> Thanks!

Sorry about the KASAN issue.

I spent a day reproing. The kasan warnings seem a bit misleading but I
think I have a fix. I'll put a v5 shortly. I'll see if any of the bots
find errors on it.

Thanks,
Daniel

[PATCH bpf v5 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-11 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v4 -> v5:
* don't read potentially uninitialized memory

v3 -> v4:
* directly pass userspace pointer to prog
* test more strings of different length

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 lib/strncpy_from_user.c   |  9 ++-
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 3 files changed, 100 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.29.2

[PATCH bpf v5 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-11 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Acked-by: Song Liu 
Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 2 files changed, 96 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..e419298132b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str1[] = "mestring";
+static const char str2[] = "mestringalittlebigger";
+static const char str3[] = "mestringblubblubblubblubblub";
+
+static int test_one_str(struct test_probe_read_user_str *skel, const char *str,
+   size_t len)
+{
+   int err, duration = 0;
+   char buf[256];
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, len);
+
+   /* Give prog our userspace pointer */
+   skel->bss->user_ptr = buf;
+
+   /* Trigger tracepoint */
+   usleep(1);
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog_ret", "prog returned: %ld\n",
+ skel->bss->ret))
+   return 1;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, len);
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   return 1;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + len, buf, sizeof(buf) - len);
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   return 1;
+
+   return 0;
+}
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int err, duration = 0;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   return;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   if (test_one_str(skel, str1, sizeof(str1)))
+   goto out;
+   if (test_one_str(skel, str2, sizeof(str2)))
+   goto out;
+   if (test_one_str(skel, str3, sizeof(str3)))
+   goto out;
+
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..3ae398b75dcd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+pid_t pid = 0;
+long ret = 0;
+void *user_ptr = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int on_write(void *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.29.2

Re: [lib/strncpy_from_user.c] 00a4ef91e8: BUG:KASAN:slab-out-of-bounds_in_s

2020-11-06 Thread Daniel Xu

On Thu Nov 5, 2020 at 8:32 PM PST, Alexei Starovoitov wrote:
> Daniel,
>
> the kasan complains about the previous version of your patch,
> but your v4 version looks equivalent.
> Could you try to repro this issue?
> The code looks correct, but kasan complain is concerning.
>
> On Thu, Nov 5, 2020 at 5:56 PM kernel test robot 
> wrote:
> >
> > Greeting,
> >
> > FYI, we noticed the following commit (built with clang-12):
> >
> > commit: 00a4ef91e8f5af6edceb9bd4bceed2305f038796 ("[PATCH bpf-next] 
> > lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator")
> > url: 
> > https://github.com/0day-ci/linux/commits/Daniel-Xu/lib-strncpy_from_user-c-Don-t-overcopy-bytes-after-NUL-terminator/20201104-103306
> > base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git master

[...]

I'll take a look, thanks.

Seems like the original email went into my spam. I'll try to fix my spam
filter.

[PATCH bpf v4 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-05 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
 lib/strncpy_from_user.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..82a67dde136b 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,22 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask, *out;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   out = (unsigned long *)(dst+res);
+   *out = (*out & ~mask) | (c & mask);
return res + find_zero(data);
}
+
+   *(unsigned long *)(dst+res) = c;
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.28.0

[PATCH bpf v4 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-05 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v3 -> v4:
* directly pass userspace pointer to prog
* test more strings of different length

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 lib/strncpy_from_user.c   |  9 ++-
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 3 files changed, 103 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.28.0

[PATCH bpf v4 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-05 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 71 +++
 .../bpf/progs/test_probe_read_user_str.c  | 25 +++
 2 files changed, 96 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..e419298132b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str1[] = "mestring";
+static const char str2[] = "mestringalittlebigger";
+static const char str3[] = "mestringblubblubblubblubblub";
+
+static int test_one_str(struct test_probe_read_user_str *skel, const char *str,
+   size_t len)
+{
+   int err, duration = 0;
+   char buf[256];
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, len);
+
+   /* Give prog our userspace pointer */
+   skel->bss->user_ptr = buf;
+
+   /* Trigger tracepoint */
+   usleep(1);
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog_ret", "prog returned: %ld\n",
+ skel->bss->ret))
+   return 1;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, len);
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   return 1;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + len, buf, sizeof(buf) - len);
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   return 1;
+
+   return 0;
+}
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int err, duration = 0;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   return;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   if (test_one_str(skel, str1, sizeof(str1)))
+   goto out;
+   if (test_one_str(skel, str2, sizeof(str2)))
+   goto out;
+   if (test_one_str(skel, str3, sizeof(str3)))
+   goto out;
+
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..3ae398b75dcd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+pid_t pid = 0;
+long ret = 0;
+void *user_ptr = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int on_write(void *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.28.0

Re: [PATCH bpf v2 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-05 Thread Daniel Xu

On Thu Nov 5, 2020 at 3:31 PM PST, Song Liu wrote:
>
>
> > On Nov 5, 2020, at 3:22 PM, Daniel Xu  wrote:
> > 
> > On Thu Nov 5, 2020 at 1:32 PM PST, Andrii Nakryiko wrote:
> >> On Wed, Nov 4, 2020 at 8:51 PM Daniel Xu  wrote:
> > [...]
> >>> diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
> >>> b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> >>> new file mode 100644
> >>> index ..41c3e296566e
> >>> --- /dev/null
> >>> +++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> >>> @@ -0,0 +1,34 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +#include 
> >>> +#include 
> >>> +#include 
> >>> +
> >>> +#include 
> >>> +
> >>> +struct sys_enter_write_args {
> >>> +   unsigned long long pad;
> >>> +   int syscall_nr;
> >>> +   int pad1; /* 4 byte hole */
> >> 
> >> I have a hunch that this explicit padding might break on big-endian
> >> architectures?..
> >> 
> >> Can you instead include "vmlinux.h" in this file and use struct
> >> trace_event_raw_sys_enter? you'll just need ctx->args[2] to get that
> >> buffer pointer.
> >> 
> >> Alternatively, and it's probably simpler overall would be to just
> >> provide user-space pointer through global variable:
> >> 
> >> void *user_ptr;
> >> 
> >> 
> >> bpf_probe_read_user_str(buf, ..., user_ptr);
> >> 
> >> From user-space:
> >> 
> >> skel->bss->user_ptr = _userspace_buf;
> >> 
> >> Full control. You can trigger tracepoint with just an usleep(1), for
> >> instance.
> > 
> > Yeah, that sounds better. I'll send a v4 with passing a ptr.
> > 
> > Thanks,
> > Daniel
>
> One more comment, how about we test multiple strings with different
> lengths? In this way, we can catch other alignment issues.

Sure, will do that in v4 also.

Re: [PATCH bpf v2 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-05 Thread Daniel Xu

On Thu Nov 5, 2020 at 1:32 PM PST, Andrii Nakryiko wrote:
> On Wed, Nov 4, 2020 at 8:51 PM Daniel Xu  wrote:
[...]
> > diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
> > b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> > new file mode 100644
> > index ..41c3e296566e
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> > @@ -0,0 +1,34 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +struct sys_enter_write_args {
> > +   unsigned long long pad;
> > +   int syscall_nr;
> > +   int pad1; /* 4 byte hole */
>
> I have a hunch that this explicit padding might break on big-endian
> architectures?..
>
> Can you instead include "vmlinux.h" in this file and use struct
> trace_event_raw_sys_enter? you'll just need ctx->args[2] to get that
> buffer pointer.
>
> Alternatively, and it's probably simpler overall would be to just
> provide user-space pointer through global variable:
>
> void *user_ptr;
>
>
> bpf_probe_read_user_str(buf, ..., user_ptr);
>
> From user-space:
>
> skel->bss->user_ptr = _userspace_buf;
>
> Full control. You can trigger tracepoint with just an usleep(1), for
> instance.

Yeah, that sounds better. I'll send a v4 with passing a ptr.

Thanks,
Daniel

[...]

[PATCH bpf v3 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-05 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 lib/strncpy_from_user.c   |  9 ++-
 .../bpf/prog_tests/probe_read_user_str.c  | 60 +++
 .../bpf/progs/test_probe_read_user_str.c  | 34 +++
 3 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.28.0

[PATCH bpf v3 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-05 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
 lib/strncpy_from_user.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..82a67dde136b 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,22 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask, *out;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   out = (unsigned long *)(dst+res);
+   *out = (*out & ~mask) | (c & mask);
return res + find_zero(data);
}
+
+   *(unsigned long *)(dst+res) = c;
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.28.0

[PATCH bpf v3 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-05 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 60 +++
 .../bpf/progs/test_probe_read_user_str.c  | 34 +++
 2 files changed, 94 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..7c6422901b78
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str[] = "mestring";
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int fd, err, duration = 0;
+   char buf[256];
+   ssize_t n;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   return;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   fd = open("/dev/null", O_WRONLY);
+   if (CHECK(fd < 0, "open", "open /dev/null failed: %d\n", fd))
+   goto out;
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, sizeof(str));
+
+   /* Trigger tracepoint */
+   n = write(fd, buf, sizeof(buf));
+   if (CHECK(n != sizeof(buf), "write", "write failed: %ld\n", n))
+   goto fd_out;
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog_ret", "prog returned: %ld\n",
+ skel->bss->ret))
+   goto fd_out;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, sizeof(str));
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   goto fd_out;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + sizeof(str), buf, sizeof(buf) - 
sizeof(str));
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   goto fd_out;
+
+fd_out:
+   close(fd);
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..5da764a8bf85
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+struct sys_enter_write_args {
+   unsigned long long pad;
+   int syscall_nr;
+   int pad1; /* 4 byte hole */
+   unsigned int fd;
+   int pad2; /* 4 byte hole */
+   const char *buf;
+   size_t count;
+};
+
+pid_t pid = 0;
+long ret = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_write")
+int on_write(struct sys_enter_write_args *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), ctx->buf);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.28.0

Re: [PATCH bpf v2 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-05 Thread Daniel Xu

On Thu Nov 5, 2020 at 10:30 AM PST, Song Liu wrote:
>
>
> > On Nov 4, 2020, at 6:25 PM, Daniel Xu  wrote:
> > 
> > Previously, bpf_probe_read_user_str() could potentially overcopy the
> > trailing bytes after the NUL due to how do_strncpy_from_user() does the
> > copy in long-sized strides. The issue has been fixed in the previous
> > commit.
> > 
> > This commit adds a selftest that ensures we don't regress
> > bpf_probe_read_user_str() again.
> > 
> > Signed-off-by: Daniel Xu 
> > ---
> > .../bpf/prog_tests/probe_read_user_str.c  | 60 +++
> > .../bpf/progs/test_probe_read_user_str.c  | 34 +++
> > 2 files changed, 94 insertions(+)
> > create mode 100644 
> > tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
> > create mode 100644 
> > tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> > 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
> > b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
> > new file mode 100644
> > index ..597a166e6c8d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
> > @@ -0,0 +1,60 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include 
> > +#include "test_probe_read_user_str.skel.h"
> > +
> > +static const char str[] = "mestring";
> > +
> > +void test_probe_read_user_str(void)
> > +{
> > +   struct test_probe_read_user_str *skel;
> > +   int fd, err, duration = 0;
> > +   char buf[256];
> > +   ssize_t n;
> > +
> > +   skel = test_probe_read_user_str__open_and_load();
> > +   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
> > + "skeleton open and load failed\n"))
> > +   goto out;
>
> nit: we can just return here.
>
> > +
> > +   err = test_probe_read_user_str__attach(skel);
> > +   if (CHECK(err, "test_probe_read_user_str__attach",
> > + "skeleton attach failed: %d\n", err))
> > +   goto out;
> > +
> > +   fd = open("/dev/null", O_WRONLY);
> > +   if (CHECK(fd < 0, "open", "open /dev/null failed: %d\n", fd))
> > +   goto out;
> > +
> > +   /* Give pid to bpf prog so it doesn't read from anyone else */
> > +   skel->bss->pid = getpid();
>
> It is better to set pid before attaching skel.
>
> > +
> > +   /* Ensure bytes after string are ones */
> > +   memset(buf, 1, sizeof(buf));
> > +   memcpy(buf, str, sizeof(str));
> > +
> > +   /* Trigger tracepoint */
> > +   n = write(fd, buf, sizeof(buf));
> > +   if (CHECK(n != sizeof(buf), "write", "write failed: %ld\n", n))
> > +   goto fd_out;
> > +
> > +   /* Did helper fail? */
> > +   if (CHECK(skel->bss->ret < 0, "prog ret", "prog returned: %d\n",
>
> In most cases, we use underscore instead of spaces in the second
> argument
> of CHECK(). IOW, please use "prog_ret" instead of "prog ret".
>
> > + skel->bss->ret))
> > +   goto fd_out;
> > +
> > +   /* Check that string was copied correctly */
> > +   err = memcmp(skel->bss->buf, str, sizeof(str));
> > +   if (CHECK(err, "memcmp", "prog copied wrong string"))
> > +   goto fd_out;
> > +
> > +   /* Now check that no extra trailing bytes were copied */
> > +   memset(buf, 0, sizeof(buf));
> > +   err = memcmp(skel->bss->buf + sizeof(str), buf, sizeof(buf) - 
> > sizeof(str));
> > +   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
> > +   goto fd_out;
> > +
> > +fd_out:
> > +   close(fd);
> > +out:
> > +   test_probe_read_user_str__destroy(skel);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
> > b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> > new file mode 100644
> > index ..41c3e296566e
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
> > @@ -0,0 +1,34 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +struct sys_enter_write_args {
> > +   unsigned long long pad;
> > +   int syscall_nr;
> > +   int pad1; /* 4 byte hole */
> > +   unsigned int fd;
> > +   int pad2; /* 4 byte hole */
> > +   const char *buf;
> > +   size_t count;
> > +};
> > +
> > +pid_t pid = 0;
> > +int ret = 0;
> > +char buf[256] = {};
> > +
> > +SEC("tracepoint/syscalls/sys_enter_write")
> > +int on_write(struct sys_enter_write_args *ctx)
> > +{
> > +   if (pid != (bpf_get_current_pid_tgid() >> 32))
> > +   return 0;
> > +
> > +   ret = bpf_probe_read_user_str(buf, sizeof(buf), ctx->buf);
>
> bpf_probe_read_user_str() returns "long". Let's use "long ret;"

Thanks for review, will send v3 with these changes.

[...]

Re: [PATCH bpf v2 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-05 Thread Daniel Xu

On Thu Nov 5, 2020 at 10:16 AM PST, Song Liu wrote:
>
>
> > On Nov 4, 2020, at 6:25 PM, Daniel Xu  wrote:
> > 
> > do_strncpy_from_user() may copy some extra bytes after the NUL
>
> We have multiple use of "NUL" here, should be "NULL"?
>
> > terminator into the destination buffer. This usually does not matter for
> > normal string operations. However, when BPF programs key BPF maps with
> > strings, this matters a lot.
> > 
> > A BPF program may read strings from user memory by calling the
> > bpf_probe_read_user_str() helper which eventually calls
> > do_strncpy_from_user(). The program can then key a map with the
> > resulting string. BPF map keys are fixed-width and string-agnostic,
> > meaning that map keys are treated as a set of bytes.
> > 
> > The issue is when do_strncpy_from_user() overcopies bytes after the NUL
> > terminator, it can result in seemingly identical strings occupying
> > multiple slots in a BPF map. This behavior is subtle and totally
> > unexpected by the user.
> > 
> > This commit uses the proper word-at-a-time APIs to avoid overcopying.
> > 
> > Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and 
> > probe_read_{user, kernel}_str helpers")
> > Signed-off-by: Daniel Xu 
> > ---
> > lib/strncpy_from_user.c | 9 +++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
> > index e6d5fcc2cdf3..d084189eb05c 100644
> > --- a/lib/strncpy_from_user.c
> > +++ b/lib/strncpy_from_user.c
> > @@ -35,17 +35,22 @@ static inline long do_strncpy_from_user(char *dst, 
> > const char __user *src,
> > goto byte_at_a_time;
> > 
> > while (max >= sizeof(unsigned long)) {
> > -   unsigned long c, data;
> > +   unsigned long c, data, mask, *out;
> > 
> > /* Fall back to byte-at-a-time if we get a page fault */
> > unsafe_get_user(c, (unsigned long __user *)(src+res), 
> > byte_at_a_time);
> > 
> > -   *(unsigned long *)(dst+res) = c;
> > if (has_zero(c, , )) {
> > data = prep_zero_mask(c, data, );
> > data = create_zero_mask(data);
> > +   mask = zero_bytemask(data);
> > +   out = (unsigned long *)(dst+res);
> > +   *out = (*out & ~mask) | (c & mask);
> > return res + find_zero(data);
> > +   } else  {
>
> This else clause is not needed, as we return in the if clause.

Thanks, will change in v3.

[..]

[PATCH bpf v2 1/2] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-04 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Fixes: 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, 
kernel}_str helpers")
Signed-off-by: Daniel Xu 
---
 lib/strncpy_from_user.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..d084189eb05c 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,22 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask, *out;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   out = (unsigned long *)(dst+res);
+   *out = (*out & ~mask) | (c & mask);
return res + find_zero(data);
+   } else  {
+   *(unsigned long *)(dst+res) = c;
}
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.28.0

[PATCH bpf v2 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL

2020-11-04 Thread Daniel Xu

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.

Signed-off-by: Daniel Xu 
---
 .../bpf/prog_tests/probe_read_user_str.c  | 60 +++
 .../bpf/progs/test_probe_read_user_str.c  | 34 +++
 2 files changed, 94 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c 
b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
new file mode 100644
index ..597a166e6c8d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include "test_probe_read_user_str.skel.h"
+
+static const char str[] = "mestring";
+
+void test_probe_read_user_str(void)
+{
+   struct test_probe_read_user_str *skel;
+   int fd, err, duration = 0;
+   char buf[256];
+   ssize_t n;
+
+   skel = test_probe_read_user_str__open_and_load();
+   if (CHECK(!skel, "test_probe_read_user_str__open_and_load",
+ "skeleton open and load failed\n"))
+   goto out;
+
+   err = test_probe_read_user_str__attach(skel);
+   if (CHECK(err, "test_probe_read_user_str__attach",
+ "skeleton attach failed: %d\n", err))
+   goto out;
+
+   fd = open("/dev/null", O_WRONLY);
+   if (CHECK(fd < 0, "open", "open /dev/null failed: %d\n", fd))
+   goto out;
+
+   /* Give pid to bpf prog so it doesn't read from anyone else */
+   skel->bss->pid = getpid();
+
+   /* Ensure bytes after string are ones */
+   memset(buf, 1, sizeof(buf));
+   memcpy(buf, str, sizeof(str));
+
+   /* Trigger tracepoint */
+   n = write(fd, buf, sizeof(buf));
+   if (CHECK(n != sizeof(buf), "write", "write failed: %ld\n", n))
+   goto fd_out;
+
+   /* Did helper fail? */
+   if (CHECK(skel->bss->ret < 0, "prog ret", "prog returned: %d\n",
+ skel->bss->ret))
+   goto fd_out;
+
+   /* Check that string was copied correctly */
+   err = memcmp(skel->bss->buf, str, sizeof(str));
+   if (CHECK(err, "memcmp", "prog copied wrong string"))
+   goto fd_out;
+
+   /* Now check that no extra trailing bytes were copied */
+   memset(buf, 0, sizeof(buf));
+   err = memcmp(skel->bss->buf + sizeof(str), buf, sizeof(buf) - 
sizeof(str));
+   if (CHECK(err, "memcmp", "trailing bytes were not stripped"))
+   goto fd_out;
+
+fd_out:
+   close(fd);
+out:
+   test_probe_read_user_str__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c 
b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
new file mode 100644
index ..41c3e296566e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_read_user_str.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+
+#include 
+
+struct sys_enter_write_args {
+   unsigned long long pad;
+   int syscall_nr;
+   int pad1; /* 4 byte hole */
+   unsigned int fd;
+   int pad2; /* 4 byte hole */
+   const char *buf;
+   size_t count;
+};
+
+pid_t pid = 0;
+int ret = 0;
+char buf[256] = {};
+
+SEC("tracepoint/syscalls/sys_enter_write")
+int on_write(struct sys_enter_write_args *ctx)
+{
+   if (pid != (bpf_get_current_pid_tgid() >> 32))
+   return 0;
+
+   ret = bpf_probe_read_user_str(buf, sizeof(buf), ctx->buf);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.28.0

[PATCH bpf v2 0/2] Fix bpf_probe_read_user_str() overcopying

2020-11-04 Thread Daniel Xu

6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

Daniel Xu (2):
  lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes
after NUL

 lib/strncpy_from_user.c   |  9 ++-
 .../bpf/prog_tests/probe_read_user_str.c  | 60 +++
 .../bpf/progs/test_probe_read_user_str.c  | 34 +++
 3 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_read_user_str.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_probe_read_user_str.c

-- 
2.28.0

Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-04 Thread Daniel Xu

On Wed Nov 4, 2020 at 2:36 PM PST, Daniel Borkmann wrote:
> On 11/4/20 9:18 PM, Daniel Xu wrote:
> > On Wed Nov 4, 2020 at 8:24 AM PST, Daniel Borkmann wrote:
> >> On 11/4/20 3:29 AM, Daniel Xu wrote:
> >>> do_strncpy_from_user() may copy some extra bytes after the NUL
> >>> terminator into the destination buffer. This usually does not matter for
> >>> normal string operations. However, when BPF programs key BPF maps with
> >>> strings, this matters a lot.
> >>>
> >>> A BPF program may read strings from user memory by calling the
> >>> bpf_probe_read_user_str() helper which eventually calls
> >>> do_strncpy_from_user(). The program can then key a map with the
> >>> resulting string. BPF map keys are fixed-width and string-agnostic,
> >>> meaning that map keys are treated as a set of bytes.
> >>>
> >>> The issue is when do_strncpy_from_user() overcopies bytes after the NUL
> >>> terminator, it can result in seemingly identical strings occupying
> >>> multiple slots in a BPF map. This behavior is subtle and totally
> >>> unexpected by the user.
> >>>
> >>> This commit uses the proper word-at-a-time APIs to avoid overcopying.
> >>>
> >>> Signed-off-by: Daniel Xu 
> >>
> >> It looks like this is a regression from the recent refactoring of the
> >> mem probing
> >> util functions?
> > 
> > I think it was like this from the beginning, at 6ae08ae3dea2 ("bpf: Add
> > probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers").
> > The old bpf_probe_read_str() used the kernel's byte-by-byte copying
> > routine. bpf_probe_read_user_str() started using strncpy_from_user()
> > which has been doing the long-sized strides since ~2012 or earlier.
> > 
> > I tried to build and test the kernel at that commit but it seems my
> > compiler is too new to build that old code. Bunch of build failures.
> > 
> > I assume the refactor you're referring to is 8d92db5c04d1 ("bpf: rework
> > the compat kernel probe handling").
>
> Ah I see, it was just reusing 3d7081822f7f ("uaccess: Add non-pagefault
> user-space
> read functions"). Potentially it might be safer choice to just rework
> the
> strncpy_from_user_nofault() to mimic strncpy_from_kernel_nofault() in
> that
> regard?

I'm a little reluctant to do that b/c it would introduce less efficient,
duplicated code. The word-at-a-time API already has the zero_bytemask()
API so it's clear that it was designed to handle this issue -- we're not
really hacking anything here.

I'll send out a V2 with the selftest shortly. Happy to change things
after that.

Thanks,
Daniel

Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-04 Thread Daniel Xu

Hi Daniel,

On Wed Nov 4, 2020 at 8:24 AM PST, Daniel Borkmann wrote:
> On 11/4/20 3:29 AM, Daniel Xu wrote:
> > do_strncpy_from_user() may copy some extra bytes after the NUL
> > terminator into the destination buffer. This usually does not matter for
> > normal string operations. However, when BPF programs key BPF maps with
> > strings, this matters a lot.
> > 
> > A BPF program may read strings from user memory by calling the
> > bpf_probe_read_user_str() helper which eventually calls
> > do_strncpy_from_user(). The program can then key a map with the
> > resulting string. BPF map keys are fixed-width and string-agnostic,
> > meaning that map keys are treated as a set of bytes.
> > 
> > The issue is when do_strncpy_from_user() overcopies bytes after the NUL
> > terminator, it can result in seemingly identical strings occupying
> > multiple slots in a BPF map. This behavior is subtle and totally
> > unexpected by the user.
> > 
> > This commit uses the proper word-at-a-time APIs to avoid overcopying.
> > 
> > Signed-off-by: Daniel Xu 
>
> It looks like this is a regression from the recent refactoring of the
> mem probing
> util functions?

I think it was like this from the beginning, at 6ae08ae3dea2 ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers").
The old bpf_probe_read_str() used the kernel's byte-by-byte copying
routine. bpf_probe_read_user_str() started using strncpy_from_user()
which has been doing the long-sized strides since ~2012 or earlier.

I tried to build and test the kernel at that commit but it seems my
compiler is too new to build that old code. Bunch of build failures.

I assume the refactor you're referring to is 8d92db5c04d1 ("bpf: rework
the compat kernel probe handling").

> Could we add a Fixes tag and then we'd also need to target the fix
> against bpf tree instead of bpf-next, no?

Sure, will do in v2.

>
> Moreover, a BPF kselftest would help to make sure it doesn't regress in
> future again.

Ditto.

[..]

Thanks,
Daniel

[PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-03 Thread Daniel Xu

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Signed-off-by: Daniel Xu 
---
 lib/strncpy_from_user.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index e6d5fcc2cdf3..d084189eb05c 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -35,17 +35,22 @@ static inline long do_strncpy_from_user(char *dst, const 
char __user *src,
goto byte_at_a_time;
 
while (max >= sizeof(unsigned long)) {
-   unsigned long c, data;
+   unsigned long c, data, mask, *out;
 
/* Fall back to byte-at-a-time if we get a page fault */
unsafe_get_user(c, (unsigned long __user *)(src+res), 
byte_at_a_time);
 
-   *(unsigned long *)(dst+res) = c;
if (has_zero(c, , )) {
data = prep_zero_mask(c, data, );
data = create_zero_mask(data);
+   mask = zero_bytemask(data);
+   out = (unsigned long *)(dst+res);
+   *out = (*out & ~mask) | (c & mask);
return res + find_zero(data);
+   } else  {
+   *(unsigned long *)(dst+res) = c;
}
+
res += sizeof(unsigned long);
max -= sizeof(unsigned long);
}
-- 
2.28.0

Re: [PATCH bpf-next] libbpf: Export bpf_object__load_vmlinux_btf

2020-05-27 Thread Daniel Xu

Hi Andrii,

On Tue May 26, 2020 at 3:09 PM PST, Andrii Nakryiko wrote:
> On Tue, May 26, 2020 at 7:09 PM Daniel Xu  wrote:
> >
> > Right now the libbpf model encourages loading the entire object at once.
> > In this model, libbpf handles loading BTF from vmlinux for us. However,
> > it can be useful to selectively load certain maps and programs inside an
> > object without loading everything else.
>
> There is no way to selectively load or not load a map. All maps are
> created, unless they are reusing map FD or pinned instances. See
> below, I'd like to understand the use case better.
>
> >
> > In the latter model, there was perviously no way to load BTF on-demand.
> > This commit exports the bpf_object__load_vmlinux_btf such that we are
> > able to load BTF on demand.
> >
>
> Let's start with the real problem, not a solution. Do you have
> specific use case where you need bpf_object__load_vmlinux_btf()? It
> might not do anything if none of BPF programs in the object requires
> BTF, because it's very much tightly coupled with loading bpf_object as
> a whole model. I'd like to understand what you are after with this,
> before exposing internal implementation details as an API.

If I try loading a program through the following sequence:

bpf_object__open_file()
bpf_object__find_program_by_name()
bpf_program__load()

And the program require BTF (tp_btf), I get an unavoidable (to the best
of my knowledge) segfault in the following code path:

bpf_program__load()
  libbpf_find_attach_btf_id()<-- [0]
__find_vmlinx_btf_id()
  find_btf_by_prefix_kind()
btf__find_by_name_kind() <-- boom (btf->nr_types)

because [0] passes prog->obj->btf_vmlinux which is still null. So the
solution I'm proposing is exporting bpf_object__load_vmlinux_btf() and
calling that on struct bpf_object before performing prog loads.

[...]

Thanks,
Daniel

[PATCH bpf-next] libbpf: Export bpf_object__load_vmlinux_btf

2020-05-26 Thread Daniel Xu

Right now the libbpf model encourages loading the entire object at once.
In this model, libbpf handles loading BTF from vmlinux for us. However,
it can be useful to selectively load certain maps and programs inside an
object without loading everything else.

In the latter model, there was perviously no way to load BTF on-demand.
This commit exports the bpf_object__load_vmlinux_btf such that we are
able to load BTF on demand.

Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c   | 2 +-
 tools/lib/bpf/libbpf.h   | 1 +
 tools/lib/bpf/libbpf.map | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5d60de6fd818..399094b1f580 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2477,7 +2477,7 @@ static inline bool libbpf_prog_needs_vmlinux_btf(struct 
bpf_program *prog)
return false;
 }
 
-static int bpf_object__load_vmlinux_btf(struct bpf_object *obj)
+int bpf_object__load_vmlinux_btf(struct bpf_object *obj)
 {
struct bpf_program *prog;
int err;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 1e2e399a5f2c..6cbd678eb124 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -147,6 +147,7 @@ LIBBPF_API unsigned int bpf_object__kversion(const struct 
bpf_object *obj);
 struct btf;
 LIBBPF_API struct btf *bpf_object__btf(const struct bpf_object *obj);
 LIBBPF_API int bpf_object__btf_fd(const struct bpf_object *obj);
+LIBBPF_API int bpf_object__load_vmlinux_btf(struct bpf_object *obj);
 
 LIBBPF_API struct bpf_program *
 bpf_object__find_program_by_title(const struct bpf_object *obj,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 381a7342ecfc..56415e671c70 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -261,6 +261,7 @@ LIBBPF_0.0.9 {
bpf_iter_create;
bpf_link_get_fd_by_id;
bpf_link_get_next_id;
+   bpf_object__load_vmlinux_btf;
bpf_program__attach_iter;
perf_buffer__consume;
 } LIBBPF_0.0.8;
-- 
2.26.2

Re: [PATCH bpf-next 1/5] perf/core: Add PERF_FORMAT_LOST read_format

2019-09-27 Thread Daniel Xu

Hi Jiri,

On Tue Sep 24, 2019 at 10:33 AM Jiri Olsa wrote:
> On Tue, Sep 17, 2019 at 06:30:52AM -0700, Daniel Xu wrote:
> 
> SNIP
> 
> > +   PERF_FORMAT_MAX = 1U << 5,  /* non-ABI */
> >  };
> >  
> >  #define PERF_ATTR_SIZE_VER064  /* sizeof first published 
> > struct */
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 0463c1151bae..ee08d3ed6299 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1715,6 +1715,9 @@ static void __perf_event_read_size(struct perf_event 
> > *event, int nr_siblings)
> > if (event->attr.read_format & PERF_FORMAT_ID)
> > entry += sizeof(u64);
> >  
> > +   if (event->attr.read_format & PERF_FORMAT_LOST)
> > +   entry += sizeof(u64);
> > +
> > if (event->attr.read_format & PERF_FORMAT_GROUP) {
> > nr += nr_siblings;
> > size += sizeof(u64);
> > @@ -4734,6 +4737,24 @@ u64 perf_event_read_value(struct perf_event *event, 
> > u64 *enabled, u64 *running)
> >  }
> >  EXPORT_SYMBOL_GPL(perf_event_read_value);
> >  
> > +static struct pmu perf_kprobe;
> > +static u64 perf_event_lost(struct perf_event *event)
> > +{
> > +   struct ring_buffer *rb;
> > +   u64 lost = 0;
> > +
> > +   rcu_read_lock();
> > +   rb = rcu_dereference(event->rb);
> > +   if (likely(!!rb))
> > +   lost += local_read(>lost);
> > +   rcu_read_unlock();
> > +
> > +   if (event->attr.type == perf_kprobe.type)
> > +   lost += perf_kprobe_missed(event);
> 
> not sure what was the peterz's suggestion, but here you are mixing
> ring buffer's lost count with kprobes missed count, seems wrong

To be honest, I'm not 100% sure what the correct semantics here should
be. I thought it might be less misleading if we included ring buffer
related misses as well.

Regardless, I am ok with either.

> maybe we could add PERF_FORMAT_KPROBE_MISSED

I think the feedback from the last patchset was that we want to keep
the misses unified.

Peter, do you have any thoughts?

Thanks,
Daniel

Re: [PATCH bpf-next 1/5] perf/core: Add PERF_FORMAT_LOST read_format

2019-09-17 Thread Daniel Xu

On Tue Sep 17, 2019 at 10:32 PM kbuild test robot wrote:
> All errors (new ones prefixed by >>):
> 
>kernel/events/core.c: In function 'perf_event_lost':
> >> kernel/events/core.c:4753:11: error: implicit declaration of function 
> >> 'perf_kprobe_missed'; did you mean 'perf_release'? 
> >> [-Werror=implicit-function-declaration]
>   lost += perf_kprobe_missed(event);
>   ^~
>   perf_release
>cc1: some warnings being treated as errors
> 

Ah forgot the #ifdef for CONFIG_KPROBE_EVENTS. I've applied the fix and
will send it in the next version.

[PATCH bpf-next 2/5] perf/core: Sync perf_event.h to tools

2019-09-17 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 tools/include/uapi/linux/perf_event.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..bd874c7257f0 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -273,6 +273,7 @@ enum {
  *   { u64 time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
  *   { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *   { u64 id;   } && PERF_FORMAT_ID
+ *   { u64 missed;   } && PERF_FORMAT_LOST
  * } && !PERF_FORMAT_GROUP
  *
  * { u64   nr;
@@ -280,6 +281,7 @@ enum {
  *   { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *   { u64 value;
  * { u64   id;   } && PERF_FORMAT_ID
+ * { u64   missed;   } && PERF_FORMAT_LOST
  *   } cntr[nr];
  * } && PERF_FORMAT_GROUP
  * };
@@ -289,8 +291,9 @@ enum perf_event_read_format {
PERF_FORMAT_TOTAL_TIME_RUNNING  = 1U << 1,
PERF_FORMAT_ID  = 1U << 2,
PERF_FORMAT_GROUP   = 1U << 3,
+   PERF_FORMAT_LOST= 1U << 4,
 
-   PERF_FORMAT_MAX = 1U << 4,  /* non-ABI */
+   PERF_FORMAT_MAX = 1U << 5,  /* non-ABI */
 };
 
 #define PERF_ATTR_SIZE_VER064  /* sizeof first published struct */
-- 
2.21.0

[PATCH bpf-next 4/5] libbpf: Set read_format PERF_FORMAT_LOST on kprobe perf fds

2019-09-17 Thread Daniel Xu

There is no way to get the nmissed count from kprobes that are created
using the perf API. The previous commits added read_format support for
this count. Enable it in this commit.

Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1ca0acd1d823..43f45f6d914d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -5031,7 +5031,7 @@ static int determine_uprobe_retprobe_bit(void)
 }
 
 static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name,
-uint64_t offset, int pid)
+uint64_t offset, int pid, uint64_t read_format)
 {
struct perf_event_attr attr = {};
char errmsg[STRERR_BUFSIZE];
@@ -5060,6 +5060,7 @@ static int perf_event_open_probe(bool uprobe, bool 
retprobe, const char *name,
}
attr.size = sizeof(attr);
attr.type = type;
+   attr.read_format = read_format;
attr.config1 = ptr_to_u64(name); /* kprobe_func or uprobe_path */
attr.config2 = offset;   /* kprobe_addr or probe_offset */
 
@@ -5087,7 +5088,8 @@ struct bpf_link *bpf_program__attach_kprobe(struct 
bpf_program *prog,
int pfd, err;
 
pfd = perf_event_open_probe(false /* uprobe */, retprobe, func_name,
-   0 /* offset */, -1 /* pid */);
+   0 /* offset */, -1 /* pid */,
+   PERF_FORMAT_LOST /* read_format */);
if (pfd < 0) {
pr_warning("program '%s': failed to create %s '%s' perf event: 
%s\n",
   bpf_program__title(prog, false),
@@ -5118,7 +5120,8 @@ struct bpf_link *bpf_program__attach_uprobe(struct 
bpf_program *prog,
int pfd, err;
 
pfd = perf_event_open_probe(true /* uprobe */, retprobe,
-   binary_path, func_offset, pid);
+   binary_path, func_offset, pid,
+   0 /* read_format */);
if (pfd < 0) {
pr_warning("program '%s': failed to create %s '%s:0x%zx' perf 
event: %s\n",
   bpf_program__title(prog, false),
-- 
2.21.0

[PATCH bpf-next 5/5] libbpf: Add selftest for PERF_FORMAT_LOST perf read_format

2019-09-17 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 .../selftests/bpf/prog_tests/attach_probe.c   | 32 ++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 5ecc267d98b0..3d636cccb6dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -1,6 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
 
+struct perf_read_output {
+   u64 count;
+   u64 lost;
+};
+
 ssize_t get_base_addr() {
size_t start;
char buf[256];
@@ -32,6 +37,8 @@ void test_attach_probe(void)
const char *file = "./test_attach_probe.o";
struct bpf_program *kprobe_prog, *kretprobe_prog;
struct bpf_program *uprobe_prog, *uretprobe_prog;
+   struct perf_read_output kprobe_read_output;
+   const struct bpf_link_fd *kprobe_fd_link;
struct bpf_object *obj;
int err, prog_fd, duration = 0, res;
struct bpf_link *kprobe_link = NULL;
@@ -40,7 +47,8 @@ void test_attach_probe(void)
struct bpf_link *uretprobe_link = NULL;
int results_map_fd;
size_t uprobe_offset;
-   ssize_t base_addr;
+   ssize_t base_addr, nread;
+   int kprobe_fd;
 
base_addr = get_base_addr();
if (CHECK(base_addr < 0, "get_base_addr",
@@ -116,6 +124,28 @@ void test_attach_probe(void)
/* trigger & validate kprobe && kretprobe */
usleep(1);
 
+   kprobe_fd_link = bpf_link__as_fd(kprobe_link);
+   if (CHECK(!kprobe_fd_link, "kprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kprobe_fd = bpf_link_fd__fd(kprobe_fd_link);
+   if (CHECK(kprobe_fd < 0, "kprobe_get_perf_fd",
+   "failed to get perf fd from kprobe link\n"))
+   goto cleanup;
+
+   /* Set to unexpected value so we can check the read(2) did stuff */
+   kprobe_read_output.lost = 1;
+   nread = read(kprobe_fd, _read_output,
+sizeof(kprobe_read_output));
+   if (CHECK(nread != sizeof(kprobe_read_output), "kprobe_perf_read",
+ "failed to read from perf fd\n"))
+   goto cleanup;
+   if (CHECK(kprobe_read_output.lost != 0, "kprobe_lost",
+ "read wrong value from perf fd: %lu\n",
+ kprobe_read_output.lost))
+   goto cleanup;
+
err = bpf_map_lookup_elem(results_map_fd, _idx, );
if (CHECK(err, "get_kprobe_res",
  "failed to get kprobe res: %d\n", err))
-- 
2.21.0

[PATCH bpf-next 1/5] perf/core: Add PERF_FORMAT_LOST read_format

2019-09-17 Thread Daniel Xu

It's useful to know kprobe's nmissed count. For example with tracing
tools, it's important to know when events may have been lost.  debugfs
currently exposes a control file to get this information, but it is not
compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to this
retrieve information about FD-based probes.

This patch adds a new field to perf's read_format that lets users query
misses. Misses include both misses from the underlying kprobe
infrastructure and misses from ringbuffer infrastructure.

Signed-off-by: Daniel Xu 
---
 include/linux/trace_events.h|  1 +
 include/uapi/linux/perf_event.h |  5 -
 kernel/events/core.c| 39 ++---
 kernel/trace/trace_kprobe.c |  8 +++
 4 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 30a8cdcfd4a4..952520c1240a 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -587,6 +587,7 @@ extern int bpf_get_kprobe_info(const struct perf_event 
*event,
   u32 *fd_type, const char **symbol,
   u64 *probe_offset, u64 *probe_addr,
   bool perf_type_tracepoint);
+extern u64 perf_kprobe_missed(const struct perf_event *event);
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
 extern int  perf_uprobe_init(struct perf_event *event,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..bd874c7257f0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -273,6 +273,7 @@ enum {
  *   { u64 time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
  *   { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *   { u64 id;   } && PERF_FORMAT_ID
+ *   { u64 missed;   } && PERF_FORMAT_LOST
  * } && !PERF_FORMAT_GROUP
  *
  * { u64   nr;
@@ -280,6 +281,7 @@ enum {
  *   { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *   { u64 value;
  * { u64   id;   } && PERF_FORMAT_ID
+ * { u64   missed;   } && PERF_FORMAT_LOST
  *   } cntr[nr];
  * } && PERF_FORMAT_GROUP
  * };
@@ -289,8 +291,9 @@ enum perf_event_read_format {
PERF_FORMAT_TOTAL_TIME_RUNNING  = 1U << 1,
PERF_FORMAT_ID  = 1U << 2,
PERF_FORMAT_GROUP   = 1U << 3,
+   PERF_FORMAT_LOST= 1U << 4,
 
-   PERF_FORMAT_MAX = 1U << 4,  /* non-ABI */
+   PERF_FORMAT_MAX = 1U << 5,  /* non-ABI */
 };
 
 #define PERF_ATTR_SIZE_VER064  /* sizeof first published struct */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..ee08d3ed6299 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1715,6 +1715,9 @@ static void __perf_event_read_size(struct perf_event 
*event, int nr_siblings)
if (event->attr.read_format & PERF_FORMAT_ID)
entry += sizeof(u64);
 
+   if (event->attr.read_format & PERF_FORMAT_LOST)
+   entry += sizeof(u64);
+
if (event->attr.read_format & PERF_FORMAT_GROUP) {
nr += nr_siblings;
size += sizeof(u64);
@@ -4734,6 +4737,24 @@ u64 perf_event_read_value(struct perf_event *event, u64 
*enabled, u64 *running)
 }
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
+static struct pmu perf_kprobe;
+static u64 perf_event_lost(struct perf_event *event)
+{
+   struct ring_buffer *rb;
+   u64 lost = 0;
+
+   rcu_read_lock();
+   rb = rcu_dereference(event->rb);
+   if (likely(!!rb))
+   lost += local_read(>lost);
+   rcu_read_unlock();
+
+   if (event->attr.type == perf_kprobe.type)
+   lost += perf_kprobe_missed(event);
+
+   return lost;
+}
+
 static int __perf_read_group_add(struct perf_event *leader,
u64 read_format, u64 *values)
 {
@@ -4770,11 +4791,15 @@ static int __perf_read_group_add(struct perf_event 
*leader,
values[n++] += perf_event_count(leader);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(leader);
+   if (read_format & PERF_FORMAT_LOST)
+   values[n++] = perf_event_lost(leader);
 
for_each_sibling_event(sub, leader) {
values[n++] += perf_event_count(sub);
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(sub);
+   if (read_format & PERF_FORMAT_LOST)
+   values[n

[PATCH bpf-next 3/5] libbpf: Add helpers to extract perf fd from bpf_link

2019-09-17 Thread Daniel Xu

It is sometimes necessary to perform operations on the underlying perf fd.
There is not currently a way to extract the fd given a bpf_link, so add a
a pair of casting and getting helpers.

The casting and getting helpers are nice because they let us define
broad categories of links that makes it clear to users what they can
expect to extract from what type of link.

Acked-by: Song Liu 
Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c   | 21 +
 tools/lib/bpf/libbpf.h   | 13 +
 tools/lib/bpf/libbpf.map |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e0276520171b..1ca0acd1d823 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4875,6 +4875,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
 
 struct bpf_link {
int (*destroy)(struct bpf_link *link);
+   enum bpf_link_type type;
 };
 
 int bpf_link__destroy(struct bpf_link *link)
@@ -4908,6 +4909,24 @@ static int bpf_link__destroy_perf_event(struct bpf_link 
*link)
return err;
 }
 
+const struct bpf_link_fd *bpf_link__as_fd(const struct bpf_link *link)
+{
+   if (link->type != LIBBPF_LINK_FD)
+   return NULL;
+
+   return (struct bpf_link_fd *)link;
+}
+
+enum bpf_link_type bpf_link__type(const struct bpf_link *link)
+{
+   return link->type;
+}
+
+int bpf_link_fd__fd(const struct bpf_link_fd *link)
+{
+   return link->fd;
+}
+
 struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
int pfd)
 {
@@ -4931,6 +4950,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct 
bpf_program *prog,
if (!link)
return ERR_PTR(-ENOMEM);
link->link.destroy = _link__destroy_perf_event;
+   link->link.type = LIBBPF_LINK_FD;
link->fd = pfd;
 
if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
@@ -5224,6 +5244,7 @@ struct bpf_link 
*bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
link = malloc(sizeof(*link));
if (!link)
return ERR_PTR(-ENOMEM);
+   link->link.type = LIBBPF_LINK_FD;
link->link.destroy = _link__destroy_fd;
 
pfd = bpf_raw_tracepoint_open(tp_name, prog_fd);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e8f70977d137..2ddef5315ff9 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -166,7 +166,20 @@ LIBBPF_API int bpf_program__pin(struct bpf_program *prog, 
const char *path);
 LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char *path);
 LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
 
+enum bpf_link_type {
+   LIBBPF_LINK_FD,
+};
+
 struct bpf_link;
+struct bpf_link_fd;
+
+/* casting APIs */
+LIBBPF_API const struct bpf_link_fd *
+bpf_link__as_fd(const struct bpf_link *link);
+
+/* getters APIs */
+LIBBPF_API enum bpf_link_type bpf_link__type(const struct bpf_link *link);
+LIBBPF_API int bpf_link_fd__fd(const struct bpf_link_fd *link);
 
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index d04c7cb623ed..216713b9eef6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -189,4 +189,7 @@ LIBBPF_0.0.4 {
 LIBBPF_0.0.5 {
global:
bpf_btf_get_next_id;
+   bpf_link__type;
+   bpf_link__as_fd;
+   bpf_link_fd__fd;
 } LIBBPF_0.0.4;
-- 
2.21.0

[PATCH bpf-next 0/5] Add PERF_FORMAT_LOST read_format

2019-09-17 Thread Daniel Xu

It's useful to know kprobe's nmissed count. For example with tracing
tools, it's important to know when events may have been lost.  debugfs
currently exposes a control file to get this information, but it is not
compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to this
retrieve information about FD-based probes.

This patch adds a new field to perf's read_format that lets users query
misses. Misses include both misses from the underlying kprobe
infrastructure and misses from ringbuffer infrastructure.

I studied the code various code paths for perf software events and perf
tracepoints and it does not look like anything can "miss" in the same
way kprobes can. They can all, however, suffer from ringbuffer misses.
It's possible I missed something while reading the code so please let
me know if I am mistaken.

Daniel Xu (5):
  perf/core: Add PERF_FORMAT_LOST read_format
  perf/core: Sync perf_event.h to tools
  libbpf: Add helpers to extract perf fd from bpf_link
  libbpf: Set read_format PERF_FORMAT_LOST on kprobe perf fds
  libbpf: Add selftest for PERF_FORMAT_LOST perf read_format

 include/linux/trace_events.h  |  1 +
 include/uapi/linux/perf_event.h   |  5 ++-
 kernel/events/core.c  | 39 +--
 kernel/trace/trace_kprobe.c   |  8 
 tools/include/uapi/linux/perf_event.h |  5 ++-
 tools/lib/bpf/libbpf.c| 30 --
 tools/lib/bpf/libbpf.h| 13 +++
 tools/lib/bpf/libbpf.map  |  3 ++
 .../selftests/bpf/prog_tests/attach_probe.c   | 32 ++-
 9 files changed, 127 insertions(+), 9 deletions(-)

-- 
2.21.0

Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-22 Thread Daniel Xu

On Thu Aug 22, 2019 at 11:05 AM Peter Zijlstra wrote:
> On Thu, Aug 22, 2019 at 07:54:16AM +, Song Liu wrote:
> > Hi Peter, 
> > 
> > > On Aug 22, 2019, at 12:47 AM, Peter Zijlstra  wrote:
> > > 
> > > On Wed, Aug 21, 2019 at 06:43:49PM +, Yonghong Song wrote:
> > >> On 8/21/19 11:31 AM, Peter Zijlstra wrote:
> > > 
> > >>> So extending PERF_RECORD_LOST doesn't work. But PERF_FORMAT_LOST might
> > >>> still work fine; but you get to implement it for all software events.
> > >> 
> > >> Could you give more specifics about PERF_FORMAT_LOST? Googling 
> > >> "PERF_FORMAT_LOST" only yields two emails which we are discussing here 
> > >> :-(
> > > 
> > > Look at what the other PERF_FORMAT_ flags do? Basically it is adding a
> > > field to the read(2) output.
> > 
> > Do we need to implement PERF_FORMAT_LOST for all software events? If user
> > space asks for PERF_FORMAT_LOST for events that do not support it, can we
> > just fail sys_perf_event_open()?
> 
> It really shouldn't be hard; and I'm failing to see why kprobes are
> special.

Thanks for the feedback, everyone. Really appreciate it.

I will look into extending read_format. I'll submit another patch series
after I get the code to work.

Daniel

[PATCH v4 bpf-next 2/4] libbpf: Add helpers to extract perf fd from bpf_link

2019-08-20 Thread Daniel Xu

It is sometimes necessary to perform ioctl's on the underlying perf fd.
There is not currently a way to extract the fd given a bpf_link, so add a
a pair of casting and getting helpers.

The casting and getting helpers are nice because they let us define
broad categories of links that makes it clear to users what they can
expect to extract from what type of link.

Acked-by: Song Liu 
Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c   | 21 +
 tools/lib/bpf/libbpf.h   | 13 +
 tools/lib/bpf/libbpf.map |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2233f919dd88..41588e13be2b 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4876,6 +4876,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
 
 struct bpf_link {
int (*destroy)(struct bpf_link *link);
+   enum bpf_link_type type;
 };
 
 int bpf_link__destroy(struct bpf_link *link)
@@ -4909,6 +4910,24 @@ static int bpf_link__destroy_perf_event(struct bpf_link 
*link)
return err;
 }
 
+const struct bpf_link_fd *bpf_link__as_fd(const struct bpf_link *link)
+{
+   if (link->type != LIBBPF_LINK_FD)
+   return NULL;
+
+   return (struct bpf_link_fd *)link;
+}
+
+enum bpf_link_type bpf_link__type(const struct bpf_link *link)
+{
+   return link->type;
+}
+
+int bpf_link_fd__fd(const struct bpf_link_fd *link)
+{
+   return link->fd;
+}
+
 struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
int pfd)
 {
@@ -4932,6 +4951,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct 
bpf_program *prog,
if (!link)
return ERR_PTR(-ENOMEM);
link->link.destroy = _link__destroy_perf_event;
+   link->link.type = LIBBPF_LINK_FD;
link->fd = pfd;
 
if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
@@ -5225,6 +5245,7 @@ struct bpf_link 
*bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
link = malloc(sizeof(*link));
if (!link)
return ERR_PTR(-ENOMEM);
+   link->link.type = LIBBPF_LINK_FD;
link->link.destroy = _link__destroy_fd;
 
pfd = bpf_raw_tracepoint_open(tp_name, prog_fd);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e8f70977d137..2ddef5315ff9 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -166,7 +166,20 @@ LIBBPF_API int bpf_program__pin(struct bpf_program *prog, 
const char *path);
 LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char *path);
 LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
 
+enum bpf_link_type {
+   LIBBPF_LINK_FD,
+};
+
 struct bpf_link;
+struct bpf_link_fd;
+
+/* casting APIs */
+LIBBPF_API const struct bpf_link_fd *
+bpf_link__as_fd(const struct bpf_link *link);
+
+/* getters APIs */
+LIBBPF_API enum bpf_link_type bpf_link__type(const struct bpf_link *link);
+LIBBPF_API int bpf_link_fd__fd(const struct bpf_link_fd *link);
 
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 664ce8e7a60e..ed169579896f 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -188,4 +188,7 @@ LIBBPF_0.0.4 {
 LIBBPF_0.0.5 {
global:
bpf_btf_get_next_id;
+   bpf_link__type;
+   bpf_link__as_fd;
+   bpf_link_fd__fd;
 } LIBBPF_0.0.4;
-- 
2.21.0

[PATCH v4 bpf-next 4/4] tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

2019-08-20 Thread Daniel Xu

Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 .../selftests/bpf/prog_tests/attach_probe.c   | 115 ++
 1 file changed, 115 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 5ecc267d98b0..414d5f4f7ccd 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -27,17 +27,27 @@ void test_attach_probe(void)
const char *kretprobe_name = "kretprobe/sys_nanosleep";
const char *uprobe_name = "uprobe/trigger_func";
const char *uretprobe_name = "uretprobe/trigger_func";
+   struct perf_event_query_probe kprobe_query = {};
+   struct perf_event_query_probe kretprobe_query = {};
+   struct perf_event_query_probe uprobe_query = {};
+   struct perf_event_query_probe uretprobe_query = {};
const int kprobe_idx = 0, kretprobe_idx = 1;
const int uprobe_idx = 2, uretprobe_idx = 3;
const char *file = "./test_attach_probe.o";
struct bpf_program *kprobe_prog, *kretprobe_prog;
struct bpf_program *uprobe_prog, *uretprobe_prog;
struct bpf_object *obj;
+   const struct bpf_link_fd *kprobe_fd_link;
+   const struct bpf_link_fd *kretprobe_fd_link;
+   const struct bpf_link_fd *uprobe_fd_link;
+   const struct bpf_link_fd *uretprobe_fd_link;
int err, prog_fd, duration = 0, res;
struct bpf_link *kprobe_link = NULL;
struct bpf_link *kretprobe_link = NULL;
struct bpf_link *uprobe_link = NULL;
struct bpf_link *uretprobe_link = NULL;
+   int kprobe_fd, kretprobe_fd;
+   int uprobe_fd, uretprobe_fd;
int results_map_fd;
size_t uprobe_offset;
ssize_t base_addr;
@@ -116,6 +126,63 @@ void test_attach_probe(void)
/* trigger & validate kprobe && kretprobe */
usleep(1);
 
+   kprobe_fd_link = bpf_link__as_fd(kprobe_link);
+   if (CHECK(!kprobe_fd_link, "kprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kprobe_fd = bpf_link_fd__fd(kprobe_fd_link);
+   if (CHECK(kprobe_fd < 0, "kprobe_get_perf_fd",
+   "failed to get perf fd from kprobe link\n"))
+   goto cleanup;
+
+   kretprobe_fd_link = bpf_link__as_fd(kretprobe_link);
+   if (CHECK(!kretprobe_fd_link, "kretprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kretprobe_fd = bpf_link_fd__fd(kretprobe_fd_link);
+   if (CHECK(kretprobe_fd < 0, "kretprobe_get_perf_fd",
+   "failed to get perf fd from kretprobe link\n"))
+   goto cleanup;
+
+   kprobe_query.size = sizeof(kprobe_query);
+   err = ioctl(kprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kprobe_ioctl",
+ "failed to issue kprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kprobe_query.size != sizeof(kprobe_query), "get_kprobe_ioctl",
+ "read incorrect size from kprobe_ioctl: %llu\n",
+ kprobe_query.size))
+   goto cleanup;
+   if (CHECK(kprobe_query.nmissed > 0, "get_kprobe_ioctl",
+ "read incorrect nmissed from kprobe_ioctl: %llu\n",
+ kprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kprobe_query.nhit == 0, "get_kprobe_ioctl",
+ "read incorrect nhit from kprobe_ioctl: %llu\n",
+ kprobe_query.nhit))
+   goto cleanup;
+
+   kretprobe_query.size = sizeof(kretprobe_query);
+   err = ioctl(kretprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kretprobe_ioctl",
+ "failed to issue kretprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kretprobe_query.size != sizeof(kretprobe_query),
+ "get_kretprobe_ioctl",
+ "read incorrect size from kretprobe_ioctl: %llu\n",
+ kretprobe_query.size))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nmissed > 0, "get_kretprobe_ioctl",
+ "read incorrect nmissed from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nhit <= 0, "get_kretprobe_ioctl",
+ "read incorrect nhit from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nhit))
+   goto cleanup;
+
err = bpf_map_lookup_elem(results_map_fd, _idx, );
if (CHECK(err, "get_kprobe_res",

[PATCH v4 bpf-next 3/4] tracing/probe: Sync perf_event.h to tools

2019-08-20 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 tools/include/uapi/linux/perf_event.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..8783d29a807a 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -447,6 +447,28 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Size of structure for forward/backward compatibility
+*/
+   __u64   size;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +484,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
-- 
2.21.0

[PATCH v4 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-20 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

Signed-off-by: Daniel Xu 
---
 include/linux/trace_events.h|  2 ++
 include/uapi/linux/perf_event.h | 23 +++
 kernel/events/core.c| 20 
 kernel/trace/trace_kprobe.c | 25 +
 kernel/trace/trace_uprobe.c | 25 +
 5 files changed, 95 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 5150436783e8..96f3cf2e39b4 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -586,6 +586,7 @@ extern int bpf_get_kprobe_info(const struct perf_event 
*event,
   u32 *fd_type, const char **symbol,
   u64 *probe_offset, u64 *probe_addr,
   bool perf_type_tracepoint);
+extern int perf_kprobe_event_query(struct perf_event *event, void __user 
*info);
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
 extern int  perf_uprobe_init(struct perf_event *event,
@@ -594,6 +595,7 @@ extern void perf_uprobe_destroy(struct perf_event *event);
 extern int bpf_get_uprobe_info(const struct perf_event *event,
   u32 *fd_type, const char **filename,
   u64 *probe_offset, bool perf_type_tracepoint);
+extern int perf_uprobe_event_query(struct perf_event *event, void __user 
*info);
 #endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..8783d29a807a 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -447,6 +447,28 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Size of structure for forward/backward compatibility
+*/
+   __u64   size;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +484,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..ed33d50511a3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5060,6 +5060,8 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg);
 static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
  struct perf_event_attr *attr);
+static int perf_probe_event_query(struct perf_event *event,
+   void __user *info);
 
 static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned 
long arg)
 {
@@ -5143,6 +5145,10 @@ static long _perf_ioctl(struct perf_event *event, 
unsigned int cmd, unsigned lon
 
return perf_event_modify_attr(event,  _attr);
}
+#if defined(CONFIG_KPROBE_EVENTS) || defined(CONFIG_UPROBE_EVENTS)
+   case PERF_EVENT_IOC_QUERY_PROBE:
+   return perf_probe_event_query(event, (void __user *)arg);
+#endif
default:
return -ENOTTY;
}
@@ -8833,6 +8839,20 @@ static inline void perf_tp_register(void)
 #endif
 }
 
+static int perf_probe_event_query(struct perf_event *event,
+   void __user *info)
+{
+#ifdef CONFIG_KPROBE_EVENTS
+   if (event->attr.type == perf_kprobe.type)
+

[PATCH v4 bpf-next 0/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE

2019-08-20 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

v3 -> v4:
- Make kernel code set size field on ioctl arg
- Update selftests to check size field
- Remove unnecessary function stubs

v2 -> v3:
- Introduce bpf_link_type and associated getter to track underlying link
  types
- Add back size field in perf_event_query_probe for forward/backwards
  compat
- Remove NULL checks, fix typos

v1 -> v2:
- More descriptive cover letter
- Make API more generic and support uprobes as well
- Use casters/getters for libbpf instead of single getter
- Fix typos
- Remove size field from ioctl struct
- Split out libbpf.h sync to tools dir to separate commit

Daniel Xu (4):
  tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl
  libbpf: Add helpers to extract perf fd from bpf_link
  tracing/probe: Sync perf_event.h to tools
  tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

 include/linux/trace_events.h  |   2 +
 include/uapi/linux/perf_event.h   |  23 
 kernel/events/core.c  |  20 +++
 kernel/trace/trace_kprobe.c   |  25 
 kernel/trace/trace_uprobe.c   |  25 
 tools/include/uapi/linux/perf_event.h |  23 
 tools/lib/bpf/libbpf.c|  21 
 tools/lib/bpf/libbpf.h|  13 ++
 tools/lib/bpf/libbpf.map  |   3 +
 .../selftests/bpf/prog_tests/attach_probe.c   | 115 ++
 10 files changed, 270 insertions(+)

-- 
2.21.0

Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-20 Thread Daniel Xu

Hi Peter,

On Tue Aug 20, 2019 at 4:45 PM Peter Zijlstra wrote:
> On Fri, Aug 16, 2019 at 03:31:46PM -0700, Daniel Xu wrote:
> > It's useful to know [uk]probe's nmissed and nhit stats. For example with
> > tracing tools, it's important to know when events may have been lost.
> > debugfs currently exposes a control file to get this information, but
> > it is not compatible with probes registered with the perf API.
> 
> What is this nmissed and nhit stuff?

nmissed is the number of times the probe's handler should have been run
but didn't. nhit is the number of times the probes handler has run. I've
documented this information in the uapi header. If you'd like, I can put
it in the commit message too.

Daniel

Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-19 Thread Daniel Xu

On Mon Aug 19, 2019 at 6:26 PM Alexei Starovoitov wrote:
> On Fri, Aug 16, 2019 at 3:33 PM Daniel Xu  wrote:
> >
> > It's useful to know [uk]probe's nmissed and nhit stats. For example with
> > tracing tools, it's important to know when events may have been lost.
> > debugfs currently exposes a control file to get this information, but
> > it is not compatible with probes registered with the perf API.
> >
> > While bpf programs may be able to manually count nhit, there is no way
> > to gather nmissed. In other words, it is currently not possible to
> > retrieve information about FD-based probes.
> >
> > This patch adds a new ioctl that lets users query nmissed (as well as
> > nhit for completeness). We currently only add support for [uk]probes
> > but leave the possibility open for other probes like tracepoint.
> >
> > Signed-off-by: Daniel Xu 
> ...
> > +int perf_kprobe_event_query(struct perf_event *event, void __user *info)
> > +{
> > +   struct perf_event_query_probe __user *uquery = info;
> > +   struct perf_event_query_probe query = {};
> > +   struct trace_event_call *call = event->tp_event;
> > +   struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> > +   u64 ncopy;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EPERM;
> > +   if (copy_from_user(, uquery,
> > +  offsetofend(struct perf_event_query_probe, 
> > size)))
> > +   return -EFAULT;
> > +
> > +   ncopy = min_t(u64, query.size, sizeof(query));
> > +   query.nhit = trace_kprobe_nhit(tk);
> > +   query.nmissed = tk->rp.kp.nmissed;
> > +
> > +   if (copy_to_user(uquery, , ncopy))
> > +   return -EFAULT;
> 
> shouldn't kernel update query.size before copying back?
> Otherwise how user space would know which fields
> were populated?

Ah yes, sorry. Will add that.

Re: [PATCH v3 bpf-next 2/4] libbpf: Add helpers to extract perf fd from bpf_link

2019-08-19 Thread Daniel Xu

On Mon, Aug 19, 2019, at 10:45 AM, Andrii Nakryiko wrote:
> On Fri, Aug 16, 2019 at 3:32 PM Daniel Xu  wrote:
> >
> > It is sometimes necessary to perform ioctl's on the underlying perf fd.
> > There is not currently a way to extract the fd given a bpf_link, so add a
> > a pair of casting and getting helpers.
> >
> > The casting and getting helpers are nice because they let us define
> > broad categories of links that makes it clear to users what they can
> > expect to extract from what type of link.
> >
> > Acked-by: Song Liu 
> > Signed-off-by: Daniel Xu 
> > ---
> 
> This looks great, thanks a lot!
> 
> I think you might have a conflict with dadb81d0afe7 ("libbpf: make
> libbpf.map source of truth for libbpf version") in libbpf.map, so you
> might need to pull, rebase and re-post rebased version. But in any
> case:
> 

The patchset is already rebased on top :). Thanks for the tip.

Daniel

[PATCH v3 bpf-next 3/4] tracing/probe: Sync perf_event.h to tools

2019-08-16 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 tools/include/uapi/linux/perf_event.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..8783d29a807a 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -447,6 +447,28 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Size of structure for forward/backward compatibility
+*/
+   __u64   size;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +484,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
-- 
2.20.1

[PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-16 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

Signed-off-by: Daniel Xu 
---
 include/linux/trace_events.h| 12 
 include/uapi/linux/perf_event.h | 23 +++
 kernel/events/core.c| 20 
 kernel/trace/trace_kprobe.c | 24 
 kernel/trace/trace_uprobe.c | 24 
 5 files changed, 103 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 5150436783e8..61558f19696a 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -586,6 +586,12 @@ extern int bpf_get_kprobe_info(const struct perf_event 
*event,
   u32 *fd_type, const char **symbol,
   u64 *probe_offset, u64 *probe_addr,
   bool perf_type_tracepoint);
+extern int perf_kprobe_event_query(struct perf_event *event, void __user 
*info);
+#else
+int perf_kprobe_event_query(struct perf_event *event, void __user *info)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
 extern int  perf_uprobe_init(struct perf_event *event,
@@ -594,6 +600,12 @@ extern void perf_uprobe_destroy(struct perf_event *event);
 extern int bpf_get_uprobe_info(const struct perf_event *event,
   u32 *fd_type, const char **filename,
   u64 *probe_offset, bool perf_type_tracepoint);
+extern int perf_uprobe_event_query(struct perf_event *event, void __user 
*info);
+#else
+int perf_uprobe_event_query(struct perf_event *event, void __user *info)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..8783d29a807a 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -447,6 +447,28 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Size of structure for forward/backward compatibility
+*/
+   __u64   size;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +484,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..ed33d50511a3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5060,6 +5060,8 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg);
 static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
  struct perf_event_attr *attr);
+static int perf_probe_event_query(struct perf_event *event,
+   void __user *info);
 
 static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned 
long arg)
 {
@@ -5143,6 +5145,10 @@ static long _perf_ioctl(struct perf_event *event, 
unsigned int cmd, unsigned lon
 
return perf_event_modify_attr(event,  _attr);
}
+#if defined(CONFIG_KPROBE_EVENTS) || defined(CONFIG_UPROBE_EVENTS)
+   case PERF_EVENT_IOC_QUERY_PROBE:
+   return perf_probe_event_query(event, (void __user *)arg);
+#endif
default:
return -ENOTTY;
}
@@ -8833,6 +8839,20 @@ static inline void perf_tp_registe

[PATCH v3 bpf-next 2/4] libbpf: Add helpers to extract perf fd from bpf_link

2019-08-16 Thread Daniel Xu

It is sometimes necessary to perform ioctl's on the underlying perf fd.
There is not currently a way to extract the fd given a bpf_link, so add a
a pair of casting and getting helpers.

The casting and getting helpers are nice because they let us define
broad categories of links that makes it clear to users what they can
expect to extract from what type of link.

Acked-by: Song Liu 
Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c   | 21 +
 tools/lib/bpf/libbpf.h   | 13 +
 tools/lib/bpf/libbpf.map |  4 
 3 files changed, 38 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2233f919dd88..41588e13be2b 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4876,6 +4876,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
 
 struct bpf_link {
int (*destroy)(struct bpf_link *link);
+   enum bpf_link_type type;
 };
 
 int bpf_link__destroy(struct bpf_link *link)
@@ -4909,6 +4910,24 @@ static int bpf_link__destroy_perf_event(struct bpf_link 
*link)
return err;
 }
 
+const struct bpf_link_fd *bpf_link__as_fd(const struct bpf_link *link)
+{
+   if (link->type != LIBBPF_LINK_FD)
+   return NULL;
+
+   return (struct bpf_link_fd *)link;
+}
+
+enum bpf_link_type bpf_link__type(const struct bpf_link *link)
+{
+   return link->type;
+}
+
+int bpf_link_fd__fd(const struct bpf_link_fd *link)
+{
+   return link->fd;
+}
+
 struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
int pfd)
 {
@@ -4932,6 +4951,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct 
bpf_program *prog,
if (!link)
return ERR_PTR(-ENOMEM);
link->link.destroy = _link__destroy_perf_event;
+   link->link.type = LIBBPF_LINK_FD;
link->fd = pfd;
 
if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
@@ -5225,6 +5245,7 @@ struct bpf_link 
*bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
link = malloc(sizeof(*link));
if (!link)
return ERR_PTR(-ENOMEM);
+   link->link.type = LIBBPF_LINK_FD;
link->link.destroy = _link__destroy_fd;
 
pfd = bpf_raw_tracepoint_open(tp_name, prog_fd);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e8f70977d137..2ddef5315ff9 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -166,7 +166,20 @@ LIBBPF_API int bpf_program__pin(struct bpf_program *prog, 
const char *path);
 LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char *path);
 LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
 
+enum bpf_link_type {
+   LIBBPF_LINK_FD,
+};
+
 struct bpf_link;
+struct bpf_link_fd;
+
+/* casting APIs */
+LIBBPF_API const struct bpf_link_fd *
+bpf_link__as_fd(const struct bpf_link *link);
+
+/* getters APIs */
+LIBBPF_API enum bpf_link_type bpf_link__type(const struct bpf_link *link);
+LIBBPF_API int bpf_link_fd__fd(const struct bpf_link_fd *link);
 
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 4e72df8e98ba..ee9945177100 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -186,4 +186,8 @@ LIBBPF_0.0.4 {
 } LIBBPF_0.0.3;
 
 LIBBPF_0.0.5 {
+   global:
+   bpf_link__type;
+   bpf_link__as_fd;
+   bpf_link_fd__fd;
 } LIBBPF_0.0.4;
-- 
2.20.1

[PATCH v3 bpf-next 0/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE

2019-08-16 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

v2 -> v3:
- Introduce bpf_link_type and associated getter to track underlying link
  types
- Add back size field in perf_event_query_probe for forward/backwards
  compat
- Remove NULL checks, fix typos

v1 -> v2:
- More descriptive cover letter
- Make API more generic and support uprobes as well
- Use casters/getters for libbpf instead of single getter
- Fix typos
- Remove size field from ioctl struct
- Split out libbpf.h sync to tools dir to separate commit

Daniel Xu (4):
  tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl
  libbpf: Add helpers to extract perf fd from bpf_link
  tracing/probe: Sync perf_event.h to tools
  tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

 include/linux/trace_events.h  |  12 ++
 include/uapi/linux/perf_event.h   |  23 
 kernel/events/core.c  |  20 
 kernel/trace/trace_kprobe.c   |  24 
 kernel/trace/trace_uprobe.c   |  24 
 tools/include/uapi/linux/perf_event.h |  23 
 tools/lib/bpf/libbpf.c|  21 
 tools/lib/bpf/libbpf.h|  13 +++
 tools/lib/bpf/libbpf.map  |   4 +
 .../selftests/bpf/prog_tests/attach_probe.c   | 106 ++
 10 files changed, 270 insertions(+)

-- 
2.20.1

[PATCH v3 bpf-next 4/4] tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

2019-08-16 Thread Daniel Xu

Acked-by: Andrii Nakryiko 
Signed-off-by: Daniel Xu 
---
 .../selftests/bpf/prog_tests/attach_probe.c   | 106 ++
 1 file changed, 106 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 5ecc267d98b0..c14db7557881 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -27,17 +27,27 @@ void test_attach_probe(void)
const char *kretprobe_name = "kretprobe/sys_nanosleep";
const char *uprobe_name = "uprobe/trigger_func";
const char *uretprobe_name = "uretprobe/trigger_func";
+   struct perf_event_query_probe kprobe_query = {};
+   struct perf_event_query_probe kretprobe_query = {};
+   struct perf_event_query_probe uprobe_query = {};
+   struct perf_event_query_probe uretprobe_query = {};
const int kprobe_idx = 0, kretprobe_idx = 1;
const int uprobe_idx = 2, uretprobe_idx = 3;
const char *file = "./test_attach_probe.o";
struct bpf_program *kprobe_prog, *kretprobe_prog;
struct bpf_program *uprobe_prog, *uretprobe_prog;
struct bpf_object *obj;
+   const struct bpf_link_fd *kprobe_fd_link;
+   const struct bpf_link_fd *kretprobe_fd_link;
+   const struct bpf_link_fd *uprobe_fd_link;
+   const struct bpf_link_fd *uretprobe_fd_link;
int err, prog_fd, duration = 0, res;
struct bpf_link *kprobe_link = NULL;
struct bpf_link *kretprobe_link = NULL;
struct bpf_link *uprobe_link = NULL;
struct bpf_link *uretprobe_link = NULL;
+   int kprobe_fd, kretprobe_fd;
+   int uprobe_fd, uretprobe_fd;
int results_map_fd;
size_t uprobe_offset;
ssize_t base_addr;
@@ -116,6 +126,54 @@ void test_attach_probe(void)
/* trigger & validate kprobe && kretprobe */
usleep(1);
 
+   kprobe_fd_link = bpf_link__as_fd(kprobe_link);
+   if (CHECK(!kprobe_fd_link, "kprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kprobe_fd = bpf_link_fd__fd(kprobe_fd_link);
+   if (CHECK(kprobe_fd < 0, "kprobe_get_perf_fd",
+   "failed to get perf fd from kprobe link\n"))
+   goto cleanup;
+
+   kretprobe_fd_link = bpf_link__as_fd(kretprobe_link);
+   if (CHECK(!kretprobe_fd_link, "kretprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kretprobe_fd = bpf_link_fd__fd(kretprobe_fd_link);
+   if (CHECK(kretprobe_fd < 0, "kretprobe_get_perf_fd",
+   "failed to get perf fd from kretprobe link\n"))
+   goto cleanup;
+
+   kprobe_query.size = sizeof(kprobe_query);
+   err = ioctl(kprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kprobe_ioctl",
+ "failed to issue kprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kprobe_query.nmissed > 0, "get_kprobe_ioctl",
+ "read incorrect nmissed from kprobe_ioctl: %llu\n",
+ kprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kprobe_query.nhit == 0, "get_kprobe_ioctl",
+ "read incorrect nhit from kprobe_ioctl: %llu\n",
+ kprobe_query.nhit))
+   goto cleanup;
+
+   kretprobe_query.size = sizeof(kretprobe_query);
+   err = ioctl(kretprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kretprobe_ioctl",
+ "failed to issue kretprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nmissed > 0, "get_kretprobe_ioctl",
+ "read incorrect nmissed from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nhit <= 0, "get_kretprobe_ioctl",
+ "read incorrect nhit from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nhit))
+   goto cleanup;
+
err = bpf_map_lookup_elem(results_map_fd, _idx, );
if (CHECK(err, "get_kprobe_res",
  "failed to get kprobe res: %d\n", err))
@@ -135,6 +193,54 @@ void test_attach_probe(void)
/* trigger & validate uprobe & uretprobe */
get_base_addr();
 
+   uprobe_fd_link = bpf_link__as_fd(uprobe_link);
+   if (CHECK(!uprobe_fd_link, "uprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   uprobe_fd = bpf_link_fd__fd(uprobe_fd_link);
+   if (CHECK(uprobe_fd < 0,

Re: [PATCH v2 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-13 Thread Daniel Xu

On Tue, Aug 13, 2019, at 2:47 PM, Song Liu wrote:
> On Fri, Aug 9, 2019 at 2:48 PM Daniel Xu  wrote:
> >
> > It's useful to know [uk]probe's nmissed and nhit stats. For example with
> > tracing tools, it's important to know when events may have been lost.
> > debugfs currently exposes a control file to get this information, but
> > it is not compatible with probes registered with the perf API.
> >
> > While bpf programs may be able to manually count nhit, there is no way
> > to gather nmissed. In other words, it is currently not possible to
> > retrieve information about FD-based probes.
> >
> > This patch adds a new ioctl that lets users query nmissed (as well as
> > nhit for completeness). We currently only add support for [uk]probes
> > but leave the possibility open for other probes like tracepoint.
> >
> > Signed-off-by: Daniel Xu 
> > ---
> [...]
> 
> > +int perf_uprobe_event_query(struct perf_event *event, void __user *info)
> > +{
> > +   struct perf_event_query_probe __user *uquery = info;
> > +   struct perf_event_query_probe query = {};
> > +   struct trace_event_call *call = event->tp_event;
> > +   struct trace_uprobe *tu = (struct trace_uprobe *)call->data;
> > +   u64 nmissed, nhit;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EPERM;
> > +   if (copy_from_user(, uquery, sizeof(query)))
> > +   return -EFAULT;
> > +
> > +   nhit = tu->nhit;
> > +   nmissed = 0;
> 
> Blindly return 0 is a little weird. Maybe return 0x so
> that the user
> can tell this is not a valid 0. Or some other idea?
> 
> Thanks,
> Song
>

My (maybe flawed) understanding is that uprobes cannot really miss the same way
a kprobe can. From skimming the code a little, it seems the main reason kprobes
can miss is when the processing of one kprobe results in hitting another kprobe.
The latter cannot be handled for whatever reason. The same cannot really happen
for uprobes as kernel doesn't call into userspace. That's why I made it 0 (that 
and
the fact I didn't see any accounting for uprobe misses).

cc Srikar who authored the uprobe patches.

Srikar, do you mind clarifying if uprobes can miss?

Thanks,
Daniel

Re: [PATCH v2 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-12 Thread Daniel Xu

On Mon, Aug 12, 2019, at 8:57 AM, Andrii Nakryiko wrote:
> On Fri, Aug 9, 2019 at 2:47 PM Daniel Xu  wrote:
> >
> > It's useful to know [uk]probe's nmissed and nhit stats. For example with
> > tracing tools, it's important to know when events may have been lost.
> > debugfs currently exposes a control file to get this information, but
> > it is not compatible with probes registered with the perf API.
> >
> > While bpf programs may be able to manually count nhit, there is no way
> > to gather nmissed. In other words, it is currently not possible to
> > retrieve information about FD-based probes.
> >
> > This patch adds a new ioctl that lets users query nmissed (as well as
> > nhit for completeness). We currently only add support for [uk]probes
> > but leave the possibility open for other probes like tracepoint.
> >
> > Signed-off-by: Daniel Xu 
> > ---
> >  include/linux/trace_events.h| 12 
> >  include/uapi/linux/perf_event.h | 19 +++
> >  kernel/events/core.c| 20 
> >  kernel/trace/trace_kprobe.c | 23 +++
> >  kernel/trace/trace_uprobe.c | 23 +++
> >  5 files changed, 97 insertions(+)
> >
[...]
> > +   struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> > +   u64 nmissed, nhit;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EPERM;
> > +   if (copy_from_user(, uquery, sizeof(query)))
> 
> what about forward/backward compatibility? Didn't you have a size
> field for perf_event_query_probe?

I initially did, yes. But after thinking about it more, I'm not convinced it
is necessary. It seems the last change to the debugfs counterpart was in
the initial comit cd7e7bd5e4, ~10 years ago. I cannot think of any other
information that would be useful off the top of my head, so I figured it'd
be best if we didn't make more complicated something that doesn't seem
likely to change. If we really needed something else, I figured adding
another ioctl is pretty cheap.

If you (or anyone) feels strongly about adding it back, I can make it a
u64 so there's no holes.

> 
> > +   return -EFAULT;
> > +
> > +   nhit = trace_kprobe_nhit(tk);
> > +   nmissed = tk->rp.kp.nmissed;
> > +
> > +   if (put_user(nmissed, >nmissed) ||
> > +   put_user(nhit, >nhit))
> 
> Wouldn't it be nicer to just do one user put for entire struct (or at
> least relevant part of it with backward/forward compatibility?).

Not sure how that didn't occur to me. Thanks.

Re: [PATCH v2 bpf-next 0/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE

2019-08-09 Thread Daniel Xu

On Fri, Aug 9, 2019, at 2:47 PM, Daniel Xu wrote:
> It's useful to know [uk]probe's nmissed and nhit stats. For example with
> tracing tools, it's important to know when events may have been lost.
> debugfs currently exposes a control file to get this information, but
> it is not compatible with probes registered with the perf API.
> 
> While bpf programs may be able to manually count nhit, there is no way
> to gather nmissed. In other words, it is currently not possible to
> retrieve information about FD-based probes.
> 
> This patch adds a new ioctl that lets users query nmissed (as well as
> nhit for completeness). We currently only add support for [uk]probes
> but leave the possibility open for other probes like tracepoint.
> 
> v1 -> v2:
> - More descriptive cover letter
> - Make API more generic and support uprobes as well
> - Use casters/getters for libbpf instead of single getter
> - Fix typos
> - Remove size field from ioctl struct
> - Split out libbpf.h sync to tools dir to separate commit
> 
> Daniel Xu (4):
>   tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl
>   libbpf: Add helpers to extract perf fd from bpf_link
>   tracing/probe: Sync perf_event.h to tools
>   tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE
> 
>  include/linux/trace_events.h  |  12 +++
>  include/uapi/linux/perf_event.h   |  19 
>  kernel/events/core.c  |  20 
>  kernel/trace/trace_kprobe.c   |  23 
>  kernel/trace/trace_uprobe.c   |  23 
>  tools/include/uapi/linux/perf_event.h |  19 
>  tools/lib/bpf/libbpf.c|  19 
>  tools/lib/bpf/libbpf.h|   8 ++
>  tools/lib/bpf/libbpf.map  |   6 ++
>  .../selftests/bpf/prog_tests/attach_probe.c   | 102 ++
>  10 files changed, 251 insertions(+)
> 
> -- 
> 2.20.1
> 
>

CC PeterZ, whose email I misspelled. Apologies.

Daniel

[PATCH v2 bpf-next 3/4] tracing/probe: Sync perf_event.h to tools

2019-08-09 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 tools/include/uapi/linux/perf_event.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..65faa9b2a3b4 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -447,6 +447,24 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +480,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
-- 
2.20.1

[PATCH v2 bpf-next 4/4] tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

2019-08-09 Thread Daniel Xu

Signed-off-by: Daniel Xu 
---
 .../selftests/bpf/prog_tests/attach_probe.c   | 102 ++
 1 file changed, 102 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 5ecc267d98b0..bb53103ddb66 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -27,17 +27,27 @@ void test_attach_probe(void)
const char *kretprobe_name = "kretprobe/sys_nanosleep";
const char *uprobe_name = "uprobe/trigger_func";
const char *uretprobe_name = "uretprobe/trigger_func";
+   struct perf_event_query_probe kprobe_query = {};
+   struct perf_event_query_probe kretprobe_query = {};
+   struct perf_event_query_probe uprobe_query = {};
+   struct perf_event_query_probe uretprobe_query = {};
const int kprobe_idx = 0, kretprobe_idx = 1;
const int uprobe_idx = 2, uretprobe_idx = 3;
const char *file = "./test_attach_probe.o";
struct bpf_program *kprobe_prog, *kretprobe_prog;
struct bpf_program *uprobe_prog, *uretprobe_prog;
struct bpf_object *obj;
+   const struct bpf_link_fd *kprobe_fd_link;
+   const struct bpf_link_fd *kretprobe_fd_link;
+   const struct bpf_link_fd *uprobe_fd_link;
+   const struct bpf_link_fd *uretprobe_fd_link;
int err, prog_fd, duration = 0, res;
struct bpf_link *kprobe_link = NULL;
struct bpf_link *kretprobe_link = NULL;
struct bpf_link *uprobe_link = NULL;
struct bpf_link *uretprobe_link = NULL;
+   int kprobe_fd, kretprobe_fd;
+   int uprobe_fd, uretprobe_fd;
int results_map_fd;
size_t uprobe_offset;
ssize_t base_addr;
@@ -116,6 +126,52 @@ void test_attach_probe(void)
/* trigger & validate kprobe && kretprobe */
usleep(1);
 
+   kprobe_fd_link = bpf_link__as_fd(kprobe_link);
+   if (CHECK(!kprobe_fd_link, "kprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kprobe_fd = bpf_link_fd__fd(kprobe_fd_link);
+   if (CHECK(kprobe_fd < 0, "kprobe_get_perf_fd",
+   "failed to get perf fd from kprobe link\n"))
+   goto cleanup;
+
+   kretprobe_fd_link = bpf_link__as_fd(kretprobe_link);
+   if (CHECK(!kretprobe_fd_link, "kretprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   kretprobe_fd = bpf_link_fd__fd(kretprobe_fd_link);
+   if (CHECK(kretprobe_fd < 0, "kretprobe_get_perf_fd",
+   "failed to get perf fd from kretprobe link\n"))
+   goto cleanup;
+
+   err = ioctl(kprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kprobe_ioctl",
+ "failed to issue kprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kprobe_query.nmissed > 0, "get_kprobe_ioctl",
+ "read incorect nmissed from kprobe_ioctl: %llu\n",
+ kprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kprobe_query.nhit == 0, "get_kprobe_ioctl",
+ "read incorect nhit from kprobe_ioctl: %llu\n",
+ kprobe_query.nhit))
+   goto cleanup;
+
+   err = ioctl(kretprobe_fd, PERF_EVENT_IOC_QUERY_PROBE, _query);
+   if (CHECK(err, "get_kretprobe_ioctl",
+ "failed to issue kretprobe query ioctl\n"))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nmissed > 0, "get_kretprobe_ioctl",
+ "read incorect nmissed from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nmissed))
+   goto cleanup;
+   if (CHECK(kretprobe_query.nhit <= 0, "get_kretprobe_ioctl",
+ "read incorect nhit from kretprobe_ioctl: %llu\n",
+ kretprobe_query.nhit))
+   goto cleanup;
+
err = bpf_map_lookup_elem(results_map_fd, _idx, );
if (CHECK(err, "get_kprobe_res",
  "failed to get kprobe res: %d\n", err))
@@ -135,6 +191,52 @@ void test_attach_probe(void)
/* trigger & validate uprobe & uretprobe */
get_base_addr();
 
+   uprobe_fd_link = bpf_link__as_fd(uprobe_link);
+   if (CHECK(!uprobe_fd_link, "uprobe_link_as_fd",
+ "failed to cast link to fd link\n"))
+   goto cleanup;
+
+   uprobe_fd = bpf_link_fd__fd(uprobe_fd_link);
+   if (CHECK(uprobe_fd < 0, "uprobe_get_perf_fd",
+   "failed to get perf fd from uprobe link\n"))
+   goto

[PATCH v2 bpf-next 0/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE

2019-08-09 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

v1 -> v2:
- More descriptive cover letter
- Make API more generic and support uprobes as well
- Use casters/getters for libbpf instead of single getter
- Fix typos
- Remove size field from ioctl struct
- Split out libbpf.h sync to tools dir to separate commit

Daniel Xu (4):
  tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl
  libbpf: Add helpers to extract perf fd from bpf_link
  tracing/probe: Sync perf_event.h to tools
  tracing/probe: Add self test for PERF_EVENT_IOC_QUERY_PROBE

 include/linux/trace_events.h  |  12 +++
 include/uapi/linux/perf_event.h   |  19 
 kernel/events/core.c  |  20 
 kernel/trace/trace_kprobe.c   |  23 
 kernel/trace/trace_uprobe.c   |  23 
 tools/include/uapi/linux/perf_event.h |  19 
 tools/lib/bpf/libbpf.c|  19 
 tools/lib/bpf/libbpf.h|   8 ++
 tools/lib/bpf/libbpf.map  |   6 ++
 .../selftests/bpf/prog_tests/attach_probe.c   | 102 ++
 10 files changed, 251 insertions(+)

-- 
2.20.1

[PATCH v2 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

2019-08-09 Thread Daniel Xu

It's useful to know [uk]probe's nmissed and nhit stats. For example with
tracing tools, it's important to know when events may have been lost.
debugfs currently exposes a control file to get this information, but
it is not compatible with probes registered with the perf API.

While bpf programs may be able to manually count nhit, there is no way
to gather nmissed. In other words, it is currently not possible to
retrieve information about FD-based probes.

This patch adds a new ioctl that lets users query nmissed (as well as
nhit for completeness). We currently only add support for [uk]probes
but leave the possibility open for other probes like tracepoint.

Signed-off-by: Daniel Xu 
---
 include/linux/trace_events.h| 12 
 include/uapi/linux/perf_event.h | 19 +++
 kernel/events/core.c| 20 
 kernel/trace/trace_kprobe.c | 23 +++
 kernel/trace/trace_uprobe.c | 23 +++
 5 files changed, 97 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 5150436783e8..61558f19696a 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -586,6 +586,12 @@ extern int bpf_get_kprobe_info(const struct perf_event 
*event,
   u32 *fd_type, const char **symbol,
   u64 *probe_offset, u64 *probe_addr,
   bool perf_type_tracepoint);
+extern int perf_kprobe_event_query(struct perf_event *event, void __user 
*info);
+#else
+int perf_kprobe_event_query(struct perf_event *event, void __user *info)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
 extern int  perf_uprobe_init(struct perf_event *event,
@@ -594,6 +600,12 @@ extern void perf_uprobe_destroy(struct perf_event *event);
 extern int bpf_get_uprobe_info(const struct perf_event *event,
   u32 *fd_type, const char **filename,
   u64 *probe_offset, bool perf_type_tracepoint);
+extern int perf_uprobe_event_query(struct perf_event *event, void __user 
*info);
+#else
+int perf_uprobe_event_query(struct perf_event *event, void __user *info)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..65faa9b2a3b4 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -447,6 +447,24 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_PROBE command
+ * to query information about the probe attached to the perf
+ * event. Currently only supports [uk]probes.
+ */
+struct perf_event_query_probe {
+   /*
+* Set by the kernel to indicate number of times this probe
+* was temporarily disabled
+*/
+   __u64   nmissed;
+   /*
+* Set by the kernel to indicate number of times this probe
+* was hit
+*/
+   __u64   nhit;
+};
+
 /*
  * Ioctls that can be done on a perf event fd:
  */
@@ -462,6 +480,7 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_QUERY_PROBE _IOR('$', 12, struct 
perf_event_query_probe *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 026a14541a38..3e0fe6eaaad0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5060,6 +5060,8 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg);
 static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
  struct perf_event_attr *attr);
+static int perf_probe_event_query(struct perf_event *event,
+   void __user *info);
 
 static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned 
long arg)
 {
@@ -5143,6 +5145,10 @@ static long _perf_ioctl(struct perf_event *event, 
unsigned int cmd, unsigned lon
 
return perf_event_modify_attr(event,  _attr);
}
+#if defined(CONFIG_KPROBE_EVENTS) || defined(CONFIG_UPROBE_EVENTS)
+   case PERF_EVENT_IOC_QUERY_PROBE:
+   return perf_probe_event_query(event, (void __user *)arg);
+#endif
default:
return -ENOTTY;
}
@@ -8833,6 +8839,20 @@ static inline void perf_tp_register(void)
 #endif
 }
 
+static int perf_probe_event_query(struct perf_event *event,
+   void

[PATCH v2 bpf-next 2/4] libbpf: Add helpers to extract perf fd from bpf_link

2019-08-09 Thread Daniel Xu

It is sometimes necessary to perform ioctl's on the underlying perf fd.
There is not currently a way to extract the fd given a bpf_link, so add a
a pair of casting and getting helpers.

The casting and getting helpers are nice because they let us define
broad categories of links that makes it clear to users what they can
expect to extract from what type of link.

Signed-off-by: Daniel Xu 
---
 tools/lib/bpf/libbpf.c   | 19 +++
 tools/lib/bpf/libbpf.h   |  8 
 tools/lib/bpf/libbpf.map |  6 ++
 3 files changed, 33 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ead915aec349..f4d750863abd 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4004,6 +4004,25 @@ static int bpf_link__destroy_perf_event(struct bpf_link 
*link)
return err;
 }
 
+const struct bpf_link_fd *bpf_link__as_fd(const struct bpf_link *link)
+{
+   if (!link)
+   return NULL;
+
+   if (link->destroy != _link__destroy_perf_event)
+   return NULL;
+
+   return (struct bpf_link_fd *)link;
+}
+
+int bpf_link_fd__fd(const struct bpf_link_fd *link)
+{
+   if (!link)
+   return -1;
+
+   return link->fd;
+}
+
 struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
int pfd)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8a9d462a6f6d..4498b6ae459a 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -166,6 +166,14 @@ LIBBPF_API int bpf_program__unpin(struct bpf_program 
*prog, const char *path);
 LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_link;
+struct bpf_link_fd;
+
+/* casting APIs */
+LIBBPF_API const struct bpf_link_fd *
+bpf_link__as_fd(const struct bpf_link *link);
+
+/* getters APIs */
+LIBBPF_API int bpf_link_fd__fd(const struct bpf_link_fd *link);
 
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f9d316e873d8..b58dd0f0259c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -184,3 +184,9 @@ LIBBPF_0.0.4 {
perf_buffer__new_raw;
perf_buffer__poll;
 } LIBBPF_0.0.3;
+
+LIBBPF_0.0.5 {
+   global:
+   bpf_link__as_fd;
+   bpf_link_fd__fd;
+} LIBBPF_0.0.4;
-- 
2.20.1

84 matches

Mail list logo