Re: [PATCH] net: bpf: type fixes for __be16/__be32

2019-10-17 Thread Martin Lau
On Wed, Oct 16, 2019 at 12:16:35PM +0100, Ben Dooks (Codethink) wrote:
> In bpf_skb_load_helper_8_no_cache and
> bpf_skb_load_helper_16_no_cache they
> read a u16/u32 where actually these
> are __be16 and __be32.
> 
> Fix the following by making the types
> match:
> 
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:215:32: warning: cast to restricted __be16
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
> net/core/filter.c:242:32: warning: cast to restricted __be32
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next] bpf: sk_storage: Fix out of bounds memory access

2019-06-13 Thread Martin Lau
On Thu, Jun 13, 2019 at 10:15:38AM -0700, Andrii Nakryiko wrote:
> On Thu, Jun 13, 2019 at 8:16 AM Arthur Fabre  wrote:
> >
> > bpf_sk_storage maps use multiple spin locks to reduce contention.
> > The number of locks to use is determined by the number of possible CPUs.
> > With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.
Thanks for report.

> >
> > When updating elements, the correct lock is determined with hash_ptr().
> > Calling hash_ptr() with 0 bits is undefined behavior, as it does:
> >
> > x >> (64 - bits)
> >
> > Using the value results in an out of bounds memory access.
> > In my case, this manifested itself as a page fault when raw_spin_lock_bh()
> > is called later, when running the self tests:
> >
> > ./tools/testing/selftests/bpf/test_verifier 773 775
> >
> > [   16.366342] BUG: unable to handle page fault for address: 
> > 8fe7a66f93f8
> > [   16.367139] #PF: supervisor write access in kernel mode
> > [   16.367751] #PF: error_code(0x0002) - not-present page
> > [   16.368323] PGD 35a01067 P4D 35a01067 PUD 0
> > [   16.368796] Oops: 0002 [#1] SMP PTI
> > [   16.369175] CPU: 0 PID: 189 Comm: test_verifier Not tainted 5.2.0-rc2+ 
> > #10
> > [   16.369960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.12.0-1 04/01/2014
> > [   16.371021] RIP: 0010:_raw_spin_lock_bh 
> > (/home/afabre/linux/./include/trace/events/initcall.h:48)
> > [ 16.371571] Code: 02 00 00 31 c0 ba ff 00 00 00 3e 0f b1 17 75 01 c3 e9 82 
> > 12 5f ff 66 90 65 81 05 ad 14 6f 41 00 02 00 00 31 c0 ba 01 00 00 00 <3e> 
> > 0f b1 17 75 01 c3 89 c6 e9 f0 02 5f ff b8 00 02 00 00 3e 0f c1
> > All code
> > 
> >0:   02 00   add(%rax),%al
> >2:   00 31   add%dh,(%rcx)
> >4:   c0 ba ff 00 00 00 3esarb   $0x3e,0xff(%rdx)
> >b:   0f b1 17cmpxchg %edx,(%rdi)
> >e:   75 01   jne0x11
> >   10:   c3  retq
> >   11:   e9 82 12 5f ff  jmpq   0xff5f1298
> >   16:   66 90   xchg   %ax,%ax
> >   18:   65 81 05 ad 14 6f 41addl   $0x200,%gs:0x416f14ad(%rip)# 
> > 0x416f14d0
> >   1f:   00 02 00 00
> >   23:   31 c0   xor%eax,%eax
> >   25:   ba 01 00 00 00  mov$0x1,%edx
> >   2a:   3e 0f b1 17 cmpxchg %edx,%ds:*(%rdi)<-- 
> > trapping instruction
> >   2e:   75 01   jne0x31
> >   30:   c3  retq
> >   31:   89 c6   mov%eax,%esi
> >   33:   e9 f0 02 5f ff  jmpq   0xff5f0328
> >   38:   b8 00 02 00 00  mov$0x200,%eax
> >   3d:   3e  ds
> >   3e:   0f  .byte 0xf
> >   3f:   c1  .byte 0xc1
> >
> > Code starting with the faulting instruction
> > ===
> >0:   3e 0f b1 17 cmpxchg %edx,%ds:(%rdi)
> >4:   75 01   jne0x7
> >6:   c3  retq
> >7:   89 c6   mov%eax,%esi
> >9:   e9 f0 02 5f ff  jmpq   0xff5f02fe
> >e:   b8 00 02 00 00  mov$0x200,%eax
> >   13:   3e  ds
> >   14:   0f  .byte 0xf
> >   15:   c1  .byte 0xc1
> > [   16.373398] RSP: 0018:a759809d3be0 EFLAGS: 00010246
> > [   16.373954] RAX:  RBX: 8fe7a66f93f0 RCX: 
> > 0040
> > [   16.374645] RDX: 0001 RSI: 8fdaf9f0d180 RDI: 
> > 8fe7a66f93f8
> > [   16.375338] RBP: 8fdaf9f0d180 R08: 8fdafba2c320 R09: 
> > 8fdaf9f0d0c0
> > [   16.376028] R10:  R11:  R12: 
> > 8fdafa346700
> > [   16.376719] R13: 8fe7a66f93f8 R14: 8fdaf9f0d0c0 R15: 
> > 0001
> > [   16.377413] FS:  7fda724c0740() GS:8fdafba0() 
> > knlGS:
> > [   16.378204] CS:  0010 DS:  ES:  CR0: 80050033
> > [   16.378763] CR2: 8fe7a66f93f8 CR3: 000139d1c006 CR4: 
> > 00360ef0
> > [   16.379453] DR0:  DR1:  DR2: 
> > 
> > [   16.380144] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [   16.380864] Call Trace:
> > [   16.381112] selem_link_map 
> > (/home/afabre/linux/./include/linux/compiler.h:221 
> > /home/afabre/linux/net/core/bpf_sk_storage.c:243)
> > [   16.381476] sk_storage_update 
> > (/home/afabre/linux/net/core/bpf_sk_storage.c:355 
> > /home/afabre/linux/net/core/bpf_sk_storage.c:414)
> > [   16.381888] bpf_sk_storage_get 
> > (/home/afabre/linux/net/core/bpf_sk_storage.c:760 
> > /home/afabre/linux/net/core/bpf_sk_storage.c:741)
> > [   16.382285] ___bpf_prog_run (/home/afabre/linux/kernel/bpf/core.c:1447)
> > [   16.382679] ? __bpf_prog_run32 
> > (/home/afabre/linux/kernel/bpf/core.c:1603)
> > [   16.383074] ? alloc_file_pseudo (/home/afa

Re: [PATCH bpf-next] bpf: allow CGROUP_SKB programs to use bpf_skb_cgroup_id() helper

2019-06-06 Thread Martin Lau
On Thu, Jun 06, 2019 at 01:30:12PM -0700, Roman Gushchin wrote:
> Currently bpf_skb_cgroup_id() is not supported for CGROUP_SKB
> programs. An attempt to load such a program generates an error
> like this:
> 
> libbpf:
> 0: (b7) r6 = 0
> ...
> 9: (85) call bpf_skb_cgroup_id#79
> unknown func bpf_skb_cgroup_id#79
> 
> There are no particular reasons for denying it, and we have some
> use cases where it might be useful.
> 
> So let's add it to the list of allowed helpers.
Acked-by: Martin KaFai Lau 


Re: [PATCH][next] bpf: hbm: fix spelling mistake "notifcations" -> "notificiations"

2019-06-03 Thread Martin Lau
On Mon, Jun 03, 2019 at 02:36:53PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> There is a spelling mistake in the help information, fix this.
> 
> Signed-off-by: Colin Ian King 
> ---
>  samples/bpf/hbm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
> index 480b7ad6a1f2..bdfce592207a 100644
> --- a/samples/bpf/hbm.c
> +++ b/samples/bpf/hbm.c
> @@ -411,7 +411,7 @@ static void Usage(void)
>  "-l also limit flows using loopback\n"
>  "-n <#> to create cgroup \"/hbm#\" and attach prog\n"
>  "   Default is /hbm1\n"
> -"--no_cndisable CN notifcations\n"
> +"--no_cndisable CN notifications\n"
Acked-by: Martin KaFai Lau 

This should go to the bpf-next.


Re: [PATCH v3 net] ipv6: Fix dangling pointer when ipv6 fragment

2019-04-02 Thread Martin Lau
On Tue, Apr 02, 2019 at 06:49:03PM +0800, kbuild test robot wrote:
> Hi hujunwei,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on net/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/hujunwei/ipv6-Fix-dangling-pointer-when-ipv6-fragment/20190402-175602
> config: i386-randconfig-x005-201913 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>net//ipv6/ip6_output.c: In function 'ip6_fragment':
> >> net//ipv6/ip6_output.c:609:27: warning: 'prevhdr' is used uninitialized in 
> >> this function [-Wuninitialized]
>  nexthdr_offset = prevhdr - skb_network_header(skb);
>   ^
> 
> vim +/prevhdr +609 net//ipv6/ip6_output.c
> 
>594
>595int ip6_fragment(struct net *net, struct sock *sk, struct 
> sk_buff *skb,
>596 int (*output)(struct net *, struct sock *, 
> struct sk_buff *))
>597{
>598struct sk_buff *frag;
>599struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
>600struct ipv6_pinfo *np = skb->sk && 
> !dev_recursion_level() ?
>601inet6_sk(skb->sk) : NULL;
>602struct ipv6hdr *tmp_hdr;
>603struct frag_hdr *fh;
>604unsigned int mtu, hlen, left, len, nexthdr_offset;
>605int hroom, troom;
>606__be32 frag_id;
>607int ptr, offset = 0, err = 0;
>608u8 *prevhdr, nexthdr = 0;
>  > 609nexthdr_offset = prevhdr - skb_network_header(skb);
hmm... This line has been moved up since v2. :(

>610
>611err = ip6_find_1stfragopt(skb, &prevhdr);
>612if (err < 0)
>613goto fail;
>614hlen = err;
>615nexthdr = *prevhdr;
>616
>617mtu = ip6_skb_dst_mtu(skb);
>618
>619/* We must not fragment if the socket is set to force 
> MTU discovery
>620 * or if the skb it not generated by a local socket.
>621 */
>622if (unlikely(!skb->ignore_df && skb->len > mtu))
>623goto fail_toobig;
>624
>625if (IP6CB(skb)->frag_max_size) {
>626if (IP6CB(skb)->frag_max_size > mtu)
>627goto fail_toobig;
>628
>629/* don't send fragments larger than what we 
> received */
>630mtu = IP6CB(skb)->frag_max_size;
>631if (mtu < IPV6_MIN_MTU)
>632mtu = IPV6_MIN_MTU;
>633}
>634
>635if (np && np->frag_size < mtu) {
>636if (np->frag_size)
>637mtu = np->frag_size;
>638}
>639if (mtu < hlen + sizeof(struct frag_hdr) + 8)
>640goto fail_toobig;
>641mtu -= hlen + sizeof(struct frag_hdr);
>642
>643frag_id = ipv6_select_ident(net, &ipv6_hdr(skb)->daddr,
>644&ipv6_hdr(skb)->saddr);
>645
>646if (skb->ip_summed == CHECKSUM_PARTIAL &&
>647(err = skb_checksum_help(skb)))
>648goto fail;
>649
>650prevhdr = skb_network_header(skb) + nexthdr_offset;
>651hroom = LL_RESERVED_SPACE(rt->dst.dev);
>652if (skb_has_frag_list(skb)) {
>653unsigned int first_len = skb_pagelen(skb);
>654struct sk_buff *frag2;
>655
>656if (first_len - hlen > mtu ||
>657((first_len - hlen) & 7) ||
>658skb_cloned(skb) ||
>659skb_headroom(skb) < (hroom + sizeof(struct 
> frag_hdr)))
>660goto slow_path;
>661
>662skb_walk_frags(skb, frag) {
>663/* Correct geometry. */
>664if (frag->len > mtu ||
>665((frag->len & 7) && frag->next) ||
>666skb_headroom(frag) < (hlen + hroom 
> + sizeof(struct frag_hdr)))
>667goto slow_path_clean;
>668
>669

Re: [PATCH] bpf: drop refcount if bpf_map_new_fd() fails in map_create()

2019-02-27 Thread Martin Lau
On Wed, Feb 27, 2019 at 10:36:25PM +0800, zerons wrote:
> In bpf/syscall.c, map_create() first set map->usercnt to 1, a file descriptor 
> is
> supposed to return to userspace. When bpf_map_new_fd() fails, drop the 
> refcount.
Thanks for the patch.

Please add a Fixes tag for bug fix in the future.

Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Acked-by: Martin KaFai Lau 


Re: [PATCH] bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id()

2019-02-26 Thread Martin Lau
On Tue, Feb 26, 2019 at 10:15:37PM +0800, zerons wrote:
> [ Upstream commit c91951f15978f1a0c6b65f063d30f7ea7bc6fb42 ]
> 
> In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero() to 
> increase
> the refcount, both map->refcnt and map->usercnt. Then, if bpf_map_new_fd() 
> fails,
> should handle map->usercnt too.
Good catch!  Thanks for the fix.

Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v1] bpf, lpm: fix lookup bug in map_delete_elem

2019-02-21 Thread Martin Lau
On Thu, Feb 21, 2019 at 05:39:26PM +0100, Alban Crequy wrote:
> From: Alban Crequy 
> 
> trie_delete_elem() was deleting an entry even though it was not matching
> if the prefixlen was correct. This patch adds a check on matchlen.
> 
> Reproducer:
> 
> $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 
> entries 128 name mylpm flags 1
> $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb 
> cc dd value hex 01
> $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
> key: 10 00 00 00 aa bb cc dd  value: 01
> Found 1 element
> $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff 
> ff ff
> $ echo $?
> 0
> $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
> Found 0 elements
The change makes sense to me.  Can you add this reproducer to
tools/testing/selftests/bpf/test_lpm_map.c?

Bug fix should be for the "bpf" tree instead of "bpf-next"
Fixes tag is also required, like

Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek 

> 
> Signed-off-by: Alban Crequy 
> ---
>  kernel/bpf/lpm_trie.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index abf1002080df..93a5cbbde421 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -471,6 +471,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
> *_key)
>   }
>  
>   if (!node || node->prefixlen != key->prefixlen ||
> + node->prefixlen != matchlen ||
>   (node->flags & LPM_TREE_NODE_FLAG_IM)) {
>   ret = -ENOENT;
>   goto out;
> -- 
> 2.20.1
> 


Re: [RFC] Loading BTF and pretty printing maps with bpftool

2019-01-25 Thread Martin Lau
On Fri, Jan 25, 2019 at 11:20:57AM +0100, Arnaldo Carvalho de Melo wrote:
>   # bpftool version
>   bpftool v5.0.0-rc3
>   #
> 
>   # bpftool prog | tail -6
>   309: tracepoint  name sys_enter  tag 819967866022f1e1  gpl
>   loaded_at 2019-01-25T11:05:41+0100  uid 0
>   xlated 528B  jited 381B  memlock 4096B  map_ids 200,199,198
>   310: tracepoint  name sys_exit  tag c1bd85c092d6e4aa  gpl
>   loaded_at 2019-01-25T11:05:41+0100  uid 0
>   xlated 256B  jited 191B  memlock 4096B  map_ids 200,199
>   #
> 
> And the maps:
> 
>   # bpftool map | tail -6
>   198: perf_event_array  name __augmented_sys  flags 0x0
>   key 4B  value 4B  max_entries 8  memlock 4096B
>   199: array  name syscalls  flags 0x0
>   key 4B  value 1B  max_entries 512  memlock 8192B
>   200: hash  name pids_filtered  flags 0x0
>   key 4B  value 1B  max_entries 64  memlock 8192B
>   #
> 
> So, dumping the entries for those entries:
> 
> [root@quaco ~]# egrep sleep 
> /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c
>   [35] = "nanosleep",
>   [230] = "clock_nanosleep",
> [root@quaco ~]#
> 
> Looking at just the open and nanosleep:
> 
>   # bpftool map dump id 199 | grep "value: 01"
>   key: 23 00 00 00  value: 01
>   key: e6 00 00 00  value: 01
>   #
> 
>   # bpftool map lookup id 199 key 35
>   Error: key expected 4 bytes got 1
>   #
> 
>   # bpftool map lookup id 199 key 35 00 00 00
>   key: 23 00 00 00  value: 01
>   # bpftool map lookup id 199 key 230 00 00 00
>   key: e6 00 00 00  value: 01
>   # 
> 
> I thought it was that --pretty option, so I tried:
> 
>   # bpftool map --pretty lookup id 199 key 230 00 00 00
>   {
> "key": ["0xe6","0x00","0x00","0x00"
> ],
> "value": ["0x01"
> ]
>   }
>   #
libbpf pr_warning on failing to load BTF
or failing to create a MAP after BTF has been loaded and
btf_map_xxx can be found.

Did you see any of them?  It seems it can load the BTF
from your email.

It may be useful to set the libbpf's __pr_debug which
should be NULL by default iirc.


Re: [PATCH] bpf: Make function btf_name_offset_valid static

2019-01-16 Thread Martin Lau
On Wed, Jan 16, 2019 at 08:29:40PM +0100, Mathieu Malaterre wrote:
> Initially in commit 69b693f0aefa ("bpf: btf: Introduce BPF Type Format
> (BTF)") the function 'btf_name_offset_valid' was introduced as static
> function it was later on changed to a non-static one, and then finally
> in commit c454a46b5efd ("bpf: Add bpf_line_info support") the function
> prototype was removed.
Instead of c454a46b5efd, it was removed from btf.h in
commit 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name 
use case")

> 
> Revert back to original implementation and make the function static.
> Remove warning triggered with W=1:
> 
>   kernel/bpf/btf.c:470:6: warning: no previous prototype for 
> 'btf_name_offset_valid' [-Wmissing-prototypes]
> 
> Signed-off-by: Mathieu Malaterre 
Fixes: 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name 
use case")

Thanks for the fix!

Other than the above,

Acked-by: Martin KaFai Lau 


Re: "bpf: Improve the info.func_info and info.func_info_rec_size behavior" breaks strace self tests

2019-01-03 Thread Martin Lau
On Thu, Jan 03, 2019 at 11:41:18PM +0100, Heiko Carstens wrote:
> On Thu, Jan 03, 2019 at 07:12:05PM +0000, Martin Lau wrote:
> > On Thu, Jan 03, 2019 at 12:46:13PM +0100, Heiko Carstens wrote:
> > > Hello,
> > > 
> > > the kernel commit 7337224fc150 ("bpf: Improve the info.func_info and
> > > info.func_info_rec_size behavior") breaks one of strace's self tests:
> > > 
> > > FAIL: bpf-obj_get_info_by_fd-prog-v.gen
> > The strace's bpf-obj_get_info_by_fd-prog-v test did fail.  However,
> > I failed to see how 7337224fc150 broke it.  How do you trace down to
> > commit 7337224fc150 and can you share your test output?
> > 
> > The failure I can reproduce is EFAULT.  It should have been failing
> > as early as "nr_jited_ksyms" is added to "struct bpf_prog_info"
> > in linux/bpf.h.
> 
> Ah, sorry(!), I forgot to mention an important detail: the test
> failure happens only if executed as normal (non-root) user.
> 
> With 7337224fc150 ("bpf: Improve the info.func_info and
> info.func_info_rec_size behavior") the failure happens. With commit
> 30da46b5dc3a ("tools: bpftool: add a command to dump the trace pipe")
> it passes; which is one commit earlier.
> 
> FAIL: bpf-obj_get_info_by_fd-prog-v.gen
> ===
> 
> --- exp   2019-01-03 23:31:49.576949303 +0100
> +++ log   2019-01-03 23:31:49.576949303 +0100
> @@ -1,8 +1,8 @@
>  bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=8, 
> max_entries=1, map_flags=0, inner_map_fd=0, map_name="test_map", 
> map_ifindex=0}, 48) = 3
>  bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=9, 
> insns=[{code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_1, src_reg=BPF_REG_0, 
> off=0, imm=0}, {code=BPF_STX|BPF_W|BPF_MEM, dst_reg=BPF_REG_10, 
> src_reg=BPF_REG_1, off=-4, imm=0}, {code=BPF_ALU64|BPF_X|BPF_MOV, 
> dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0}, 
> {code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, 
> imm=0xfffc}, {code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, 
> src_reg=BPF_REG_1, off=0, imm=0x3}, {code=BPF_LD|BPF_W|BPF_IMM, 
> dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, 
> {code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, 
> imm=0x1}, {code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, 
> src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_EXIT, 
> dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}], license="BSD", 
> log_level=42, log_size=4096, log_buf="", kern_version=KERNEL_VERSION(57005, 
> 192, 222), prog_flags=0, prog_name="test_prog", prog_ifindex=0, 
> expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = 4
>  bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=3, 
> info_len=128 => 80, info={type=BPF_MAP_TYPE_ARRAY, id=15, key_size=4, 
> value_size=8, max_entries=1, map_flags=0, name="test_map", ifindex=0, 
> netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
> -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, 
> info_len=168 => 152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, 
> tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, 
> jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=NULL, 
> load_time=28281352029, created_by_uid=1000, nr_map_ids=0 => 1, map_ids=NULL, 
> name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
> -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, 
> info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, 
> tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, 
> jited_prog_insns=NULL, xlated_prog_len=336, xlated_prog_insns=0x3ff8d6f7000, 
> load_time=28281352029, created_by_uid=1000, nr_map_ids=2, 
> map_ids=0x3ff8d6f1000, name="test_prog", ifindex=0, netns_dev=makedev(0, 0), 
> netns_ino=0}}}, 16) = -1 EFAULT (Bad address)
> -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, 
> info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, 
> tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, 
> jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=[], 
> load_time=28281352029, created_by_uid=1000, nr_map_ids=0 => 1, map_ids=[], 
> name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
> -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, 
> info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, 
> tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, 
> jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=[], 
> load_time=28281352029, created_by_uid=1000, nr_map_ids=2 => 1, map_ids=[15], 
> name="test_prog", ifindex=0, 

Re: "bpf: Improve the info.func_info and info.func_info_rec_size behavior" breaks strace self tests

2019-01-03 Thread Martin Lau
On Thu, Jan 03, 2019 at 12:46:13PM +0100, Heiko Carstens wrote:
> Hello,
> 
> the kernel commit 7337224fc150 ("bpf: Improve the info.func_info and
> info.func_info_rec_size behavior") breaks one of strace's self tests:
> 
> FAIL: bpf-obj_get_info_by_fd-prog-v.gen
The strace's bpf-obj_get_info_by_fd-prog-v test did fail.  However,
I failed to see how 7337224fc150 broke it.  How do you trace down to
commit 7337224fc150 and can you share your test output?

The failure I can reproduce is EFAULT.  It should have been failing
as early as "nr_jited_ksyms" is added to "struct bpf_prog_info"
in linux/bpf.h.

The sys_bpf(BPF_OBJ_GET_INFO_BY_FD, &bpf_prog_get_info_attr, ...) is called
in a loop. The bpf_prog_get_info_attr.info object is in size 104 but
bpf_prog_get_info_attr.info_len is in size 168.  Hence,
if the prog is jited, the second iteraton onwards will have nr_jited_ksyms == 1
which is asking the kernel to fill the bpf_prog_get_info_attr.info.jited_ksyms
and it is NULL.

This fix the strace's test program:
diff --git i/tests/bpf-obj_get_info_by_fd.c w/tests/bpf-obj_get_info_by_fd.c
index e95afda27b3d..953083720822 100644
--- i/tests/bpf-obj_get_info_by_fd.c
+++ w/tests/bpf-obj_get_info_by_fd.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -341,6 +342,7 @@ main(void)
size_t old_prog_info_len = PROG_INFO_SZ;
 
for (unsigned int i = 0; i < 4; i++) {
+   memset(prog_info + 1, 0, PROG_INFO_SZ - sizeof(*prog_info));
prog_info->jited_prog_len = 0;
switch (i) {
case 1:

After the above change:
[root@arch-fb-vm1 tests]# ./bpf-obj_get_info_by_fd-prog-v
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=8, 
max_entries=1, map_flags=0, inner_map_fd=0, map_name="test_map", 
map_ifindex=0}, 48) = 3
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=9, 
insns=[{code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_1, src_reg=BPF_REG_0, 
off=0, imm=0}, {code=BPF_STX|BPF_W|BPF_MEM, dst_reg=BPF_REG_10, 
src_reg=BPF_REG_1, off=-4, imm=0}, {code=BPF_ALU64|BPF_X|BPF_MOV, 
dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0}, 
{code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, 
imm=0xfffc}, {code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, 
src_reg=BPF_REG_1, off=0, imm=0x3}, {code=BPF_LD|BPF_W|BPF_IMM, 
dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, 
{code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, 
imm=0x1}, {code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, 
off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_EXIT, dst_reg=BPF_REG_0, 
src_reg=BPF_REG_0, off=0, imm=0}], license="BSD", log_level=42, log_size=4096, 
log_buf="", kern_version=KERNEL_VERSION(57005, 192, 222), prog_flags=0, 
prog_name="test_prog", prog_ifindex=0, 
expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = 4
bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=3, info_len=128 
=> 80, info={type=BPF_MAP_TYPE_ARRAY, id=13, key_size=4, value_size=8, 
max_entries=1, map_flags=0, name="test_map", ifindex=0, netns_dev=makedev(0, 
0), netns_ino=0}}}, 16) = 0
bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, 
info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, 
tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, 
jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=NULL, 
load_time=1360454956729, created_by_uid=0, nr_map_ids=0 => 1, map_ids=NULL, 
name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, 
info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, 
tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0, 
jited_prog_insns=NULL, xlated_prog_len=336, xlated_prog_insns=0x7fda42a02000, 
load_time=1360454956729, created_by_uid=0, nr_map_ids=2, 
map_ids=0x7fda429f5000, name="test_prog", ifindex=0, netns_dev=makedev(0, 0), 
netns_ino=0}}}, 16) = -1 EFAULT (Bad address)
bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, 
info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, 
tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, 
jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=[], 
load_time=1360454956729, created_by_uid=0, nr_map_ids=0 => 1, map_ids=[], 
name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, 
info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, 
tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, 
jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=[], 
load_time=1360454956729, created_by_uid=0, nr_map_ids=2 => 1, map_ids=[13], 
name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0
+++ exited with 0 +++

> 
> Looking into the kernel commit, it seems that the user space visible
> uapi change is intentional; even though it might break existing user
> space.
> 
> To reproduce:
> 
> git clone

Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules

2018-12-13 Thread Martin Lau
On Thu, Dec 13, 2018 at 11:38:51AM -0800, Matt Mullins wrote:
> On Thu, 2018-12-13 at 19:22 +0000, Martin Lau wrote:
> > On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote:
> > > Distributions build drivers as modules, including network and filesystem
> > > drivers which export numerous tracepoints.  This enables
> > > bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
> > >
Acked-by: Martin KaFai Lau 

[ ... ]

> > > +#ifdef CONFIG_MODULES
> > > +int bpf_event_notify(struct notifier_block *nb, unsigned long op, void 
> > > *module)
> > > +{
> > > + struct bpf_trace_module *btm, *tmp;
> > > + struct module *mod = module;
> > > +
> > > + if (mod->num_bpf_raw_events == 0 ||
> > > + (op != MODULE_STATE_COMING && op != MODULE_STATE_GOING))
> > > + return 0;
> > > +
> > > + mutex_lock(&bpf_module_mutex);
> > > +
> > > + switch (op) {
> > > + case MODULE_STATE_COMING:
> > > + btm = kzalloc(sizeof(*btm), GFP_KERNEL);
> > > + if (btm) {
> > > + btm->module = module;
> > > + list_add(&btm->list, &bpf_trace_modules);
> > > + }
> > 
> > Is it fine to return 0 on !btm case?
> 
> That effectively just means we'll be ignoring tracepoints for a module
> that is loaded while we can't allocate a bpf_trace_module (24 bytes) to
> track it.  That feels like reasonable behavior to me.
ok.


Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules

2018-12-13 Thread Martin Lau
On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote:
> Distributions build drivers as modules, including network and filesystem
> drivers which export numerous tracepoints.  This enables
> bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
> 
> Signed-off-by: Matt Mullins 
> ---
> v1->v2:
>   * avoid taking the mutex in bpf_event_notify when op is neither COMING nor
> GOING.
>   * check that kzalloc actually succeeded
> 
> I didn't try to check list_empty before taking the mutex since I want to avoid
> races between bpf_event_notify and bpf_get_raw_tracepoint.  Additionally,
> list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, 
> but
> Alexei suggested I use it to protect against fragility if the subsequent 
> break;
> eventually disappears.
> 
>  include/linux/module.h   |  4 ++
>  include/linux/trace_events.h |  8 ++-
>  kernel/bpf/syscall.c | 11 ++--
>  kernel/module.c  |  5 ++
>  kernel/trace/bpf_trace.c | 99 +++-
>  5 files changed, 120 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index fce6b4335e36..5f147dd5e709 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -432,6 +432,10 @@ struct module {
>   unsigned int num_tracepoints;
>   tracepoint_ptr_t *tracepoints_ptrs;
>  #endif
> +#ifdef CONFIG_BPF_EVENTS
> + unsigned int num_bpf_raw_events;
> + struct bpf_raw_event_map *bpf_raw_events;
> +#endif
>  #ifdef HAVE_JUMP_LABEL
>   struct jump_entry *jump_entries;
>   unsigned int num_jump_entries;
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 4130a5497d40..8a62731673f7 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event);
>  int perf_event_query_prog_array(struct perf_event *event, void __user *info);
>  int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
>  int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog 
> *prog);
> -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
> +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
> +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
>  int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
>   u32 *fd_type, const char **buf,
>   u64 *probe_offset, u64 *probe_addr);
> @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct 
> bpf_raw_event_map *btp, struct bpf
>  {
>   return -EOPNOTSUPP;
>  }
> -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char 
> *name)
> +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char 
> *name)
>  {
>   return NULL;
>  }
> +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
> +{
> +}
>  static inline int bpf_get_perf_event_info(const struct perf_event *event,
> u32 *prog_id, u32 *fd_type,
> const char **buf, u64 *probe_offset,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 70fb11106fc2..754370e3155e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode 
> *inode, struct file *filp)
>   bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
>   bpf_prog_put(raw_tp->prog);
>   }
> + bpf_put_raw_tracepoint(raw_tp->btp);
>   kfree(raw_tp);
>   return 0;
>  }
> @@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union 
> bpf_attr *attr)
>   return -EFAULT;
>   tp_name[sizeof(tp_name) - 1] = 0;
>  
> - btp = bpf_find_raw_tracepoint(tp_name);
> + btp = bpf_get_raw_tracepoint(tp_name);
>   if (!btp)
>   return -ENOENT;
>  
>   raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER);
> - if (!raw_tp)
> - return -ENOMEM;
> + if (!raw_tp) {
> + err = -ENOMEM;
> + goto out_put_btp;
> + }
>   raw_tp->btp = btp;
>  
>   prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
> @@ -1668,6 +1671,8 @@ static int bpf_raw_tracepoint_open(const union bpf_attr 
> *attr)
>   bpf_prog_put(prog);
>  out_free_tp:
>   kfree(raw_tp);
> +out_put_btp:
> + bpf_put_raw_tracepoint(btp);
>   return err;
>  }
>  
> diff --git a/kernel/module.c b/kernel/module.c
> index 49a405891587..06ec68f08387 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3093,6 +3093,11 @@ static int find_module_sections(struct module *mod, 
> struct load_info *info)
>sizeof(*mod->tracepoints_ptrs),
>&mod->num_tracepoints);
>  #endif
> +#ifdef CONFIG_BPF_EVENTS
> + mod->bpf_raw

Re: [PATCH v2 bpf-next 3/3] selftests/bpf: add btf annotations for cgroup_local_storage maps

2018-12-11 Thread Martin Lau
On Mon, Dec 10, 2018 at 03:43:02PM -0800, Roman Gushchin wrote:
> Add btf annotations to cgroup local storage maps (per-cpu and shared)
> in the network packet counting example.
Acked-by: Martin KaFai Lau 


Re: [PATCH v2 bpf-next 2/3] bpf: add bpffs pretty print for cgroup local storage maps

2018-12-11 Thread Martin Lau
On Mon, Dec 10, 2018 at 03:43:01PM -0800, Roman Gushchin wrote:
> Implement bpffs pretty printing for cgroup local storage maps
> (both shared and per-cpu).
> Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c):
> 
> Shared:
>   $ cat /sys/fs/bpf/map_2
>   # WARNING!! The output is for debug purpose only
>   # WARNING!! The output format will change
>   {4294968594,1}: {,1039896}
> 
> Per-cpu:
>   $ cat /sys/fs/bpf/map_1
>   # WARNING!! The output is for debug purpose only
>   # WARNING!! The output format will change
>   {4294968594,1}: {
>   cpu0: {0,0,0,0,0}
>   cpu1: {0,0,0,0,0}
>   cpu2: {1,104,0,0,0}
>   cpu3: {0,0,0,0,0}
>   }
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 1/3] bpf: pass struct btf pointer to the map_check_btf() callback

2018-12-08 Thread Martin Lau
On Fri, Dec 07, 2018 at 04:53:13PM -0800, Roman Gushchin wrote:
> If key_type or value_type are of non-trivial data types
> (e.g. structure or typedef), it's not possible to check them without
> the additional information, which can't be obtained without a pointer
> to the btf structure.
> 
> So, let's pass btf pointer to the map_check_btf() callbacks.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 2/3] bpf: add bpffs pretty print for cgroup local storage maps

2018-12-08 Thread Martin Lau
On Fri, Dec 07, 2018 at 04:53:14PM -0800, Roman Gushchin wrote:
> Implement bpffs pretty printing for cgroup local storage maps
> (both shared and per-cpu).
> Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c):
> 
> Shared:
>   $ cat /sys/fs/bpf/map_2
>   # WARNING!! The output is for debug purpose only
>   # WARNING!! The output format will change
>   {4294968594,1}: {,1039896}
> 
> Per-cpu:
>   $ cat /sys/fs/bpf/map_1
>   # WARNING!! The output is for debug purpose only
>   # WARNING!! The output format will change
>   {4294968594,1}: {
>   cpu0: {0,0,0,0,0}
>   cpu1: {0,0,0,0,0}
>   cpu2: {1,104,0,0,0}
>   cpu3: {0,0,0,0,0}
>   }
> 
> Signed-off-by: Roman Gushchin 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
>  include/linux/btf.h| 10 +
>  kernel/bpf/local_storage.c | 90 +-
>  2 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 8c2199b5d250..ac67bc4cbfd9 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -5,6 +5,7 @@
>  #define _LINUX_BTF_H 1
>  
>  #include 
> +#include 
>  
>  struct btf;
>  struct btf_type;
> @@ -63,4 +64,13 @@ static inline const char *btf_name_by_offset(const struct 
> btf *btf,
>  }
>  #endif
>  
> +static inline const struct btf_type *btf_orig_type(const struct btf *btf,
> +const struct btf_type *t)
> +{
> + while (t && BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF)
> + t = btf_type_by_id(btf, t->type);
Only typedef is allowed?  Not even const?
If that is not the case, can btf_type_id_size() be reused?
The "type following" has already been done once and then remembered
in the verification time.

If cgroup_storage_check_btf() can only allow typedef, please
move "btf_orig_type()" to the local_storage.c.

> +
> + return t;
> +}
> +
>  #endif
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index b65017dead44..7b51fe1aba3c 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -1,11 +1,13 @@
>  //SPDX-License-Identifier: GPL-2.0
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  DEFINE_PER_CPU(struct bpf_cgroup_storage*, 
> bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]);
>  
> @@ -308,6 +310,91 @@ static int cgroup_storage_delete_elem(struct bpf_map 
> *map, void *key)
>   return -EINVAL;
>  }
>  
> +static int cgroup_storage_check_btf(const struct bpf_map *map,
> + const struct btf *btf,
> + const struct btf_type *key_type,
Actually, in "map_check_btf()" (just before cgroup_storage_check_btf()
is called), btf_type_id_size() has already been called
to get the true size and the resolved type (i.e. BTF_INFO_KIND_STRUCT here)
in order to reject "key_size != map->key_size".  Hence, the key_type
passed to cgroup_storage_check_btf() here will not be KIND_TYPEDEF or
KIND_CONST.

> + const struct btf_type *value_type)
> +{
> + const struct btf_type *st, *t;
> + struct btf_member *m;
> +
> + /* Key is expected to be of struct bpf_cgroup_storage_key type,
> +  * which is:
> +  * struct bpf_cgroup_storage_key {
> +  *  __u64   cgroup_inode_id;
> +  *  __u32   attach_type;
> +  * };
> +  */
> +
> + /*
> +  * Key_type must be a structure (or a typedef of a structure) with
> +  * two members.
> +  */
> + st = btf_orig_type(btf, key_type);
> + if (BTF_INFO_KIND(st->info) != BTF_KIND_STRUCT ||
> + BTF_INFO_VLEN(st->info) != 2)
> + return -EINVAL;
> +
> + /*
> +  * The first field must be a 64 bit integer at 0 offset.
> +  */
> + m = (struct btf_member *)(st + 1);
> + t = btf_orig_type(btf, btf_type_by_id(btf, m->type));
> + if (!t || BTF_INFO_KIND(t->info) != BTF_KIND_INT || m->offset ||
> + t->size !=
> + FIELD_SIZEOF(struct bpf_cgroup_storage_key, cgroup_inode_id))
Instead of t->size,
BTF_INT_BITS() and BTF_INT_OFFSET() need to be checked (please refer to the
key_type check in array_map_check_btf()).

I think exposing "btf_type_int_is_regular()" from btf.c will be easier here.
Actually, add "btf_type_is_reg_int(t, expected_size)" to btf.h and btf.c,
like (uncompiled and untested code):

/*
 * Not a bit field and it must be the expected size.
 */
bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size)
{
u8 nr_bits, nr_bytes;
u32 int_data;

if (!btf_type_is_int(t))
return false;

int_data = btf_type_int(t);
nr_bits = BTF_INT_BITS(int_data);
nr_bytes = BITS_ROUNDUP_BYTES(nr_bits);
if (BITS_PER_BYTE_MASKED(nr_bits) ||
BTF_INT_OFFSET(int_data) ||
nr_bytes != expected_size)
return false;

 

RE: [PATCH RESEND] tcp_estats: ebpf hacks

2016-02-03 Thread Martin Lau
Please ignore. It is a mistake...

From: Martin KaFai Lau [ka...@fb.com]
Sent: Wednesday, February 03, 2016 9:39 PM
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar; Masami Hiramatsu; Steven Rostedt; Alexei Starovoitov; Josef 
Bacik; Kernel Team
Subject: [PATCH RESEND] tcp_estats: ebpf hacks

Signed-off-by: Martin KaFai Lau 
---
 kernel/trace/bpf_trace.c |  20 ++
 samples/Makefile |   2 +-
 samples/bpf/Makefile |  11 +-
 samples/bpf/bpf_helpers.h|   4 +
 samples/bpf/bpf_load.c   |  44 +++--
 samples/bpf/tcp_trace.h  |  51 +
 samples/bpf/tcp_trace_kern.c | 454 +++
 samples/bpf/tcp_trace_user.c | 115 +++
 tools/net/Makefile   |   6 +-
 9 files changed, 689 insertions(+), 18 deletions(-)
 create mode 100644 samples/bpf/tcp_trace.h
 create mode 100644 samples/bpf/tcp_trace_kern.c
 create mode 100644 samples/bpf/tcp_trace_user.c

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 47febbe..977702e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -68,6 +68,7 @@ static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 
r5)
void *unsafe_ptr = (void *) (long) r3;

return probe_kernel_read(dst, unsafe_ptr, size);
+   /* return __bpf_probe_read_hack(dst, unsafe_ptr, size); */
 }

 static const struct bpf_func_proto bpf_probe_read_proto = {
@@ -79,6 +80,25 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type  = ARG_ANYTHING,
 };

+static u64 bpf_probe_read_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   u32 *dst = (u32 *) (long) r1;
+   int size = (int) r2;
+   u32 *unsafe_ptr = (void *) (long) r3;
+
+   *dst = *unsafe_ptr;
+   return probe_kernel_read(dst, unsafe_ptr, size);
+}
+
+static const struct bpf_func_proto bpf_probe_read_u32_proto = {
+   .func   = bpf_probe_read,
+   .gpl_only   = true,
+   .ret_type   = RET_VOID,
+   .arg1_type  = ARG_PTR_TO_STACK,
+   .arg2_type  = ARG_CONST_STACK_SIZE,
+   .arg3_type  = ARG_ANYTHING,
+};
+
 /*
  * limited trace_printk()
  * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed
diff --git a/samples/Makefile b/samples/Makefile
index f00257b..fb87be5 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code

 obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
-  hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+  hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ 
bpf/
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 97e5243..02885ae 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += trace_output
 hostprogs-y += lathist
+hostprogs-y += tcp_trace

 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -28,6 +29,7 @@ tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
 tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
 trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
 lathist-objs := bpf_load.o libbpf.o lathist_user.o
+tcp_trace-objs := bpf_load.o libbpf.o tcp_trace_user.o

 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -42,6 +44,7 @@ always += tracex5_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
 always += lathist_kern.o
+always += tcp_trace_kern.o

 HOSTCFLAGS += -I$(objtree)/usr/include

@@ -56,14 +59,16 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
+HOSTLOADLIBES_tcp_trace += -lelf

 # point this to your LLVM backend with bpf support
-LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+LLC=/home/kafai/local/llvm-git-master/bin/llc
+CLANG=/home/kafai/local/llvm-git-master/bin/clang

 $(obj)/%.o: $(src)/%.c
-   clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+   $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
-   clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+   $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o 
$@.s
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index e84dd3c..df3f00e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -33,6 +33,10 @@ static int (*bpf_get_current_comm)(void *buf, int buf_size) =
(void *) BPF_FUNC_get_current_comm;
 static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void 
*data, int size) 

Re: [PATCH] ring-buffer: Fix polling on trace_pipe

2014-07-15 Thread Martin Lau
Yes, I think it should go to both stable and 3.16.

Thanks,
--Martin

On Tue, Jul 15, 2014 at 03:46:12PM -0400, Steven Rostedt wrote:
> On Thu, 10 Jul 2014 15:20:30 -0700
> Martin Lau  wrote:
> 
> > Hi Steve,
> > 
> > Do you have a chance to give it another try?
> > 
> 
> OK, I finally got around to testing it. Yep I see your point. Do you
> think it should go to 3.16 and stable? I can add it to my next git pull
> request.
> 
> Also, I modified your test such that it wouldn't hang on failure, but
> detects that it hung and returns a exit value to add this to my
> testing. That is, it exits with zero on success and non zero on failure.
> 
> -- Steve

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static const char * debugfs_list[] = {
>   "/debug/tracing",
>   "/sys/kernel/debug/tracing",
>   "/d/tracing",
>   NULL,
> };
> 
> static const char *debugfs;
> static int markfd;
> static int trace_pipe_fd;
> 
> static void alog(const char *name, int ret)
> {
>   printf("%d: %s: %d\n", getpid(), name, ret);
> }
> 
> static const char *find_debugfs(void)
> {
>   struct stat st;
>   int i;
>   int r;
> 
>   for (i = 0; debugfs_list[i]; i++) {
>   r = stat(debugfs_list[i], &st);
>   if (r < 0)
>   continue;
>   if (S_ISDIR(st.st_mode))
>   return debugfs_list[i];
>   }
>   return NULL;
> }
> 
> static char * make_path(const char *file)
> {
>   char *path;
>   int size;
> 
>   size = strlen(debugfs) + strlen(file) + 2;
>   path = malloc(size);
>   if (!path) {
>   perror("malloc");
>   exit(-1);
>   }
>   sprintf(path, "%s/%s", debugfs, file);
>   return path;
> }
> 
> static void mark_write(const char *str)
> {
>   int ret;
>   ret = write(markfd, str, strlen(str));
>   alog("write(markfd)", ret);
> }
> 
> static void read_trace_pipe(void)
> {
>   char buf[1024];
>   int r;
> 
>   while ((r = read(trace_pipe_fd, buf, 1024)) > 0)
>   printf("%.*s", r, buf);
> }
> 
> int main (int argc, char **argv)
> {
>   struct epoll_event ee;
>   char *marker;
>   char *pipe;
>   int efd;
>   int ret;
>   pid_t dwrt_pid;
> 
>   debugfs = find_debugfs();
>   if (!debugfs) {
>   fprintf(stderr, "Could not find debugfs\n");
>   exit(-1);
>   }
> 
>   marker = make_path("trace_marker");
>   pipe = make_path("trace_pipe");
> 
>   markfd = open(marker, O_WRONLY);
>   if (markfd < 0) {
>   perror("marker");
>   exit(-1);
>   }
>   trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK);
>   if (trace_pipe_fd < 0) {
>   perror("trace_pipe");
>   exit(-1);
>   }
> 
>   efd = epoll_create(1);
>   if (efd < 0) {
>   perror("epoll_create");
>   exit(-1);
>   }
> 
>   mark_write("some data");
>   memset(&ee, 0, sizeof(ee));
>   ee.events = EPOLLIN;
>   ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee);
>   if (ret < 0) {
>   perror("epoll_ctl");
>   exit(-1);
>   }
>   alog("waiting data..", 0);
>   ret = epoll_wait(efd, &ee, 1, -1);
>   alog("epoll_wait()", ret);
>   read_trace_pipe();
>   dwrt_pid = fork();
>   assert(dwrt_pid != -1);
>   if (dwrt_pid > 0) {
>   int status;
>   sleep(10);
>   mark_write("more data");
>   sleep(10);
>   ret = waitpid(dwrt_pid, &status, WNOHANG);
>   if (ret != dwrt_pid) {
>   alog("Poll never finished!", 0);
>   kill(dwrt_pid, 9);
>   exit(-1);
>   }
>   if (WEXITSTATUS(status) != 0) {
>   alog("Something failed on polling!",
>WEXITSTATUS(status));
>   exit(-1);
>   }
>   } else {
>   alog("waiting form more data..", 0);
>   ret = epoll_wait(efd, &ee, 1, -1);
>   alog("epoll_wait()", ret);
>   read_trace_pipe();
>   if (ret < 0)
>   exit(errno);
>   }
>   exit (0);
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ring-buffer: Fix polling on trace_pipe

2014-07-10 Thread Martin Lau
Hi Steve,

Do you have a chance to give it another try?

Thanks,
Martin

On Thu, Jun 26, 2014 at 11:52:44PM -0700, Martin Lau wrote:
> Hi Steve,
> 
> I retried the test program (with the kernel patch).  It does block from
> time to time.  I spotted the ee.events is not set to EPOLLIN before
> calling epll_ctl(eft, EPOLL_CTL_ADD,...).  I fixed it and ran
> the test in a bash-loop.
> 
> I have the kafai-2 version attached (with some more logging in case
> if it still blocks).
> 
> Can you retry?
> 
> Thanks,
> Martin
> 
> On Thu, Jun 26, 2014 at 08:53:27PM -0400, Steven Rostedt wrote:
> > On Thu, 26 Jun 2014 11:34:46 -0700
> > Martin Lau  wrote:
> > 
> > > Hi Steve,
> > > 
> > > Can the modified test program reproduce the problem in your test setup?
> > 
> > Ah sorry, got distracted by other work.
> > 
> > OK, I just tried it out, and here's my results:
> > 
> > I ran you code with my current kernel and this is what I got:
> > 
> > # ./ftrace-test-epoll-kafai 
> ><...>-3183  [002] ...181.777891: tracing_mark_write:
> > some data 3183: waitting for more data..
> > 3184: written more data
> > 
> > And it just hung there.
> > 
> > 
> > Then I applied your patch, compiled and booted it, and ran the test
> > again, and I got this:
> > 
> > # ./ftrace-test-epoll-kafai
> > 
> > It just hung there. No forward progress.
> > 
> > I don't think that was the result you intended.
> > 
> > -- Steve
> > 
> > 

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static const char * debugfs_list[] = {
>   "/debug/tracing",
>   "/sys/kernel/debug/tracing",
>   "/d/tracing",
>   NULL,
> };
> 
> static const char *debugfs;
> static int markfd;
> static int trace_pipe_fd;
> 
> static void alog(const char *name, int ret)
> {
>   printf("%d: %s: %d\n", getpid(), name, ret);
> }
> 
> static const char *find_debugfs(void)
> {
>   struct stat st;
>   int i;
>   int r;
> 
>   for (i = 0; debugfs_list[i]; i++) {
>   r = stat(debugfs_list[i], &st);
>   if (r < 0)
>   continue;
>   if (S_ISDIR(st.st_mode))
>   return debugfs_list[i];
>   }
>   return NULL;
> }
> 
> static char * make_path(const char *file)
> {
>   char *path;
>   int size;
> 
>   size = strlen(debugfs) + strlen(file) + 2;
>   path = malloc(size);
>   if (!path) {
>   perror("malloc");
>   exit(-1);
>   }
>   sprintf(path, "%s/%s", debugfs, file);
>   return path;
> }
> 
> static void mark_write(const char *str)
> {
>   int ret;
>   ret = write(markfd, str, strlen(str));
>   alog("write(markfd)", ret);
> }
> 
> static void read_trace_pipe(void)
> {
>   char buf[1024];
>   int r;
> 
>   while ((r = read(trace_pipe_fd, buf, 1024)) > 0)
>   printf("%.*s", r, buf);
> }
> 
> int main (int argc, char **argv)
> {
>   struct epoll_event ee;
>   char *marker;
>   char *pipe;
>   int efd;
>   int ret;
>   pid_t dwrt_pid;
> 
>   debugfs = find_debugfs();
>   if (!debugfs) {
>   fprintf(stderr, "Could not find debugfs\n");
>   exit(-1);
>   }
> 
>   marker = make_path("trace_marker");
>   pipe = make_path("trace_pipe");
> 
>   markfd = open(marker, O_WRONLY);
>   if (markfd < 0) {
>   perror("marker");
>   exit(-1);
>   }
>   trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK);
>   if (trace_pipe_fd < 0) {
>   perror("trace_pipe");
>   exit(-1);
>   }
> 
>   efd = epoll_create(1);
>   if (efd < 0) {
>   perror("epoll_create");
>   exit(-1);
>   }
> 
>   mark_write("some data");
>   memset(&ee, 0, sizeof(ee));
>   ee.events = EPOLLIN;
>   ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee);
>   if (ret < 0) {
>   perror("epoll_ctl");
>   exit(-1);
>   }
>   alog("waiting data..", 0);
>   ret = epoll_wait(efd, &ee, 1, -1);
>   alog("epoll_wait()", ret);
>   read_trace_pipe();
>   dwrt_pid = fork();
>   assert(dwrt_pid != -1);
>   if (dwrt_pid == 0) {
>   sleep(10);
>   mark_write("more data");
>   } else {
>   alog("waiting form more data..", 0);
>   ret = epoll_wait(efd, &ee, 1, -1);
>   alog("epoll_wait()", ret);
>   read_trace_pipe();
>   wait(NULL);
>   }
>   return 0;
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ring-buffer: Fix polling on trace_pipe

2014-06-26 Thread Martin Lau
Hi Steve,

I retried the test program (with the kernel patch).  It does block from
time to time.  I spotted the ee.events is not set to EPOLLIN before
calling epll_ctl(eft, EPOLL_CTL_ADD,...).  I fixed it and ran
the test in a bash-loop.

I have the kafai-2 version attached (with some more logging in case
if it still blocks).

Can you retry?

Thanks,
Martin

On Thu, Jun 26, 2014 at 08:53:27PM -0400, Steven Rostedt wrote:
> On Thu, 26 Jun 2014 11:34:46 -0700
> Martin Lau  wrote:
> 
> > Hi Steve,
> > 
> > Can the modified test program reproduce the problem in your test setup?
> 
> Ah sorry, got distracted by other work.
> 
> OK, I just tried it out, and here's my results:
> 
> I ran you code with my current kernel and this is what I got:
> 
> # ./ftrace-test-epoll-kafai 
><...>-3183  [002] ...181.777891: tracing_mark_write:
> some data 3183: waitting for more data..
> 3184: written more data
> 
> And it just hung there.
> 
> 
> Then I applied your patch, compiled and booted it, and ran the test
> again, and I got this:
> 
> # ./ftrace-test-epoll-kafai
> 
> It just hung there. No forward progress.
> 
> I don't think that was the result you intended.
> 
> -- Steve
> 
> 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static const char * debugfs_list[] = {
"/debug/tracing",
"/sys/kernel/debug/tracing",
"/d/tracing",
NULL,
};

static const char *debugfs;
static int markfd;
static int trace_pipe_fd;

static void alog(const char *name, int ret)
{
printf("%d: %s: %d\n", getpid(), name, ret);
}

static const char *find_debugfs(void)
{
struct stat st;
int i;
int r;

for (i = 0; debugfs_list[i]; i++) {
r = stat(debugfs_list[i], &st);
if (r < 0)
continue;
if (S_ISDIR(st.st_mode))
return debugfs_list[i];
}
return NULL;
}

static char * make_path(const char *file)
{
char *path;
int size;

size = strlen(debugfs) + strlen(file) + 2;
path = malloc(size);
if (!path) {
perror("malloc");
exit(-1);
}
sprintf(path, "%s/%s", debugfs, file);
return path;
}

static void mark_write(const char *str)
{
int ret;
ret = write(markfd, str, strlen(str));
alog("write(markfd)", ret);
}

static void read_trace_pipe(void)
{
char buf[1024];
int r;

while ((r = read(trace_pipe_fd, buf, 1024)) > 0)
printf("%.*s", r, buf);
}

int main (int argc, char **argv)
{
struct epoll_event ee;
char *marker;
char *pipe;
int efd;
int ret;
pid_t dwrt_pid;

debugfs = find_debugfs();
if (!debugfs) {
fprintf(stderr, "Could not find debugfs\n");
exit(-1);
}

marker = make_path("trace_marker");
pipe = make_path("trace_pipe");

markfd = open(marker, O_WRONLY);
if (markfd < 0) {
perror("marker");
exit(-1);
}
trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK);
if (trace_pipe_fd < 0) {
perror("trace_pipe");
exit(-1);
}

efd = epoll_create(1);
if (efd < 0) {
perror("epoll_create");
exit(-1);
}

mark_write("some data");
memset(&ee, 0, sizeof(ee));
ee.events = EPOLLIN;
ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee);
if (ret < 0) {
perror("epoll_ctl");
exit(-1);
}
alog("waiting data..", 0);
ret = epoll_wait(efd, &ee, 1, -1);
alog("epoll_wait()", ret);
read_trace_pipe();
dwrt_pid = fork();
assert(dwrt_pid != -1);
if (dwrt_pid == 0) {
sleep(10);
mark_write("more data");
} else {
alog("waiting form more data..", 0);
ret = epoll_wait(efd, &ee, 1, -1);
alog("epoll_wait()", ret);
read_trace_pipe();
wait(NULL);
}
return 0;
}


Re: [PATCH] ring-buffer: Fix polling on trace_pipe

2014-06-26 Thread Martin Lau
Hi Steve,

Can the modified test program reproduce the problem in your test setup?

Thanks,
--Martin

On Tue, Jun 10, 2014 at 10:58:14PM -0700, Martin Lau wrote:
> Hi Steve,
> 
> Attached is the modified test program.  Here is the sample output:
> 
> localhost ~ # ./ftrace-test-epoll-kafai
><...>-1857  [000] ...1   720.174295: tracing_mark_write: some data
> 1857: waitting for more data..
> 1858: written more data
> 
> 
> Thanks,
> --Martin
> 
> On Tue, Jun 10, 2014 at 11:49:15AM -0400, Steven Rostedt wrote:
> > On Mon, 9 Jun 2014 23:06:42 -0700
> > Martin Lau  wrote:
> > 
> > > ring_buffer_poll_wait() should always put the poll_table to its wait_queue
> > > even there is immediate data available.  Otherwise, the following epoll 
> > > and
> > > read sequence will eventually hang forever:
> > > 
> > > 1. Put some data to make the trace_pipe ring_buffer read ready first
> > > 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
> > > 3. epoll_wait()
> > > 4. read(trace_pipe_fd) till EAGAIN
> > > 5. Add some more data to the trace_pipe ring_buffer
> > > 6. epoll_wait() -> this epoll_wait() will block forever
> > > 
> > > ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
> > >   ring_buffer_poll_wait() returns immediately without adding poll_table,
> > >   which has poll_table->_qproc pointing to ep_poll_callback(), to its
> > >   wait_queue.
> > > ~ During the epoll_wait() call in step 3 and step 6,
> > >   ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
> > >   because the poll_table->_qproc is NULL and it is how epoll works.
> > > ~ When there is new data available in step 6, ring_buffer does not know
> > >   it has to call ep_poll_callback() because it is not in its wait queue.
> > >   Hence, block forever.
> > > 
> > > Other poll implementation seems to call poll_wait() unconditionally as 
> > > the very
> > > first thing to do.  For example, tcp_poll() in tcp.c.
> > 
> > I'm trying to see the effect of this bug, but can't seem to reproduce
> > it. Maybe I did something wrong. Attached is a test program I wrote
> > trying to follow your instructions. I don't use epoll, so perhaps I
> > didn't use it correctly.
> > 
> > Can you modify it to show me the problem this is trying to fix. That
> > is, without this patch it hangs, but with the patch it does not.
> > 
> > Thanks!
> > 
> > -- Steve
> > 
> > > 
> > > Signed-off-by: Martin Lau 
> > > ---
> > >  kernel/trace/ring_buffer.c | 4 
> > >  1 file changed, 4 deletions(-)
> > > 
> > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > index fd12cc5..a6e64e8 100644
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer 
> > > *buffer, int cpu,
> > >   struct ring_buffer_per_cpu *cpu_buffer;
> > >   struct rb_irq_work *work;
> > >  
> > > - if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
> > > - (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, 
> > > cpu)))
> > > - return POLLIN | POLLRDNORM;
> > > -
> > >   if (cpu == RING_BUFFER_ALL_CPUS)
> > >   work = &buffer->irq_work;
> > >   else {
> > 
> 

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static const char * debugfs_list[] = {
>   "/debug/tracing",
>   "/sys/kernel/debug/tracing",
>   "/d/tracing",
>   NULL,
> };
> 
> static const char *debugfs;
> static int markfd;
> static int trace_pipe_fd;
> 
> static const char *find_debugfs(void)
> {
>   struct stat st;
>   int i;
>   int r;
> 
>   for (i = 0; debugfs_list[i]; i++) {
>   r = stat(debugfs_list[i], &st);
>   if (r < 0)
>   continue;
>   if (S_ISDIR(st.st_mode))
>   return debugfs_list[i];
>   }
>   return NULL;
> }
> 
> static char * make_path(const char *file)
> {
>   char *path;
>   int size;
> 
>   size = strlen(debugfs) + strlen(file) + 2;
>   path = malloc(size);
>   if (!path) {
>   perror("malloc&

Re: [PATCH] ring-buffer: Fix polling on trace_pipe

2014-06-10 Thread Martin Lau
Hi Steve,

Attached is the modified test program.  Here is the sample output:

localhost ~ # ./ftrace-test-epoll-kafai
   <...>-1857  [000] ...1   720.174295: tracing_mark_write: some data
1857: waitting for more data..
1858: written more data


Thanks,
--Martin

On Tue, Jun 10, 2014 at 11:49:15AM -0400, Steven Rostedt wrote:
> On Mon, 9 Jun 2014 23:06:42 -0700
> Martin Lau  wrote:
> 
> > ring_buffer_poll_wait() should always put the poll_table to its wait_queue
> > even there is immediate data available.  Otherwise, the following epoll and
> > read sequence will eventually hang forever:
> > 
> > 1. Put some data to make the trace_pipe ring_buffer read ready first
> > 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
> > 3. epoll_wait()
> > 4. read(trace_pipe_fd) till EAGAIN
> > 5. Add some more data to the trace_pipe ring_buffer
> > 6. epoll_wait() -> this epoll_wait() will block forever
> > 
> > ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
> >   ring_buffer_poll_wait() returns immediately without adding poll_table,
> >   which has poll_table->_qproc pointing to ep_poll_callback(), to its
> >   wait_queue.
> > ~ During the epoll_wait() call in step 3 and step 6,
> >   ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
> >   because the poll_table->_qproc is NULL and it is how epoll works.
> > ~ When there is new data available in step 6, ring_buffer does not know
> >   it has to call ep_poll_callback() because it is not in its wait queue.
> >   Hence, block forever.
> > 
> > Other poll implementation seems to call poll_wait() unconditionally as the 
> > very
> > first thing to do.  For example, tcp_poll() in tcp.c.
> 
> I'm trying to see the effect of this bug, but can't seem to reproduce
> it. Maybe I did something wrong. Attached is a test program I wrote
> trying to follow your instructions. I don't use epoll, so perhaps I
> didn't use it correctly.
> 
> Can you modify it to show me the problem this is trying to fix. That
> is, without this patch it hangs, but with the patch it does not.
> 
> Thanks!
> 
> -- Steve
> 
> > 
> > Signed-off-by: Martin Lau 
> > ---
> >  kernel/trace/ring_buffer.c | 4 
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index fd12cc5..a6e64e8 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, 
> > int cpu,
> > struct ring_buffer_per_cpu *cpu_buffer;
> > struct rb_irq_work *work;
> >  
> > -   if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
> > -   (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, 
> > cpu)))
> > -   return POLLIN | POLLRDNORM;
> > -
> > if (cpu == RING_BUFFER_ALL_CPUS)
> > work = &buffer->irq_work;
> > else {
> 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static const char * debugfs_list[] = {
"/debug/tracing",
"/sys/kernel/debug/tracing",
"/d/tracing",
NULL,
};

static const char *debugfs;
static int markfd;
static int trace_pipe_fd;

static const char *find_debugfs(void)
{
struct stat st;
int i;
int r;

for (i = 0; debugfs_list[i]; i++) {
r = stat(debugfs_list[i], &st);
if (r < 0)
continue;
if (S_ISDIR(st.st_mode))
return debugfs_list[i];
}
return NULL;
}

static char * make_path(const char *file)
{
char *path;
int size;

size = strlen(debugfs) + strlen(file) + 2;
path = malloc(size);
if (!path) {
perror("malloc");
exit(-1);
}
sprintf(path, "%s/%s", debugfs, file);
return path;
}

static void mark_write(const char *str)
{
write(markfd, str, strlen(str));
}

static void read_trace_pipe(void)
{
char buf[1024];
int r;

while ((r = read(trace_pipe_fd, buf, 1024)) > 0)
printf("%.*s", r, buf);
}

int main (int argc, char **argv)
{
struct epoll_event ee;
char *marker;
char *pipe;
int efd;
int ret;
pid_t dwrt_pid;

debugfs = find_debugfs();
if (!debugfs) {
fprintf(stderr, "Could not find debugfs\n");
exit(-1);
}


[PATCH] ring-buffer: Fix polling on trace_pipe

2014-06-09 Thread Martin Lau
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available.  Otherwise, the following epoll and
read sequence will eventually hang forever:

1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever

~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
  ring_buffer_poll_wait() returns immediately without adding poll_table,
  which has poll_table->_qproc pointing to ep_poll_callback(), to its
  wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
  ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
  because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
  it has to call ep_poll_callback() because it is not in its wait queue.
  Hence, block forever.

Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do.  For example, tcp_poll() in tcp.c.

Signed-off-by: Martin Lau 
---
 kernel/trace/ring_buffer.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd12cc5..a6e64e8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, int 
cpu,
struct ring_buffer_per_cpu *cpu_buffer;
struct rb_irq_work *work;
 
-   if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
-   (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, 
cpu)))
-   return POLLIN | POLLRDNORM;
-
if (cpu == RING_BUFFER_ALL_CPUS)
work = &buffer->irq_work;
else {
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/