Re: [PATCH] net: bpf: type fixes for __be16/__be32
On Wed, Oct 16, 2019 at 12:16:35PM +0100, Ben Dooks (Codethink) wrote: > In bpf_skb_load_helper_8_no_cache and > bpf_skb_load_helper_16_no_cache they > read a u16/u32 where actually these > are __be16 and __be32. > > Fix the following by making the types > match: > > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:215:32: warning: cast to restricted __be16 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 > net/core/filter.c:242:32: warning: cast to restricted __be32 Acked-by: Martin KaFai Lau
Re: [PATCH bpf-next] bpf: sk_storage: Fix out of bounds memory access
On Thu, Jun 13, 2019 at 10:15:38AM -0700, Andrii Nakryiko wrote: > On Thu, Jun 13, 2019 at 8:16 AM Arthur Fabre wrote: > > > > bpf_sk_storage maps use multiple spin locks to reduce contention. > > The number of locks to use is determined by the number of possible CPUs. > > With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used. Thanks for report. > > > > When updating elements, the correct lock is determined with hash_ptr(). > > Calling hash_ptr() with 0 bits is undefined behavior, as it does: > > > > x >> (64 - bits) > > > > Using the value results in an out of bounds memory access. > > In my case, this manifested itself as a page fault when raw_spin_lock_bh() > > is called later, when running the self tests: > > > > ./tools/testing/selftests/bpf/test_verifier 773 775 > > > > [ 16.366342] BUG: unable to handle page fault for address: > > 8fe7a66f93f8 > > [ 16.367139] #PF: supervisor write access in kernel mode > > [ 16.367751] #PF: error_code(0x0002) - not-present page > > [ 16.368323] PGD 35a01067 P4D 35a01067 PUD 0 > > [ 16.368796] Oops: 0002 [#1] SMP PTI > > [ 16.369175] CPU: 0 PID: 189 Comm: test_verifier Not tainted 5.2.0-rc2+ > > #10 > > [ 16.369960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > 1.12.0-1 04/01/2014 > > [ 16.371021] RIP: 0010:_raw_spin_lock_bh > > (/home/afabre/linux/./include/trace/events/initcall.h:48) > > [ 16.371571] Code: 02 00 00 31 c0 ba ff 00 00 00 3e 0f b1 17 75 01 c3 e9 82 > > 12 5f ff 66 90 65 81 05 ad 14 6f 41 00 02 00 00 31 c0 ba 01 00 00 00 <3e> > > 0f b1 17 75 01 c3 89 c6 e9 f0 02 5f ff b8 00 02 00 00 3e 0f c1 > > All code > > > >0: 02 00 add(%rax),%al > >2: 00 31 add%dh,(%rcx) > >4: c0 ba ff 00 00 00 3esarb $0x3e,0xff(%rdx) > >b: 0f b1 17cmpxchg %edx,(%rdi) > >e: 75 01 jne0x11 > > 10: c3 retq > > 11: e9 82 12 5f ff jmpq 0xff5f1298 > > 16: 66 90 xchg %ax,%ax > > 18: 65 81 05 ad 14 6f 41addl $0x200,%gs:0x416f14ad(%rip)# > > 0x416f14d0 > > 1f: 00 02 00 00 > > 23: 31 c0 xor%eax,%eax > > 25: ba 01 00 00 00 mov$0x1,%edx > > 2a: 3e 0f b1 17 cmpxchg %edx,%ds:*(%rdi)<-- > > trapping instruction > > 2e: 75 01 jne0x31 > > 30: c3 retq > > 31: 89 c6 mov%eax,%esi > > 33: e9 f0 02 5f ff jmpq 0xff5f0328 > > 38: b8 00 02 00 00 mov$0x200,%eax > > 3d: 3e ds > > 3e: 0f .byte 0xf > > 3f: c1 .byte 0xc1 > > > > Code starting with the faulting instruction > > === > >0: 3e 0f b1 17 cmpxchg %edx,%ds:(%rdi) > >4: 75 01 jne0x7 > >6: c3 retq > >7: 89 c6 mov%eax,%esi > >9: e9 f0 02 5f ff jmpq 0xff5f02fe > >e: b8 00 02 00 00 mov$0x200,%eax > > 13: 3e ds > > 14: 0f .byte 0xf > > 15: c1 .byte 0xc1 > > [ 16.373398] RSP: 0018:a759809d3be0 EFLAGS: 00010246 > > [ 16.373954] RAX: RBX: 8fe7a66f93f0 RCX: > > 0040 > > [ 16.374645] RDX: 0001 RSI: 8fdaf9f0d180 RDI: > > 8fe7a66f93f8 > > [ 16.375338] RBP: 8fdaf9f0d180 R08: 8fdafba2c320 R09: > > 8fdaf9f0d0c0 > > [ 16.376028] R10: R11: R12: > > 8fdafa346700 > > [ 16.376719] R13: 8fe7a66f93f8 R14: 8fdaf9f0d0c0 R15: > > 0001 > > [ 16.377413] FS: 7fda724c0740() GS:8fdafba0() > > knlGS: > > [ 16.378204] CS: 0010 DS: ES: CR0: 80050033 > > [ 16.378763] CR2: 8fe7a66f93f8 CR3: 000139d1c006 CR4: > > 00360ef0 > > [ 16.379453] DR0: DR1: DR2: > > > > [ 16.380144] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [ 16.380864] Call Trace: > > [ 16.381112] selem_link_map > > (/home/afabre/linux/./include/linux/compiler.h:221 > > /home/afabre/linux/net/core/bpf_sk_storage.c:243) > > [ 16.381476] sk_storage_update > > (/home/afabre/linux/net/core/bpf_sk_storage.c:355 > > /home/afabre/linux/net/core/bpf_sk_storage.c:414) > > [ 16.381888] bpf_sk_storage_get > > (/home/afabre/linux/net/core/bpf_sk_storage.c:760 > > /home/afabre/linux/net/core/bpf_sk_storage.c:741) > > [ 16.382285] ___bpf_prog_run (/home/afabre/linux/kernel/bpf/core.c:1447) > > [ 16.382679] ? __bpf_prog_run32 > > (/home/afabre/linux/kernel/bpf/core.c:1603) > > [ 16.383074] ? alloc_file_pseudo (/home/afa
Re: [PATCH bpf-next] bpf: allow CGROUP_SKB programs to use bpf_skb_cgroup_id() helper
On Thu, Jun 06, 2019 at 01:30:12PM -0700, Roman Gushchin wrote: > Currently bpf_skb_cgroup_id() is not supported for CGROUP_SKB > programs. An attempt to load such a program generates an error > like this: > > libbpf: > 0: (b7) r6 = 0 > ... > 9: (85) call bpf_skb_cgroup_id#79 > unknown func bpf_skb_cgroup_id#79 > > There are no particular reasons for denying it, and we have some > use cases where it might be useful. > > So let's add it to the list of allowed helpers. Acked-by: Martin KaFai Lau
Re: [PATCH][next] bpf: hbm: fix spelling mistake "notifcations" -> "notificiations"
On Mon, Jun 03, 2019 at 02:36:53PM +0100, Colin King wrote: > From: Colin Ian King > > There is a spelling mistake in the help information, fix this. > > Signed-off-by: Colin Ian King > --- > samples/bpf/hbm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c > index 480b7ad6a1f2..bdfce592207a 100644 > --- a/samples/bpf/hbm.c > +++ b/samples/bpf/hbm.c > @@ -411,7 +411,7 @@ static void Usage(void) > "-l also limit flows using loopback\n" > "-n <#> to create cgroup \"/hbm#\" and attach prog\n" > " Default is /hbm1\n" > -"--no_cndisable CN notifcations\n" > +"--no_cndisable CN notifications\n" Acked-by: Martin KaFai Lau This should go to the bpf-next.
Re: [PATCH v3 net] ipv6: Fix dangling pointer when ipv6 fragment
On Tue, Apr 02, 2019 at 06:49:03PM +0800, kbuild test robot wrote: > Hi hujunwei, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on net/master] > > url: > https://github.com/0day-ci/linux/commits/hujunwei/ipv6-Fix-dangling-pointer-when-ipv6-fragment/20190402-175602 > config: i386-randconfig-x005-201913 (attached as .config) > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All warnings (new ones prefixed by >>): > >net//ipv6/ip6_output.c: In function 'ip6_fragment': > >> net//ipv6/ip6_output.c:609:27: warning: 'prevhdr' is used uninitialized in > >> this function [-Wuninitialized] > nexthdr_offset = prevhdr - skb_network_header(skb); > ^ > > vim +/prevhdr +609 net//ipv6/ip6_output.c > >594 >595int ip6_fragment(struct net *net, struct sock *sk, struct > sk_buff *skb, >596 int (*output)(struct net *, struct sock *, > struct sk_buff *)) >597{ >598struct sk_buff *frag; >599struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); >600struct ipv6_pinfo *np = skb->sk && > !dev_recursion_level() ? >601inet6_sk(skb->sk) : NULL; >602struct ipv6hdr *tmp_hdr; >603struct frag_hdr *fh; >604unsigned int mtu, hlen, left, len, nexthdr_offset; >605int hroom, troom; >606__be32 frag_id; >607int ptr, offset = 0, err = 0; >608u8 *prevhdr, nexthdr = 0; > > 609nexthdr_offset = prevhdr - skb_network_header(skb); hmm... This line has been moved up since v2. :( >610 >611err = ip6_find_1stfragopt(skb, &prevhdr); >612if (err < 0) >613goto fail; >614hlen = err; >615nexthdr = *prevhdr; >616 >617mtu = ip6_skb_dst_mtu(skb); >618 >619/* We must not fragment if the socket is set to force > MTU discovery >620 * or if the skb it not generated by a local socket. >621 */ >622if (unlikely(!skb->ignore_df && skb->len > mtu)) >623goto fail_toobig; >624 >625if (IP6CB(skb)->frag_max_size) { >626if (IP6CB(skb)->frag_max_size > mtu) >627goto fail_toobig; >628 >629/* don't send fragments larger than what we > received */ >630mtu = IP6CB(skb)->frag_max_size; >631if (mtu < IPV6_MIN_MTU) >632mtu = IPV6_MIN_MTU; >633} >634 >635if (np && np->frag_size < mtu) { >636if (np->frag_size) >637mtu = np->frag_size; >638} >639if (mtu < hlen + sizeof(struct frag_hdr) + 8) >640goto fail_toobig; >641mtu -= hlen + sizeof(struct frag_hdr); >642 >643frag_id = ipv6_select_ident(net, &ipv6_hdr(skb)->daddr, >644&ipv6_hdr(skb)->saddr); >645 >646if (skb->ip_summed == CHECKSUM_PARTIAL && >647(err = skb_checksum_help(skb))) >648goto fail; >649 >650prevhdr = skb_network_header(skb) + nexthdr_offset; >651hroom = LL_RESERVED_SPACE(rt->dst.dev); >652if (skb_has_frag_list(skb)) { >653unsigned int first_len = skb_pagelen(skb); >654struct sk_buff *frag2; >655 >656if (first_len - hlen > mtu || >657((first_len - hlen) & 7) || >658skb_cloned(skb) || >659skb_headroom(skb) < (hroom + sizeof(struct > frag_hdr))) >660goto slow_path; >661 >662skb_walk_frags(skb, frag) { >663/* Correct geometry. */ >664if (frag->len > mtu || >665((frag->len & 7) && frag->next) || >666skb_headroom(frag) < (hlen + hroom > + sizeof(struct frag_hdr))) >667goto slow_path_clean; >668 >669
Re: [PATCH] bpf: drop refcount if bpf_map_new_fd() fails in map_create()
On Wed, Feb 27, 2019 at 10:36:25PM +0800, zerons wrote: > In bpf/syscall.c, map_create() first set map->usercnt to 1, a file descriptor > is > supposed to return to userspace. When bpf_map_new_fd() fails, drop the > refcount. Thanks for the patch. Please add a Fixes tag for bug fix in the future. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Acked-by: Martin KaFai Lau
Re: [PATCH] bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id()
On Tue, Feb 26, 2019 at 10:15:37PM +0800, zerons wrote: > [ Upstream commit c91951f15978f1a0c6b65f063d30f7ea7bc6fb42 ] > > In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero() to > increase > the refcount, both map->refcnt and map->usercnt. Then, if bpf_map_new_fd() > fails, > should handle map->usercnt too. Good catch! Thanks for the fix. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Acked-by: Martin KaFai Lau
Re: [PATCH bpf-next v1] bpf, lpm: fix lookup bug in map_delete_elem
On Thu, Feb 21, 2019 at 05:39:26PM +0100, Alban Crequy wrote: > From: Alban Crequy > > trie_delete_elem() was deleting an entry even though it was not matching > if the prefixlen was correct. This patch adds a check on matchlen. > > Reproducer: > > $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 > entries 128 name mylpm flags 1 > $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb > cc dd value hex 01 > $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm > key: 10 00 00 00 aa bb cc dd value: 01 > Found 1 element > $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff > ff ff > $ echo $? > 0 > $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm > Found 0 elements The change makes sense to me. Can you add this reproducer to tools/testing/selftests/bpf/test_lpm_map.c? Bug fix should be for the "bpf" tree instead of "bpf-next" Fixes tag is also required, like Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE") Cc: Craig Gallek > > Signed-off-by: Alban Crequy > --- > kernel/bpf/lpm_trie.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c > index abf1002080df..93a5cbbde421 100644 > --- a/kernel/bpf/lpm_trie.c > +++ b/kernel/bpf/lpm_trie.c > @@ -471,6 +471,7 @@ static int trie_delete_elem(struct bpf_map *map, void > *_key) > } > > if (!node || node->prefixlen != key->prefixlen || > + node->prefixlen != matchlen || > (node->flags & LPM_TREE_NODE_FLAG_IM)) { > ret = -ENOENT; > goto out; > -- > 2.20.1 >
Re: [RFC] Loading BTF and pretty printing maps with bpftool
On Fri, Jan 25, 2019 at 11:20:57AM +0100, Arnaldo Carvalho de Melo wrote: > # bpftool version > bpftool v5.0.0-rc3 > # > > # bpftool prog | tail -6 > 309: tracepoint name sys_enter tag 819967866022f1e1 gpl > loaded_at 2019-01-25T11:05:41+0100 uid 0 > xlated 528B jited 381B memlock 4096B map_ids 200,199,198 > 310: tracepoint name sys_exit tag c1bd85c092d6e4aa gpl > loaded_at 2019-01-25T11:05:41+0100 uid 0 > xlated 256B jited 191B memlock 4096B map_ids 200,199 > # > > And the maps: > > # bpftool map | tail -6 > 198: perf_event_array name __augmented_sys flags 0x0 > key 4B value 4B max_entries 8 memlock 4096B > 199: array name syscalls flags 0x0 > key 4B value 1B max_entries 512 memlock 8192B > 200: hash name pids_filtered flags 0x0 > key 4B value 1B max_entries 64 memlock 8192B > # > > So, dumping the entries for those entries: > > [root@quaco ~]# egrep sleep > /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c > [35] = "nanosleep", > [230] = "clock_nanosleep", > [root@quaco ~]# > > Looking at just the open and nanosleep: > > # bpftool map dump id 199 | grep "value: 01" > key: 23 00 00 00 value: 01 > key: e6 00 00 00 value: 01 > # > > # bpftool map lookup id 199 key 35 > Error: key expected 4 bytes got 1 > # > > # bpftool map lookup id 199 key 35 00 00 00 > key: 23 00 00 00 value: 01 > # bpftool map lookup id 199 key 230 00 00 00 > key: e6 00 00 00 value: 01 > # > > I thought it was that --pretty option, so I tried: > > # bpftool map --pretty lookup id 199 key 230 00 00 00 > { > "key": ["0xe6","0x00","0x00","0x00" > ], > "value": ["0x01" > ] > } > # libbpf pr_warning on failing to load BTF or failing to create a MAP after BTF has been loaded and btf_map_xxx can be found. Did you see any of them? It seems it can load the BTF from your email. It may be useful to set the libbpf's __pr_debug which should be NULL by default iirc.
Re: [PATCH] bpf: Make function btf_name_offset_valid static
On Wed, Jan 16, 2019 at 08:29:40PM +0100, Mathieu Malaterre wrote: > Initially in commit 69b693f0aefa ("bpf: btf: Introduce BPF Type Format > (BTF)") the function 'btf_name_offset_valid' was introduced as static > function it was later on changed to a non-static one, and then finally > in commit c454a46b5efd ("bpf: Add bpf_line_info support") the function > prototype was removed. Instead of c454a46b5efd, it was removed from btf.h in commit 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") > > Revert back to original implementation and make the function static. > Remove warning triggered with W=1: > > kernel/bpf/btf.c:470:6: warning: no previous prototype for > 'btf_name_offset_valid' [-Wmissing-prototypes] > > Signed-off-by: Mathieu Malaterre Fixes: 23127b33ec80 ("bpf: Create a new btf_name_by_offset() for non type name use case") Thanks for the fix! Other than the above, Acked-by: Martin KaFai Lau
Re: "bpf: Improve the info.func_info and info.func_info_rec_size behavior" breaks strace self tests
On Thu, Jan 03, 2019 at 11:41:18PM +0100, Heiko Carstens wrote: > On Thu, Jan 03, 2019 at 07:12:05PM +0000, Martin Lau wrote: > > On Thu, Jan 03, 2019 at 12:46:13PM +0100, Heiko Carstens wrote: > > > Hello, > > > > > > the kernel commit 7337224fc150 ("bpf: Improve the info.func_info and > > > info.func_info_rec_size behavior") breaks one of strace's self tests: > > > > > > FAIL: bpf-obj_get_info_by_fd-prog-v.gen > > The strace's bpf-obj_get_info_by_fd-prog-v test did fail. However, > > I failed to see how 7337224fc150 broke it. How do you trace down to > > commit 7337224fc150 and can you share your test output? > > > > The failure I can reproduce is EFAULT. It should have been failing > > as early as "nr_jited_ksyms" is added to "struct bpf_prog_info" > > in linux/bpf.h. > > Ah, sorry(!), I forgot to mention an important detail: the test > failure happens only if executed as normal (non-root) user. > > With 7337224fc150 ("bpf: Improve the info.func_info and > info.func_info_rec_size behavior") the failure happens. With commit > 30da46b5dc3a ("tools: bpftool: add a command to dump the trace pipe") > it passes; which is one commit earlier. > > FAIL: bpf-obj_get_info_by_fd-prog-v.gen > === > > --- exp 2019-01-03 23:31:49.576949303 +0100 > +++ log 2019-01-03 23:31:49.576949303 +0100 > @@ -1,8 +1,8 @@ > bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=8, > max_entries=1, map_flags=0, inner_map_fd=0, map_name="test_map", > map_ifindex=0}, 48) = 3 > bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=9, > insns=[{code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_1, src_reg=BPF_REG_0, > off=0, imm=0}, {code=BPF_STX|BPF_W|BPF_MEM, dst_reg=BPF_REG_10, > src_reg=BPF_REG_1, off=-4, imm=0}, {code=BPF_ALU64|BPF_X|BPF_MOV, > dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0}, > {code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, > imm=0xfffc}, {code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, > src_reg=BPF_REG_1, off=0, imm=0x3}, {code=BPF_LD|BPF_W|BPF_IMM, > dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, > {code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, > imm=0x1}, {code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, > src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_EXIT, > dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}], license="BSD", > log_level=42, log_size=4096, log_buf="", kern_version=KERNEL_VERSION(57005, > 192, 222), prog_flags=0, prog_name="test_prog", prog_ifindex=0, > expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = 4 > bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=3, > info_len=128 => 80, info={type=BPF_MAP_TYPE_ARRAY, id=15, key_size=4, > value_size=8, max_entries=1, map_flags=0, name="test_map", ifindex=0, > netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 > -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, > info_len=168 => 152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, > tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, > jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=NULL, > load_time=28281352029, created_by_uid=1000, nr_map_ids=0 => 1, map_ids=NULL, > name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 > -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, > info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, > tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, > jited_prog_insns=NULL, xlated_prog_len=336, xlated_prog_insns=0x3ff8d6f7000, > load_time=28281352029, created_by_uid=1000, nr_map_ids=2, > map_ids=0x3ff8d6f1000, name="test_prog", ifindex=0, netns_dev=makedev(0, 0), > netns_ino=0}}}, 16) = -1 EFAULT (Bad address) > -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, > info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, > tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, > jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=[], > load_time=28281352029, created_by_uid=1000, nr_map_ids=0 => 1, map_ids=[], > name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 > -bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, > info_len=152, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=8, > tag="\xde\x90\x83\x18\xfb\x16\xd8\x9c", jited_prog_len=0, > jited_prog_insns=NULL, xlated_prog_len=0, xlated_prog_insns=[], > load_time=28281352029, created_by_uid=1000, nr_map_ids=2 => 1, map_ids=[15], > name="test_prog", ifindex=0,
Re: "bpf: Improve the info.func_info and info.func_info_rec_size behavior" breaks strace self tests
On Thu, Jan 03, 2019 at 12:46:13PM +0100, Heiko Carstens wrote: > Hello, > > the kernel commit 7337224fc150 ("bpf: Improve the info.func_info and > info.func_info_rec_size behavior") breaks one of strace's self tests: > > FAIL: bpf-obj_get_info_by_fd-prog-v.gen The strace's bpf-obj_get_info_by_fd-prog-v test did fail. However, I failed to see how 7337224fc150 broke it. How do you trace down to commit 7337224fc150 and can you share your test output? The failure I can reproduce is EFAULT. It should have been failing as early as "nr_jited_ksyms" is added to "struct bpf_prog_info" in linux/bpf.h. The sys_bpf(BPF_OBJ_GET_INFO_BY_FD, &bpf_prog_get_info_attr, ...) is called in a loop. The bpf_prog_get_info_attr.info object is in size 104 but bpf_prog_get_info_attr.info_len is in size 168. Hence, if the prog is jited, the second iteraton onwards will have nr_jited_ksyms == 1 which is asking the kernel to fill the bpf_prog_get_info_attr.info.jited_ksyms and it is NULL. This fix the strace's test program: diff --git i/tests/bpf-obj_get_info_by_fd.c w/tests/bpf-obj_get_info_by_fd.c index e95afda27b3d..953083720822 100644 --- i/tests/bpf-obj_get_info_by_fd.c +++ w/tests/bpf-obj_get_info_by_fd.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -341,6 +342,7 @@ main(void) size_t old_prog_info_len = PROG_INFO_SZ; for (unsigned int i = 0; i < 4; i++) { + memset(prog_info + 1, 0, PROG_INFO_SZ - sizeof(*prog_info)); prog_info->jited_prog_len = 0; switch (i) { case 1: After the above change: [root@arch-fb-vm1 tests]# ./bpf-obj_get_info_by_fd-prog-v bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=8, max_entries=1, map_flags=0, inner_map_fd=0, map_name="test_map", map_ifindex=0}, 48) = 3 bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=9, insns=[{code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_1, src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_STX|BPF_W|BPF_MEM, dst_reg=BPF_REG_10, src_reg=BPF_REG_1, off=-4, imm=0}, {code=BPF_ALU64|BPF_X|BPF_MOV, dst_reg=BPF_REG_2, src_reg=BPF_REG_10, off=0, imm=0}, {code=BPF_ALU64|BPF_K|BPF_ADD, dst_reg=BPF_REG_2, src_reg=BPF_REG_0, off=0, imm=0xfffc}, {code=BPF_LD|BPF_DW|BPF_IMM, dst_reg=BPF_REG_1, src_reg=BPF_REG_1, off=0, imm=0x3}, {code=BPF_LD|BPF_W|BPF_IMM, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_CALL, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0x1}, {code=BPF_ALU64|BPF_K|BPF_MOV, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}, {code=BPF_JMP|BPF_K|BPF_EXIT, dst_reg=BPF_REG_0, src_reg=BPF_REG_0, off=0, imm=0}], license="BSD", log_level=42, log_size=4096, log_buf="", kern_version=KERNEL_VERSION(57005, 192, 222), prog_flags=0, prog_name="test_prog", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 72) = 4 bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=3, info_len=128 => 80, info={type=BPF_MAP_TYPE_ARRAY, id=13, key_size=4, value_size=8, max_entries=1, map_flags=0, name="test_map", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=NULL, load_time=1360454956729, created_by_uid=0, nr_map_ids=0 => 1, map_ids=NULL, name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0, jited_prog_insns=NULL, xlated_prog_len=336, xlated_prog_insns=0x7fda42a02000, load_time=1360454956729, created_by_uid=0, nr_map_ids=2, map_ids=0x7fda429f5000, name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = -1 EFAULT (Bad address) bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=[], load_time=1360454956729, created_by_uid=0, nr_map_ids=0 => 1, map_ids=[], name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 bpf(BPF_OBJ_GET_INFO_BY_FD, {info={bpf_fd=4, info_len=168, info={type=BPF_PROG_TYPE_SOCKET_FILTER, id=12, tag="\xda\xbf\x02\x07\xd1\x99\x24\x86", jited_prog_len=0 => 110, jited_prog_insns=NULL, xlated_prog_len=0 => 120, xlated_prog_insns=[], load_time=1360454956729, created_by_uid=0, nr_map_ids=2 => 1, map_ids=[13], name="test_prog", ifindex=0, netns_dev=makedev(0, 0), netns_ino=0}}}, 16) = 0 +++ exited with 0 +++ > > Looking into the kernel commit, it seems that the user space visible > uapi change is intentional; even though it might break existing user > space. > > To reproduce: > > git clone
Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules
On Thu, Dec 13, 2018 at 11:38:51AM -0800, Matt Mullins wrote: > On Thu, 2018-12-13 at 19:22 +0000, Martin Lau wrote: > > On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote: > > > Distributions build drivers as modules, including network and filesystem > > > drivers which export numerous tracepoints. This enables > > > bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints. > > > Acked-by: Martin KaFai Lau [ ... ] > > > +#ifdef CONFIG_MODULES > > > +int bpf_event_notify(struct notifier_block *nb, unsigned long op, void > > > *module) > > > +{ > > > + struct bpf_trace_module *btm, *tmp; > > > + struct module *mod = module; > > > + > > > + if (mod->num_bpf_raw_events == 0 || > > > + (op != MODULE_STATE_COMING && op != MODULE_STATE_GOING)) > > > + return 0; > > > + > > > + mutex_lock(&bpf_module_mutex); > > > + > > > + switch (op) { > > > + case MODULE_STATE_COMING: > > > + btm = kzalloc(sizeof(*btm), GFP_KERNEL); > > > + if (btm) { > > > + btm->module = module; > > > + list_add(&btm->list, &bpf_trace_modules); > > > + } > > > > Is it fine to return 0 on !btm case? > > That effectively just means we'll be ignoring tracepoints for a module > that is loaded while we can't allocate a bpf_trace_module (24 bytes) to > track it. That feels like reasonable behavior to me. ok.
Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules
On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote: > Distributions build drivers as modules, including network and filesystem > drivers which export numerous tracepoints. This enables > bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints. > > Signed-off-by: Matt Mullins > --- > v1->v2: > * avoid taking the mutex in bpf_event_notify when op is neither COMING nor > GOING. > * check that kzalloc actually succeeded > > I didn't try to check list_empty before taking the mutex since I want to avoid > races between bpf_event_notify and bpf_get_raw_tracepoint. Additionally, > list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, > but > Alexei suggested I use it to protect against fragility if the subsequent > break; > eventually disappears. > > include/linux/module.h | 4 ++ > include/linux/trace_events.h | 8 ++- > kernel/bpf/syscall.c | 11 ++-- > kernel/module.c | 5 ++ > kernel/trace/bpf_trace.c | 99 +++- > 5 files changed, 120 insertions(+), 7 deletions(-) > > diff --git a/include/linux/module.h b/include/linux/module.h > index fce6b4335e36..5f147dd5e709 100644 > --- a/include/linux/module.h > +++ b/include/linux/module.h > @@ -432,6 +432,10 @@ struct module { > unsigned int num_tracepoints; > tracepoint_ptr_t *tracepoints_ptrs; > #endif > +#ifdef CONFIG_BPF_EVENTS > + unsigned int num_bpf_raw_events; > + struct bpf_raw_event_map *bpf_raw_events; > +#endif > #ifdef HAVE_JUMP_LABEL > struct jump_entry *jump_entries; > unsigned int num_jump_entries; > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index 4130a5497d40..8a62731673f7 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event); > int perf_event_query_prog_array(struct perf_event *event, void __user *info); > int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); > int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog > *prog); > -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); > +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); > +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); > int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, > u32 *fd_type, const char **buf, > u64 *probe_offset, u64 *probe_addr); > @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct > bpf_raw_event_map *btp, struct bpf > { > return -EOPNOTSUPP; > } > -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char > *name) > +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char > *name) > { > return NULL; > } > +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) > +{ > +} > static inline int bpf_get_perf_event_info(const struct perf_event *event, > u32 *prog_id, u32 *fd_type, > const char **buf, u64 *probe_offset, > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 70fb11106fc2..754370e3155e 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode > *inode, struct file *filp) > bpf_probe_unregister(raw_tp->btp, raw_tp->prog); > bpf_prog_put(raw_tp->prog); > } > + bpf_put_raw_tracepoint(raw_tp->btp); > kfree(raw_tp); > return 0; > } > @@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union > bpf_attr *attr) > return -EFAULT; > tp_name[sizeof(tp_name) - 1] = 0; > > - btp = bpf_find_raw_tracepoint(tp_name); > + btp = bpf_get_raw_tracepoint(tp_name); > if (!btp) > return -ENOENT; > > raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); > - if (!raw_tp) > - return -ENOMEM; > + if (!raw_tp) { > + err = -ENOMEM; > + goto out_put_btp; > + } > raw_tp->btp = btp; > > prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd, > @@ -1668,6 +1671,8 @@ static int bpf_raw_tracepoint_open(const union bpf_attr > *attr) > bpf_prog_put(prog); > out_free_tp: > kfree(raw_tp); > +out_put_btp: > + bpf_put_raw_tracepoint(btp); > return err; > } > > diff --git a/kernel/module.c b/kernel/module.c > index 49a405891587..06ec68f08387 100644 > --- a/kernel/module.c > +++ b/kernel/module.c > @@ -3093,6 +3093,11 @@ static int find_module_sections(struct module *mod, > struct load_info *info) >sizeof(*mod->tracepoints_ptrs), >&mod->num_tracepoints); > #endif > +#ifdef CONFIG_BPF_EVENTS > + mod->bpf_raw
Re: [PATCH v2 bpf-next 3/3] selftests/bpf: add btf annotations for cgroup_local_storage maps
On Mon, Dec 10, 2018 at 03:43:02PM -0800, Roman Gushchin wrote: > Add btf annotations to cgroup local storage maps (per-cpu and shared) > in the network packet counting example. Acked-by: Martin KaFai Lau
Re: [PATCH v2 bpf-next 2/3] bpf: add bpffs pretty print for cgroup local storage maps
On Mon, Dec 10, 2018 at 03:43:01PM -0800, Roman Gushchin wrote: > Implement bpffs pretty printing for cgroup local storage maps > (both shared and per-cpu). > Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c): > > Shared: > $ cat /sys/fs/bpf/map_2 > # WARNING!! The output is for debug purpose only > # WARNING!! The output format will change > {4294968594,1}: {,1039896} > > Per-cpu: > $ cat /sys/fs/bpf/map_1 > # WARNING!! The output is for debug purpose only > # WARNING!! The output format will change > {4294968594,1}: { > cpu0: {0,0,0,0,0} > cpu1: {0,0,0,0,0} > cpu2: {1,104,0,0,0} > cpu3: {0,0,0,0,0} > } Acked-by: Martin KaFai Lau
Re: [PATCH bpf-next 1/3] bpf: pass struct btf pointer to the map_check_btf() callback
On Fri, Dec 07, 2018 at 04:53:13PM -0800, Roman Gushchin wrote: > If key_type or value_type are of non-trivial data types > (e.g. structure or typedef), it's not possible to check them without > the additional information, which can't be obtained without a pointer > to the btf structure. > > So, let's pass btf pointer to the map_check_btf() callbacks. Acked-by: Martin KaFai Lau
Re: [PATCH bpf-next 2/3] bpf: add bpffs pretty print for cgroup local storage maps
On Fri, Dec 07, 2018 at 04:53:14PM -0800, Roman Gushchin wrote: > Implement bpffs pretty printing for cgroup local storage maps > (both shared and per-cpu). > Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c): > > Shared: > $ cat /sys/fs/bpf/map_2 > # WARNING!! The output is for debug purpose only > # WARNING!! The output format will change > {4294968594,1}: {,1039896} > > Per-cpu: > $ cat /sys/fs/bpf/map_1 > # WARNING!! The output is for debug purpose only > # WARNING!! The output format will change > {4294968594,1}: { > cpu0: {0,0,0,0,0} > cpu1: {0,0,0,0,0} > cpu2: {1,104,0,0,0} > cpu3: {0,0,0,0,0} > } > > Signed-off-by: Roman Gushchin > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > --- > include/linux/btf.h| 10 + > kernel/bpf/local_storage.c | 90 +- > 2 files changed, 99 insertions(+), 1 deletion(-) > > diff --git a/include/linux/btf.h b/include/linux/btf.h > index 8c2199b5d250..ac67bc4cbfd9 100644 > --- a/include/linux/btf.h > +++ b/include/linux/btf.h > @@ -5,6 +5,7 @@ > #define _LINUX_BTF_H 1 > > #include > +#include > > struct btf; > struct btf_type; > @@ -63,4 +64,13 @@ static inline const char *btf_name_by_offset(const struct > btf *btf, > } > #endif > > +static inline const struct btf_type *btf_orig_type(const struct btf *btf, > +const struct btf_type *t) > +{ > + while (t && BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF) > + t = btf_type_by_id(btf, t->type); Only typedef is allowed? Not even const? If that is not the case, can btf_type_id_size() be reused? The "type following" has already been done once and then remembered in the verification time. If cgroup_storage_check_btf() can only allow typedef, please move "btf_orig_type()" to the local_storage.c. > + > + return t; > +} > + > #endif > diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c > index b65017dead44..7b51fe1aba3c 100644 > --- a/kernel/bpf/local_storage.c > +++ b/kernel/bpf/local_storage.c > @@ -1,11 +1,13 @@ > //SPDX-License-Identifier: GPL-2.0 > #include > #include > +#include > #include > #include > #include > #include > #include > +#include > > DEFINE_PER_CPU(struct bpf_cgroup_storage*, > bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); > > @@ -308,6 +310,91 @@ static int cgroup_storage_delete_elem(struct bpf_map > *map, void *key) > return -EINVAL; > } > > +static int cgroup_storage_check_btf(const struct bpf_map *map, > + const struct btf *btf, > + const struct btf_type *key_type, Actually, in "map_check_btf()" (just before cgroup_storage_check_btf() is called), btf_type_id_size() has already been called to get the true size and the resolved type (i.e. BTF_INFO_KIND_STRUCT here) in order to reject "key_size != map->key_size". Hence, the key_type passed to cgroup_storage_check_btf() here will not be KIND_TYPEDEF or KIND_CONST. > + const struct btf_type *value_type) > +{ > + const struct btf_type *st, *t; > + struct btf_member *m; > + > + /* Key is expected to be of struct bpf_cgroup_storage_key type, > + * which is: > + * struct bpf_cgroup_storage_key { > + * __u64 cgroup_inode_id; > + * __u32 attach_type; > + * }; > + */ > + > + /* > + * Key_type must be a structure (or a typedef of a structure) with > + * two members. > + */ > + st = btf_orig_type(btf, key_type); > + if (BTF_INFO_KIND(st->info) != BTF_KIND_STRUCT || > + BTF_INFO_VLEN(st->info) != 2) > + return -EINVAL; > + > + /* > + * The first field must be a 64 bit integer at 0 offset. > + */ > + m = (struct btf_member *)(st + 1); > + t = btf_orig_type(btf, btf_type_by_id(btf, m->type)); > + if (!t || BTF_INFO_KIND(t->info) != BTF_KIND_INT || m->offset || > + t->size != > + FIELD_SIZEOF(struct bpf_cgroup_storage_key, cgroup_inode_id)) Instead of t->size, BTF_INT_BITS() and BTF_INT_OFFSET() need to be checked (please refer to the key_type check in array_map_check_btf()). I think exposing "btf_type_int_is_regular()" from btf.c will be easier here. Actually, add "btf_type_is_reg_int(t, expected_size)" to btf.h and btf.c, like (uncompiled and untested code): /* * Not a bit field and it must be the expected size. */ bool btf_type_is_reg_int(const struct btf_type *t, u32 expected_size) { u8 nr_bits, nr_bytes; u32 int_data; if (!btf_type_is_int(t)) return false; int_data = btf_type_int(t); nr_bits = BTF_INT_BITS(int_data); nr_bytes = BITS_ROUNDUP_BYTES(nr_bits); if (BITS_PER_BYTE_MASKED(nr_bits) || BTF_INT_OFFSET(int_data) || nr_bytes != expected_size) return false;
RE: [PATCH RESEND] tcp_estats: ebpf hacks
Please ignore. It is a mistake... From: Martin KaFai Lau [ka...@fb.com] Sent: Wednesday, February 03, 2016 9:39 PM To: linux-kernel@vger.kernel.org Cc: Ingo Molnar; Masami Hiramatsu; Steven Rostedt; Alexei Starovoitov; Josef Bacik; Kernel Team Subject: [PATCH RESEND] tcp_estats: ebpf hacks Signed-off-by: Martin KaFai Lau --- kernel/trace/bpf_trace.c | 20 ++ samples/Makefile | 2 +- samples/bpf/Makefile | 11 +- samples/bpf/bpf_helpers.h| 4 + samples/bpf/bpf_load.c | 44 +++-- samples/bpf/tcp_trace.h | 51 + samples/bpf/tcp_trace_kern.c | 454 +++ samples/bpf/tcp_trace_user.c | 115 +++ tools/net/Makefile | 6 +- 9 files changed, 689 insertions(+), 18 deletions(-) create mode 100644 samples/bpf/tcp_trace.h create mode 100644 samples/bpf/tcp_trace_kern.c create mode 100644 samples/bpf/tcp_trace_user.c diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 47febbe..977702e 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -68,6 +68,7 @@ static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) void *unsafe_ptr = (void *) (long) r3; return probe_kernel_read(dst, unsafe_ptr, size); + /* return __bpf_probe_read_hack(dst, unsafe_ptr, size); */ } static const struct bpf_func_proto bpf_probe_read_proto = { @@ -79,6 +80,25 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_probe_read_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + u32 *dst = (u32 *) (long) r1; + int size = (int) r2; + u32 *unsafe_ptr = (void *) (long) r3; + + *dst = *unsafe_ptr; + return probe_kernel_read(dst, unsafe_ptr, size); +} + +static const struct bpf_func_proto bpf_probe_read_u32_proto = { + .func = bpf_probe_read, + .gpl_only = true, + .ret_type = RET_VOID, + .arg1_type = ARG_PTR_TO_STACK, + .arg2_type = ARG_CONST_STACK_SIZE, + .arg3_type = ARG_ANYTHING, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed diff --git a/samples/Makefile b/samples/Makefile index f00257b..fb87be5 100644 --- a/samples/Makefile +++ b/samples/Makefile @@ -1,4 +1,4 @@ # Makefile for Linux samples code obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \ - hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ + hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ bpf/ diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 97e5243..02885ae 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -14,6 +14,7 @@ hostprogs-y += tracex4 hostprogs-y += tracex5 hostprogs-y += trace_output hostprogs-y += lathist +hostprogs-y += tcp_trace test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o @@ -28,6 +29,7 @@ tracex4-objs := bpf_load.o libbpf.o tracex4_user.o tracex5-objs := bpf_load.o libbpf.o tracex5_user.o trace_output-objs := bpf_load.o libbpf.o trace_output_user.o lathist-objs := bpf_load.o libbpf.o lathist_user.o +tcp_trace-objs := bpf_load.o libbpf.o tcp_trace_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -42,6 +44,7 @@ always += tracex5_kern.o always += trace_output_kern.o always += tcbpf1_kern.o always += lathist_kern.o +always += tcp_trace_kern.o HOSTCFLAGS += -I$(objtree)/usr/include @@ -56,14 +59,16 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt HOSTLOADLIBES_tracex5 += -lelf HOSTLOADLIBES_trace_output += -lelf -lrt HOSTLOADLIBES_lathist += -lelf +HOSTLOADLIBES_tcp_trace += -lelf # point this to your LLVM backend with bpf support -LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc +LLC=/home/kafai/local/llvm-git-master/bin/llc +CLANG=/home/kafai/local/llvm-git-master/bin/clang $(obj)/%.o: $(src)/%.c - clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ + $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ - clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ + $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h index e84dd3c..df3f00e 100644 --- a/samples/bpf/bpf_helpers.h +++ b/samples/bpf/bpf_helpers.h @@ -33,6 +33,10 @@ static int (*bpf_get_current_comm)(void *buf, int buf_size) = (void *) BPF_FUNC_get_current_comm; static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, int size)
Re: [PATCH] ring-buffer: Fix polling on trace_pipe
Yes, I think it should go to both stable and 3.16. Thanks, --Martin On Tue, Jul 15, 2014 at 03:46:12PM -0400, Steven Rostedt wrote: > On Thu, 10 Jul 2014 15:20:30 -0700 > Martin Lau wrote: > > > Hi Steve, > > > > Do you have a chance to give it another try? > > > > OK, I finally got around to testing it. Yep I see your point. Do you > think it should go to 3.16 and stable? I can add it to my next git pull > request. > > Also, I modified your test such that it wouldn't hang on failure, but > detects that it hung and returns a exit value to add this to my > testing. That is, it exits with zero on success and non zero on failure. > > -- Steve > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > static const char * debugfs_list[] = { > "/debug/tracing", > "/sys/kernel/debug/tracing", > "/d/tracing", > NULL, > }; > > static const char *debugfs; > static int markfd; > static int trace_pipe_fd; > > static void alog(const char *name, int ret) > { > printf("%d: %s: %d\n", getpid(), name, ret); > } > > static const char *find_debugfs(void) > { > struct stat st; > int i; > int r; > > for (i = 0; debugfs_list[i]; i++) { > r = stat(debugfs_list[i], &st); > if (r < 0) > continue; > if (S_ISDIR(st.st_mode)) > return debugfs_list[i]; > } > return NULL; > } > > static char * make_path(const char *file) > { > char *path; > int size; > > size = strlen(debugfs) + strlen(file) + 2; > path = malloc(size); > if (!path) { > perror("malloc"); > exit(-1); > } > sprintf(path, "%s/%s", debugfs, file); > return path; > } > > static void mark_write(const char *str) > { > int ret; > ret = write(markfd, str, strlen(str)); > alog("write(markfd)", ret); > } > > static void read_trace_pipe(void) > { > char buf[1024]; > int r; > > while ((r = read(trace_pipe_fd, buf, 1024)) > 0) > printf("%.*s", r, buf); > } > > int main (int argc, char **argv) > { > struct epoll_event ee; > char *marker; > char *pipe; > int efd; > int ret; > pid_t dwrt_pid; > > debugfs = find_debugfs(); > if (!debugfs) { > fprintf(stderr, "Could not find debugfs\n"); > exit(-1); > } > > marker = make_path("trace_marker"); > pipe = make_path("trace_pipe"); > > markfd = open(marker, O_WRONLY); > if (markfd < 0) { > perror("marker"); > exit(-1); > } > trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK); > if (trace_pipe_fd < 0) { > perror("trace_pipe"); > exit(-1); > } > > efd = epoll_create(1); > if (efd < 0) { > perror("epoll_create"); > exit(-1); > } > > mark_write("some data"); > memset(&ee, 0, sizeof(ee)); > ee.events = EPOLLIN; > ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee); > if (ret < 0) { > perror("epoll_ctl"); > exit(-1); > } > alog("waiting data..", 0); > ret = epoll_wait(efd, &ee, 1, -1); > alog("epoll_wait()", ret); > read_trace_pipe(); > dwrt_pid = fork(); > assert(dwrt_pid != -1); > if (dwrt_pid > 0) { > int status; > sleep(10); > mark_write("more data"); > sleep(10); > ret = waitpid(dwrt_pid, &status, WNOHANG); > if (ret != dwrt_pid) { > alog("Poll never finished!", 0); > kill(dwrt_pid, 9); > exit(-1); > } > if (WEXITSTATUS(status) != 0) { > alog("Something failed on polling!", >WEXITSTATUS(status)); > exit(-1); > } > } else { > alog("waiting form more data..", 0); > ret = epoll_wait(efd, &ee, 1, -1); > alog("epoll_wait()", ret); > read_trace_pipe(); > if (ret < 0) > exit(errno); > } > exit (0); > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ring-buffer: Fix polling on trace_pipe
Hi Steve, Do you have a chance to give it another try? Thanks, Martin On Thu, Jun 26, 2014 at 11:52:44PM -0700, Martin Lau wrote: > Hi Steve, > > I retried the test program (with the kernel patch). It does block from > time to time. I spotted the ee.events is not set to EPOLLIN before > calling epll_ctl(eft, EPOLL_CTL_ADD,...). I fixed it and ran > the test in a bash-loop. > > I have the kafai-2 version attached (with some more logging in case > if it still blocks). > > Can you retry? > > Thanks, > Martin > > On Thu, Jun 26, 2014 at 08:53:27PM -0400, Steven Rostedt wrote: > > On Thu, 26 Jun 2014 11:34:46 -0700 > > Martin Lau wrote: > > > > > Hi Steve, > > > > > > Can the modified test program reproduce the problem in your test setup? > > > > Ah sorry, got distracted by other work. > > > > OK, I just tried it out, and here's my results: > > > > I ran you code with my current kernel and this is what I got: > > > > # ./ftrace-test-epoll-kafai > ><...>-3183 [002] ...181.777891: tracing_mark_write: > > some data 3183: waitting for more data.. > > 3184: written more data > > > > And it just hung there. > > > > > > Then I applied your patch, compiled and booted it, and ran the test > > again, and I got this: > > > > # ./ftrace-test-epoll-kafai > > > > It just hung there. No forward progress. > > > > I don't think that was the result you intended. > > > > -- Steve > > > > > #include > #include > #include > #include > #include > #include > #include > #include > #include > > static const char * debugfs_list[] = { > "/debug/tracing", > "/sys/kernel/debug/tracing", > "/d/tracing", > NULL, > }; > > static const char *debugfs; > static int markfd; > static int trace_pipe_fd; > > static void alog(const char *name, int ret) > { > printf("%d: %s: %d\n", getpid(), name, ret); > } > > static const char *find_debugfs(void) > { > struct stat st; > int i; > int r; > > for (i = 0; debugfs_list[i]; i++) { > r = stat(debugfs_list[i], &st); > if (r < 0) > continue; > if (S_ISDIR(st.st_mode)) > return debugfs_list[i]; > } > return NULL; > } > > static char * make_path(const char *file) > { > char *path; > int size; > > size = strlen(debugfs) + strlen(file) + 2; > path = malloc(size); > if (!path) { > perror("malloc"); > exit(-1); > } > sprintf(path, "%s/%s", debugfs, file); > return path; > } > > static void mark_write(const char *str) > { > int ret; > ret = write(markfd, str, strlen(str)); > alog("write(markfd)", ret); > } > > static void read_trace_pipe(void) > { > char buf[1024]; > int r; > > while ((r = read(trace_pipe_fd, buf, 1024)) > 0) > printf("%.*s", r, buf); > } > > int main (int argc, char **argv) > { > struct epoll_event ee; > char *marker; > char *pipe; > int efd; > int ret; > pid_t dwrt_pid; > > debugfs = find_debugfs(); > if (!debugfs) { > fprintf(stderr, "Could not find debugfs\n"); > exit(-1); > } > > marker = make_path("trace_marker"); > pipe = make_path("trace_pipe"); > > markfd = open(marker, O_WRONLY); > if (markfd < 0) { > perror("marker"); > exit(-1); > } > trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK); > if (trace_pipe_fd < 0) { > perror("trace_pipe"); > exit(-1); > } > > efd = epoll_create(1); > if (efd < 0) { > perror("epoll_create"); > exit(-1); > } > > mark_write("some data"); > memset(&ee, 0, sizeof(ee)); > ee.events = EPOLLIN; > ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee); > if (ret < 0) { > perror("epoll_ctl"); > exit(-1); > } > alog("waiting data..", 0); > ret = epoll_wait(efd, &ee, 1, -1); > alog("epoll_wait()", ret); > read_trace_pipe(); > dwrt_pid = fork(); > assert(dwrt_pid != -1); > if (dwrt_pid == 0) { > sleep(10); > mark_write("more data"); > } else { > alog("waiting form more data..", 0); > ret = epoll_wait(efd, &ee, 1, -1); > alog("epoll_wait()", ret); > read_trace_pipe(); > wait(NULL); > } > return 0; > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ring-buffer: Fix polling on trace_pipe
Hi Steve, I retried the test program (with the kernel patch). It does block from time to time. I spotted the ee.events is not set to EPOLLIN before calling epll_ctl(eft, EPOLL_CTL_ADD,...). I fixed it and ran the test in a bash-loop. I have the kafai-2 version attached (with some more logging in case if it still blocks). Can you retry? Thanks, Martin On Thu, Jun 26, 2014 at 08:53:27PM -0400, Steven Rostedt wrote: > On Thu, 26 Jun 2014 11:34:46 -0700 > Martin Lau wrote: > > > Hi Steve, > > > > Can the modified test program reproduce the problem in your test setup? > > Ah sorry, got distracted by other work. > > OK, I just tried it out, and here's my results: > > I ran you code with my current kernel and this is what I got: > > # ./ftrace-test-epoll-kafai ><...>-3183 [002] ...181.777891: tracing_mark_write: > some data 3183: waitting for more data.. > 3184: written more data > > And it just hung there. > > > Then I applied your patch, compiled and booted it, and ran the test > again, and I got this: > > # ./ftrace-test-epoll-kafai > > It just hung there. No forward progress. > > I don't think that was the result you intended. > > -- Steve > > #include #include #include #include #include #include #include #include #include static const char * debugfs_list[] = { "/debug/tracing", "/sys/kernel/debug/tracing", "/d/tracing", NULL, }; static const char *debugfs; static int markfd; static int trace_pipe_fd; static void alog(const char *name, int ret) { printf("%d: %s: %d\n", getpid(), name, ret); } static const char *find_debugfs(void) { struct stat st; int i; int r; for (i = 0; debugfs_list[i]; i++) { r = stat(debugfs_list[i], &st); if (r < 0) continue; if (S_ISDIR(st.st_mode)) return debugfs_list[i]; } return NULL; } static char * make_path(const char *file) { char *path; int size; size = strlen(debugfs) + strlen(file) + 2; path = malloc(size); if (!path) { perror("malloc"); exit(-1); } sprintf(path, "%s/%s", debugfs, file); return path; } static void mark_write(const char *str) { int ret; ret = write(markfd, str, strlen(str)); alog("write(markfd)", ret); } static void read_trace_pipe(void) { char buf[1024]; int r; while ((r = read(trace_pipe_fd, buf, 1024)) > 0) printf("%.*s", r, buf); } int main (int argc, char **argv) { struct epoll_event ee; char *marker; char *pipe; int efd; int ret; pid_t dwrt_pid; debugfs = find_debugfs(); if (!debugfs) { fprintf(stderr, "Could not find debugfs\n"); exit(-1); } marker = make_path("trace_marker"); pipe = make_path("trace_pipe"); markfd = open(marker, O_WRONLY); if (markfd < 0) { perror("marker"); exit(-1); } trace_pipe_fd = open(pipe, O_RDONLY|O_NONBLOCK); if (trace_pipe_fd < 0) { perror("trace_pipe"); exit(-1); } efd = epoll_create(1); if (efd < 0) { perror("epoll_create"); exit(-1); } mark_write("some data"); memset(&ee, 0, sizeof(ee)); ee.events = EPOLLIN; ret = epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, &ee); if (ret < 0) { perror("epoll_ctl"); exit(-1); } alog("waiting data..", 0); ret = epoll_wait(efd, &ee, 1, -1); alog("epoll_wait()", ret); read_trace_pipe(); dwrt_pid = fork(); assert(dwrt_pid != -1); if (dwrt_pid == 0) { sleep(10); mark_write("more data"); } else { alog("waiting form more data..", 0); ret = epoll_wait(efd, &ee, 1, -1); alog("epoll_wait()", ret); read_trace_pipe(); wait(NULL); } return 0; }
Re: [PATCH] ring-buffer: Fix polling on trace_pipe
Hi Steve, Can the modified test program reproduce the problem in your test setup? Thanks, --Martin On Tue, Jun 10, 2014 at 10:58:14PM -0700, Martin Lau wrote: > Hi Steve, > > Attached is the modified test program. Here is the sample output: > > localhost ~ # ./ftrace-test-epoll-kafai ><...>-1857 [000] ...1 720.174295: tracing_mark_write: some data > 1857: waitting for more data.. > 1858: written more data > > > Thanks, > --Martin > > On Tue, Jun 10, 2014 at 11:49:15AM -0400, Steven Rostedt wrote: > > On Mon, 9 Jun 2014 23:06:42 -0700 > > Martin Lau wrote: > > > > > ring_buffer_poll_wait() should always put the poll_table to its wait_queue > > > even there is immediate data available. Otherwise, the following epoll > > > and > > > read sequence will eventually hang forever: > > > > > > 1. Put some data to make the trace_pipe ring_buffer read ready first > > > 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee) > > > 3. epoll_wait() > > > 4. read(trace_pipe_fd) till EAGAIN > > > 5. Add some more data to the trace_pipe ring_buffer > > > 6. epoll_wait() -> this epoll_wait() will block forever > > > > > > ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2, > > > ring_buffer_poll_wait() returns immediately without adding poll_table, > > > which has poll_table->_qproc pointing to ep_poll_callback(), to its > > > wait_queue. > > > ~ During the epoll_wait() call in step 3 and step 6, > > > ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue > > > because the poll_table->_qproc is NULL and it is how epoll works. > > > ~ When there is new data available in step 6, ring_buffer does not know > > > it has to call ep_poll_callback() because it is not in its wait queue. > > > Hence, block forever. > > > > > > Other poll implementation seems to call poll_wait() unconditionally as > > > the very > > > first thing to do. For example, tcp_poll() in tcp.c. > > > > I'm trying to see the effect of this bug, but can't seem to reproduce > > it. Maybe I did something wrong. Attached is a test program I wrote > > trying to follow your instructions. I don't use epoll, so perhaps I > > didn't use it correctly. > > > > Can you modify it to show me the problem this is trying to fix. That > > is, without this patch it hangs, but with the patch it does not. > > > > Thanks! > > > > -- Steve > > > > > > > > Signed-off-by: Martin Lau > > > --- > > > kernel/trace/ring_buffer.c | 4 > > > 1 file changed, 4 deletions(-) > > > > > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > > index fd12cc5..a6e64e8 100644 > > > --- a/kernel/trace/ring_buffer.c > > > +++ b/kernel/trace/ring_buffer.c > > > @@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer > > > *buffer, int cpu, > > > struct ring_buffer_per_cpu *cpu_buffer; > > > struct rb_irq_work *work; > > > > > > - if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) || > > > - (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, > > > cpu))) > > > - return POLLIN | POLLRDNORM; > > > - > > > if (cpu == RING_BUFFER_ALL_CPUS) > > > work = &buffer->irq_work; > > > else { > > > > #include > #include > #include > #include > #include > #include > #include > #include > #include > > static const char * debugfs_list[] = { > "/debug/tracing", > "/sys/kernel/debug/tracing", > "/d/tracing", > NULL, > }; > > static const char *debugfs; > static int markfd; > static int trace_pipe_fd; > > static const char *find_debugfs(void) > { > struct stat st; > int i; > int r; > > for (i = 0; debugfs_list[i]; i++) { > r = stat(debugfs_list[i], &st); > if (r < 0) > continue; > if (S_ISDIR(st.st_mode)) > return debugfs_list[i]; > } > return NULL; > } > > static char * make_path(const char *file) > { > char *path; > int size; > > size = strlen(debugfs) + strlen(file) + 2; > path = malloc(size); > if (!path) { > perror("malloc&
Re: [PATCH] ring-buffer: Fix polling on trace_pipe
Hi Steve, Attached is the modified test program. Here is the sample output: localhost ~ # ./ftrace-test-epoll-kafai <...>-1857 [000] ...1 720.174295: tracing_mark_write: some data 1857: waitting for more data.. 1858: written more data Thanks, --Martin On Tue, Jun 10, 2014 at 11:49:15AM -0400, Steven Rostedt wrote: > On Mon, 9 Jun 2014 23:06:42 -0700 > Martin Lau wrote: > > > ring_buffer_poll_wait() should always put the poll_table to its wait_queue > > even there is immediate data available. Otherwise, the following epoll and > > read sequence will eventually hang forever: > > > > 1. Put some data to make the trace_pipe ring_buffer read ready first > > 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee) > > 3. epoll_wait() > > 4. read(trace_pipe_fd) till EAGAIN > > 5. Add some more data to the trace_pipe ring_buffer > > 6. epoll_wait() -> this epoll_wait() will block forever > > > > ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2, > > ring_buffer_poll_wait() returns immediately without adding poll_table, > > which has poll_table->_qproc pointing to ep_poll_callback(), to its > > wait_queue. > > ~ During the epoll_wait() call in step 3 and step 6, > > ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue > > because the poll_table->_qproc is NULL and it is how epoll works. > > ~ When there is new data available in step 6, ring_buffer does not know > > it has to call ep_poll_callback() because it is not in its wait queue. > > Hence, block forever. > > > > Other poll implementation seems to call poll_wait() unconditionally as the > > very > > first thing to do. For example, tcp_poll() in tcp.c. > > I'm trying to see the effect of this bug, but can't seem to reproduce > it. Maybe I did something wrong. Attached is a test program I wrote > trying to follow your instructions. I don't use epoll, so perhaps I > didn't use it correctly. > > Can you modify it to show me the problem this is trying to fix. That > is, without this patch it hangs, but with the patch it does not. > > Thanks! > > -- Steve > > > > > Signed-off-by: Martin Lau > > --- > > kernel/trace/ring_buffer.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > index fd12cc5..a6e64e8 100644 > > --- a/kernel/trace/ring_buffer.c > > +++ b/kernel/trace/ring_buffer.c > > @@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, > > int cpu, > > struct ring_buffer_per_cpu *cpu_buffer; > > struct rb_irq_work *work; > > > > - if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) || > > - (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, > > cpu))) > > - return POLLIN | POLLRDNORM; > > - > > if (cpu == RING_BUFFER_ALL_CPUS) > > work = &buffer->irq_work; > > else { > #include #include #include #include #include #include #include #include #include static const char * debugfs_list[] = { "/debug/tracing", "/sys/kernel/debug/tracing", "/d/tracing", NULL, }; static const char *debugfs; static int markfd; static int trace_pipe_fd; static const char *find_debugfs(void) { struct stat st; int i; int r; for (i = 0; debugfs_list[i]; i++) { r = stat(debugfs_list[i], &st); if (r < 0) continue; if (S_ISDIR(st.st_mode)) return debugfs_list[i]; } return NULL; } static char * make_path(const char *file) { char *path; int size; size = strlen(debugfs) + strlen(file) + 2; path = malloc(size); if (!path) { perror("malloc"); exit(-1); } sprintf(path, "%s/%s", debugfs, file); return path; } static void mark_write(const char *str) { write(markfd, str, strlen(str)); } static void read_trace_pipe(void) { char buf[1024]; int r; while ((r = read(trace_pipe_fd, buf, 1024)) > 0) printf("%.*s", r, buf); } int main (int argc, char **argv) { struct epoll_event ee; char *marker; char *pipe; int efd; int ret; pid_t dwrt_pid; debugfs = find_debugfs(); if (!debugfs) { fprintf(stderr, "Could not find debugfs\n"); exit(-1); }
[PATCH] ring-buffer: Fix polling on trace_pipe
ring_buffer_poll_wait() should always put the poll_table to its wait_queue even there is immediate data available. Otherwise, the following epoll and read sequence will eventually hang forever: 1. Put some data to make the trace_pipe ring_buffer read ready first 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee) 3. epoll_wait() 4. read(trace_pipe_fd) till EAGAIN 5. Add some more data to the trace_pipe ring_buffer 6. epoll_wait() -> this epoll_wait() will block forever ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2, ring_buffer_poll_wait() returns immediately without adding poll_table, which has poll_table->_qproc pointing to ep_poll_callback(), to its wait_queue. ~ During the epoll_wait() call in step 3 and step 6, ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue because the poll_table->_qproc is NULL and it is how epoll works. ~ When there is new data available in step 6, ring_buffer does not know it has to call ep_poll_callback() because it is not in its wait queue. Hence, block forever. Other poll implementation seems to call poll_wait() unconditionally as the very first thing to do. For example, tcp_poll() in tcp.c. Signed-off-by: Martin Lau --- kernel/trace/ring_buffer.c | 4 1 file changed, 4 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index fd12cc5..a6e64e8 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -613,10 +613,6 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu, struct ring_buffer_per_cpu *cpu_buffer; struct rb_irq_work *work; - if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) || - (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu))) - return POLLIN | POLLRDNORM; - if (cpu == RING_BUFFER_ALL_CPUS) work = &buffer->irq_work; else { -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/