Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint
Jason Wang wrote: > On Fri, Feb 23, 2024 at 9:42 AM Xuan Zhuo wrote: > > > > On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni wrote: > > > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote: > > > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni wrote: > > > > > > > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote: > > > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni > > > > > > wrote: > > > > > > > > > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote: > > > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer > > > > > > > > wrote: > > > > > > > > > On 02/02/2024 13.11, Liang Chen wrote: > > > > > > > [...] > > > > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct > > > > > > > > > > xdp_buff *xdp) > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct > > > > > > > > > > virtnet_xdp_buff *virtnet_xdp, > > > > > > > > > > + struct net_device *dev, > > > > > > > > > > + struct > > > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash) > > > > > > > > > > +{ > > > > > > > > > > + if (dev->features & NETIF_F_RXHASH) { > > > > > > > > > > + virtnet_xdp->hash_value = > > > > > > > > > > hdr_hash->hash_value; > > > > > > > > > > + virtnet_xdp->hash_report = > > > > > > > > > > hdr_hash->hash_report; > > > > > > > > > > + } > > > > > > > > > > +} > > > > > > > > > > + > > > > > > > > > > > > > > > > > > Would it be possible to store a pointer to hdr_hash in > > > > > > > > > virtnet_xdp_buff, > > > > > > > > > with the purpose of delaying extracting this, until and only > > > > > > > > > if XDP > > > > > > > > > bpf_prog calls the kfunc? > > > > > > > > > > > > > > > > > > > > > > > > > That seems to be the way v1 works, > > > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/ > > > > > > > > . But it was pointed out that the inline header may be > > > > > > > > overwritten by > > > > > > > > the xdp prog, so the hash is copied out to maintain its > > > > > > > > integrity. > > > > > > > > > > > > > > Why? isn't XDP supposed to get write access only to the pkt > > > > > > > contents/buffer? > > > > > > > > > > > > > > > > > > > Normally, an XDP program accesses only the packet data. However, > > > > > > there's also an XDP RX Metadata area, referenced by the data_meta > > > > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to > > > > > > point somewhere ahead of the data buffer, thereby granting the XDP > > > > > > program access to the virtio header located immediately before the > > > > > > > > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data > > > > > before > > > > > xdp->data_hard_start: > > > > > > > > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210 > > > > > > > > > > and virtio net set such field after the virtio_net_hdr: > > > > > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218 > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420 > > > > > > > > > > I don't see how the virtio hdr could be touched? Possibly even more > > > > > important: if such thing is possible, I think is should be somewhat > > > > > denied (for the same reason an H/W nic should prevent XDP from > > > > > modifying its own buffer descriptor). > > > > > > > > Thank you for highlighting this concern. The header layout differs > > > > slightly between small and mergeable mode. Taking 'mergeable mode' as > > > > an example, after calling xdp_prepare_buff the layout of xdp_buff > > > > would be as depicted in the diagram below, > > > > > > > > buf > > > >| > > > >v > > > > +--+--+-+ > > > > | xdp headroom | virtio header| packet | > > > > | (256 bytes) | (20 bytes) | content | > > > > +--+--+-+ > > > > ^ ^ > > > > | | > > > > data_hard_startdata > > > > data_meta > > > > > > > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little > > > > towards 'data_hard_start', it would point to the inline header, thus > > > > potentially allowing the XDP program to access the inline header. Fairly late to the thread sorry. Given above layout does it make sense to just delay extraction to the kfunc as suggested above? Sure the XDP program could smash the entry in virtio header, but this is already the case for anything else there. A program writing over the virtio header is likely buggy anyways. Worse that might happen is bad rss values and mappings? I like seeing more use cases for the hints though. Thanks! Joh
RE: [syzbot] WARNING: refcount bug in sk_psock_get
syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:9c54130c Add linux-next specific files for 20210406 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=17d8d7aad0 > kernel config: https://syzkaller.appspot.com/x/.config?x=d125958c3995ddcd > dashboard link: https://syzkaller.appspot.com/bug?extid=b54a1ce86ba4a623b7f0 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1729797ed0 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1190f46ad0 > > The issue was bisected to: > > commit 997acaf6b4b59c6a9c259740312a69ea549cc684 > Author: Mark Rutland > Date: Mon Jan 11 15:37:07 2021 + > > lockdep: report broken irq restoration > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=11a6cc96d0 > final oops: https://syzkaller.appspot.com/x/report.txt?x=13a6cc96d0 > console output: https://syzkaller.appspot.com/x/log.txt?x=15a6cc96d0 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+b54a1ce86ba4a623b...@syzkaller.appspotmail.com > Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration") > > [ cut here ] > refcount_t: saturated; leaking memory. > WARNING: CPU: 1 PID: 8414 at lib/refcount.c:19 > refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 > Modules linked in: > CPU: 1 PID: 8414 Comm: syz-executor793 Not tainted > 5.12.0-rc6-next-20210406-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 > Code: 1d 69 0c e6 09 31 ff 89 de e8 c8 b4 a6 fd 84 db 75 ab e8 0f ae a6 fd 48 > c7 c7 e0 52 c2 89 c6 05 49 0c e6 09 01 e8 91 0f 00 05 <0f> 0b eb 8f e8 f3 ad > a6 fd 0f b6 1d 33 0c e6 09 31 ff 89 de e8 93 > RSP: 0018:c9eef388 EFLAGS: 00010282 > RAX: RBX: RCX: > RDX: 88801bbdd580 RSI: 815c2e05 RDI: f520001dde63 > RBP: R08: R09: > R10: 815bcc6e R11: R12: 1920001dde74 > R13: 90200301 R14: 888026e0 R15: c9eef3c0 > FS: 01422300() GS:8880b9d0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 2000 CR3: 12b3b000 CR4: 001506e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > __refcount_add_not_zero include/linux/refcount.h:163 [inline] > __refcount_inc_not_zero include/linux/refcount.h:227 [inline] > refcount_inc_not_zero include/linux/refcount.h:245 [inline] > sk_psock_get+0x3b0/0x400 include/linux/skmsg.h:435 > bpf_exec_tx_verdict+0x11e/0x11a0 net/tls/tls_sw.c:799 > tls_sw_sendmsg+0xa41/0x1800 net/tls/tls_sw.c:1013 > inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821 [...] This is likely a problem with latest round of sockmap patches I'll tke a look.
RE: [syzbot] WARNING: suspicious RCU usage in tcp_bpf_update_proto
syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:514e1150 net: x25: Queue received packets in the drivers i.. > git tree: net-next > console output: https://syzkaller.appspot.com/x/log.txt?x=112a8831d0 > kernel config: https://syzkaller.appspot.com/x/.config?x=7eff0f22b8563a5f > dashboard link: https://syzkaller.appspot.com/bug?extid=320a3bc8d80f478c37e4 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1532d711d0 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15f44c5ed0 > > The issue was bisected to: > > commit 4dfe6bd94959222e18d512bdf15f6bf9edb9c27c > Author: Rustam Kovhaev > Date: Wed Feb 24 20:00:30 2021 + > > ntfs: check for valid standard information attribute > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=16207a81d0 > final oops: https://syzkaller.appspot.com/x/report.txt?x=15207a81d0 > console output: https://syzkaller.appspot.com/x/log.txt?x=11207a81d0 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+320a3bc8d80f478c3...@syzkaller.appspotmail.com > Fixes: 4dfe6bd94959 ("ntfs: check for valid standard information attribute") > > = > WARNING: suspicious RCU usage > 5.12.0-rc4-syzkaller #0 Not tainted > - > include/linux/skmsg.h:286 suspicious rcu_dereference_check() usage! > > other info that might help us debug this: > > > rcu_scheduler_active = 2, debug_locks = 1 > 1 lock held by syz-executor383/8454: > #0: 888013a99b48 (clock-AF_INET){++..}-{2:2}, at: > sk_psock_drop+0x2c/0x460 net/core/skmsg.c:788 > > stack backtrace: > CPU: 1 PID: 8454 Comm: syz-executor383 Not tainted 5.12.0-rc4-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:79 [inline] > dump_stack+0x141/0x1d7 lib/dump_stack.c:120 > sk_psock include/linux/skmsg.h:286 [inline] > tcp_bpf_update_proto+0x530/0x5f0 net/ipv4/tcp_bpf.c:504 > sk_psock_restore_proto include/linux/skmsg.h:408 [inline] > sk_psock_drop+0xdf/0x460 net/core/skmsg.c:789 > sk_psock_put include/linux/skmsg.h:446 [inline] > tcp_bpf_recvmsg+0x42d/0x480 net/ipv4/tcp_bpf.c:208 The bisection is bogus, but we will get a fix for this ASAP. The real commit is, commit 8a59f9d1e3d4340659fdfee8879dc09a6f2546e1 Author: Cong Wang Date: Tue Mar 30 19:32:31 2021 -0700 sock: Introduce sk->sk_prot->psock_update_sk_prot() Thanks, John
RE: [PATCH bpf-next 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros
Pedro Tammela wrote: > This macro was refactored out of the bpf selftests. > > Since percpu values are rounded up to '8' in the kernel, a careless > user in userspace might encounter unexpected values when parsing the > output of the batched operations. > > Now that both array and hash maps have support for batched ops in the > percpu variant, let's provide a convenient macro to declare percpu map > value types. > > Updates the tests to a "reference" usage of the new macro. > > Signed-off-by: Pedro Tammela > --- Other than the initial patch needing a bit of description the series looks good to me. Thanks.
RE: [PATCH bpf-next 1/3] bpf: add batched ops support for percpu array
Pedro Tammela wrote: > Suggested-by: Jamal Hadi Salim > Signed-off-by: Pedro Tammela > --- A commit message describing some of the change details and a note it uses the for-each cpu copies (same as normal syscall on percpu map) and not the per-cpu ones would be nice. I at least had to go and check the generic_map* batch operations. Also something about why generic_map_delete_batch is omitted? > kernel/bpf/arraymap.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > index 463d25e1e67e..3c4105603f9d 100644 > --- a/kernel/bpf/arraymap.c > +++ b/kernel/bpf/arraymap.c > @@ -698,6 +698,8 @@ const struct bpf_map_ops percpu_array_map_ops = { > .map_delete_elem = array_map_delete_elem, > .map_seq_show_elem = percpu_array_map_seq_show_elem, > .map_check_btf = array_map_check_btf, > + .map_lookup_batch = generic_map_lookup_batch, > + .map_update_batch = generic_map_update_batch, > .map_set_for_each_callback_args = map_set_for_each_callback_args, > .map_for_each_callback = bpf_for_each_array_elem, > .map_btf_name = "bpf_array", > -- > 2.25.1 >
RE: [PATCH v7 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)
Alexander Lobakin wrote: > This series introduces XSK generic zerocopy xmit by adding XSK umem > pages as skb frags instead of copying data to linear space. > The only requirement for this for drivers is to be able to xmit skbs > with skb_headlen(skb) == 0, i.e. all data including hard headers > starts from frag 0. > To indicate whether a particular driver supports this, a new netdev > priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net > as it's already capable of doing it). So consider implementing this > in your drivers to greatly speed-up generic XSK xmit. [...] > Performance Testing > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s > ``` > > Test result data: > > size64 512 10241500 > copy1916747 1775988 1600203 1440054 > page1974058 1953655 1945463 1904478 > percent 3.0%10.0% 21.58% 32.3% > For the series, but might be good to get Dave or Jakub to check 2/6 to be sure they agree. Acked-by: John Fastabend
Re: RE: [PATCH v7 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
Xuan Zhuo wrote: > On Wed, 17 Feb 2021 16:46:04 -0800, John Fastabend > wrote: > > Alexander Lobakin wrote: > > > From: Xuan Zhuo > > > > > > This patch is used to construct skb based on page to save memory copy > > > overhead. > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > directly construct skb. If this feature is not supported, it is still > > > necessary to copy data to construct skb. > > > > > > Performance Testing > > > > > > The test environment is Aliyun ECS server. > > > Test cmd: > > > ``` > > > xdpsock -i eth0 -t -S -s > > > ``` > > > > > > Test result data: > > > > > > size64 512 10241500 > > > copy1916747 1775988 1600203 1440054 > > > page1974058 1953655 1945463 1904478 > > > percent 3.0%10.0% 21.58% 32.3% > > > > > > Signed-off-by: Xuan Zhuo > > > Reviewed-by: Dust Li > > > [ alobakin: > > > - expand subject to make it clearer; > > > - improve skb->truesize calculation; > > > - reserve some headroom in skb for drivers; > > > - tailroom is not needed as skb is non-linear ] > > > Signed-off-by: Alexander Lobakin > > > Acked-by: Magnus Karlsson > > > --- > > > > [...] > > > > > + buffer = xsk_buff_raw_get_data(pool, addr); > > > + offset = offset_in_page(buffer); > > > + addr = buffer - pool->addrs; > > > + > > > + for (copied = 0, i = 0; copied < len; i++) { > > > + page = pool->umem->pgs[addr >> PAGE_SHIFT]; > > > > Looks like we could walk off the end of pgs[] if len is larger than > > the number of pgs? Do we need to guard against a misconfigured socket > > causing a panic here? AFAIU len here is read from the user space > > descriptor so is under user control. Or maybe I missed a check somewhere. > > > > Thanks, > > John > > > > Don't worry about this, the legality of desc has been checked. > > xskq_cons_peek_desc -> xskq_cons_read_desc -> >xskq_cons_is_valid_desc -> xp_validate_desc Ah OK I didn't dig past the cons_read_desc(). In that case LGTM. Acked-by: John Fastabend
RE: [PATCH v7 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
Alexander Lobakin wrote: > From: Xuan Zhuo > > This patch is used to construct skb based on page to save memory copy > overhead. > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > directly construct skb. If this feature is not supported, it is still > necessary to copy data to construct skb. > > Performance Testing > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s > ``` > > Test result data: > > size64 512 10241500 > copy1916747 1775988 1600203 1440054 > page1974058 1953655 1945463 1904478 > percent 3.0%10.0% 21.58% 32.3% > > Signed-off-by: Xuan Zhuo > Reviewed-by: Dust Li > [ alobakin: > - expand subject to make it clearer; > - improve skb->truesize calculation; > - reserve some headroom in skb for drivers; > - tailroom is not needed as skb is non-linear ] > Signed-off-by: Alexander Lobakin > Acked-by: Magnus Karlsson > --- [...] > + buffer = xsk_buff_raw_get_data(pool, addr); > + offset = offset_in_page(buffer); > + addr = buffer - pool->addrs; > + > + for (copied = 0, i = 0; copied < len; i++) { > + page = pool->umem->pgs[addr >> PAGE_SHIFT]; Looks like we could walk off the end of pgs[] if len is larger than the number of pgs? Do we need to guard against a misconfigured socket causing a panic here? AFAIU len here is read from the user space descriptor so is under user control. Or maybe I missed a check somewhere. Thanks, John > + get_page(page); > + > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > + skb_fill_page_desc(skb, i, page, offset, copy); > + > + copied += copy; > + addr += copy; > + offset = 0; > + } > + > + skb->len += len; > + skb->data_len += len; > + skb->truesize += ts; > + > + refcount_add(ts, &xs->sk.sk_wmem_alloc); > + > + return skb; > +} > +
RE: [PATCH bpf-next v6 00/11] Atomics for eBPF
Brendan Jackman wrote: > Happy new year everyone, and thanks once again for the reviews. > > There's still one unresolved review comment from John[3] but I don't > think it needs to block the patchset as it stands, it can be a > separate patch. Hope that's OK. Its ok on my side if it comes as a follow up. I think it limits the use cases without it, but no reason to hold up the series IMO.
RE: [PATCH bpf-next] xsk: build skb by page
Xuan Zhuo wrote: > This patch is used to construct skb based on page to save memory copy > overhead. > > Taking into account the problem of addr unaligned, and the > possibility of frame size greater than page in the future. > > Signed-off-by: Xuan Zhuo > --- > net/xdp/xsk.c | 68 > --- > 1 file changed, 51 insertions(+), 17 deletions(-) > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index ac4a317..7cab40f 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -430,6 +430,55 @@ static void xsk_destruct_skb(struct sk_buff *skb) > sock_wfree(skb); > } > > +static struct sk_buff *xsk_build_skb_bypage(struct xdp_sock *xs, struct > xdp_desc *desc) > +{ > + char *buffer; > + u64 addr; > + u32 len, offset, copy, copied; > + int err, i; > + struct page *page; > + struct sk_buff *skb; > + > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); Because this is just grabbing an skb did you consider build_skb? > + if (unlikely(!skb)) > + return NULL; I think it would be best to push err back to caller here with ERR_PTR(). > + > + addr = desc->addr; > + len = desc->len; > + > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > + offset = offset_in_page(buffer); > + addr = buffer - (char *)xs->pool->addrs; > + > + for (copied = 0, i = 0; copied < len; ++i) { > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > + > + get_page(page); Is it obvious why this get_page() is needed? Maybe a small comment would be nice. Something like, "we need to inc refcnt on page to ensure skb does not release page from pool". > + > + copy = min((u32)(PAGE_SIZE - offset), len - copied); > + nit: take it or leave it, seems like a lot of new lines imo. I would just put all these together. Not really important though. > + skb_fill_page_desc(skb, i, page, offset, copy); > + > + copied += copy; > + addr += copy; > + offset = 0; > + } > + > + skb->len += len; > + skb->data_len += len; > + skb->truesize += len; > + > + refcount_add(len, &xs->sk.sk_wmem_alloc); > + > + skb->dev = xs->dev; > + skb->priority = xs->sk.sk_priority; > + skb->mark = xs->sk.sk_mark; > + skb_shinfo(skb)->destructor_arg = (void *)(long)addr; > + skb->destructor = xsk_destruct_skb; > + > + return skb; > +} > + > static int xsk_generic_xmit(struct sock *sk) > { > struct xdp_sock *xs = xdp_sk(sk); > @@ -445,40 +494,25 @@ static int xsk_generic_xmit(struct sock *sk) > goto out; > > while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > - char *buffer; > - u64 addr; > - u32 len; > - > if (max_batch-- == 0) { > err = -EAGAIN; > goto out; > } > > - len = desc.len; > - skb = sock_alloc_send_skb(sk, len, 1, &err); > + skb = xsk_build_skb_bypage(xs, &desc); > if (unlikely(!skb)) Is err set here? Either way if skb is an ERR_PTR we can use that here for better error handling. > goto out; > > - skb_put(skb, len); > - addr = desc.addr; > - buffer = xsk_buff_raw_get_data(xs->pool, addr); > - err = skb_store_bits(skb, 0, buffer, len); > /* This is the backpressure mechanism for the Tx path. >* Reserve space in the completion queue and only proceed >* if there is space in it. This avoids having to implement >* any buffering in the Tx path. >*/ > - if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > + if (xskq_prod_reserve(xs->pool->cq)) { > kfree_skb(skb); Same here, do we need to set err now that its not explicit above in err = skb_store_bits... > goto out; > } > > - skb->dev = xs->dev; > - skb->priority = sk->sk_priority; > - skb->mark = sk->sk_mark; > - skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr; > - skb->destructor = xsk_destruct_skb; > - > err = __dev_direct_xmit(skb, xs->queue_id); > if (err == NETDEV_TX_BUSY) { > /* Tell user-space to retry the send */ > -- > 1.8.3.1 >
Re: [PATCH bpf-next v4 04/11] bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
Brendan Jackman wrote: > Hi John, thanks a lot for the reviews! > > On Mon, Dec 07, 2020 at 01:56:53PM -0800, John Fastabend wrote: > > Brendan Jackman wrote: > > > A subsequent patch will add additional atomic operations. These new > > > operations will use the same opcode field as the existing XADD, with > > > the immediate discriminating different operations. > > > > > > In preparation, rename the instruction mode BPF_ATOMIC and start > > > calling the zero immediate BPF_ADD. > > > > > > This is possible (doesn't break existing valid BPF progs) because the > > > immediate field is currently reserved MBZ and BPF_ADD is zero. > > > > > > All uses are removed from the tree but the BPF_XADD definition is > > > kept around to avoid breaking builds for people including kernel > > > headers. > > > > > > Signed-off-by: Brendan Jackman > > > --- [...] > > > + case BPF_STX | BPF_ATOMIC | BPF_W: > > > + case BPF_STX | BPF_ATOMIC | BPF_DW: > > > + if (insn->imm != BPF_ADD) { > > > + pr_err("bpf-jit: not supported: atomic operation %02x > > > ***\n", > > > +insn->imm); > > > + return -EINVAL; > > > + } > > > > Can we standardize the error across jits and the error return code? It seems > > odd that we use pr_err, pr_info_once, pr_err_ratelimited and then return > > ENOTSUPP, EFAULT or EINVAL. > > That would be a noble cause but I don't think it makes sense in this > patchset: they are already inconsistent, so here I've gone for intra-JIT > consistency over inter-JIT consistency. > > I think it would be more annoying, for example, if the s390 JIT returned > -EOPNOTSUPP for a bad atomic but -1 for other unsupported ops, than it > is already that the s390 JIT returns -1 where the MIPS returns -EINVAL. ok works for me thanks for the explanation.
RE: [PATCH bpf-next v4 07/11] bpf: Add instructions for atomic_[cmp]xchg
Brendan Jackman wrote: > This adds two atomic opcodes, both of which include the BPF_FETCH > flag. XCHG without the BPF_FETCH flag would naturally encode > atomic_set. This is not supported because it would be of limited > value to userspace (it doesn't imply any barriers). CMPXCHG without > BPF_FETCH woulud be an atomic compare-and-write. We don't have such > an operation in the kernel so it isn't provided to BPF either. > > There are two significant design decisions made for the CMPXCHG > instruction: > > - To solve the issue that this operation fundamentally has 3 >operands, but we only have two register fields. Therefore the >operand we compare against (the kernel's API calls it 'old') is >hard-coded to be R0. x86 has similar design (and A64 doesn't >have this problem). > >A potential alternative might be to encode the other operand's >register number in the immediate field. > > - The kernel's atomic_cmpxchg returns the old value, while the C11 >userspace APIs return a boolean indicating the comparison >result. Which should BPF do? A64 returns the old value. x86 returns >the old value in the hard-coded register (and also sets a >flag). That means return-old-value is easier to JIT. > > Signed-off-by: Brendan Jackman > --- Sorry if this is a dup, client crashed while I sent the previous version and don't see it on the list. > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -3608,11 +3608,14 @@ static int check_mem_access(struct bpf_verifier_env > *env, int insn_idx, u32 regn > > static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct > bpf_insn *insn) > { > + int load_reg; > int err; > > switch (insn->imm) { > case BPF_ADD: > case BPF_ADD | BPF_FETCH: > + case BPF_XCHG: > + case BPF_CMPXCHG: > break; > default: > verbose(env, "BPF_ATOMIC uses invalid atomic opcode %02x\n", > insn->imm); > @@ -3634,6 +3637,13 @@ static int check_atomic(struct bpf_verifier_env *env, > int insn_idx, struct bpf_i > if (err) > return err; > > + if (insn->imm == BPF_CMPXCHG) { > + /* Check comparison of R0 with memory location */ > + err = check_reg_arg(env, BPF_REG_0, SRC_OP); > + if (err) > + return err; > + } > + Need to think a bit more on this, but do we need to update is_reg64() here as well? > if (is_pointer_value(env, insn->src_reg)) { > verbose(env, "R%d leaks addr into mem\n", insn->src_reg); > return -EACCES; > @@ -3664,8 +3674,13 @@ static int check_atomic(struct bpf_verifier_env *env, > int insn_idx, struct bpf_i > if (!(insn->imm & BPF_FETCH)) > return 0; > > - /* check and record load of old value into src reg */ > - err = check_reg_arg(env, insn->src_reg, DST_OP); > + if (insn->imm == BPF_CMPXCHG) > + load_reg = BPF_REG_0; > + else > + load_reg = insn->src_reg; > + > + /* check and record load of old value */ > + err = check_reg_arg(env, load_reg, DST_OP); > if (err) > return err; > Thanks, John
RE: [PATCH bpf-next v4 07/11] bpf: Add instructions for atomic_[cmp]xchg
Brendan Jackman wrote: > This adds two atomic opcodes, both of which include the BPF_FETCH > flag. XCHG without the BPF_FETCH flag would naturally encode > atomic_set. This is not supported because it would be of limited > value to userspace (it doesn't imply any barriers). CMPXCHG without > BPF_FETCH woulud be an atomic compare-and-write. We don't have such > an operation in the kernel so it isn't provided to BPF either. > > There are two significant design decisions made for the CMPXCHG > instruction: > > - To solve the issue that this operation fundamentally has 3 >operands, but we only have two register fields. Therefore the >operand we compare against (the kernel's API calls it 'old') is >hard-coded to be R0. x86 has similar design (and A64 doesn't >have this problem). > >A potential alternative might be to encode the other operand's >register number in the immediate field. > > - The kernel's atomic_cmpxchg returns the old value, while the C11 >userspace APIs return a boolean indicating the comparison >result. Which should BPF do? A64 returns the old value. x86 returns >the old value in the hard-coded register (and also sets a >flag). That means return-old-value is easier to JIT. Just a nit as it looks like perhaps we get one more spin here. Would be nice to be explicit about what the code does here. The above reads like it could go either way. Just an extra line "So we use ...' would be enough. > > Signed-off-by: Brendan Jackman > --- One question below. > arch/x86/net/bpf_jit_comp.c| 8 > include/linux/filter.h | 22 ++ > include/uapi/linux/bpf.h | 4 +++- > kernel/bpf/core.c | 20 > kernel/bpf/disasm.c| 15 +++ > kernel/bpf/verifier.c | 19 +-- > tools/include/linux/filter.h | 22 ++ > tools/include/uapi/linux/bpf.h | 4 +++- > 8 files changed, 110 insertions(+), 4 deletions(-) > [...] > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index f8c4e809297d..f5f4460b3e4e 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -3608,11 +3608,14 @@ static int check_mem_access(struct bpf_verifier_env > *env, int insn_idx, u32 regn > > static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct > bpf_insn *insn) > { > + int load_reg; > int err; > > switch (insn->imm) { > case BPF_ADD: > case BPF_ADD | BPF_FETCH: > + case BPF_XCHG: > + case BPF_CMPXCHG: > break; > default: > verbose(env, "BPF_ATOMIC uses invalid atomic opcode %02x\n", > insn->imm); > @@ -3634,6 +3637,13 @@ static int check_atomic(struct bpf_verifier_env *env, > int insn_idx, struct bpf_i > if (err) > return err; > > + if (insn->imm == BPF_CMPXCHG) { > + /* Check comparison of R0 with memory location */ > + err = check_reg_arg(env, BPF_REG_0, SRC_OP); > + if (err) > + return err; > + } > + I need to think a bit more about it, but do we need to update is_reg64() at all for these? > if (is_pointer_value(env, insn->src_reg)) { > verbose(env, "R%d leaks addr into mem\n", insn->src_reg); > return -EACCES; > @@ -3664,8 +3674,13 @@ static int check_atomic(struct bpf_verifier_env *env, > int insn_idx, struct bpf_i > if (!(insn->imm & BPF_FETCH)) > return 0; > > - /* check and record load of old value into src reg */ > - err = check_reg_arg(env, insn->src_reg, DST_OP); > + if (insn->imm == BPF_CMPXCHG) > + load_reg = BPF_REG_0; > + else > + load_reg = insn->src_reg; > + > + /* check and record load of old value */ > + err = check_reg_arg(env, load_reg, DST_OP); > if (err) > return err;
RE: [PATCH bpf-next v4 06/11] bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
Brendan Jackman wrote: > The BPF_FETCH field can be set in bpf_insn.imm, for BPF_ATOMIC > instructions, in order to have the previous value of the > atomically-modified memory location loaded into the src register > after an atomic op is carried out. > > Suggested-by: Yonghong Song > Signed-off-by: Brendan Jackman > --- I like Yonghong suggestion #define BPF_ATOMIC_FETCH_ADD(SIZE, DST, SRC, OFF) \ BPF_ATOMIC(SIZE, DST, SRC, OFF, BPF_ADD | BPF_FETCH) otherwise LGTM. One observation to consider below. Acked-by: John Fastabend > arch/x86/net/bpf_jit_comp.c| 4 > include/linux/filter.h | 1 + > include/uapi/linux/bpf.h | 3 +++ > kernel/bpf/core.c | 13 + > kernel/bpf/disasm.c| 7 +++ > kernel/bpf/verifier.c | 33 - > tools/include/linux/filter.h | 11 +++ > tools/include/uapi/linux/bpf.h | 3 +++ > 8 files changed, 66 insertions(+), 9 deletions(-) [...] > @@ -3652,8 +3656,20 @@ static int check_atomic(struct bpf_verifier_env *env, > int insn_idx, struct bpf_i > return err; > > /* check whether we can write into the same memory */ > - return check_mem_access(env, insn_idx, insn->dst_reg, insn->off, > - BPF_SIZE(insn->code), BPF_WRITE, -1, true); > + err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off, > +BPF_SIZE(insn->code), BPF_WRITE, -1, true); > + if (err) > + return err; > + > + if (!(insn->imm & BPF_FETCH)) > + return 0; > + > + /* check and record load of old value into src reg */ > + err = check_reg_arg(env, insn->src_reg, DST_OP); This will mark the reg unknown. I think this is fine here. Might be nice to carry bounds through though if possible > + if (err) > + return err; > + > + return 0; > } >
RE: [PATCH bpf-next v4 05/11] bpf: Move BPF_STX reserved field check into BPF_STX verifier code
Brendan Jackman wrote: > I can't find a reason why this code is in resolve_pseudo_ldimm64; > since I'll be modifying it in a subsequent commit, tidy it up. > > Signed-off-by: Brendan Jackman > --- > kernel/bpf/verifier.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) Acked-by: John Fastabend
RE: [PATCH bpf-next v4 04/11] bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
Brendan Jackman wrote: > A subsequent patch will add additional atomic operations. These new > operations will use the same opcode field as the existing XADD, with > the immediate discriminating different operations. > > In preparation, rename the instruction mode BPF_ATOMIC and start > calling the zero immediate BPF_ADD. > > This is possible (doesn't break existing valid BPF progs) because the > immediate field is currently reserved MBZ and BPF_ADD is zero. > > All uses are removed from the tree but the BPF_XADD definition is > kept around to avoid breaking builds for people including kernel > headers. > > Signed-off-by: Brendan Jackman > --- > Documentation/networking/filter.rst | 30 - > arch/arm/net/bpf_jit_32.c | 7 ++- > arch/arm64/net/bpf_jit_comp.c | 16 +-- > arch/mips/net/ebpf_jit.c | 11 +++-- > arch/powerpc/net/bpf_jit_comp64.c | 25 --- > arch/riscv/net/bpf_jit_comp32.c | 20 +++-- > arch/riscv/net/bpf_jit_comp64.c | 16 +-- > arch/s390/net/bpf_jit_comp.c | 27 ++- > arch/sparc/net/bpf_jit_comp_64.c | 17 +-- > arch/x86/net/bpf_jit_comp.c | 45 ++- > arch/x86/net/bpf_jit_comp32.c | 6 +-- > drivers/net/ethernet/netronome/nfp/bpf/jit.c | 14 -- > drivers/net/ethernet/netronome/nfp/bpf/main.h | 4 +- > .../net/ethernet/netronome/nfp/bpf/verifier.c | 15 --- > include/linux/filter.h| 29 ++-- > include/uapi/linux/bpf.h | 5 ++- > kernel/bpf/core.c | 31 + > kernel/bpf/disasm.c | 6 ++- > kernel/bpf/verifier.c | 24 +- > lib/test_bpf.c| 14 +++--- > samples/bpf/bpf_insn.h| 4 +- > samples/bpf/cookie_uid_helper_example.c | 6 +-- > samples/bpf/sock_example.c| 2 +- > samples/bpf/test_cgrp2_attach.c | 5 ++- > tools/include/linux/filter.h | 28 ++-- > tools/include/uapi/linux/bpf.h| 5 ++- > .../bpf/prog_tests/cgroup_attach_multi.c | 4 +- > .../selftests/bpf/test_cgroup_storage.c | 2 +- > tools/testing/selftests/bpf/verifier/ctx.c| 7 ++- > .../bpf/verifier/direct_packet_access.c | 4 +- > .../testing/selftests/bpf/verifier/leak_ptr.c | 10 ++--- > .../selftests/bpf/verifier/meta_access.c | 4 +- > tools/testing/selftests/bpf/verifier/unpriv.c | 3 +- > .../bpf/verifier/value_illegal_alu.c | 2 +- > tools/testing/selftests/bpf/verifier/xadd.c | 18 > 35 files changed, 317 insertions(+), 149 deletions(-) > [...] > +++ a/arch/mips/net/ebpf_jit.c [...] > - if (BPF_MODE(insn->code) == BPF_XADD) { > + if (BPF_MODE(insn->code) == BPF_ATOMIC) { > + if (insn->imm != BPF_ADD) { > + pr_err("ATOMIC OP %02x NOT HANDLED\n", > insn->imm); > + return -EINVAL; > + } > + > /* [...] > +++ b/arch/powerpc/net/bpf_jit_comp64.c > - case BPF_STX | BPF_XADD | BPF_W: > + case BPF_STX | BPF_ATOMIC | BPF_W: > + if (insn->imm != BPF_ADD) { > + pr_err_ratelimited( > + "eBPF filter atomic op code %02x (@%d) > unsupported\n", > + code, i); > + return -ENOTSUPP; > + } [...] > @@ -699,8 +707,15 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 > *image, > - case BPF_STX | BPF_XADD | BPF_DW: > + case BPF_STX | BPF_ATOMIC | BPF_DW: > + if (insn->imm != BPF_ADD) { > + pr_err_ratelimited( > + "eBPF filter atomic op code %02x (@%d) > unsupported\n", > + code, i); > + return -ENOTSUPP; > + } [...] > + case BPF_STX | BPF_ATOMIC | BPF_W: > + if (insn->imm != BPF_ADD) { > + pr_info_once( > + "bpf-jit: not supported: atomic operation %02x > ***\n", > + insn->imm); > + return -EFAULT; > + } [...] > + case BPF_STX | BPF_ATOMIC | BPF_W: > + case BPF_STX | BPF_ATOMIC | BPF_DW: > + if (insn->imm != BPF_ADD) { > + pr_err("bpf-jit: not supported: atomic operation %02x > ***\n", > +insn->imm); > + return -EINVAL; > + } Can we standardize the error across jits and the error return code? It seems odd that we
RE: [PATCH bpf-next v4 03/11] bpf: x86: Factor out a lookup table for some ALU opcodes
Brendan Jackman wrote: > A later commit will need to lookup a subset of these opcodes. To > avoid duplicating code, pull out a table. > > The shift opcodes won't be needed by that later commit, but they're > already duplicated, so fold them into the table anyway. > > Signed-off-by: Brendan Jackman > --- Acked-byy: John Fastabend
RE: [PATCH bpf-next v4 02/11] bpf: x86: Factor out emission of REX byte
Brendan Jackman wrote: > The JIT case for encoding atomic ops is about to get more > complicated. In order to make the review & resulting code easier, > let's factor out some shared helpers. > > Signed-off-by: Brendan Jackman > --- Acked-by: John Fastabend
RE: [PATCH bpf-next v4 01/11] bpf: x86: Factor out emission of ModR/M for *(reg + off)
Brendan Jackman wrote: > The case for JITing atomics is about to get more complicated. Let's > factor out some common code to make the review and result more > readable. > > NB the atomics code doesn't yet use the new helper - a subsequent > patch will add its use as a side-effect of other changes. > > Signed-off-by: Brendan Jackman > --- Small nit on style preference below. Acked-by: John Fastabend [...] > > @@ -1240,11 +1250,7 @@ st:if (is_imm8(insn->off)) > goto xadd; > case BPF_STX | BPF_XADD | BPF_DW: > EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01); > -xadd:if (is_imm8(insn->off)) > - EMIT2(add_2reg(0x40, dst_reg, src_reg), > insn->off); > - else > - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg), > - insn->off); > +xadd:emit_modrm_dstoff(&prog, dst_reg, src_reg, > insn->off); I at least prefer the xadd on its own line above the emit_*(). That seems more consistent with the rest of the code in this file. The only other example like this is st:.
Re: [PATCH v2 bpf-next 00/13] Atomics for eBPF
Andrii Nakryiko wrote: > On Tue, Dec 1, 2020 at 9:53 PM John Fastabend > wrote: > > > > Yonghong Song wrote: > > > > > > > > > > [...] > > > > > > Great, this means that all existing valid uses of > > > > __sync_fetch_and_add() will generate BPF_XADD instructions and will > > > > work on old kernels, right? > > > > > > That is correct. > > > > > > > > > > > If that's the case, do we still need cpu=v4? The new instructions are > > > > *only* going to be generated if the user uses previously unsupported > > > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > > > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > > > > tautology then? > > > > > > This is a very good question. Essentially this boils to when users can > > > use the new functionality including meaningful return value of > > > __sync_fetch_and_add(). > > >(1). user can write a small bpf program to test the feature. If user > > > gets a failed compilation (fatal error), it won't be supported. > > > Otherwise, it is supported. > > >(2). compiler provides some way to tell user it is safe to use, e.g., > > > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > > > > > I guess since kernel already did a lot of feature discovery. Option (1) > > > is probably fine. > > > > For option (2) we can use BTF with kernel version check. If kernel is > > greater than kernel this lands in we use the the new instructions if > > not we use a slower locked version. That should work for all cases > > unless someone backports patches into an older case. > > Two different things: Clang support detection and kernel support > detection. You are talking about kernel detection, I and Yonghong were > talking about Clang detection and explicit cpu=v4 opt-in. > Ah right, catching up on email and reading the thread backwords I lost the context thanks! So, I live in a dev world where I control the build infrastructure so always know clang/llvm versions and features. What I don't know as well is where the program I just built might be run. So its a bit of an odd question from my perspective to ask if my clang supports feature X. If it doesn't support feature X and I want it we upgrade clang so that it does support it. I don't think we would ever write a program to test the assertion. Anyways thanks. > For kernel detection, if there is an enum value or type that gets > added along the feature, then with CO-RE built-ins it's easy to detect > and kernel dead code elimination will make sure that unsupported > instructions won't trip up the BPF verifier. Still need Clang support > to compile the program in the first place, though. +1 > > If there is no such BTF-based way to check, it is still possible to > try to load a trivial BPF program with __sync_fech_and_xxx() to do > feature detection and then use .rodata to turn off code paths relying > on a new instruction set. Right. > > > > > At least thats what I'll probably end up wrapping in a helper function.
Re: [PATCH v2 bpf-next 00/13] Atomics for eBPF
Yonghong Song wrote: > > [...] > > Great, this means that all existing valid uses of > > __sync_fetch_and_add() will generate BPF_XADD instructions and will > > work on old kernels, right? > > That is correct. > > > > > If that's the case, do we still need cpu=v4? The new instructions are > > *only* going to be generated if the user uses previously unsupported > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > > tautology then? > > This is a very good question. Essentially this boils to when users can > use the new functionality including meaningful return value of > __sync_fetch_and_add(). >(1). user can write a small bpf program to test the feature. If user > gets a failed compilation (fatal error), it won't be supported. > Otherwise, it is supported. >(2). compiler provides some way to tell user it is safe to use, e.g., > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > I guess since kernel already did a lot of feature discovery. Option (1) > is probably fine. For option (2) we can use BTF with kernel version check. If kernel is greater than kernel this lands in we use the the new instructions if not we use a slower locked version. That should work for all cases unless someone backports patches into an older case. At least thats what I'll probably end up wrapping in a helper function.
Re: [PATCH v3] bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
Andrii Nakryiko wrote: > On Tue, Nov 10, 2020 at 9:03 PM wrote: > > > > From: Kaixu Xia > > > > The unsigned variable datasec_id is assigned a return value from the call > > to check_pseudo_btf_id(), which may return negative error code. > > > > Fixes coccicheck warning: > > > > ./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared > > with zero: datasec_id > 0 > > > > Reported-by: Tosk Robot > > Signed-off-by: Kaixu Xia > > --- > > Looks good. > > Acked-by: Andrii Nakryiko > > [...] Acked-by: John Fastabend
Re: [PATCH v2 bpf] tools: bpftool: Add missing close before bpftool net attach exit
wanghai (M) wrote: > > 在 2020/11/10 12:33, John Fastabend 写道: > > Wang Hai wrote: > >> progfd is created by prog_parse_fd(), before 'bpftool net attach' exit, > >> it should be closed. > >> > >> Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP > >> on interface") > >> Signed-off-by: Wang Hai > >> --- [...] > > Alternatively we could add an 'err = 0' here, but above should never > > return a value >0 as far as I can see. > It's true that 'err > 0' doesn't exist currently , but adding 'err = 0' > would make the code clearer. Thanks for your advice. > >> +cleanup: > >> + close(progfd); > >> + return err; > >> } > >> > >> static int do_detach(int argc, char **argv) > >> -- > >> 2.17.1 > >> > Can it be fixed like this? > > --- a/tools/bpf/bpftool/net.c > +++ b/tools/bpf/bpftool/net.c > @@ -578,8 +578,8 @@ static int do_attach(int argc, char **argv) > > ifindex = net_parse_dev(&argc, &argv); > if (ifindex < 1) { > - close(progfd); > - return -EINVAL; > + err = -EINVAL; > + goto cleanup; > } > > if (argc) { > @@ -587,8 +587,8 @@ static int do_attach(int argc, char **argv) > overwrite = true; > } else { > p_err("expected 'overwrite', got: '%s'?", *argv); > - close(progfd); > - return -EINVAL; > + err = -EINVAL; > + goto cleanup; > } > } > > @@ -597,16 +597,19 @@ static int do_attach(int argc, char **argv) > err = do_attach_detach_xdp(progfd, attach_type, ifindex, > overwrite); > > - if (err < 0) { > + if (err) { > p_err("interface %s attach failed: %s", > attach_type_strings[attach_type], strerror(-err)); > - return err; > + goto cleanup; > } > > if (json_output) > jsonw_null(json_wtr); > > - return 0; > + ret = 0; > +cleanup: > + close(progfd); > + return err; > } > LGTM. Send a v3.
RE: [PATCH v2 bpf] tools: bpftool: Add missing close before bpftool net attach exit
Wang Hai wrote: > progfd is created by prog_parse_fd(), before 'bpftool net attach' exit, > it should be closed. > > Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP on > interface") > Signed-off-by: Wang Hai > --- > v1->v2: use cleanup tag instead of repeated closes > tools/bpf/bpftool/net.c | 14 -- > 1 file changed, 8 insertions(+), 6 deletions(-) > > diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c > index 910e7bac6e9e..1ac7228167e6 100644 > --- a/tools/bpf/bpftool/net.c > +++ b/tools/bpf/bpftool/net.c > @@ -578,8 +578,8 @@ static int do_attach(int argc, char **argv) > > ifindex = net_parse_dev(&argc, &argv); > if (ifindex < 1) { > - close(progfd); > - return -EINVAL; > + err = -EINVAL; > + goto cleanup; > } > > if (argc) { > @@ -587,8 +587,8 @@ static int do_attach(int argc, char **argv) > overwrite = true; > } else { > p_err("expected 'overwrite', got: '%s'?", *argv); > - close(progfd); > - return -EINVAL; > + err = -EINVAL; > + goto cleanup; > } > } > > @@ -600,13 +600,15 @@ static int do_attach(int argc, char **argv) I think now that return value depends on this err it should be 'if (err)' otherwise we risk retunring non-zero error code from do_attach which will cause programs to fail. > if (err < 0) { if (err) { > p_err("interface %s attach failed: %s", > attach_type_strings[attach_type], strerror(-err)); > - return err; > + goto cleanup; > } > > if (json_output) > jsonw_null(json_wtr); > > - return 0; Alternatively we could add an 'err = 0' here, but above should never return a value >0 as far as I can see. Thanks, John > +cleanup: > + close(progfd); > + return err; > } > > static int do_detach(int argc, char **argv) > -- > 2.17.1 >
Re: [PATCH] net/xdp: remove unused macro REG_STATE_NEW
Alex Shi wrote: > > > 在 2020/11/7 上午12:13, Jesper Dangaard Brouer 写道: > > Hmm... REG_STATE_NEW is zero, so it is implicitly set via memset zero. > > But it is true that it is technically not directly used or referenced. > > > > It is mentioned in a comment, so please send V2 with this additional change: > > Hi Jesper, > > Thanks a lot for comments. here is the v2: > > From 2908d25bf2e1c90ad71a83ba056743f45da283e8 Mon Sep 17 00:00:00 2001 > From: Alex Shi > Date: Fri, 6 Nov 2020 13:41:58 +0800 > Subject: [PATCH v2] net/xdp: remove unused macro REG_STATE_NEW > > To tame gcc warning on it: > net/core/xdp.c:20:0: warning: macro "REG_STATE_NEW" is not used > [-Wunused-macros] > And change related comments as Jesper Dangaard Brouer suggested. > > Signed-off-by: Alex Shi > --- > net/core/xdp.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/core/xdp.c b/net/core/xdp.c > index 48aba933a5a8..0df5ee5682d9 100644 > --- a/net/core/xdp.c > +++ b/net/core/xdp.c > @@ -19,7 +19,6 @@ > #include > #include > > -#define REG_STATE_NEW0x0 > #define REG_STATE_REGISTERED 0x1 > #define REG_STATE_UNREGISTERED 0x2 > #define REG_STATE_UNUSED 0x3 I think having the define there makes it more readable and clear what the zero state is. But if we run with unused-macros I guess its even uglier to try and mark it with unused attribute. Acked-by: John Fastabend > @@ -175,7 +174,7 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, > return -ENODEV; > } > > - /* State either UNREGISTERED or NEW */ > + /* State either UNREGISTERED or zero */ > xdp_rxq_info_init(xdp_rxq); > xdp_rxq->dev = dev; > xdp_rxq->queue_index = queue_index; > -- > 1.8.3.1 >
Re: [PATCH bpf] tools: bpftool: Add missing close before bpftool net attach exit
Michal Rostecki wrote: > On 11/9/20 8:04 AM, Wang Hai wrote: > > progfd is created by prog_parse_fd(), before 'bpftool net attach' exit, > > it should be closed. > > > > Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP > > on interface") > > Signed-off-by: Wang Hai > > --- > > tools/bpf/bpftool/net.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c > > index 910e7bac6e9e..3e9b40e64fb0 100644 > > --- a/tools/bpf/bpftool/net.c > > +++ b/tools/bpf/bpftool/net.c > > @@ -600,12 +600,14 @@ static int do_attach(int argc, char **argv) > > if (err < 0) { > > p_err("interface %s attach failed: %s", > > attach_type_strings[attach_type], strerror(-err)); > > + close(progfd); > > return err; > > } > > > > if (json_output) > > jsonw_null(json_wtr); > > > > + close(progfd); > > return 0; > > } > > > > Nit - wouldn't it be better to create a `cleanup`/`out` section before > return and use goto, to avoid copying the `close` call? I agree would be nicer.
RE: [PATCH] net: sched: fix misspellings using misspell-fixer tool
menglong8.dong@ wrote: > From: Menglong Dong > > Some typos are found out by misspell-fixer tool: > > $ misspell-fixer -rnv ./net/sched/ > ./net/sched/act_api.c:686 > ./net/sched/act_bpf.c:68 > ./net/sched/cls_rsvp.h:241 > ./net/sched/em_cmp.c:44 > ./net/sched/sch_pie.c:408 > > Fix typos found by misspell-fixer. > > Signed-off-by: Menglong Dong > --- Hi, you will want to add net-next to the [PATCH *] line next time to make it clear this is for net-next. The contents make it obvious in this case though. Also I'm not sure why the bpf@ include but OK. Acked-by: John Fastabend
RE: [PATCH v3 bpf] trace: bpf: Fix passing zero to PTR_ERR()
Wang Qing wrote: > There is a bug when passing zero to PTR_ERR() and return. > Fix smatch err. > > Signed-off-by: Wang Qing > --- > kernel/trace/bpf_trace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 4517c8b..5113fd4 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -1198,7 +1198,7 @@ static int bpf_btf_printf_prepare(struct btf_ptr *ptr, > u32 btf_ptr_size, > *btf = bpf_get_btf_vmlinux(); > > if (IS_ERR_OR_NULL(*btf)) > - return PTR_ERR(*btf); > + return IS_ERR(*btf) ? PTR_ERR(*btf) : -EINVAL; > > if (ptr->type_id > 0) > *btf_id = ptr->type_id; > -- > 2.7.4 > Acked-by: John Fastabend
RE: [PATCH] bpf: remove duplicate include
Wang Qing wrote: > Remove duplicate header which is included twice. > > Signed-off-by: Wang Qing > --- > kernel/bpf/btf.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > index ed7d02e..6324de8 > --- a/kernel/bpf/btf.c > +++ b/kernel/bpf/btf.c > @@ -22,7 +22,6 @@ > #include > #include > #include > -#include > #include > > /* BTF (BPF Type Format) is the meta data format which describes > -- > 2.7.4 > Looks fine to me. But, these types of things should go to bpf-next I see no reason to push these into bpf tree.
RE: [PATCH] samples/bpf: remove duplicate include
menglong8.dong@ wrote: > From: Menglong Dong > > Obviously, 'bpf/bpf.h' in 'samples/bpf/hbm.c' is duplicated. > > Signed-off-by: Menglong Dong > --- Acked-by: John Fastabend
Re: [PATCH bpf-next v2 7/8] bpf: Add tests for task_local_storage
Alexei Starovoitov wrote: > On Tue, Nov 3, 2020 at 5:55 PM KP Singh wrote: > > > > [...] > > > > > > > > > > I saw the docs mention that these are not exposed to tracing programs > > > > due to > > > > insufficient preemption checks. Do you think it would be okay to allow > > > > them > > > > for LSM programs? > > > > > > hmm. Isn't it allowed already? > > > The verifier does: > > > if ((is_tracing_prog_type(prog_type) || > > > prog_type == BPF_PROG_TYPE_SOCKET_FILTER) && > > > map_value_has_spin_lock(map)) { > > > verbose(env, "tracing progs cannot use bpf_spin_lock > > > yet\n"); > > > return -EINVAL; > > > } > > > > > > BPF_PROG_TYPE_LSM is not in this list. > > > > The verifier does not have any problem, it's just that the helpers are not > > exposed to LSM programs via bpf_lsm_func_proto. > > > > So all we need is: > > > > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c > > index 61f8cc52fd5b..93383df2140b 100644 > > --- a/kernel/bpf/bpf_lsm.c > > +++ b/kernel/bpf/bpf_lsm.c > > @@ -63,6 +63,10 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const > > struct bpf_prog *prog) > > return &bpf_task_storage_get_proto; > > case BPF_FUNC_task_storage_delete: > > return &bpf_task_storage_delete_proto; > > + case BPF_FUNC_spin_lock: > > + return &bpf_spin_lock_proto; > > + case BPF_FUNC_spin_unlock: > > + return &bpf_spin_unlock_proto; > > Ahh. Yes. That should do it. Right now I don't see concerns with safety > of the bpf_spin_lock in bpf_lsm progs. What about sleepable lsm hooks? Normally we wouldn't expect to sleep with a spinlock held. Should we have a check to ensure programs bpf_spin_lock are not also sleepable?
Re: [PATCH] net: sockmap: Don't call bpf_prog_put() on NULL pointer
Jakub Sitnicki wrote: > On Mon, Oct 12, 2020 at 07:09 PM CEST, Alex Dewar wrote: > > If bpf_prog_inc_not_zero() fails for skb_parser, then bpf_prog_put() is > > called unconditionally on skb_verdict, even though it may be NULL. Fix > > and tidy up error path. > > > > Addresses-Coverity-ID: 1497799: Null pointer dereferences (FORWARD_NULL) > > Fixes: 743df8b7749f ("bpf, sockmap: Check skb_verdict and skb_parser > > programs explicitly") > > Signed-off-by: Alex Dewar > > --- > > Acked-by: Jakub Sitnicki Thanks. Jakub, any opinions on if we should just throw an error if users try to add a sock to a map with a parser but no verdict? At the moment we fall through and add the socket, but it wont do any receive parsing/verdict. At the moment I think its fine with above fix. The useful cases for RX are parser+verdict, verdict, and empty. Where empty is just used for redirects or other socket account tricks. Just something to keep in mind. Acked-by: John Fastabend
RE: [PATCH bpf] bpf: sockmap: add locking annotations to iterator
Lorenz Bauer wrote: > The sparse checker currently outputs the following warnings: > > include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in > 'sock_hash_seq_start' - wrong count at exit > include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in > 'sock_map_seq_start' - wrong count at exit > > Add the necessary __acquires and __release annotations to make the > iterator locking schema palatable to sparse. Also add __must_hold > for good measure. > > The kernel codebase uses both __acquires(rcu) and __acquires(RCU). > I couldn't find any guidance which one is preferred, so I used > what is easier to type out. > > Fixes: 0365351524d7 ("net: Allow iterating sockmap and sockhash") > Reported-by: kernel test robot > Signed-off-by: Lorenz Bauer > --- LGTM Acked-by: John Fastabend
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
Cong Wang wrote: > On Thu, Sep 3, 2020 at 10:08 PM John Fastabend > wrote: > > Maybe this would unlock us, > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 7df6c9617321..9b09429103f1 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, > > struct Qdisc *q, > > > > if (q->flags & TCQ_F_NOLOCK) { > > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > > - qdisc_run(q); > > + __qdisc_run(q); > > > > if (unlikely(to_free)) > > kfree_skb_list(to_free); > > > > > > Per other thread we also need the state deactivated check added > > back. > > I guess no, because pfifo_dequeue() seems to require q->seqlock, > according to comments in qdisc_run(), so we can not just get rid of > qdisc_run_begin()/qdisc_run_end() here. > > Thanks. Seems we would have to revert this as well then, commit 021a17ed796b62383f7623f4fea73787abddad77 Author: Paolo Abeni Date: Tue May 15 16:24:37 2018 +0200 pfifo_fast: drop unneeded additional lock on dequeue After the previous patch, for NOLOCK qdiscs, q->seqlock is always held when the dequeue() is invoked, we can drop any additional locking to protect such operation. Then I think it should be safe. Back when I was working on the ptr ring implementation I opted not to do a case without the spinlock because the performance benefit was minimal in the benchmarks I was looking at. I assumed at some point it would be worth going back to it, but just changing those to the __ptr_ring* cases is not safe without a lock. I remember having a discussion with Tsirkin about the details, but would have to go through the mail servers to find it. FWIW the initial perf looked like this, (https://lwn.net/Articles/698135/) nolock pfifo_fast 1: 1417597 1407479 1418913 1439601 2: 1882009 1867799 1864374 1855950 4: 1806736 1804261 1803697 1806994 8: 1354318 1358686 1353145 1356645 12: 1331928 1333079 1333476 1335544 locked pfifo_fast 1: 1471479 1469142 1458825 1456788 2: 1746231 1749490 1753176 1753780 4: 1119626 1120515 1121478 1119220 8: 1001471 999308 1000318 1000776 12: 989269 992122 991590 986581 As you can see measurable improvement on many cores. But, actually worse if you have enough nic queues to map 1:1 with cores. nolock mq 1:1417768 1438712 1449092 1426775 2:2644099 2634961 2628939 2712867 4:4866133 4862802 4863396 4867423 8:9422061 9464986 9457825 9467619 12: 13854470 13213735 13664498 13213292 locked mq 1: 1448374 1444208 1437459 1437088 2: 2687963 2679221 2651059 2691630 4: 5153884 4684153 5091728 4635261 8: 9292395 9625869 9681835 9711651 12: 13553918 13682410 14084055 13946138 So only better if you have more cores than hardware queues which was the case on some of the devices we had at the time. Thanks, John
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
Cong Wang wrote: > On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni wrote: > > > > On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote: > > > Can you test the attached one-line fix? I think we are overthinking, > > > probably all > > > we need here is a busy wait. > > > > I think that will solve, but I also think that will kill NOLOCK > > performances due to really increased contention. > > Yeah, we somehow end up with more locks (seqlock, skb array lock) > for lockless qdisc. What an irony... ;) I went back to the original nolock implementation code to try and figure out how this was working in the first place. After initial patch series we have this in __dev_xmit_skb() if (q->flags & TCQ_F_NOLOCK) { if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; __qdisc_run(q); } if (unlikely(to_free)) kfree_skb_list(to_free); return rc; } One important piece here is we used __qdisc_run(q) instead of what we have there now qdisc_run(q). Here is the latest code, if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); ... __qdisc_run is going to always go into a qdisc_restart loop and dequeue packets. There is no check here to see if another CPU is running or not. Compare that to qdisc_run() static inline void qdisc_run(struct Qdisc *q) { if (qdisc_run_begin(q)) { __qdisc_run(q); qdisc_run_end(q); } } Here we have all the racing around qdisc_is_running() that seems unsolvable. Seems we flipped __qdisc_run to qdisc_run here 32f7b44d0f566 ("sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers"). Its not clear to me from thatpatch though why it was even done there? Maybe this would unlock us, diff --git a/net/core/dev.c b/net/core/dev.c index 7df6c9617321..9b09429103f1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - qdisc_run(q); + __qdisc_run(q); if (unlikely(to_free)) kfree_skb_list(to_free); Per other thread we also need the state deactivated check added back. > > > > > At this point I fear we could consider reverting the NOLOCK stuff. > > I personally would hate doing so, but it looks like NOLOCK benefits are > > outweighed by its issues. > > I agree, NOLOCK brings more pains than gains. There are many race > conditions hidden in generic qdisc layer, another one is enqueue vs. > reset which is being discussed in another thread. Sure. Seems they crept in over time. I had some plans to write a lockless HTB implementation. But with fq+EDT with BPF it seems that it is no longer needed, we have a more generic/better solution. So I dropped it. Also most folks should really be using fq, fq_codel, etc. by default anyways. Using pfifo_fast alone is not ideal IMO. Thanks, John
Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc
Cong Wang wrote: > On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin wrote: > > > > On 2020/9/3 9:48, Cong Wang wrote: > > > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin > > > wrote: > > >> > > >> On 2020/9/3 8:35, Cong Wang wrote: > > >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin > > >>> wrote: > > > > On 2020/9/2 12:41, Cong Wang wrote: > > > On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin > > > wrote: > > >> > > >> On 2020/9/2 2:24, Cong Wang wrote: > > >>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin > > >>> wrote: > > > > Currently there is concurrent reset and enqueue operation for the > > same lockless qdisc when there is no lock to synchronize the > > q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in > > qdisc_deactivate() called by dev_deactivate_queue(), which may > > cause > > out-of-bounds access for priv->ring[] in hns3 driver if user has > > requested a smaller queue num when __dev_xmit_skb() still enqueue a > > skb with a larger queue_mapping after the corresponding qdisc is > > reset, and call hns3_nic_net_xmit() with that skb later. > > >>> > > >>> Can you be more specific here? Which call path requests a smaller > > >>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly > > >>> we already have a synchronize_net() there. > > >> > > >> When the netdevice is in active state, the synchronize_net() seems to > > >> do the correct work, as below: > > >> > > >> CPU 0: CPU1: > > >> __dev_queue_xmit() > > >> netif_set_real_num_tx_queues() > > >> rcu_read_lock_bh(); > > >> netdev_core_pick_tx(dev, skb, sb_dev); > > >> . > > >> . dev->real_num_tx_queues = > > >> txq; > > >> . . > > >> . . > > >> . synchronize_net(); > > >> . . > > >> q->enqueue(). > > >> . . > > >> rcu_read_unlock_bh(). > > >> qdisc_reset_all_tx_gt > > >> > > >> > > > > > > Right. > > > > > > > > >> but dev->real_num_tx_queues is not RCU-protected, maybe that is a > > >> problem > > >> too. > > >> > > >> The problem we hit is as below: > > >> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is > > >> called > > >> to deactive the netdevice when user requested a smaller queue num, > > >> and > > >> txq->qdisc is already changed to noop_qdisc when calling > > >> netif_set_real_num_tx_queues(), so the synchronize_net() in the > > >> function > > >> netif_set_real_num_tx_queues() does not help here. > > > > > > How could qdisc still be running after deactivating the device? > > > > qdisc could be running during the device deactivating process. > > > > The main process of changing channel number is as below: > > > > 1. dev_deactivate() > > 2. hns3 handware related setup > > 3. netif_set_real_num_tx_queues() > > 4. netif_tx_wake_all_queues() > > 5. dev_activate() > > > > During step 1, qdisc could be running while qdisc is resetting, so > > there could be skb left in the old qdisc(which will be restored back to > > txq->qdisc during dev_activate()), as below: > > > > CPU 0: CPU1: > > __dev_queue_xmit(): dev_deactivate_many(): > > rcu_read_lock_bh(); qdisc_deactivate(qdisc); > > q = rcu_dereference_bh(txq->qdisc); . > > netdev_core_pick_tx(dev, skb, sb_dev); . > > . > > . > > rcu_assign_pointer(dev_queue->qdisc, qdisc_default); > > . . > > . . > > . . > > . . > > q->enqueue(). > > >>> > > >>> > > >>> Well, like I said, if the deactivated bit were tested before > > >>> ->enqueue(), > > >>> there would be no packet queued after qdisc_deactivate(). Trying to unwind this through git history :/ Original code had a test_bit in dev_xmit_skb(), if (q->flags & TCQ_F_NOLOCK) { if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP;
RE: [PATCH bpf-next 5/6] bpf: sockmap: allow update from BPF
John Fastabend wrote: > Lorenz Bauer wrote: > > Allow calling bpf_map_update_elem on sockmap and sockhash from a BPF > > context. The synchronization required for this is a bit fiddly: we > > need to prevent the socket from changing it's state while we add it > > to the sockmap, since we rely on getting a callback via > > sk_prot->unhash. However, we can't just lock_sock like in > > sock_map_sk_acquire because that might sleep. So instead we disable > > softirq processing and use bh_lock_sock to prevent further > > modification. > > > > Signed-off-by: Lorenz Bauer > > --- > > kernel/bpf/verifier.c | 6 -- > > net/core/sock_map.c | 24 > > 2 files changed, 28 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > index 47f9b94bb9d4..421fccf18dea 100644 > > --- a/kernel/bpf/verifier.c > > +++ b/kernel/bpf/verifier.c > > @@ -4254,7 +4254,8 @@ static int check_map_func_compatibility(struct > > bpf_verifier_env *env, > > func_id != BPF_FUNC_map_delete_elem && > > func_id != BPF_FUNC_msg_redirect_map && > > func_id != BPF_FUNC_sk_select_reuseport && > > - func_id != BPF_FUNC_map_lookup_elem) > > + func_id != BPF_FUNC_map_lookup_elem && > > + func_id != BPF_FUNC_map_update_elem) > > goto error; > > break; > > case BPF_MAP_TYPE_SOCKHASH: > > @@ -4263,7 +4264,8 @@ static int check_map_func_compatibility(struct > > bpf_verifier_env *env, > > func_id != BPF_FUNC_map_delete_elem && > > func_id != BPF_FUNC_msg_redirect_hash && > > func_id != BPF_FUNC_sk_select_reuseport && > > - func_id != BPF_FUNC_map_lookup_elem) > > + func_id != BPF_FUNC_map_lookup_elem && > > + func_id != BPF_FUNC_map_update_elem) > > I lost track of a detail here, map_lookup_elem should return > PTR_TO_MAP_VALUE_OR_NULL but if we want to feed that back into > the map_update_elem() we need to return PTR_TO_SOCKET_OR_NULL > and then presumably have a null check to get a PTR_TO_SOCKET > type as expect. > > Can we use the same logic for expected arg (previous patch) on the > ret_type. Or did I miss it:/ Need some coffee I guess. OK, I tracked this down. It looks like we rely on mark_ptr_or_null_reg() to update the reg->tyype to PTR_TO_SOCKET. I do wonder if it would be a bit more straight forward to do something similar to the previous patch and refine it earlier to PTR_TO_SOCKET_OR_NULL, but should be safe as-is for now. I still have the below question though. > > > goto error; > > break; > > case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 018367fb889f..b2c886c34566 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -603,6 +603,28 @@ int sock_map_update_elem_sys(struct bpf_map *map, void > > *key, > > return ret; > > } > > > > +static int sock_map_update_elem(struct bpf_map *map, void *key, > > + void *value, u64 flags) > > +{ > > + struct sock *sk = (struct sock *)value; > > + int ret; > > + > > + if (!sock_map_sk_is_suitable(sk)) > > + return -EOPNOTSUPP; > > + > > + local_bh_disable(); > > + bh_lock_sock(sk); > > How do ensure we are not being called from some context which > already has the bh_lock_sock() held? It seems we can call map_update_elem() > from any context, kprobes, tc, xdp, etc.? > > > + if (!sock_map_sk_state_allowed(sk)) > > + ret = -EOPNOTSUPP; > > + else if (map->map_type == BPF_MAP_TYPE_SOCKMAP) > > + ret = sock_map_update_common(map, *(u32 *)key, sk, flags); > > + else > > + ret = sock_hash_update_common(map, key, sk, flags); > > + bh_unlock_sock(sk); > > + local_bh_enable(); > > + return ret; > > +} > > + > > BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops, > >struct bpf_map *, map, void *, key, u64, flags) > > { > > @@ -687,6 +709,7 @@ const struct bpf_map_ops sock_map_ops = { > > .map_free = sock_map_free, > > .map_get_next_key = sock_map_get_next_key, > > .map_lookup_elem_sys_only = sock_map_lookup_sys, > > + .map_update_elem= so
RE: [PATCH bpf-next 5/6] bpf: sockmap: allow update from BPF
Lorenz Bauer wrote: > Allow calling bpf_map_update_elem on sockmap and sockhash from a BPF > context. The synchronization required for this is a bit fiddly: we > need to prevent the socket from changing it's state while we add it > to the sockmap, since we rely on getting a callback via > sk_prot->unhash. However, we can't just lock_sock like in > sock_map_sk_acquire because that might sleep. So instead we disable > softirq processing and use bh_lock_sock to prevent further > modification. > > Signed-off-by: Lorenz Bauer > --- > kernel/bpf/verifier.c | 6 -- > net/core/sock_map.c | 24 > 2 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 47f9b94bb9d4..421fccf18dea 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -4254,7 +4254,8 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > func_id != BPF_FUNC_map_delete_elem && > func_id != BPF_FUNC_msg_redirect_map && > func_id != BPF_FUNC_sk_select_reuseport && > - func_id != BPF_FUNC_map_lookup_elem) > + func_id != BPF_FUNC_map_lookup_elem && > + func_id != BPF_FUNC_map_update_elem) > goto error; > break; > case BPF_MAP_TYPE_SOCKHASH: > @@ -4263,7 +4264,8 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > func_id != BPF_FUNC_map_delete_elem && > func_id != BPF_FUNC_msg_redirect_hash && > func_id != BPF_FUNC_sk_select_reuseport && > - func_id != BPF_FUNC_map_lookup_elem) > + func_id != BPF_FUNC_map_lookup_elem && > + func_id != BPF_FUNC_map_update_elem) I lost track of a detail here, map_lookup_elem should return PTR_TO_MAP_VALUE_OR_NULL but if we want to feed that back into the map_update_elem() we need to return PTR_TO_SOCKET_OR_NULL and then presumably have a null check to get a PTR_TO_SOCKET type as expect. Can we use the same logic for expected arg (previous patch) on the ret_type. Or did I miss it:/ Need some coffee I guess. > goto error; > break; > case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 018367fb889f..b2c886c34566 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -603,6 +603,28 @@ int sock_map_update_elem_sys(struct bpf_map *map, void > *key, > return ret; > } > > +static int sock_map_update_elem(struct bpf_map *map, void *key, > + void *value, u64 flags) > +{ > + struct sock *sk = (struct sock *)value; > + int ret; > + > + if (!sock_map_sk_is_suitable(sk)) > + return -EOPNOTSUPP; > + > + local_bh_disable(); > + bh_lock_sock(sk); How do ensure we are not being called from some context which already has the bh_lock_sock() held? It seems we can call map_update_elem() from any context, kprobes, tc, xdp, etc.? > + if (!sock_map_sk_state_allowed(sk)) > + ret = -EOPNOTSUPP; > + else if (map->map_type == BPF_MAP_TYPE_SOCKMAP) > + ret = sock_map_update_common(map, *(u32 *)key, sk, flags); > + else > + ret = sock_hash_update_common(map, key, sk, flags); > + bh_unlock_sock(sk); > + local_bh_enable(); > + return ret; > +} > + > BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops, > struct bpf_map *, map, void *, key, u64, flags) > { > @@ -687,6 +709,7 @@ const struct bpf_map_ops sock_map_ops = { > .map_free = sock_map_free, > .map_get_next_key = sock_map_get_next_key, > .map_lookup_elem_sys_only = sock_map_lookup_sys, > + .map_update_elem= sock_map_update_elem, > .map_delete_elem= sock_map_delete_elem, > .map_lookup_elem= sock_map_lookup, > .map_release_uref = sock_map_release_progs, > @@ -1180,6 +1203,7 @@ const struct bpf_map_ops sock_hash_ops = { > .map_alloc = sock_hash_alloc, > .map_free = sock_hash_free, > .map_get_next_key = sock_hash_get_next_key, > + .map_update_elem= sock_map_update_elem, > .map_delete_elem= sock_hash_delete_elem, > .map_lookup_elem= sock_hash_lookup, > .map_lookup_elem_sys_only = sock_hash_lookup_sys, > -- > 2.25.1 >
RE: [PATCH bpf-next 4/6] bpf: override the meaning of ARG_PTR_TO_MAP_VALUE for sockmap and sockhash
Lorenz Bauer wrote: > The verifier assumes that map values are simple blobs of memory, and > therefore treats ARG_PTR_TO_MAP_VALUE, etc. as such. However, there are > map types where this isn't true. For example, sockmap and sockhash store > sockets. In general this isn't a big problem: we can just > write helpers that explicitly requests PTR_TO_SOCKET instead of > ARG_PTR_TO_MAP_VALUE. > > The one exception are the standard map helpers like map_update_elem, > map_lookup_elem, etc. Here it would be nice we could overload the > function prototype for different kinds of maps. Unfortunately, this > isn't entirely straight forward: > We only know the type of the map once we have resolved meta->map_ptr > in check_func_arg. This means we can't swap out the prototype > in check_helper_call until we're half way through the function. > > Instead, modify check_func_arg to treat ARG_PTR_TO_MAP_VALUE* to > mean "the native type for the map" instead of "pointer to memory" > for sockmap and sockhash. This means we don't have to modify the > function prototype at all > > Signed-off-by: Lorenz Bauer > --- > kernel/bpf/verifier.c | 40 > 1 file changed, 40 insertions(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index b6ccfce3bf4c..47f9b94bb9d4 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -3872,6 +3872,38 @@ static int int_ptr_type_to_size(enum bpf_arg_type type) > return -EINVAL; > } > > +static int override_map_arg_type(struct bpf_verifier_env *env, > + const struct bpf_call_arg_meta *meta, > + enum bpf_arg_type *arg_type) One nit can we rename this to refine_map_arg_type or resolve_map_arg_type I don't like the name "override" we are getting a more precise type here. > +{ > + if (!meta->map_ptr) { > + /* kernel subsystem misconfigured verifier */ > + verbose(env, "invalid map_ptr to access map->type\n"); > + return -EACCES; > + } > + > + switch (meta->map_ptr->map_type) { > + case BPF_MAP_TYPE_SOCKMAP: > + case BPF_MAP_TYPE_SOCKHASH: > + switch (*arg_type) { > + case ARG_PTR_TO_MAP_VALUE: > + *arg_type = ARG_PTR_TO_SOCKET; > + break; > + case ARG_PTR_TO_MAP_VALUE_OR_NULL: > + *arg_type = ARG_PTR_TO_SOCKET_OR_NULL; > + break; > + default: > + verbose(env, "invalid arg_type for sockmap/sockhash\n"); Might be worth pushing the arg_type into the verbose message so its obvious where the types went wrong. We will probably "know" just based on the switch in front of this, but users in general wont. Just a suggestion if you think its overkill go ahead and skip. > + return -EINVAL; > + } > + break; > + > + default: > + break; > + } > + return 0; > +} > + Otherwise LGTM. > static int check_func_arg(struct bpf_verifier_env *env, u32 arg, > struct bpf_call_arg_meta *meta, > const struct bpf_func_proto *fn) > @@ -3904,6 +3936,14 @@ static int check_func_arg(struct bpf_verifier_env > *env, u32 arg, > return -EACCES; > } > > + if (arg_type == ARG_PTR_TO_MAP_VALUE || > + arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE || > + arg_type == ARG_PTR_TO_MAP_VALUE_OR_NULL) { > + err = override_map_arg_type(env, meta, &arg_type); > + if (err) > + return err; > + } > + > if (arg_type == ARG_PTR_TO_MAP_KEY || > arg_type == ARG_PTR_TO_MAP_VALUE || > arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE || > -- > 2.25.1 >
RE: [PATCH bpf-next 3/6] bpf: sockmap: call sock_map_update_elem directly
Lorenz Bauer wrote: > Don't go via map->ops to call sock_map_update_elem, since we know > what function to call in bpf_map_update_value. Since > check_map_func_compatibility doesn't allow calling > BPF_FUNC_map_update_elem from BPF context, we can remove > ops->map_update_elem and rename the function to > sock_map_update_elem_sys. > > Signed-off-by: Lorenz Bauer > --- Acked-by: John Fastabend
RE: [PATCH bpf-next 2/6] bpf: sockmap: merge sockmap and sockhash update functions
Lorenz Bauer wrote: > Merge the two very similar functions sock_map_update_elem and > sock_hash_update_elem into one. > > Signed-off-by: Lorenz Bauer > --- > net/core/sock_map.c | 53 - > 1 file changed, 9 insertions(+), 44 deletions(-) > Fixup the warning, but otherwise Acked-by: John Fastabend
RE: [PATCH bpf-next 1/6] net: sk_msg: simplify sk_psock initialization
Lorenz Bauer wrote: > Initializing psock->sk_proto and other saved callbacks is only > done in sk_psock_update_proto, after sk_psock_init has returned. > The logic for this is difficult to follow, and needlessly complex. > > Instead, initialize psock->sk_proto whenever we allocate a new > psock. Additionally, assert the following invariants: > > * The SK has no ULP: ULP does it's own finagling of sk->sk_prot > * sk_user_data is unused: we need it to store sk_psock > > Protect our access to sk_user_data with sk_callback_lock, which > is what other users like reuseport arrays, etc. do. > > The result is that an sk_psock is always fully initialized, and > that psock->sk_proto is always the "original" struct proto. > The latter allows us to use psock->sk_proto when initializing > IPv6 TCP / UDP callbacks for sockmap. > > Signed-off-by: Lorenz Bauer > --- Looks like a nice bit of cleanup thanks. I think we might be able to fold up more of this code as well, but I'll take a look in a bit. Acked-by: John Fastabend
RE: [PATCH] kprobes: fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
Muchun Song wrote: > Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE. > > kernel/kprobes.c: In function 'kill_kprobe': > kernel/kprobes.c:1116:33: warning: statement with no effect > [-Wunused-value] > 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV) > | ^ > kernel/kprobes.c:2154:3: note: in expansion of macro > 'disarm_kprobe_ftrace' > 2154 | disarm_kprobe_ftrace(p); > > Link: https://lore.kernel.org/r/20200805142136.0331f...@canb.auug.org.au > > Reported-by: Stephen Rothwell > Fixes: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at > kprobe_ftrace_handler") > Signed-off-by: Muchun Song > --- Acked-by: John Fastabend
Re: BUG: unable to handle kernel NULL pointer dereference in bpf_prog_ADDR
Eric Dumazet wrote: > > > On 8/2/20 3:45 PM, syzbot wrote: > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:ac3a0c84 Merge git://git.kernel.org/pub/scm/linux/kernel/g.. > > git tree: upstream > > console output: https://syzkaller.appspot.com/x/log.txt?x=1323497090 > > kernel config: https://syzkaller.appspot.com/x/.config?x=c0cfcf935bcc94d2 > > dashboard link: https://syzkaller.appspot.com/bug?extid=192a7fbbece55f740074 > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=141541ea90 > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+192a7fbbece55f740...@syzkaller.appspotmail.com > > > > BUG: kernel NULL pointer dereference, address: > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x) - not-present page > > PGD 9176a067 P4D 9176a067 PUD 9176b067 PMD 0 > > Oops: [#1] PREEMPT SMP KASAN > > CPU: 1 PID: 8142 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > RIP: 0010:bpf_prog_e48ebe87b99394c4+0x1f/0x590 > > Code: cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 48 81 ec > > 00 00 00 00 53 41 55 41 56 41 57 6a 00 31 c0 48 8b 47 28 <48> 8b 40 00 8b > > 80 00 01 00 00 5b 41 5f 41 5e 41 5d 5b c9 c3 cc cc > > RSP: 0018:c900038a7b00 EFLAGS: 00010246 > > RAX: RBX: dc00 RCX: dc00 > > RDX: 88808cfb0200 RSI: c9e7e038 RDI: c900038a7ca8 > > RBP: c900038a7b28 R08: R09: > > R10: R11: R12: c9e7e000 > > R13: c9e7e000 R14: 0001 R15: > > FS: 7fda07fef700() GS:8880ae70() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: CR3: 91769000 CR4: 001406e0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > Call Trace: > > bpf_prog_run_xdp include/linux/filter.h:734 [inline] > > bpf_test_run+0x221/0xc70 net/bpf/test_run.c:47 > > bpf_prog_test_run_xdp+0x2ca/0x510 net/bpf/test_run.c:524 > > bpf_prog_test_run kernel/bpf/syscall.c:2983 [inline] > > __do_sys_bpf+0x2117/0x4b10 kernel/bpf/syscall.c:4135 > > do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:384 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > RIP: 0033:0x45cc79 > > Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 > > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > > ff 0f 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > > RSP: 002b:7fda07feec78 EFLAGS: 0246 ORIG_RAX: 0141 > > RAX: ffda RBX: 1740 RCX: 0045cc79 > > RDX: 0028 RSI: 2080 RDI: 000a > > RBP: 0078bfe0 R08: R09: > > R10: R11: 0246 R12: 0078bfac > > R13: 7ffc3ef769bf R14: 7fda07fef9c0 R15: 0078bfac > > Modules linked in: > > CR2: > > ---[ end trace b2d24107e7fdae7d ]--- > > RIP: 0010:bpf_prog_e48ebe87b99394c4+0x1f/0x590 > > Code: cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 48 81 ec > > 00 00 00 00 53 41 55 41 56 41 57 6a 00 31 c0 48 8b 47 28 <48> 8b 40 00 8b > > 80 00 01 00 00 5b 41 5f 41 5e 41 5d 5b c9 c3 cc cc > > RSP: 0018:c900038a7b00 EFLAGS: 00010246 > > RAX: RBX: dc00 RCX: dc00 > > RDX: 88808cfb0200 RSI: c9e7e038 RDI: c900038a7ca8 > > RBP: c900038a7b28 R08: R09: > > R10: R11: R12: c9e7e000 > > R13: c9e7e000 R14: 0001 R15: > > FS: 7fda07fef700() GS:8880ae70() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: CR3: 91769000 CR4: 001406e0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > > > > > --- > > This report is generated by a bot. It may contain errors. > > See https://goo.gl/tpsmEJ for more information about syzbot. > > syzbot engineers can be reached at syzkal...@googlegroups.com. > > > > syzbot will keep track of this issue. See: > > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > syzbot can test patches for this issue, for details see: > > https://goo.gl/tpsmEJ#testing-patches > > > > > # > https://syzkaller.appspot.com/bug?id=d60883a0b19a778d2bcab55f3f6459467f4a3ea7 > # See https://goo.gl/kgGztJ for information about syzkaller reproducers. > #{"threaded":true,"collide":
Re: [PATCH] tools/bpf/bpftool: Fix wrong return value in do_dump()
Andrii Nakryiko wrote: > On Sun, Aug 2, 2020 at 4:16 AM Tianjia Zhang > wrote: > > > > In case of btf_id does not exist, a negative error code -ENOENT > > should be returned. > > > > Fixes: c93cc69004df3 ("bpftool: add ability to dump BTF types") > > Cc: Andrii Nakryiko > > Signed-off-by: Tianjia Zhang > > --- > > > Acked-by: Andrii Nakryiko > Acked-by: John Fastabend
Re: [PATCH bpf 1/2] flow_dissector: reject invalid attach_flags
Alexei Starovoitov wrote: > On Tue, Jun 16, 2020 at 1:30 AM Lorenz Bauer wrote: > > > > On Tue, 16 Jun 2020 at 04:55, Alexei Starovoitov > > wrote: > > > > > > On Mon, Jun 15, 2020 at 7:43 AM Lorenz Bauer wrote: > > > > > > > > On Fri, 12 Jun 2020 at 23:36, Alexei Starovoitov > > > > wrote: > > > > > > > > > > On Fri, Jun 12, 2020 at 9:02 AM Lorenz Bauer > > > > > wrote: > > > > > > > > > > > > Using BPF_PROG_ATTACH on a flow dissector program supports neither > > > > > > flags > > > > > > nor target_fd but accepts any value. Return EINVAL if either are > > > > > > non-zero. > > > > > > > > > > > > Signed-off-by: Lorenz Bauer > > > > > > Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog > > > > > > callbacks") > > > > > > --- > > > > > > kernel/bpf/net_namespace.c | 3 +++ > > > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > > > diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c > > > > > > index 78cf061f8179..56133e78ae4f 100644 > > > > > > --- a/kernel/bpf/net_namespace.c > > > > > > +++ b/kernel/bpf/net_namespace.c > > > > > > @@ -192,6 +192,9 @@ int netns_bpf_prog_attach(const union bpf_attr > > > > > > *attr, struct bpf_prog *prog) > > > > > > struct net *net; > > > > > > int ret; > > > > > > > > > > > > + if (attr->attach_flags || attr->target_fd) > > > > > > + return -EINVAL; > > > > > > + > > > > > > > > > > In theory it makes sense, but how did you test it? > > > > > > > > Not properly it seems, sorry! > > > > > > > > > test_progs -t flow > > > > > fails 5 tests. > > > > > > > > I spent today digging through this, and the issue is actually more > > > > annoying than > > > > I thought. BPF_PROG_DETACH for sockmap and flow_dissector ignores > > > > attach_bpf_fd. The cgroup and lirc2 attach point use this to make sure > > > > that the > > > > program being detached is actually what user space expects. We actually > > > > have > > > > tests that set attach_bpf_fd for these to attach points, which tells > > > > me that this is > > > > an easy mistake to make. In sockmap case I didn't manage to think what multiple programs of the same type on the same map would look like so we can just remove whatever program is there. Is there a problem with this or is it that we just want the sanity check. > > > > > > > > Unfortunately I can't come up with a good fix that seems backportable: > > > > - Making sockmap and flow_dissector have the same semantics as cgroup > > > > and lirc2 requires a bunch of changes (probably a new function for > > > > sockmap) > > > > > > making flow dissector pass prog_fd as cg and lirc is certainly my > > > preference. > > > Especially since tests are passing fd user code is likely doing the same, > > > so breakage is unlikely. Also it wasn't done that long ago, so > > > we can backport far enough. > > > It will remove cap_net_admin ugly check in bpf_prog_detach() > > > which is the only exception now in cap model. > > > > SGTM. What about sockmap though? The code for that has been around for ages. > > you mean the second patch that enforces sock_map_get_from_fd doesn't > use attach_flags? > I think it didn't break anything, so enforcing is fine. I'm ok with enforcing it. > > or the detach part that doesn't use prog_fd ? > I'm not sure what's the best here. > At least from cap perspective it's fine because map_fd is there. > > John, wdyt? I think we can keep the current detach without the prog_fd as-is. And then add logic so that if the prog_fd is included we check it?
RE: [PATCH] [bpf] xdp_redirect_cpu_user: Fix null pointer dereference
Gaurav Singh wrote: > Memset() on the pointer right after malloc() can cause > a null pointer dereference if it failed to allocate memory. > Fix this by replacing malloc/memset with a single calloc(). > > Signed-off-by: Gaurav Singh > --- > samples/bpf/xdp_redirect_cpu_user.c | 11 +++ > 1 file changed, 3 insertions(+), 8 deletions(-) > > diff --git a/samples/bpf/xdp_redirect_cpu_user.c > b/samples/bpf/xdp_redirect_cpu_user.c > index f3468168982e..2ae7a9a1d950 100644 > --- a/samples/bpf/xdp_redirect_cpu_user.c > +++ b/samples/bpf/xdp_redirect_cpu_user.c > @@ -207,11 +207,8 @@ static struct datarec *alloc_record_per_cpu(void) > { > unsigned int nr_cpus = bpf_num_possible_cpus(); > struct datarec *array; > - size_t size; > > - size = sizeof(struct datarec) * nr_cpus; > - array = malloc(size); > - memset(array, 0, size); > + array = calloc(nr_cpus, sizeof(struct datarec)); > if (!array) { > fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus); > exit(EXIT_FAIL_MEM); > @@ -222,11 +219,9 @@ static struct datarec *alloc_record_per_cpu(void) > static struct stats_record *alloc_stats_record(void) > { > struct stats_record *rec; > - int i, size; > + int i; > > - size = sizeof(*rec) + n_cpus * sizeof(struct record); > - rec = malloc(size); > - memset(rec, 0, size); > + rec = calloc(n_cpus + 1, sizeof(struct record)); > if (!rec) { > fprintf(stderr, "Mem alloc error\n"); > exit(EXIT_FAIL_MEM); > -- > 2.17.1 > Acked-by: John Fastabend
RE: [PATCH] [bpf] xdp_monitor_user: Fix null pointer dereference
Gaurav Singh wrote: > Memset() on the pointer right after malloc() can cause > a null pointer dereference if it failed to allocate memory. > Fix this by replacing malloc/memset with a single calloc(). > > Signed-off-by: Gaurav Singh > --- > samples/bpf/xdp_monitor_user.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c > index dd558cbb2309..ef53b93db573 100644 > --- a/samples/bpf/xdp_monitor_user.c > +++ b/samples/bpf/xdp_monitor_user.c > @@ -509,11 +509,8 @@ static void *alloc_rec_per_cpu(int record_size) > { > unsigned int nr_cpus = bpf_num_possible_cpus(); > void *array; > - size_t size; > > - size = record_size * nr_cpus; > - array = malloc(size); > - memset(array, 0, size); > + array = calloc(nr_cpus, record_size); > if (!array) { > fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus); > exit(EXIT_FAIL_MEM); > @@ -528,8 +525,7 @@ static struct stats_record *alloc_stats_record(void) > int i; > > /* Alloc main stats_record structure */ > - rec = malloc(sizeof(*rec)); > - memset(rec, 0, sizeof(*rec)); > + rec = calloc(1, sizeof(*rec)); > if (!rec) { > fprintf(stderr, "Mem alloc error\n"); > exit(EXIT_FAIL_MEM); > -- > 2.17.1 > Acked-by: John Fastabend
Re: [PATCH] xdp_rxq_info_user: Fix null pointer dereference. Replace malloc/memset with calloc.
Jesper Dangaard Brouer wrote: > On Fri, 12 Jun 2020 14:53:27 -0400 > Gaurav Singh wrote: > > > Memset on the pointer right after malloc can cause a > > null pointer deference if it failed to allocate memory. > > A simple fix is to replace malloc/memset with a calloc() > > > > Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to > > xdp_rxq_info") > > Signed-off-by: Gaurav Singh > > Acked-by: Jesper Dangaard Brouer > Acked-by: John Fastabend
Re: [PATCH bpf] bpf: sockmap: don't attach programs to UDP sockets
Jakub Sitnicki wrote: > On Thu, 11 Jun 2020 18:25:20 +0100 > Lorenz Bauer wrote: > > > The stream parser infrastructure isn't set up to deal with UDP > > sockets, so we mustn't try to attach programs to them. > > > > I remember making this change at some point, but I must have lost > > it while rebasing or something similar. > > > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > > Signed-off-by: Lorenz Bauer > > --- > > net/core/sock_map.c | 10 ++ > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 00a26cf2cfe9..35cea36f3892 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -424,10 +424,7 @@ static int sock_map_get_next_key(struct bpf_map *map, > > void *key, void *next) > > return 0; > > } > > > > -static bool sock_map_redirect_allowed(const struct sock *sk) > > -{ > > - return sk->sk_state != TCP_LISTEN; > > -} > > +static bool sock_map_redirect_allowed(const struct sock *sk); > > > > static int sock_map_update_common(struct bpf_map *map, u32 idx, > > struct sock *sk, u64 flags) > > @@ -508,6 +505,11 @@ static bool sk_is_udp(const struct sock *sk) > >sk->sk_protocol == IPPROTO_UDP; > > } > > > > +static bool sock_map_redirect_allowed(const struct sock *sk) > > +{ > > + return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN; > > +} > > + > > static bool sock_map_sk_is_suitable(const struct sock *sk) > > { > > return sk_is_tcp(sk) || sk_is_udp(sk); > > Acked-by: Jakub Sitnicki Thanks. Acked-by: John Fastabend
RE: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg
dihu wrote: > From 865a45747de6b68fd02a0ff128a69a5c8feb73c3 Mon Sep 17 00:00:00 2001 > From: dihu > Date: Mon, 25 May 2020 17:23:16 +0800 > Subject: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg > > When user application calls read() with MSG_PEEK flag to read data > of bpf sockmap socket, kernel panic happens at > __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg > queue after read out under MSG_PEEK flag is set. Because it's not > judged whether sk_msg is the last msg of ingress_msg queue, the next > sk_msg may be the head of ingress_msg queue, whose memory address of > sg page is invalid. So it's necessary to add check codes to prevent > this problem. > > [20759.125457] BUG: kernel NULL pointer dereference, address: > 0008 > [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE > 5.4.32 #1 > [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS > 4.1.12 06/18/2017 > [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300 > [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350 > [20759.276099] tcp_bpf_recvmsg+0x113/0x370 > [20759.281137] inet_recvmsg+0x55/0xc0 > [20759.285734] __sys_recvfrom+0xc8/0x130 > [20759.290566] ? __audit_syscall_entry+0x103/0x130 > [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0 > [20759.301700] ? __audit_syscall_exit+0x1e4/0x290 > [20759.307235] __x64_sys_recvfrom+0x24/0x30 > [20759.312226] do_syscall_64+0x55/0x1b0 > [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Signed-off-by: dihu > --- > net/ipv4/tcp_bpf.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index 5a05327..c0d4624 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c > @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock > *psock, >} while (i != msg_rx->sg.end); > >if (unlikely(peek)) { > + if (msg_rx == list_last_entry(&psock->ingress_msg, > + struct sk_msg, list)) > +break; Thanks. Change looks good but spacing is a bit off . Can we turn those spaces into tabs? Otherwise adding fixes tag and my ack would be great. Fixes: 02c558b2d5d67 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress") Acked-by: John Fastabend
Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
Jakub Kicinski wrote: > On Fri, 8 May 2020 22:58:29 -0700 Jakub Kicinski wrote: > > On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote: > > > If systemd is configured to use hybrid mode which enables the use of > > > both cgroup v1 and v2, systemd will create new cgroup on both the default > > > root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach > > > task to the two cgroups. If the task does some network thing then the v2 > > > cgroup can never be freed after the session exited. > > > > > > One of our machines ran into OOM due to this memory leak. > > > > > > In the scenario described above when sk_alloc() is called > > > cgroup_sk_alloc() > > > thought it's in v2 mode, so it stores the cgroup pointer in > > > sk->sk_cgrp_data > > > and increments the cgroup refcnt, but then sock_update_netprioidx() > > > thought > > > it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the > > > cgroup refcnt will never be freed. > > > > > > Currently we do the mode switch when someone writes to the ifpriomap > > > cgroup > > > control file. The easiest fix is to also do the switch when a task is > > > attached > > > to a new cgroup. > > > > > > Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup") > > > > ^ space missing here > > > > > Reported-by: Yang Yingliang > > > Tested-by: Yang Yingliang > > > Signed-off-by: Zefan Li > > Fixed up the commit message and applied, thank you. Hi Zefan, Tejun, This is causing a regression where previously cgroupv2 bpf sockops programs could be attached and would run even if netprio_cgroup was enabled as long as the netprio cgroup had not been configured. After this the bpf sockops programs can still be attached but only programs attached to the root cgroup will be run. For example I hit this when I ran bpf selftests on a box that also happened to have netprio cgroup enabled, tests started failing after bumping kernel to rc5. I'm a bit on the fence here if it needs to be reverted. For my case its just a test box and easy enough to work around. Also all the production cases I have already have to be aware of this to avoid the configured error. So it may be fine but worth noting at least. Added Alexei to see if he has any opinion and/or uses net_prio+cgroubv2. I only looked it over briefly but didn't see any simple rc6 worthy fixes that would fix the issue above and also keep the original behavior. And then while reviewing I also wonder do we have the same issue described here in netclasid_cgroup.c with the cgrp_attach()? It would be best to keep netcls and netprio in sync in this regard imo. At least netcls calls cgroup_sk_alloc_disable in the write_classid() hook so I suspect it makes sense to also add that to the attach hook? Thanks, John
RE: [PATCH 1/1] bpf: Fix bpf_event_output re-entry issue
Allan Zhang wrote: > BPF_PROG_TYPE_SOCK_OPS program can reenter bpf_event_output because it can > be called from atomic and non-atomic contexts since we don't have > bpf_prog_active to prevent it happen. > > This patch enables 3 level of nesting to support normal, irq and nmi > context. > > We can easily reproduce the issue by running neper crr mode with 100 flows > and 10 threads from neper client side. > > Here is the whole stack dump: > > [ 515.228898] WARNING: CPU: 20 PID: 14686 at kernel/trace/bpf_trace.c:549 > bpf_event_output+0x1f9/0x220 > [ 515.228903] CPU: 20 PID: 14686 Comm: tcp_crr Tainted: GW > 4.15.0-smp-fixpanic #44 > [ 515.228904] Hardware name: Intel TBG,ICH10/Ikaria_QC_1b, BIOS 1.22.0 > 06/04/2018 > [ 515.228905] RIP: 0010:bpf_event_output+0x1f9/0x220 [...] > Fixes: a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops") > > Effort: BPF > Signed-off-by: Allan Zhang > Reviewed-by: Stanislav Fomichev > Reviewed-by: Eric Dumazet > --- LGTM thanks. Acked-by: John Fastabend
Re: general protection fault in tls_sk_proto_close (2)
Jakub Kicinski wrote: > On Thu, 29 Aug 2019 11:48:32 -0700, John Fastabend wrote: > > Jakub Kicinski wrote: > > > On Thu, 29 Aug 2019 11:52:00 +0800, Hillf Danton wrote: > > > > Alternatively work is done if sock is closed again. Anyway ctx is reset > > > > under sock's callback lock in write mode. > > > > > > > > --- a/net/tls/tls_main.c > > > > +++ b/net/tls/tls_main.c > > > > @@ -295,6 +295,8 @@ static void tls_sk_proto_close(struct so > > > > long timeo = sock_sndtimeo(sk, 0); > > > > bool free_ctx; > > > > > > > > + if (!ctx) > > > > + return; > > > > if (ctx->tx_conf == TLS_SW) > > > > tls_sw_cancel_work_tx(ctx); > > > > > > That's no bueno, the real socket's close will never get called. > > > > Seems when we refactored BPF side we dropped the check for ULP on one > > path so I'll add that back now. It would be nice and seems we are > > getting closer now that tls side is a bit more dynamic if the ordering > > didn't matter. > > We'd probably need some more generic way of communicating the changes > in sk_proto stack, e.g. by moving the update into one of sk_proto > callbacks? but yes. > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 1330a7442e5b..30d11558740e 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -666,6 +666,8 @@ static int sock_hash_update_common(struct bpf_map *map, > > void *key, > > WARN_ON_ONCE(!rcu_read_lock_held()); > > if (unlikely(flags > BPF_EXIST)) > > return -EINVAL; > > + if (unlikely(icsk->icsk_ulp_data)) > > + return -EINVAL; > > > > link = sk_psock_init_link(); > > if (!link) > > Thanks! That looks good, if you feel like submitting officially feel > free to add my Reviewed-by! I'll send it out this evening after running the selftests. .John
Re: general protection fault in tls_sk_proto_close (2)
Jakub Kicinski wrote: > On Thu, 29 Aug 2019 11:52:00 +0800, Hillf Danton wrote: > > Alternatively work is done if sock is closed again. Anyway ctx is reset > > under sock's callback lock in write mode. > > > > --- a/net/tls/tls_main.c > > +++ b/net/tls/tls_main.c > > @@ -295,6 +295,8 @@ static void tls_sk_proto_close(struct so > > long timeo = sock_sndtimeo(sk, 0); > > bool free_ctx; > > > > + if (!ctx) > > + return; > > if (ctx->tx_conf == TLS_SW) > > tls_sw_cancel_work_tx(ctx); > > That's no bueno, the real socket's close will never get called. Seems when we refactored BPF side we dropped the check for ULP on one path so I'll add that back now. It would be nice and seems we are getting closer now that tls side is a bit more dynamic if the ordering didn't matter. diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 1330a7442e5b..30d11558740e 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -666,6 +666,8 @@ static int sock_hash_update_common(struct bpf_map *map, void *key, WARN_ON_ONCE(!rcu_read_lock_held()); if (unlikely(flags > BPF_EXIST)) return -EINVAL; + if (unlikely(icsk->icsk_ulp_data)) + return -EINVAL; link = sk_psock_init_link(); if (!link)
Re: general protection fault in tls_write_space
Jakub Kicinski wrote: > On Tue, 13 Aug 2019 11:30:00 -0700, John Fastabend wrote: > > Jakub Kicinski wrote: > > > On Tue, 13 Aug 2019 10:17:06 -0700, John Fastabend wrote: > > > > > Followup of commit 95fa145479fb > > > > > ("bpf: sockmap/tls, close can race with map free") > > > > > > > > > > --- a/net/tls/tls_main.c > > > > > +++ b/net/tls/tls_main.c > > > > > @@ -308,6 +308,9 @@ static void tls_sk_proto_close(struct so > > > > > if (free_ctx) > > > > > icsk->icsk_ulp_data = NULL; > > > > > sk->sk_prot = ctx->sk_proto; > > > > > + /* tls will go; restore sock callback before enabling bh */ > > > > > + if (sk->sk_write_space == tls_write_space) > > > > > + sk->sk_write_space = ctx->sk_write_space; > > > > > write_unlock_bh(&sk->sk_callback_lock); > > > > > release_sock(sk); > > > > > if (ctx->tx_conf == TLS_SW) > > > > > > > > Hi Hillf, > > > > > > > > We need this patch (although slightly updated for bpf tree) do > > > > you want to send it? Otherwise I can. We should only set this if > > > > TX path was enabled otherwise we null it. Checking against > > > > tls_write_space seems best to me as well. > > > > > > > > Against bpf this patch should fix it. > > > > > > > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c > > > > index ce6ef56a65ef..43252a801c3f 100644 > > > > --- a/net/tls/tls_main.c > > > > +++ b/net/tls/tls_main.c > > > > @@ -308,7 +308,8 @@ static void tls_sk_proto_close(struct sock *sk, > > > > long timeout) > > > > if (free_ctx) > > > > icsk->icsk_ulp_data = NULL; > > > > sk->sk_prot = ctx->sk_proto; > > > > - sk->sk_write_space = ctx->sk_write_space; > > > > + if (sk->sk_write_space == tls_write_space) > > > > + sk->sk_write_space = ctx->sk_write_space; > > > > write_unlock_bh(&sk->sk_callback_lock); > > > > release_sock(sk); > > > > if (ctx->tx_conf == TLS_SW) > > > > > > This is already in net since Friday: > > > > Don't we need to guard that with an > > > > if (sk->sk_write_space == tls_write_space) > > > > or something similar? Where is ctx->sk_write_space set in the rx only > > case? In do_tls_setsockop_conf() we have this block > > > > if (tx) { > > ctx->sk_write_space = sk->sk_write_space; > > sk->sk_write_space = tls_write_space; > > } else { > > sk->sk_socket->ops = &tls_sw_proto_ops; > > } > > > > which makes me think ctx->sk_write_space may not be set correctly in > > all cases. > > Ah damn, you're right I remember looking at that but then I went down > the rabbit hole of trying to repro and forgot :/ > > Do you want to send an incremental change? Sure I'll send something out this afternoon.
Re: general protection fault in tls_write_space
Jakub Kicinski wrote: > On Tue, 13 Aug 2019 10:17:06 -0700, John Fastabend wrote: > > > Followup of commit 95fa145479fb > > > ("bpf: sockmap/tls, close can race with map free") > > > > > > --- a/net/tls/tls_main.c > > > +++ b/net/tls/tls_main.c > > > @@ -308,6 +308,9 @@ static void tls_sk_proto_close(struct so > > > if (free_ctx) > > > icsk->icsk_ulp_data = NULL; > > > sk->sk_prot = ctx->sk_proto; > > > + /* tls will go; restore sock callback before enabling bh */ > > > + if (sk->sk_write_space == tls_write_space) > > > + sk->sk_write_space = ctx->sk_write_space; > > > write_unlock_bh(&sk->sk_callback_lock); > > > release_sock(sk); > > > if (ctx->tx_conf == TLS_SW) > > > > Hi Hillf, > > > > We need this patch (although slightly updated for bpf tree) do > > you want to send it? Otherwise I can. We should only set this if > > TX path was enabled otherwise we null it. Checking against > > tls_write_space seems best to me as well. > > > > Against bpf this patch should fix it. > > > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c > > index ce6ef56a65ef..43252a801c3f 100644 > > --- a/net/tls/tls_main.c > > +++ b/net/tls/tls_main.c > > @@ -308,7 +308,8 @@ static void tls_sk_proto_close(struct sock *sk, long > > timeout) > > if (free_ctx) > > icsk->icsk_ulp_data = NULL; > > sk->sk_prot = ctx->sk_proto; > > - sk->sk_write_space = ctx->sk_write_space; > > + if (sk->sk_write_space == tls_write_space) > > + sk->sk_write_space = ctx->sk_write_space; > > write_unlock_bh(&sk->sk_callback_lock); > > release_sock(sk); > > if (ctx->tx_conf == TLS_SW) > > This is already in net since Friday: Don't we need to guard that with an if (sk->sk_write_space == tls_write_space) or something similar? Where is ctx->sk_write_space set in the rx only case? In do_tls_setsockop_conf() we have this block if (tx) { ctx->sk_write_space = sk->sk_write_space; sk->sk_write_space = tls_write_space; } else { sk->sk_socket->ops = &tls_sw_proto_ops; } which makes me think ctx->sk_write_space may not be set correctly in all cases. Thanks. > > commit 57c722e932cfb82e9820bbaae1b1f7222ea97b52 > Author: Jakub Kicinski > Date: Fri Aug 9 18:36:23 2019 -0700 > > net/tls: swap sk_write_space on close > > Now that we swap the original proto and clear the ULP pointer > on close we have to make sure no callback will try to access > the freed state. sk_write_space is not part of sk_prot, remember > to swap it. > > Reported-by: syzbot+dcdc9deefaec44785...@syzkaller.appspotmail.com > Fixes: 95fa145479fb ("bpf: sockmap/tls, close can race with map free") > Signed-off-by: Jakub Kicinski > Signed-off-by: David S. Miller > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c > index 9cbbae606ced..ce6ef56a65ef 100644 > --- a/net/tls/tls_main.c > +++ b/net/tls/tls_main.c > @@ -308,6 +308,7 @@ static void tls_sk_proto_close(struct sock *sk, long > timeout) > if (free_ctx) > icsk->icsk_ulp_data = NULL; > sk->sk_prot = ctx->sk_proto; > + sk->sk_write_space = ctx->sk_write_space; > write_unlock_bh(&sk->sk_callback_lock); > release_sock(sk); > if (ctx->tx_conf == TLS_SW)
Re: general protection fault in tls_write_space
Hillf Danton wrote: > > On Sat, 10 Aug 2019 01:23:06 -0700 > > > > syzbot has found a reproducer for the following crash on: > > > > HEAD commit:ca497fb6 taprio: remove unused variable 'entry_list_policy' > > git tree: net-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=109f380260 > > kernel config: https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7 > > dashboard link: https://syzkaller.appspot.com/bug?extid=dcdc9deefaec44785f32 > > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11c78cd260 > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+dcdc9deefaec44785...@syzkaller.appspotmail.com > > > > kasan: CONFIG_KASAN_INLINE enabled > > kasan: GPF could be caused by NULL-ptr deref or user memory access > > general protection fault: [#1] PREEMPT SMP KASAN > > CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.3.0-rc3+ #125 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > RIP: 0010:tls_write_space+0x51/0x170 net/tls/tls_main.c:239 > > Code: c1 ea 03 80 3c 02 00 0f 85 26 01 00 00 49 8b 9c 24 b0 06 00 00 48 b8 > > 00 00 00 00 00 fc ff df 48 8d 7b 6a 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 > > 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 df 00 00 00 > > RSP: 0018:8880a98b74c8 EFLAGS: 00010202 > > RAX: dc00 RBX: RCX: 860a27a2 > > RDX: 000d RSI: 862c86c1 RDI: 006a > > RBP: 8880a98b74e0 R08: 8880a98a2240 R09: fbfff167c289 > > R10: fbfff167c288 R11: 8b3e1447 R12: 8880a4de41c0 > > R13: 8880a4de45b8 R14: 000a R15: > > FS: () GS:8880ae80() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: CR3: 8c9d1000 CR4: 001406f0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > Call Trace: > > tcp_new_space net/ipv4/tcp_input.c:5151 [inline] > > tcp_check_space+0x191/0x760 net/ipv4/tcp_input.c:5162 > > tcp_data_snd_check net/ipv4/tcp_input.c:5172 [inline] > > tcp_rcv_state_process+0xe24/0x4e48 net/ipv4/tcp_input.c:6303 > > tcp_v6_do_rcv+0x7d7/0x12c0 net/ipv6/tcp_ipv6.c:1381 > > tcp_v6_rcv+0x31f1/0x3500 net/ipv6/tcp_ipv6.c:1588 > > ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397 > > ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438 > > NF_HOOK include/linux/netfilter.h:305 [inline] > > NF_HOOK include/linux/netfilter.h:299 [inline] > > ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447 > > dst_input include/net/dst.h:442 [inline] > > ip6_rcv_finish+0x1de/0x2f0 net/ipv6/ip6_input.c:76 > > NF_HOOK include/linux/netfilter.h:305 [inline] > > NF_HOOK include/linux/netfilter.h:299 [inline] > > ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:272 > > __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5006 > > __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5120 > > process_backlog+0x206/0x750 net/core/dev.c:5951 > > napi_poll net/core/dev.c:6388 [inline] > > net_rx_action+0x4d6/0x1080 net/core/dev.c:6456 > > __do_softirq+0x262/0x98c kernel/softirq.c:292 > > run_ksoftirqd kernel/softirq.c:603 [inline] > > run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595 > > smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165 > > kthread+0x361/0x430 kernel/kthread.c:255 > > ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 > > Modules linked in: > > ---[ end trace c21a83505707bb9d ]--- > > Followup of commit 95fa145479fb > ("bpf: sockmap/tls, close can race with map free") > > --- a/net/tls/tls_main.c > +++ b/net/tls/tls_main.c > @@ -308,6 +308,9 @@ static void tls_sk_proto_close(struct so > if (free_ctx) > icsk->icsk_ulp_data = NULL; > sk->sk_prot = ctx->sk_proto; > + /* tls will go; restore sock callback before enabling bh */ > + if (sk->sk_write_space == tls_write_space) > + sk->sk_write_space = ctx->sk_write_space; > write_unlock_bh(&sk->sk_callback_lock); > release_sock(sk); > if (ctx->tx_conf == TLS_SW) Hi Hillf, We need this patch (although slightly updated for bpf tree) do you want to send it? Otherwise I can. We should only set this if TX path was enabled otherwise we null it. Checking against tls_write_space seems best to me as well. Against bpf this patch should fix it. diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index ce6ef56a65ef..43252a801c3f 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -308,7 +308,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout) if (free_ctx) icsk->icsk_ulp_data = NULL; sk->sk_prot = ctx->sk_proto; - sk->sk_write_space = ctx->sk_write_space; + if (sk->sk_write_spac
Re: BUG: spinlock recursion in release_sock
syzbot wrote: > syzbot has bisected this bug to: > > commit 8822e270d697010e6a4fd42a319dbefc33db91e1 > Author: John Hurley > Date: Sun Jul 7 14:01:54 2019 + > > net: core: move push MPLS functionality from OvS to core helper > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13ca5a5c60 > start commit: 9e6dfe80 Add linux-next specific files for 20190724 > git tree: linux-next > final crash:https://syzkaller.appspot.com/x/report.txt?x=102a5a5c60 > console output: https://syzkaller.appspot.com/x/log.txt?x=17ca5a5c60 > kernel config: https://syzkaller.appspot.com/x/.config?x=6cbb8fc2cf2842d7 > dashboard link: https://syzkaller.appspot.com/bug?extid=e67cf584b5e6b35a8ffa > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1368059460 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15b3414460 > > Reported-by: syzbot+e67cf584b5e6b35a8...@syzkaller.appspotmail.com > Fixes: 8822e270d697 ("net: core: move push MPLS functionality from OvS to > core helper") > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection This commit is wrong, it appears to be introduced by some other fixes we pushed last couple days for tls/bpf. I'll look into it. Thanks.
Re: kernel panic: stack is corrupted in pointer
Dmitry Vyukov wrote: > On Wed, Jul 17, 2019 at 10:58 AM syzbot > wrote: > > > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:1438cde7 Add linux-next specific files for 20190716 > > git tree: linux-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=1398805860 > > kernel config: https://syzkaller.appspot.com/x/.config?x=3430a151e1452331 > > dashboard link: https://syzkaller.appspot.com/bug?extid=79f5f028005a77ecb6bb > > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=111fc8afa0 > > From the repro it looks like the same bpf stack overflow bug. +John > We need to dup them onto some canonical report for this bug, or this > becomes unmanageable. Fixes in bpf tree should fix this. Hopefully, we will squash this once fixes percolate up. #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git > > #syz dup: kernel panic: corrupted stack end in dput > > > The bug was bisected to: > > > > commit 96a5d8d4915f3e241ebb48d5decdd110ab9c7dcf > > Author: Leo Liu > > Date: Fri Jul 13 15:26:28 2018 + > > > > drm/amdgpu: Make sure IB tests flushed after IP resume > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=14a4620060 > > final crash:https://syzkaller.appspot.com/x/report.txt?x=16a4620060 > > console output: https://syzkaller.appspot.com/x/log.txt?x=12a4620060 > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+79f5f028005a77ecb...@syzkaller.appspotmail.com > > Fixes: 96a5d8d4915f ("drm/amdgpu: Make sure IB tests flushed after IP > > resume") > > > > Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: > > pointer+0x702/0x750 lib/vsprintf.c:2187 > > Shutting down cpus with NMI > > Kernel Offset: disabled > > > > > > --- > > This bug is generated by a bot. It may contain errors. > > See https://goo.gl/tpsmEJ for more information about syzbot. > > syzbot engineers can be reached at syzkal...@googlegroups.com. > > > > syzbot will keep track of this bug report. See: > > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection > > syzbot can test patches for this bug, for details see: > > https://goo.gl/tpsmEJ#testing-patches
Re: WARNING in mark_lock
syzbot wrote: > Hello, > > syzbot has tested the proposed patch but the reproducer still triggered > crash: > KASAN: use-after-free Read in class_equal > > == > BUG: KASAN: use-after-free in class_equal+0x40/0x50 > kernel/locking/lockdep.c:1527 > Read of size 8 at addr 88808a268ba0 by task syz-executor.1/9270 > > CPU: 0 PID: 9270 Comm: syz-executor.1 Not tainted 5.2.0-rc3+ #1 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > > Allocated by task 2647419968: > BUG: unable to handle page fault for address: 8c00b020 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD 8a70067 P4D 8a70067 PUD 8a71063 PMD 0 > Thread overran stack, or stack corrupted > Oops: [#1] PREEMPT SMP KASAN > CPU: 0 PID: 9270 Comm: syz-executor.1 Not tainted 5.2.0-rc3+ #1 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:stack_depot_fetch+0x10/0x30 lib/stackdepot.c:203 > Code: e9 7b fd ff ff 4c 89 ff e8 8d b4 62 fe e9 e6 fd ff ff 90 90 90 90 90 > 90 90 90 89 f8 c1 ef 11 25 ff ff 1f 00 81 e7 f0 3f 00 00 <48> 03 3c c5 20 > 6c 04 8b 48 8d 47 18 48 89 06 8b 47 0c c3 0f 1f 00 > RSP: 0018:88808a2688e8 EFLAGS: 00010006 > RAX: 001f8880 RBX: 88808a269304 RCX: > RDX: RSI: 88808a2688f0 RDI: 3ff0 > RBP: 88808a268908 R08: 0020 R09: ed1015d044fa > R10: ed1015d044f9 R11: 8880ae8227cf R12: ea0002289a00 > R13: 88808a268ba0 R14: 8880aa58ec40 R15: 88808a269300 > FS: 570ba940() GS:8880ae80() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 8c00b020 CR3: 8dd0 CR4: 001406f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > Modules linked in: > CR2: 8c00b020 > ---[ end trace 4acfe4b59fbc9cdb ]--- > RIP: 0010:stack_depot_fetch+0x10/0x30 lib/stackdepot.c:203 > Code: e9 7b fd ff ff 4c 89 ff e8 8d b4 62 fe e9 e6 fd ff ff 90 90 90 90 90 > 90 90 90 89 f8 c1 ef 11 25 ff ff 1f 00 81 e7 f0 3f 00 00 <48> 03 3c c5 20 > 6c 04 8b 48 8d 47 18 48 89 06 8b 47 0c c3 0f 1f 00 > RSP: 0018:88808a2688e8 EFLAGS: 00010006 > RAX: 001f8880 RBX: 88808a269304 RCX: > RDX: RSI: 88808a2688f0 RDI: 3ff0 > RBP: 88808a268908 R08: 0020 R09: ed1015d044fa > R10: ed1015d044f9 R11: 8880ae8227cf R12: ea0002289a00 > R13: 88808a268ba0 R14: 8880aa58ec40 R15: 88808a269300 > FS: 570ba940() GS:8880ae80() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 8c00b020 CR3: 8dd0 CR4: 001406f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > > > Tested on: > > commit: 0b58d013 bpf: tls, implement unhash to avoid transition ou.. > git tree: git://github.com/cilium/linux ktls-unhash > console output: https://syzkaller.appspot.com/x/log.txt?x=153368a3a0 > kernel config: https://syzkaller.appspot.com/x/.config?x=2cc918d28ebd06b4 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > #syz test: git://github.com/cilium/linux fix-unhash
Re: kernel panic: corrupted stack end in dput
Eric Biggers wrote: > [+bpf and tls maintainers] > > On Wed, Jul 03, 2019 at 04:23:34PM +0100, Al Viro wrote: > > On Wed, Jul 03, 2019 at 03:40:00PM +0100, Al Viro wrote: > > > On Wed, Jul 03, 2019 at 02:43:07PM +0800, Hillf Danton wrote: > > > > > > > > This is very much *NOT* fine. > > > > > 1) trylock can fail from any number of reasons, starting > > > > > with "somebody is going through the hash chain doing a lookup on > > > > > something completely unrelated" > > > > > > > > They are also a red light that we need to bail out of spiraling up > > > > the directory hierarchy imho. > > > > > > Translation: "let's leak the reference to parent, shall we?" > > > > > > > > 2) whoever had been holding the lock and whatever they'd > > > > > been doing might be over right after we get the return value from > > > > > spin_trylock(). > > > > > > > > Or after we send a mail using git. I don't know. > > > > > > > > > 3) even had that been really somebody adding children in > > > > > the same parent *AND* even if they really kept doing that, rather > > > > > than unlocking and buggering off, would you care to explain why > > > > > dentry_unlist() called by __dentry_kill() and removing the victim > > > > > from the list of children would be safe to do in parallel with that? > > > > > > > > > My bad. I have to walk around that unsafety. > > > > > > WHAT unsafety? Can you explain what are you seeing and how to > > > reproduce it, whatever it is? > > > > BTW, what makes you think that it's something inside dput() itself? > > All I see is that at some point in the beginning of the loop body > > in dput() we observe a buggered stack. > > > > Is that the first iteration through the loop? IOW, is that just > > the place where we first notice preexisting corruption, or is > > that something the code called from that loop does? If it's > > a stack overflow, I would be very surprised to see it here - > > dput() is iterative and it's called on a very shallow stack in > > those traces. > > > > What happens if you e.g. turn that > > dput(dentry); > > in __fput() into > > rcu_read_lock(); rcu_read_unlock(); // trigger the check > > dput(dentry); > > > > and run your reporducer? > > > > Please don't waste your time on this, it looks like just another report from > the > massive memory corruption in BPF and/or TLS. Look at reproducer: > > bpf$MAP_CREATE(0x0, &(0x7f000280)={0xf, 0x4, 0x4, 0x400, 0x0, 0x1}, 0x3c) > socket$rxrpc(0x21, 0x2, 0x8a) > r0 = socket$inet6_tcp(0xa, 0x1, 0x0) > setsockopt$inet6_tcp_int(r0, 0x6, 0x13, &(0x7fc0)=0x10001, 0x1d4) > connect$inet6(r0, &(0x7f000140), 0x1c) > bpf$MAP_CREATE(0x0, &(0x7f00)={0x5}, 0xfdcb) > bpf$MAP_CREATE(0x2, &(0x7f003000)={0x3, 0x0, 0x77fffb, 0x0, > 0x1002000, 0x0}, 0x2c) > setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f40)='tls\x00', 0x4) > > It's the same as like 20 other syzbot reports. There is a missing synchronize_rcu we need to add and we have a race between map_free and tls close at the moment. The race cuases us to incorrectly set the sk->prot pointers when tls socket is closed in this case. I've added a hook to the ULP side now that should let the map_free reset the saved sk->prot pointers on the TLS side and am testing this now. The 20 syzbot reports appear to all be due to these two issues. This has nothing to do with dput(). Thanks, John > > - Eric
RE: [net/bpf] Re: WARNING in mark_lock
Eric Biggers wrote: > [+bpf list] > > On Tue, Jun 25, 2019 at 08:20:56AM +0200, Thomas Gleixner wrote: > > On Mon, 24 Jun 2019, syzbot wrote: > > > > > Hello, > > > > CC++ Peterz > > > > > > > > syzbot found the following crash on: > > > > > > HEAD commit:dc636f5d Add linux-next specific files for 20190620 > > > git tree: linux-next > > > console output: https://syzkaller.appspot.com/x/log.txt?x=162b68b1a0 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=99c104b0092a557b > > > dashboard link: > > > https://syzkaller.appspot.com/bug?extid=a861f52659ae2596492b > > > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=110b24f6a0 > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > > Reported-by: syzbot+a861f52659ae25964...@syzkaller.appspotmail.com > > The syz repro looks bpf related, and essentially the same repro is in lots of > other open syzbot reports which I've assigned to the bpf subsystem... > https://lore.kernel.org/lkml/20190624050114.GA30702@sol.localdomain/ > > {"threaded":true,"repeat":true,"procs":6,"sandbox":"none","fault_call":-1,"tun":true,"netdev":true,"resetnet":true,"cgroups":true,"binfmt_misc":true,"close_fds":true,"tmpdir":true,"segv":true} > bpf$MAP_CREATE(0x0, &(0x7f000280)={0xf, 0x4, 0x4, 0x400, 0x0, 0x1}, 0x3c) > socket$rxrpc(0x21, 0x2, 0x8a) > r0 = socket$inet6_tcp(0xa, 0x1, 0x0) > setsockopt$inet6_tcp_int(r0, 0x6, 0x13, &(0x7fc0)=0x10001, 0x1d4) > connect$inet6(r0, &(0x7f000140), 0x1c) > bpf$MAP_CREATE(0x0, &(0x7f00)={0x5}, 0xfdcb) > bpf$MAP_CREATE(0x2, &(0x7f003000)={0x3, 0x0, 0x77fffb, 0x0, > 0x1002000, 0x0}, 0x2c) > setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f40)='tls\x00', 0x4) #syz test: git://github.com/cilium/linux ktls-unhash
RE: memory leak in create_ctx
syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:79c3ba32 Merge tag 'drm-fixes-2019-06-07-1' of git://anong.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=170e0bfea0 > kernel config: https://syzkaller.appspot.com/x/.config?x=d5c73825cbdc7326 > dashboard link: https://syzkaller.appspot.com/bug?extid=06537213db7ba2745c4a > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10aa806aa0 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+06537213db7ba2745...@syzkaller.appspotmail.com > > IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready > 2019/06/08 14:55:51 executed programs: 15 > 2019/06/08 14:55:56 executed programs: 31 > 2019/06/08 14:56:02 executed programs: 51 > BUG: memory leak > unreferenced object 0x888117ceae00 (size 512): >comm "syz-executor.3", pid 7233, jiffies 4294949016 (age 13.640s) >hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >backtrace: > [] kmemleak_alloc_recursive > include/linux/kmemleak.h:55 [inline] > [ ] slab_post_alloc_hook mm/slab.h:439 [inline] > [ ] slab_alloc mm/slab.c:3326 [inline] > [ ] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 > [<14132182>] kmalloc include/linux/slab.h:547 [inline] > [<14132182>] kzalloc include/linux/slab.h:742 [inline] > [<14132182>] create_ctx+0x25/0x70 net/tls/tls_main.c:601 > [ ] tls_init net/tls/tls_main.c:787 [inline] > [ ] tls_init+0x97/0x1e0 net/tls/tls_main.c:769 > [<37b0c43c>] __tcp_set_ulp net/ipv4/tcp_ulp.c:126 [inline] > [<37b0c43c>] tcp_set_ulp+0xe2/0x190 net/ipv4/tcp_ulp.c:147 > [<7a284277>] do_tcp_setsockopt.isra.0+0x19a/0xd60 > net/ipv4/tcp.c:2784 > [ ] tcp_setsockopt+0x71/0x80 net/ipv4/tcp.c:3098 > [ ] sock_common_setsockopt+0x38/0x50 > net/core/sock.c:3124 > [<06b0801f>] __sys_setsockopt+0x98/0x120 net/socket.c:2072 > [ ] __do_sys_setsockopt net/socket.c:2083 [inline] > [ ] __se_sys_setsockopt net/socket.c:2080 [inline] > [ ] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080 > [ ] do_syscall_64+0x76/0x1a0 > arch/x86/entry/common.c:301 > [ ] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > BUG: memory leak > unreferenced object 0x88810965dc00 (size 512): >comm "syz-executor.2", pid 7235, jiffies 4294949016 (age 13.640s) >hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >backtrace: > [ ] kmemleak_alloc_recursive > include/linux/kmemleak.h:55 [inline] > [ ] slab_post_alloc_hook mm/slab.h:439 [inline] > [ ] slab_alloc mm/slab.c:3326 [inline] > [ ] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 > [<14132182>] kmalloc include/linux/slab.h:547 [inline] > [<14132182>] kzalloc include/linux/slab.h:742 [inline] > [<14132182>] create_ctx+0x25/0x70 net/tls/tls_main.c:601 > [ ] tls_init net/tls/tls_main.c:787 [inline] > [ ] tls_init+0x97/0x1e0 net/tls/tls_main.c:769 > [<37b0c43c>] __tcp_set_ulp net/ipv4/tcp_ulp.c:126 [inline] > [<37b0c43c>] tcp_set_ulp+0xe2/0x190 net/ipv4/tcp_ulp.c:147 > [<7a284277>] do_tcp_setsockopt.isra.0+0x19a/0xd60 > net/ipv4/tcp.c:2784 > [ ] tcp_setsockopt+0x71/0x80 net/ipv4/tcp.c:3098 > [ ] sock_common_setsockopt+0x38/0x50 > net/core/sock.c:3124 > [<06b0801f>] __sys_setsockopt+0x98/0x120 net/socket.c:2072 > [ ] __do_sys_setsockopt net/socket.c:2083 [inline] > [ ] __se_sys_setsockopt net/socket.c:2080 [inline] > [ ] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080 > [ ] do_syscall_64+0x76/0x1a0 > arch/x86/entry/common.c:301 > [ ] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > BUG: memory leak > unreferenced object 0x8881207d7600 (size 512): >comm "syz-executor.5", pid 7244, jiffies 4294949019 (age 13.610s) >hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >backtrace: > [ ] kmemleak_alloc_recursive > include/linux/kmemleak
Re: KASAN: slab-out-of-bounds Write in validate_chain
Eric Biggers wrote: > Hi John, > > On Tue, Jun 25, 2019 at 04:07:00PM -0700, syzbot wrote: > > syzbot has bisected this bug to: > > > > commit e9db4ef6bf4ca9894bb324c76e01b8f1a16b2650 > > Author: John Fastabend > > Date: Sat Jun 30 13:17:47 2018 + > > > > bpf: sockhash fix omitted bucket lock in sock_close > > > > Are you working on this? This is the 6th open syzbot report that has been > bisected to this commit, and I suspect it's the cause of many of the other > 30 open syzbot reports I assigned to the bpf subsystem too > (https://lore.kernel.org/bpf/20190624050114.GA30702@sol.localdomain/). > > Also, this is happening in mainline (v5.2-rc6). > > - Eric Should have a fix today. It seems syzbot has found this bug repeatedly. .John
RE: kernel panic: corrupted stack end in corrupted
syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:29f785ff Merge branch 'fixes' of git://git.kernel.org/pub/.. > git tree: net > console output: https://syzkaller.appspot.com/x/log.txt?x=1158d411a0 > kernel config: https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b > dashboard link: https://syzkaller.appspot.com/bug?extid=b764c7ca388222ddfb17 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=100ca932a0 > > The bug was bisected to: > > commit e9db4ef6bf4ca9894bb324c76e01b8f1a16b2650 > Author: John Fastabend > Date: Sat Jun 30 13:17:47 2018 + > > bpf: sockhash fix omitted bucket lock in sock_close > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=135e8a3aa0 > final crash:https://syzkaller.appspot.com/x/report.txt?x=10de8a3aa0 > console output: https://syzkaller.appspot.com/x/log.txt?x=175e8a3aa0 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+b764c7ca388222ddf...@syzkaller.appspotmail.com > Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close") > > Kernel panic - not syncing: corrupted stack end detected inside scheduler > CPU: 0 PID: 8770 Comm: syz-executor.4 Not tainted 5.2.0-rc5+ #57 > Hardware name: Google GooglSeaBIOS (version 1.8.2-20190503_170316-google) > I believe this is the same or similar bug to the others found with sockmap + ulp running. Will have fixes shortly. Thanks, John
Re: memory leak in create_ctx
Hillf Danton wrote: > > Hi Dmitry > > On Tue, 11 Jun 2019 19:45:28 +0800 Dmitry Vyukov wrote: > > > > I've run the repro as "./syz-execprog -repeat=0 -procs=6 repro" and > > in 10 mins I got the following splat, which indeed suggests a bpf bug. > > But we of course can have both bpf stack overflow and a memory leak in tls. > > > > > > > > 2019/06/11 10:26:52 executed programs: 887 > > 2019/06/11 10:26:57 executed programs: 899 > > 2019/06/11 10:27:02 executed programs: 916 > > [ 429.171049][ T9870] BUG: stack guard page was hit at a78467b9 > > (stack is 1452e9df..4fb93e51) > > [ 429.173714][ T9870] kernel stack overflow (double-fault): [#1] > > PREEMPT SMP > > [ 429.174819][ T9870] CPU: 3 PID: 9870 Comm: syz-executor Not tainted > > 5.2.0-rc4+ #6 > > [ 429.175901][ T9870] Hardware name: QEMU Standard PC (i440FX + PIIX, > > 1996), BIOS 1.10.2-1 04/01/2014 > > [ 429.177215][ T9870] RIP: 0010:tcp_bpf_unhash+0xc/0x80 > > [ 429.177950][ T9870] Code: 28 4c 89 ee 48 89 df ff 10 e8 30 56 66 fe > > 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 80 00 00 00 00 55 48 89 e5 41 55 41 > > 54 53 48 89 fb 0f 56 66 fe e8 6a bb 5f fe 4c 8b a3 80 02 00 00 4d > > 85 e4 74 2f > > [ 429.180707][ T9870] RSP: 0018:c9000369 EFLAGS: 00010293 > > [ 429.181562][ T9870] RAX: 888066a72000 RBX: 88806695b640 RCX: > > 82c82f80 > > [ 429.182681][ T9870] RDX: RSI: 0007 RDI: > > 88806695b640 > > [ 429.183807][ T9870] RBP: c90003690018 R08: R09: > > > > [ 429.184931][ T9870] R10: c90003693e70 R11: R12: > > 82c82f10 > > [ 429.186104][ T9870] R13: 0007 R14: 88806695b710 R15: > > 88806695b710 > > [ 429.187303][ T9870] FS: 569fc940() > > GS:88807db8() knlGS: > > [ 429.188678][ T9870] CS: 0010 DS: ES: CR0: 80050033 > > [ 429.189674][ T9870] CR2: c9000368fff8 CR3: 762bc002 CR4: > > 007606e0 > > [ 429.190880][ T9870] DR0: DR1: DR2: > > > > [ 429.192094][ T9870] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [ 429.193295][ T9870] PKRU: 5554 > > [ 429.193829][ T9870] Call Trace: > > [ 429.194326][ T9870] ? tcp_bpf_close+0xa0/0xa0 > > [ 429.195020][ T9870] tcp_bpf_unhash+0x76/0x80 > > [ 429.195706][ T9870] ? tcp_bpf_close+0xa0/0xa0 > > [ 429.196400][ T9870] tcp_bpf_unhash+0x76/0x80 > > [ 429.197079][ T9870] ? tcp_bpf_close+0xa0/0xa0 > > [ 429.197773][ T9870] tcp_bpf_unhash+0x76/0x80 > > [ 429.651942][ T9870] ? tcp_bpf_close+0xa0/0xa0 > > > ... duplicated info trimed ... > > > > [ 429.652512][ T9870] tcp_bpf_unhash+0x76/0x80 > > [ 429.656467][ T9870] ? tcp_bpf_close+0xa0/0xa0 > > [ 429.657037][ T9870] tcp_bpf_unhash+0x76/0x80 > > [ 429.657600][ T9870] tcp_set_state+0x7b/0x220 > > [ 429.658160][ T9870] ? put_object+0x20/0x30 > > [ 429.658699][ T9870] ? debug_smp_processor_id+0x2b/0x130 > > [ 429.659382][ T9870] tcp_disconnect+0x518/0x610 > > [ 429.659973][ T9870] tcp_close+0x41d/0x540 > > [ 429.660501][ T9870] ? tcp_check_oom+0x180/0x180 > > [ 429.661095][ T9870] tls_sk_proto_close+0x86/0x2a0 > > [ 429.661711][ T9870] ? locks_remove_posix+0x114/0x1c0 > > [ 429.662359][ T9870] inet_release+0x44/0x80 > > [ 429.662899][ T9870] inet6_release+0x36/0x50 > > [ 429.663453][ T9870] __sock_release+0x4b/0x100 > > [ 429.664024][ T9870] ? __sock_release+0x100/0x100 > > [ 429.664625][ T9870] sock_close+0x19/0x20 > > [ 429.665141][ T9870] __fput+0xe7/0x2f0 > > [ 429.665624][ T9870] fput+0x15/0x20 > > [ 429.666120][ T9870] task_work_run+0xa4/0xd0 > > [ 429.71][ T9870] exit_to_usermode_loop+0x16f/0x180 > > [ 429.667329][ T9870] do_syscall_64+0x187/0x1b0 > > [ 429.667920][ T9870] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 429.668654][ T9870] RIP: 0033:0x412451 > > [ 429.669141][ T9870] Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff > > ff 0f 83 94 1a 00 00 c3 48 83 ec 08 e8 0a fc ff ff 48 89 04 24 b8 03 > > 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 53 fc ff ff 48 89 d0 48 83 c4 > > 08 48 3d 01 > > [ 429.671586][ T9870] RSP: 002b:7ffde18b5470 EFLAGS: 0293 > > ORIG_RAX: 0003 > > [ 429.672636][ T9870] RAX: RBX: 0005 RCX: > > 00412451 > > [ 429.673628][ T9870] RDX: RSI: 0081 RDI: > > 0004 > > [ 429.674643][ T9870] RBP: R08: 00721170 R09: > > > > [ 429.675641][ T9870] R10: 7ffde18b5580 R11: 0293 R12: > > > > [ 429.676636][ T9870] R13: 0071bf00 R14: 006e3140 R15: > > > > [ 429.677630][ T9870] Modules linked in: > > [ 429.678119][ T9870] ---[ end trace a429c7ce256ca7bb ]--- > > [ 429.678798][ T9870] RIP: 0010:tcp_bpf_unhash+0xc/0x80 > > [ 4
Re: [PATCH 1/2] bpf: sockmap, fix uninitialized variable
On 05/17/2018 07:08 AM, Gustavo A. R. Silva wrote: > There is a potential execution path in which variable err is > returned without being properly initialized previously. > > Fix this by initializing variable err to 0. > > Addresses-Coverity-ID: 1468964 ("Uninitialized scalar variable") > Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work > with hashmap") > Signed-off-by: Gustavo A. R. Silva > --- > kernel/bpf/sockmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c > index c6de139..41b41fc 100644 > --- a/kernel/bpf/sockmap.c > +++ b/kernel/bpf/sockmap.c > @@ -1713,7 +1713,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map > *map, > struct smap_psock_map_entry *e = NULL; > struct smap_psock *psock; > bool new = false; > - int err; > + int err = 0; > > /* 1. If sock map has BPF programs those will be inherited by the >* sock being added. If the sock is already attached to BPF programs > Thanks for catching this and the quick fix. The path to hit this case is to add a sock to a map (without a BPF program) where the sock already has been added to another map. I don't have any tests for the case with socks in multiple maps so I'll add some to the selftests so I remember this case. The alternative fix would be to always 'return 0' at the end of the function, but I think its probably better to init err here like above. Acked-by: John Fastabend
Re: [PATCH 2/2] bpf: sockmap, fix double-free
On 05/17/2018 07:11 AM, Gustavo A. R. Silva wrote: > `e' is being freed twice. > > Fix this by removing one of the kfree() calls. > > Addresses-Coverity-ID: 1468983 ("Double free") > Fixes: 81110384441a ("bpf: sockmap, add hash map support") > Signed-off-by: Gustavo A. R. Silva > --- > kernel/bpf/sockmap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c > index 41b41fc..c682669 100644 > --- a/kernel/bpf/sockmap.c > +++ b/kernel/bpf/sockmap.c > @@ -1823,7 +1823,6 @@ static int __sock_map_ctx_update_elem(struct bpf_map > *map, > write_unlock_bh(&sock->sk_callback_lock); > return err; > out_free: > - kfree(e); > smap_release_sock(psock, sock); > out_progs: > if (verdict) > Thanks. This can happen when a user tries to add a kTLS socket to a sockmap. We need to add some tests for kTLS + sockmap cases. Acked-by: John Fastabend
Re: [bug, bisected] pfifo_fast causes packet reordering
On 03/22/2018 03:16 AM, Jakob Unterwurzacher wrote: > On 21.03.18 21:52, John Fastabend wrote: >> Can you try this, >> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >> index d4907b5..1e596bd 100644 >> --- a/include/net/sch_generic.h >> +++ b/include/net/sch_generic.h >> @@ -30,6 +30,7 @@ struct qdisc_rate_table { >> enum qdisc_state_t { >> __QDISC_STATE_SCHED, >> __QDISC_STATE_DEACTIVATED, >> + __QDISC_STATE_RUNNING, >> }; >> [...] > > Tested, looks good. No OOO observed, no side effects observed, iperf > numbers on Gigabit Ethernet look the same. > > Thanks, > Jakob Thanks, I'll send out an official patch today. .John
Re: [bug, bisected] pfifo_fast causes packet reordering
On 03/21/2018 12:44 PM, Jakob Unterwurzacher wrote: > On 21.03.18 19:43, John Fastabend wrote: >> Thats my theory at least. Are you able to test a patch if I generate >> one to fix this? > > Yes, no problem. Can you try this, diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index d4907b5..1e596bd 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -30,6 +30,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_RUNNING, }; struct qdisc_size_table { diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 190570f..cf7c37d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -377,20 +377,26 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets) struct netdev_queue *txq; struct net_device *dev; struct sk_buff *skb; - bool validate; + bool more, validate; /* Dequeue packet */ + if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) + return false; + skb = dequeue_skb(q, &validate, packets); - if (unlikely(!skb)) + if (unlikely(!skb)) { + clear_bit(__QDISC_STATE_RUNNING, &q->state); return false; + } if (!(q->flags & TCQ_F_NOLOCK)) root_lock = qdisc_lock(q); dev = qdisc_dev(q); txq = skb_get_tx_queue(dev, skb); - - return sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + clear_bit(__QDISC_STATE_RUNNING, &q->state); + return more; } > > I just tested with the flag change you suggested (see below, I had to keep > TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far. > Right because the code expects per cpu stats if the CPUSTATS flag is removed it will crash. > Thanks, > Jakob > > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 190570f21b20..51b68ef4977b 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = { > .dump = pfifo_fast_dump, > .change_tx_queue_len = pfifo_fast_change_tx_queue_len, > .owner = THIS_MODULE, > - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS, > + .static_flags = TCQ_F_CPUSTATS, > }; > EXPORT_SYMBOL(pfifo_fast_ops);
Re: [bug, bisected] pfifo_fast causes packet reordering
On 03/21/2018 03:01 AM, Jakob Unterwurzacher wrote: > On 16.03.18 11:26, Jakob Unterwurzacher wrote: >> On 15.03.18 23:30, John Fastabend wrote: >>>> I have reproduced it using two USB network cards connected to each other. >>>> The test tool sends UDP packets containing a counter and listens on the >>>> other interface, it is available at >>>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py >>> >>> Great thanks, can you also run this with taskset to bind to >>> a single CPU, >>> >>> # taskset 0x1 ./pifof_stress.py >>> >>> And let me know if you still see the OOO. >> >> Interesting. Looks like it depends on which core it runs on. CPU0 is clean, >> CPU1 is not. > > So we are at v4.16-rc6 now - have you managed to reproduce this is or should > I try to get the revert correct? I have a theory on what is going on here. Because we now run without locks we can have multiple qdisc_run() calls running in parallel. Possible if we send packets using multiple cores. --- application --- cpu0cpu1 | | | | enqueue enqueue | | pfifo_fast | | dequeue dequeue \/ ndo_xmit The skb->ooo_okay flag will keep the enqueue side packets in order. So that is covered. But on the dequeue side if two cores dequeue in parallel we will race to ndo ops to ensure packets are in order. Rarely, I guess the second dequeue could actually call ndo hook before first dequeued packet. Because usually the dequeue happens on the same queue the enqueue happened on we don't see this very often. But there seems to be a case where the driver calls netif_tx_wake_queue() on a different core (from the rx interrupt context). The wake queue call then eventually runs the dequeue on a different core. So when taskset is aligned with the interrupt everything is in-order when it is moved to a different core we see the OOO. Thats my theory at least. Are you able to test a patch if I generate one to fix this? FWIW the revert on this is trivial, but I think we can fix this without too much work. Also, if you had a driver tx queue per core this would not be an issue. diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 190570f..171f470 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -792,7 +792,6 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = { .dump = pfifo_fast_dump, .change_tx_queue_len = pfifo_fast_change_tx_queue_len, .owner = THIS_MODULE, - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS, }; EXPORT_SYMBOL(pfifo_fast_ops); > > Best regards, > Jakob
Re: [bug, bisected] pfifo_fast causes packet reordering
On 03/15/2018 11:08 AM, Jakob Unterwurzacher wrote: > On 14.03.18 05:03, John Fastabend wrote: >> On 03/13/2018 11:35 AM, Dave Taht wrote: >>> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher >>> wrote: >>>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux >>>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets >>>> are >>>> delivered out-of-order. >>>> >> >> Is the stress-testing tool available somewhere? What type of packets >> are being sent? > > > I have reproduced it using two USB network cards connected to each other. The > test tool sends UDP packets containing a counter and listens on the other > interface, it is available at > https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py > Great thanks, can you also run this with taskset to bind to a single CPU, # taskset 0x1 ./pifof_stress.py And let me know if you still see the OOO. > Here is what I get: > > root@rk3399-q7:~# ./pfifo_stress.py > [...] > expected ctr 0xcdc, received 0xcdd > expected ctr 0xcde, received 0xcdc > expected ctr 0xcdd, received 0xcde > expected ctr 0xe3c, received 0xe3d > expected ctr 0xe3e, received 0xe3c > expected ctr 0xe3d, received 0xe3e > expected ctr 0x1097, received 0x1098 > expected ctr 0x1099, received 0x1097 > expected ctr 0x1098, received 0x1099 > expected ctr 0x17c0, received 0x17c1 > expected ctr 0x17c2, received 0x17c0 > [...] > > Best regards, > Jakob
Re: [bug, bisected] pfifo_fast causes packet reordering
On 03/13/2018 11:35 AM, Dave Taht wrote: > On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher > wrote: >> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux >> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are >> delivered out-of-order. >> Is the stress-testing tool available somewhere? What type of packets are being sent? >> We have tracked the problem down to the driver interface level, and it seems >> that the driver's net_device_ops.ndo_start_xmit() function gets the packets >> handed over in the wrong order. >> >> This behavior was not observed on Linux v4.15 and I have bisected the >> problem down to this patch: >> >>> commit c5ad119fb6c09b0297446be05bd66602fa564758 >>> Author: John Fastabend >>> Date: Thu Dec 7 09:58:19 2017 -0800 >>> >>>net: sched: pfifo_fast use skb_array >>> >>>This converts the pfifo_fast qdisc to use the skb_array data structure >>>and set the lockless qdisc bit. pfifo_fast is the first qdisc to >>> support >>>the lockless bit that can be a child of a qdisc requiring locking. So >>>we add logic to clear the lock bit on initialization in these cases >>> when >>>the qdisc graft operation occurs. >>> >>>This also removes the logic used to pick the next band to dequeue from >>>and instead just checks a per priority array for packets from top >>> priority >>>to lowest. This might need to be a bit more clever but seems to work >>>for now. >>> >>>Signed-off-by: John Fastabend >>>Signed-off-by: David S. Miller >> >> >> The patch does not revert cleanly, but moving to one commit earlier makes >> the problem go away. >> >> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go >> away as well. > Is this a single queue device or a multiqueue device? Running 'tc -s qdisc show dev foo' would help some. > I am of course, a fan of obsoleting pfifo_fast. There's no good reason > for it anymore. > >> >> Is this an unintended side-effect of the patch or is there something the >> driver has to do to request in-order delivery? >> If we introduced a OOO edge case somewhere that was not intended so I'll take a look into it. But, if you can provide a bit more details on how stress testing is done to cause the issue that would help. Thanks, John >> Thanks, >> Jakob > > >
Re: [PATCH net-next 06/12] Revert "net: ptr_ring: otherwise safe empty checks can overrun array bounds"
On 01/25/2018 03:36 PM, Michael S. Tsirkin wrote: > This reverts commit bcecb4bbf88aa03171c30652bca761cf27755a6b. > > If we try to allocate an extra entry as the above commit did, and when > the requested size is UINT_MAX, addition overflows causing zero size to > be passed to kmalloc(). > > kmalloc then returns ZERO_SIZE_PTR with a subsequent crash. > > Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com > Cc: John Fastabend > Signed-off-by: Michael S. Tsirkin > --- Dang, I missed this case. Thanks. Acked-by: John Fastabend > include/linux/ptr_ring.h | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index f175846..3a19ebd 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -466,12 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct > ptr_ring *r, > > static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t > gfp) > { > - /* Allocate an extra dummy element at end of ring to avoid consumer head > - * or produce head access past the end of the array. Possible when > - * producer/consumer operations and __ptr_ring_peek operations run in > - * parallel. > - */ > - return kcalloc(size + 1, sizeof(void *), gfp); > + return kcalloc(size, sizeof(void *), gfp); > } > > static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) >
Re: [PATCH net-next 01/12] ptr_ring: keep consumer_head valid at all times
On 01/25/2018 03:36 PM, Michael S. Tsirkin wrote: > The comment near __ptr_ring_peek says: > > * If ring is never resized, and if the pointer is merely > * tested, there's no need to take the lock - see e.g. __ptr_ring_empty. > > but this was in fact never possible since consumer_head would sometimes > point outside the ring. Refactor the code so that it's always > pointing within a ring. > > Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array") > Signed-off-by: Michael S. Tsirkin > --- > include/linux/ptr_ring.h | 25 - > 1 file changed, 16 insertions(+), 9 deletions(-) > Thanks for fixing this up. Acked-by: John Fastabend
Re: [PATCH] ptr_ring: document usage around __ptr_ring_peek
On 01/10/2018 06:03 AM, Michael S. Tsirkin wrote: > This explains why is the net usage of __ptr_ring_peek > actually ok without locks. > > Signed-off-by: Michael S. Tsirkin > --- > > John - I think this is more or less what you meant. Is that right? Yep, thanks for following up. Acked-by: John Fastabend
Re: WARNING in strp_data_ready
On 10/24/2017 08:20 AM, syzbot wrote: > Hello, > > syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me > include/net/sock.h:1505 [inline] > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user > include/net/sock.h:1511 [inline] > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 > strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > > __dump_stack lib/dump_stack.c:16 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:52 > panic+0x1e4/0x417 kernel/panic.c:181 > __warn+0x1c4/0x1d9 kernel/panic.c:542 > report_bug+0x211/0x2d0 lib/bug.c:183 > fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 > do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] > do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 > do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 > invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 > RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] > RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] > RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 > RSP: 0018:8801db206b18 EFLAGS: 00010206 > RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: > RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 > RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd > R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 > R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 > psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 Looks like KCM is calling sk_data_ready() without first taking the sock lock. /* Called with lower sock held */ static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) { [...] if (kcm_queue_rcv_skb(&kcm->sk, skb)) { In this case kcm->sk is not the same lock the comment is referring to. And kcm_queue_rcv_skb() will eventually call sk_data_ready(). @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? I don't have anything better in mind immediately. Thanks, John
Re: [PATCH net 3/3] bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations
On 10/17/2017 07:55 AM, Daniel Borkmann wrote: > PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu > allocator. Given we support __GFP_NOWARN now, lets just let > the allocation request fail naturally instead. The two call > sites from BPF mistakenly assumed __GFP_NOWARN would work, so > no changes needed to their actual __alloc_percpu_gfp() calls > which use the flag already. > > Signed-off-by: Daniel Borkmann > --- Nice cleanup. Thanks! Acked-by: John Fastabend
Re: [PATCH net 2/3] bpf: fix splat for illegal devmap percpu allocation
On 10/17/2017 07:55 AM, Daniel Borkmann wrote: > It was reported that syzkaller was able to trigger a splat on > devmap percpu allocation due to illegal/unsupported allocation > request size passed to __alloc_percpu(): > > [ 70.094249] illegal size (32776) or align (8) for percpu allocation > [ 70.094256] [ cut here ] > [ 70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 > pcpu_alloc+0x96/0x630 > [...] > [ 70.094325] Call Trace: > [ 70.094328] __alloc_percpu_gfp+0x12/0x20 > [ 70.094330] dev_map_alloc+0x134/0x1e0 > [ 70.094331] SyS_bpf+0x9bc/0x1610 > [ 70.094333] ? selinux_task_setrlimit+0x5a/0x60 > [ 70.094334] ? security_task_setrlimit+0x43/0x60 > [ 70.094336] entry_SYSCALL_64_fastpath+0x1a/0xa5 > > This was due to too large max_entries for the map such that we > surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to > fail naturally here, so switch to __alloc_percpu_gfp() and pass > __GFP_NOWARN instead. > > Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map") > Reported-by: Mark Rutland > Reported-by: Shankara Pailoor > Reported-by: Richard Weinberger > Signed-off-by: Daniel Borkmann > Cc: John Fastabend > --- Thanks! Acked-by: John Fastabend
Re: [PATCH] bpf: devmap: Check attr->max_entries more carefully
On 10/15/2017 03:13 PM, Richard Weinberger wrote: > Am Montag, 16. Oktober 2017, 00:00:20 CEST schrieb Richard Weinberger: >> max_entries is user controlled and used as input for __alloc_percpu(). >> This function expects that the allocation size is a power of two and >> less than PCPU_MIN_UNIT_SIZE. >> Otherwise a WARN() is triggered. > > On a second though, I think we should also have a hard limit for > ->max_entries > or check for an overflow here: > Or just, static u64 dev_map_bitmap_size(const union bpf_attr *attr) { return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long); } Let me know if you want me to send a patch or if you want to. Thanks, John > static u64 dev_map_bitmap_size(const union bpf_attr *attr) > { > return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long); > } > > Thanks, > //richard >
Re: [PATCH net-next 3/3] virtio-net: support XDP_REDIRECT
On 09/19/2017 02:42 AM, Jason Wang wrote: > This patch tries to add XDP_REDIRECT for virtio-net. The changes are > not complex as we could use exist XDP_TX helpers for most of the > work. The rest is passing the XDP_TX to NAPI handler for implementing > batching. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- [...] > @@ -678,12 +711,20 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > } > break; > case XDP_TX: > - if (unlikely(!virtnet_xdp_xmit(vi, &xdp))) > + if (unlikely(!__virtnet_xdp_xmit(vi, &xdp))) > trace_xdp_exception(vi->dev, xdp_prog, act); > + else > + *xdp_xmit = true; > if (unlikely(xdp_page != page)) > goto err_xdp; > rcu_read_unlock(); > goto xdp_xmit; > + case XDP_REDIRECT: > + err = xdp_do_redirect(dev, &xdp, xdp_prog); > + if (err) > + *xdp_xmit = true; Is this a typo? Should above be if (!err) *xdp_xmit = true; this would match the pattern in receive_small and also make more sense. > + rcu_read_unlock(); > + goto xdp_xmit; > default: > bpf_warn_invalid_xdp_action(act); > case XDP_ABORTED: > @@ -788,7 +829,7 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > } > [...] Otherwise looks good to me thanks for doing this. .John
Re: [PATCH][net-next][V3] bpf: test_maps: fix typos, "conenct" and "listeen"
On 08/30/2017 10:15 AM, Colin King wrote: > From: Colin Ian King > > Trivial fix to typos in printf error messages: > "conenct" -> "connect" > "listeen" -> "listen" > > thanks to Daniel Borkmann for spotting one of these mistakes > > Signed-off-by: Colin Ian King > --- Thanks. Even though its a spelling typo best to add fixes tag I think. Fixes: 6f6d33f3b3d0 ("bpf: selftests add sockmap tests") Acked-by: John Fastabend
Re: linux-next: build failure after merge of the net-next tree
On 07/17/2017 06:09 PM, Stephen Rothwell wrote: > Hi all, > > After merging the net-next tree, today's linux-next build (arm > multi_v7_defconfig) failed like this: > > net/core/filter.o: In function `xdp_do_flush_map': > filter.c:(.text+0x3770): undefined reference to `__dev_map_flush' > net/core/filter.o: In function `__bpf_tx_xdp': > filter.c:(.text+0x56dc): undefined reference to `__dev_map_insert_ctx' > net/core/filter.o: In function `xdp_do_redirect_map': > filter.c:(.text+0x5e24): undefined reference to `__dev_map_lookup_elem' > > Caused by commit > > 11393cc9b9be ("xdp: Add batching support to redirect map") > > This build has > > CONFIG_BPF=y > # CONFIG_BPF_SYSCALL is not set > > I have used the net-next tree from next-20170717 for today. > I missed testing without CONFIG_BPF_SYSCALL. Patch is posted to netdev and on patchworks at, https://patchwork.ozlabs.org/patch/789862/ Thanks, John
Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer
On 17-02-21 12:46 AM, Jason Wang wrote: > This patch switch to use build_skb() for small buffer which can have > better performance for both TCP and XDP (since we can work at page > before skb creation). It also remove lots of XDP codes since both > mergeable and small buffer use page frag during refill now. > >Before | After > XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps > > Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf. When you do the xdp tests are you generating packets with pktgen on the corresponding tap devices? Also another thought, have you looked at using some of the buffer recycling techniques used in the hardware drivers such as ixgbe and with Eric's latest patches mlx? I have seen significant performance increases for some workloads doing this. I wanted to try something like this out on virtio but haven't had time yet. > > Signed-off-by: Jason Wang > --- [...] > static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue > *rq, >gfp_t gfp) > { > - int headroom = GOOD_PACKET_LEN + virtnet_get_headroom(vi); > + struct page_frag *alloc_frag = &rq->alloc_frag; > + char *buf; > unsigned int xdp_headroom = virtnet_get_headroom(vi); > - struct sk_buff *skb; > - struct virtio_net_hdr_mrg_rxbuf *hdr; > + int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom; > int err; > > - skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp); > - if (unlikely(!skb)) > + len = SKB_DATA_ALIGN(len) + > + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) > return -ENOMEM; > > - skb_put(skb, headroom); > - > - hdr = skb_vnet_hdr(skb); > - sg_init_table(rq->sg, 2); > - sg_set_buf(rq->sg, hdr, vi->hdr_len); > - skb_to_sgvec(skb, rq->sg + 1, xdp_headroom, skb->len - xdp_headroom); > - > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp); > + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > + get_page(alloc_frag->page); > + alloc_frag->offset += len; > + sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom, > + vi->hdr_len + GOOD_PACKET_LEN); > + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp); Nice this cleans up a lot of the branching code. Thanks. Acked-by: John Fastabend
Re: [PATCH net-next] virito-net: set queues after reset during xdp_set
On 17-02-16 09:10 PM, Jason Wang wrote: > > > On 2017年02月17日 12:53, John Fastabend wrote: >> On 17-02-15 01:08 AM, Jason Wang wrote: >>> We set queues before reset which will cause a crash[1]. This is >>> because is_xdp_raw_buffer_queue() depends on the old xdp queue pairs >>> number to do the correct detection. So fix this by: >>> >>> - set queues after reset, to keep the old vi->curr_queue_pairs. (in >>>fact setting queues before reset does not works since after feature >>>set, all queue pairs were enabled by default during reset). >>> - change xdp_queue_pairs only after virtnet_reset() is succeed. >>> >>> [1] >> I'm guessing this occurs when enabling XDP while receiving lots of traffic? > > I hit this then disabling XDP while receiving lots of traffic. > [...] >>> +vi->xdp_queue_pairs = xdp_qp; >> But xdp_queue_pairs is being used to detect if we should allocate the XDP >> headroom. If we move it here do we have a set of buffers in the ring without >> the proper headroom when we assign the xdp program below? > > Right, so how about passing xdp_queue_pairs as a parameter to virtnet_reset(). > Then virtnet_reset() can set it after _remove_vq_common() but before > virtnet_restore_up()? > > Thanks > Sounds like a reasonable fix to me. >> >>> +} >>> + >>> +err = _virtnet_set_queues(vi, curr_qp + xdp_qp); >>> +if (err) { >>> +dev_warn(&dev->dev, "XDP Device queue allocation failure.\n"); >>> +goto virtio_queue_err; >>> } >>> netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp); >>> @@ -1844,17 +1844,18 @@ static int virtnet_xdp_set(struct net_device *dev, >>> struct bpf_prog *prog) >>> return 0; >> Thanks, >> John >> >
Re: [PATCH net-next] virito-net: set queues after reset during xdp_set
On 17-02-15 01:08 AM, Jason Wang wrote: > We set queues before reset which will cause a crash[1]. This is > because is_xdp_raw_buffer_queue() depends on the old xdp queue pairs > number to do the correct detection. So fix this by: > > - set queues after reset, to keep the old vi->curr_queue_pairs. (in > fact setting queues before reset does not works since after feature > set, all queue pairs were enabled by default during reset). > - change xdp_queue_pairs only after virtnet_reset() is succeed. > > [1] I'm guessing this occurs when enabling XDP while receiving lots of traffic? > > [ 74.328168] general protection fault: [#1] SMP > [ 74.328625] Modules linked in: nfsd xfs libcrc32c virtio_net virtio_pci > [ 74.329117] CPU: 0 PID: 2849 Comm: xdp2 Not tainted 4.10.0-rc7+ #499 > [ 74.329577] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 > [ 74.330424] task: 88007a894000 task.stack: c90004388000 > [ 74.330844] RIP: 0010:skb_release_head_state+0x28/0x80 > [ 74.331298] RSP: 0018:c9000438b8d0 EFLAGS: 00010206 > [ 74.331676] RAX: RBX: 88007ad96300 RCX: > > [ 74.332217] RDX: 88007fc137a8 RSI: 88007fc0db28 RDI: > 0001bf0001be > [ 74.332758] RBP: c9000438b8d8 R08: 0005008f R09: > 05f9 > [ 74.333274] R10: 88007d001700 R11: 820a8a4d R12: > 88007ad96300 > [ 74.333787] R13: 0002 R14: 880036604000 R15: > 77ff8000 > [ 74.334308] FS: 7fc70d8a7b40() GS:88007fc0() > knlGS: > [ 74.334891] CS: 0010 DS: ES: CR0: 80050033 > [ 74.335314] CR2: 7fff4144a710 CR3: 7ab56000 CR4: > 003406f0 > [ 74.335830] DR0: DR1: DR2: > > [ 74.336373] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 74.336895] Call Trace: > [ 74.337086] skb_release_all+0xd/0x30 > [ 74.337356] consume_skb+0x2c/0x90 > [ 74.337607] free_unused_bufs+0x1ff/0x270 [virtio_net] > [ 74.337988] ? vp_synchronize_vectors+0x3b/0x60 [virtio_pci] > [ 74.338398] virtnet_xdp+0x21e/0x440 [virtio_net] > [ 74.338741] dev_change_xdp_fd+0x101/0x140 > [ 74.339048] do_setlink+0xcf4/0xd20 > [ 74.339304] ? symcmp+0xf/0x20 > [ 74.339529] ? mls_level_isvalid+0x52/0x60 > [ 74.339828] ? mls_range_isvalid+0x43/0x50 > [ 74.340135] ? nla_parse+0xa0/0x100 > [ 74.340400] rtnl_setlink+0xd4/0x120 > [ 74.340664] ? cpumask_next_and+0x30/0x50 > [ 74.340966] rtnetlink_rcv_msg+0x7f/0x1f0 > [ 74.341259] ? sock_has_perm+0x59/0x60 > [ 74.341586] ? napi_consume_skb+0xe2/0x100 > [ 74.342010] ? rtnl_newlink+0x890/0x890 > [ 74.342435] netlink_rcv_skb+0x92/0xb0 > [ 74.342846] rtnetlink_rcv+0x23/0x30 > [ 74.343277] netlink_unicast+0x162/0x210 > [ 74.343677] netlink_sendmsg+0x2db/0x390 > [ 74.343968] sock_sendmsg+0x33/0x40 > [ 74.344233] SYSC_sendto+0xee/0x160 > [ 74.344482] ? SYSC_bind+0xb0/0xe0 > [ 74.344806] ? sock_alloc_file+0x92/0x110 > [ 74.345106] ? fd_install+0x20/0x30 > [ 74.345360] ? sock_map_fd+0x3f/0x60 > [ 74.345586] SyS_sendto+0x9/0x10 > [ 74.345790] entry_SYSCALL_64_fastpath+0x1a/0xa9 > [ 74.346086] RIP: 0033:0x7fc70d1b8f6d > [ 74.346312] RSP: 002b:7fff4144a708 EFLAGS: 0246 ORIG_RAX: > 002c > [ 74.346785] RAX: ffda RBX: RCX: > 7fc70d1b8f6d > [ 74.347244] RDX: 002c RSI: 7fff4144a720 RDI: > 0003 > [ 74.347683] RBP: 0003 R08: R09: > > [ 74.348544] R10: R11: 0246 R12: > 7fff4144bd90 > [ 74.349082] R13: 0002 R14: 0002 R15: > 7fff4144cda0 > [ 74.349607] Code: 00 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 58 48 85 ff 74 > 0e 40 f6 c7 01 74 3d 48 c7 43 58 00 00 00 00 48 8b 7b 68 48 85 ff 74 05 > ff 0f 74 20 48 8b 43 60 48 85 c0 74 14 65 8b 15 f3 ab 8d 7e > [ 74.351008] RIP: skb_release_head_state+0x28/0x80 RSP: c9000438b8d0 > [ 74.351625] ---[ end trace fe6e19fd11cfc80b ]--- > > Fixes: 2de2f7f40ef9 ("virtio_net: XDP support for adjust_head") > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 35 ++- > 1 file changed, 18 insertions(+), 17 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 11e2853..9ff959c 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -1775,7 +17
Re: [PATCH v2] virtio_net: fix PAGE_SIZE > 64k
On 17-01-24 01:56 PM, Michael S. Tsirkin wrote: > On Tue, Jan 24, 2017 at 04:10:46PM -0500, David Miller wrote: >> This works in the regimen that XDP packets always live in exactly one >> page. That will be needed to mmap the RX ring into userspace, and it >> helps make adjust_header trivial as well. I still don't see why this is a hard requirement for mmap let me post some patches later tonight to show how we do this with af_packet. > > I think the point was to avoid resets across xdp attach/detach. If we > are doing resets now, we could do whatever buffering we want. We could > also just disable mergeable buffers for that matter. > >> MTU 1500, PAGESIZE >= 4096, so a headroom of 256 is no problem, and >> we still have enough tailroom for skb_shared_info should we wrap >> the buffer into a real SKB and push it into the stack. >> >> If you are trying to do buffering differently for virtio_net, well... >> that's a self inflicted wound as far as I can tell. > > Right but I was wondering about the fact that this makes XDP_PASS > much slower than processing skbs without XDP, as truesize is huge > so we'll quickly run out of rmem space. > > When XDP is used to fight DOS attacks, why isn't this a concern? > It is a concern on my side. I want XDP and Linux stack to work reasonably well together. .John
Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
On 17-01-02 10:16 PM, Jason Wang wrote: > > > On 2017年01月03日 06:43, John Fastabend wrote: >> On 16-12-23 06:37 AM, Jason Wang wrote: >>> Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of >>> small receive buffer untouched. This will confuse the user who want to >>> set XDP but use small buffers. Other than forbid XDP in small buffer >>> mode, let's make it work. XDP then can only work at skb->data since >>> virtio-net create skbs during refill, this is sub optimal which could >>> be optimized in the future. >>> >>> Cc: John Fastabend >>> Signed-off-by: Jason Wang >>> --- >>> drivers/net/virtio_net.c | 112 >>> --- >>> 1 file changed, 87 insertions(+), 25 deletions(-) >>> >> Hi Jason, >> >> I was doing some more testing on this what do you think about doing this >> so that free_unused_bufs() handles the buffer free with dev_kfree_skb() >> instead of put_page in small receive mode. Seems more correct to me. >> >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 783e842..27ff76c 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct >> virtnet_info >> *vi) >> >> static bool is_xdp_queue(struct virtnet_info *vi, int q) >> { >> + /* For small receive mode always use kfree_skb variants */ >> + if (!vi->mergeable_rx_bufs) >> + return false; >> + >> if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs)) >> return false; >> else if (q < vi->curr_queue_pairs) >> >> >> patch is untested just spotted doing code review. >> >> Thanks, >> John > > We probably need a better name for this function. > > Acked-by: Jason Wang > How about is_xdp_raw_buffer_queue()? I'll submit a proper patch today.
Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
On 16-12-23 06:37 AM, Jason Wang wrote: > Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of > small receive buffer untouched. This will confuse the user who want to > set XDP but use small buffers. Other than forbid XDP in small buffer > mode, let's make it work. XDP then can only work at skb->data since > virtio-net create skbs during refill, this is sub optimal which could > be optimized in the future. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 112 > --- > 1 file changed, 87 insertions(+), 25 deletions(-) > Hi Jason, I was doing some more testing on this what do you think about doing this so that free_unused_bufs() handles the buffer free with dev_kfree_skb() instead of put_page in small receive mode. Seems more correct to me. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 783e842..27ff76c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info *vi) static bool is_xdp_queue(struct virtnet_info *vi, int q) { + /* For small receive mode always use kfree_skb variants */ + if (!vi->mergeable_rx_bufs) + return false; + if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs)) return false; else if (q < vi->curr_queue_pairs) patch is untested just spotted doing code review. Thanks, John
Re: [PATCH net 0/9] several fixups for virtio-net XDP
On 16-12-23 06:37 AM, Jason Wang wrote: > Merry Xmas and a Happy New year to all: > > This series tries to fixes several issues for virtio-net XDP which > could be categorized into several parts: > > - fix several issues during XDP linearizing > - allow csumed packet to work for XDP_PASS > - make EWMA rxbuf size estimation works for XDP > - forbid XDP when GUEST_UFO is support > - remove big packet XDP support > - add XDP support or small buffer > > Please see individual patches for details. > > Thanks > > Jason Wang (9): > virtio-net: remove the warning before XDP linearizing > virtio-net: correctly xmit linearized page on XDP_TX > virtio-net: fix page miscount during XDP linearizing > virtio-net: correctly handle XDP_PASS for linearized packets > virtio-net: unbreak csumed packets for XDP_PASS > virtio-net: make rx buf size estimation works for XDP > virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support > virtio-net: remove big packet XDP codes > virtio-net: XDP support for small buffers > > drivers/net/virtio_net.c | 172 > --- > 1 file changed, 102 insertions(+), 70 deletions(-) > Thanks a lot Jason. The last piece that is needed is support to complete XDP support is to get the adjust_head part correct. I'll send out a patch in a bit but will need to merge it on top of this set. .John
Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
On 16-12-23 06:37 AM, Jason Wang wrote: > Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of > small receive buffer untouched. This will confuse the user who want to > set XDP but use small buffers. Other than forbid XDP in small buffer > mode, let's make it work. XDP then can only work at skb->data since > virtio-net create skbs during refill, this is sub optimal which could > be optimized in the future. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- Looks good to me thanks! Acked-by: John Fastabend
Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
On 16-12-23 08:02 AM, John Fastabend wrote: > On 16-12-23 06:37 AM, Jason Wang wrote: >> When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO >> packet that exceeds a single page which could not be handled >> correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is >> supported. While at it, forbid XDP for ECN (which comes only from GRO) >> too to prevent user from misconfiguration. >> Is sending packets greater than single page though normal in this case? I don't have any need to support big packet mode other than MST asked for it. And I wasn't seeing this in my tests. MTU is capped at 4k - hdr when XDP is enabled. .John >> Cc: John Fastabend >> Signed-off-by: Jason Wang >> --- >> drivers/net/virtio_net.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 77ae358..c1f66d8 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, >> struct bpf_prog *prog) >> int i, err; >> >> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) || >> -virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) { >> +virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) || >> +virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) || >> +virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) { >> netdev_warn(dev, "can't set XDP while host is implementing LRO, >> disable LRO first\n"); >> return -EOPNOTSUPP; >> } >> > > Acked-by: John Fastabend >
Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
On 16-12-23 06:37 AM, Jason Wang wrote: > When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO > packet that exceeds a single page which could not be handled > correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is > supported. While at it, forbid XDP for ECN (which comes only from GRO) > too to prevent user from misconfiguration. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 77ae358..c1f66d8 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, > struct bpf_prog *prog) > int i, err; > > if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) || > - virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) { > + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) || > + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) || > + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) { > netdev_warn(dev, "can't set XDP while host is implementing LRO, > disable LRO first\n"); > return -EOPNOTSUPP; > } > Acked-by: John Fastabend
Re: [PATCH net 6/9] virtio-net: make rx buf size estimation works for XDP
On 16-12-23 06:37 AM, Jason Wang wrote: > We don't update ewma rx buf size in the case of XDP. This will lead > underestimation of rx buf size which causes host to produce more than > one buffers. This will greatly increase the possibility of XDP page > linearization. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 0778dc8..77ae358 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -584,10 +584,12 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > put_page(page); > head_skb = page_to_skb(vi, rq, xdp_page, > 0, len, PAGE_SIZE); > + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len); > return head_skb; > } > break; > case XDP_TX: > + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len); > if (unlikely(xdp_page != page)) > goto err_xdp; > rcu_read_unlock(); > @@ -596,6 +598,7 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > default: > if (unlikely(xdp_page != page)) > __free_pages(xdp_page, 0); > + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len); > goto err_xdp; > } > } > Looks needed although I guess it will only be the case with MTU > ETH_DATA_LEN because of the clamp in get_mergeable_buf_len(). Although XDP setup allows MTU up to page_size - hdr so certainly will happen with ~MTU > 1500. I need to add some various MTU tests to my setup. Acked-by: John Fastabend
Re: [PATCH net 5/9] virtio-net: unbreak csumed packets for XDP_PASS
On 16-12-23 06:37 AM, Jason Wang wrote: > We drop csumed packet when do XDP for packets. This breaks > XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag > to be set. With this patch, simple TCP works for XDP_PASS. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 470293e..0778dc8 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -440,7 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev, > struct virtio_net_hdr_mrg_rxbuf *hdr = buf; > u32 act; > > - if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) > + if (unlikely(hdr->hdr.gso_type)) > goto err_xdp; > act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len); > switch (act) { > @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, >* the receive path after XDP is loaded. In practice I >* was not able to create this condition. >*/ > - if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) > + if (unlikely(hdr->hdr.gso_type)) > goto err_xdp; > > act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); > Acked-by: John Fastabend
Re: [PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets
On 16-12-23 06:37 AM, Jason Wang wrote: > When XDP_PASS were determined for linearized packets, we try to get > new buffers in the virtqueue and build skbs from them. This is wrong, > we should create skbs based on existed buffers instead. Fixing them by > creating skb based on xdp_page. > > With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 58ad40e..470293e 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); > switch (act) { > case XDP_PASS: > - if (unlikely(xdp_page != page)) > - __free_pages(xdp_page, 0); > + /* We can only create skb based on xdp_page. */ > + if (unlikely(xdp_page != page)) { > + rcu_read_unlock(); > + put_page(page); > + head_skb = page_to_skb(vi, rq, xdp_page, > +0, len, PAGE_SIZE); > + return head_skb; > + } > break; > case XDP_TX: > if (unlikely(xdp_page != page)) > Great thanks. This was likely working before because of the memory leak fixed in 3/9. Acked-by: John Fastabend
Re: [PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing
On 16-12-23 06:37 AM, Jason Wang wrote: > We don't put page during linearizing, the would cause leaking when > xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by > put page accordingly. Also decrease the number of buffers during > linearizing to make sure caller can free buffers correctly when packet > exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize > huge number of packets. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- Thanks! looks good. By the way do you happen to have any actual configuration where this path is hit? I obviously didn't test this very long other than a quick test with my hacked vhost driver. Acked-by: John Fastabend
Re: [PATCH net 2/9] virtio-net: correctly xmit linearized page on XDP_TX
On 16-12-23 06:37 AM, Jason Wang wrote: > After we linearize page, we should xmit this page instead of the page > of first buffer which may lead unexpected result. With this patch, we > can see correct packet during XDP_TX. > > Cc: John Fastabend > Signed-off-by: Jason Wang > --- > drivers/net/virtio_net.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 1067253..fe4562d 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) > goto err_xdp; > > - act = do_xdp_prog(vi, rq, xdp_prog, page, offset, len); > + act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); > switch (act) { > case XDP_PASS: > if (unlikely(xdp_page != page)) > Thanks clumsy merge conflict resolution between v4 and v6 versions :/ Acked-by: John Fastabend
Re: [PATCH net-next] virtio-net: enable multiqueue by default
On 16-11-28 08:28 AM, David Miller wrote: > From: "Michael S. Tsirkin" > Date: Fri, 25 Nov 2016 06:43:08 +0200 > >> On Fri, Nov 25, 2016 at 12:37:26PM +0800, Jason Wang wrote: >>> We use single queue even if multiqueue is enabled and let admin to >>> enable it through ethtool later. This is used to avoid possible >>> regression (small packet TCP stream transmission). But looks like an >>> overkill since: >>> >>> - single queue user can disable multiqueue when launching qemu >>> - brings extra troubles for the management since it needs extra admin >>> tool in guest to enable multiqueue >>> - multiqueue performs much better than single queue in most of the >>> cases >>> >>> So this patch enables multiqueue by default: if #queues is less than or >>> equal to #vcpu, enable as much as queue pairs; if #queues is greater >>> than #vcpu, enable #vcpu queue pairs. >>> >>> Cc: Hannes Frederic Sowa >>> Cc: Michael S. Tsirkin >>> Cc: Neil Horman >>> Cc: Jeremy Eder >>> Cc: Marko Myllynen >>> Cc: Maxime Coquelin >>> Signed-off-by: Jason Wang >> >> OK at some level but all uses of num_online_cpus() >> like this are racy versus hotplug. >> I know we already have this bug but shouldn't we fix it >> before we add more? > > This is more being used like a heuristic in this scenerio, and in > fact I would say one would keep the code this way even once proper > hotplug handlers are installed to adjust the queued dynamically if > there is a desired (which is also not necessarily the case). > > I really don't think this change should be held on up on this issue. > So can we please make some forward progress here? > > Thanks. > Also it might be worth noting all the other multiqueue capable ethernet devices I checked use some variation of this heuristic. Typically related to what other features are enables RSS, DCB, etc. So it at least should be familiar to folks who do hotplug cpus on bare metal boxes. .John