from:"Cong Wang"

Re: [RFC PATCH net-next v8 14/14] selftests: add ncdevmem, netcat for devmem TCP

2024-04-08 Thread Cong Wang

On Tue, Apr 02, 2024 at 05:20:51PM -0700, Mina Almasry wrote:
> +static char *server_ip = "192.168.1.4";
> +static char *client_ip = "192.168.1.2";
> +static char *port = "5201";
> +static size_t do_validation;
> +static int start_queue = 8;
> +static int num_queues = 8;
> +static char *ifname = "eth1";
> +static unsigned int ifindex = 3;
> +static char *nic_pci_addr = ":06:00.0";

It seems this is set but never used.

Thanks.

Re: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers

2024-04-08 Thread Cong Wang

On Mon, Apr 01, 2024 at 12:22:24PM -0700, Mina Almasry wrote:
> On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig  wrote:
> >
> > On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
> > >
> > > Are you envisioning that dmabuf support would be added to the block
> > > layer
> >
> > Yes.
> >
> > > (which I understand is part of the VFS and not driver specific),
> >
> > The block layer isn't really the VFS, it's just another core stack
> > like the network stack.
> >
> > > or as part of the specific storage driver (like nvme for example)? If
> > > we can add dmabuf support to the block layer itself that sounds
> > > awesome. We may then be able to do devmem TCP on all/most storage
> > > devices without having to modify each individual driver.
> >
> > I suspect we'll still need to touch the drivers to understand it,
> > but hopefully all the main infrastructure can live in the block layer.
> >
> > > In your estimation, is adding dmabuf support to the block layer
> > > something technically feasible & acceptable upstream? I notice you
> > > suggested it so I'm guessing yes to both, but I thought I'd confirm.
> >
> > I think so, and I know there has been quite some interest to at least
> > pre-register userspace memory so that the iommu overhead can be
> > pre-loaded.  It also is a much better interface for Peer to Peer
> > transfers than what we currently have.
> >

Thanks for copying me on this. This sounds really great. 

Also P2PDMA requires PCI root complex to support this kind of direct transfer,
and IIUC dmabuf does not have such hardware dependency.

> 
> I think this is positively thrilling news for me. I was worried that
> adding devmemTCP support to storage devices would involve using a
> non-dmabuf standard of buffer sharing like pci_p2pdma_
> (drivers/pci/p2pdma.c) and that would require messy changes to
> pci_p2pdma_ that would get nacked. Also it would require adding
> pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding
> dma-buf support to storage devices is feasible and desirable, that's a
> much better approach IMO. (a) it will maybe work with devmem TCP
> without any changes needed on the netdev side of things and (b)
> dma-buf support may be generically useful and a good contribution even
> outside of devmem TCP.

I think the major difference is its interface, which exposes an mmap memory
region instead of fd: https://lwn.net/Articles/906092/.

> 
> I don't have a concrete user for devmem TCP for storage devices but
> the use case is very similar to GPU and I imagine the benefits in perf
> can be significant in some setups.

We have storage use cases at ByteDance, we use NVME SSD to cache videos
transferred through network, so moving data directly from SSD to NIC
would help a lot.

Thanks!

Re: [RFC PATCH net-next v8 14/14] selftests: add ncdevmem, netcat for devmem TCP

2024-04-08 Thread Cong Wang

On Tue, Apr 02, 2024 at 05:20:51PM -0700, Mina Almasry wrote:
> +static char *server_ip = "192.168.1.4";
> +static char *client_ip = "192.168.1.2";
> +static char *port = "5201";
> +static size_t do_validation;
> +static int start_queue = 8;
> +static int num_queues = 8;
> +static char *ifname = "eth1";
> +static unsigned int ifindex = 3;
> +static char *nic_pci_addr = ":06:00.0";

It seems this is set but never used.

Thanks.

Re: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers

2024-04-08 Thread Cong Wang

On Mon, Apr 01, 2024 at 12:22:24PM -0700, Mina Almasry wrote:
> On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig  wrote:
> >
> > On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
> > >
> > > Are you envisioning that dmabuf support would be added to the block
> > > layer
> >
> > Yes.
> >
> > > (which I understand is part of the VFS and not driver specific),
> >
> > The block layer isn't really the VFS, it's just another core stack
> > like the network stack.
> >
> > > or as part of the specific storage driver (like nvme for example)? If
> > > we can add dmabuf support to the block layer itself that sounds
> > > awesome. We may then be able to do devmem TCP on all/most storage
> > > devices without having to modify each individual driver.
> >
> > I suspect we'll still need to touch the drivers to understand it,
> > but hopefully all the main infrastructure can live in the block layer.
> >
> > > In your estimation, is adding dmabuf support to the block layer
> > > something technically feasible & acceptable upstream? I notice you
> > > suggested it so I'm guessing yes to both, but I thought I'd confirm.
> >
> > I think so, and I know there has been quite some interest to at least
> > pre-register userspace memory so that the iommu overhead can be
> > pre-loaded.  It also is a much better interface for Peer to Peer
> > transfers than what we currently have.
> >

Thanks for copying me on this. This sounds really great. 

Also P2PDMA requires PCI root complex to support this kind of direct transfer,
and IIUC dmabuf does not have such hardware dependency.

> 
> I think this is positively thrilling news for me. I was worried that
> adding devmemTCP support to storage devices would involve using a
> non-dmabuf standard of buffer sharing like pci_p2pdma_
> (drivers/pci/p2pdma.c) and that would require messy changes to
> pci_p2pdma_ that would get nacked. Also it would require adding
> pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding
> dma-buf support to storage devices is feasible and desirable, that's a
> much better approach IMO. (a) it will maybe work with devmem TCP
> without any changes needed on the netdev side of things and (b)
> dma-buf support may be generically useful and a good contribution even
> outside of devmem TCP.

I think the major difference is its interface, which exposes an mmap memory
region instead of fd: https://lwn.net/Articles/906092/.

> 
> I don't have a concrete user for devmem TCP for storage devices but
> the use case is very similar to GPU and I imagine the benefits in perf
> can be significant in some setups.

We have storage use cases at ByteDance, we use NVME SSD to cache videos
transferred through network, so moving data directly from SSD to NIC
would help a lot.

Thanks!

[Patch net] vsock: improve tap delivery accuracy

2023-05-02 Thread Cong Wang

From: Cong Wang 

When virtqueue_add_sgs() fails, the skb is put back to send queue,
we should not deliver the copy to tap device in this case. So we
need to move virtio_transport_deliver_tap_pkt() down after all
possible failures.

Fixes: 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks")
Cc: Stefan Hajnoczi 
Cc: Stefano Garzarella 
Cc: Bobby Eshleman 
Signed-off-by: Cong Wang 
---
 net/vmw_vsock/virtio_transport.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..055678628c07 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -109,9 +109,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
if (!skb)
break;
 
-   virtio_transport_deliver_tap_pkt(skb);
-   reply = virtio_vsock_skb_reply(skb);
-
sg_init_one(, virtio_vsock_hdr(skb), 
sizeof(*virtio_vsock_hdr(skb)));
sgs[out_sg++] = 
if (skb->len > 0) {
@@ -128,6 +125,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
break;
}
 
+   virtio_transport_deliver_tap_pkt(skb);
+   reply = virtio_vsock_skb_reply(skb);
if (reply) {
struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
int val;
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [net?] [virt?] [io-uring?] [kvm?] BUG: soft lockup in vsock_connect

2023-03-26 Thread Cong Wang

On Sat, Mar 18, 2023 at 12:32:39AM +, Bobby Eshleman wrote:
> On Fri, Mar 24, 2023 at 09:38:38AM +0100, Stefano Garzarella wrote:
> > Hi Bobby,
> > FYI we have also this one, but it seems related to
> > syzbot+befff0a9536049e79...@syzkaller.appspotmail.com
> > 
> > Thanks,
> > Stefano
> > 
> 
> Got it, I'll look into it.
> 

It seems you forgot to set skb->sk??

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 957cdc01c8e8..d47ad27b409d 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -236,6 +236,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock 
*vsk,
}

virtio_transport_inc_tx_pkt(vvs, skb);
+   skb_set_owner_w(skb, sk_vsock(vsk));

return t_ops->send_pkt(skb);
 }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC net-next v2 0/3] vsock: add support for sockmap

2023-02-05 Thread Cong Wang

On Mon, Jan 30, 2023 at 08:35:11PM -0800, Bobby Eshleman wrote:
> Add support for sockmap to vsock.
> 
> We're testing usage of vsock as a way to redirect guest-local UDS requests to
> the host and this patch series greatly improves the performance of such a
> setup.
> 
> Compared to copying packets via userspace, this improves throughput by 121% in
> basic testing.
> 
> Tested as follows.
> 
> Setup: guest unix dgram sender -> guest vsock redirector -> host vsock server
> Threads: 1
> Payload: 64k
> No sockmap:
> - 76.3 MB/s
> - The guest vsock redirector was
>   "socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock"
> Using sockmap (this patch):
> - 168.8 MB/s (+121%)
> - The guest redirector was a simple sockmap echo server,
>   redirecting unix ingress to vsock 2:1234 egress.
> - Same sender and server programs
> 
> *Note: these numbers are from RFC v1
> 
> Only the virtio transport has been tested. The loopback transport was used in
> writing bpf/selftests, but not thoroughly tested otherwise.
> 
> This series requires the skb patch.
> 

Looks good to me. Definitely good to go as non-RFC.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 1/3] vsock: support sockmap

2023-01-21 Thread Cong Wang

On Wed, Jan 18, 2023 at 12:27:39PM -0800, Bobby Eshleman wrote:
> +static int vsock_read_skb(struct sock *sk, skb_read_actor_t read_actor)
> +{
> + struct vsock_sock *vsk = vsock_sk(sk);
> +
> + if (!vsk->transport)
> + return -ENODEV;
> +
> + if (!vsk->transport->read_skb)
> + return -EOPNOTSUPP;

Can we move these two checks to sockmap update path? It would make
vsock_read_skb() faster.

> +
> + return vsk->transport->read_skb(vsk, read_actor);
> +}

Essentially can be just this one line.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] kernel BUG in vhost_vsock_handle_tx_kick

2023-01-03 Thread Cong Wang

On Tue, Jan 03, 2023 at 04:08:51PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:c76083fac3ba Add linux-next specific files for 20221226
> git tree:   linux-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=1723da4248
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c217c755f1884ab6
> dashboard link: https://syzkaller.appspot.com/bug?extid=30b72abaa17c07fe39dd
> compiler:   gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils 
> for Debian) 2.35.2
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14fc414c48
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1604b20a48
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/e388f26357fd/disk-c76083fa.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/e24f0bae36d5/vmlinux-c76083fa.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/a5a69a059716/bzImage-c76083fa.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+30b72abaa17c07fe3...@syzkaller.appspotmail.com

+bobby.eshle...@gmail.com

Bobby, please take a look.

Thanks.

> 
> skbuff: skb_over_panic: text:8768d6f1 len:25109 put:25109 
> head:88802b5ac000 data:88802b5ac02c tail:0x6241 end:0xc0 dev:
> [ cut here ]
> kernel BUG at net/core/skbuff.c:121!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 5072 Comm: vhost-5071 Not tainted 
> 6.2.0-rc1-next-20221226-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 10/26/2022
> RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:121
> Code: f7 4c 8b 4c 24 10 8b 4b 70 41 56 45 89 e8 4c 89 e2 41 57 48 89 ee 48 c7 
> c7 40 04 5b 8b ff 74 24 10 ff 74 24 20 e8 09 8e bf ff <0f> 0b e8 1a 67 82 f7 
> 4c 8b 64 24 18 e8 80 3d d0 f7 48 c7 c1 40 12
> RSP: 0018:c90003cefca0 EFLAGS: 00010282
> RAX: 008d RBX: 88802b674500 RCX: 
> RDX: 8880236bba80 RSI: 81663b9c RDI: f5200079df86
> RBP: 8b5b1280 R08: 008d R09: 
> R10: 8000 R11:  R12: 8768d6f1
> R13: 6215 R14: 8b5b0400 R15: 00c0
> FS:  () GS:8880b980() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 2380 CR3: 2985f000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  skb_over_panic net/core/skbuff.c:126 [inline]
>  skb_put.cold+0x24/0x24 net/core/skbuff.c:2218
>  virtio_vsock_skb_rx_put include/linux/virtio_vsock.h:56 [inline]
>  vhost_vsock_alloc_skb drivers/vhost/vsock.c:374 [inline]
>  vhost_vsock_handle_tx_kick+0xad1/0xd00 drivers/vhost/vsock.c:509
>  vhost_worker+0x241/0x3e0 drivers/vhost/vhost.c:364
>  kthread+0x2e8/0x3a0 kernel/kthread.c:376
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
>  
> Modules linked in:
> ---[ end trace  ]---
> RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:121
> Code: f7 4c 8b 4c 24 10 8b 4b 70 41 56 45 89 e8 4c 89 e2 41 57 48 89 ee 48 c7 
> c7 40 04 5b 8b ff 74 24 10 ff 74 24 20 e8 09 8e bf ff <0f> 0b e8 1a 67 82 f7 
> 4c 8b 64 24 18 e8 80 3d d0 f7 48 c7 c1 40 12
> RSP: 0018:c90003cefca0 EFLAGS: 00010282
> RAX: 008d RBX: 88802b674500 RCX: 
> RDX: 8880236bba80 RSI: 81663b9c RDI: f5200079df86
> RBP: 8b5b1280 R08: 008d R09: 
> R10: 8000 R11:  R12: 8768d6f1
> R13: 6215 R14: 8b5b0400 R15: 00c0
> FS:  () GS:8880b990() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fdc6f4a4298 CR3: 2985f000 CR4: 003506e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] vsock: replace virtio_vsock_pkt with sk_buff

2022-10-15 Thread Cong Wang

On Mon, Oct 03, 2022 at 12:11:39AM +, Bobby Eshleman wrote:
> On Thu, Oct 06, 2022 at 09:34:10AM +0200, Stefano Garzarella wrote:
> > On Thu, Oct 06, 2022 at 03:08:12AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Oct 05, 2022 at 06:19:44PM -0700, Bobby Eshleman wrote:
> > > > This patch replaces the struct virtio_vsock_pkt with struct sk_buff.
> > > > 
> > > > Using sk_buff in vsock benefits it by a) allowing vsock to be extended
> > > > for socket-related features like sockmap, b) vsock may in the future
> > > > use other sk_buff-dependent kernel capabilities, and c) vsock shares
> > > > commonality with other socket types.
> > > > 
> > > > This patch is taken from the original series found here:
> > > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshle...@bytedance.com/
> > > > 
> > > > Small-sized packet throughput improved by ~5% (from 18.53 Mb/s to 19.51
> > > > Mb/s). Tested using uperf, 16B payloads, 64 threads, 100s, averaged from
> > > > 10 test runs (n=10). This improvement is likely due to packet merging.
> > > > 
> > > > Large-sized packet throughput decreases ~9% (from 27.25 Gb/s to 25.04
> > > > Gb/s). Tested using uperf, 64KB payloads, 64 threads, 100s, averaged
> > > > from 10 test runs (n=10).
> > > > 
> > > > Medium-sized packet throughput decreases ~5% (from 4.0 Gb/s to 3.81
> > > > Gb/s). Tested using uperf, 4k to 8k payload sizes picked randomly
> > > > according to normal distribution, 64 threads, 100s, averaged from 10
> > > > test runs (n=10).
> > > 
> > > It is surprizing to me that the original vsock code managed to outperform
> > > the new one, given that to my knowledge we did not focus on optimizing it.
> > 
> > Yeah mee to.
> > 
> 
> Indeed.
> 
> > From this numbers maybe the allocation cost has been reduced as it performs
> > better with small packets. But with medium to large packets we perform
> > worse, perhaps because previously we were allocating a contiguous buffer up
> > to 64k?
> > Instead alloc_skb() could allocate non-contiguous pages ? (which would solve
> > the problems we saw a few days ago)
> > 
> 
> I think this would be the case with alloc_skb_with_frags(), but
> internally alloc_skb() uses kmalloc() for the payload and sk_buff_head
> slab allocations for the sk_buff itself (all the more confusing to me,
> as the prior allocator also uses two separate allocations per packet).

I think it is related to your implementation of
virtio_transport_add_to_queue(), where you introduced much more
complicated logic than before:

-   spin_lock_bh(>send_pkt_list_lock);
-   list_add_tail(>list, >send_pkt_list);
-   spin_unlock_bh(>send_pkt_list_lock);
-
+   virtio_transport_add_to_queue(>send_pkt_queue, skb);

A simple list_add_tail() is definitely faster than your
virtio_transport_skbs_can_merge() check. So, why do you have to merge
skb while we don't merge virtio_vsock_pkt?

_If_ you are trying to mimic TCP, I think you are doing it wrong, it can
be much more efficient if you could do the merge in sendmsg() before skb
is even allocated, see tcp_sendmsg_locked().

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Cake] [PATCH net-next] net: sched: remove redundant NULL check in change hook function

2022-08-28 Thread Cong Wang via Cake

On Sat, Aug 27, 2022 at 09:49:10AM +0800, Zhengchao Shao wrote:
> Currently, the change function can be called by two ways. The one way is
> that qdisc_change() will call it. Before calling change function,
> qdisc_change() ensures tca[TCA_OPTIONS] is not empty. The other way is
> that .init() will call it. The opt parameter is also checked before
> calling change function in .init(). Therefore, it's no need to check the
> input parameter opt in change function.
> 

Right.. but the one below:

> diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
> index c50a0853dcb9..e23d3dbb7272 100644
> --- a/net/sched/sch_gred.c
> +++ b/net/sched/sch_gred.c
> @@ -413,9 +413,6 @@ static int gred_change_table_def(struct Qdisc *sch, 
> struct nlattr *dps,
>   bool red_flags_changed;
>   int i;
>  
> - if (!dps)
> - return -EINVAL;
> -

I don't think anyone checks tb[TCA_GRED_DPS]. What you intended to patch
is gred_change(), right?

Thanks.
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake

Re: [ovs-dev] [PATCH net v2 2/3] net/sched: flow_dissector: Fix matching on zone id for invalid conns

2021-12-11 Thread Cong Wang

On Fri, Dec 10, 2021 at 8:52 PM Jakub Kicinski  wrote:
>
> On Thu, 9 Dec 2021 09:57:33 +0200 Paul Blakey wrote:
> > @@ -238,10 +239,12 @@ void
> >  skb_flow_dissect_ct(const struct sk_buff *skb,
> >   struct flow_dissector *flow_dissector,
> >   void *target_container, u16 *ctinfo_map,
> > - size_t mapsize, bool post_ct)
> > + size_t mapsize)
> >  {
> >  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> > + bool post_ct = tc_skb_cb(skb)->post_ct;
> >   struct flow_dissector_key_ct *key;
> > + u16 zone = tc_skb_cb(skb)->zone;
> >   enum ip_conntrack_info ctinfo;
> >   struct nf_conn_labels *cl;
> >   struct nf_conn *ct;
> > @@ -260,6 +263,7 @@ skb_flow_dissect_ct(const struct sk_buff *skb,
> >   if (!ct) {
> >   key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >   TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> > + key->ct_zone = zone;
> >   return;
> >   }
> >
>
> Why is flow dissector expecting skb cb to be TC now?
> Please keep the appropriate abstractions intact.

Probably because only fl_classify() calls it, but I agree with you
on this point, this function is supposed to be TC independent.

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [Patch net] wireguard: preserve skb->mark on ingress side

2021-10-07 Thread Cong Wang

Hi, Jason

On Mon, Sep 27, 2021 at 8:27 PM Cong Wang  wrote:
>
> On Mon, Sep 27, 2021 at 8:22 PM Jason A. Donenfeld  wrote:
> >
> > Hi Cong,
> >
> > I'm not so sure this makes sense, as the inner packet is in fact
> > totally different. If you want to distinguish the ingress interface,
>
> The contents are definitely different, but skb itself is the same.
>
> Please also take a look at other tunnels, they all preserve this
> in similar ways, that is, comparing net namespaces. Any reason
> why wireguard is so different from other tunnels?

Any response?

Thanks.

[Patch net] wireguard: preserve skb->mark on ingress side

2021-09-27 Thread Cong Wang

From: Cong Wang 

On ingress side, wg_reset_packet() resets skb->mark twice: with
skb_scrub_packet() (xnet==true) and with memset() following it. But
skb->mark does not have to be cleared at least when staying in the
same net namespace, and other tunnels preserve it too similarly,
especially vxlan.

In our use case, we would like to preserve this skb->mark to
distinguish which wireguard device the packets are routed from.

Tested-by: Peilin Ye 
Cc: "Jason A. Donenfeld" 
Signed-off-by: Cong Wang 
---
 drivers/net/wireguard/queueing.h | 9 +++--
 drivers/net/wireguard/receive.c  | 2 +-
 drivers/net/wireguard/send.c | 2 +-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index 4ef2944a68bc..3516c1c59df0 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -73,15 +73,20 @@ static inline bool wg_check_packet_protocol(struct sk_buff 
*skb)
return real_protocol && skb->protocol == real_protocol;
 }
 
-static inline void wg_reset_packet(struct sk_buff *skb, bool encapsulating)
+static inline void wg_reset_packet(struct sk_buff *skb, bool encapsulating,
+  bool xnet)
 {
u8 l4_hash = skb->l4_hash;
u8 sw_hash = skb->sw_hash;
u32 hash = skb->hash;
-   skb_scrub_packet(skb, true);
+   u32 mark;
+
+   skb_scrub_packet(skb, xnet);
+   mark = skb->mark;
memset(>headers_start, 0,
   offsetof(struct sk_buff, headers_end) -
   offsetof(struct sk_buff, headers_start));
+   skb->mark = mark;
if (encapsulating) {
skb->l4_hash = l4_hash;
skb->sw_hash = sw_hash;
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 7dc84bcca261..385b2b60cfd9 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -476,7 +476,7 @@ int wg_packet_rx_poll(struct napi_struct *napi, int budget)
if (unlikely(wg_socket_endpoint_from_skb(, skb)))
goto next;
 
-   wg_reset_packet(skb, false);
+   wg_reset_packet(skb, false, !net_eq(dev_net(peer->device->dev), 
dev_net(skb->dev)));
wg_packet_consume_data_done(peer, skb, );
free = false;
 
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 5368f7c35b4b..c77ef0815c2e 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -296,7 +296,7 @@ void wg_packet_encrypt_worker(struct work_struct *work)
skb_list_walk_safe(first, skb, next) {
if (likely(encrypt_packet(skb,
PACKET_CB(first)->keypair))) {
-   wg_reset_packet(skb, true);
+   wg_reset_packet(skb, true, true);
} else {
state = PACKET_STATE_DEAD;
break;
-- 
2.30.2

Re: [ovs-dev] [PATCH] ovs: Only clear tstamp when changing namespaces

2021-09-21 Thread Cong Wang

On Sun, Sep 19, 2021 at 10:44 PM Tyler Stachecki
 wrote:
> Sorry if this is a no-op -- I'm admittedly not familiar with this part
> of the tree.  I had added this check based on this discussion on the
> OVS mailing list:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2021-February/050966.html
>
> The motivation to add it was based on the fact that skb_scrub_packet
> is doing it conditionally as well, but you seem to indicate that
> skb_scrub_packet itself is already being done somewhere?

I mean, skb->tstamp has been cleared when crossing netns,
so: 1) you don't need to clear it again for this case; 2) clearly we
fix other cases with commit 01634047bf0d, so your patch break
it again.

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ovs: Only clear tstamp when changing namespaces

2021-09-21 Thread Cong Wang

On Sun, Sep 19, 2021 at 10:59 AM Tyler J. Stachecki
 wrote:
>
> As of "ovs: clear skb->tstamp in forwarding path", the
> tstamp is now being cleared unconditionally to fix fq qdisc
> operation with ovs vports.
>
> While this is mostly correct and fixes forwarding for that
> use case, a slight adjustment is necessary to ensure that
> the tstamp is cleared *only when the forwarding is across
> namespaces*.

Hmm? I am sure timestamp has already been cleared when
crossing netns:

void skb_scrub_packet(struct sk_buff *skb, bool xnet)
{
...
if (!xnet)
return;

ipvs_reset(skb);
skb->mark = 0;
skb->tstamp = 0;
}

So, what are you trying to fix?

>
> Signed-off-by: Tyler J. Stachecki 
> ---
>  net/openvswitch/vport.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index cf2ce5812489..c2d32a5c3697 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -507,7 +507,8 @@ void ovs_vport_send(struct vport *vport, struct sk_buff 
> *skb, u8 mac_proto)
> }
>
> skb->dev = vport->dev;
> -   skb->tstamp = 0;
> +   if (dev_net(skb->dev))

Doesn't dev_net() always return a non-NULL pointer?

If you really want to check whether it is cross-netns, you should
use net_eq() to compare src netns with dst netns, something like:
if (!net_eq(dev_net(vport->dev), dev_net(skb->dev))).

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [Patch] blk-wbt: fix a divide-by-zero error in rwb_arm_timer()

2021-04-20 Thread Cong Wang

On Sat, Apr 17, 2021 at 9:41 PM Cong Wang  wrote:
>
> From: Cong Wang 
>
> We hit a divide error in rwb_arm_timer() and crash dump shows
> rqd->scale_step is 16777215 (0xff in hex), so the expression
> "(rqd->scale_step + 1) << 8)" is 0x1, which is just beyond
> 32-bit integer range, hence it is truncated to 0 and int_sqrt(0)
> returns 0 too, so we end up passing 0 as a divisor to div_u64().
>

Never mind. rqd->scale_step should be capped by
rq_depth_scale_down(), so should never be so large. In the old
calc_wb_limits() implementation, rwb->wb_max was set to zero
accidentally.

Thanks.

[Patch] blk-wbt: fix a divide-by-zero error in rwb_arm_timer()

2021-04-17 Thread Cong Wang

From: Cong Wang 

We hit a divide error in rwb_arm_timer() and crash dump shows
rqd->scale_step is 16777215 (0xff in hex), so the expression
"(rqd->scale_step + 1) << 8)" is 0x1, which is just beyond
32-bit integer range, hence it is truncated to 0 and int_sqrt(0)
returns 0 too, so we end up passing 0 as a divisor to div_u64().

Looking at the assembly code generated:

add$0x1,%edi
shl$0x8,%edi
movslq %edi,%rdi
mov0x10(%rbx),%rdi
xor%edx,%edx
mov%eax,%ecx
shl$0x4,%rdi
mov%rdi,%rax
div%rcx

we notice that the left shift is still using 32 bit register %edi,
because the type of rqd->scale_step is 'int'. But actually int_sqrt()
takes 'long' as a parameter, so the temporary result should fit well
at least on x86_64. Fix this by explicitly casting the expression to
u64 and call int_sqrt64() to avoid any ambiguity on 32 bit.

After this patch, the assembly code looks correct:

add$0x1,%edi
movslq %edi,%rdi
shl$0x8,%rdi
mov0x10(%rbx),%rdi
xor%edx,%edx
mov%eax,%ecx
shl$0x4,%rdi
mov%rdi,%rax
div%rcx

Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism")
Cc: Jens Axboe 
Cc: Fam Zheng 
Cc: Xiongchun Duan 
Signed-off-by: Cong Wang 
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 42aed0160f86..5157ca86574f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -337,7 +337,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 * though.
 */
rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
-   int_sqrt((rqd->scale_step + 1) << 8));
+   int_sqrt64((u64)(rqd->scale_step + 1) 
<< 8));
} else {
/*
 * For step < 0, we don't want to increase/decrease the
-- 
2.25.1

Re: [syzbot] WARNING: refcount bug in sk_psock_get

2021-04-10 Thread Cong Wang

On Fri, Apr 9, 2021 at 12:45 PM John Fastabend  wrote:
>
> syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:9c54130c Add linux-next specific files for 20210406
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17d8d7aad0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=d125958c3995ddcd
> > dashboard link: https://syzkaller.appspot.com/bug?extid=b54a1ce86ba4a623b7f0
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1729797ed0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1190f46ad0
> >
> > The issue was bisected to:
> >
> > commit 997acaf6b4b59c6a9c259740312a69ea549cc684
> > Author: Mark Rutland 
> > Date:   Mon Jan 11 15:37:07 2021 +
> >
> > lockdep: report broken irq restoration
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11a6cc96d0
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=13a6cc96d0
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15a6cc96d0
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+b54a1ce86ba4a623b...@syzkaller.appspotmail.com
> > Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration")
> >
> > [ cut here ]
> > refcount_t: saturated; leaking memory.
> > WARNING: CPU: 1 PID: 8414 at lib/refcount.c:19 
> > refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> > Modules linked in:
> > CPU: 1 PID: 8414 Comm: syz-executor793 Not tainted 
> > 5.12.0-rc6-next-20210406-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> > Code: 1d 69 0c e6 09 31 ff 89 de e8 c8 b4 a6 fd 84 db 75 ab e8 0f ae a6 fd 
> > 48 c7 c7 e0 52 c2 89 c6 05 49 0c e6 09 01 e8 91 0f 00 05 <0f> 0b eb 8f e8 
> > f3 ad a6 fd 0f b6 1d 33 0c e6 09 31 ff 89 de e8 93
> > RSP: 0018:c9eef388 EFLAGS: 00010282
> > RAX:  RBX:  RCX: 
> > RDX: 88801bbdd580 RSI: 815c2e05 RDI: f520001dde63
> > RBP:  R08:  R09: 
> > R10: 815bcc6e R11:  R12: 1920001dde74
> > R13: 90200301 R14: 888026e0 R15: c9eef3c0
> > FS:  01422300() GS:8880b9d0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 2000 CR3: 12b3b000 CR4: 001506e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  __refcount_add_not_zero include/linux/refcount.h:163 [inline]
> >  __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
> >  refcount_inc_not_zero include/linux/refcount.h:245 [inline]
> >  sk_psock_get+0x3b0/0x400 include/linux/skmsg.h:435
> >  bpf_exec_tx_verdict+0x11e/0x11a0 net/tls/tls_sw.c:799
> >  tls_sw_sendmsg+0xa41/0x1800 net/tls/tls_sw.c:1013
> >  inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821
>
> [...]
>
> This is likely a problem with latest round of sockmap patches I'll
> tke a look.

I bet this has nothing to do with my sockmap patches, as clearly
the reproducer does not even have any BPF thing. And actually
the reproducer creates an SMC socket, which coincidentally uses
sk_user_data too, therefore triggers this bug.

I think we should just prohibit TCP_ULP on SMC socket, something
like this:

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 47340b3b514f..0d4d6d28f20c 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2162,6 +2162,9 @@ static int smc_setsockopt(struct socket *sock,
int level, int optname,
struct smc_sock *smc;
int val, rc;

+   if (optname == TCP_ULP)
+   return -EOPNOTSUPP;
+
smc = smc_sk(sk);

/* generic setsockopts reaching us here always apply to the
@@ -2186,7 +2189,6 @@ static int smc_setsockopt(struct socket *sock,
int level, int optname,
if (rc || smc->use_fallback)
goto out;
switch (optname) {
-   case TCP_ULP:
case TCP_FASTOPEN:
case TCP_FASTOPEN_CONNECT:
case TCP_FASTOPEN_KEY:

Re: [External] linux-next: manual merge of the net-next tree with the bpf tree

2021-04-07 Thread Cong Wang .

On Wed, Apr 7, 2021 at 8:11 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   net/core/skmsg.c
>
> between commit:
>
>   144748eb0c44 ("bpf, sockmap: Fix incorrect fwd_alloc accounting")
>
> from the bpf tree and commit:
>
>   e3526bb92a20 ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
>
> from the net-next tree.
>
> I fixed it up (I think - see below) and can carry the fix as
> necessary. This is now fixed as far as linux-next is concerned, but any
> non trivial conflicts should be mentioned to your upstream maintainer
> when your tree is submitted for merging.  You may also want to consider
> cooperating with the maintainer of the conflicting tree to minimise any
> particularly complex conflicts.

Looks good from my quick glance.

Thanks!

Re: [External] linux-next: manual merge of the net-next tree with the bpf tree

2021-04-07 Thread Cong Wang .

On Wed, Apr 7, 2021 at 8:02 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   include/linux/skmsg.h
>
> between commit:
>
>   1c84b33101c8 ("bpf, sockmap: Fix sk->prot unhash op reset")
>
> from the bpf tree and commit:
>
>   8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
>
> from the net-next tree.
>
> I didn't know how to fixed it up so I just used the latter version or
> today - a better solution would be appreciated. This is now fixed as
> far as linux-next is concerned, but any non trivial conflicts should be
> mentioned to your upstream maintainer when your tree is submitted for
> merging.  You may also want to consider cooperating with the maintainer
> of the conflicting tree to minimise any particularly complex conflicts.

The right way to resolve this is to move the lines added in commit
1c84b33101c8 to the similar place in tcp_bpf_update_proto().

Thanks.

Re: [syzbot] KASAN: use-after-free Write in sk_psock_stop

2021-04-06 Thread Cong Wang

On Tue, Apr 6, 2021 at 6:01 AM syzbot
 wrote:
> ==
> BUG: KASAN: use-after-free in __lock_acquire+0x3e6f/0x54c0 
> kernel/locking/lockdep.c:4770
> Read of size 8 at addr 888024f66238 by task syz-executor.1/14202
>
> CPU: 0 PID: 14202 Comm: syz-executor.1 Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
>  __kasan_report mm/kasan/report.c:399 [inline]
>  kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
>  __lock_acquire+0x3e6f/0x54c0 kernel/locking/lockdep.c:4770
>  lock_acquire kernel/locking/lockdep.c:5510 [inline]
>  lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
>  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:175
>  spin_lock_bh include/linux/spinlock.h:359 [inline]
>  sk_psock_stop+0x2f/0x4d0 net/core/skmsg.c:750
>  sock_map_close+0x172/0x390 net/core/sock_map.c:1534
>  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
>  __sock_release+0xcd/0x280 net/socket.c:599
>  sock_close+0x18/0x20 net/socket.c:1258
>  __fput+0x288/0x920 fs/file_table.c:280
>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
>  tracehook_notify_resume include/linux/tracehook.h:189 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
>  exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:208
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
>  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:301
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x466459
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7f1bde3a3188 EFLAGS: 0246 ORIG_RAX: 0003
> RAX:  RBX: 0056bf60 RCX: 00466459
> RDX:  RSI:  RDI: 0005
> RBP: 004bf9fb R08:  R09: 
> R10:  R11: 0246 R12: 0056bf60
> R13: 7ffe6eb13bbf R14: 7f1bde3a3300 R15: 00022000
[...]
> Second to last potentially related work creation:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:345
>  __call_rcu kernel/rcu/tree.c:3039 [inline]
>  call_rcu+0xb1/0x740 kernel/rcu/tree.c:3114
>  queue_rcu_work+0x82/0xa0 kernel/workqueue.c:1753
>  sk_psock_put include/linux/skmsg.h:446 [inline]
>  sock_map_unref+0x109/0x190 net/core/sock_map.c:182
>  sock_hash_delete_from_link net/core/sock_map.c:918 [inline]
>  sock_map_unlink net/core/sock_map.c:1480 [inline]
>  sock_map_remove_links+0x389/0x530 net/core/sock_map.c:1492
>  sock_map_close+0x12f/0x390 net/core/sock_map.c:1532

Looks like the last refcnt can be gone before sk_psock_stop().

Technically, we can call sk_psock_stop() before
sock_map_remove_links(), the only barrier is the RCU read lock
there. Let me see if we can get rid of that RCU read lock.

Thanks.

Re: [syzbot] WARNING: suspicious RCU usage in tcp_bpf_update_proto

2021-04-06 Thread Cong Wang

On Tue, Apr 6, 2021 at 2:44 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:514e1150 net: x25: Queue received packets in the drivers i..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=112a8831d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7eff0f22b8563a5f
> dashboard link: https://syzkaller.appspot.com/bug?extid=320a3bc8d80f478c37e4
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1532d711d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15f44c5ed0
>
> The issue was bisected to:
>
> commit 4dfe6bd94959222e18d512bdf15f6bf9edb9c27c
> Author: Rustam Kovhaev 
> Date:   Wed Feb 24 20:00:30 2021 +
>
> ntfs: check for valid standard information attribute

This is caused by one of my sockmap patches.

>
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=16207a81d0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=15207a81d0
> console output: https://syzkaller.appspot.com/x/log.txt?x=11207a81d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+320a3bc8d80f478c3...@syzkaller.appspotmail.com
> Fixes: 4dfe6bd94959 ("ntfs: check for valid standard information attribute")
>
> =
> WARNING: suspicious RCU usage
> 5.12.0-rc4-syzkaller #0 Not tainted
> -
> include/linux/skmsg.h:286 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor383/8454:
>  #0: 888013a99b48 (clock-AF_INET){++..}-{2:2}, at: 
> sk_psock_drop+0x2c/0x460 net/core/skmsg.c:788
>
> stack backtrace:
> CPU: 1 PID: 8454 Comm: syz-executor383 Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  sk_psock include/linux/skmsg.h:286 [inline]
>  tcp_bpf_update_proto+0x530/0x5f0 net/ipv4/tcp_bpf.c:504
>  sk_psock_restore_proto include/linux/skmsg.h:408 [inline]
>  sk_psock_drop+0xdf/0x460 net/core/skmsg.c:789
>  sk_psock_put include/linux/skmsg.h:446 [inline]
>  tcp_bpf_recvmsg+0x42d/0x480 net/ipv4/tcp_bpf.c:208
>  inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:852

Oddly, I have all relevant Kconfig enabled but never see this
warning when running selftests for hours

Anyway, let me see how this should be fixed.

Thanks!

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2021-04-05 Thread Cong Wang

On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina  wrote:
>
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> coming days. If it works, then we can consider proceeding with it,
> otherwise I am all for reverting the whole NOLOCK stuff.
>
> [1] 
> https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsh...@huawei.com/T/#u

I personally prefer to just revert that bit, as it brings more troubles
than gains. Even with Yunsheng's patch, there are still some issues.
Essentially, I think the core qdisc scheduling code is not ready for
lockless, just look at those NOLOCK checks in sch_generic.c. :-/

Thanks.

Re: [PATCH v2 1/1] net: sched: bump refcount for new action in ACT replace mode

2021-03-30 Thread Cong Wang

On Mon, Mar 29, 2021 at 3:55 PM Kumar Kartikeya Dwivedi
 wrote:
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index b919826939e0..43cceb924976 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1042,6 +1042,9 @@ struct tc_action *tcf_action_init_1(struct net *net, 
> struct tcf_proto *tp,
> if (err != ACT_P_CREATED)
> module_put(a_o->owner);
>
> +   if (!bind && ovr && err == ACT_P_CREATED)
> +   refcount_set(>tcfa_refcnt, 2);
> +

Hmm, if we set the refcnt to 2 here, how could tcf_action_destroy()
destroy them when we rollback from a failure in the middle of the loop
in tcf_action_init()?

Thanks.

Re: [PATCH net v2] net: sched: fix packet stuck problem for lockless qdisc

2021-03-24 Thread Cong Wang

On Tue, Mar 23, 2021 at 7:24 PM Yunsheng Lin  wrote:
> @@ -176,8 +207,23 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
> write_seqcount_end(>running);
> -   if (qdisc->flags & TCQ_F_NOLOCK)
> +   if (qdisc->flags & TCQ_F_NOLOCK) {
> spin_unlock(>seqlock);
> +
> +   /* qdisc_run_end() is protected by RCU lock, and
> +* qdisc reset will do a synchronize_net() after
> +* setting __QDISC_STATE_DEACTIVATED, so testing
> +* the below two bits separately should be fine.

Hmm, why synchronize_net() after setting this bit is fine? It could
still be flipped right after you test RESCHEDULE bit.


> +* For qdisc_run() in net_tx_action() case, we
> +* really should provide rcu protection explicitly
> +* for document purposes or PREEMPT_RCU.
> +*/
> +   if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
> + >state) &&
> +!test_bit(__QDISC_STATE_DEACTIVATED,
> +  >state)))

Why do you want to test __QDISC_STATE_DEACTIVATED bit at all?
dev_deactivate_many() will wait for those scheduled but being
deactivated, so what's the problem of scheduling it even with this bit?

Thanks.

Re: [Linuxarm] Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-23 Thread Cong Wang

On Sun, Mar 21, 2021 at 5:55 PM Yunsheng Lin  wrote:
>
> On 2021/3/20 2:15, Cong Wang wrote:
> > On Thu, Mar 18, 2021 at 12:33 AM Yunsheng Lin  
> > wrote:
> >>
> >> On 2021/3/17 21:45, Jason A. Donenfeld wrote:
> >>> On 3/17/21, Toke Høiland-Jørgensen  wrote:
> >>>> Cong Wang  writes:
> >>>>
> >>>>> On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
> >>>>>>
> >>>>>> I thought pfifo was supposed to be "lockless" and this change
> >>>>>> re-introduces a lock between producer and consumer, no?
> >>>>>
> >>>>> It has never been truly lockless, it uses two spinlocks in the ring
> >>>>> buffer
> >>>>> implementation, and it introduced a q->seqlock recently, with this patch
> >>>>> now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
> >>>>> up having more locks than others. ;) I don't think we are going to a
> >>>>> right direction...
> >>>>
> >>>> Just a thought, have you guys considered adopting the lockless MSPC ring
> >>>> buffer recently introduced into Wireguard in commit:
> >>>>
> >>>> 8b5553ace83c ("wireguard: queueing: get rid of per-peer ring buffers")
> >>>>
> >>>> Jason indicated he was willing to work on generalising it into a
> >>>> reusable library if there was a use case for it. I haven't quite though
> >>>> through the details of whether this would be such a use case, but
> >>>> figured I'd at least mention it :)
> >>>
> >>> That offer definitely still stands. Generalization sounds like a lot of 
> >>> fun.
> >>>
> >>> Keep in mind though that it's an eventually consistent queue, not an
> >>> immediately consistent one, so that might not match all use cases. It
> >>> works with wg because we always trigger the reader thread anew when it
> >>> finishes, but that doesn't apply to everyone's queueing setup.
> >>
> >> Thanks for mentioning this.
> >>
> >> "multi-producer, single-consumer" seems to match the lockless qdisc's
> >> paradigm too, for now concurrent enqueuing/dequeuing to the pfifo_fast's
> >> queues() is not allowed, it is protected by producer_lock or consumer_lock.
> >>
> >> So it would be good to has lockless concurrent enqueuing, while dequeuing
> >> can be protected by qdisc_lock() or q->seqlock, which meets the 
> >> "multi-producer,
> >> single-consumer" paradigm.
> >
> > I don't think so. Usually we have one queue for each CPU so we can expect
> > each CPU has a lockless qdisc assigned, but we can not assume this in
> > the code, so we still have to deal with multiple CPU's sharing a lockless 
> > qdisc,
> > and we usually enqueue and dequeue in process context, so it means we could
> > have multiple producers and multiple consumers.
>
> For lockless qdisc, dequeuing is always within the qdisc_run_begin() and
> qdisc_run_end(), so multiple consumers is protected with each other by
> q->seqlock .

So are you saying you will never go lockless for lockless qdisc? I thought
you really want to go lockless with Jason's proposal of MPMC ring buffer
code.

>
> For enqueuing, multiple consumers is protected by producer_lock, see
> pfifo_fast_enqueue() -> skb_array_produce() -> ptr_ring_produce().

I think you seriously misunderstand how we classify MPMC or MPSC,
it is not about how we lock them, it is about whether we truly have
a single or multiple consumers regardless of locks used, because the
goal is to go lockless.

> I am not sure if lockless MSPC can work with the process context, but
> even if not, the enqueuing is also protected by rcu_read_lock_bh(),
> which provides some kind of atomicity, so that producer_lock can be
> reomved when lockless MSPC is used.


Not sure if I can even understand what you are saying here, Jason's
code only disables preemption with busy wait, I can't see why it can
not be used in the process context.

Thanks.

Re: net/dev: fix information leak to userspace

2021-03-21 Thread Cong Wang

On Sun, Mar 21, 2021 at 9:34 AM Pavel Machek  wrote:
>
> dev_get_mac_address() does not always initialize whole
> structure. Unfortunately, other code copies such structure to
> userspace, leaking information. Fix it.

Well, most callers already initialize it with a memset() or copy_from_user(),
for example, __tun_chr_ioctl():

if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
(_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
if (copy_from_user(, argp, ifreq_len))
return -EFAULT;
} else {
memset(, 0, sizeof(ifr));
}

Except tap_ioctl(), but we can just initialize 'sa' there instead of doing
it in dev_get_mac_address().

Thanks.

Re: [Linuxarm] [PATCH net] net: sched: fix packet stuck problem for lockless qdisc

2021-03-19 Thread Cong Wang

On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin  wrote:
> I had done some performance test to see if there is value to
> fix the packet stuck problem and support lockless qdisc bypass,
> here is some result using pktgen in 'queue_xmit' mode on a dummy
> device as Paolo Abeni had done in [1], and using pfifo_fast qdisc:
>
> threads  vanillalocked-qdiscvanilla+this_patch
>1 2.6Mpps  2.9Mpps2.5Mpps
>2 3.9Mpps  4.8Mpps3.6Mpps
>4 5.6Mpps  3.0Mpps4.7Mpps
>8 2.7Mpps  1.6Mpps2.8Mpps
>162.2Mpps  1.3Mpps2.3Mpps
>
> locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS".

I read this as this patch introduces somehow a performance
regression for -net, as the lockless bypass patch you submitted is
for -net-next.

Thanks.

Re: [PATCH net] net: sched: fix packet stuck problem for lockless qdisc

2021-03-19 Thread Cong Wang

On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin  wrote:
>
> Lockless qdisc has below concurrent problem:
> cpu0  cpu1
>   . .
>  q->enqueue .
>   . .
>qdisc_run_begin().
>   . .
>  dequeue_skb()  .
>   . .
>sch_direct_xmit().
>   . .
>   .q->enqueue
>   . qdisc_run_begin()
>   .return and do nothing
>   . .
> qdisc_run_end() .
>
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
>
> Lockless qdisc has another concurrent problem when tx_action
> is involved:
>
> cpu0(serving tx_action) cpu1 cpu2
>   .   ..
>   .  q->enqueue.
>   .qdisc_run_begin()   .
>   .  dequeue_skb() .
>   .   .q->enqueue
>   .   ..
>   . sch_direct_xmit()  .
>   .   . qdisc_run_begin()
>   .   .   return and do nothing
>   .   ..
> clear __QDISC_STATE_SCHED ..
> qdisc_run_begin() ..
> return and do nothing ..
>   .   ..
>   .  qdisc_run_begin() .
>
> This patch fixes the above data race by:
> 1. Set a flag after spin_trylock() return false.
> 2. Retry a spin_trylock() in case other CPU may not see the
>new flag after it releases the lock.
> 3. reschedule if the flag is set after the lock is released
>at the end of qdisc_run_end().
>
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_begin(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
>
> Also clear the flag before dequeuing in order to reduce the
> overhead of the above process, and aviod doing the heavy
> test_and_clear_bit() at the end of qdisc_run_end().
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin 
> ---
> For those who has not been following the qdsic scheduling
> discussion, there is packet stuck problem for lockless qdisc,
> see [1], and I has done some cleanup and added some enhanced
> features too, see [2] [3].
> While I was doing the optimization for lockless qdisc, it
> accurred to me that these optimization is useless if there is
> still basic bug in lockless qdisc, even the bug is not easily
> reproducible. So look through [1] again, I found that the data
> race for tx action mentioned by Michael, and thought deep about
> it and came up with this patch trying to fix it.
>
> So I am really appreciated some who still has the reproducer
> can try this patch and report back.
>
> 1. 
> https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd...@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
> 2. 
> https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsh...@huawei.com/
> 3. 
> https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsh...@huawei.com/
> ---
>  include/net/sch_generic.h | 23 ---
>  net/sched/sch_generic.c   |  1 +
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..4220eab 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
> __QDISC_STATE_SCHED,
> __QDISC_STATE_DEACTIVATED,
> +   __QDISC_STATE_NEED_RESCHEDULE,
>  };
>
>  struct qdisc_size_table {
> @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc 
> *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
> if (qdisc->flags & TCQ_F_NOLOCK) {
> -   if (!spin_trylock(>seqlock))
> -   return false;
> +   if (!spin_trylock(>seqlock)) {
> +   set_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +   >state);

Why do we need another bit? I mean why not just call __netif_schedule()?

> +
> +   /* Retry again in case other CPU may not see the
> +* new flags after it releases the lock at the
> +* end

Re: [Linuxarm] Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-19 Thread Cong Wang

On Thu, Mar 18, 2021 at 12:33 AM Yunsheng Lin  wrote:
>
> On 2021/3/17 21:45, Jason A. Donenfeld wrote:
> > On 3/17/21, Toke Høiland-Jørgensen  wrote:
> >> Cong Wang  writes:
> >>
> >>> On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
> >>>>
> >>>> I thought pfifo was supposed to be "lockless" and this change
> >>>> re-introduces a lock between producer and consumer, no?
> >>>
> >>> It has never been truly lockless, it uses two spinlocks in the ring
> >>> buffer
> >>> implementation, and it introduced a q->seqlock recently, with this patch
> >>> now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
> >>> up having more locks than others. ;) I don't think we are going to a
> >>> right direction...
> >>
> >> Just a thought, have you guys considered adopting the lockless MSPC ring
> >> buffer recently introduced into Wireguard in commit:
> >>
> >> 8b5553ace83c ("wireguard: queueing: get rid of per-peer ring buffers")
> >>
> >> Jason indicated he was willing to work on generalising it into a
> >> reusable library if there was a use case for it. I haven't quite though
> >> through the details of whether this would be such a use case, but
> >> figured I'd at least mention it :)
> >
> > That offer definitely still stands. Generalization sounds like a lot of fun.
> >
> > Keep in mind though that it's an eventually consistent queue, not an
> > immediately consistent one, so that might not match all use cases. It
> > works with wg because we always trigger the reader thread anew when it
> > finishes, but that doesn't apply to everyone's queueing setup.
>
> Thanks for mentioning this.
>
> "multi-producer, single-consumer" seems to match the lockless qdisc's
> paradigm too, for now concurrent enqueuing/dequeuing to the pfifo_fast's
> queues() is not allowed, it is protected by producer_lock or consumer_lock.
>
> So it would be good to has lockless concurrent enqueuing, while dequeuing
> can be protected by qdisc_lock() or q->seqlock, which meets the 
> "multi-producer,
> single-consumer" paradigm.

I don't think so. Usually we have one queue for each CPU so we can expect
each CPU has a lockless qdisc assigned, but we can not assume this in
the code, so we still have to deal with multiple CPU's sharing a lockless qdisc,
and we usually enqueue and dequeue in process context, so it means we could
have multiple producers and multiple consumers.

On the other hand, I don't think the problems we have been fixing are the ring
buffer implementation itself, they are about the high-level qdisc
state transitions.

Thanks.

Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-16 Thread Cong Wang

On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
>
> I thought pfifo was supposed to be "lockless" and this change
> re-introduces a lock between producer and consumer, no?

It has never been truly lockless, it uses two spinlocks in the ring buffer
implementation, and it introduced a q->seqlock recently, with this patch
now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
up having more locks than others. ;) I don't think we are going to a
right direction...

Thanks.

Re: [PATCH net-next] net: sched: remove unnecessay lock protection for skb_bad_txq/gso_skb

2021-03-16 Thread Cong Wang

On Mon, Mar 15, 2021 at 4:42 PM David Miller  wrote:
>
> From: Yunsheng Lin 
> Date: Mon, 15 Mar 2021 17:30:10 +0800
>
> > Currently qdisc_lock(q) is taken before enqueuing and dequeuing
> > for lockless qdisc's skb_bad_txq/gso_skb queue, qdisc->seqlock is
> > also taken, which can provide the same protection as qdisc_lock(q).
> >
> > This patch removes the unnecessay qdisc_lock(q) protection for
> > lockless qdisc' skb_bad_txq/gso_skb queue.
> >
> > And dev_reset_queue() takes the qdisc->seqlock for lockless qdisc
> > besides taking the qdisc_lock(q) when doing the qdisc reset,
> > some_qdisc_is_busy() takes both qdisc->seqlock and qdisc_lock(q)
> > when checking qdisc status. It is unnecessary to take both lock
> > while the fast path only take one lock, so this patch also changes
> > it to only take qdisc_lock(q) for locked qdisc, and only take
> > qdisc->seqlock for lockless qdisc.
> >
> > Since qdisc->seqlock is taken for lockless qdisc when calling
> > qdisc_is_running() in some_qdisc_is_busy(), use qdisc->running
> > to decide if the lockless qdisc is running.
> >
> > Signed-off-by: Yunsheng Lin 
>
> What about other things protected by this lock, such as statistics and qlen?
>
> This change looks too risky to me.

They are per-cpu for pfifo_fast which sets TCQ_F_CPUSTATS too.

Thanks.

Re: [PATCH net-next] net: sched: remove unnecessay lock protection for skb_bad_txq/gso_skb

2021-03-16 Thread Cong Wang

On Mon, Mar 15, 2021 at 2:29 AM Yunsheng Lin  wrote:
>
> Currently qdisc_lock(q) is taken before enqueuing and dequeuing
> for lockless qdisc's skb_bad_txq/gso_skb queue, qdisc->seqlock is
> also taken, which can provide the same protection as qdisc_lock(q).
>
> This patch removes the unnecessay qdisc_lock(q) protection for
> lockless qdisc' skb_bad_txq/gso_skb queue.
>
> And dev_reset_queue() takes the qdisc->seqlock for lockless qdisc
> besides taking the qdisc_lock(q) when doing the qdisc reset,
> some_qdisc_is_busy() takes both qdisc->seqlock and qdisc_lock(q)
> when checking qdisc status. It is unnecessary to take both lock
> while the fast path only take one lock, so this patch also changes
> it to only take qdisc_lock(q) for locked qdisc, and only take
> qdisc->seqlock for lockless qdisc.
>
> Since qdisc->seqlock is taken for lockless qdisc when calling
> qdisc_is_running() in some_qdisc_is_busy(), use qdisc->running
> to decide if the lockless qdisc is running.

What's the benefit here? Since qdisc->q.lock is also per-qdisc,
so there is no actual contention to take it when we already acquire
q->seqlock, right?

Also, is ->seqlock supposed to be used for protecting skb_bad_txq
etc.? From my understanding, it was introduced merely for replacing
__QDISC_STATE_RUNNING. If you want to extend it, you probably
have to rename it too.

Thanks.

Re: [PATCH] net: sched: avoid duplicates in classes dump

2021-03-04 Thread Cong Wang

On Thu, Mar 4, 2021 at 6:44 AM Maximilian Heyne  wrote:
>
> This is a follow up of commit ea3274695353 ("net: sched: avoid
> duplicates in qdisc dump") which has fixed the issue only for the qdisc
> dump.
>
> The duplicate printing also occurs when dumping the classes via
>   tc class show dev eth0
>
> Fixes: 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable")
> Signed-off-by: Maximilian Heyne 

Seems reasonable. If you can show the difference of tc class dump
before and after this patch in your changelog, it would be helpful to
understand the bug here.

Thanks.

Re: [PATCH] net: bridge: Fix jump_label config

2021-02-26 Thread Cong Wang

On Thu, Feb 25, 2021 at 5:39 PM Kefeng Wang  wrote:
>
>
> On 2021/2/26 5:22, Cong Wang wrote:
> > On Wed, Feb 24, 2021 at 8:03 AM Kefeng Wang  
> > wrote:
> >> HAVE_JUMP_LABLE is removed by commit e9666d10a567 ("jump_label: move
> >> 'asm goto' support test to Kconfig"), use CONFIG_JUMP_LABLE instead
> >> of HAVE_JUMP_LABLE.
> >>
> >> Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge 
> >> input path")
> >> Signed-off-by: Kefeng Wang 
> > Hmm, why do we have to use a macro here? static_key_false() is defined
> > in both cases, CONFIG_JUMP_LABEL=y or CONFIG_JUMP_LABEL=n.
>
> It seems that all nf_hooks_needed related are using the macro,
>
> see net/netfilter/core.c and include/linux/netfilter.h,
>
>#ifdef CONFIG_JUMP_LABEL
>struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
> EXPORT_SYMBOL(nf_hooks_needed);
> #endif
>
>nf_static_key_inc()/nf_static_key_dec()

Same question: why? Clearly struct static_key is defined in both cases:

#else
struct static_key {
atomic_t enabled;
};
#endif  /* CONFIG_JUMP_LABEL */

Thanks.

Re: [PATCH/v3] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-02-26 Thread Cong Wang

On Thu, Feb 25, 2021 at 7:59 PM Xuesen Huang  wrote:
> v3:
> - Fix the code format.
>
> v2:
> Suggested-by: Willem de Bruijn 
> - Add a new flag to specify the type of the inner packet.

These need to be moved after '---', otherwise it would be merged
into the final git log.

>
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Zhiyong Cheng 
> Signed-off-by: Li Wang 
> ---
>  include/uapi/linux/bpf.h   |  5 +
>  net/core/filter.c  | 11 ++-
>  tools/include/uapi/linux/bpf.h |  5 +
>  3 files changed, 20 insertions(+), 1 deletion(-)

As a good practice, please add a test case for this in
tools/testing/selftests/bpf/progs/test_tc_tunnel.c.

Thanks.

Re: [PATCH] net: bridge: Fix jump_label config

2021-02-25 Thread Cong Wang

On Wed, Feb 24, 2021 at 8:03 AM Kefeng Wang  wrote:
>
> HAVE_JUMP_LABLE is removed by commit e9666d10a567 ("jump_label: move
> 'asm goto' support test to Kconfig"), use CONFIG_JUMP_LABLE instead
> of HAVE_JUMP_LABLE.
>
> Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge 
> input path")
> Signed-off-by: Kefeng Wang 

Hmm, why do we have to use a macro here? static_key_false() is defined
in both cases, CONFIG_JUMP_LABEL=y or CONFIG_JUMP_LABEL=n.

Thanks.

Re: general protection fault in xfrm_user_rcv_msg_compat

2021-02-23 Thread Cong Wang

On Tue, Feb 23, 2021 at 6:55 AM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11a6fccad0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7a875029a795d230
> dashboard link: https://syzkaller.appspot.com/bug?extid=5078fc2d7cf37d71de1c
> userspace arch: i386
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=167c1832d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10214f12d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+5078fc2d7cf37d71d...@syzkaller.appspotmail.com
>
> general protection fault, probably for non-canonical address 
> 0xe51af2c1f2c7bd20:  [#1] PREEMPT SMP KASAN
> KASAN: maybe wild-memory-access in range 
> [0x28d7b60f963de900-0x28d7b60f963de907]
> CPU: 1 PID: 8357 Comm: syz-executor113 Not tainted 5.11.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:nla_type include/net/netlink.h:1130 [inline]
> RIP: 0010:xfrm_xlate32_attr net/xfrm/xfrm_compat.c:404 [inline]
> RIP: 0010:xfrm_xlate32 net/xfrm/xfrm_compat.c:526 [inline]
> RIP: 0010:xfrm_user_rcv_msg_compat+0x5e5/0x1070 net/xfrm/xfrm_compat.c:571

Looks like we have to initialize the pointer array to NULL's.

diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
index d8e8a11ca845..56fb32f90799 100644
--- a/net/xfrm/xfrm_compat.c
+++ b/net/xfrm/xfrm_compat.c
@@ -537,7 +537,7 @@ static struct nlmsghdr
*xfrm_user_rcv_msg_compat(const struct nlmsghdr *h32,
 {
/* netlink_rcv_skb() checks if a message has full (struct nlmsghdr) */
u16 type = h32->nlmsg_type - XFRM_MSG_BASE;
-   struct nlattr *attrs[XFRMA_MAX+1];
+   struct nlattr *attrs[XFRMA_MAX+1] = {0};
struct nlmsghdr *h64;
size_t len;
int err;

Re: [PATCH 1/8] af_unix: take address assignment/hash insertion into a new helper

2021-02-20 Thread Cong Wang

On Sat, Feb 20, 2021 at 11:32 AM Al Viro  wrote:
>
> On Sat, Feb 20, 2021 at 11:12:33AM -0800, Cong Wang wrote:
> > On Thu, Feb 18, 2021 at 8:22 PM Al Viro  wrote:
> > >
> > > Duplicated logics in all bind variants (autobind, bind-to-path,
> > > bind-to-abstract) gets taken into a common helper.
> > >
> > > Signed-off-by: Al Viro 
> > > ---
> > >  net/unix/af_unix.c | 30 +++---
> > >  1 file changed, 15 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > > index 41c3303c3357..179b4fe837e6 100644
> > > --- a/net/unix/af_unix.c
> > > +++ b/net/unix/af_unix.c
> > > @@ -262,6 +262,16 @@ static void __unix_insert_socket(struct hlist_head 
> > > *list, struct sock *sk)
> > > sk_add_node(sk, list);
> > >  }
> > >
> > > +static void __unix_set_addr(struct sock *sk, struct unix_address *addr,
> > > +   unsigned hash)
> > > +   __releases(_table_lock)
> > > +{
> > > +   __unix_remove_socket(sk);
> > > +   smp_store_release(_sk(sk)->addr, addr);
> > > +   __unix_insert_socket(_socket_table[hash], sk);
> > > +   spin_unlock(_table_lock);
> >
> > Please take the unlock out, it is clearly an anti-pattern.
>
> Why?  "Insert into locked and unlock" is fairly common...

Because it does not lock the lock, just compare:

lock();
__unix_set_addr();
unlock();

to:

lock();
__unix_set_addr();

Clearly the former is more readable and less error-prone. Even
if you really want to do unlock, pick a name which explicitly says
it, for example, __unix_set_addr_unlock().

Thanks.

Re: [PATCH 1/8] af_unix: take address assignment/hash insertion into a new helper

2021-02-20 Thread Cong Wang

On Thu, Feb 18, 2021 at 8:22 PM Al Viro  wrote:
>
> Duplicated logics in all bind variants (autobind, bind-to-path,
> bind-to-abstract) gets taken into a common helper.
>
> Signed-off-by: Al Viro 
> ---
>  net/unix/af_unix.c | 30 +++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 41c3303c3357..179b4fe837e6 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -262,6 +262,16 @@ static void __unix_insert_socket(struct hlist_head 
> *list, struct sock *sk)
> sk_add_node(sk, list);
>  }
>
> +static void __unix_set_addr(struct sock *sk, struct unix_address *addr,
> +   unsigned hash)
> +   __releases(_table_lock)
> +{
> +   __unix_remove_socket(sk);
> +   smp_store_release(_sk(sk)->addr, addr);
> +   __unix_insert_socket(_socket_table[hash], sk);
> +   spin_unlock(_table_lock);

Please take the unlock out, it is clearly an anti-pattern.

And please Cc netdev for networking changes.

Thanks.

Re: [PATCH net] net: sched: fix police ext initialization

2021-02-16 Thread Cong Wang

On Tue, Feb 16, 2021 at 8:22 AM Vlad Buslov  wrote:
>
> When police action is created by cls API tcf_exts_validate() first
> conditional that calls tcf_action_init_1() directly, the action idr is not
> updated according to latest changes in action API that require caller to
> commit newly created action to idr with tcf_idr_insert_many(). This results
> such action not being accessible through act API and causes crash reported
> by syzbot:

Good catch!

This certainly makes sense to me, and I feed it to syzbot too, it is happy
with this patch, so:

Reported-and-tested-by: syzbot+151e3e714d34ae4ce...@syzkaller.appspotmail.com
Reviewed-by: Cong Wang 

Thanks.

Re: KASAN: null-ptr-deref Read in tcf_idrinfo_destroy

2021-02-15 Thread Cong Wang

On Wed, Feb 10, 2021 at 9:53 PM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:291009f6 Merge tag 'pm-5.11-rc8' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=14470d18d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a53fd47f16f22f8c
> dashboard link: https://syzkaller.appspot.com/bug?extid=151e3e714d34ae4ce7e8
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15f45814d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15f4aff8d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+151e3e714d34ae4ce...@syzkaller.appspotmail.com
>
> ==
> BUG: KASAN: null-ptr-deref in instrument_atomic_read 
> include/linux/instrumented.h:71 [inline]
> BUG: KASAN: null-ptr-deref in atomic_read 
> include/asm-generic/atomic-instrumented.h:27 [inline]
> BUG: KASAN: null-ptr-deref in __tcf_idr_release net/sched/act_api.c:178 
> [inline]
> BUG: KASAN: null-ptr-deref in tcf_idrinfo_destroy+0x129/0x1d0 
> net/sched/act_api.c:598
> Read of size 4 at addr 0010 by task kworker/u4:5/204
>
> CPU: 0 PID: 204 Comm: kworker/u4:5 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Workqueue: netns cleanup_net
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  __kasan_report mm/kasan/report.c:400 [inline]
>  kasan_report.cold+0x5f/0xd5 mm/kasan/report.c:413
>  check_memory_region_inline mm/kasan/generic.c:179 [inline]
>  check_memory_region+0x13d/0x180 mm/kasan/generic.c:185
>  instrument_atomic_read include/linux/instrumented.h:71 [inline]
>  atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
>  __tcf_idr_release net/sched/act_api.c:178 [inline]
>  tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598
>  tc_action_net_exit include/net/act_api.h:151 [inline]
>  police_exit_net+0x168/0x360 net/sched/act_police.c:390

This is really strange. It seems we still left some -EBUSY placeholders
in the idr, however, we actually call tcf_action_destroy() to clean up
everything including -EBUSY ones on error path.

What do you think, Vlad?

Thanks.

Re: BUG: Incorrect MTU on GRE device if remote is unspecified

2021-01-27 Thread Cong Wang

On Wed, Jan 27, 2021 at 4:56 PM Jakub Kicinski  wrote:
>
> On Mon, 25 Jan 2021 22:10:10 +0200 Slava Bacherikov wrote:
> > Hi, I'd like to report a regression. Currently, if you create GRE
> > interface on the latest stable or LTS kernel (5.4 branch) with
> > unspecified remote destination it's MTU will be adjusted for header size
> > twice. For example:
> >
> > $ ip link add name test type gre local 127.0.0.32
> > $ ip link show test | grep mtu
> > 27: test@NONE:  mtu 1452 qdisc noop state DOWN mode DEFAULT group
> > default qlen 1000
> >
> > or with FOU
> >
> > $ ip link add name test2   type gre local 127.0.0.32 encap fou
> > encap-sport auto encap-dport 
> > $ ip link show test2 | grep mtu
> > 28: test2@NONE:  mtu 1436 qdisc noop state DOWN mode DEFAULT
> > group default qlen 1000
> >
> > The same happens with GUE too (MTU is 1428 instead of 1464).
> > As you can see that MTU in first case is 1452 (1500 - 24 - 24) and with
> > FOU it's 1436 (1500 - 32 - 32), GUE 1428 (1500 - 36 - 36). If remote
> > address is specified MTU is correct.
> >
> > This regression caused by fdafed459998e2be0e877e6189b24cb7a0183224 commit.
>
> Cong is this one on your radar?

Yeah, I guess ipgre_link_update() somehow gets called twice,
but I will need to look into it.

Thanks.

Re: BPF: unbounded bpf_map_free_deferred problem

2021-01-22 Thread Cong Wang

On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa
 wrote:
>
> Hello, BPF developers.
>
> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | 
> WQ_UNBOUND
> workqueue did not solve this problem. Is it possible that a refcount leak 
> somewhere
> preventing bpf_map_free_deferred from completing? Please see
> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+j0ek...@mail.gmail.com
>  .
>

Which map does the reproducer create? And where exactly do
those work block on?

Different map->ops->map_free() waits for different reasons,
for example, htab_map_free() waits for flying htab_elem_free_rcu().
I can't immediately see how they could wait for each other, if this
is what you meant above.

Thanks.

Re: KMSAN: kernel-infoleak in move_addr_to_user (4)

2021-01-11 Thread Cong Wang

On Mon, Jan 11, 2021 at 11:33 AM Jakub Kicinski  wrote:
>
> Looks like a AF_CAN socket:
>
> r0 = socket(0x1d, 0x2, 0x6)
> getsockname$packet(r0, &(0x7f000100)={0x11, 0x0, 0x0, 0x1, 0x0, 0x6, 
> @broadcast}, &(0x7f00)=0x14)
>

Right, it seems we need a memset(0) in isotp_getname().

Thanks.

Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-05 Thread Cong Wang

On Mon, Jan 4, 2021 at 7:29 AM Alan Maguire  wrote:
>
> BPF Type Format (BTF) provides a description of kernel data structures
> and of the types kernel functions utilize as arguments and return values.
>
> A helper was recently added - bpf_snprintf_btf() - that uses that
> description to create a string representation of the data provided,
> using the BTF id of its type.  For example to create a string
> representation of a "struct sk_buff", the pointer to the skb
> is provided along with the type id of "struct sk_buff".
>
> Here that functionality is utilized to support tracing kernel
> function entry and return using k[ret]probes.  The "struct pt_regs"
> context can be used to derive arguments and return values, and
> when the user supplies a function name we
>
> - look it up in /proc/kallsyms to find its address/module
> - look it up in the BTF kernel data to get types of arguments
>   and return value
> - store a map representation of the trace information, keyed by
>   instruction pointer
> - on function entry/return we look up the map to retrieve the BTF
>   ids of the arguments/return values and can call bpf_snprintf_btf()
>   with these argument/return values along with the type ids to store
>   a string representation in the map.
> - this is then sent via perf event to userspace where it can be
>   displayed.
>
> ksnoop can be used to show function signatures; for example:

This is definitely quite useful!

Is it possible to integrate this with bpftrace? That would save people
from learning yet another tool. ;)

Thanks.

Re: inconsistent lock state in ila_xlat_nl_cmd_add_mapping

2020-12-29 Thread Cong Wang

On Tue, Dec 29, 2020 at 5:39 PM Jakub Kicinski  wrote:
>
> On Mon, 13 Aug 2018 21:40:03 -0700 syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:78cbac647e61 Merge branch 'ip-faster-in-order-IP-fragments'
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14df482840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=9100338df26ab75
> > dashboard link: https://syzkaller.appspot.com/bug?extid=eaaf6c4a6a8cb1869d86
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=13069ad240
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+eaaf6c4a6a8cb1869...@syzkaller.appspotmail.com
>
> #syz invalid
>
> Hard to track down what fixed this, but the lockdep splat is mixing up
> locks from two different hashtables, so there was never a real issue
> here.

This one is probably fixed by:

commit ff93bca769925a2d8fd7f910cdf543d992e17f07
Author: Cong Wang 
Date:   Tue Aug 14 15:21:31 2018 -0700

ila: make lockdep happy again

given the time of last reproducing...

Thanks.

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang

On Wed, Nov 4, 2020 at 10:04 PM Cong Wang  wrote:
>
> On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> > >> From my understanding, we can do anything about the old qdisc (including
> > >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> > >
> > > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
> >
> > If lock is taken when doing reset, it does not matter if the reset is
> > before some_qdisc_is_busy(), right?
>
> Why not? How about the following scenario?
>
> CPU0:   CPU1:
> dev_reset_queue()
> net_tx_action()
>  -> sch_direct_xmit()
>-> dev_requeue_skb()
> some_qdisc_is_busy()
> // waiting for TX action on CPU1
> // now some packets are requeued

Never mind, the skb_bad_txq is also cleared by dev_reset_queue().
TX action after resetting should get NULL.

Thanks.

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang

On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> >> From my understanding, we can do anything about the old qdisc (including
> >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> >
> > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
>
> If lock is taken when doing reset, it does not matter if the reset is
> before some_qdisc_is_busy(), right?

Why not? How about the following scenario?

CPU0:   CPU1:
dev_reset_queue()
net_tx_action()
 -> sch_direct_xmit()
   -> dev_requeue_skb()
some_qdisc_is_busy()
// waiting for TX action on CPU1
// now some packets are requeued

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-02 Thread Cong Wang

On Fri, Oct 30, 2020 at 12:38 AM Yunsheng Lin  wrote:
>
> On 2020/10/30 3:05, Cong Wang wrote:
> >
> > I do not see how and why it should. synchronize_net() is merely an optimized
> > version of synchronize_rcu(), it should wait for RCU readers, softirqs are 
> > not
> > necessarily RCU readers, net_tx_action() does not take RCU read lock either.
>
> Ok, make sense.
>
> Taking RCU read lock in net_tx_action() does not seems to solve the problem,
> what about the time window between __netif_reschedule() and net_tx_action()?
>
> It seems we need to re-dereference the qdisc whenever RCU read lock is 
> released
> and qdisc is still in sd->output_queue or wait for the sd->output_queue to 
> drain?

Not suggesting you to take RCU read lock. We already wait for TX action with
a loop of sleep. To me, the only thing missing is just moving the
reset after that
wait.


> >>>> If we do any additional reset that is not related to qdisc in 
> >>>> dev_reset_queue(), we
> >>>> can move it after some_qdisc_is_busy() checking.
> >>>
> >>> I am not suggesting to do an additional reset, I am suggesting to move
> >>> your reset after the busy waiting.
> >>
> >> There maybe a deadlock here if we reset the qdisc after the 
> >> some_qdisc_is_busy() checking,
> >> because some_qdisc_is_busy() may require the qdisc reset to clear the skb, 
> >> so that
> >
> > some_qdisc_is_busy() checks the status of qdisc, not the skb queue.
>
> Is there any reason why we do not check the skb queue in the dqisc?
> It seems there may be skb left when netdev is deactivated, maybe at least warn
> about that when there is still skb left when netdev is deactivated?
> Is that why we call qdisc_reset() to clear the leftover skb in 
> qdisc_destroy()?
>
> >
> >
> >> some_qdisc_is_busy() can return false. I am not sure this is really a 
> >> problem, but
> >> sch_direct_xmit() may requeue the skb when dev_hard_start_xmit return 
> >> TX_BUSY.
> >
> > Sounds like another reason we should move the reset as late as possible?
>
> Why?

You said "sch_direct_xmit() may requeue the skb", I agree. I assume you mean
net_tx_action() calls sch_direct_xmit() which does the requeue then races with
reset. No?


>
> There current netdev down order is mainly below:
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> dev_reset_queue()
>
> some_qdisc_is_busy()
>
>
> You suggest to change it to below order, right?
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> some_qdisc_is_busy()
>
> dev_reset_queue()

Yes.

>
>
> What is the semantics of some_qdisc_is_busy()?

Waiting for flying TX action.

> From my understanding, we can do anything about the old qdisc (including
> destorying the old qdisc) after some_qdisc_is_busy() return false.

But the current code does the reset _before_ some_qdisc_is_busy(). ;)

Thanks.

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-29 Thread Cong Wang

On Wed, Oct 28, 2020 at 7:54 PM Yunsheng Lin  wrote:
>
> On 2020/9/18 3:26, Cong Wang wrote:
> > On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/11 4:07, Cong Wang wrote:
> >>> On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>
> >>>> Reused the existing synchronize_net() in dev_deactivate_many() to
> >>>> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >>>> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >>>> dev_reset_queue() is called.
> >>>>
> >>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >>>> Signed-off-by: Yunsheng Lin 
> >>>> ---
> >>>> ChangeLog V2:
> >>>> Reuse existing synchronize_net().
> >>>> ---
> >>>>  net/sched/sch_generic.c | 48 
> >>>> +---
> >>>>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >>>> index 265a61d..54c4172 100644
> >>>> --- a/net/sched/sch_generic.c
> >>>> +++ b/net/sched/sch_generic.c
> >>>> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>>>
> >>>>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>>>  {
> >>>> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> -
> >>>> if (qdisc->flags & TCQ_F_BUILTIN)
> >>>> return;
> >>>> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> >>>> -   return;
> >>>> -
> >>>> -   if (nolock)
> >>>> -   spin_lock_bh(>seqlock);
> >>>> -   spin_lock_bh(qdisc_lock(qdisc));
> >>>>
> >>>> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> >>>> -
> >>>> -   qdisc_reset(qdisc);
> >>>> -
> >>>> -   spin_unlock_bh(qdisc_lock(qdisc));
> >>>> -   if (nolock)
> >>>> -   spin_unlock_bh(>seqlock);
> >>>>  }
> >>>>
> >>>>  static void dev_deactivate_queue(struct net_device *dev,
> >>>> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct 
> >>>> net_device *dev,
> >>>> }
> >>>>  }
> >>>>
> >>>> +static void dev_reset_queue(struct net_device *dev,
> >>>> +   struct netdev_queue *dev_queue,
> >>>> +   void *_unused)
> >>>> +{
> >>>> +   struct Qdisc *qdisc;
> >>>> +   bool nolock;
> >>>> +
> >>>> +   qdisc = dev_queue->qdisc_sleeping;
> >>>> +   if (!qdisc)
> >>>> +   return;
> >>>> +
> >>>> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> +
> >>>> +   if (nolock)
> >>>> +   spin_lock_bh(>seqlock);
> >>>> +   spin_lock_bh(qdisc_lock(qdisc));
> >>>
> >>>
> >>> I think you do not need this lock for lockless one.
> >>
> >> It seems so.
> >> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> >> qdisc?
> >
> > Yeah, but not sure if we still want this lockless qdisc any more,
> > it brings more troubles than gains.
> >
> >>
> >>
> >>>
> >>>> +
> >>>> +   qdisc_reset(qdisc);
> >>>> +
> >>>> +   spin_unloc

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-28 Thread Cong Wang

On Wed, Oct 28, 2020 at 8:37 AM Pai, Vishwanath  wrote:
> Hi,
>
> We noticed some problems when testing the latest 5.4 LTS kernel and traced it
> back to this commit using git bisect. When running our tests the machine stops
> responding to all traffic and the only way to recover is a reboot. I do not 
> see
> a stack trace on the console.

Do you mean the machine is still running fine just the network is down?

If so, can you dump your tc config with stats when the problem is happening?
(You can use `tc -s -d qd show ...`.)

>
> This can be reproduced using the packetdrill test below, it should be run a
> few times or in a loop. You should hit this issue within a few tries but
> sometimes might take up to 15-20 tries.
...
> I can reproduce the issue easily on v5.4.68, and after reverting this commit 
> it
> does not happen anymore.

This is odd. The patch in this thread touches netdev reset path, if packetdrill
is the only thing you use to trigger the bug (that is netdev is always active),
I can not connect them.

Thanks.

Re: [PATCH] net: cls_api: remove unneeded local variable in tc_dump_chain()

2020-10-28 Thread Cong Wang

On Wed, Oct 28, 2020 at 6:59 AM Tom Rix  wrote:
>
>
> On 10/28/20 4:35 AM, Lukas Bulwahn wrote:
> > @@ -2971,13 +2963,11 @@ static int tc_dump_chain(struct sk_buff *skb, 
> > struct netlink_callback *cb)
> >   if (!dev)
> >   return skb->len;
> >
> > - parent = tcm->tcm_parent;
> > - if (!parent) {
> > + if (!tcm->tcm_parent)
> >   q = dev->qdisc;
> > - parent = q->handle;
>
> This looks like a an unused error handler.
>
> and the later call to
>
> if (TC_H_MIN(tcm->tcm_parent)
>
> maybe should be
>
> if (TC_H_MIN(parent))

When tcm->tcm_parent is 0, TC_H_MIN(tcm->tcm_parent) is also 0,
so we will not hit that if branch.

So, I think Lukas' patch is correct.

Thanks.

Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang

On Mon, Oct 12, 2020 at 2:53 AM Muchun Song  wrote:
> We are not complaining about TCP using too much memory, but how do
> we know that TCP uses a lot of memory. When I firstly face this problem,
> I do not know who uses the 25GB memory and it is not shown in the 
> /proc/meminfo.
> If we can know the amount memory of the socket buffer via /proc/meminfo, we
> may not need to spend a lot of time troubleshooting this problem. Not everyone
> knows that a lot of memory may be used here. But I believe many people
> should know /proc/meminfo to confirm memory users.

Well, I'd bet networking people know `ss -m` better than /proc/meminfo,
generally speaking.

The practice here is that if you want some networking-specific counters,
add it to where networking people know better, that is, `ss -m` or /proc/net/...

Or maybe the problem you described is not specific to networking at all,
there must be some other places where pages are allocated but not charged.
If so, adding a general mm counter in /proc/meminfo makes sense, but
it won't be specific to networking.

Thanks.

Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang

On Mon, Oct 12, 2020 at 2:53 AM Muchun Song  wrote:
> We are not complaining about TCP using too much memory, but how do
> we know that TCP uses a lot of memory. When I firstly face this problem,
> I do not know who uses the 25GB memory and it is not shown in the 
> /proc/meminfo.
> If we can know the amount memory of the socket buffer via /proc/meminfo, we
> may not need to spend a lot of time troubleshooting this problem. Not everyone
> knows that a lot of memory may be used here. But I believe many people
> should know /proc/meminfo to confirm memory users.

Well, I'd bet networking people know `ss -m` better than /proc/meminfo,
generally speaking.

The practice here is that if you want some networking-specific counters,
add it to where networking people know better, that is, `ss -m` or /proc/net/...

Or maybe the problem you described is not specific to networking at all,
there must be some other places where pages are allocated but not charged.
If so, adding a general mm counter in /proc/meminfo makes sense, but
it won't be specific to networking.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang

On Sun, Oct 11, 2020 at 9:22 PM Muchun Song  wrote:
>
> On Mon, Oct 12, 2020 at 2:39 AM Cong Wang  wrote:
> >
> > On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  
> > wrote:
> > >
> > > The amount of memory allocated to sockets buffer can become significant.
> > > However, we do not display the amount of memory consumed by sockets
> > > buffer. In this case, knowing where the memory is consumed by the kernel
> >
> > We do it via `ss -m`. Is it not sufficient? And if not, why not adding it 
> > there
> > rather than /proc/meminfo?
>
> If the system has little free memory, we can know where the memory is via
> /proc/meminfo. If a lot of memory is consumed by socket buffer, we cannot
> know it when the Sock is not shown in the /proc/meminfo. If the unaware user
> can't think of the socket buffer, naturally they will not `ss -m`. The
> end result

Interesting, we already have a few counters related to socket buffers,
are you saying these are not accounted in /proc/meminfo either?
If yes, why are page frags so special here? If not, they are more
important than page frags, so you probably want to deal with them
first.

> is that we still don’t know where the memory is consumed. And we add the
> Sock to the /proc/meminfo just like the memcg does('sock' item in the cgroup
> v2 memory.stat). So I think that adding to /proc/meminfo is sufficient.

It looks like actually the socket page frag is already accounted,
for example, the tcp_sendmsg_locked():

copy = min_t(int, copy, pfrag->size - pfrag->offset);

if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;

>
> >
> > >  static inline void __skb_frag_unref(skb_frag_t *frag)
> > >  {
> > > -   put_page(skb_frag_page(frag));
> > > +   struct page *page = skb_frag_page(frag);
> > > +
> > > +   if (put_page_testzero(page)) {
> > > +   dec_sock_node_page_state(page);
> > > +   __put_page(page);
> > > +   }
> > >  }
> >
> > You mix socket page frag with skb frag at least, not sure this is exactly
> > what you want, because clearly skb page frags are frequently used
> > by network drivers rather than sockets.
> >
> > Also, which one matches this dec_sock_node_page_state()? Clearly
> > not skb_fill_page_desc() or __skb_frag_ref().
>
> Yeah, we call inc_sock_node_page_state() in the skb_page_frag_refill().

How is skb_page_frag_refill() possibly paired with __skb_frag_unref()?

> So if someone gets the page returned by skb_page_frag_refill(), it must
> put the page via __skb_frag_unref()/skb_frag_unref(). We use PG_private
> to indicate that we need to dec the node page state when the refcount of
> page reaches zero.

skb_page_frag_refill() is called on frags not within an skb, for instance,
sk_page_frag_refill() uses it for a per-socket or per-process page frag.
But, __skb_frag_unref() is specifically used for skb frags, which are
supposed to be filled by skb_fill_page_desc() (page is allocated by driver).

They are different things you are mixing them up, which looks clearly
wrong or at least misleading.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang

On Sun, Oct 11, 2020 at 9:22 PM Muchun Song  wrote:
>
> On Mon, Oct 12, 2020 at 2:39 AM Cong Wang  wrote:
> >
> > On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  
> > wrote:
> > >
> > > The amount of memory allocated to sockets buffer can become significant.
> > > However, we do not display the amount of memory consumed by sockets
> > > buffer. In this case, knowing where the memory is consumed by the kernel
> >
> > We do it via `ss -m`. Is it not sufficient? And if not, why not adding it 
> > there
> > rather than /proc/meminfo?
>
> If the system has little free memory, we can know where the memory is via
> /proc/meminfo. If a lot of memory is consumed by socket buffer, we cannot
> know it when the Sock is not shown in the /proc/meminfo. If the unaware user
> can't think of the socket buffer, naturally they will not `ss -m`. The
> end result

Interesting, we already have a few counters related to socket buffers,
are you saying these are not accounted in /proc/meminfo either?
If yes, why are page frags so special here? If not, they are more
important than page frags, so you probably want to deal with them
first.


> is that we still don’t know where the memory is consumed. And we add the
> Sock to the /proc/meminfo just like the memcg does('sock' item in the cgroup
> v2 memory.stat). So I think that adding to /proc/meminfo is sufficient.

It looks like actually the socket page frag is already accounted,
for example, the tcp_sendmsg_locked():

copy = min_t(int, copy, pfrag->size - pfrag->offset);

if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;


>
> >
> > >  static inline void __skb_frag_unref(skb_frag_t *frag)
> > >  {
> > > -   put_page(skb_frag_page(frag));
> > > +   struct page *page = skb_frag_page(frag);
> > > +
> > > +   if (put_page_testzero(page)) {
> > > +   dec_sock_node_page_state(page);
> > > +   __put_page(page);
> > > +   }
> > >  }
> >
> > You mix socket page frag with skb frag at least, not sure this is exactly
> > what you want, because clearly skb page frags are frequently used
> > by network drivers rather than sockets.
> >
> > Also, which one matches this dec_sock_node_page_state()? Clearly
> > not skb_fill_page_desc() or __skb_frag_ref().
>
> Yeah, we call inc_sock_node_page_state() in the skb_page_frag_refill().

How is skb_page_frag_refill() possibly paired with __skb_frag_unref()?

> So if someone gets the page returned by skb_page_frag_refill(), it must
> put the page via __skb_frag_unref()/skb_frag_unref(). We use PG_private
> to indicate that we need to dec the node page state when the refcount of
> page reaches zero.

skb_page_frag_refill() is called on frags not within an skb, for instance,
sk_page_frag_refill() uses it for a per-socket or per-process page frag.
But, __skb_frag_unref() is specifically used for skb frags, which are
supposed to be filled by skb_fill_page_desc() (page is allocated by driver).

They are different things you are mixing them up, which looks clearly
wrong or at least misleading.

Thanks.

Re: BUG: unable to handle kernel paging request in tcf_action_dump_terse

2020-10-11 Thread Cong Wang

#syz fix: net_sched: check error pointer in tcf_dump_walker()

Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-11 Thread Cong Wang

On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  wrote:
>
> The amount of memory allocated to sockets buffer can become significant.
> However, we do not display the amount of memory consumed by sockets
> buffer. In this case, knowing where the memory is consumed by the kernel

We do it via `ss -m`. Is it not sufficient? And if not, why not adding it there
rather than /proc/meminfo?

>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> -   put_page(skb_frag_page(frag));
> +   struct page *page = skb_frag_page(frag);
> +
> +   if (put_page_testzero(page)) {
> +   dec_sock_node_page_state(page);
> +   __put_page(page);
> +   }
>  }

You mix socket page frag with skb frag at least, not sure this is exactly
what you want, because clearly skb page frags are frequently used
by network drivers rather than sockets.

Also, which one matches this dec_sock_node_page_state()? Clearly
not skb_fill_page_desc() or __skb_frag_ref().

Thanks.

Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-11 Thread Cong Wang

On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  wrote:
>
> The amount of memory allocated to sockets buffer can become significant.
> However, we do not display the amount of memory consumed by sockets
> buffer. In this case, knowing where the memory is consumed by the kernel

We do it via `ss -m`. Is it not sufficient? And if not, why not adding it there
rather than /proc/meminfo?

>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> -   put_page(skb_frag_page(frag));
> +   struct page *page = skb_frag_page(frag);
> +
> +   if (put_page_testzero(page)) {
> +   dec_sock_node_page_state(page);
> +   __put_page(page);
> +   }
>  }

You mix socket page frag with skb frag at least, not sure this is exactly
what you want, because clearly skb page frags are frequently used
by network drivers rather than sockets.

Also, which one matches this dec_sock_node_page_state()? Clearly
not skb_fill_page_desc() or __skb_frag_ref().

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: INFO: task hung in tcf_action_init_1

2020-09-30 Thread Cong Wang

#syz test: https://github.com/congwang/linux.git net

Re: general protection fault in tcf_generic_walker

2020-09-30 Thread Cong Wang

On Wed, Sep 30, 2020 at 10:42 AM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:2b3e981a Merge branch 'mptcp-Fix-for-32-bit-DATA_FIN'
> git tree:   net
> console output: https://syzkaller.appspot.com/x/log.txt?x=1653724790
> kernel config:  https://syzkaller.appspot.com/x/.config?x=99a7c78965c75e07
> dashboard link: https://syzkaller.appspot.com/bug?extid=b47bc4f247856fb4d9e1
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1412a5a790
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b47bc4f247856fb4d...@syzkaller.appspotmail.com
>
> general protection fault, probably for non-canonical address 
> 0xdc04:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0020-0x0027]
> CPU: 0 PID: 8855 Comm: syz-executor.1 Not tainted 5.9.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:tcf_dump_walker net/sched/act_api.c:240 [inline]
> RIP: 0010:tcf_generic_walker+0x367/0xba0 net/sched/act_api.c:343
> Code: 24 31 ff 48 89 de e8 c8 55 eb fa 48 85 db 74 3f e8 3e 59 eb fa 48 8d 7d 
> 30 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 26 
> 07 00 00 48 8b 5d 30 31 ff 48 2b 1c 24 48 89
> RSP: 0018:c9000b6ff3a8 EFLAGS: 00010202
> RAX: 0004 RBX: c000aae4 RCX: dc00
> RDX: 8880a82aa140 RSI: 868ae502 RDI: 0020
> RBP: fff0 R08:  R09: 8880a8c41e07
> R10:  R11:  R12: 88809f226340
> R13:  R14:  R15: 
> FS:  7f156f7fa700() GS:8880ae40() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55d25128b348 CR3: a7d3d000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  tc_dump_action+0x6d5/0xe60 net/sched/act_api.c:1609


Probably just need another IS_ERR() check on the dump path.
I will take a second look.

Thanks.

Re: KASAN: use-after-free Read in tcf_action_init

2020-09-28 Thread Cong Wang

#syz fix: net_sched: commit action insertions together

Re: [PATCH 0/2] iommu/iova: Solve longterm IOVA issue

2020-09-25 Thread Cong Wang

On Fri, Sep 25, 2020 at 2:56 AM John Garry  wrote:
>
> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
>
> I also included the small optimisation from Cong Wang, which never seems
> to be have been accepted [1]. There was some debate of the other patches
> in that series, but this one is quite straightforward.
>
> @Cong Wang, Please resend your series if prefer I didn't upstream your
> patch.

Thanks for letting me know. But I still don't think it is worth any effort,
given it is hard to work with Robin. Users who care about latency here
should just disable IOMMU, it is really hard to optimize IOVA cache
performance to catch up with !IOMMU case.

So, please feel free to carry it on your own, I have no problem with it.

Thanks.

Re: [PATCH 0/2] iommu/iova: Solve longterm IOVA issue

2020-09-25 Thread Cong Wang

On Fri, Sep 25, 2020 at 2:56 AM John Garry  wrote:
>
> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
>
> I also included the small optimisation from Cong Wang, which never seems
> to be have been accepted [1]. There was some debate of the other patches
> in that series, but this one is quite straightforward.
>
> @Cong Wang, Please resend your series if prefer I didn't upstream your
> patch.

Thanks for letting me know. But I still don't think it is worth any effort,
given it is hard to work with Robin. Users who care about latency here
should just disable IOMMU, it is really hard to optimize IOVA cache
performance to catch up with !IOMMU case.

So, please feel free to carry it on your own, I have no problem with it.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-17 Thread Cong Wang

On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:07, Cong Wang wrote:
> > On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >>
> >> Reused the existing synchronize_net() in dev_deactivate_many() to
> >> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >> dev_reset_queue() is called.
> >>
> >> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >> Signed-off-by: Yunsheng Lin 
> >> ---
> >> ChangeLog V2:
> >> Reuse existing synchronize_net().
> >> ---
> >>  net/sched/sch_generic.c | 48 
> >> +---
> >>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >> index 265a61d..54c4172 100644
> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>
> >>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>  {
> >> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> -
> >> if (qdisc->flags & TCQ_F_BUILTIN)
> >> return;
> >> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> >> -   return;
> >> -
> >> -   if (nolock)
> >> -   spin_lock_bh(>seqlock);
> >> -   spin_lock_bh(qdisc_lock(qdisc));
> >>
> >> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> >> -
> >> -   qdisc_reset(qdisc);
> >> -
> >> -   spin_unlock_bh(qdisc_lock(qdisc));
> >> -   if (nolock)
> >> -   spin_unlock_bh(>seqlock);
> >>  }
> >>
> >>  static void dev_deactivate_queue(struct net_device *dev,
> >> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> >> *dev,
> >> }
> >>  }
> >>
> >> +static void dev_reset_queue(struct net_device *dev,
> >> +   struct netdev_queue *dev_queue,
> >> +   void *_unused)
> >> +{
> >> +   struct Qdisc *qdisc;
> >> +   bool nolock;
> >> +
> >> +   qdisc = dev_queue->qdisc_sleeping;
> >> +   if (!qdisc)
> >> +   return;
> >> +
> >> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> +
> >> +   if (nolock)
> >> +   spin_lock_bh(>seqlock);
> >> +   spin_lock_bh(qdisc_lock(qdisc));
> >
> >
> > I think you do not need this lock for lockless one.
>
> It seems so.
> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> qdisc?

Yeah, but not sure if we still want this lockless qdisc any more,
it brings more troubles than gains.

>
>
> >
> >> +
> >> +   qdisc_reset(qdisc);
> >> +
> >> +   spin_unlock_bh(qdisc_lock(qdisc));
> >> +   if (nolock)
> >> +   spin_unlock_bh(>seqlock);
> >> +}
> >> +
> >>  static bool some_qdisc_is_busy(struct net_device *dev)
> >>  {
> >> unsigned int i;
> >> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> >> dev_watchdog_down(dev);
> >> }
> >>
> >> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> >> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> >> +* outstanding qdisc enqueuing calls.
> >>  * This is avoided if all devices are in dismantle phase :
> >>  * Caller will call synchronize_net() for us
> >>  */
> >> synchronize_net();
> >>
> >> +

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-17 Thread Cong Wang

On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:19, Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> >> I also tried Cong's patch (shown below on my tree) and it could avoid
> >> the issue (stressing for 30 minutus for three times and not jitter
> >> observed).
> >
> > Thanks for verifying it!
> >
> >>
> >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> >> @@ -127,8 +127,7 @@
> >>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >>  {
> >>   if (qdisc->flags & TCQ_F_NOLOCK) {
> >> - if (!spin_trylock(>seqlock))
> >> - return false;
> >> + spin_lock(>seqlock);
> >>   } else if (qdisc_is_running(qdisc)) {
> >>   return false;
> >>   }
> >>
> >> I am not actually know what you are discussing above. It seems to me
> >> that Cong's patch is similar as disabling lockless feature.
> >
> >>From performance's perspective, yeah. Did you see any performance
> > downgrade with my patch applied? It would be great if you can compare
> > it with removing NOLOCK. And if the performance is as bad as no
> > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > for now.
>
> It seems the lockless qdisc may have below concurrent problem:
>   cpu0:   cpu1:
> q->enqueue  .
> qdisc_run_begin(q)  .
> __qdisc_run(q) ->qdisc_restart() -> dequeue_skb()   .
>  -> sch_direct_xmit()   .
> .
> q->enqueue
>  
> qdisc_run_begin(q)
> qdisc_run_end(q)
>
>
> cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).

This is the same problem that my patch fixes, I do not know
why you are suggesting another patch despite quoting mine.
Please read the whole thread if you want to participate.

Thanks.

Re: [PATCH] [PATCH] net/sched : cbs : fix calculation error of idleslope credits

2020-09-17 Thread Cong Wang

On Wed, Sep 16, 2020 at 11:05 PM Xiaoyong Yan  wrote:
>
> in the function cbs_dequeue_soft, when q->credits <0, [now - q->last]
> should be accounted for sendslope,not idleslope.
>
> so the solution as follows: when q->credits is less than 0, directly
> calculate delay time, activate hrtimer and when hrtimer fires,
> calculate idleslope credits and update it to q->credits.
>
> because of the lack of self-defined qdisc_watchdog_func, so in the
> generic sch_api, add qdisc_watchdog_init_func and
> qdisc_watchdog_init_clockid_func, so sch_cbs can use it to define its
> own process.

You do not have to define them as generic API, you can just use
hrtimer API directly in sch_cbs, as it is the only user within net/sched/.
Hopefully this would reduce the size of your patch too.


>
> the patch is changed based on v5.4.42,and the test result as follows:
> the NIC is 100Mb/s ,full duplex.
>
> step1:
> tc qdisc add dev ens33 root cbs idleslope 75 sendslope -25 hicredit 100 
> locredit -100 offload 0
> step2:
> root@ubuntu:/home/yxy/kernel/linux-stable# iperf -c 192.168.1.114 -i 1
> 
> Client connecting to 192.168.1.114, TCP port 5001
> TCP window size:  246 KByte (default)
> 
> [  3] local 192.168.1.120 port 42004 connected with 192.168.1.114 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0- 1.0 sec  9.00 MBytes  75.5 Mbits/sec
> [  3]  1.0- 2.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  2.0- 3.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  3.0- 4.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  4.0- 5.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  5.0- 6.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  6.0- 7.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  7.0- 8.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  8.0- 9.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  9.0-10.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
>
> Signed-off-by: Xiaoyong Yan 
> ---
>  include/net/pkt_sched.h |  7 +++
>  net/sched/sch_api.c | 20 ++--
>  net/sched/sch_cbs.c | 41 +
>  3 files changed, 50 insertions(+), 18 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..5fec23b15e1c 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
> #define DEFAULT_TX_QUEUE_LEN1000
>
> @@ -72,6 +73,12 @@ struct qdisc_watchdog {
> struct Qdisc*qdisc;
>  };
>
> +typedef enum hrtimer_restart (*qdisc_watchdog_func_t)(struct hrtimer *timer);
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd, struct 
> Qdisc *qdisc,
> +clockid_t clockid,qdisc_watchdog_func_t func);
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func);
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer);
> +
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid);
>  void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 50794125bf02..fea08d10c811 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -22,7 +22,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>
> @@ -591,7 +590,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc 
> *qdisc)
>  }
>  EXPORT_SYMBOL(qdisc_warn_nonwc);
>
> -static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
>  {
> struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog,
>  timer);
> @@ -602,7 +601,24 @@ static enum hrtimer_restart qdisc_watchdog(struct 
> hrtimer *timer)
>
> return HRTIMER_NORESTART;
>  }
> +EXPORT_SYMBOL(qdisc_watchdog);
>
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,clockid_t clockid,qdisc_watchdog_func_t func)
> +{
> +   hrtimer_init(>timer,clockid,HRTIMER_MODE_ABS_PINNED);
> +   if(!func)
> +   wd->timer.function = qdisc_watchdog;
> +   else
> +   wd->timer.function = func;
> +   wd->qdisc = qdisc;
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_clockid_func);
> +
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func)
> +{
> +   qdisc_watchdog_init_clockid_func(wd,qdisc,CLOCK_MONOTONIC,func);
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_func);
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid)
>  {
> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
> index

Re: [PATCH net-next] genetlink: Remove unused function genl_err_attr()

2020-09-17 Thread Cong Wang

On Wed, Sep 16, 2020 at 9:33 AM YueHaibing  wrote:
>
> It is never used, so can remove it.

This is a bit confusing, it was actually used before, see commit
ab0d76f6823cc3a4e2.

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang

On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> I also tried Cong's patch (shown below on my tree) and it could avoid
> the issue (stressing for 30 minutus for three times and not jitter
> observed).

Thanks for verifying it!

>
> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> @@ -127,8 +127,7 @@
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>   if (qdisc->flags & TCQ_F_NOLOCK) {
> - if (!spin_trylock(>seqlock))
> - return false;
> + spin_lock(>seqlock);
>   } else if (qdisc_is_running(qdisc)) {
>   return false;
>   }
>
> I am not actually know what you are discussing above. It seems to me
> that Cong's patch is similar as disabling lockless feature.

>From performance's perspective, yeah. Did you see any performance
downgrade with my patch applied? It would be great if you can compare
it with removing NOLOCK. And if the performance is as bad as no
NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
for now.

Thanks.

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang

On Thu, Sep 3, 2020 at 10:08 PM John Fastabend  wrote:
> Maybe this would unlock us,
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7df6c9617321..9b09429103f1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
> struct Qdisc *q,
>
> if (q->flags & TCQ_F_NOLOCK) {
> rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
> -   qdisc_run(q);
> +   __qdisc_run(q);
>
> if (unlikely(to_free))
> kfree_skb_list(to_free);
>
>
> Per other thread we also need the state deactivated check added
> back.

I guess no, because pfifo_dequeue() seems to require q->seqlock,
according to comments in qdisc_run(), so we can not just get rid of
qdisc_run_begin()/qdisc_run_end() here.

Thanks.

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-10 Thread Cong Wang

On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.
>
> Reused the existing synchronize_net() in dev_deactivate_many() to
> make sure skb with larger queue_mapping enqueued to old qdisc(which
> is saved in dev_queue->qdisc_sleeping) will always be reset when
> dev_reset_queue() is called.
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin 
> ---
> ChangeLog V2:
> Reuse existing synchronize_net().
> ---
>  net/sched/sch_generic.c | 48 +---
>  1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 265a61d..54c4172 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
>
>  static void qdisc_deactivate(struct Qdisc *qdisc)
>  {
> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> -
> if (qdisc->flags & TCQ_F_BUILTIN)
> return;
> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> -   return;
> -
> -   if (nolock)
> -   spin_lock_bh(>seqlock);
> -   spin_lock_bh(qdisc_lock(qdisc));
>
> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> -
> -   qdisc_reset(qdisc);
> -
> -   spin_unlock_bh(qdisc_lock(qdisc));
> -   if (nolock)
> -   spin_unlock_bh(>seqlock);
>  }
>
>  static void dev_deactivate_queue(struct net_device *dev,
> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> *dev,
> }
>  }
>
> +static void dev_reset_queue(struct net_device *dev,
> +   struct netdev_queue *dev_queue,
> +   void *_unused)
> +{
> +   struct Qdisc *qdisc;
> +   bool nolock;
> +
> +   qdisc = dev_queue->qdisc_sleeping;
> +   if (!qdisc)
> +   return;
> +
> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> +
> +   if (nolock)
> +   spin_lock_bh(>seqlock);
> +   spin_lock_bh(qdisc_lock(qdisc));


I think you do not need this lock for lockless one.

> +
> +   qdisc_reset(qdisc);
> +
> +   spin_unlock_bh(qdisc_lock(qdisc));
> +   if (nolock)
> +   spin_unlock_bh(>seqlock);
> +}
> +
>  static bool some_qdisc_is_busy(struct net_device *dev)
>  {
> unsigned int i;
> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> dev_watchdog_down(dev);
> }
>
> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> +* outstanding qdisc enqueuing calls.
>  * This is avoided if all devices are in dismantle phase :
>  * Caller will call synchronize_net() for us
>  */
> synchronize_net();
>
> +   list_for_each_entry(dev, head, close_list) {
> +   netdev_for_each_tx_queue(dev, dev_reset_queue, NULL);
> +
> +   if (dev_ingress_queue(dev))
> +   dev_reset_queue(dev, dev_ingress_queue(dev), NULL);
> +   }
> +
> /* Wait for outstanding qdisc_run calls. */
> list_for_each_entry(dev, head, close_list) {
> while (some_qdisc_is_busy(dev)) {

Do you want to reset before waiting for TX action?

I think it is safer to do it after, at least prior to commit 759ae57f1b
we did after.

Thanks.

Re: INFO: task hung in tcf_ife_init

2020-09-03 Thread Cong Wang

On Thu, Sep 3, 2020 at 2:47 AM Eric Dumazet  wrote:
> This commit might be related :
>
> commit 4e407ff5cd67ec76a1deec227b7982dc7f66
> Author: Cong Wang 
> Date:   Sun Aug 19 12:22:12 2018 -0700
>
> act_ife: move tcfa_lock down to where necessary

It does not look like my commit's fault. From my _quick_ understanding
of this problem, we somehow have a "deadlock" situation:

Thread A allocates an IDR with a specific index and calls populate_metalist()
to re-accquire the RTNL lock.

Thread B gets RTNL lock right after thread A releases it, then it waits for
thread A to finally commit the IDR change.

Thanks.

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-03 Thread Cong Wang

On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni  wrote:
>
> On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > Can you test the attached one-line fix? I think we are overthinking,
> > probably all
> > we need here is a busy wait.
>
> I think that will solve, but I also think that will kill NOLOCK
> performances due to really increased contention.

Yeah, we somehow end up with more locks (seqlock, skb array lock)
for lockless qdisc. What an irony... ;)

>
> At this point I fear we could consider reverting the NOLOCK stuff.
> I personally would hate doing so, but it looks like NOLOCK benefits are
> outweighed by its issues.

I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.

Thanks.

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-02 Thread Cong Wang

Hello, Kehuan

Can you test the attached one-line fix? I think we are overthinking,
probably all
we need here is a busy wait.

Thanks.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c39d60c..fc1bacdb102b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -156,8 +156,7 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
-		if (!spin_trylock(>seqlock))
-			return false;
+		spin_lock(>seqlock);
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang

On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 9:48, Cong Wang wrote:
> > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/3 8:35, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 12:41, Cong Wang wrote:
> >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>>>
> >>>>>>> Can you be more specific here? Which call path requests a smaller
> >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>>>> we already have a synchronize_net() there.
> >>>>>>
> >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>>>> do the correct work, as below:
> >>>>>>
> >>>>>> CPU 0:   CPU1:
> >>>>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>>>> rcu_read_lock_bh();
> >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>>>> .
> >>>>>> .   dev->real_num_tx_queues = txq;
> >>>>>> .   .
> >>>>>> .   .
> >>>>>> .   synchronize_net();
> >>>>>> .   .
> >>>>>> q->enqueue().
> >>>>>> .   .
> >>>>>> rcu_read_unlock_bh().
> >>>>>> qdisc_reset_all_tx_gt
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>
> >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a 
> >>>>>> problem
> >>>>>> too.
> >>>>>>
> >>>>>> The problem we hit is as below:
> >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is 
> >>>>>> called
> >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the 
> >>>>>> function
> >>>>>> netif_set_real_num_tx_queues() does not help here.
> >>>>>
> >>>>> How could qdisc still be running after deactivating the device?
> >>>>
> >>>> qdisc could be running during the device deactivating process.
> >>>>
> >>>> The main process of changing channel number is as below:
> >>>>
> >>>> 1. dev_deactivate()
> >>>> 2. hns3 handware related setup
> >>>> 3. netif_set_real_num_tx_queues()
> >>>> 4. netif_tx_wake_all_queues()
> >>>> 5. dev_activate()
> >>>>
> >>>> During step 1, qdisc could be running while qdisc is resetting, so
> >>>> there could be skb left in the old qdisc(which will be restored back to
&g

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang

On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 8:35, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 12:41, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>
> >>>>> Can you be more specific here? Which call path requests a smaller
> >>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>> we already have a synchronize_net() there.
> >>>>
> >>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>> do the correct work, as below:
> >>>>
> >>>> CPU 0:   CPU1:
> >>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>> rcu_read_lock_bh();
> >>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>> .
> >>>> .   dev->real_num_tx_queues = txq;
> >>>> .   .
> >>>> .   .
> >>>> .   synchronize_net();
> >>>> .   .
> >>>> q->enqueue().
> >>>> .   .
> >>>> rcu_read_unlock_bh().
> >>>> qdisc_reset_all_tx_gt
> >>>>
> >>>>
> >>>
> >>> Right.
> >>>
> >>>
> >>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>> too.
> >>>>
> >>>> The problem we hit is as below:
> >>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>> netif_set_real_num_tx_queues() does not help here.
> >>>
> >>> How could qdisc still be running after deactivating the device?
> >>
> >> qdisc could be running during the device deactivating process.
> >>
> >> The main process of changing channel number is as below:
> >>
> >> 1. dev_deactivate()
> >> 2. hns3 handware related setup
> >> 3. netif_set_real_num_tx_queues()
> >> 4. netif_tx_wake_all_queues()
> >> 5. dev_activate()
> >>
> >> During step 1, qdisc could be running while qdisc is resetting, so
> >> there could be skb left in the old qdisc(which will be restored back to
> >> txq->qdisc during dev_activate()), as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit():  dev_deactivate_many():
> >> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> >> q = rcu_dereference_bh(txq->qdisc); .
> >> netdev_core_pick_tx(dev, skb, sb_dev);  .
> >> .
> >> .   
> >> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >> .   .
> >> .   .
> >> .   .
> >> .   .
> >> q->enqueue().
> >
> >
> > Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > there would be no packet queued after qdisc_deactivate().
>
> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> otherwise there is still window between setting and testing the deactivated 
> bit.

Can you be more specific here? Why testing or setting a bit is not atomic?

AFAIU, qdisc->seqlock is an optimization to replace
__QDISC_STATE_RUNNING, which has nothing to do with deactivate bit.

Thanks.

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang

On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 12:41, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 2:24, Cong Wang wrote:
> >>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>
> >>> Can you be more specific here? Which call path requests a smaller
> >>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>> we already have a synchronize_net() there.
> >>
> >> When the netdevice is in active state, the synchronize_net() seems to
> >> do the correct work, as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >> rcu_read_lock_bh();
> >> netdev_core_pick_tx(dev, skb, sb_dev);
> >> .
> >> .   dev->real_num_tx_queues = txq;
> >> .   .
> >> .   .
> >> .   synchronize_net();
> >> .   .
> >> q->enqueue().
> >> .   .
> >> rcu_read_unlock_bh().
> >> qdisc_reset_all_tx_gt
> >>
> >>
> >
> > Right.
> >
> >
> >> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >> too.
> >>
> >> The problem we hit is as below:
> >> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >> to deactive the netdevice when user requested a smaller queue num, and
> >> txq->qdisc is already changed to noop_qdisc when calling
> >> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >> netif_set_real_num_tx_queues() does not help here.
> >
> > How could qdisc still be running after deactivating the device?
>
> qdisc could be running during the device deactivating process.
>
> The main process of changing channel number is as below:
>
> 1. dev_deactivate()
> 2. hns3 handware related setup
> 3. netif_set_real_num_tx_queues()
> 4. netif_tx_wake_all_queues()
> 5. dev_activate()
>
> During step 1, qdisc could be running while qdisc is resetting, so
> there could be skb left in the old qdisc(which will be restored back to
> txq->qdisc during dev_activate()), as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit():  dev_deactivate_many():
> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> q = rcu_dereference_bh(txq->qdisc); .
> netdev_core_pick_tx(dev, skb, sb_dev);  .
> .
> .   rcu_assign_pointer(dev_queue->qdisc, 
> qdisc_default);
> .   .
> .   .
> .   .
> .   .
> q->enqueue().


Well, like I said, if the deactivated bit were tested before ->enqueue(),
there would be no packet queued after qdisc_deactivate().


> .   .
> rcu_read_unlock_bh().
>
> And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
> only reset the noop_qdisc, but not the actual qdisc, which is stored in
> txq->qdisc_sleeping, so the actual qdisc may still have skb.
>
> When hns3_link_status_change() call step 4 and 5, it will restore all queue's
> qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
> The skb enqueued in step 1 may be dequeued and run, whi

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang

On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 2:24, Cong Wang wrote:
> > On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >
> > Can you be more specific here? Which call path requests a smaller
> > tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > we already have a synchronize_net() there.
>
> When the netdevice is in active state, the synchronize_net() seems to
> do the correct work, as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> rcu_read_lock_bh();
> netdev_core_pick_tx(dev, skb, sb_dev);
> .
> .   dev->real_num_tx_queues = txq;
> .   .
> .   .
> .   synchronize_net();
> .   .
> q->enqueue().
> .   .
> rcu_read_unlock_bh().
> qdisc_reset_all_tx_gt
>
>

Right.


> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> too.
>
> The problem we hit is as below:
> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> to deactive the netdevice when user requested a smaller queue num, and
> txq->qdisc is already changed to noop_qdisc when calling
> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> netif_set_real_num_tx_queues() does not help here.

How could qdisc still be running after deactivating the device?


>
> >
> >>
> >> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> >> after assigning new qdisc to dev_queue->qdisc and before calling
> >> qdisc_deactivate() to make sure skb with larger queue_mapping
> >> enqueued to old qdisc will always be reset when qdisc_deactivate()
> >> is called.
> >
> > Like Eric said, it is not nice to call such a blocking function when
> > we have a large number of TX queues. Possibly we just need to
> > add a synchronize_net() as in netif_set_real_num_tx_queues(),
> > if it is missing.
>
> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
> to work when netdevice is in active state, but does not work when in
> deactive.

Please explain why deactivated device still has qdisc running?

At least before commit 379349e9bc3b4, we always test deactivate
bit before enqueueing. Are you complaining about that commit?
That commit is indeed suspicious, at least it does not precisely revert
commit ba27b4cdaaa66561aaedb21 as it claims.


>
> And we do not want skb left in the old qdisc when netdevice is deactived,
> right?

Yes, and more importantly, qdisc should not be running after deactivation.

Thanks.

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang

On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.

Can you be more specific here? Which call path requests a smaller
tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
we already have a synchronize_net() there.

>
> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> after assigning new qdisc to dev_queue->qdisc and before calling
> qdisc_deactivate() to make sure skb with larger queue_mapping
> enqueued to old qdisc will always be reset when qdisc_deactivate()
> is called.

Like Eric said, it is not nice to call such a blocking function when
we have a large number of TX queues. Possibly we just need to
add a synchronize_net() as in netif_set_real_num_tx_queues(),
if it is missing.

Thanks.

Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-17 Thread Cong Wang

On Sun, Aug 16, 2020 at 10:45 PM Christoph Hellwig  wrote:
>
> On Sun, Aug 16, 2020 at 10:55:09AM -0700, Cong Wang wrote:
> > On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
> > >
> > > The original problem was from nvme-over-tcp code, who mistakenly uses
> > > kernel_sendpage() to send pages allocated by __get_free_pages() without
> > > __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> > > tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> > > from a corrupted kernel heap, because these pages are incorrectly freed
> > > in network stack as page_count 0 pages.
> > >
> > > This patch introduces a helper sendpage_ok(), it returns true if the
> > > checking page,
> > > - is not slab page: PageSlab(page) is false.
> > > - has page refcount: page_count(page) is not zero
> > >
> > > All drivers who want to send page to remote end by kernel_sendpage()
> > > may use this helper to check whether the page is OK. If the helper does
> > > not return true, the driver should try other non sendpage method (e.g.
> > > sock_no_sendpage()) to handle the page.
> >
> > Can we leave this helper to mm subsystem?
> >
> > I know it is for sendpage, but its implementation is all about some
> > mm details and its two callers do not belong to net subsystem either.
> >
> > Think this in another way: who would fix it if it is buggy? I bet mm people
> > should. ;)
>
> No.  This is all about a really unusual imitation in sendpage, which

So netdev people will have to understand and support PageSlab() or
page_count()?

If it is unusual even for mm people, how could netdev people suppose
to understand this unusual mm bug? At least not any better.

> is pretty much unexpected.  In fact the best thing would be to make
> sock_sendpage do the right thing and call sock_no_sendpage based
> on this condition, so that driver writers don't have to worry at all.

Agreed, but kernel_sendpage() still relies on mm to provide a helper
to make the decision and ensure this helper is always up-to-date.

In short, it is all about ownership.

Thanks.

Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-16 Thread Cong Wang

On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
>
> The original problem was from nvme-over-tcp code, who mistakenly uses
> kernel_sendpage() to send pages allocated by __get_free_pages() without
> __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> from a corrupted kernel heap, because these pages are incorrectly freed
> in network stack as page_count 0 pages.
>
> This patch introduces a helper sendpage_ok(), it returns true if the
> checking page,
> - is not slab page: PageSlab(page) is false.
> - has page refcount: page_count(page) is not zero
>
> All drivers who want to send page to remote end by kernel_sendpage()
> may use this helper to check whether the page is OK. If the helper does
> not return true, the driver should try other non sendpage method (e.g.
> sock_no_sendpage()) to handle the page.

Can we leave this helper to mm subsystem?

I know it is for sendpage, but its implementation is all about some
mm details and its two callers do not belong to net subsystem either.

Think this in another way: who would fix it if it is buggy? I bet mm people
should. ;)

Thanks.

Re: [ovs-dev] [PATCH v2] net: openvswitch: introduce common code for flushing flows

2020-08-13 Thread Cong Wang

On Wed, Aug 12, 2020 at 2:59 AM  wrote:
>
> From: Tonghao Zhang 
>
> To avoid some issues, for example RCU usage warning and double free,
> we should flush the flows under ovs_lock. This patch refactors
> table_instance_destroy and introduces table_instance_flow_flush
> which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
>
> Fixes: 50b0e61b32ee ("net: openvswitch: fix possible memleak on destroy 
> flow-table")
> Reported-by: Johan Knöös 
> Reported-at: 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050489.html
> Signed-off-by: Tonghao Zhang 

Reviewed-by: Cong Wang 

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-12 Thread Cong Wang

On Wed, Aug 12, 2020 at 2:21 AM linmiaohe  wrote:
>
> Hi all:
> David Miller  wrote:
> >From: Cong Wang 
> >Date: Tue, 11 Aug 2020 16:02:51 -0700
> >
> >>> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net,
> >>> int val)  }  #endif
> >>>
> >>> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot) {
> >>> +   if (!twsk_prot)
> >>> +   return;
> >>> +   kfree(twsk_prot->twsk_slab_name);
> >>> +   twsk_prot->twsk_slab_name = NULL;
> >>> +   kmem_cache_destroy(twsk_prot->twsk_slab);
> >>
> >> Hmm, are you sure you can free the kmem cache name before
> >> kmem_cache_destroy()? To me, it seems kmem_cache_destroy() frees the
> >> name via slab_kmem_cache_release() via kfree_const().
> >> With your patch, we have a double-free on the name?
> >>
> >> Or am I missing anything?
> >
> >Yep, there is a double free here.
> >
> >Please fix this.
>
> Many thanks for both of you to point this issue out. But I'am not really 
> understand, could you please explain it more?
> As far as I can see, the double free path is:
> 1. kfree(twsk_prot->twsk_slab_name)
> 2. kmem_cache_destroy
> --> shutdown_memcg_caches
> --> shutdown_cache
> --> slab_kmem_cache_release
> --> kfree_const(s->name)
> But twsk_prot->twsk_slab_name is allocated from kasprintf via 
> kmalloc_track_caller while twsk_prot->twsk_slab->name is allocated
> via kstrdup_const. So I think twsk_prot->twsk_slab_name and 
> twsk_prot->twsk_slab->name point to different memory, and there is no double 
> free.
>

You are right. Since it is duplicated, then there is no need to keep
->twsk_slab_name, we can just use twsk_slab->name. I will send
a patch to get rid of it.

> Or am I missing anything?
>
> By the way, req_prot_cleanup() do the same things, i.e. free the slab_name 
> before involve kmem_cache_destroy(). If there is a double
> free, so as here?

Ditto. Can be just removed.

Thanks.

Re: [ovs-dev] [PATCH] net: openvswitch: introduce common code for flushing flows

2020-08-11 Thread Cong Wang

On Mon, Aug 10, 2020 at 6:14 PM  wrote:
>
> From: Tonghao Zhang 
>
> To avoid some issues, for example RCU usage warning, we should
> flush the flows under ovs_lock. This patch refactors
> table_instance_destroy and introduces table_instance_flow_flush
> which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
>
> Signed-off-by: Tonghao Zhang 

Please add a Fixes tag here, I think it is probably your memory leak fix
which introduced this issue. And a Reported-by, to give credits to bug
reporters.

Plus one minor issue below:

> -static void table_instance_destroy(struct flow_table *table,
> -  struct table_instance *ti,
> -  struct table_instance *ufid_ti,
> -  bool deferred)
> +/* Must be called with OVS mutex held. */
> +void table_instance_flow_flush(struct flow_table *table,
> +  struct table_instance *ti,
> +  struct table_instance *ufid_ti)
>  {
> int i;
>
> -   if (!ti)
> -   return;
> -
> -   BUG_ON(!ufid_ti);
> if (ti->keep_flows)
> -   goto skip_flows;
> +   return;
>
> for (i = 0; i < ti->n_buckets; i++) {
> -   struct sw_flow *flow;
> struct hlist_head *head = >buckets[i];
> struct hlist_node *n;
> +   struct sw_flow *flow;

This is at most a coding style change, please do not mix
coding style changes in bug fixes. You can always push coding
style changes separately when net-next is open.

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-11 Thread Cong Wang

On Mon, Aug 10, 2020 at 10:59 PM Tonghao Zhang  wrote:
>
> On Tue, Aug 11, 2020 at 12:08 PM Cong Wang  wrote:
> >
> > On Mon, Aug 10, 2020 at 8:27 PM Tonghao Zhang  
> > wrote:
> > >
> > > On Tue, Aug 11, 2020 at 10:24 AM Cong Wang  
> > > wrote:
> > > >
> > > > On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang 
> > > >  wrote:
> > > > > Hi all, I send a patch to fix this. The rcu warnings disappear. I
> > > > > don't reproduce the double free issue.
> > > > > But I guess this patch may address this issue.
> > > > >
> > > > > http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/
> > > >
> > > > I don't see how your patch address the double-free, as we still
> > > > free mask array twice after your patch: once in tbl_mask_array_realloc()
> > > > and once in ovs_flow_tbl_destroy().
> > > Hi Cong.
> > > Before my patch, we use the ovsl_dereference
> > > (rcu_dereference_protected) in the rcu callback.
> > > ovs_flow_tbl_destroy
> > > ->table_instance_destroy
> > > ->table_instance_flow_free
> > > ->flow_mask_remove
> > > ASSERT_OVSL(will print warning)
> > > ->tbl_mask_array_del_mask
> > > ovsl_dereference(rcu usage warning)
> > >
> >
> > I understand how your patch addresses the RCU annotation issue,
> > which is different from double-free.
> >
> >
> > > so we should invoke the table_instance_destroy or others under
> > > ovs_lock to avoid (ASSERT_OVSL and rcu usage warning).
> >
> > Of course... I never doubt it.
> >
> >
> > > with this patch, we reallocate the mask_array under ovs_lock, and free
> > > it in the rcu callback. Without it, we  reallocate and free it in the
> > > rcu callback.
> > > I think we may fix it with this patch.
> >
> > Does it matter which context tbl_mask_array_realloc() is called?
> > Even with ovs_lock, we can still double free:
> >
> > ovs_lock()
> > tbl_mask_array_realloc()
> >  => call_rcu(>rcu, mask_array_rcu_cb);
> > ovs_unlock()
> > ...
> > ovs_flow_tbl_destroy()
> >  => call_rcu(>rcu, mask_array_rcu_cb);
> >
> > So still twice, right? To fix the double-free, we have to eliminate one
> > of them, don't we? ;)
> No
> Without my patch: in rcu callback:
> ovs_flow_tbl_destroy
> ->call_rcu(>rcu, mask_array_rcu_cb);
> ->table_instance_destroy
> ->tbl_mask_array_realloc(Shrink the mask array if necessary)
> ->call_rcu(>rcu, mask_array_rcu_cb);
>
> With the patch:
> ovs_lock
> table_instance_flow_flush (free the flow)
> tbl_mask_array_realloc(shrink the mask array if necessary, will free
> mask_array in rcu(mask_array_rcu_cb) and rcu_assign_pointer new
> mask_array)
> ovs_unlock
>
> in rcu callback:
> ovs_flow_tbl_destroy
> call_rcu(>rcu, mask_array_rcu_cb);(that is new mask_array)

Ah, I see, I thought the mask array was cached in caller and passed along,
it is in fact refetched via >table.

Thanks.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-11 Thread Cong Wang

On Mon, Aug 10, 2020 at 5:19 AM Miaohe Lin  wrote:
>
> If we failed to assign proto idx, we free the twsk_slab_name but forget to
> free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
> together and also use this helper function in proto_unregister().
>
> Fixes: b45ce32135d1 ("sock: fix potential memory leak in proto_register()")
> Signed-off-by: Miaohe Lin 
> ---
>  net/core/sock.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 49cd5ffe673e..c9083ad44ea1 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net, int val)
>  }
>  #endif
>
> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot)
> +{
> +   if (!twsk_prot)
> +   return;
> +   kfree(twsk_prot->twsk_slab_name);
> +   twsk_prot->twsk_slab_name = NULL;
> +   kmem_cache_destroy(twsk_prot->twsk_slab);

Hmm, are you sure you can free the kmem cache name before
kmem_cache_destroy()? To me, it seems kmem_cache_destroy()
frees the name via slab_kmem_cache_release() via kfree_const().
With your patch, we have a double-free on the name?

Or am I missing anything?

Thanks.

Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-10 Thread Cong Wang

On Mon, Aug 10, 2020 at 8:27 PM Tonghao Zhang  wrote:
>
> On Tue, Aug 11, 2020 at 10:24 AM Cong Wang  wrote:
> >
> > On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang  
> > wrote:
> > > Hi all, I send a patch to fix this. The rcu warnings disappear. I
> > > don't reproduce the double free issue.
> > > But I guess this patch may address this issue.
> > >
> > > http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/
> >
> > I don't see how your patch address the double-free, as we still
> > free mask array twice after your patch: once in tbl_mask_array_realloc()
> > and once in ovs_flow_tbl_destroy().
> Hi Cong.
> Before my patch, we use the ovsl_dereference
> (rcu_dereference_protected) in the rcu callback.
> ovs_flow_tbl_destroy
> ->table_instance_destroy
> ->table_instance_flow_free
> ->flow_mask_remove
> ASSERT_OVSL(will print warning)
> ->tbl_mask_array_del_mask
> ovsl_dereference(rcu usage warning)
>

I understand how your patch addresses the RCU annotation issue,
which is different from double-free.

> so we should invoke the table_instance_destroy or others under
> ovs_lock to avoid (ASSERT_OVSL and rcu usage warning).

Of course... I never doubt it.

> with this patch, we reallocate the mask_array under ovs_lock, and free
> it in the rcu callback. Without it, we  reallocate and free it in the
> rcu callback.
> I think we may fix it with this patch.

Does it matter which context tbl_mask_array_realloc() is called?
Even with ovs_lock, we can still double free:

ovs_lock()
tbl_mask_array_realloc()
 => call_rcu(>rcu, mask_array_rcu_cb);
ovs_unlock()
...
ovs_flow_tbl_destroy()
 => call_rcu(>rcu, mask_array_rcu_cb);

So still twice, right? To fix the double-free, we have to eliminate one
of them, don't we? ;)

>
> > Have you tried my patch which is supposed to address this double-free?
> I don't reproduce it. but your patch does not avoid ruc usage warning
> and ASSERT_OVSL.

Sure, I never intend to fix anything else but double-free. The $subject is
about double free, I double checked. ;)

Thanks.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [Linux-kernel-mentees] [PATCH net] ipvs: Fix uninit-value in do_ip_vs_set_ctl()

2020-08-10 Thread Cong Wang

On Mon, Aug 10, 2020 at 3:10 PM Peilin Ye  wrote:
>
> do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
> zero. Fix it.

Which exact 'cmd' is it here?

I _guess_ it is one of those uninitialized in set_arglen[], which is 0.
But if that is the case, should it be initialized to
sizeof(struct ip_vs_service_user) instead because ip_vs_copy_usvc_compat()
is called anyway. Or, maybe we should just ban len==0 case.

In either case, it does not look like you fix it correctly.

Thanks.

Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-10 Thread Cong Wang

On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang  wrote:
> Hi all, I send a patch to fix this. The rcu warnings disappear. I
> don't reproduce the double free issue.
> But I guess this patch may address this issue.
>
> http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/

I don't see how your patch address the double-free, as we still
free mask array twice after your patch: once in tbl_mask_array_realloc()
and once in ovs_flow_tbl_destroy().

Have you tried my patch which is supposed to address this double-free?
It simply skips the reallocation as it makes no sense to trigger reallocation
when destroying it.

Thanks.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-08-08 Thread Cong Wang

On Sat, Aug 8, 2020 at 10:07 AM Gaurav Singh  wrote:
>
> This PR fixes a possible segmentation violation.
>
> In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL
> unconditionally (regardless sk being  NULL or not).
>
> In include/linux/ipv6.h:
>
> static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
> {
> return NULL;
> }
>

Tell us who will use ip6_autoflowlabel() when CONFIG_IPV6
is disabled. :)

Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-07 Thread Cong Wang

On Fri, Aug 7, 2020 at 8:33 AM Johan Knöös  wrote:
>
> On Tue, Aug 4, 2020 at 8:52 AM Gregory Rose  wrote:
> >
> >
> >
> > On 8/3/2020 12:01 PM, Johan Knöös via discuss wrote:
> > > Hi Open vSwitch contributors,
> > >
> > > We have found openvswitch is causing double-freeing of memory. The
> > > issue was not present in kernel version 5.5.17 but is present in
> > > 5.6.14 and newer kernels.
> > >
> > > After reverting the RCU commits below for debugging, enabling
> > > slub_debug, lockdep, and KASAN, we see the warnings at the end of this
> > > email in the kernel log (the last one shows the double-free). When I
> > > revert 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 ("net: openvswitch:
> > > fix possible memleak on destroy flow-table"), the symptoms disappear.
> > > While I have a reliable way to reproduce the issue, I unfortunately
> > > don't yet have a process that's amenable to sharing. Please take a
> > > look.
> > >
> > > 189a6883dcf7 rcu: Remove kfree_call_rcu_nobatch()
> > > 77a40f97030b rcu: Remove kfree_rcu() special casing and lazy-callback 
> > > handling
> > > e99637becb2e rcu: Add support for debug_objects debugging for kfree_rcu()
> > > 0392bebebf26 rcu: Add multiple in-flight batches of kfree_rcu() work
> > > 569d767087ef rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
> > > a35d16905efc rcu: Add basic support for kfree_rcu() batching
> > >
> > > Thanks,
> > > Johan Knöös
> >
> > Let's add the author of the patch you reverted and the Linux netdev
> > mailing list.
> >
> > - Greg
>
> I found we also sometimes get warnings from
> https://elixir.bootlin.com/linux/v5.5.17/source/kernel/rcu/tree.c#L2239
> under similar conditions even on kernel 5.5.17, which I believe may be
> related. However, it's much rarer and I don't have a reliable way of
> reproducing it. Perhaps 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 only
> increases the frequency of a pre-existing bug.

It seems clear we have a double free on table->mask_array when
the reallocation is triggered on the destroy path.

Are you able to test the attached patch (compile tested only)?
Also note: it is generated against the latest net tree, it may not be
applied cleanly to any earlier stable release.

Thanks!
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 8c12675cbb67..cc7859db445a 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -294,7 +294,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
 }
 
 static void tbl_mask_array_del_mask(struct flow_table *tbl,
-struct sw_flow_mask *mask)
+struct sw_flow_mask *mask, bool destroy)
 {
 	struct mask_array *ma = ovsl_dereference(tbl->mask_array);
 	int i, ma_count = READ_ONCE(ma->count);
@@ -314,6 +314,11 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
 	rcu_assign_pointer(ma->masks[i], ma->masks[ma_count -1]);
 	RCU_INIT_POINTER(ma->masks[ma_count -1], NULL);
 
+	if (destroy) {
+		kfree(mask);
+		return;
+	}
+
 	kfree_rcu(mask, rcu);
 
 	/* Shrink the mask array if necessary. */
@@ -326,7 +331,8 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
 }
 
 /* Remove 'mask' from the mask list, if it is not needed any more. */
-static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
+static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask,
+			 bool destroy)
 {
 	if (mask) {
 		/* ovs-lock is required to protect mask-refcount and
@@ -337,7 +343,7 @@ static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
 		mask->ref_count--;
 
 		if (!mask->ref_count)
-			tbl_mask_array_del_mask(tbl, mask);
+			tbl_mask_array_del_mask(tbl, mask, destroy);
 	}
 }
 
@@ -470,7 +476,7 @@ static void table_instance_flow_free(struct flow_table *table,
 			table->ufid_count--;
 	}
 
-	flow_mask_remove(table, flow->mask);
+	flow_mask_remove(table, flow->mask, !count);
 }
 
 static void table_instance_destroy(struct flow_table *table,
@@ -521,9 +527,9 @@ void ovs_flow_tbl_destroy(struct flow_table *table)
 	struct mask_cache *mc = rcu_dereference_raw(table->mask_cache);
 	struct mask_array *ma = rcu_dereference_raw(table->mask_array);
 
+	table_instance_destroy(table, ti, ufid_ti, false);
 	call_rcu(>rcu, mask_cache_rcu_cb);
 	call_rcu(>rcu, mask_array_rcu_cb);
-	table_instance_destroy(table, ti, ufid_ti, false);
 }
 
 struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [PATCHv2 net-next] dropwatch: Support monitoring of dropped frames

2020-08-04 Thread Cong Wang

On Tue, Aug 4, 2020 at 9:14 AM  wrote:
>
> From: Izabela Bakollari 
>
> Dropwatch is a utility that monitors dropped frames by having userspace
> record them over the dropwatch protocol over a file. This augument
> allows live monitoring of dropped frames using tools like tcpdump.
>
> With this feature, dropwatch allows two additional commands (start and
> stop interface) which allows the assignment of a net_device to the
> dropwatch protocol. When assinged, dropwatch will clone dropped frames,
> and receive them on the assigned interface, allowing tools like tcpdump
> to monitor for them.
>
> With this feature, create a dummy ethernet interface (ip link add dev
> dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using
> these new commands, and then monitor dropped frames in real time by
> running tcpdump -i dummy0.

drop monitor is already able to send dropped packets to user-space,
and wireshark already catches up with this feature:

https://code.wireshark.org/review/gitweb?p=wireshark.git;a=commitdiff;h=a94a860c0644ec3b8a129fd243674a2e376ce1c8

So what you propose here seems pretty much a duplicate?

Thanks.

Re: [PATCH] drivers/net/wan/lapbether: Use needed_headroom instead of hard_header_len

2020-07-27 Thread Cong Wang

Hello,

On Mon, Jul 27, 2020 at 12:41 PM Xie He  wrote:
>
> Hi Cong Wang,
>
> I'm wishing to change a driver from using "hard_header_len" to using
> "needed_headroom" to declare its needed headroom. I submitted a patch
> and it is decided it needs to be reviewed. I see you participated in
> "hard_header_len vs needed_headroom" discussions in the past. Can you
> help me review this patch? Thanks!
>
> The patch is at:
> http://patchwork.ozlabs.org/project/netdev/patch/20200726110524.151957-1-xie.he.0...@gmail.com/
>
> In my understanding, hard_header_len should be the length of the header
> created by dev_hard_header. Any additional headroom needed should be
> declared in needed_headroom instead of hard_header_len. I came to this
> conclusion by examining the logic of net/packet/af_packet.c:packet_snd.

I am not familiar with this WAN driver, but I suggest you to look at
the following commit, which provides a lot of useful information:

commit 9454f7a895b822dd8fb4588fc55fda7c96728869
Author: Brian Norris 
Date:   Wed Feb 26 16:05:11 2020 -0800

mwifiex: set needed_headroom, not hard_header_len

hard_header_len provides limitations for things like AF_PACKET, such
that we don't allow transmitting packets smaller than this.

needed_headroom provides a suggested minimum headroom for SKBs, so that
we can trivally add our headers to the front.

The latter is the correct field to use in this case, while the former
mostly just prevents sending small AF_PACKET frames.

In any case, mwifiex already does its own bounce buffering [1] if we
don't have enough headroom, so hints (not hard limits) are all that are
needed.

This is the essentially the same bug (and fix) that brcmfmac had, fixed
in commit cb39288fd6bb ("brcmfmac: use ndev->needed_headroom to reserve
additional header space").

[1] mwifiex_hard_start_xmit():
if (skb_headroom(skb) < MWIFIEX_MIN_DATA_HEADER_LEN) {
[...]
/* Insufficient skb headroom - allocate a new skb */

Hope this helps.

Thanks.

Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang

On Mon, Jul 27, 2020 at 7:14 PM Gaurav Singh  wrote:
>
> Add return to fix build issue. Haven't reproduced this issue at
> my end.
>
> My hypothesis is this: In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL.
>
> Further down the function, there's a check:
> if (np) hlimit = hp->htop_limit

This check exists before git history, at that time 'sk' could be NULL,
hence 'np', so it does not mean it is still necessary now.

I looked at all callers of ip6_xmit(), I don't see how it is called with
a non-full socket, neither 'sk' could be NULL after
commit b30bd282cbf5c46247a279a2e8d2aae027d9f1bf
("[IPV6]: ip6_xmit: remove unnecessary NULL ptr check").

Thanks.

Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang

On Sun, Jul 26, 2020 at 8:39 PM Gaurav Singh  wrote:
>
> ipv6_pinfo is initlialized by inet6_sk() which returns NULL.

Why? It only returns NULL for timewait or request sock, but
I don't see how ip6_autoflowlabel() could be called on these
sockets. So please explain.

> Hence it can cause segmentation fault. Fix this by adding a
> NULL check.

Which exact call path? Do you have a full stack trace?

Thanks.

Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-26 Thread Cong Wang

On Sat, Jul 25, 2020 at 11:12 PM B K Karthik  wrote:
>
> On Sun, Jul 26, 2020 at 11:05 AM Cong Wang  wrote:
> >
> > On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> > > @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net 
> > > *net, u32 spi)
> > >  {
> > > struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> > > struct xfrm6_tunnel_spi *x6spi;
> > > -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> > > +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t 
> > > *)spi);
> > >
> > > hlist_for_each_entry(x6spi,
> > > -_tn->spi_byspi[index],
> > > +_tn->spi_byaddr[index],
> > >  list_byspi) {
> > > if (x6spi->spi == spi)
> >
> > How did you convince yourself this is correct? This lookup is still
> > using spi. :)
>
> I'm sorry, but my intention behind writing this patch was not to fix
> the UAF, but to fix a slab-out-of-bound.

Odd, your $subject is clearly UAF, so is the stack trace in your changelog.
:)


> If required, I can definitely change the subject line and resend the
> patch, but I figured this was correct for
> https://syzkaller.appspot.com/bug?id=058d05f470583ab2843b1d6785fa8d0658ef66ae
> . since that particular report did not have a reproducer,
> Dmitry Vyukov  suggested that I test this patch on
> other reports for xfrm/spi .

You have to change it to avoid misleading.

>
> Forgive me if this was the wrong way to send a patch for that
> particular report, but I guessed since the reproducer did not trigger
> the crash
> for UAF, I would leave the subject line as 'fix UAF' :)
>
> xfrm6_spi_hash_by_hash seemed more convincing because I had to prevent
> a slab-out-of-bounds because it uses ipv6_addr_hash.
> It would be of great help if you could help me understand how this was
> able to fix a UAF.

Sure, you just avoid a pointer deref, which of course can fix the UAF,
but I still don't think it is correct in any aspect.

Even if it is a OOB, you still have to explain why it happened. Once
again, I can't see how it could happen either.

>
> >
> > More importantly, can you explain how UAF happens? Apparently
> > the syzbot stack traces you quote make no sense at all. I also
> > looked at other similar reports, none of them makes sense to me.
>
> Forgive me, but I do not understand what you mean by the stack traces
> (this or other similar reports) "make no sense".

Because the stack trace in your changelog clearly shows it is allocated
in tomoyo_init_log(), which is a buffer in struct tomoyo_query, but
none of xfrm paths uses it. Or do you see anything otherwise?

Thanks.

Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-25 Thread Cong Wang

On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net *net, 
> u32 spi)
>  {
> struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> struct xfrm6_tunnel_spi *x6spi;
> -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t *)spi);
>
> hlist_for_each_entry(x6spi,
> -_tn->spi_byspi[index],
> +_tn->spi_byaddr[index],
>  list_byspi) {
> if (x6spi->spi == spi)

How did you convince yourself this is correct? This lookup is still
using spi. :)

More importantly, can you explain how UAF happens? Apparently
the syzbot stack traces you quote make no sense at all. I also
looked at other similar reports, none of them makes sense to me.

Thanks.

Re: KASAN: use-after-free Read in vlan_dev_get_iflink

2020-07-23 Thread Cong Wang

#syz test: https://github.com/congwang/linux.git net

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4580 matches

Mail list logo