Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
On 2016年12月01日 10:48, wangyunjian wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, November 30, 2016 9:41 PM To: wangyunjian Cc: jasow...@redhat.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; caihe Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote: When we meet an error(err=-EBADFD) recvmsg, How do you get EBADFD? Won't vhost_net_rx_peek_head_len return 0 in this case, breaking the loop? We started many guest VMs while attaching/detaching some virtio-net nics for loop. The soft lockup might happened. The err is -EBADFD. How did you do the attaching/detaching? AFAIK, the -EBADFD can only happens when you deleting tun device during vhost_net transmission. meesage log: kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093] call trace: []vhost_get_vq_desc+0x1e7/0x984 [vhost] []handle_rx+0x226/0x810 [vhost_net] []handle_rx_net+0x15/0x20 [vhost_net] []vhost_worker+0xfb/0x1e0 [vhost] []? vhost_dev_reset_owner+0x50/0x50 [vhost] []kthread+0xcf/0xe0 []? kthread_create_on_node+0x140/0x140 []ret_from_fork+0x58/0x90 []? kthread_create_on_node+0x140/0x140 the error handling in vhost handle_rx() will continue. This will cause a soft CPU lockup in vhost thread. Signed-off-by: Yunjian Wang--- drivers/vhost/net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 5dc128a..edc470b 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net) pr_debug("Discarded rx packet: " " len %d, expected %zd\n", err, sock_len); vhost_discard_vq_desc(vq, headcount); + /* Don't continue to do, when meet errors. */ + if (err < 0) + goto out; You might get e.g. EAGAIN and I think you need to retry in this case. continue; } /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */ -- 1.9.5.msysgit.1
Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx
On 2016年12月01日 17:34, Andrey Konovalov wrote: This patch changes tun.c to call netif_receive_skb instead of netif_rx when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid stack exhaustion). The difference between the two is that netif_rx queues the packet into the backlog, and netif_receive_skb proccesses the packet in the current context. This patch is required for syzkaller [1] to collect coverage from packet receive paths, when a packet being received through tun (syzkaller collects coverage per process in the process context). As mentioned by Eric this change also speeds up tun/tap. As measured by Peter it speeds up his closed-loop single-stream tap/OVS benchmark by about 23%, from 700k packets/second to 867k packets/second. A similar patch was introduced back in 2010 [2, 3], but the author found out that the patch doesn't help with the task he had in mind (for cgroups to shape network traffic based on the original process) and decided not to go further with it. The main concern back then was about possible stack exhaustion with 4K stacks. [1] https://github.com/google/syzkaller [2] https://www.spinics.net/lists/netdev/thrd440.html#130570 [3] https://www.spinics.net/lists/netdev/msg130570.html Signed-off-by: Andrey Konovalov <andreyk...@google.com> --- Changes since v1: - incorporate Eric's note about speed improvements in commit description - use netif_receive_skb CONFIG_4KSTACKS is not enabled drivers/net/tun.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 8093e39..d310b13 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1304,7 +1304,13 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); +#ifndef CONFIG_4KSTACKS + local_bh_disable(); + netif_receive_skb(skb); + local_bh_enable(); +#else netif_rx_ni(skb); +#endif stats = get_cpu_ptr(tun->pcpu_stats); u64_stats_update_begin(>syncp); I get +20% tx pps from guest with this patch. Acked-by: Jason Wang <jasow...@redhat.com>
Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
On 2016年12月01日 11:21, Michael S. Tsirkin wrote: On Thu, Dec 01, 2016 at 02:48:59AM +, wangyunjian wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, November 30, 2016 9:41 PM To: wangyunjian Cc: jasow...@redhat.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; caihe Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote: When we meet an error(err=-EBADFD) recvmsg, How do you get EBADFD? Won't vhost_net_rx_peek_head_len return 0 in this case, breaking the loop? We started many guest VMs while attaching/detaching some virtio-net nics for loop. The soft lockup might happened. The err is -EBADFD. OK, I'd like to figure out what happened here. why don't we get 0 when we peek at the head? EBADFD is from here: struct tun_struct *tun = __tun_get(tfile); ... if (!tun) return -EBADFD; but then: static int tun_peek_len(struct socket *sock) { ... struct tun_struct *tun; ... tun = __tun_get(tfile); if (!tun) return 0; so peek len should return 0. then while will exit: while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) ... Consider this case: user do ip link del link tap0 before recvmsg() but after tun_peek_len() ? meesage log: kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093] call trace: []vhost_get_vq_desc+0x1e7/0x984 [vhost] []handle_rx+0x226/0x810 [vhost_net] []handle_rx_net+0x15/0x20 [vhost_net] []vhost_worker+0xfb/0x1e0 [vhost] []? vhost_dev_reset_owner+0x50/0x50 [vhost] []kthread+0xcf/0xe0 []? kthread_create_on_node+0x140/0x140 []ret_from_fork+0x58/0x90 []? kthread_create_on_node+0x140/0x140 So somehow you keep seeing something in tun when we peek. IMO this is the problem we should try to fix. Could you try debugging the root cause here? the error handling in vhost handle_rx() will continue. This will cause a soft CPU lockup in vhost thread. Signed-off-by: Yunjian Wang--- drivers/vhost/net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 5dc128a..edc470b 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net) pr_debug("Discarded rx packet: " " len %d, expected %zd\n", err, sock_len); vhost_discard_vq_desc(vq, headcount); + /* Don't continue to do, when meet errors. */ + if (err < 0) + goto out; You might get e.g. EAGAIN and I think you need to retry in this case. continue; } /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */ -- 1.9.5.msysgit.1
Re: [RFC PATCH] virtio_net: XDP support for adjust_head
On 2017年01月03日 03:44, John Fastabend wrote: Add support for XDP adjust head by allocating a 256B header region that XDP programs can grow into. This is only enabled when a XDP program is loaded. In order to ensure that we do not have to unwind queue headroom push queue setup below bpf_prog_add. It reads better to do a prog ref unwind vs another queue setup call. : There is a problem with this patch as is. When xdp prog is loaded the old buffers without the 256B headers need to be flushed so that the bpf prog has the necessary headroom. This patch does this by calling the virtqueue_detach_unused_buf() and followed by the virtnet_set_queues() call to reinitialize the buffers. However I don't believe this is safe per comment in virtio_ring this API is not valid on an active queue and the only thing we have done here is napi_disable/napi_enable wrappers which doesn't do anything to the emulation layer. So the RFC is really to find the best solution to this problem. A couple things come to mind, (a) always allocate the necessary headroom but this is a bit of a waste (b) add some bit somewhere to check if the buffer has headroom but this would mean XDP programs would be broke for a cycle through the ring, (c) figure out how to deactivate a queue, free the buffers and finally reallocate. I think (c) is the best choice for now but I'm not seeing the API to do this so virtio/qemu experts anyone know off-hand how to make this work? I started looking into the PCI callbacks reset() and virtio_device_ready() or possibly hitting the right set of bits with vp_set_status() but my first attempt just hung the device. Hi John: AFAIK, disabling a specific queue was supported only by virtio 1.0 through queue_enable field in pci common cfg. But unfortunately, qemu does not emulate this at all and legacy device does not even support this. So the safe way is probably reset the device and redo the initialization here. Signed-off-by: John Fastabend--- drivers/net/virtio_net.c | 106 +++--- 1 file changed, 80 insertions(+), 26 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5deeda6..fcc5bd7 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -159,6 +159,9 @@ struct virtnet_info { /* Ethtool settings */ u8 duplex; u32 speed; + + /* Headroom allocated in RX Queue */ + unsigned int headroom; }; struct padded_vnet_hdr { @@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, } if (vi->mergeable_rx_bufs) { + xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf); /* Zero header and leave csum up to XDP layers */ hdr = xdp->data; memset(hdr, 0, vi->hdr_len); @@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, num_sg = 2; sg_init_table(sq->sg, 2); sg_set_buf(sq->sg, hdr, vi->hdr_len); - skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); + skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - xdp->data); vi->headroom look suspicious, should it be xdp->data - xdp->data_hard_start? } err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, data, GFP_ATOMIC); @@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi, struct bpf_prog *xdp_prog, void *data, int len) { - int hdr_padded_len; struct xdp_buff xdp; - void *buf; unsigned int qp; u32 act; + if (vi->mergeable_rx_bufs) { - hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); - xdp.data = data + hdr_padded_len; + int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf); + + /* Allow consuming headroom but reserve enough space to push +* the descriptor on if we get an XDP_TX return code. +*/ + xdp.data_hard_start = data - vi->headroom + desc_room; + xdp.data = data + desc_room; xdp.data_end = xdp.data + (len - vi->hdr_len); - buf = data; } else { /* small buffers */ struct sk_buff *skb = data; - xdp.data = skb->data; + xdp.data_hard_start = skb->data; + xdp.data = skb->data + vi->headroom; xdp.data_end = xdp.data + len; - buf = skb->data; } act = bpf_prog_run_xdp(xdp_prog, ); switch (act) { case XDP_PASS: + if (!vi->mergeable_rx_bufs) + __skb_pull((struct sk_buff *) data, + xdp.data - xdp.data_hard_start); Instead of doing things here and virtnet_xdp_xmit(). How about
Re: [net PATCH] net: virtio: cap mtu when XDP programs are running
On 2017年01月03日 06:30, John Fastabend wrote: XDP programs can not consume multiple pages so we cap the MTU to avoid this case. Virtio-net however only checks the MTU at XDP program load and does not block MTU changes after the program has loaded. This patch sets/clears the max_mtu value at XDP load/unload time. Signed-off-by: John Fastabend--- drivers/net/virtio_net.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5deeda6..783e842 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev) .set_settings = virtnet_set_settings, }; +#define MIN_MTU ETH_MIN_MTU +#define MAX_MTU ETH_MAX_MTU + static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) { unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr); @@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) virtnet_set_queues(vi, curr_qp); return PTR_ERR(prog); } + dev->max_mtu = max_sz; + } else { + dev->max_mtu = ETH_MAX_MTU; Or use ETH_DATA_LEN here consider we only allocate a size of GOOD_PACKET_LEN for each small buffer? Thanks } vi->xdp_queue_pairs = xdp_qp; @@ -2133,9 +2139,6 @@ static bool virtnet_validate_features(struct virtio_device *vdev) return true; } -#define MIN_MTU ETH_MIN_MTU -#define MAX_MTU ETH_MAX_MTU - static int virtnet_probe(struct virtio_device *vdev) { int i, err;
Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
On 2017年01月03日 06:43, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of small receive buffer untouched. This will confuse the user who want to set XDP but use small buffers. Other than forbid XDP in small buffer mode, let's make it work. XDP then can only work at skb->data since virtio-net create skbs during refill, this is sub optimal which could be optimized in the future. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 112 --- 1 file changed, 87 insertions(+), 25 deletions(-) Hi Jason, I was doing some more testing on this what do you think about doing this so that free_unused_bufs() handles the buffer free with dev_kfree_skb() instead of put_page in small receive mode. Seems more correct to me. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 783e842..27ff76c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info *vi) static bool is_xdp_queue(struct virtnet_info *vi, int q) { + /* For small receive mode always use kfree_skb variants */ + if (!vi->mergeable_rx_bufs) + return false; + if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs)) return false; else if (q < vi->curr_queue_pairs) patch is untested just spotted doing code review. Thanks, John We probably need a better name for this function. Acked-by: Jason Wang <jasow...@redhat.com>
Re: [RFC PATCH] virtio_net: XDP support for adjust_head
On 2017年01月05日 02:58, John Fastabend wrote: [...] @@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi, struct bpf_prog *xdp_prog, void *data, int len) { -int hdr_padded_len; struct xdp_buff xdp; -void *buf; unsigned int qp; u32 act; + if (vi->mergeable_rx_bufs) { -hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); -xdp.data = data + hdr_padded_len; +int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf); + +/* Allow consuming headroom but reserve enough space to push + * the descriptor on if we get an XDP_TX return code. + */ +xdp.data_hard_start = data - vi->headroom + desc_room; +xdp.data = data + desc_room; xdp.data_end = xdp.data + (len - vi->hdr_len); -buf = data; } else { /* small buffers */ struct sk_buff *skb = data; -xdp.data = skb->data; +xdp.data_hard_start = skb->data; +xdp.data = skb->data + vi->headroom; xdp.data_end = xdp.data + len; -buf = skb->data; } act = bpf_prog_run_xdp(xdp_prog, ); switch (act) { case XDP_PASS: +if (!vi->mergeable_rx_bufs) +__skb_pull((struct sk_buff *) data, + xdp.data - xdp.data_hard_start); Instead of doing things here and virtnet_xdp_xmit(). How about always making skb->data point to the buffer head like: 1) reserve headroom in add_recvbuf_small() 2) skb_push(xdp->data - xdp_data_hard_start, skb) if we detect xdp->data was modified afer bpf_prog_run_xdp() Then there's no special code in either XDP_PASS or XDP_TX? Alternatively moving the pull into the receive_small XDP handler also removes the special case. I'll submit a patch shortly let me know what you think. I'm ok with this. Thanks
Re: [net PATCH 1/2] virtio_net: cap mtu when XDP programs are running
On 2017年01月05日 11:18, Michael S. Tsirkin wrote: On Wed, Jan 04, 2017 at 07:11:18PM -0800, John Fastabend wrote: XDP programs can not consume multiple pages so we cap the MTU to avoid this case. Virtio-net however only checks the MTU at XDP program load and does not block MTU changes after the program has loaded. Do drivers really have to tweak max mtu all the time? Seems strange, I would say drivers just report device caps and net core enforces rules. Can't net core do these checks? I think this needs host co-operation, at least this patch prevents user from misconfiguring mtu in guest.
Re: [net PATCH] net: virtio: cap mtu when XDP programs are running
On 2017年01月05日 02:57, John Fastabend wrote: [...] On 2017年01月04日 00:48, John Fastabend wrote: On 17-01-02 10:14 PM, Jason Wang wrote: On 2017年01月03日 06:30, John Fastabend wrote: XDP programs can not consume multiple pages so we cap the MTU to avoid this case. Virtio-net however only checks the MTU at XDP program load and does not block MTU changes after the program has loaded. This patch sets/clears the max_mtu value at XDP load/unload time. Signed-off-by: John Fastabend <john.r.fastab...@intel.com> --- [...] OK so this logic is a bit too simply. When it resets the max_mtu I guess it needs to read the mtu via virtio_cread16(vdev, ...) or we may break the negotiated mtu. Yes, this is a problem (even use ETH_MAX_MTU). We may need a method to notify the device about the mtu in this case which is not supported by virtio now. Note this is not really a XDP specific problem. The guest can change the MTU after init time even without XDP which I assume should ideally result in a notification if the MTU is negotiated. Yes, Michael, do you think we need add some mechanism to notify host about MTU change in this case? Thanks
Re: [PATCH net-next V2 3/3] tun: rx batching
On 2017年01月03日 21:33, Stefan Hajnoczi wrote: On Wed, Dec 28, 2016 at 04:09:31PM +0800, Jason Wang wrote: +static int tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb, + int more) +{ + struct sk_buff_head *queue = >sk.sk_write_queue; + struct sk_buff_head process_queue; + int qlen; + bool rcv = false; + + spin_lock(>lock); Should this be spin_lock_bh()? Below and in tun_get_user() there are explicit local_bh_disable() calls so I guess BHs can interrupt us here and this would deadlock. sk_write_queue were accessed only in this function which runs under process context, so no need for spin_lock_bh() here.
Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
On 2017年01月04日 00:40, John Fastabend wrote: On 17-01-02 10:16 PM, Jason Wang wrote: On 2017年01月03日 06:43, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of small receive buffer untouched. This will confuse the user who want to set XDP but use small buffers. Other than forbid XDP in small buffer mode, let's make it work. XDP then can only work at skb->data since virtio-net create skbs during refill, this is sub optimal which could be optimized in the future. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 112 --- 1 file changed, 87 insertions(+), 25 deletions(-) Hi Jason, I was doing some more testing on this what do you think about doing this so that free_unused_bufs() handles the buffer free with dev_kfree_skb() instead of put_page in small receive mode. Seems more correct to me. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 783e842..27ff76c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info *vi) static bool is_xdp_queue(struct virtnet_info *vi, int q) { + /* For small receive mode always use kfree_skb variants */ + if (!vi->mergeable_rx_bufs) + return false; + if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs)) return false; else if (q < vi->curr_queue_pairs) patch is untested just spotted doing code review. Thanks, John We probably need a better name for this function. Acked-by: Jason Wang <jasow...@redhat.com> How about is_xdp_raw_buffer_queue()? I'll submit a proper patch today. Sounds good, thanks.
Re: [net PATCH] net: virtio: cap mtu when XDP programs are running
case. On 2017年01月04日 00:48, John Fastabend wrote: On 17-01-02 10:14 PM, Jason Wang wrote: On 2017年01月03日 06:30, John Fastabend wrote: XDP programs can not consume multiple pages so we cap the MTU to avoid this case. Virtio-net however only checks the MTU at XDP program load and does not block MTU changes after the program has loaded. This patch sets/clears the max_mtu value at XDP load/unload time. Signed-off-by: John Fastabend <john.r.fastab...@intel.com> --- drivers/net/virtio_net.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5deeda6..783e842 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev) .set_settings = virtnet_set_settings, }; +#define MIN_MTU ETH_MIN_MTU +#define MAX_MTU ETH_MAX_MTU + static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) { unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr); @@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) virtnet_set_queues(vi, curr_qp); return PTR_ERR(prog); } +dev->max_mtu = max_sz; +} else { +dev->max_mtu = ETH_MAX_MTU; Or use ETH_DATA_LEN here consider we only allocate a size of GOOD_PACKET_LEN for each small buffer? Thanks OK so this logic is a bit too simply. When it resets the max_mtu I guess it needs to read the mtu via virtio_cread16(vdev, ...) or we may break the negotiated mtu. Yes, this is a problem (even use ETH_MAX_MTU). We may need a method to notify the device about the mtu in this case which is not supported by virtio now. As for capping it at GOOD_PACKET_LEN this has the nice benefit of avoiding any underestimates in EWMA predictions because it appears min estimates are capped at GOOD_PACKET_LEN via get_mergeable_buf_len(). This seems something misunderstanding here, I meant only use GOOD_PACKET_LEN for small buffer (which does not use EWMA). Thanks Thanks, John
Re: [RFC PATCH] virtio_net: XDP support for adjust_head
On 2017年01月04日 00:54, John Fastabend wrote: +/* Changing the headroom in buffers is a disruptive operation because + * existing buffers must be flushed and reallocated. This will happen + * when a xdp program is initially added or xdp is disabled by removing + * the xdp program. + */ We probably need reset the device here, but maybe Michale has more ideas. And if we do this, another interesting thing to do is to disable EWMA and always use a single page for each packet, this could almost eliminate linearizing. Well with normal MTU 1500 size we should not hit the linearizing case right? My reply may be not clear, for 1500 I mean for small buffer only. Thanks The question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of (PAGE_SIZE - overhead).
Re: [RFC PATCH] virtio_net: XDP support for adjust_head
On 2017年01月04日 00:57, John Fastabend wrote: +/* Changing the headroom in buffers is a disruptive operation because + * existing buffers must be flushed and reallocated. This will happen + * when a xdp program is initially added or xdp is disabled by removing + * the xdp program. + */ We probably need reset the device here, but maybe Michale has more ideas. And if we do this, another interesting thing to do is to disable EWMA and always use a single page for each packet, this could almost eliminate linearizing. Well with normal MTU 1500 size we should not hit the linearizing case right? The question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of (PAGE_SIZE - overhead). Sorry responding to my own post with a bit more detail. I don't really like going to a page for each packet because we end up with double the pages in use for the "normal" 1500 MTU case. We could make the xdp allocation scheme smarter and allocate a page per packet when MTU is greater than 2k instead of using the EWMA but I would push those types of things at net-next and live with the linearizing behavior for now or capping the MTU. Yes, agree. Thanks
[PATCH net-next V3 0/3] vhost_net tx batching
Hi: This series tries to implement tx batching support for vhost. This was done by using MSG_MORE as a hint for under layer socket. The backend (e.g tap) can then batch the packets temporarily in a list and submit it all once the number of bacthed exceeds a limitation. Tests shows obvious improvement on guest pktgen over over mlx4(noqueue) on host: Mpps -+% rx_batched=0 0.90 +0% rx_batched=4 0.97 +7.8% rx_batched=8 0.97 +7.8% rx_batched=16 0.98 +8.9% rx_batched=32 1.03 +14.4% rx_batched=48 1.09 +21.1% rx_batched=64 1.02 +13.3% Changes from V2: - remove uselss queue limitation check (and we don't drop any packet now) Changes from V1: - drop NAPI handler since we don't use NAPI now - fix the issues that may exceeds max pending of zerocopy - more improvement on available buffer detection - move the limitation of batched pacekts from vhost to tuntap Please review. Thanks Jason Wang (3): vhost: better detection of available buffers vhost_net: tx batching tun: rx batching drivers/net/tun.c | 50 -- drivers/vhost/net.c | 23 --- drivers/vhost/vhost.c | 8 ++-- 3 files changed, 70 insertions(+), 11 deletions(-) -- 2.7.4
[PATCH net-next V3 3/3] tun: rx batching
We can only process 1 packet at one time during sendmsg(). This often lead bad cache utilization under heavy load. So this patch tries to do some batching during rx before submitting them to host network stack. This is done through accepting MSG_MORE as a hint from sendmsg() caller, if it was set, batch the packet temporarily in a linked list and submit them all once MSG_MORE were cleared. Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host: Mpps -+% rx_batched=0 0.90 +0% rx_batched=4 0.97 +7.8% rx_batched=8 0.97 +7.8% rx_batched=16 0.98 +8.9% rx_batched=32 1.03 +14.4% rx_batched=48 1.09 +21.1% rx_batched=64 1.02 +13.3% The maximum number of batched packets were specified through a module parameter. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 50 -- 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index cd8e02c..a268ed9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -75,6 +75,10 @@ #include +static int rx_batched; +module_param(rx_batched, int, 0444); +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx"); + /* Uncomment to enable debugging */ /* #define TUN_DEBUG 1 */ @@ -522,6 +526,7 @@ static void tun_queue_purge(struct tun_file *tfile) while ((skb = skb_array_consume(>tx_array)) != NULL) kfree_skb(skb); + skb_queue_purge(>sk.sk_write_queue); skb_queue_purge(>sk.sk_error_queue); } @@ -1140,10 +1145,36 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile, return skb; } +static void tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb, + int more) +{ + struct sk_buff_head *queue = >sk.sk_write_queue; + struct sk_buff_head process_queue; + int qlen; + bool rcv = false; + + spin_lock(>lock); + qlen = skb_queue_len(queue); + __skb_queue_tail(queue, skb); + if (!more || qlen == rx_batched) { + __skb_queue_head_init(_queue); + skb_queue_splice_tail_init(queue, _queue); + rcv = true; + } + spin_unlock(>lock); + + if (rcv) { + local_bh_disable(); + while ((skb = __skb_dequeue(_queue))) + netif_receive_skb(skb); + local_bh_enable(); + } +} + /* Get packet from user space buffer */ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, void *msg_control, struct iov_iter *from, - int noblock) + int noblock, bool more) { struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) }; struct sk_buff *skb; @@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); + #ifndef CONFIG_4KSTACKS - local_bh_disable(); - netif_receive_skb(skb); - local_bh_enable(); + if (!rx_batched) { + local_bh_disable(); + netif_receive_skb(skb); + local_bh_enable(); + } else { + tun_rx_batched(tfile, skb, more); + } #else netif_rx_ni(skb); #endif @@ -1312,7 +1348,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!tun) return -EBADFD; - result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK); + result = tun_get_user(tun, tfile, NULL, from, + file->f_flags & O_NONBLOCK, false); tun_put(tun); return result; @@ -1570,7 +1607,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) return -EBADFD; ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter, - m->msg_flags & MSG_DONTWAIT); + m->msg_flags & MSG_DONTWAIT, + m->msg_flags & MSG_MORE); tun_put(tun); return ret; } -- 2.7.4
[PATCH net-next V3 1/3] vhost: better detection of available buffers
This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/vhost.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d643260..9f11838 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) __virtio16 avail_idx; int r; + if (vq->avail_idx != vq->last_avail_idx) + return false; + r = vhost_get_user(vq, avail_idx, >avail->idx); - if (r) + if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx; + return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); -- 2.7.4
[PATCH net-next V3 2/3] vhost_net: tx batching
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 5dc3465..c42e9c3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net, return r; } +static bool vhost_exceeds_maxpend(struct vhost_net *net) +{ + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; + struct vhost_virtqueue *vq = >vq; + + return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV + == nvq->done_idx; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net) /* If more outstanding DMAs, queue the work. * Handle upend_idx wrap around */ - if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND) - % UIO_MAXIOV == nvq->done_idx)) + if (unlikely(vhost_exceeds_maxpend(net))) break; head = vhost_net_tx_get_vq_desc(net, vq, vq->iov, @@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net) msg.msg_control = NULL; ubufs = NULL; } + + total_len += len; + if (total_len < VHOST_NET_WEIGHT && + !vhost_vq_avail_empty(>dev, vq) && + likely(!vhost_exceeds_maxpend(net))) { + msg.msg_flags |= MSG_MORE; + } else { + msg.msg_flags &= ~MSG_MORE; + } + /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, , len); if (unlikely(err < 0)) { @@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net) vhost_add_used_and_signal(>dev, vq, head, 0); else vhost_zerocopy_signal_used(net, vq); - total_len += len; vhost_net_tx_packet(net); if (unlikely(total_len >= VHOST_NET_WEIGHT)) { vhost_poll_queue(>poll); -- 2.7.4
Re: [PATCH net-next V2 3/3] tun: rx batching
On 2016年12月30日 00:35, David Miller wrote: From: Jason Wang <jasow...@redhat.com> Date: Wed, 28 Dec 2016 16:09:31 +0800 + spin_lock(>lock); + qlen = skb_queue_len(queue); + if (qlen > rx_batched) + goto drop; + __skb_queue_tail(queue, skb); + if (!more || qlen + 1 > rx_batched) { + __skb_queue_head_init(_queue); + skb_queue_splice_tail_init(queue, _queue); + rcv = true; + } + spin_unlock(>lock); Since you always clear the 'queue' when you insert the skb that hits the limit, I don't see how the "goto drop" path can be possibly taken. True, will fix this. Thanks
Re: [PATCH net-next V3 3/3] tun: rx batching
On 2017年01月01日 01:31, David Miller wrote: From: Jason Wang <jasow...@redhat.com> Date: Fri, 30 Dec 2016 13:20:51 +0800 @@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); + #ifndef CONFIG_4KSTACKS - local_bh_disable(); - netif_receive_skb(skb); - local_bh_enable(); + if (!rx_batched) { + local_bh_disable(); + netif_receive_skb(skb); + local_bh_enable(); + } else { + tun_rx_batched(tfile, skb, more); + } #else netif_rx_ni(skb); #endif If rx_batched has been set, and we are talking to clients not using this new MSG_MORE facility (or such clients don't have multiple TX packets to send to you, thus MSG_MORE is often clear), you are doing a lot more work per-packet than the existing code. You take the queue lock, you test state, you splice into a local queue on the stack, then you walk that local stack queue to submit just one SKB to netif_receive_skb(). I think you want to streamline this sequence in such cases so that the cost before and after is similar if not equivalent. Yes, so I will do a skb_queue_empty() check if !MSG_MORE and call netif_receive_skb() immediately in this case. This can save the wasted efforts. Thanks
Re: [PATCH net-next V3 3/3] tun: rx batching
On 2017年01月01日 05:03, Stephen Hemminger wrote: On Fri, 30 Dec 2016 13:20:51 +0800 Jason Wang <jasow...@redhat.com> wrote: diff --git a/drivers/net/tun.c b/drivers/net/tun.c index cd8e02c..a268ed9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -75,6 +75,10 @@ #include +static int rx_batched; +module_param(rx_batched, int, 0444); +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx"); + /* Uncomment to enable debugging */ I like the concept or rx batching. But controlling it via a module parameter is one of the worst API choices. Ethtool would be better to use because that is how other network devices control batching. If you do ethtool, you could even extend it to have an number of packets and max latency value. Right, this is better (I believe you mean rx-frames). For rx-usecs, we could do it on top in the future. Thanks
Re: [PATCH V4 net-next 3/3] tun: rx batching
On 2017年01月07日 03:47, Michael S. Tsirkin wrote: +static int tun_get_coalesce(struct net_device *dev, + struct ethtool_coalesce *ec) +{ + struct tun_struct *tun = netdev_priv(dev); + + ec->rx_max_coalesced_frames = tun->rx_batched; + + return 0; +} + +static int tun_set_coalesce(struct net_device *dev, + struct ethtool_coalesce *ec) +{ + struct tun_struct *tun = netdev_priv(dev); + + if (ec->rx_max_coalesced_frames > NAPI_POLL_WEIGHT) + return -EINVAL; So what should userspace do? Keep trying until it succeeds? I think it's better to just use NAPI_POLL_WEIGHT instead and DTRT here. Well, looking at how set_coalesce is implemented in other drivers, -EINVAL is usually used when user give a value that exceeds the limitation. For tuntap, what missed here is probably just a documentation for coalescing in tuntap.txt. (Or extend ethtool to return the max value). This seems much better than silently reduce the value to the limitation. Thanks
Re: [PATCH V4 net-next 1/3] vhost: better detection of available buffers
On 2017年01月07日 03:55, Michael S. Tsirkin wrote: On Fri, Jan 06, 2017 at 10:13:15AM +0800, Jason Wang wrote: This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/vhost.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d643260..9f11838 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) __virtio16 avail_idx; int r; + if (vq->avail_idx != vq->last_avail_idx) + return false; + r = vhost_get_user(vq, avail_idx, >avail->idx); - if (r) + if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx; + return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); So again, this did not address the issue I pointed out in v1: if we have 1 buffer in RX queue and that is not enough to store the whole packet, vhost_vq_avail_empty returns false, then we re-read the descriptors again and again. You have saved a single index access but not the more expensive descriptor access. Looks not, if I understand the code correctly, in this case, get_rx_bufs() will return zero, and we will try to enable rx kick and exit the loop. Thanks
[PATCH V4 net-next 3/3] tun: rx batching
We can only process 1 packet at one time during sendmsg(). This often lead bad cache utilization under heavy load. So this patch tries to do some batching during rx before submitting them to host network stack. This is done through accepting MSG_MORE as a hint from sendmsg() caller, if it was set, batch the packet temporarily in a linked list and submit them all once MSG_MORE were cleared. Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host: Mpps -+% rx-frames = 00.91 +0% rx-frames = 41.00 +9.8% rx-frames = 81.00 +9.8% rx-frames = 16 1.01 +10.9% rx-frames = 32 1.07 +17.5% rx-frames = 48 1.07 +17.5% rx-frames = 64 1.08 +18.6% rx-frames = 64 (no MSG_MORE) 0.91 +0% User were allowed to change per device batched packets through ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation to prevent bh from being disabled too long. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 76 ++- 1 file changed, 70 insertions(+), 6 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index cd8e02c..6c93926 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -218,6 +218,7 @@ struct tun_struct { struct list_head disabled; void *security; u32 flow_count; + u32 rx_batched; struct tun_pcpu_stats __percpu *pcpu_stats; }; @@ -522,6 +523,7 @@ static void tun_queue_purge(struct tun_file *tfile) while ((skb = skb_array_consume(>tx_array)) != NULL) kfree_skb(skb); + skb_queue_purge(>sk.sk_write_queue); skb_queue_purge(>sk.sk_error_queue); } @@ -1140,10 +1142,45 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile, return skb; } +static void tun_rx_batched(struct tun_struct *tun, struct tun_file *tfile, + struct sk_buff *skb, int more) +{ + struct sk_buff_head *queue = >sk.sk_write_queue; + struct sk_buff_head process_queue; + u32 rx_batched = tun->rx_batched; + bool rcv = false; + + if (!rx_batched || (!more && skb_queue_empty(queue))) { + local_bh_disable(); + netif_receive_skb(skb); + local_bh_enable(); + return; + } + + spin_lock(>lock); + if (!more || skb_queue_len(queue) == rx_batched) { + __skb_queue_head_init(_queue); + skb_queue_splice_tail_init(queue, _queue); + rcv = true; + } else { + __skb_queue_tail(queue, skb); + } + spin_unlock(>lock); + + if (rcv) { + struct sk_buff *nskb; + local_bh_disable(); + while ((nskb = __skb_dequeue(_queue))) + netif_receive_skb(nskb); + netif_receive_skb(skb); + local_bh_enable(); + } +} + /* Get packet from user space buffer */ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, void *msg_control, struct iov_iter *from, - int noblock) + int noblock, bool more) { struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) }; struct sk_buff *skb; @@ -1283,10 +1320,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); + #ifndef CONFIG_4KSTACKS - local_bh_disable(); - netif_receive_skb(skb); - local_bh_enable(); + tun_rx_batched(tun, tfile, skb, more); #else netif_rx_ni(skb); #endif @@ -1312,7 +1348,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!tun) return -EBADFD; - result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK); + result = tun_get_user(tun, tfile, NULL, from, + file->f_flags & O_NONBLOCK, false); tun_put(tun); return result; @@ -1570,7 +1607,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) return -EBADFD; ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter, - m->msg_flags & MSG_DONTWAIT); + m->msg_flags & MSG_DONTWAIT, + m->msg_flags & MSG_MORE); tun_put(tun); return ret; } @@ -1771,6 +1809,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr) tun->align = NET_SKB_PAD; tun->filter_attached = false; tun->sndbuf = tfile->socket.sk->sk_sndbuf; +
[PATCH V4 net-next 1/3] vhost: better detection of available buffers
This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/vhost.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d643260..9f11838 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) __virtio16 avail_idx; int r; + if (vq->avail_idx != vq->last_avail_idx) + return false; + r = vhost_get_user(vq, avail_idx, >avail->idx); - if (r) + if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx; + return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); -- 2.7.4
[PATCH V4 net-next 0/3] vhost_net tx batching
Hi: This series tries to implement tx batching support for vhost. This was done by using MSG_MORE as a hint for under layer socket. The backend (e.g tap) can then batch the packets temporarily in a list and submit it all once the number of bacthed exceeds a limitation. Tests shows obvious improvement on guest pktgen over over mlx4(noqueue) on host: Mpps -+% rx-frames = 00.91 +0% rx-frames = 41.00 +9.8% rx-frames = 81.00 +9.8% rx-frames = 16 1.01 +10.9% rx-frames = 32 1.07 +17.5% rx-frames = 48 1.07 +17.5% rx-frames = 64 1.08 +18.6% rx-frames = 64 (no MSG_MORE) 0.91 +0% Changes from V3: - use ethtool instead of module parameter to control the maximum number of batched packets - avoid overhead when MSG_MORE were not set and no packet queued Changes from V2: - remove uselss queue limitation check (and we don't drop any packet now) Changes from V1: - drop NAPI handler since we don't use NAPI now - fix the issues that may exceeds max pending of zerocopy - more improvement on available buffer detection - move the limitation of batched pacekts from vhost to tuntap Please review. Thanks Jason Wang (3): vhost: better detection of available buffers vhost_net: tx batching tun: rx batching drivers/net/tun.c | 76 +++ drivers/vhost/net.c | 23 ++-- drivers/vhost/vhost.c | 8 -- 3 files changed, 96 insertions(+), 11 deletions(-) -- 2.7.4
[PATCH V4 net-next 2/3] vhost_net: tx batching
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 5dc3465..c42e9c3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net, return r; } +static bool vhost_exceeds_maxpend(struct vhost_net *net) +{ + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; + struct vhost_virtqueue *vq = >vq; + + return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV + == nvq->done_idx; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net) /* If more outstanding DMAs, queue the work. * Handle upend_idx wrap around */ - if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND) - % UIO_MAXIOV == nvq->done_idx)) + if (unlikely(vhost_exceeds_maxpend(net))) break; head = vhost_net_tx_get_vq_desc(net, vq, vq->iov, @@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net) msg.msg_control = NULL; ubufs = NULL; } + + total_len += len; + if (total_len < VHOST_NET_WEIGHT && + !vhost_vq_avail_empty(>dev, vq) && + likely(!vhost_exceeds_maxpend(net))) { + msg.msg_flags |= MSG_MORE; + } else { + msg.msg_flags &= ~MSG_MORE; + } + /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, , len); if (unlikely(err < 0)) { @@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net) vhost_add_used_and_signal(>dev, vq, head, 0); else vhost_zerocopy_signal_used(net, vq); - total_len += len; vhost_net_tx_packet(net); if (unlikely(total_len >= VHOST_NET_WEIGHT)) { vhost_poll_queue(>poll); -- 2.7.4
[PATCH net 9/9] virtio-net: XDP support for small buffers
Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of small receive buffer untouched. This will confuse the user who want to set XDP but use small buffers. Other than forbid XDP in small buffer mode, let's make it work. XDP then can only work at skb->data since virtio-net create skbs during refill, this is sub optimal which could be optimized in the future. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 112 --- 1 file changed, 87 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index e53365a8..5deeda6 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -333,9 +333,9 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, static void virtnet_xdp_xmit(struct virtnet_info *vi, struct receive_queue *rq, struct send_queue *sq, -struct xdp_buff *xdp) +struct xdp_buff *xdp, +void *data) { - struct page *page = virt_to_head_page(xdp->data); struct virtio_net_hdr_mrg_rxbuf *hdr; unsigned int num_sg, len; void *xdp_sent; @@ -343,20 +343,45 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, /* Free up any pending old buffers before queueing new ones. */ while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) { - struct page *sent_page = virt_to_head_page(xdp_sent); - put_page(sent_page); + if (vi->mergeable_rx_bufs) { + struct page *sent_page = virt_to_head_page(xdp_sent); + + put_page(sent_page); + } else { /* small buffer */ + struct sk_buff *skb = xdp_sent; + + kfree_skb(skb); + } } - /* Zero header and leave csum up to XDP layers */ - hdr = xdp->data; - memset(hdr, 0, vi->hdr_len); + if (vi->mergeable_rx_bufs) { + /* Zero header and leave csum up to XDP layers */ + hdr = xdp->data; + memset(hdr, 0, vi->hdr_len); + + num_sg = 1; + sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data); + } else { /* small buffer */ + struct sk_buff *skb = data; - num_sg = 1; - sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data); + /* Zero header and leave csum up to XDP layers */ + hdr = skb_vnet_hdr(skb); + memset(hdr, 0, vi->hdr_len); + + num_sg = 2; + sg_init_table(sq->sg, 2); + sg_set_buf(sq->sg, hdr, vi->hdr_len); + skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); + } err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, - xdp->data, GFP_ATOMIC); + data, GFP_ATOMIC); if (unlikely(err)) { - put_page(page); + if (vi->mergeable_rx_bufs) { + struct page *page = virt_to_head_page(xdp->data); + + put_page(page); + } else /* small buffer */ + kfree_skb(data); return; // On error abort to avoid unnecessary kick } @@ -366,23 +391,26 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, static u32 do_xdp_prog(struct virtnet_info *vi, struct receive_queue *rq, struct bpf_prog *xdp_prog, - struct page *page, int offset, int len) + void *data, int len) { int hdr_padded_len; struct xdp_buff xdp; + void *buf; unsigned int qp; u32 act; - u8 *buf; - - buf = page_address(page) + offset; - if (vi->mergeable_rx_bufs) + if (vi->mergeable_rx_bufs) { hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); - else - hdr_padded_len = sizeof(struct padded_vnet_hdr); + xdp.data = data + hdr_padded_len; + xdp.data_end = xdp.data + (len - vi->hdr_len); + buf = data; + } else { /* small buffers */ + struct sk_buff *skb = data; - xdp.data = buf + hdr_padded_len; - xdp.data_end = xdp.data + (len - vi->hdr_len); + xdp.data = skb->data; + xdp.data_end = xdp.data + len; + buf = skb->data; + } act = bpf_prog_run_xdp(xdp_prog, ); switch (act) { @@ -392,8 +420,8 @@ static u32 do_xdp_prog(struct virtnet_info *vi, qp = vi->curr_queue_pairs - vi
[PATCH net 5/9] virtio-net: unbreak csumed packets for XDP_PASS
We drop csumed packet when do XDP for packets. This breaks XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag to be set. With this patch, simple TCP works for XDP_PASS. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 470293e..0778dc8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -440,7 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev, struct virtio_net_hdr_mrg_rxbuf *hdr = buf; u32 act; - if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) + if (unlikely(hdr->hdr.gso_type)) goto err_xdp; act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len); switch (act) { @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, * the receive path after XDP is loaded. In practice I * was not able to create this condition. */ - if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) + if (unlikely(hdr->hdr.gso_type)) goto err_xdp; act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); -- 2.7.4
[PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO packet that exceeds a single page which could not be handled correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is supported. While at it, forbid XDP for ECN (which comes only from GRO) too to prevent user from misconfiguration. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 77ae358..c1f66d8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) int i, err; if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) || - virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) { + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) || + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) || + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) { netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n"); return -EOPNOTSUPP; } -- 2.7.4
[PATCH net 8/9] virtio-net: remove big packet XDP codes
Now we in fact don't allow XDP for big packets, remove its codes. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 44 +++- 1 file changed, 3 insertions(+), 41 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c1f66d8..e53365a8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -344,11 +344,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, /* Free up any pending old buffers before queueing new ones. */ while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) { struct page *sent_page = virt_to_head_page(xdp_sent); - - if (vi->mergeable_rx_bufs) - put_page(sent_page); - else - give_pages(rq, sent_page); + put_page(sent_page); } /* Zero header and leave csum up to XDP layers */ @@ -360,15 +356,8 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, xdp->data, GFP_ATOMIC); if (unlikely(err)) { - if (vi->mergeable_rx_bufs) - put_page(page); - else - give_pages(rq, page); + put_page(page); return; // On error abort to avoid unnecessary kick - } else if (!vi->mergeable_rx_bufs) { - /* If not mergeable bufs must be big packets so cleanup pages */ - give_pages(rq, (struct page *)page->private); - page->private = 0; } virtqueue_kick(sq->vq); @@ -430,44 +419,17 @@ static struct sk_buff *receive_big(struct net_device *dev, void *buf, unsigned int len) { - struct bpf_prog *xdp_prog; struct page *page = buf; - struct sk_buff *skb; - - rcu_read_lock(); - xdp_prog = rcu_dereference(rq->xdp_prog); - if (xdp_prog) { - struct virtio_net_hdr_mrg_rxbuf *hdr = buf; - u32 act; - - if (unlikely(hdr->hdr.gso_type)) - goto err_xdp; - act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len); - switch (act) { - case XDP_PASS: - break; - case XDP_TX: - rcu_read_unlock(); - goto xdp_xmit; - case XDP_DROP: - default: - goto err_xdp; - } - } - rcu_read_unlock(); + struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE); - skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE); if (unlikely(!skb)) goto err; return skb; -err_xdp: - rcu_read_unlock(); err: dev->stats.rx_dropped++; give_pages(rq, page); -xdp_xmit: return NULL; } -- 2.7.4
[PATCH net 1/9] virtio-net: remove the warning before XDP linearizing
Since we use EWMA to estimate the size of rx buffer. When rx buffer size is underestimated, it's usual to have a packet with more than one buffers. Consider this is not a bug, remove the warning and correct the comment before XDP linearizing. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 08327e0..1067253 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -552,14 +552,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct page *xdp_page; u32 act; - /* No known backend devices should send packets with -* more than a single buffer when XDP conditions are -* met. However it is not strictly illegal so the case -* is handled as an exception and a warning is thrown. -*/ + /* This happens when rx buffer size is underestimated */ if (unlikely(num_buf > 1)) { - bpf_warn_invalid_xdp_buffer(); - /* linearize data for XDP */ xdp_page = xdp_linearize_page(rq, num_buf, page, offset, ); -- 2.7.4
[PATCH net 6/9] virtio-net: make rx buf size estimation works for XDP
We don't update ewma rx buf size in the case of XDP. This will lead underestimation of rx buf size which causes host to produce more than one buffers. This will greatly increase the possibility of XDP page linearization. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0778dc8..77ae358 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -584,10 +584,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, put_page(page); head_skb = page_to_skb(vi, rq, xdp_page, 0, len, PAGE_SIZE); + ewma_pkt_len_add(>mrg_avg_pkt_len, len); return head_skb; } break; case XDP_TX: + ewma_pkt_len_add(>mrg_avg_pkt_len, len); if (unlikely(xdp_page != page)) goto err_xdp; rcu_read_unlock(); @@ -596,6 +598,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, default: if (unlikely(xdp_page != page)) __free_pages(xdp_page, 0); + ewma_pkt_len_add(>mrg_avg_pkt_len, len); goto err_xdp; } } -- 2.7.4
[PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing
We don't put page during linearizing, the would cause leaking when xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by put page accordingly. Also decrease the number of buffers during linearizing to make sure caller can free buffers correctly when packet exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize huge number of packets. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index fe4562d..58ad40e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -483,7 +483,7 @@ static struct sk_buff *receive_big(struct net_device *dev, * anymore. */ static struct page *xdp_linearize_page(struct receive_queue *rq, - u16 num_buf, + u16 *num_buf, struct page *p, int offset, unsigned int *len) @@ -497,7 +497,7 @@ static struct page *xdp_linearize_page(struct receive_queue *rq, memcpy(page_address(page) + page_off, page_address(p) + offset, *len); page_off += *len; - while (--num_buf) { + while (--*num_buf) { unsigned int buflen; unsigned long ctx; void *buf; @@ -507,19 +507,22 @@ static struct page *xdp_linearize_page(struct receive_queue *rq, if (unlikely(!ctx)) goto err_buf; + buf = mergeable_ctx_to_buf_address(ctx); + p = virt_to_head_page(buf); + off = buf - page_address(p); + /* guard against a misconfigured or uncooperative backend that * is sending packet larger than the MTU. */ - if ((page_off + buflen) > PAGE_SIZE) + if ((page_off + buflen) > PAGE_SIZE) { + put_page(p); goto err_buf; - - buf = mergeable_ctx_to_buf_address(ctx); - p = virt_to_head_page(buf); - off = buf - page_address(p); + } memcpy(page_address(page) + page_off, page_address(p) + off, buflen); page_off += buflen; + put_page(p); } *len = page_off; @@ -555,7 +558,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, /* This happens when rx buffer size is underestimated */ if (unlikely(num_buf > 1)) { /* linearize data for XDP */ - xdp_page = xdp_linearize_page(rq, num_buf, + xdp_page = xdp_linearize_page(rq, _buf, page, offset, ); if (!xdp_page) goto err_xdp; -- 2.7.4
[PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets
When XDP_PASS were determined for linearized packets, we try to get new buffers in the virtqueue and build skbs from them. This is wrong, we should create skbs based on existed buffers instead. Fixing them by creating skb based on xdp_page. With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 58ad40e..470293e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); switch (act) { case XDP_PASS: - if (unlikely(xdp_page != page)) - __free_pages(xdp_page, 0); + /* We can only create skb based on xdp_page. */ + if (unlikely(xdp_page != page)) { + rcu_read_unlock(); + put_page(page); + head_skb = page_to_skb(vi, rq, xdp_page, + 0, len, PAGE_SIZE); + return head_skb; + } break; case XDP_TX: if (unlikely(xdp_page != page)) -- 2.7.4
[PATCH net 2/9] virtio-net: correctly xmit linearized page on XDP_TX
After we linearize page, we should xmit this page instead of the page of first buffer which may lead unexpected result. With this patch, we can see correct packet during XDP_TX. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 1067253..fe4562d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags)) goto err_xdp; - act = do_xdp_prog(vi, rq, xdp_prog, page, offset, len); + act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); switch (act) { case XDP_PASS: if (unlikely(xdp_page != page)) -- 2.7.4
[PATCH net 0/9] several fixups for virtio-net XDP
Merry Xmas and a Happy New year to all: This series tries to fixes several issues for virtio-net XDP which could be categorized into several parts: - fix several issues during XDP linearizing - allow csumed packet to work for XDP_PASS - make EWMA rxbuf size estimation works for XDP - forbid XDP when GUEST_UFO is support - remove big packet XDP support - add XDP support or small buffer Please see individual patches for details. Thanks Jason Wang (9): virtio-net: remove the warning before XDP linearizing virtio-net: correctly xmit linearized page on XDP_TX virtio-net: fix page miscount during XDP linearizing virtio-net: correctly handle XDP_PASS for linearized packets virtio-net: unbreak csumed packets for XDP_PASS virtio-net: make rx buf size estimation works for XDP virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support virtio-net: remove big packet XDP codes virtio-net: XDP support for small buffers drivers/net/virtio_net.c | 172 --- 1 file changed, 102 insertions(+), 70 deletions(-) -- 2.7.4
[PATCH net-next V2 1/3] vhost: better detection of available buffers
This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/vhost.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d643260..9f11838 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) __virtio16 avail_idx; int r; + if (vq->avail_idx != vq->last_avail_idx) + return false; + r = vhost_get_user(vq, avail_idx, >avail->idx); - if (r) + if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx; + return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); -- 2.7.4
[PATCH net-next V2 3/3] tun: rx batching
We can only process 1 packet at one time during sendmsg(). This often lead bad cache utilization under heavy load. So this patch tries to do some batching during rx before submitting them to host network stack. This is done through accepting MSG_MORE as a hint from sendmsg() caller, if it was set, batch the packet temporarily in a linked list and submit them all once MSG_MORE were cleared. Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host: Mpps -+% rx_batched=0 0.90 +0% rx_batched=4 0.97 +7.8% rx_batched=8 0.97 +7.8% rx_batched=16 0.98 +8.9% rx_batched=32 1.03 +14.4% rx_batched=48 1.09 +21.1% rx_batched=64 1.02 +13.3% The maximum number of batched packets were specified through a module parameter. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 66 --- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index cd8e02c..6ea5d6d 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -75,6 +75,10 @@ #include +static int rx_batched; +module_param(rx_batched, int, 0444); +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx"); + /* Uncomment to enable debugging */ /* #define TUN_DEBUG 1 */ @@ -522,6 +526,7 @@ static void tun_queue_purge(struct tun_file *tfile) while ((skb = skb_array_consume(>tx_array)) != NULL) kfree_skb(skb); + skb_queue_purge(>sk.sk_write_queue); skb_queue_purge(>sk.sk_error_queue); } @@ -1140,10 +1145,44 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile, return skb; } +static int tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb, + int more) +{ + struct sk_buff_head *queue = >sk.sk_write_queue; + struct sk_buff_head process_queue; + int qlen; + bool rcv = false; + + spin_lock(>lock); + qlen = skb_queue_len(queue); + if (qlen > rx_batched) + goto drop; + __skb_queue_tail(queue, skb); + if (!more || qlen + 1 > rx_batched) { + __skb_queue_head_init(_queue); + skb_queue_splice_tail_init(queue, _queue); + rcv = true; + } + spin_unlock(>lock); + + if (rcv) { + local_bh_disable(); + while ((skb = __skb_dequeue(_queue))) + netif_receive_skb(skb); + local_bh_enable(); + } + + return 0; +drop: + spin_unlock(>lock); + kfree_skb(skb); + return -EFAULT; +} + /* Get packet from user space buffer */ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, void *msg_control, struct iov_iter *from, - int noblock) + int noblock, bool more) { struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) }; struct sk_buff *skb; @@ -1283,18 +1322,27 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); + #ifndef CONFIG_4KSTACKS - local_bh_disable(); - netif_receive_skb(skb); - local_bh_enable(); + if (!rx_batched) { + local_bh_disable(); + netif_receive_skb(skb); + local_bh_enable(); + } else { + err = tun_rx_batched(tfile, skb, more); + } #else netif_rx_ni(skb); #endif stats = get_cpu_ptr(tun->pcpu_stats); u64_stats_update_begin(>syncp); - stats->rx_packets++; - stats->rx_bytes += len; + if (err) { + stats->rx_dropped++; + } else { + stats->rx_packets++; + stats->rx_bytes += len; + } u64_stats_update_end(>syncp); put_cpu_ptr(stats); @@ -1312,7 +1360,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!tun) return -EBADFD; - result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK); + result = tun_get_user(tun, tfile, NULL, from, + file->f_flags & O_NONBLOCK, false); tun_put(tun); return result; @@ -1570,7 +1619,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) return -EBADFD; ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter, - m->msg_flags & MSG_DONTWAIT); + m->msg_flags & MSG_DONTWAIT, + m->msg_flags & MSG_MORE); tun_put(tun); return ret; } -- 2.7.4
[PATCH net-next V2 0/3] vhost net tx batching
Hi: This series tries to implement tx batching support for vhost. This was done by using MSG_MORE as a hint for under layer socket. The backend (e.g tap) can then batch the packets temporarily in a list and submit it all once the number of bacthed exceeds a limitation. Tests shows obvious improvement on guest pktgen over over mlx4(noqueue) on host: Mpps -+% rx_batched=0 0.90 +0% rx_batched=4 0.97 +7.8% rx_batched=8 0.97 +7.8% rx_batched=16 0.98 +8.9% rx_batched=32 1.03 +14.4% rx_batched=48 1.09 +21.1% rx_batched=64 1.02 +13.3% Changes from V1: - drop NAPI handler since we don't use NAPI now - fix the issues that may exceeds max pending of zerocopy - more improvement on available buffer detection - move the limitation of batched pacekts from vhost to tuntap Please review. Thanks Jason Wang (3): vhost: better detection of available buffers vhost_net: tx batching tun: rx batching drivers/net/tun.c | 66 --- drivers/vhost/net.c | 23 +++--- drivers/vhost/vhost.c | 8 +-- 3 files changed, 84 insertions(+), 13 deletions(-) -- 2.7.4
[PATCH net-next V2 2/3] vhost_net: tx batching
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 5dc3465..c42e9c3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net, return r; } +static bool vhost_exceeds_maxpend(struct vhost_net *net) +{ + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; + struct vhost_virtqueue *vq = >vq; + + return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV + == nvq->done_idx; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net) /* If more outstanding DMAs, queue the work. * Handle upend_idx wrap around */ - if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND) - % UIO_MAXIOV == nvq->done_idx)) + if (unlikely(vhost_exceeds_maxpend(net))) break; head = vhost_net_tx_get_vq_desc(net, vq, vq->iov, @@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net) msg.msg_control = NULL; ubufs = NULL; } + + total_len += len; + if (total_len < VHOST_NET_WEIGHT && + !vhost_vq_avail_empty(>dev, vq) && + likely(!vhost_exceeds_maxpend(net))) { + msg.msg_flags |= MSG_MORE; + } else { + msg.msg_flags &= ~MSG_MORE; + } + /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, , len); if (unlikely(err < 0)) { @@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net) vhost_add_used_and_signal(>dev, vq, head, 0); else vhost_zerocopy_signal_used(net, vq); - total_len += len; vhost_net_tx_packet(net); if (unlikely(total_len >= VHOST_NET_WEIGHT)) { vhost_poll_queue(>poll); -- 2.7.4
Re: [PATCH net 1/9] virtio-net: remove the warning before XDP linearizing
On 2016年12月24日 03:31, Daniel Borkmann wrote: Hi Jason, On 12/23/2016 03:37 PM, Jason Wang wrote: Since we use EWMA to estimate the size of rx buffer. When rx buffer size is underestimated, it's usual to have a packet with more than one buffers. Consider this is not a bug, remove the warning and correct the comment before XDP linearizing. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 08327e0..1067253 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -552,14 +552,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct page *xdp_page; u32 act; -/* No known backend devices should send packets with - * more than a single buffer when XDP conditions are - * met. However it is not strictly illegal so the case - * is handled as an exception and a warning is thrown. - */ +/* This happens when rx buffer size is underestimated */ if (unlikely(num_buf > 1)) { -bpf_warn_invalid_xdp_buffer(); Could you also remove the bpf_warn_invalid_xdp_buffer(), which got added just for this? Thanks. Posted. Thanks
[PATCH net] net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
After commit 73b62bd085f4737679ea9afc7867fa5f99ba7d1b ("virtio-net: remove the warning before XDP linearizing"), there's no users for bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for commit f23bc46c30ca5ef58b8549434899fcbac41b2cfc. Cc: Daniel Borkmann <dan...@iogearbox.net> Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/filter.h | 1 - net/core/filter.c | 6 -- 2 files changed, 7 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 7023142..a0934e6 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -610,7 +610,6 @@ bool bpf_helper_changes_pkt_data(void *func); struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); void bpf_warn_invalid_xdp_action(u32 act); -void bpf_warn_invalid_xdp_buffer(void); #ifdef CONFIG_BPF_JIT extern int bpf_jit_enable; diff --git a/net/core/filter.c b/net/core/filter.c index 7190bd6..b146170 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2972,12 +2972,6 @@ void bpf_warn_invalid_xdp_action(u32 act) } EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action); -void bpf_warn_invalid_xdp_buffer(void) -{ - WARN_ONCE(1, "Illegal XDP buffer encountered, expect throughput degradation\n"); -} -EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer); - static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg, int src_reg, int ctx_off, struct bpf_insn *insn_buf, -- 2.7.4
Re: [PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing
On 2016年12月23日 23:54, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: We don't put page during linearizing, the would cause leaking when xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by put page accordingly. Also decrease the number of buffers during linearizing to make sure caller can free buffers correctly when packet exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize huge number of packets. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- Thanks! looks good. By the way do you happen to have any actual configuration where this path is hit? I obviously didn't test this very long other than a quick test with my hacked vhost driver. Acked-by: John Fastabend <john.r.fastab...@intel.com> Yes, I have. Just increase the MTU above 1500 for both virtio and tap and produce some traffic with size which will lead underestimated of rxbuf. Thanks
Re: [PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets
On 2016年12月23日 23:57, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: When XDP_PASS were determined for linearized packets, we try to get new buffers in the virtqueue and build skbs from them. This is wrong, we should create skbs based on existed buffers instead. Fixing them by creating skb based on xdp_page. With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS. Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 58ad40e..470293e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len); switch (act) { case XDP_PASS: - if (unlikely(xdp_page != page)) - __free_pages(xdp_page, 0); + /* We can only create skb based on xdp_page. */ + if (unlikely(xdp_page != page)) { + rcu_read_unlock(); + put_page(page); + head_skb = page_to_skb(vi, rq, xdp_page, + 0, len, PAGE_SIZE); + return head_skb; + } break; case XDP_TX: if (unlikely(xdp_page != page)) Great thanks. This was likely working before because of the memory leak fixed in 3/9. Looks not, without this and 3/9 the code will try to get buffers and build skb for a new packet instead of existed buffers. Thanks Acked-by: John Fastabend <john.r.fastab...@intel.com>
Re: [PATCH net 0/9] several fixups for virtio-net XDP
On 2016年12月24日 01:10, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: Merry Xmas and a Happy New year to all: This series tries to fixes several issues for virtio-net XDP which could be categorized into several parts: - fix several issues during XDP linearizing - allow csumed packet to work for XDP_PASS - make EWMA rxbuf size estimation works for XDP - forbid XDP when GUEST_UFO is support - remove big packet XDP support - add XDP support or small buffer Please see individual patches for details. Thanks Jason Wang (9): virtio-net: remove the warning before XDP linearizing virtio-net: correctly xmit linearized page on XDP_TX virtio-net: fix page miscount during XDP linearizing virtio-net: correctly handle XDP_PASS for linearized packets virtio-net: unbreak csumed packets for XDP_PASS virtio-net: make rx buf size estimation works for XDP virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support virtio-net: remove big packet XDP codes virtio-net: XDP support for small buffers drivers/net/virtio_net.c | 172 --- 1 file changed, 102 insertions(+), 70 deletions(-) Thanks a lot Jason. The last piece that is needed is support to complete XDP support is to get the adjust_head part correct. I'll send out a patch in a bit but will need to merge it on top of this set. .John Yes, glad to see the your patch. Thanks.
Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
On 2016年12月24日 00:10, John Fastabend wrote: On 16-12-23 08:02 AM, John Fastabend wrote: On 16-12-23 06:37 AM, Jason Wang wrote: When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO packet that exceeds a single page which could not be handled correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is supported. While at it, forbid XDP for ECN (which comes only from GRO) too to prevent user from misconfiguration. Is sending packets greater than single page though normal in this case? Yes, when NETIF_F_UFO was enabled for tap, it won't segment UFO packet and will send it directly to guest. (This could be reproduced with UDP_STREAM between two guests or host to guest). Thanks I don't have any need to support big packet mode other than MST asked for it. And I wasn't seeing this in my tests. MTU is capped at 4k - hdr when XDP is enabled. .John Cc: John Fastabend <john.r.fastab...@intel.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 77ae358..c1f66d8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) int i, err; if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) || - virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) { + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) || + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) || + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) { netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n"); return -EOPNOTSUPP; } Acked-by: John Fastabend <john.r.fastab...@intel.com>
Re: [net PATCH v4 6/6] virtio_net: XDP support for adjust_head
On 2017年01月16日 08:01, John Fastabend wrote: Add support for XDP adjust head by allocating a 256B header region that XDP programs can grow into. This is only enabled when a XDP program is loaded. In order to ensure that we do not have to unwind queue headroom push queue setup below bpf_prog_add. It reads better to do a prog ref unwind vs another queue setup call. At the moment this code must do a full reset to ensure old buffers without headroom on program add or with headroom on program removal are not used incorrectly in the datapath. Ideally we would only have to disable/enable the RX queues being updated but there is no API to do this at the moment in virtio so use the big hammer. In practice it is likely not that big of a problem as this will only happen when XDP is enabled/disabled changing programs does not require the reset. There is some risk that the driver may either have an allocation failure or for some reason fail to correctly negotiate with the underlying backend in this case the driver will be left uninitialized. I have not seen this ever happen on my test systems and for what its worth this same failure case can occur from probe and other contexts in virtio framework. Signed-off-by: John Fastabend--- drivers/net/virtio_net.c | 110 ++ 1 file changed, 82 insertions(+), 28 deletions(-) [...] - vi->xdp_queue_pairs = xdp_qp; netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp); for (i = 0; i < vi->max_queue_pairs; i++) { @@ -1746,6 +1789,21 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) } return 0; + +virtio_reset_err: + /* On reset error do our best to unwind XDP changes inflight and return +* error up to user space for resolution. The underlying PCI reset hung +* on us so not much we can do here. It should work with other transport, so let's remove "PCI" here. +*/ + dev_warn(>dev, "XDP reset failure and queues unstable\n"); + vi->xdp_queue_pairs = oxdp_qp; +virtio_queue_err: + /* On queue set error we can unwind bpf ref count and user space can +* retry this is most likely an allocation failure. +*/ + if (prog) + bpf_prog_sub(prog, vi->max_queue_pairs - 1); + return err; } static bool virtnet_xdp_query(struct net_device *dev) @@ -2373,7 +2431,6 @@ static void virtnet_remove(struct virtio_device *vdev) free_netdev(vi->dev); } -#ifdef CONFIG_PM_SLEEP static int virtnet_freeze(struct virtio_device *vdev) { struct virtnet_info *vi = vdev->priv; @@ -2430,7 +2487,6 @@ static int virtnet_restore(struct virtio_device *vdev) return 0; } -#endif If you do want to use virtio_device_reset(), it's better to squash this into patch 5/6. Other looks good. Thanks
Re: [net PATCH v4 0/6] virtio_net XDP fixes and adjust_header support
On 2017年01月16日 07:59, John Fastabend wrote: This has a fix to handle small buffer free logic correctly and then also adds adjust head support. I pushed adjust head at net (even though its rc3) to avoid having to push another exception case into virtio_net to catch if the program uses adjust_head and then block it. If there are any strong objections to this we can push it at net-next and use a patch from Jakub to add the exception handling but then user space has to deal with it either via try/fail logic or via kernel version checks. Granted we already have some cases that need to be configured to enable XDP but I don't see any reason to have yet another one when we can fix it now vs delaying a kernel version. v2: fix spelling error, convert unsigned -> unsigned int v3: v2 git crashed during send so retrying sorry for the noise v4: changed layout of rtnl_lock fixes (Stephen) moved reset logic into virtio core with new patch (MST) fixed up linearize and some code cleanup (Jason) Otherwise did some generic code cleanup so might be a bit cleaner this time at least that is the hope. Thanks everyone for the v3 review. Thanks, looks good to me overall, just few nits.
Re: [net PATCH v4 5/6] virtio: add pci_down/pci_up configuration
On 2017年01月16日 08:01, John Fastabend wrote: In virtio_net we need to do a full reset of the device to support queue reconfiguration and also we can trigger this via ethtool commands. So instead of open coding this in net driver push this into generic code in virtio. This also avoid exporting a handful of internal virtio routines. Looks like this is not a pci specific stuffs. And there's some driver left (e.g scsi and block). In fact, I'm not sure touching other drivers is really needed. Maybe we can just: - move virtio_device_freeze(), virtio_device_restore() and .freeze/.restore in virtio_driver out of CONFIG_PM_SLEEP - move virtnet_freeze() and virtnet_restore() out of CONFIG_PM_SLEEP - introduce virtio_net_reset() and call virtio_device_freeze()/virtio_device_restore() there Another possible issue for sleep/hibernation is xdp_prog were not restored, if this is not XDP intended, we'd better fix this. Thanks [...]
Re: [PATCH net-next 1/8] ptr_ring: introduce batch dequeuing
On 2017年03月22日 21:43, Michael S. Tsirkin wrote: On Tue, Mar 21, 2017 at 12:04:40PM +0800, Jason Wang wrote: Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/ptr_ring.h | 65 1 file changed, 65 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6c70444..4771ded 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r) return ptr; } +static inline int __ptr_ring_consume_batched(struct ptr_ring *r, +void **array, int n) +{ + void *ptr; + int i = 0; + + while (i < n) { + ptr = __ptr_ring_consume(r); + if (!ptr) + break; + array[i++] = ptr; + } + + return i; +} + /* * Note: resize (below) nests producer lock within consumer lock, so if you * call this in interrupt or BH context, you must disable interrupts/BH when This ignores the comment above that function: /* Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). */ Yes, __ptr_ring_swap_queue() ignores this too. Also - it looks like it shouldn't matter if reads are reordered but I wonder. Thoughts? Including some reasoning about it in commit log would be nice. Yes, I think it doesn't matter in this case, it matters only for batched producing. Thanks @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r) return ptr; } +static inline int ptr_ring_consume_batched(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_irq(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irq(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r, + void **array, int n) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(>consumer_lock, flags); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irqrestore(>consumer_lock, flags); + + return ret; +} + +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_bh(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_bh(>consumer_lock); + + return ret; +} + /* Cast to structure type and call a function without discarding from FIFO. * Function must return a value. * Callers must take consumer_lock. -- 2.7.4
Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array
On 2017年03月22日 22:16, Michael S. Tsirkin wrote: On Tue, Mar 21, 2017 at 12:04:46PM +0800, Jason Wang wrote: We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization and spinlock touching for each packet. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Tests were done by XDP1: - small buffer: Before: 1.88Mpps After : 2.25Mpps (+19.6%) - mergeable buffer: Before: 1.83Mpps After : 2.10Mpps (+14.7%) Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 64 + 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b51989..53f09f2 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include #include @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref { struct vhost_virtqueue *vq; }; +#define VHOST_RX_BATCH 64 struct vhost_net_virtqueue { struct vhost_virtqueue vq; size_t vhost_hlen; @@ -99,6 +102,10 @@ struct vhost_net_virtqueue { /* Reference counting for outstanding ubufs. * Protected by vq mutex. Writers must also take device mutex. */ struct vhost_net_ubuf_ref *ubufs; + struct skb_array *rx_array; + void *rxq[VHOST_RX_BATCH]; + int rt; + int rh; }; struct vhost_net { @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n) n->vqs[i].ubufs = NULL; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } } @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net) mutex_unlock(>mutex); } -static int peek_head_len(struct sock *sk) +static int peek_head_len_batched(struct vhost_net_virtqueue *rvq) Pls rename to say what it actually does: fetch skbs Ok. +{ + if (rvq->rh != rvq->rt) + goto out; + + rvq->rh = rvq->rt = 0; + rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq, + VHOST_RX_BATCH); A comment explaining why is is -bh would be helpful. Ok. Thanks
[PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization and spinlock touching for each packet. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Tests were done by XDP1: - small buffer: Before: 1.88Mpps After : 2.25Mpps (+19.6%) - mergeable buffer: Before: 1.83Mpps After : 2.10Mpps (+14.7%) Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 64 + 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b51989..ffa78c6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include #include @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref { struct vhost_virtqueue *vq; }; +#define VHOST_RX_BATCH 64 struct vhost_net_virtqueue { struct vhost_virtqueue vq; size_t vhost_hlen; @@ -99,6 +102,10 @@ struct vhost_net_virtqueue { /* Reference counting for outstanding ubufs. * Protected by vq mutex. Writers must also take device mutex. */ struct vhost_net_ubuf_ref *ubufs; + struct skb_array *rx_array; + void *rxq[VHOST_RX_BATCH]; + int rt; + int rh; }; struct vhost_net { @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n) n->vqs[i].ubufs = NULL; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } } @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net) mutex_unlock(>mutex); } -static int peek_head_len(struct sock *sk) +static int fetch_skbs(struct vhost_net_virtqueue *rvq) +{ + if (rvq->rh != rvq->rt) + goto out; + + rvq->rh = rvq->rt = 0; + rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq, + VHOST_RX_BATCH); + if (!rvq->rt) + return 0; +out: + return __skb_array_len_with_tag(rvq->rxq[rvq->rh]); +} + +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; struct sk_buff *head; int len = 0; unsigned long flags; + if (rvq->rx_array) + return fetch_skbs(rvq); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk) return skb_queue_empty(>sk_receive_queue); } -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) +static int vhost_net_rx_peek_head_len(struct vhost_net *net, + struct sock *sk) { + struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX]; struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = >vq; unsigned long uninitialized_var(endtime); - int len = peek_head_len(sk); + int len = peek_head_len(rvq, sk); if (!len && vq->busyloop_timeout) { /* Both tx vq and rx socket were polled here */ @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) vhost_poll_queue(>poll); mutex_unlock(>mutex); - len = peek_head_len(sk); + len = peek_head_len(rvq, sk); } return len; @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; + if (nvq->rx_array) + msg.msg_control = nvq->rxq[nvq->rh++]; /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(_iter, READ, vq->iov, 1, 1); @@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file *f) n->vqs[i].done_idx = 0; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); @@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n, struct vhost_virtqueue *vq) { struct socket *sock; + struct vhost_net_virtqueue *nvq = + container_of(vq, struct vhost_net_virtqueue, vq); mutex_lock(>mutex);
[PATCH V2 net-next 6/7] tap: support receiving skb from msg_control
This patch makes tap_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index abdaf86..07d9174 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q, static ssize_t tap_do_read(struct tap_queue *q, struct iov_iter *to, - int noblock) + int noblock, struct sk_buff *skb) { DEFINE_WAIT(wait); - struct sk_buff *skb; ssize_t ret = 0; if (!iov_iter_count(to)) return 0; + if (skb) + goto done; + while (1) { if (!noblock) prepare_to_wait(sk_sleep(>sk), , @@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q, if (!noblock) finish_wait(sk_sleep(>sk), ); +done: if (skb) { ret = tap_put_user(q, skb, to); if (unlikely(ret < 0)) @@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to) struct tap_queue *q = file->private_data; ssize_t len = iov_iter_count(to), ret; - ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK); + ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL); ret = min_t(ssize_t, ret, len); if (ret > 0) iocb->ki_pos = ret; @@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m, int ret; if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) return -EINVAL; - ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT); + ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, + m->msg_control); if (ret > total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; -- 2.7.4
[PATCH V2 net-next 3/7] tun: export skb_array
This patch exports skb_array through tun_get_skb_array(). Caller can then manipulate skb array directly. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 13 + include/linux/if_tun.h | 5 + 2 files changed, 18 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index c418f0a..70dd9ec 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file) } EXPORT_SYMBOL_GPL(tun_get_socket); +struct skb_array *tun_get_skb_array(struct file *file) +{ + struct tun_file *tfile; + + if (file->f_op != _fops) + return ERR_PTR(-EINVAL); + tfile = file->private_data; + if (!tfile) + return ERR_PTR(-EBADFD); + return >tx_array; +} +EXPORT_SYMBOL_GPL(tun_get_skb_array); + module_init(tun_init); module_exit(tun_cleanup); MODULE_DESCRIPTION(DRV_DESCRIPTION); diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index ed6da2e..bf9bdf4 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -19,6 +19,7 @@ #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) struct socket *tun_get_socket(struct file *); +struct skb_array *tun_get_skb_array(struct file *file); #else #include #include @@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f) { return ERR_PTR(-EINVAL); } +static inline struct skb_array *tun_get_skb_array(struct file *f) +{ + return ERR_PTR(-EINVAL); +} #endif /* CONFIG_TUN */ #endif /* __IF_TUN_H */ -- 2.7.4
[PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
This patch introduce a batched version of consuming, consumer can dequeue more than one pointers from the ring at a time. We don't care about the reorder of reading here so no need for compiler barrier. Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/ptr_ring.h | 65 1 file changed, 65 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6c70444..2be0f350 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r) return ptr; } +static inline int __ptr_ring_consume_batched(struct ptr_ring *r, +void **array, int n) +{ + void *ptr; + int i; + + for (i = 0; i < n; i++) { + ptr = __ptr_ring_consume(r); + if (!ptr) + break; + array[i] = ptr; + } + + return i; +} + /* * Note: resize (below) nests producer lock within consumer lock, so if you * call this in interrupt or BH context, you must disable interrupts/BH when @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r) return ptr; } +static inline int ptr_ring_consume_batched(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_irq(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irq(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r, + void **array, int n) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(>consumer_lock, flags); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irqrestore(>consumer_lock, flags); + + return ret; +} + +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_bh(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_bh(>consumer_lock); + + return ret; +} + /* Cast to structure type and call a function without discarding from FIFO. * Function must return a value. * Callers must take consumer_lock. -- 2.7.4
[PATCH V2 net-next 2/7] skb_array: introduce batch dequeuing
Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/skb_array.h | 25 + 1 file changed, 25 insertions(+) diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h index f4dfade..90e44b9 100644 --- a/include/linux/skb_array.h +++ b/include/linux/skb_array.h @@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct skb_array *a) return ptr_ring_consume(>ring); } +static inline int skb_array_consume_batched(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched(>ring, array, n); +} + static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a) { return ptr_ring_consume_irq(>ring); } +static inline int skb_array_consume_batched_irq(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_irq(>ring, array, n); +} + static inline struct sk_buff *skb_array_consume_any(struct skb_array *a) { return ptr_ring_consume_any(>ring); } +static inline int skb_array_consume_batched_any(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_any(>ring, array, n); +} + + static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a) { return ptr_ring_consume_bh(>ring); } +static inline int skb_array_consume_batched_bh(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_bh(>ring, array, n); +} + static inline int __skb_array_len_with_tag(struct sk_buff *skb) { if (likely(skb)) { -- 2.7.4
[PATCH V2 net-next 0/7] vhost-net rx batching
Hi all: This series tries to implement rx batching for vhost-net. This is done by batching the dequeuing from skb_array which was exported by underlayer socket and pass the sbk back through msg_control to finish userspace copying. Tests shows at most 19% improvment on rx pps. Please review. Thanks Changes from V1: - switch to use for() in __ptr_ring_consume_batched() - rename peek_head_len_batched() to fetch_skbs() - use skb_array_consume_batched() instead of skb_array_consume_batched_bh() since no consumer run in bh - drop the lockless peeking patch since skb_array could be resized, so it's not safe to call lockless one Jason Wang (7): ptr_ring: introduce batch dequeuing skb_array: introduce batch dequeuing tun: export skb_array tap: export skb_array tun: support receiving skb through msg_control tap: support receiving skb from msg_control vhost_net: try batch dequing from skb array drivers/net/tap.c | 25 +++--- drivers/net/tun.c | 31 -- drivers/vhost/net.c | 64 +++--- include/linux/if_tap.h| 5 include/linux/if_tun.h| 5 include/linux/ptr_ring.h | 65 +++ include/linux/skb_array.h | 25 ++ 7 files changed, 204 insertions(+), 16 deletions(-) -- 2.7.4
[PATCH V2 net-next 5/7] tun: support receiving skb through msg_control
This patch makes tun_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 70dd9ec..a82bced 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock, static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, struct iov_iter *to, - int noblock) + int noblock, struct sk_buff *skb) { - struct sk_buff *skb; ssize_t ret; int err; @@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, if (!iov_iter_count(to)) return 0; - /* Read frames from ring */ - skb = tun_ring_recv(tfile, noblock, ); - if (!skb) - return err; + if (!skb) { + /* Read frames from ring */ + skb = tun_ring_recv(tfile, noblock, ); + if (!skb) + return err; + } ret = tun_put_user(tun, tfile, skb, to); if (unlikely(ret < 0)) @@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to) if (!tun) return -EBADFD; - ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK); + ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL); ret = min_t(ssize_t, ret, len); if (ret > 0) iocb->ki_pos = ret; @@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, SOL_PACKET, TUN_TX_TIMESTAMP); goto out; } - ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT); + ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, + m->msg_control); if (ret > (ssize_t)total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; -- 2.7.4
[PATCH V2 net-next 4/7] tap: export skb_array
This patch exports skb_array through tap_get_skb_array(). Caller can then manipulate skb array directly. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tap.c | 13 + include/linux/if_tap.h | 5 + 2 files changed, 18 insertions(+) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 4d4173d..abdaf86 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file) } EXPORT_SYMBOL_GPL(tap_get_socket); +struct skb_array *tap_get_skb_array(struct file *file) +{ + struct tap_queue *q; + + if (file->f_op != _fops) + return ERR_PTR(-EINVAL); + q = file->private_data; + if (!q) + return ERR_PTR(-EBADFD); + return >skb_array; +} +EXPORT_SYMBOL_GPL(tap_get_skb_array); + int tap_queue_resize(struct tap_dev *tap) { struct net_device *dev = tap->dev; diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h index 3482c3c..4837157 100644 --- a/include/linux/if_tap.h +++ b/include/linux/if_tap.h @@ -3,6 +3,7 @@ #if IS_ENABLED(CONFIG_TAP) struct socket *tap_get_socket(struct file *); +struct skb_array *tap_get_skb_array(struct file *file); #else #include #include @@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f) { return ERR_PTR(-EINVAL); } +static inline struct skb_array *tap_get_skb_array(struct file *f) +{ + return ERR_PTR(-EINVAL); +} #endif /* CONFIG_TAP */ #include -- 2.7.4
Re: [PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling
On 2017年03月29日 20:07, Michael S. Tsirkin wrote: On Tue, Mar 21, 2017 at 12:04:47PM +0800, Jason Wang wrote: For the socket that exports its skb array, we can use lockless polling to avoid touching spinlock during busy polling. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 53f09f2..41153a3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) return len; } -static int sk_has_rx_data(struct sock *sk) +static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; + if (rvq->rx_array) + return !__skb_array_empty(rvq->rx_array); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); I don't see which patch adds __skb_array_empty. This is not something new, it was introduced by ad69f35d1dc0a ("skb_array: array based FIFO for skbs"). Thanks @@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, endtime = busy_clock() + vq->busyloop_timeout; while (vhost_can_busy_poll(>dev, endtime) && - !sk_has_rx_data(sk) && + !sk_has_rx_data(rvq, sk) && vhost_vq_avail_empty(>dev, vq)) cpu_relax(); -- 2.7.4
Re: [PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling
On 2017年03月30日 10:33, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 10:16:15AM +0800, Jason Wang wrote: On 2017年03月29日 20:07, Michael S. Tsirkin wrote: On Tue, Mar 21, 2017 at 12:04:47PM +0800, Jason Wang wrote: For the socket that exports its skb array, we can use lockless polling to avoid touching spinlock during busy polling. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 53f09f2..41153a3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) return len; } -static int sk_has_rx_data(struct sock *sk) +static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; + if (rvq->rx_array) + return !__skb_array_empty(rvq->rx_array); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); I don't see which patch adds __skb_array_empty. This is not something new, it was introduced by ad69f35d1dc0a ("skb_array: array based FIFO for skbs"). Thanks Same comment about a compiler barrier applies then. Ok, rethink about this, since skb_array could be resized, using lockless version seems wrong. For the comment of using compiler barrier, caller (vhost_net_rx_peek_head_len) uses cpu_relax(). But I haven't figured out why a compiler barrier is needed here. Could you please explain? Thanks @@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, endtime = busy_clock() + vq->busyloop_timeout; while (vhost_can_busy_poll(>dev, endtime) && - !sk_has_rx_data(sk) && + !sk_has_rx_data(rvq, sk) && vhost_vq_avail_empty(>dev, vq)) cpu_relax(); -- 2.7.4
Re: [PATCH] virtio_net: fix mergeable bufs error handling
On 2017年03月29日 20:37, Michael S. Tsirkin wrote: On xdp error we try to free head_skb without having initialized it, that's clearly bogus. Fixes: f600b6905015 ("virtio_net: Add XDP support") Cc: John FastabendSigned-off-by: Michael S. Tsirkin --- drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 11773d6..e0fb3707 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -570,7 +570,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, u16 num_buf; struct page *page; int offset; - struct sk_buff *head_skb, *curr_skb; + struct sk_buff *head_skb = NULL, *curr_skb; struct bpf_prog *xdp_prog; unsigned int truesize; My tree (net.git HEAD is 8f1f7eeb22c16a197159cf7b35d1350695193ead) has: head_skb = NULL; just after the above codes. Thanks
Re: [PATCH] virtio_net: enable big packets for large MTU values
On 2017年03月29日 20:38, Michael S. Tsirkin wrote: If one enables e.g. jumbo frames without mergeable buffers, packets won't fit in 1500 byte buffers we use. Switch to big packet mode instead. TODO: make sizing more exact, possibly extend small packet mode to use larger pages. Signed-off-by: Michael S. Tsirkin--- drivers/net/virtio_net.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index e0fb3707..9dc31dc 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2428,6 +2428,10 @@ static int virtnet_probe(struct virtio_device *vdev) dev->mtu = mtu; dev->max_mtu = mtu; } + + /* TODO: size buffers correctly in this case. */ + if (dev->mtu > ETH_DATA_LEN) + vi->big_packets = true; } if (vi->any_header_sg) Ok, I think maybe we should fail the probe of small buffer in this case and decrease the max_mtu to ETH_DATA_LEN if small buffer is used (since user are allowed to increase the mtu). Thanks
Re: [PATCH v2] virtio_net: fix support for small rings
On 2017年03月30日 01:42, Michael S. Tsirkin wrote: When ring size is small (<32 entries) making buffers smaller means a full ring might not be able to hold enough buffers to fit a single large packet. Make sure a ring full of buffers is large enough to allow at least one packet of max size. Fixes: 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag allocators") Signed-off-by: Michael S. Tsirkin--- changes from v1: typo fix drivers/net/virtio_net.c | 30 ++ 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9dc31dc..e5f6e34 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -29,6 +29,7 @@ #include #include #include +#include This seems unnecessary. static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -98,6 +99,9 @@ struct receive_queue { /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; + /* Min single buffer size for mergeable buffers case. */ + unsigned int min_buf_len; + /* Name of this receive queue: input.$index */ char name[40]; }; @@ -894,13 +898,14 @@ static int add_recvbuf_big(struct virtnet_info *vi, struct receive_queue *rq, return err; } -static unsigned int get_mergeable_buf_len(struct ewma_pkt_len *avg_pkt_len) +static unsigned int get_mergeable_buf_len(struct receive_queue *rq, + struct ewma_pkt_len *avg_pkt_len) { const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); unsigned int len; len = hdr_len + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len), - GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + rq->min_buf_len - hdr_len, PAGE_SIZE - hdr_len); return ALIGN(len, L1_CACHE_BYTES); } @@ -914,7 +919,7 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, int err; unsigned int len, hole; - len = get_mergeable_buf_len(>mrg_avg_pkt_len); + len = get_mergeable_buf_len(rq, >mrg_avg_pkt_len); if (unlikely(!skb_page_frag_refill(len + headroom, alloc_frag, gfp))) return -ENOMEM; @@ -2086,6 +2091,21 @@ static void virtnet_del_vqs(struct virtnet_info *vi) virtnet_free_queues(vi); } +/* How large should a single buffer be so a queue full of these can fit at + * least one full packet? + * Logic below assumes the mergeable buffer header is used. + */ +static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct virtqueue *vq) +{ + const unsigned int hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int rq_size = virtqueue_get_vring_size(vq); + unsigned int packet_len = vi->big_packets ? IP_MAX_MTU : vi->dev->max_mtu; + unsigned int buf_len = hdr_len + ETH_HLEN + VLAN_HLEN + packet_len; + unsigned int min_buf_len = DIV_ROUND_UP(buf_len, rq_size); + + return max(min_buf_len, hdr_len); +} Consider rq_size may be large, I think this should be max(min_buf_len, GOOD_PACKET_LEN)? + static int virtnet_find_vqs(struct virtnet_info *vi) { vq_callback_t **callbacks; @@ -2151,6 +2171,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi) for (i = 0; i < vi->max_queue_pairs; i++) { vi->rq[i].vq = vqs[rxq2vq(i)]; + vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq); vi->sq[i].vq = vqs[txq2vq(i)]; } @@ -2237,7 +2258,8 @@ static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue, BUG_ON(queue_index >= vi->max_queue_pairs); avg = >rq[queue_index].mrg_avg_pkt_len; - return sprintf(buf, "%u\n", get_mergeable_buf_len(avg)); + return sprintf(buf, "%u\n", + get_mergeable_buf_len(>rq[queue_index], avg)); } static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
Re: [PATCH net-next] virtio_net: don't reset twice on XDP on/off
On 2017年03月30日 04:14, Michael S. Tsirkin wrote: We already do a reset once in remove_vq_common - there appears to be no point in doing another one when we add/remove XDP. Signed-off-by: Michael S. Tsirkin <m...@redhat.com> --- drivers/net/virtio_net.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index de42e9a..ed8f548 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1810,7 +1810,6 @@ static int virtnet_reset(struct virtnet_info *vi, int curr_qp, int xdp_qp) virtnet_freeze_down(dev); _remove_vq_common(vi); - dev->config->reset(dev); virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE); virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER); Acked-by: Jason Wang <jasow...@redhat.com>
Re: [PATCH 1/6] virtio: wrap find_vqs
On 2017年03月30日 04:48, Michael S. Tsirkin wrote: We are going to add more parameters to find_vqs, let's wrap the call so we don't need to tweak all drivers every time. Signed-off-by: Michael S. Tsirkin--- A quick glance and it looks ok, but what the benefit of this series, is it required by other changes? Thanks
Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array
On 2017年03月29日 18:46, Pankaj Gupta wrote: Hi Jason, On 2017年03月23日 13:34, Jason Wang wrote: +{ +if (rvq->rh != rvq->rt) +goto out; + +rvq->rh = rvq->rt = 0; +rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq, +VHOST_RX_BATCH); A comment explaining why is is -bh would be helpful. Ok. Thanks Rethink about this. It looks like -bh is not needed in this case since no consumer run in bh. In that case do we need other variants of "ptr_ring_consume_batched_*()" functions. Are we planning to use them in future? I think we'd better keep them, since it serves as helpers. You can see that not all the helpers in ptr_ring has real users, but they were prepared for the future use. Thanks Thanks
Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array
On 2017年03月23日 13:34, Jason Wang wrote: +{ +if (rvq->rh != rvq->rt) +goto out; + +rvq->rh = rvq->rt = 0; +rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq, +VHOST_RX_BATCH); A comment explaining why is is -bh would be helpful. Ok. Thanks Rethink about this. It looks like -bh is not needed in this case since no consumer run in bh. Thanks
Re: [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control
On 2017年03月30日 23:06, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:28PM +0800, Jason Wang wrote: This patch makes tun_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang<jasow...@redhat.com> Do we need to bother with tun? I didn't realize one can even use that with vhost. What would be the point of all the virtio header stuff dealing with checksums etc? Even if you see a use-case is it worth optimizing? It's for tap in fact. I use "tun" just because we have already had a tap.c which is used by macvtap. Thanks
[PATCH net-next 2/8] skb_array: introduce batch dequeuing
Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/skb_array.h | 25 + 1 file changed, 25 insertions(+) diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h index f4dfade..90e44b9 100644 --- a/include/linux/skb_array.h +++ b/include/linux/skb_array.h @@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct skb_array *a) return ptr_ring_consume(>ring); } +static inline int skb_array_consume_batched(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched(>ring, array, n); +} + static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a) { return ptr_ring_consume_irq(>ring); } +static inline int skb_array_consume_batched_irq(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_irq(>ring, array, n); +} + static inline struct sk_buff *skb_array_consume_any(struct skb_array *a) { return ptr_ring_consume_any(>ring); } +static inline int skb_array_consume_batched_any(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_any(>ring, array, n); +} + + static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a) { return ptr_ring_consume_bh(>ring); } +static inline int skb_array_consume_batched_bh(struct skb_array *a, + void **array, int n) +{ + return ptr_ring_consume_batched_bh(>ring, array, n); +} + static inline int __skb_array_len_with_tag(struct sk_buff *skb) { if (likely(skb)) { -- 2.7.4
[PATCH net-next 5/8] tun: support receiving skb through msg_control
This patch makes tun_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 70dd9ec..a82bced 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock, static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, struct iov_iter *to, - int noblock) + int noblock, struct sk_buff *skb) { - struct sk_buff *skb; ssize_t ret; int err; @@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, if (!iov_iter_count(to)) return 0; - /* Read frames from ring */ - skb = tun_ring_recv(tfile, noblock, ); - if (!skb) - return err; + if (!skb) { + /* Read frames from ring */ + skb = tun_ring_recv(tfile, noblock, ); + if (!skb) + return err; + } ret = tun_put_user(tun, tfile, skb, to); if (unlikely(ret < 0)) @@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to) if (!tun) return -EBADFD; - ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK); + ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL); ret = min_t(ssize_t, ret, len); if (ret > 0) iocb->ki_pos = ret; @@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, SOL_PACKET, TUN_TX_TIMESTAMP); goto out; } - ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT); + ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, + m->msg_control); if (ret > (ssize_t)total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; -- 2.7.4
[PATCH net-next 7/8] vhost_net: try batch dequing from skb array
We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization and spinlock touching for each packet. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Tests were done by XDP1: - small buffer: Before: 1.88Mpps After : 2.25Mpps (+19.6%) - mergeable buffer: Before: 1.83Mpps After : 2.10Mpps (+14.7%) Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 64 + 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b51989..53f09f2 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include #include @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref { struct vhost_virtqueue *vq; }; +#define VHOST_RX_BATCH 64 struct vhost_net_virtqueue { struct vhost_virtqueue vq; size_t vhost_hlen; @@ -99,6 +102,10 @@ struct vhost_net_virtqueue { /* Reference counting for outstanding ubufs. * Protected by vq mutex. Writers must also take device mutex. */ struct vhost_net_ubuf_ref *ubufs; + struct skb_array *rx_array; + void *rxq[VHOST_RX_BATCH]; + int rt; + int rh; }; struct vhost_net { @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n) n->vqs[i].ubufs = NULL; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } } @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net) mutex_unlock(>mutex); } -static int peek_head_len(struct sock *sk) +static int peek_head_len_batched(struct vhost_net_virtqueue *rvq) +{ + if (rvq->rh != rvq->rt) + goto out; + + rvq->rh = rvq->rt = 0; + rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq, + VHOST_RX_BATCH); + if (!rvq->rt) + return 0; +out: + return __skb_array_len_with_tag(rvq->rxq[rvq->rh]); +} + +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; struct sk_buff *head; int len = 0; unsigned long flags; + if (rvq->rx_array) + return peek_head_len_batched(rvq); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk) return skb_queue_empty(>sk_receive_queue); } -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) +static int vhost_net_rx_peek_head_len(struct vhost_net *net, + struct sock *sk) { + struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX]; struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = >vq; unsigned long uninitialized_var(endtime); - int len = peek_head_len(sk); + int len = peek_head_len(rvq, sk); if (!len && vq->busyloop_timeout) { /* Both tx vq and rx socket were polled here */ @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) vhost_poll_queue(>poll); mutex_unlock(>mutex); - len = peek_head_len(sk); + len = peek_head_len(rvq, sk); } return len; @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; + if (nvq->rx_array) + msg.msg_control = nvq->rxq[nvq->rh++]; /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(_iter, READ, vq->iov, 1, 1); @@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file *f) n->vqs[i].done_idx = 0; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); @@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n, struct vhost_virtqueue *vq) { struct socket *sock; + struct vhost_net_virtqueue *nvq = + container_of(vq, struct vhost_net_virtqueue, vq); mu
[PATCH net-next 3/8] tun: export skb_array
This patch exports skb_array through tun_get_skb_array(). Caller can then manipulate skb array directly. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tun.c | 13 + include/linux/if_tun.h | 5 + 2 files changed, 18 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index c418f0a..70dd9ec 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file) } EXPORT_SYMBOL_GPL(tun_get_socket); +struct skb_array *tun_get_skb_array(struct file *file) +{ + struct tun_file *tfile; + + if (file->f_op != _fops) + return ERR_PTR(-EINVAL); + tfile = file->private_data; + if (!tfile) + return ERR_PTR(-EBADFD); + return >tx_array; +} +EXPORT_SYMBOL_GPL(tun_get_skb_array); + module_init(tun_init); module_exit(tun_cleanup); MODULE_DESCRIPTION(DRV_DESCRIPTION); diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index ed6da2e..bf9bdf4 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -19,6 +19,7 @@ #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) struct socket *tun_get_socket(struct file *); +struct skb_array *tun_get_skb_array(struct file *file); #else #include #include @@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f) { return ERR_PTR(-EINVAL); } +static inline struct skb_array *tun_get_skb_array(struct file *f) +{ + return ERR_PTR(-EINVAL); +} #endif /* CONFIG_TUN */ #endif /* __IF_TUN_H */ -- 2.7.4
[PATCH net-next 6/8] tap: support receiving skb from msg_control
This patch makes tap_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index abdaf86..07d9174 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q, static ssize_t tap_do_read(struct tap_queue *q, struct iov_iter *to, - int noblock) + int noblock, struct sk_buff *skb) { DEFINE_WAIT(wait); - struct sk_buff *skb; ssize_t ret = 0; if (!iov_iter_count(to)) return 0; + if (skb) + goto done; + while (1) { if (!noblock) prepare_to_wait(sk_sleep(>sk), , @@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q, if (!noblock) finish_wait(sk_sleep(>sk), ); +done: if (skb) { ret = tap_put_user(q, skb, to); if (unlikely(ret < 0)) @@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to) struct tap_queue *q = file->private_data; ssize_t len = iov_iter_count(to), ret; - ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK); + ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL); ret = min_t(ssize_t, ret, len); if (ret > 0) iocb->ki_pos = ret; @@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m, int ret; if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) return -EINVAL; - ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT); + ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, + m->msg_control); if (ret > total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; -- 2.7.4
[PATCH net-next 4/8] tap: export skb_array
This patch exports skb_array through tap_get_skb_array(). Caller can then manipulate skb array directly. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/tap.c | 13 + include/linux/if_tap.h | 5 + 2 files changed, 18 insertions(+) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 4d4173d..abdaf86 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file) } EXPORT_SYMBOL_GPL(tap_get_socket); +struct skb_array *tap_get_skb_array(struct file *file) +{ + struct tap_queue *q; + + if (file->f_op != _fops) + return ERR_PTR(-EINVAL); + q = file->private_data; + if (!q) + return ERR_PTR(-EBADFD); + return >skb_array; +} +EXPORT_SYMBOL_GPL(tap_get_skb_array); + int tap_queue_resize(struct tap_dev *tap) { struct net_device *dev = tap->dev; diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h index 3482c3c..4837157 100644 --- a/include/linux/if_tap.h +++ b/include/linux/if_tap.h @@ -3,6 +3,7 @@ #if IS_ENABLED(CONFIG_TAP) struct socket *tap_get_socket(struct file *); +struct skb_array *tap_get_skb_array(struct file *file); #else #include #include @@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f) { return ERR_PTR(-EINVAL); } +static inline struct skb_array *tap_get_skb_array(struct file *f) +{ + return ERR_PTR(-EINVAL); +} #endif /* CONFIG_TAP */ #include -- 2.7.4
[PATCH net-next 0/8] vhost-net rx batching
Hi all: This series tries to implement rx batching for vhost-net. This is done by batching the dequeuing from skb_array which was exported by underlayer socket and pass the sbk back through msg_control to finish userspace copying. Tests shows at most 19% improvment on rx pps. Please review. Thanks Jason Wang (8): ptr_ring: introduce batch dequeuing skb_array: introduce batch dequeuing tun: export skb_array tap: export skb_array tun: support receiving skb through msg_control tap: support receiving skb from msg_control vhost_net: try batch dequing from skb array vhost_net: use lockless peeking for skb array during busy polling drivers/net/tap.c | 25 ++--- drivers/net/tun.c | 31 +++-- drivers/vhost/net.c | 71 +++ include/linux/if_tap.h| 5 include/linux/if_tun.h| 5 include/linux/ptr_ring.h | 65 +++ include/linux/skb_array.h | 25 + 7 files changed, 209 insertions(+), 18 deletions(-) -- 2.7.4
[PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling
For the socket that exports its skb array, we can use lockless polling to avoid touching spinlock during busy polling. Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/vhost/net.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 53f09f2..41153a3 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) return len; } -static int sk_has_rx_data(struct sock *sk) +static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; + if (rvq->rx_array) + return !__skb_array_empty(rvq->rx_array); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); @@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, endtime = busy_clock() + vq->busyloop_timeout; while (vhost_can_busy_poll(>dev, endtime) && - !sk_has_rx_data(sk) && + !sk_has_rx_data(rvq, sk) && vhost_vq_avail_empty(>dev, vq)) cpu_relax(); -- 2.7.4
Re: [PATCH net-next 1/8] ptr_ring: introduce batch dequeuing
On 2017年03月21日 18:25, Sergei Shtylyov wrote: Hello! On 3/21/2017 7:04 AM, Jason Wang wrote: Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/ptr_ring.h | 65 1 file changed, 65 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6c70444..4771ded 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r) return ptr; } +static inline int __ptr_ring_consume_batched(struct ptr_ring *r, + void **array, int n) +{ +void *ptr; +int i = 0; + +while (i < n) { Hm, why not *for*? Yes, it maybe better, if there's other comment on the series, will change it in next version. Thanks
Re: [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control
On 2017年03月30日 23:03, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:29PM +0800, Jason Wang wrote: This patch makes tap_recvmsg() can receive from skb from its caller through msg_control. Vhost_net will be the first user. Signed-off-by: Jason Wang<jasow...@redhat.com> --- drivers/net/tap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index abdaf86..07d9174 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q, static ssize_t tap_do_read(struct tap_queue *q, struct iov_iter *to, - int noblock) + int noblock, struct sk_buff *skb) { DEFINE_WAIT(wait); - struct sk_buff *skb; ssize_t ret = 0; if (!iov_iter_count(to)) return 0; + if (skb) + goto done; + while (1) { if (!noblock) prepare_to_wait(sk_sleep(>sk), , @@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q, if (!noblock) finish_wait(sk_sleep(>sk), ); +done: Please just use an if {} block here. goto on error is ok, but we are far from done here and goto done is misleading. Ok. Thanks.
Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
On 2017年03月31日 12:02, Jason Wang wrote: On 2017年03月30日 22:21, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote: We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization which cache does this refer to btw? Both icache and dcache more or less. and spinlock touching for each packet. Do you mean the effect of extra two atomics here? In fact four, packet length peeking needs another two. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Tests were done by XDP1: - small buffer: Before: 1.88Mpps After : 2.25Mpps (+19.6%) - mergeable buffer: Before: 1.83Mpps After : 2.10Mpps (+14.7%) Signed-off-by: Jason Wang <jasow...@redhat.com> Looks like I misread the code previously. More comments below, sorry about not asking these questions earlier. --- drivers/vhost/net.c | 64 + 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b51989..ffa78c6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include #include @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref { struct vhost_virtqueue *vq; }; +#define VHOST_RX_BATCH 64 struct vhost_net_virtqueue { struct vhost_virtqueue vq; size_t vhost_hlen; Could you please try playing with batch size and see what the effect is? Ok. I tried 32 which seems slower than 64 but still faster than no batching. Ok, result is: no batching 1.88Mpps RX_BATCH=11.93Mpps RX_BATCH=42.11Mpps RX_BATCH=16 2.14Mpps RX_BATCH=64 2.25Mpps RX_BATCH=256 2.18Mpps @@ -99,6 +102,10 @@ struct vhost_net_virtqueue { /* Reference counting for outstanding ubufs. * Protected by vq mutex. Writers must also take device mutex. */ struct vhost_net_ubuf_ref *ubufs; +struct skb_array *rx_array; +void *rxq[VHOST_RX_BATCH]; +int rt; +int rh; }; struct vhost_net { @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n) n->vqs[i].ubufs = NULL; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; +n->vqs[i].rt = 0; +n->vqs[i].rh = 0; } } @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net) mutex_unlock(>mutex); } -static int peek_head_len(struct sock *sk) +static int fetch_skbs(struct vhost_net_virtqueue *rvq) +{ +if (rvq->rh != rvq->rt) +goto out; + +rvq->rh = rvq->rt = 0; +rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq, +VHOST_RX_BATCH); +if (!rvq->rt) +return 0; +out: +return __skb_array_len_with_tag(rvq->rxq[rvq->rh]); +} + +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; struct sk_buff *head; int len = 0; unsigned long flags; +if (rvq->rx_array) +return fetch_skbs(rvq); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk) return skb_queue_empty(>sk_receive_queue); } -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) +static int vhost_net_rx_peek_head_len(struct vhost_net *net, + struct sock *sk) { +struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX]; struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = >vq; unsigned long uninitialized_var(endtime); -int len = peek_head_len(sk); +int len = peek_head_len(rvq, sk); if (!len && vq->busyloop_timeout) { /* Both tx vq and rx socket were polled here */ @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) vhost_poll_queue(>poll); mutex_unlock(>mutex); -len = peek_head_len(sk); +len = peek_head_len(rvq, sk); } return len; @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; +if (nvq->rx_array) +msg.msg_control = nvq->rxq[nvq->rh++]; /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(_iter, READ, vq->iov, 1, 1); So there's a bit of a mystery here. vhost code isn't batched, all we are batching is the fetch from the tun ring. I've already had vhost batching code
Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
On 2017年03月31日 22:31, Michael S. Tsirkin wrote: On Fri, Mar 31, 2017 at 11:52:24AM +0800, Jason Wang wrote: On 2017年03月30日 21:53, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote: This patch introduce a batched version of consuming, consumer can dequeue more than one pointers from the ring at a time. We don't care about the reorder of reading here so no need for compiler barrier. Signed-off-by: Jason Wang<jasow...@redhat.com> --- include/linux/ptr_ring.h | 65 1 file changed, 65 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6c70444..2be0f350 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r) return ptr; } +static inline int __ptr_ring_consume_batched(struct ptr_ring *r, +void **array, int n) Can we use a shorter name? ptr_ring_consume_batch? Ok, but at least we need to keep the prefix since there's a locked version. +{ + void *ptr; + int i; + + for (i = 0; i < n; i++) { + ptr = __ptr_ring_consume(r); + if (!ptr) + break; + array[i] = ptr; + } + + return i; +} + /* * Note: resize (below) nests producer lock within consumer lock, so if you * call this in interrupt or BH context, you must disable interrupts/BH when I'd like to add a code comment here explaining why we don't care about cpu or compiler reordering. And I think the reason is in the way you use this API: in vhost it does not matter if you get less entries than present in the ring. That's ok but needs to be noted in a code comment so people use this function correctly. Interesting, but I still think it's not necessary. If consumer is doing a busy polling, it will eventually get the entries. If the consumer need notification from producer, it should drain the queue which means it need enable notification before last try of consuming call, otherwise it was a bug. The batch consuming function in this patch can guarantee return at least one pointer if there's many, this looks sufficient for the correctness? Thanks You ask for N entries but get N-1. This seems to imply the ring is now empty. Do we guarantee this? I think consumer can not assume ring is empty consider producer can produce at the same time. It need enable notification and do another poll in this case. Thanks
Re: [PATCH net-next v2 2/5] virtio-net: transmit napi
On 2017年04月20日 21:58, Willem de Bruijn wrote: On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang <jasow...@redhat.com> wrote: On 2017年04月19日 04:21, Willem de Bruijn wrote: +static void virtnet_napi_tx_enable(struct virtnet_info *vi, + struct virtqueue *vq, + struct napi_struct *napi) +{ + if (!napi->weight) + return; + + if (!vi->affinity_hint_set) { + napi->weight = 0; + return; + } + + return virtnet_napi_enable(vq, napi); +} + static void refill_work(struct work_struct *work) Maybe I was wrong, but according to Michael's comment it looks like he want check affinity_hint_set just for speculative tx polling on rx napi instead of disabling it at all. And I'm not convinced this is really needed, driver only provide affinity hint instead of affinity, so it's not guaranteed that tx and rx interrupt are in the same vcpus. You're right. I made the restriction broader than the request, to really err on the side of caution for the initial merge of napi tx. And enabling the optimization is always a win over keeping it off, even without irq affinity. The cycle cost is significant without affinity regardless of whether the optimization is used. Yes, I noticed this in the past too. Though this is not limited to napi-tx, it is more pronounced in that mode than without napi. 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: upstream: 1,1,1: 28985 Mbps, 278 Gcyc 1,0,2: 30067 Mbps, 402 Gcyc napi tx: 1,1,1: 34492 Mbps, 269 Gcyc 1,0,2: 36527 Mbps, 537 Gcyc (!) 1,0,1: 36269 Mbps, 394 Gcyc 1,0,0: 34674 Mbps, 402 Gcyc This is a particularly strong example. It is also representative of most RR tests. It is less pronounced in other streaming tests. 10x TCP_RR, for instance: upstream: 1,1,1: 42267 Mbps, 301 Gcyc 1,0,2: 40663 Mbps, 445 Gcyc napi tx: 1,1,1: 42420 Mbps, 303 Gcyc 1,0,2: 42267 Mbps, 431 Gcyc These numbers were obtained with the virtqueue_enable_cb_delayed optimization after xmit_skb, btw. It turns out that moving that before increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing 100x TCP_RR a bit. I see, so I think we can leave the affinity hint optimization/check for future investigation: - to avoid endless optimization (e.g we may want to share a single vector/napi for tx/rx queue pairs in the future) for this series. - tx napi is disabled by default which means we can do optimization on top. Thanks
Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions
On 2017年04月20日 23:34, Vlad Yasevich wrote: On 04/17/2017 11:01 PM, Jason Wang wrote: On 2017年04月16日 00:38, Vladislav Yasevich wrote: Curreclty virtion net header is fixed size and adding things to it is rather difficult to do. This series attempt to add the infrastructure as well as some extensions that try to resolve some deficiencies we currently have. First, vnet header only has space for 16 flags. This may not be enough in the future. The extensions will provide space for 32 possbile extension flags and 32 possible extensions. These flags will be carried in the first pseudo extension header, the presense of which will be determined by the flag in the virtio net header. The extensions themselves will immidiately follow the extension header itself. They will be added to the packet in the same order as they appear in the extension flags. No padding is placed between the extensions and any extensions negotiated, but not used need by a given packet will convert to trailing padding. Do we need a explicit padding (e.g an extension) which could be controlled by each side? I don't think so. The size of the vnet header is set based on the extensions negotiated. The one part I am not crazy about is that in the case of packet not using any extensions, the data is still placed after the entire vnet header, which essentially adds a lot of padding. However, that's really no different then if we simply grew the vnet header. The other thing I've tried before is putting extensions into their own sg buffer, but that made it slower.h Yes. For example: | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data | Just some rough thoughts: - Is this better to use TLV instead of bitmap here? One advantage of TLV is that the length is not limited by the length of bitmap. but the disadvantage is that we add at least 4 bytes per extension of just TL data. That makes this thing even longer. Yes, and it looks like the length is still limited by e.g the length of T. - For 1.1, do we really want something like vnet header? AFAIK, it was not used by modern NICs, is this better to pack all meta-data into descriptor itself? This may need a some changes in tun/macvtap, but looks more PCIE friendly. That would really be ideal and I've looked at this. There are small issues of exposing the 'net metadata' of the descriptor to taps so they can be filled in. The alternative is to use a different control structure for tap->qemu|vhost channel (that can be implementation specific) and have qemu|vhost populate the 'net metadata' of the descriptor. Yes, this needs some thought. For vhost, things looks a little bit easier, we can probably use msg_control. Thanks Thanks -vlad Thanks Extensions proposed in this series are: - IPv6 fragment id extension * Currently, the guest generated fragment id is discarded and the host generates an IPv6 fragment id if the packet has to be fragmented. The code attempts to add time based perturbation to id generation to make it harder to guess the next fragment id to be used. However, doing this on the host may result is less perturbation (due to differnet timing) and might make id guessing easier. Ideally, the ids generated by the guest should be used. One could also argue that we a "violating" the IPv6 protocol in the if the _strict_ interpretation of the spec. - VLAN header acceleration * Currently virtio doesn't not do vlan header acceleration and instead uses software tagging. One of the first things that the host will do is strip the vlan header out. When passing the packet the a guest the vlan header is re-inserted in to the packet. We can skip all that work if we can pass the vlan data in accelearted format. Then the host will not do any extra work. However, so far, this yeilded a very small perf bump (only ~1%). I am still looking into this. - UDP tunnel offload * Similar to vlan acceleration, with this extension we can pass additional data to host for support GSO with udp tunnel and possible other encapsulations. This yeilds a significant perfromance improvement (still testing remote checksum code). An addition extension that is unfinished (due to still testing for any side-effects) is checksum passthrough to support drivers that set CHECKSUM_COMPLETE. This would eliminate the need for guests to compute the software checksum. This series only takes care of virtio net. I have addition patches for the host side (vhost and tap/macvtap as well as qemu), but wanted to get feedback on the general approach first. Vladislav Yasevich (6): virtio-net: Remove the use the padded vnet_header structure virtio-net: make header length handling uniform virtio_net: Add basic skeleton for handling vnet header extensions. virtio-net: Add support for IPv6 fragment i
Re: [PATCH RFC (resend) net-next 3/6] virtio_net: Add basic skeleton for handling vnet header extensions.
On 2017年04月16日 00:38, Vladislav Yasevich wrote: This is the basic sceleton which will be fleshed out by individiual extensions. Signed-off-by: Vladislav Yasevich--- drivers/net/virtio_net.c| 21 + include/linux/virtio_net.h | 12 include/uapi/linux/virtio_net.h | 11 +++ 3 files changed, 44 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5ad6ee6..08e2709 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -145,6 +145,10 @@ struct virtnet_info { /* Packet virtio header size */ u8 hdr_len; + /* Header extensions were negotiated */ + bool hdr_ext; + u32 ext_mask; + /* Active statistics */ struct virtnet_stats __percpu *stats; @@ -174,6 +178,11 @@ struct virtnet_info { u32 speed; }; +struct virtio_net_hdr_max { + struct virtio_net_hdr_mrg_rxbuf hdr; + struct virtio_net_ext_hdr ext_hdr; +}; + static inline u8 padded_vnet_hdr(struct virtnet_info *vi) { u8 hdr_len = vi->hdr_len; @@ -214,6 +223,7 @@ static int rxq2vq(int rxq) static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb) { + BUILD_BUG_ON(sizeof(struct virtio_net_hdr_max) > sizeof(skb->cb)); return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb; } @@ -767,6 +777,12 @@ static int receive_buf(struct virtnet_info *vi, struct receive_queue *rq, goto frame_err; } + if (vi->hdr_ext && + virtio_net_ext_to_skb(skb, + (struct virtio_net_ext_hdr *)(hdr + 1))) { + goto frame_err; + } + skb->protocol = eth_type_trans(skb, dev); pr_debug("Receiving skb proto 0x%04x len %i type %i\n", ntohs(skb->protocol), skb->len, skb->pkt_type); @@ -1106,6 +1122,11 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (vi->mergeable_rx_bufs) hdr->num_buffers = 0; + if (vi->hdr_ext && + virtio_net_ext_from_skb(skb, (struct virtio_net_ext_hdr *)(hdr + 1), + vi->ext_mask)) + BUG(); + sg_init_table(sq->sg, skb_shinfo(skb)->nr_frags + (can_push ? 1 : 2)); if (can_push) { __skb_push(skb, hdr_len); diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 5209b5e..eaa524f 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -100,4 +100,16 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, return 0; } +static inline int virtio_net_ext_to_skb(struct sk_buff *skb, + struct virtio_net_ext_hdr *ext) +{ + return 0; +} + +static inline int virtio_net_ext_from_skb(const struct sk_buff *skb, + struct virtio_net_ext_hdr *ext, + __u32 ext_mask) +{ + return 0; +} #endif /* _LINUX_VIRTIO_NET_H */ diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index fc353b5..0039b72 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -88,6 +88,7 @@ struct virtio_net_config { struct virtio_net_hdr_v1 { #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */ #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */ +#define VIRTIO_NET_HDR_F_VNET_EXT 4 /* Vnet extensions present */ __u8 flags; #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */ #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */ @@ -102,6 +103,16 @@ struct virtio_net_hdr_v1 { __virtio16 num_buffers; /* Number of merged rx buffers */ }; +/* If IRTIO_NET_HDR_F_VNET_EXT flags is set, this header immediately + * follows the virtio_net_hdr. The flags in this header will indicate + * which extension will follow. The extnsion data will immidiately follow s/extnsion/extension/ + * this header. + */ +struct virtio_net_ext_hdr { + __u32 flags; + __u8 extensions[]; +}; + #ifndef VIRTIO_NET_NO_LEGACY /* This header comes first in the scatter-gather list. * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must
Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume
On 2017年04月17日 07:19, Michael S. Tsirkin wrote: Applications that consume a batch of entries in one go can benefit from ability to return some of them back into the ring. Add an API for that - assuming there's space. If there's no space naturally we can't do this and have to drop entries, but this implies ring is full so we'd likely drop some anyway. Signed-off-by: Michael S. Tsirkin--- Jason, in my mind the biggest issue with your batching patchset is the backet drops on disconnect. This API will help avoid that in the common case. Ok, I will rebase the series on top of this. (Though I don't think we care the packet loss). I would still prefer that we understand what's going on, I try to reply in another thread, does it make sense? and I would like to know what's the smallest batch size that's still helpful, Yes, I've replied in another thread, the result is: no batching 1.88Mpps RX_BATCH=11.93Mpps RX_BATCH=42.11Mpps RX_BATCH=16 2.14Mpps RX_BATCH=64 2.25Mpps RX_BATCH=256 2.18Mpps but I'm not going to block the patch on these grounds assuming packet drops are fixed. Thanks a lot. Lightly tested - this is on top of consumer batching patches. Thanks! include/linux/ptr_ring.h | 57 1 file changed, 57 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 783e7f5..5fbeab4 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp) return 0; } +/* + * Return entries into ring. Destroy entries that don't fit. + * + * Note: this is expected to be a rare slow path operation. + * + * Note: producer lock is nested within consumer lock, so if you + * resize you must make sure all uses nest correctly. + * In particular if you consume ring in interrupt or BH context, you must + * disable interrupts/BH when doing so. + */ +static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n, + void (*destroy)(void *)) +{ + unsigned long flags; + int head; + + spin_lock_irqsave(&(r)->consumer_lock, flags); + spin_lock(&(r)->producer_lock); + + if (!r->size) + goto done; + + /* +* Clean out buffered entries (for simplicity). This way following code +* can test entries for NULL and if not assume they are valid. +*/ + head = r->consumer_head - 1; + while (likely(head >= r->consumer_tail)) + r->queue[head--] = NULL; + r->consumer_tail = r->consumer_head; + + /* +* Go over entries in batch, start moving head back and copy entries. +* Stop when we run into previously unconsumed entries. +*/ + while (n--) { + head = r->consumer_head - 1; + if (head < 0) + head = r->size - 1; + if (r->queue[head]) { + /* This batch entry will have to be destroyed. */ + ++n; + goto done; + } + r->queue[head] = batch[n]; + r->consumer_tail = r->consumer_head = head; + } + +done: + /* Destroy all entries left in the batch. */ + while (n--) { + destroy(batch[n]); + } + spin_unlock(&(r)->producer_lock); + spin_unlock_irqrestore(&(r)->consumer_lock, flags); +} + static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue, int size, gfp_t gfp, void (*destroy)(void *))
Re: [PATCH net-next v2 2/5] virtio-net: transmit napi
On 2017年04月19日 04:21, Willem de Bruijn wrote: From: Willem de Bruijn <will...@google.com> Convert virtio-net to a standard napi tx completion path. This enables better TCP pacing using TCP small queues and increases single stream throughput. The virtio-net driver currently cleans tx descriptors on transmission of new packets in ndo_start_xmit. Latency depends on new traffic, so is unbounded. To avoid deadlock when a socket reaches its snd limit, packets are orphaned on tranmission. This breaks socket backpressure, including TSQ. Napi increases the number of interrupts generated compared to the current model, which keeps interrupts disabled as long as the ring has enough free descriptors. Keep tx napi optional and disabled for now. Follow-on patches will reduce the interrupt cost. Signed-off-by: Willem de Bruijn <will...@google.com> Signed-off-by: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 77 +--- 1 file changed, 67 insertions(+), 10 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b9c1df29892c..c173e85dc7b8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -33,9 +33,10 @@ static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); -static bool csum = true, gso = true; +static bool csum = true, gso = true, napi_tx; module_param(csum, bool, 0444); module_param(gso, bool, 0444); +module_param(napi_tx, bool, 0644); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) @@ -86,6 +87,8 @@ struct send_queue { /* Name of the send queue: output.$index */ char name[40]; + + struct napi_struct napi; }; /* Internal representation of a receive virtqueue */ @@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct *napi, static void skb_xmit_done(struct virtqueue *vq) { struct virtnet_info *vi = vq->vdev->priv; + struct napi_struct *napi = >sq[vq2txq(vq)].napi; /* Suppress further interrupts. */ virtqueue_disable_cb(vq); - /* We were probably waiting for more output buffers. */ - netif_wake_subqueue(vi->dev, vq2txq(vq)); + if (napi->weight) + virtqueue_napi_schedule(napi, vq); + else + /* We were probably waiting for more output buffers. */ + netif_wake_subqueue(vi->dev, vq2txq(vq)); } static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) @@ -972,6 +979,21 @@ static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi) local_bh_enable(); } +static void virtnet_napi_tx_enable(struct virtnet_info *vi, + struct virtqueue *vq, + struct napi_struct *napi) +{ + if (!napi->weight) + return; + + if (!vi->affinity_hint_set) { + napi->weight = 0; + return; + } + + return virtnet_napi_enable(vq, napi); +} + static void refill_work(struct work_struct *work) { struct virtnet_info *vi = @@ -1046,6 +1068,7 @@ static int virtnet_open(struct net_device *dev) if (!try_fill_recv(vi, >rq[i], GFP_KERNEL)) schedule_delayed_work(>refill, 0); virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi); + virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi); } return 0; @@ -1081,6 +1104,25 @@ static void free_old_xmit_skbs(struct send_queue *sq) u64_stats_update_end(>tx_syncp); } +static int virtnet_poll_tx(struct napi_struct *napi, int budget) +{ + struct send_queue *sq = container_of(napi, struct send_queue, napi); + struct virtnet_info *vi = sq->vq->vdev->priv; + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq)); + + if (__netif_tx_trylock(txq)) { + free_old_xmit_skbs(sq); + __netif_tx_unlock(txq); + } + + virtqueue_napi_complete(napi, sq->vq, 0); + + if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) + netif_tx_wake_queue(txq); + + return 0; +} + static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) { struct virtio_net_hdr_mrg_rxbuf *hdr; @@ -1130,9 +1172,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) int err; struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum); bool kick = !skb->xmit_more; + bool use_napi = sq->napi.weight; /* Free up any pending old buffers before queueing new ones. */ - free_old_xmit_skbs(sq); + if (!use_napi) + free_old_xmit_skbs(sq); I'm not sure this is best or even correct. Consider we clean xmit packets speculatively in virtnet_poll_t
Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick
On 2017年04月19日 04:21, Willem de Bruijn wrote: From: Willem de BruijnTx napi mode increases the rate of transmit interrupts. Suppress some by masking interrupts while more packets are expected. The interrupts will be reenabled before the last packet is sent. This optimization reduces the througput drop with tx napi for unidirectional flows such as UDP_STREAM that do not benefit from cleaning tx completions in the the receive napi handler. Signed-off-by: Willem de Bruijn --- drivers/net/virtio_net.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b14c82ce0032..b107ae011632 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1196,8 +1196,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) bool use_napi = sq->napi.weight; /* Free up any pending old buffers before queueing new ones. */ - if (!use_napi) + if (use_napi) { + if (kick) + virtqueue_enable_cb_delayed(sq->vq); + else + virtqueue_disable_cb(sq->vq); Since virtqueue_disable_cb() do nothing for event idx. I wonder whether or not just calling enable_cb_dealyed() is ok here. Btw, it does not disable interrupt at all, I propose a patch in the past which can do more than this: https://patchwork.kernel.org/patch/6472601/ Thanks + } else { free_old_xmit_skbs(sq); + } /* timestamp packet in software */ skb_tx_timestamp(skb);
Re: [PATCH net-next v2 2/5] virtio-net: transmit napi
On 2017年04月19日 04:21, Willem de Bruijn wrote: +static void virtnet_napi_tx_enable(struct virtnet_info *vi, + struct virtqueue *vq, + struct napi_struct *napi) +{ + if (!napi->weight) + return; + + if (!vi->affinity_hint_set) { + napi->weight = 0; + return; + } + + return virtnet_napi_enable(vq, napi); +} + static void refill_work(struct work_struct *work) Maybe I was wrong, but according to Michael's comment it looks like he want check affinity_hint_set just for speculative tx polling on rx napi instead of disabling it at all. And I'm not convinced this is really needed, driver only provide affinity hint instead of affinity, so it's not guaranteed that tx and rx interrupt are in the same vcpus. Thanks
Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions
On 2017年04月21日 21:08, Vlad Yasevich wrote: On 04/21/2017 12:05 AM, Jason Wang wrote: On 2017年04月20日 23:34, Vlad Yasevich wrote: On 04/17/2017 11:01 PM, Jason Wang wrote: On 2017年04月16日 00:38, Vladislav Yasevich wrote: Curreclty virtion net header is fixed size and adding things to it is rather difficult to do. This series attempt to add the infrastructure as well as some extensions that try to resolve some deficiencies we currently have. First, vnet header only has space for 16 flags. This may not be enough in the future. The extensions will provide space for 32 possbile extension flags and 32 possible extensions. These flags will be carried in the first pseudo extension header, the presense of which will be determined by the flag in the virtio net header. The extensions themselves will immidiately follow the extension header itself. They will be added to the packet in the same order as they appear in the extension flags. No padding is placed between the extensions and any extensions negotiated, but not used need by a given packet will convert to trailing padding. Do we need a explicit padding (e.g an extension) which could be controlled by each side? I don't think so. The size of the vnet header is set based on the extensions negotiated. The one part I am not crazy about is that in the case of packet not using any extensions, the data is still placed after the entire vnet header, which essentially adds a lot of padding. However, that's really no different then if we simply grew the vnet header. The other thing I've tried before is putting extensions into their own sg buffer, but that made it slower.h Yes. For example: | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data | Just some rough thoughts: - Is this better to use TLV instead of bitmap here? One advantage of TLV is that the length is not limited by the length of bitmap. but the disadvantage is that we add at least 4 bytes per extension of just TL data. That makes this thing even longer. Yes, and it looks like the length is still limited by e.g the length of T. Not only that, but it is also limited by the skb->cb as a whole. So adding putting extensions into a TLV style means we have less extensions for now, until we get rid of skb->cb usage. - For 1.1, do we really want something like vnet header? AFAIK, it was not used by modern NICs, is this better to pack all meta-data into descriptor itself? This may need a some changes in tun/macvtap, but looks more PCIE friendly. That would really be ideal and I've looked at this. There are small issues of exposing the 'net metadata' of the descriptor to taps so they can be filled in. The alternative is to use a different control structure for tap->qemu|vhost channel (that can be implementation specific) and have qemu|vhost populate the 'net metadata' of the descriptor. Yes, this needs some thought. For vhost, things looks a little bit easier, we can probably use msg_control. We can use msg_control in qemu as well, can't we? AFAIK, it needs some changes since we don't export socket to userspace. It really is a question of who is doing the work and the number of copies. I can take a closer look of how it would look if we extend the descriptor with type specific data. I don't know if other users of virtio would benefit from it? Not sure, but we can have a common descriptor header followed by device specific meta data. This probably need some prototype benchmarking to see the benefits first. Thanks
Re: [PATCH net-next RFC 2/4] virtio-net: transmit napi
On 2017年03月03日 22:39, Willem de Bruijn wrote: From: Willem de BruijnConvert virtio-net to a standard napi tx completion path. This enables better TCP pacing using TCP small queues and increases single stream throughput. The virtio-net driver currently cleans tx descriptors on transmission of new packets in ndo_start_xmit. Latency depends on new traffic, so is unbounded. To avoid deadlock when a socket reaches its snd limit, packets are orphaned on tranmission. This breaks socket backpressure, including TSQ. Napi increases the number of interrupts generated compared to the current model, which keeps interrupts disabled as long as the ring has enough free descriptors. Keep tx napi optional for now. Follow-on patches will reduce the interrupt cost. Signed-off-by: Willem de Bruijn --- drivers/net/virtio_net.c | 73 1 file changed, 61 insertions(+), 12 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 8c21e9a4adc7..9a9031640179 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -33,6 +33,8 @@ static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); +static int napi_tx_weight = NAPI_POLL_WEIGHT; + Maybe we should use module_param for this? Or in the future, use tx-frames-irq for a per-device configuration. Thanks
Re: [PATCH net-next RFC 3/4] vhost: interrupt coalescing support
On 2017年03月03日 22:39, Willem de Bruijn wrote: +void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq); +static enum hrtimer_restart vhost_coalesce_timer(struct hrtimer *timer) +{ + struct vhost_virtqueue *vq = + container_of(timer, struct vhost_virtqueue, ctimer); + + if (mutex_trylock(>mutex)) { + vq->coalesce_frames = vq->max_coalesce_frames; + vhost_signal(vq->dev, vq); + mutex_unlock(>mutex); + } + + /* TODO: restart if lock failed and not held by handle_tx */ + return HRTIMER_NORESTART; +} + Then we may lose an interrupt forever if no new tx request? I believe we need e.g vhost_poll_queue() here. Thanks
Re: [PATCH net-next RFC 4/4] virtio-net: clean tx descriptors from rx napi
On 2017年03月03日 22:39, Willem de Bruijn wrote: From: Willem de BruijnAmortize the cost of virtual interrupts by doing both rx and tx work on reception of a receive interrupt. Together VIRTIO_F_EVENT_IDX and vhost interrupt moderation, this suppresses most explicit tx completion interrupts for bidirectional workloads. Signed-off-by: Willem de Bruijn --- drivers/net/virtio_net.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9a9031640179..21c575127d50 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1031,6 +1031,23 @@ static int virtnet_receive(struct receive_queue *rq, int budget) return received; } +static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget); + +static void virtnet_poll_cleantx(struct receive_queue *rq) +{ + struct virtnet_info *vi = rq->vq->vdev->priv; + unsigned int index = vq2rxq(rq->vq); + struct send_queue *sq = >sq[index]; + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index); + + __netif_tx_lock(txq, smp_processor_id()); + free_old_xmit_skbs(sq, sq->napi.weight); + __netif_tx_unlock(txq); Should we check tx napi weight here? Or this was treated as an independent optimization? + + if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) + netif_wake_subqueue(vi->dev, vq2txq(sq->vq)); +} + static int virtnet_poll(struct napi_struct *napi, int budget) { struct receive_queue *rq = @@ -1039,6 +1056,8 @@ static int virtnet_poll(struct napi_struct *napi, int budget) received = virtnet_receive(rq, budget); + virtnet_poll_cleantx(rq); + Better to do the before virtnet_receive() consider refill may allocate memory for rx buffers. Btw, if this is proved to be more efficient. In the future we may consider to: 1) use a single interrupt for both rx and tx 2) use a single napi to handle both rx and tx Thanks /* Out of packets? */ if (received < budget) virtqueue_napi_complete(napi, rq->vq, received);
Re: [PATCH net-next RFC 3/4] vhost: interrupt coalescing support
On 2017年03月07日 01:31, Willem de Bruijn wrote: On Mon, Mar 6, 2017 at 4:28 AM, Jason Wang <jasow...@redhat.com> wrote: On 2017年03月03日 22:39, Willem de Bruijn wrote: +void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq); +static enum hrtimer_restart vhost_coalesce_timer(struct hrtimer *timer) +{ + struct vhost_virtqueue *vq = + container_of(timer, struct vhost_virtqueue, ctimer); + + if (mutex_trylock(>mutex)) { + vq->coalesce_frames = vq->max_coalesce_frames; + vhost_signal(vq->dev, vq); + mutex_unlock(>mutex); + } + + /* TODO: restart if lock failed and not held by handle_tx */ + return HRTIMER_NORESTART; +} + Then we may lose an interrupt forever if no new tx request? I believe we need e.g vhost_poll_queue() here. Absolutely, I need to fix this. The common case for failing to grab the lock is competition with handle_tx. With careful coding we can probably avoid scheduling another run with vhost_poll_queue in the common case. Yes, probably add some checking after releasing the mutex_lock in handle_tx(). Thans Your patch v7 cancels the pending hrtimer at the start of handle_tx. I need to reintroduce that, and also only schedule a timer at the end of handle_tx, not immediately when vq->coalesce_frames becomes non-zero.
Re: [PATCH 1/6] virtio: wrap find_vqs
On 2017年03月30日 22:32, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote: On 2017年03月30日 04:48, Michael S. Tsirkin wrote: We are going to add more parameters to find_vqs, let's wrap the call so we don't need to tweak all drivers every time. Signed-off-by: Michael S. Tsirkin<m...@redhat.com> --- A quick glance and it looks ok, but what the benefit of this series, is it required by other changes? Thanks Yes - to avoid touching all devices when doing the rest of the patchset. Maybe I'm not clear. I mean the benefit of this series not this single patch. I guess it may be used by you proposal that avoid reset when set XDP? If yes, do we really want to drop some packets after XDP is set? Thanks
Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
On 2017年03月30日 21:53, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote: This patch introduce a batched version of consuming, consumer can dequeue more than one pointers from the ring at a time. We don't care about the reorder of reading here so no need for compiler barrier. Signed-off-by: Jason Wang <jasow...@redhat.com> --- include/linux/ptr_ring.h | 65 1 file changed, 65 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6c70444..2be0f350 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r) return ptr; } +static inline int __ptr_ring_consume_batched(struct ptr_ring *r, +void **array, int n) Can we use a shorter name? ptr_ring_consume_batch? Ok, but at least we need to keep the prefix since there's a locked version. +{ + void *ptr; + int i; + + for (i = 0; i < n; i++) { + ptr = __ptr_ring_consume(r); + if (!ptr) + break; + array[i] = ptr; + } + + return i; +} + /* * Note: resize (below) nests producer lock within consumer lock, so if you * call this in interrupt or BH context, you must disable interrupts/BH when I'd like to add a code comment here explaining why we don't care about cpu or compiler reordering. And I think the reason is in the way you use this API: in vhost it does not matter if you get less entries than present in the ring. That's ok but needs to be noted in a code comment so people use this function correctly. Interesting, but I still think it's not necessary. If consumer is doing a busy polling, it will eventually get the entries. If the consumer need notification from producer, it should drain the queue which means it need enable notification before last try of consuming call, otherwise it was a bug. The batch consuming function in this patch can guarantee return at least one pointer if there's many, this looks sufficient for the correctness? Thanks Also, I think you need to repeat the comment about cpu_relax near this function: if someone uses it in a loop, a compiler barrier is needed to prevent compiler from optimizing it out. I note that ptr_ring_consume currently lacks any of these comments so I'm ok with merging as is, and I'll add documentation on top. Like this perhaps? /* Consume up to n entries and return the number of entries consumed * or 0 on ring empty. * Note: this might return early with less entries than present in the * ring. * Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). Callers must take consumer_lock * if the ring is ever resized - see e.g. ptr_ring_consume_batch. */ @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r) return ptr; } +static inline int ptr_ring_consume_batched(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_irq(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irq(>consumer_lock); + + return ret; +} + +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r, + void **array, int n) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(>consumer_lock, flags); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_irqrestore(>consumer_lock, flags); + + return ret; +} + +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, + void **array, int n) +{ + int ret; + + spin_lock_bh(>consumer_lock); + ret = __ptr_ring_consume_batched(r, array, n); + spin_unlock_bh(>consumer_lock); + + return ret; +} + /* Cast to structure type and call a function without discarding from FIFO. * Function must return a value. * Callers must take consumer_lock. -- 2.7.4
Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
On 2017年03月30日 22:21, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote: We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization which cache does this refer to btw? Both icache and dcache more or less. and spinlock touching for each packet. Do you mean the effect of extra two atomics here? In fact four, packet length peeking needs another two. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Tests were done by XDP1: - small buffer: Before: 1.88Mpps After : 2.25Mpps (+19.6%) - mergeable buffer: Before: 1.83Mpps After : 2.10Mpps (+14.7%) Signed-off-by: Jason Wang <jasow...@redhat.com> Looks like I misread the code previously. More comments below, sorry about not asking these questions earlier. --- drivers/vhost/net.c | 64 + 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b51989..ffa78c6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include #include @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref { struct vhost_virtqueue *vq; }; +#define VHOST_RX_BATCH 64 struct vhost_net_virtqueue { struct vhost_virtqueue vq; size_t vhost_hlen; Could you please try playing with batch size and see what the effect is? Ok. I tried 32 which seems slower than 64 but still faster than no batching. @@ -99,6 +102,10 @@ struct vhost_net_virtqueue { /* Reference counting for outstanding ubufs. * Protected by vq mutex. Writers must also take device mutex. */ struct vhost_net_ubuf_ref *ubufs; + struct skb_array *rx_array; + void *rxq[VHOST_RX_BATCH]; + int rt; + int rh; }; struct vhost_net { @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n) n->vqs[i].ubufs = NULL; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rt = 0; + n->vqs[i].rh = 0; } } @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net) mutex_unlock(>mutex); } -static int peek_head_len(struct sock *sk) +static int fetch_skbs(struct vhost_net_virtqueue *rvq) +{ + if (rvq->rh != rvq->rt) + goto out; + + rvq->rh = rvq->rt = 0; + rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq, + VHOST_RX_BATCH); + if (!rvq->rt) + return 0; +out: + return __skb_array_len_with_tag(rvq->rxq[rvq->rh]); +} + +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk) { struct socket *sock = sk->sk_socket; struct sk_buff *head; int len = 0; unsigned long flags; + if (rvq->rx_array) + return fetch_skbs(rvq); + if (sock->ops->peek_len) return sock->ops->peek_len(sock); @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk) return skb_queue_empty(>sk_receive_queue); } -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) +static int vhost_net_rx_peek_head_len(struct vhost_net *net, + struct sock *sk) { + struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX]; struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = >vq; unsigned long uninitialized_var(endtime); - int len = peek_head_len(sk); + int len = peek_head_len(rvq, sk); if (!len && vq->busyloop_timeout) { /* Both tx vq and rx socket were polled here */ @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) vhost_poll_queue(>poll); mutex_unlock(>mutex); - len = peek_head_len(sk); + len = peek_head_len(rvq, sk); } return len; @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; + if (nvq->rx_array) + msg.msg_control = nvq->rxq[nvq->rh++]; /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(_iter, READ, vq->iov, 1, 1); So there's a bit of a mystery here. vhost code isn't batched, all we are batching is the fetch from
Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions
On 2017年04月16日 00:38, Vladislav Yasevich wrote: Curreclty virtion net header is fixed size and adding things to it is rather difficult to do. This series attempt to add the infrastructure as well as some extensions that try to resolve some deficiencies we currently have. First, vnet header only has space for 16 flags. This may not be enough in the future. The extensions will provide space for 32 possbile extension flags and 32 possible extensions. These flags will be carried in the first pseudo extension header, the presense of which will be determined by the flag in the virtio net header. The extensions themselves will immidiately follow the extension header itself. They will be added to the packet in the same order as they appear in the extension flags. No padding is placed between the extensions and any extensions negotiated, but not used need by a given packet will convert to trailing padding. Do we need a explicit padding (e.g an extension) which could be controlled by each side? For example: | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data | Just some rough thoughts: - Is this better to use TLV instead of bitmap here? One advantage of TLV is that the length is not limited by the length of bitmap. - For 1.1, do we really want something like vnet header? AFAIK, it was not used by modern NICs, is this better to pack all meta-data into descriptor itself? This may need a some changes in tun/macvtap, but looks more PCIE friendly. Thanks Extensions proposed in this series are: - IPv6 fragment id extension * Currently, the guest generated fragment id is discarded and the host generates an IPv6 fragment id if the packet has to be fragmented. The code attempts to add time based perturbation to id generation to make it harder to guess the next fragment id to be used. However, doing this on the host may result is less perturbation (due to differnet timing) and might make id guessing easier. Ideally, the ids generated by the guest should be used. One could also argue that we a "violating" the IPv6 protocol in the if the _strict_ interpretation of the spec. - VLAN header acceleration * Currently virtio doesn't not do vlan header acceleration and instead uses software tagging. One of the first things that the host will do is strip the vlan header out. When passing the packet the a guest the vlan header is re-inserted in to the packet. We can skip all that work if we can pass the vlan data in accelearted format. Then the host will not do any extra work. However, so far, this yeilded a very small perf bump (only ~1%). I am still looking into this. - UDP tunnel offload * Similar to vlan acceleration, with this extension we can pass additional data to host for support GSO with udp tunnel and possible other encapsulations. This yeilds a significant perfromance improvement (still testing remote checksum code). An addition extension that is unfinished (due to still testing for any side-effects) is checksum passthrough to support drivers that set CHECKSUM_COMPLETE. This would eliminate the need for guests to compute the software checksum. This series only takes care of virtio net. I have addition patches for the host side (vhost and tap/macvtap as well as qemu), but wanted to get feedback on the general approach first. Vladislav Yasevich (6): virtio-net: Remove the use the padded vnet_header structure virtio-net: make header length handling uniform virtio_net: Add basic skeleton for handling vnet header extensions. virtio-net: Add support for IPv6 fragment id vnet header extension. virtio-net: Add support for vlan acceleration vnet header extension. virtio-net: Add support for UDP tunnel offload and extension. drivers/net/virtio_net.c| 132 +--- include/linux/skbuff.h | 5 ++ include/linux/virtio_net.h | 91 ++- include/uapi/linux/virtio_net.h | 38 4 files changed, 242 insertions(+), 24 deletions(-)
Re: [PATCH RFC (resend) net-next 5/6] virtio-net: Add support for vlan acceleration vnet header extension.
On 2017年04月16日 00:38, Vladislav Yasevich wrote: This extension allows us to pass vlan ID and vlan protocol data to the host hypervisor as part of the vnet header and lets us take advantage of HW accelerated vlan tagging in the host. It requires support in the host to negotiate the feature. When the extension is enabled, the virtio device will enabled HW accelerated vlan features. Signed-off-by: Vladislav Yasevich--- drivers/net/virtio_net.c| 17 - include/linux/virtio_net.h | 17 + include/uapi/linux/virtio_net.h | 7 +++ 3 files changed, 40 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 18eb0dd..696ef4a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -182,6 +182,7 @@ struct virtio_net_hdr_max { struct virtio_net_hdr_mrg_rxbuf hdr; struct virtio_net_ext_hdr ext_hdr; struct virtio_net_ext_ip6frag ip6f_ext; + struct virtio_net_ext_vlan vlan_ext; }; static inline u8 padded_vnet_hdr(struct virtnet_info *vi) @@ -2276,6 +2277,11 @@ static void virtnet_init_extensions(struct virtio_device *vdev) vi->hdr_len += sizeof(u32); vi->ext_mask |= VIRTIO_NET_EXT_F_IP6FRAG; } + + if (virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD)) { + vi->hdr_len += sizeof(struct virtio_net_ext_vlan); + vi->ext_mask |= VIRTIO_NET_EXT_F_VLAN; + } } #define MIN_MTU ETH_MIN_MTU @@ -2352,6 +2358,14 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) dev->features |= NETIF_F_RXCSUM; + if (virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD)) { + dev->features |= NETIF_F_HW_VLAN_CTAG_TX | +NETIF_F_HW_VLAN_CTAG_RX | +NETIF_F_HW_VLAN_STAG_TX | +NETIF_F_HW_VLAN_STAG_RX; + } + + dev->vlan_features = dev->features; /* MTU range: 68 - 65535 */ @@ -2395,7 +2409,8 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) vi->mergeable_rx_bufs = true; - if (virtio_has_feature(vdev, VIRTIO_NET_F_IP6_FRAGID)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_IP6_FRAGID) || + virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD)) vi->hdr_ext = true; if (vi->hdr_ext) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 3b259dc..e790191 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -113,6 +113,14 @@ static inline int virtio_net_ext_to_skb(struct sk_buff *skb, ptr += sizeof(struct virtio_net_ext_ip6frag); } + if (ext->flags & VIRTIO_NET_EXT_F_VLAN) { + struct virtio_net_ext_vlan *vhdr = + (struct virtio_net_ext_vlan *)ptr; + + __vlan_hwaccel_put_tag(skb, vhdr->vlan_proto, vhdr->vlan_tci); + ptr += sizeof(struct virtio_net_ext_vlan); + } + return 0; } @@ -130,6 +138,15 @@ static inline int virtio_net_ext_from_skb(const struct sk_buff *skb, ext->flags |= VIRTIO_NET_EXT_F_IP6FRAG; Looks like you need advance ptr here? } + if (ext_mask & VIRTIO_NET_EXT_F_VLAN && skb_vlan_tag_present(skb)) { + struct virtio_net_ext_vlan *vhdr = + (struct virtio_net_ext_vlan *)ptr; + + vlan_get_tag(skb, >vlan_tci); + vhdr->vlan_proto = skb->vlan_proto; + ext->flags |= VIRTIO_NET_EXT_F_VLAN; And here? Thanks + } + return 0; } #endif /* _LINUX_VIRTIO_NET_H */ diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index eac8d94..6125de7 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,7 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ #define VIRTIO_NET_F_IP6_FRAGID24 /* Host supports VLAN accleration */ +#define VIRTIO_NET_F_VLAN_OFFLOAD 25 /* Host supports VLAN accleration */ #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ @@ -111,6 +112,7 @@ struct virtio_net_hdr_v1 { */ struct virtio_net_ext_hdr { #define VIRTIO_NET_EXT_F_IP6FRAG (1<<0) +#define VIRTIO_NET_EXT_F_VLAN (1<<1) __u32 flags; __u8 extensions[]; }; @@ -120,6 +122,11 @@ struct virtio_net_ext_ip6frag { __be32 frag_id; }; +struct virtio_net_ext_vlan { + __be16 vlan_tci; + __be16 vlan_proto; +}; + #ifndef VIRTIO_NET_NO_LEGACY /* This header comes first in the scatter-gather list. * For legacy virtio, if
Re: [PATCH net-next V2 1/3] tap: use build_skb() for small packet
On 2017年08月16日 11:55, Michael S. Tsirkin wrote: On Tue, Aug 15, 2017 at 08:45:20PM -0700, Eric Dumazet wrote: On Fri, 2017-08-11 at 19:41 +0800, Jason Wang wrote: We use tun_alloc_skb() which calls sock_alloc_send_pskb() to allocate skb in the past. This socket based method is not suitable for high speed userspace like virtualization which usually: - ignore sk_sndbuf (INT_MAX) and expect to receive the packet as fast as possible - don't want to be block at sendmsg() To eliminate the above overheads, this patch tries to use build_skb() for small packet. We will do this only when the following conditions are all met: - TAP instead of TUN - sk_sndbuf is INT_MAX - caller don't want to be blocked - zerocopy is not used - packet size is smaller enough to use build_skb() Pktgen from guest to host shows ~11% improvement for rx pps of tap: Before: ~1.70Mpps After : ~1.88Mpps What's more important, this makes it possible to implement XDP for tap before creating skbs. Well well well. You do realize that tun_build_skb() is not thread safe ? The issue is alloc frag, isn't it? I guess for now we can limit this to XDP mode only, and just allocate full pages in that mode. Limit this to XDP mode only does not prevent user from sending packets to same queue in parallel I think? Thanks