Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread Jason Wang



On 2016年12月01日 10:48, wangyunjian wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, November 30, 2016 9:41 PM
To: wangyunjian
Cc: jasow...@redhat.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; 
caihe
Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
meet errors

On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:

When we meet an error(err=-EBADFD) recvmsg,

How do you get EBADFD? Won't vhost_net_rx_peek_head_len
return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for 
loop.
The soft lockup might happened. The err is -EBADFD.


How did you do the attaching/detaching? AFAIK, the -EBADFD can only 
happens when you deleting tun device during vhost_net transmission.




meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! 
[vhost-60898:126093]
call trace:
[]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[]handle_rx+0x226/0x810 [vhost_net]
[]handle_rx_net+0x15/0x20 [vhost_net]
[]vhost_worker+0xfb/0x1e0 [vhost]
[]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[]kthread+0xcf/0xe0
[]? kthread_create_on_node+0x140/0x140
[]ret_from_fork+0x58/0x90
[]? kthread_create_on_node+0x140/0x140


the error handling in vhost
handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.

Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc128a..edc470b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
pr_debug("Discarded rx packet: "
 " len %d, expected %zd\n", err, sock_len);
vhost_discard_vq_desc(vq, headcount);
+   /* Don't continue to do, when meet errors. */
+   if (err < 0)
+   goto out;

You might get e.g. EAGAIN and I think you need to retry
in this case.


continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
--
1.9.5.msysgit.1





Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx

2016-12-01 Thread Jason Wang



On 2016年12月01日 17:34, Andrey Konovalov wrote:

This patch changes tun.c to call netif_receive_skb instead of netif_rx
when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
stack exhaustion). The difference between the two is that netif_rx queues
the packet into the backlog, and netif_receive_skb proccesses the packet
in the current context.

This patch is required for syzkaller [1] to collect coverage from packet
receive paths, when a packet being received through tun (syzkaller collects
coverage per process in the process context).

As mentioned by Eric this change also speeds up tun/tap. As measured by
Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
about 23%, from 700k packets/second to 867k packets/second.

A similar patch was introduced back in 2010 [2, 3], but the author found
out that the patch doesn't help with the task he had in mind (for cgroups
to shape network traffic based on the original process) and decided not to
go further with it. The main concern back then was about possible stack
exhaustion with 4K stacks.

[1] https://github.com/google/syzkaller

[2] https://www.spinics.net/lists/netdev/thrd440.html#130570

[3] https://www.spinics.net/lists/netdev/msg130570.html

Signed-off-by: Andrey Konovalov <andreyk...@google.com>
---

Changes since v1:
- incorporate Eric's note about speed improvements in commit description
- use netif_receive_skb CONFIG_4KSTACKS is not enabled

  drivers/net/tun.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8093e39..d310b13 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1304,7 +1304,13 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);
  
  	rxhash = skb_get_hash(skb);

+#ifndef CONFIG_4KSTACKS
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+#else
netif_rx_ni(skb);
+#endif
  
  	stats = get_cpu_ptr(tun->pcpu_stats);

u64_stats_update_begin(>syncp);


I get +20% tx pps from guest with this patch.

Acked-by: Jason Wang <jasow...@redhat.com>



Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread Jason Wang



On 2016年12月01日 11:21, Michael S. Tsirkin wrote:

On Thu, Dec 01, 2016 at 02:48:59AM +, wangyunjian wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, November 30, 2016 9:41 PM
To: wangyunjian
Cc: jasow...@redhat.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; 
caihe
Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
meet errors

On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:

When we meet an error(err=-EBADFD) recvmsg,

How do you get EBADFD? Won't vhost_net_rx_peek_head_len
return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for 
loop.
The soft lockup might happened. The err is -EBADFD.


OK, I'd like to figure out what happened here. why don't
we get 0 when we peek at the head?

EBADFD is from here:
 struct tun_struct *tun = __tun_get(tfile);
...
 if (!tun)
 return -EBADFD;

but then:
static int tun_peek_len(struct socket *sock)
{

...

 struct tun_struct *tun;
...
 
 tun = __tun_get(tfile);

 if (!tun)
 return 0;


so peek len should return 0.

then while will exit:
 while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
...



Consider this case: user do ip link del link tap0 before recvmsg() but 
after tun_peek_len() ?



meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! 
[vhost-60898:126093]
call trace:
[]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[]handle_rx+0x226/0x810 [vhost_net]
[]handle_rx_net+0x15/0x20 [vhost_net]
[]vhost_worker+0xfb/0x1e0 [vhost]
[]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[]kthread+0xcf/0xe0
[]? kthread_create_on_node+0x140/0x140
[]ret_from_fork+0x58/0x90
[]? kthread_create_on_node+0x140/0x140

So somehow you keep seeing something in tun when we peek.
IMO this is the problem we should try to fix.
Could you try debugging the root cause here?


the error handling in vhost
handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.

Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc128a..edc470b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
pr_debug("Discarded rx packet: "
 " len %d, expected %zd\n", err, sock_len);
vhost_discard_vq_desc(vq, headcount);
+   /* Don't continue to do, when meet errors. */
+   if (err < 0)
+   goto out;

You might get e.g. EAGAIN and I think you need to retry
in this case.


continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
--
1.9.5.msysgit.1





Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-02 Thread Jason Wang



On 2017年01月03日 03:44, John Fastabend wrote:

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

: There is a problem with this patch as is. When xdp prog is loaded
   the old buffers without the 256B headers need to be flushed so that
   the bpf prog has the necessary headroom. This patch does this by
   calling the virtqueue_detach_unused_buf() and followed by the
   virtnet_set_queues() call to reinitialize the buffers. However I
   don't believe this is safe per comment in virtio_ring this API
   is not valid on an active queue and the only thing we have done
   here is napi_disable/napi_enable wrappers which doesn't do anything
   to the emulation layer.

   So the RFC is really to find the best solution to this problem.
   A couple things come to mind, (a) always allocate the necessary
   headroom but this is a bit of a waste (b) add some bit somewhere
   to check if the buffer has headroom but this would mean XDP programs
   would be broke for a cycle through the ring, (c) figure out how
   to deactivate a queue, free the buffers and finally reallocate.
   I think (c) is the best choice for now but I'm not seeing the
   API to do this so virtio/qemu experts anyone know off-hand
   how to make this work? I started looking into the PCI callbacks
   reset() and virtio_device_ready() or possibly hitting the right
   set of bits with vp_set_status() but my first attempt just hung
   the device.


Hi John:

AFAIK, disabling a specific queue was supported only by virtio 1.0 
through queue_enable field in pci common cfg. But unfortunately, qemu 
does not emulate this at all and legacy device does not even support 
this. So the safe way is probably reset the device and redo the 
initialization here.




Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |  106 +++---
  1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..fcc5bd7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -159,6 +159,9 @@ struct virtnet_info {
/* Ethtool settings */
u8 duplex;
u32 speed;
+
+   /* Headroom allocated in RX Queue */
+   unsigned int headroom;
  };
  
  struct padded_vnet_hdr {

@@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
}
  
  	if (vi->mergeable_rx_bufs) {

+   xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
/* Zero header and leave csum up to XDP layers */
hdr = xdp->data;
memset(hdr, 0, vi->hdr_len);
@@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
num_sg = 2;
sg_init_table(sq->sg, 2);
sg_set_buf(sq->sg, hdr, vi->hdr_len);
-   skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - 
xdp->data);


vi->headroom look suspicious, should it be xdp->data - xdp->data_hard_start?


}
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
   data, GFP_ATOMIC);
@@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
   struct bpf_prog *xdp_prog,
   void *data, int len)
  {
-   int hdr_padded_len;
struct xdp_buff xdp;
-   void *buf;
unsigned int qp;
u32 act;
  
+

if (vi->mergeable_rx_bufs) {
-   hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-   xdp.data = data + hdr_padded_len;
+   int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+   /* Allow consuming headroom but reserve enough space to push
+* the descriptor on if we get an XDP_TX return code.
+*/
+   xdp.data_hard_start = data - vi->headroom + desc_room;
+   xdp.data = data + desc_room;
xdp.data_end = xdp.data + (len - vi->hdr_len);
-   buf = data;
} else { /* small buffers */
struct sk_buff *skb = data;
  
-		xdp.data = skb->data;

+   xdp.data_hard_start = skb->data;
+   xdp.data = skb->data + vi->headroom;
xdp.data_end = xdp.data + len;
-   buf = skb->data;
}
  
  	act = bpf_prog_run_xdp(xdp_prog, );

switch (act) {
case XDP_PASS:
+   if (!vi->mergeable_rx_bufs)
+   __skb_pull((struct sk_buff *) data,
+  xdp.data - xdp.data_hard_start);


Instead of doing things here and virtnet_xdp_xmit(). How about 

Re: [net PATCH] net: virtio: cap mtu when XDP programs are running

2017-01-02 Thread Jason Wang



On 2017年01月03日 06:30, John Fastabend wrote:

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

This patch sets/clears the max_mtu value at XDP load/unload time.

Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..783e842 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
  };
  
+#define MIN_MTU ETH_MIN_MTU

+#define MAX_MTU ETH_MAX_MTU
+
  static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
  {
unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
@@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
}
+   dev->max_mtu = max_sz;
+   } else {
+   dev->max_mtu = ETH_MAX_MTU;


Or use ETH_DATA_LEN here consider we only allocate a size of 
GOOD_PACKET_LEN for each small buffer?


Thanks


}
  
  	vi->xdp_queue_pairs = xdp_qp;

@@ -2133,9 +2139,6 @@ static bool virtnet_validate_features(struct 
virtio_device *vdev)
return true;
  }
  
-#define MIN_MTU ETH_MIN_MTU

-#define MAX_MTU ETH_MAX_MTU
-
  static int virtnet_probe(struct virtio_device *vdev)
  {
int i, err;





Re: [PATCH net 9/9] virtio-net: XDP support for small buffers

2017-01-02 Thread Jason Wang



On 2017年01月03日 06:43, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 112 ---
  1 file changed, 87 insertions(+), 25 deletions(-)


Hi Jason,

I was doing some more testing on this what do you think about doing this
so that free_unused_bufs() handles the buffer free with dev_kfree_skb()
instead of put_page in small receive mode. Seems more correct to me.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 783e842..27ff76c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info 
*vi)

  static bool is_xdp_queue(struct virtnet_info *vi, int q)
  {
+   /* For small receive mode always use kfree_skb variants */
+   if (!vi->mergeable_rx_bufs)
+   return false;
+
 if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
 return false;
 else if (q < vi->curr_queue_pairs)


patch is untested just spotted doing code review.

Thanks,
John


We probably need a better name for this function.

Acked-by: Jason Wang <jasow...@redhat.com>



Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-04 Thread Jason Wang



On 2017年01月05日 02:58, John Fastabend wrote:

[...]


@@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
  struct bpf_prog *xdp_prog,
  void *data, int len)
   {
-int hdr_padded_len;
   struct xdp_buff xdp;
-void *buf;
   unsigned int qp;
   u32 act;
   +
   if (vi->mergeable_rx_bufs) {
-hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-xdp.data = data + hdr_padded_len;
+int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+/* Allow consuming headroom but reserve enough space to push
+ * the descriptor on if we get an XDP_TX return code.
+ */
+xdp.data_hard_start = data - vi->headroom + desc_room;
+xdp.data = data + desc_room;
   xdp.data_end = xdp.data + (len - vi->hdr_len);
-buf = data;
   } else { /* small buffers */
   struct sk_buff *skb = data;
   -xdp.data = skb->data;
+xdp.data_hard_start = skb->data;
+xdp.data = skb->data + vi->headroom;
   xdp.data_end = xdp.data + len;
-buf = skb->data;
   }
 act = bpf_prog_run_xdp(xdp_prog, );
   switch (act) {
   case XDP_PASS:
+if (!vi->mergeable_rx_bufs)
+__skb_pull((struct sk_buff *) data,
+   xdp.data - xdp.data_hard_start);

Instead of doing things here and virtnet_xdp_xmit(). How about always making
skb->data point to the buffer head like:

1) reserve headroom in add_recvbuf_small()
2) skb_push(xdp->data - xdp_data_hard_start, skb) if we detect xdp->data was
modified afer bpf_prog_run_xdp()

Then there's no special code in either XDP_PASS or XDP_TX?


Alternatively moving the pull into the receive_small XDP handler also
removes the special case. I'll submit a patch shortly let me know what
you think.



I'm ok with this.

Thanks


Re: [net PATCH 1/2] virtio_net: cap mtu when XDP programs are running

2017-01-05 Thread Jason Wang



On 2017年01月05日 11:18, Michael S. Tsirkin wrote:

On Wed, Jan 04, 2017 at 07:11:18PM -0800, John Fastabend wrote:

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

Do drivers really have to tweak max mtu all the time?
Seems strange, I would say drivers just report device caps
and net core enforces rules.
Can't net core do these checks?


I think this needs host co-operation, at least this patch prevents user 
from misconfiguring mtu in guest.


Re: [net PATCH] net: virtio: cap mtu when XDP programs are running

2017-01-04 Thread Jason Wang



On 2017年01月05日 02:57, John Fastabend wrote:

[...]


On 2017年01月04日 00:48, John Fastabend wrote:

On 17-01-02 10:14 PM, Jason Wang wrote:

On 2017年01月03日 06:30, John Fastabend wrote:

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

This patch sets/clears the max_mtu value at XDP load/unload time.

Signed-off-by: John Fastabend <john.r.fastab...@intel.com>
---

[...]


OK so this logic is a bit too simply. When it resets the max_mtu I guess it
needs to read the mtu via

 virtio_cread16(vdev, ...)

or we may break the negotiated mtu.

Yes, this is a problem (even use ETH_MAX_MTU). We may need a method to notify
the device about the mtu in this case which is not supported by virtio now.

Note this is not really a XDP specific problem. The guest can change the MTU
after init time even without XDP which I assume should ideally result in a
notification if the MTU is negotiated.


Yes, Michael, do you think we need add some mechanism to notify host 
about MTU change in this case?


Thanks


Re: [PATCH net-next V2 3/3] tun: rx batching

2017-01-03 Thread Jason Wang



On 2017年01月03日 21:33, Stefan Hajnoczi wrote:

On Wed, Dec 28, 2016 at 04:09:31PM +0800, Jason Wang wrote:

+static int tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb,
+ int more)
+{
+   struct sk_buff_head *queue = >sk.sk_write_queue;
+   struct sk_buff_head process_queue;
+   int qlen;
+   bool rcv = false;
+
+   spin_lock(>lock);

Should this be spin_lock_bh()?  Below and in tun_get_user() there are
explicit local_bh_disable() calls so I guess BHs can interrupt us here
and this would deadlock.


sk_write_queue were accessed only in this function which runs under 
process context, so no need for spin_lock_bh() here.


Re: [PATCH net 9/9] virtio-net: XDP support for small buffers

2017-01-03 Thread Jason Wang



On 2017年01月04日 00:40, John Fastabend wrote:

On 17-01-02 10:16 PM, Jason Wang wrote:


On 2017年01月03日 06:43, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
   drivers/net/virtio_net.c | 112 
---
   1 file changed, 87 insertions(+), 25 deletions(-)


Hi Jason,

I was doing some more testing on this what do you think about doing this
so that free_unused_bufs() handles the buffer free with dev_kfree_skb()
instead of put_page in small receive mode. Seems more correct to me.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 783e842..27ff76c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info
*vi)

   static bool is_xdp_queue(struct virtnet_info *vi, int q)
   {
+   /* For small receive mode always use kfree_skb variants */
+   if (!vi->mergeable_rx_bufs)
+   return false;
+
  if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
  return false;
  else if (q < vi->curr_queue_pairs)


patch is untested just spotted doing code review.

Thanks,
John

We probably need a better name for this function.

Acked-by: Jason Wang <jasow...@redhat.com>


How about is_xdp_raw_buffer_queue()?

I'll submit a proper patch today.


Sounds good, thanks.



Re: [net PATCH] net: virtio: cap mtu when XDP programs are running

2017-01-03 Thread Jason Wang

 case.



On 2017年01月04日 00:48, John Fastabend wrote:

On 17-01-02 10:14 PM, Jason Wang wrote:


On 2017年01月03日 06:30, John Fastabend wrote:

XDP programs can not consume multiple pages so we cap the MTU to
avoid this case. Virtio-net however only checks the MTU at XDP
program load and does not block MTU changes after the program
has loaded.

This patch sets/clears the max_mtu value at XDP load/unload time.

Signed-off-by: John Fastabend <john.r.fastab...@intel.com>
---
   drivers/net/virtio_net.c |9 ++---
   1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..783e842 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev)
   .set_settings = virtnet_set_settings,
   };
   +#define MIN_MTU ETH_MIN_MTU
+#define MAX_MTU ETH_MAX_MTU
+
   static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
   {
   unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
@@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev,
struct bpf_prog *prog)
   virtnet_set_queues(vi, curr_qp);
   return PTR_ERR(prog);
   }
+dev->max_mtu = max_sz;
+} else {
+dev->max_mtu = ETH_MAX_MTU;

Or use ETH_DATA_LEN here consider we only allocate a size of GOOD_PACKET_LEN for
each small buffer?

Thanks

OK so this logic is a bit too simply. When it resets the max_mtu I guess it
needs to read the mtu via

virtio_cread16(vdev, ...)

or we may break the negotiated mtu.


Yes, this is a problem (even use ETH_MAX_MTU). We may need a method to 
notify the device about the mtu in this case which is not supported by 
virtio now.


As for capping it at GOOD_PACKET_LEN this has the nice benefit of avoiding any
underestimates in EWMA predictions because it appears min estimates are capped
at GOOD_PACKET_LEN via get_mergeable_buf_len().


This seems something misunderstanding here, I meant only use 
GOOD_PACKET_LEN for small buffer (which does not use EWMA).


Thanks



Thanks,
John





Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-03 Thread Jason Wang



On 2017年01月04日 00:54, John Fastabend wrote:

+/* Changing the headroom in buffers is a disruptive operation because
+ * existing buffers must be flushed and reallocated. This will happen
+ * when a xdp program is initially added or xdp is disabled by removing
+ * the xdp program.
+ */

We probably need reset the device here, but maybe Michale has more ideas. And if
we do this, another interesting thing to do is to disable EWMA and always use a
single page for each packet, this could almost eliminate linearizing.

Well with normal MTU 1500 size we should not hit the linearizing case right?


My reply may be not clear, for 1500 I mean for small buffer only.

Thanks


  The
question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of
(PAGE_SIZE - overhead).





Re: [RFC PATCH] virtio_net: XDP support for adjust_head

2017-01-03 Thread Jason Wang



On 2017年01月04日 00:57, John Fastabend wrote:

+/* Changing the headroom in buffers is a disruptive operation because
+ * existing buffers must be flushed and reallocated. This will happen
+ * when a xdp program is initially added or xdp is disabled by removing
+ * the xdp program.
+ */

We probably need reset the device here, but maybe Michale has more ideas. And if
we do this, another interesting thing to do is to disable EWMA and always use a
single page for each packet, this could almost eliminate linearizing.

Well with normal MTU 1500 size we should not hit the linearizing case right? The
question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of
(PAGE_SIZE - overhead).

Sorry responding to my own post with a bit more detail. I don't really like
going to a page for each packet because we end up with double the pages in use
for the "normal" 1500 MTU case. We could make the xdp allocation scheme smarter
and allocate a page per packet when MTU is greater than 2k instead of using the
EWMA but I would push those types of things at net-next and live with the
linearizing behavior for now or capping the MTU.



Yes, agree.

Thanks


[PATCH net-next V3 0/3] vhost_net tx batching

2016-12-29 Thread Jason Wang
Hi:

This series tries to implement tx batching support for vhost. This was
done by using MSG_MORE as a hint for under layer socket. The backend
(e.g tap) can then batch the packets temporarily in a list and
submit it all once the number of bacthed exceeds a limitation.

Tests shows obvious improvement on guest pktgen over over
mlx4(noqueue) on host:

Mpps  -+%
rx_batched=0  0.90  +0%
rx_batched=4  0.97  +7.8%
rx_batched=8  0.97  +7.8%
rx_batched=16 0.98  +8.9%
rx_batched=32 1.03  +14.4%
rx_batched=48 1.09  +21.1%
rx_batched=64 1.02  +13.3%

Changes from V2:
- remove uselss queue limitation check (and we don't drop any packet now)

Changes from V1:
- drop NAPI handler since we don't use NAPI now
- fix the issues that may exceeds max pending of zerocopy
- more improvement on available buffer detection
- move the limitation of batched pacekts from vhost to tuntap

Please review.

Thanks

Jason Wang (3):
  vhost: better detection of available buffers
  vhost_net: tx batching
  tun: rx batching

 drivers/net/tun.c | 50 --
 drivers/vhost/net.c   | 23 ---
 drivers/vhost/vhost.c |  8 ++--
 3 files changed, 70 insertions(+), 11 deletions(-)

-- 
2.7.4



[PATCH net-next V3 3/3] tun: rx batching

2016-12-29 Thread Jason Wang
We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.

Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

  Mpps  -+%
rx_batched=0  0.90  +0%
rx_batched=4  0.97  +7.8%
rx_batched=8  0.97  +7.8%
rx_batched=16 0.98  +8.9%
rx_batched=32 1.03  +14.4%
rx_batched=48 1.09  +21.1%
rx_batched=64 1.02  +13.3%

The maximum number of batched packets were specified through a module
parameter.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 50 --
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..a268ed9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -75,6 +75,10 @@
 
 #include 
 
+static int rx_batched;
+module_param(rx_batched, int, 0444);
+MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
+
 /* Uncomment to enable debugging */
 /* #define TUN_DEBUG 1 */
 
@@ -522,6 +526,7 @@ static void tun_queue_purge(struct tun_file *tfile)
while ((skb = skb_array_consume(>tx_array)) != NULL)
kfree_skb(skb);
 
+   skb_queue_purge(>sk.sk_write_queue);
skb_queue_purge(>sk.sk_error_queue);
 }
 
@@ -1140,10 +1145,36 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
*tfile,
return skb;
 }
 
+static void tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb,
+  int more)
+{
+   struct sk_buff_head *queue = >sk.sk_write_queue;
+   struct sk_buff_head process_queue;
+   int qlen;
+   bool rcv = false;
+
+   spin_lock(>lock);
+   qlen = skb_queue_len(queue);
+   __skb_queue_tail(queue, skb);
+   if (!more || qlen == rx_batched) {
+   __skb_queue_head_init(_queue);
+   skb_queue_splice_tail_init(queue, _queue);
+   rcv = true;
+   }
+   spin_unlock(>lock);
+
+   if (rcv) {
+   local_bh_disable();
+   while ((skb = __skb_dequeue(_queue)))
+   netif_receive_skb(skb);
+   local_bh_enable();
+   }
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
-   int noblock)
+   int noblock, bool more)
 {
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
@@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);
 
rxhash = skb_get_hash(skb);
+
 #ifndef CONFIG_4KSTACKS
-   local_bh_disable();
-   netif_receive_skb(skb);
-   local_bh_enable();
+   if (!rx_batched) {
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+   } else {
+   tun_rx_batched(tfile, skb, more);
+   }
 #else
netif_rx_ni(skb);
 #endif
@@ -1312,7 +1348,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
if (!tun)
return -EBADFD;
 
-   result = tun_get_user(tun, tfile, NULL, from, file->f_flags & 
O_NONBLOCK);
+   result = tun_get_user(tun, tfile, NULL, from,
+ file->f_flags & O_NONBLOCK, false);
 
tun_put(tun);
return result;
@@ -1570,7 +1607,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
return -EBADFD;
 
ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
-  m->msg_flags & MSG_DONTWAIT);
+  m->msg_flags & MSG_DONTWAIT,
+  m->msg_flags & MSG_MORE);
tun_put(tun);
return ret;
 }
-- 
2.7.4



[PATCH net-next V3 1/3] vhost: better detection of available buffers

2016-12-29 Thread Jason Wang
This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:

- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
  step.

This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d643260..9f11838 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
__virtio16 avail_idx;
int r;
 
+   if (vq->avail_idx != vq->last_avail_idx)
+   return false;
+
r = vhost_get_user(vq, avail_idx, >avail->idx);
-   if (r)
+   if (unlikely(r))
return false;
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
+   return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.7.4



[PATCH net-next V3 2/3] vhost_net: tx batching

2016-12-29 Thread Jason Wang
This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc3465..c42e9c3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
return r;
 }
 
+static bool vhost_exceeds_maxpend(struct vhost_net *net)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+
+   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
+   == nvq->done_idx;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net)
/* If more outstanding DMAs, queue the work.
 * Handle upend_idx wrap around
 */
-   if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND)
- % UIO_MAXIOV == nvq->done_idx))
+   if (unlikely(vhost_exceeds_maxpend(net)))
break;
 
head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
@@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
+
+   total_len += len;
+   if (total_len < VHOST_NET_WEIGHT &&
+   !vhost_vq_avail_empty(>dev, vq) &&
+   likely(!vhost_exceeds_maxpend(net))) {
+   msg.msg_flags |= MSG_MORE;
+   } else {
+   msg.msg_flags &= ~MSG_MORE;
+   }
+
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, , len);
if (unlikely(err < 0)) {
@@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net)
vhost_add_used_and_signal(>dev, vq, head, 0);
else
vhost_zerocopy_signal_used(net, vq);
-   total_len += len;
vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(>poll);
-- 
2.7.4



Re: [PATCH net-next V2 3/3] tun: rx batching

2016-12-29 Thread Jason Wang



On 2016年12月30日 00:35, David Miller wrote:

From: Jason Wang <jasow...@redhat.com>
Date: Wed, 28 Dec 2016 16:09:31 +0800


+   spin_lock(>lock);
+   qlen = skb_queue_len(queue);
+   if (qlen > rx_batched)
+   goto drop;
+   __skb_queue_tail(queue, skb);
+   if (!more || qlen + 1 > rx_batched) {
+   __skb_queue_head_init(_queue);
+   skb_queue_splice_tail_init(queue, _queue);
+   rcv = true;
+   }
+   spin_unlock(>lock);

Since you always clear the 'queue' when you insert the skb that hits
the limit, I don't see how the "goto drop" path can be possibly taken.


True, will fix this.

Thanks


Re: [PATCH net-next V3 3/3] tun: rx batching

2017-01-02 Thread Jason Wang



On 2017年01月01日 01:31, David Miller wrote:

From: Jason Wang <jasow...@redhat.com>
Date: Fri, 30 Dec 2016 13:20:51 +0800


@@ -1283,10 +1314,15 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);
  
  	rxhash = skb_get_hash(skb);

+
  #ifndef CONFIG_4KSTACKS
-   local_bh_disable();
-   netif_receive_skb(skb);
-   local_bh_enable();
+   if (!rx_batched) {
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+   } else {
+   tun_rx_batched(tfile, skb, more);
+   }
  #else
netif_rx_ni(skb);
  #endif

If rx_batched has been set, and we are talking to clients not using
this new MSG_MORE facility (or such clients don't have multiple TX
packets to send to you, thus MSG_MORE is often clear), you are doing a
lot more work per-packet than the existing code.

You take the queue lock, you test state, you splice into a local queue
on the stack, then you walk that local stack queue to submit just one
SKB to netif_receive_skb().

I think you want to streamline this sequence in such cases so that the
cost before and after is similar if not equivalent.


Yes, so I will do a skb_queue_empty() check if !MSG_MORE and call 
netif_receive_skb() immediately in this case. This can save the wasted 
efforts.


Thanks


Re: [PATCH net-next V3 3/3] tun: rx batching

2017-01-02 Thread Jason Wang



On 2017年01月01日 05:03, Stephen Hemminger wrote:

On Fri, 30 Dec 2016 13:20:51 +0800
Jason Wang <jasow...@redhat.com> wrote:


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..a268ed9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -75,6 +75,10 @@
  
  #include 
  
+static int rx_batched;

+module_param(rx_batched, int, 0444);
+MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
+
  /* Uncomment to enable debugging */

I like the concept or rx batching. But controlling it via a module parameter
is one of the worst API choices.  Ethtool would be better to use because that is
how other network devices control batching.

If you do ethtool, you could even extend it to have an number of packets
and max latency value.


Right, this is better (I believe you mean rx-frames). For rx-usecs, we 
could do it on top in the future.


Thanks


Re: [PATCH V4 net-next 3/3] tun: rx batching

2017-01-08 Thread Jason Wang



On 2017年01月07日 03:47, Michael S. Tsirkin wrote:

+static int tun_get_coalesce(struct net_device *dev,
+   struct ethtool_coalesce *ec)
+{
+   struct tun_struct *tun = netdev_priv(dev);
+
+   ec->rx_max_coalesced_frames = tun->rx_batched;
+
+   return 0;
+}
+
+static int tun_set_coalesce(struct net_device *dev,
+   struct ethtool_coalesce *ec)
+{
+   struct tun_struct *tun = netdev_priv(dev);
+
+   if (ec->rx_max_coalesced_frames > NAPI_POLL_WEIGHT)
+   return -EINVAL;

So what should userspace do? Keep trying until it succeeds?
I think it's better to just use NAPI_POLL_WEIGHT instead and DTRT here.



Well, looking at how set_coalesce is implemented in other drivers, 
-EINVAL is usually used when user give a value that exceeds the 
limitation. For tuntap, what missed here is probably just a 
documentation for coalescing in tuntap.txt. (Or extend ethtool to return 
the max value). This seems much better than silently reduce the value to 
the limitation.


Thanks


Re: [PATCH V4 net-next 1/3] vhost: better detection of available buffers

2017-01-08 Thread Jason Wang



On 2017年01月07日 03:55, Michael S. Tsirkin wrote:

On Fri, Jan 06, 2017 at 10:13:15AM +0800, Jason Wang wrote:

This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:

- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
   step.

This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/vhost.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d643260..9f11838 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
__virtio16 avail_idx;
int r;
  
+	if (vq->avail_idx != vq->last_avail_idx)

+   return false;
+
r = vhost_get_user(vq, avail_idx, >avail->idx);
-   if (r)
+   if (unlikely(r))
return false;
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
  
-	return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;

+   return vq->avail_idx == vq->last_avail_idx;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

So again, this did not address the issue I pointed out in v1:
if we have 1 buffer in RX queue and
that is not enough to store the whole packet,
vhost_vq_avail_empty returns false, then we re-read
the descriptors again and again.

You have saved a single index access but not the more expensive
descriptor access.


Looks not, if I understand the code correctly, in this case, 
get_rx_bufs() will return zero, and we will try to enable rx kick and 
exit the loop.


Thanks



[PATCH V4 net-next 3/3] tun: rx batching

2017-01-05 Thread Jason Wang
We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.

Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

 Mpps  -+%
rx-frames = 00.91  +0%
rx-frames = 41.00  +9.8%
rx-frames = 81.00  +9.8%
rx-frames = 16   1.01  +10.9%
rx-frames = 32   1.07  +17.5%
rx-frames = 48   1.07  +17.5%
rx-frames = 64   1.08  +18.6%
rx-frames = 64 (no MSG_MORE) 0.91  +0%

User were allowed to change per device batched packets through
ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
to prevent bh from being disabled too long.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 76 ++-
 1 file changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..6c93926 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -218,6 +218,7 @@ struct tun_struct {
struct list_head disabled;
void *security;
u32 flow_count;
+   u32 rx_batched;
struct tun_pcpu_stats __percpu *pcpu_stats;
 };
 
@@ -522,6 +523,7 @@ static void tun_queue_purge(struct tun_file *tfile)
while ((skb = skb_array_consume(>tx_array)) != NULL)
kfree_skb(skb);
 
+   skb_queue_purge(>sk.sk_write_queue);
skb_queue_purge(>sk.sk_error_queue);
 }
 
@@ -1140,10 +1142,45 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
*tfile,
return skb;
 }
 
+static void tun_rx_batched(struct tun_struct *tun, struct tun_file *tfile,
+  struct sk_buff *skb, int more)
+{
+   struct sk_buff_head *queue = >sk.sk_write_queue;
+   struct sk_buff_head process_queue;
+   u32 rx_batched = tun->rx_batched;
+   bool rcv = false;
+
+   if (!rx_batched || (!more && skb_queue_empty(queue))) {
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+   return;
+   }
+
+   spin_lock(>lock);
+   if (!more || skb_queue_len(queue) == rx_batched) {
+   __skb_queue_head_init(_queue);
+   skb_queue_splice_tail_init(queue, _queue);
+   rcv = true;
+   } else {
+   __skb_queue_tail(queue, skb);
+   }
+   spin_unlock(>lock);
+
+   if (rcv) {
+   struct sk_buff *nskb;
+   local_bh_disable();
+   while ((nskb = __skb_dequeue(_queue)))
+   netif_receive_skb(nskb);
+   netif_receive_skb(skb);
+   local_bh_enable();
+   }
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
-   int noblock)
+   int noblock, bool more)
 {
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
@@ -1283,10 +1320,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);
 
rxhash = skb_get_hash(skb);
+
 #ifndef CONFIG_4KSTACKS
-   local_bh_disable();
-   netif_receive_skb(skb);
-   local_bh_enable();
+   tun_rx_batched(tun, tfile, skb, more);
 #else
netif_rx_ni(skb);
 #endif
@@ -1312,7 +1348,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
if (!tun)
return -EBADFD;
 
-   result = tun_get_user(tun, tfile, NULL, from, file->f_flags & 
O_NONBLOCK);
+   result = tun_get_user(tun, tfile, NULL, from,
+ file->f_flags & O_NONBLOCK, false);
 
tun_put(tun);
return result;
@@ -1570,7 +1607,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
return -EBADFD;
 
ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
-  m->msg_flags & MSG_DONTWAIT);
+  m->msg_flags & MSG_DONTWAIT,
+  m->msg_flags & MSG_MORE);
tun_put(tun);
return ret;
 }
@@ -1771,6 +1809,7 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
tun->align = NET_SKB_PAD;
tun->filter_attached = false;
tun->sndbuf = tfile->socket.sk->sk_sndbuf;
+   

[PATCH V4 net-next 1/3] vhost: better detection of available buffers

2017-01-05 Thread Jason Wang
This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:

- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
  step.

This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d643260..9f11838 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
__virtio16 avail_idx;
int r;
 
+   if (vq->avail_idx != vq->last_avail_idx)
+   return false;
+
r = vhost_get_user(vq, avail_idx, >avail->idx);
-   if (r)
+   if (unlikely(r))
return false;
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
+   return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.7.4



[PATCH V4 net-next 0/3] vhost_net tx batching

2017-01-05 Thread Jason Wang
Hi:

This series tries to implement tx batching support for vhost. This was
done by using MSG_MORE as a hint for under layer socket. The backend
(e.g tap) can then batch the packets temporarily in a list and
submit it all once the number of bacthed exceeds a limitation.

Tests shows obvious improvement on guest pktgen over over
mlx4(noqueue) on host:

 Mpps  -+%
rx-frames = 00.91  +0%
rx-frames = 41.00  +9.8%
rx-frames = 81.00  +9.8%
rx-frames = 16   1.01  +10.9%
rx-frames = 32   1.07  +17.5%
rx-frames = 48   1.07  +17.5%
rx-frames = 64   1.08  +18.6%
rx-frames = 64 (no MSG_MORE) 0.91  +0%

Changes from V3:
- use ethtool instead of module parameter to control the maximum
  number of batched packets
- avoid overhead when MSG_MORE were not set and no packet queued

Changes from V2:
- remove uselss queue limitation check (and we don't drop any packet now)

Changes from V1:
- drop NAPI handler since we don't use NAPI now
- fix the issues that may exceeds max pending of zerocopy
- more improvement on available buffer detection
- move the limitation of batched pacekts from vhost to tuntap

Please review.

Thanks

Jason Wang (3):
  vhost: better detection of available buffers
  vhost_net: tx batching
  tun: rx batching

 drivers/net/tun.c | 76 +++
 drivers/vhost/net.c   | 23 ++--
 drivers/vhost/vhost.c |  8 --
 3 files changed, 96 insertions(+), 11 deletions(-)

-- 
2.7.4



[PATCH V4 net-next 2/3] vhost_net: tx batching

2017-01-05 Thread Jason Wang
This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc3465..c42e9c3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
return r;
 }
 
+static bool vhost_exceeds_maxpend(struct vhost_net *net)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+
+   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
+   == nvq->done_idx;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net)
/* If more outstanding DMAs, queue the work.
 * Handle upend_idx wrap around
 */
-   if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND)
- % UIO_MAXIOV == nvq->done_idx))
+   if (unlikely(vhost_exceeds_maxpend(net)))
break;
 
head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
@@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
+
+   total_len += len;
+   if (total_len < VHOST_NET_WEIGHT &&
+   !vhost_vq_avail_empty(>dev, vq) &&
+   likely(!vhost_exceeds_maxpend(net))) {
+   msg.msg_flags |= MSG_MORE;
+   } else {
+   msg.msg_flags &= ~MSG_MORE;
+   }
+
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, , len);
if (unlikely(err < 0)) {
@@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net)
vhost_add_used_and_signal(>dev, vq, head, 0);
else
vhost_zerocopy_signal_used(net, vq);
-   total_len += len;
vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(>poll);
-- 
2.7.4



[PATCH net 9/9] virtio-net: XDP support for small buffers

2016-12-23 Thread Jason Wang
Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 112 ---
 1 file changed, 87 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e53365a8..5deeda6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -333,9 +333,9 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 static void virtnet_xdp_xmit(struct virtnet_info *vi,
 struct receive_queue *rq,
 struct send_queue *sq,
-struct xdp_buff *xdp)
+struct xdp_buff *xdp,
+void *data)
 {
-   struct page *page = virt_to_head_page(xdp->data);
struct virtio_net_hdr_mrg_rxbuf *hdr;
unsigned int num_sg, len;
void *xdp_sent;
@@ -343,20 +343,45 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
 
/* Free up any pending old buffers before queueing new ones. */
while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) {
-   struct page *sent_page = virt_to_head_page(xdp_sent);
-   put_page(sent_page);
+   if (vi->mergeable_rx_bufs) {
+   struct page *sent_page = virt_to_head_page(xdp_sent);
+
+   put_page(sent_page);
+   } else { /* small buffer */
+   struct sk_buff *skb = xdp_sent;
+
+   kfree_skb(skb);
+   }
}
 
-   /* Zero header and leave csum up to XDP layers */
-   hdr = xdp->data;
-   memset(hdr, 0, vi->hdr_len);
+   if (vi->mergeable_rx_bufs) {
+   /* Zero header and leave csum up to XDP layers */
+   hdr = xdp->data;
+   memset(hdr, 0, vi->hdr_len);
+
+   num_sg = 1;
+   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   } else { /* small buffer */
+   struct sk_buff *skb = data;
 
-   num_sg = 1;
-   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   /* Zero header and leave csum up to XDP layers */
+   hdr = skb_vnet_hdr(skb);
+   memset(hdr, 0, vi->hdr_len);
+
+   num_sg = 2;
+   sg_init_table(sq->sg, 2);
+   sg_set_buf(sq->sg, hdr, vi->hdr_len);
+   skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   }
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
-  xdp->data, GFP_ATOMIC);
+  data, GFP_ATOMIC);
if (unlikely(err)) {
-   put_page(page);
+   if (vi->mergeable_rx_bufs) {
+   struct page *page = virt_to_head_page(xdp->data);
+
+   put_page(page);
+   } else /* small buffer */
+   kfree_skb(data);
return; // On error abort to avoid unnecessary kick
}
 
@@ -366,23 +391,26 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
 static u32 do_xdp_prog(struct virtnet_info *vi,
   struct receive_queue *rq,
   struct bpf_prog *xdp_prog,
-  struct page *page, int offset, int len)
+  void *data, int len)
 {
int hdr_padded_len;
struct xdp_buff xdp;
+   void *buf;
unsigned int qp;
u32 act;
-   u8 *buf;
-
-   buf = page_address(page) + offset;
 
-   if (vi->mergeable_rx_bufs)
+   if (vi->mergeable_rx_bufs) {
hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-   else
-   hdr_padded_len = sizeof(struct padded_vnet_hdr);
+   xdp.data = data + hdr_padded_len;
+   xdp.data_end = xdp.data + (len - vi->hdr_len);
+   buf = data;
+   } else { /* small buffers */
+   struct sk_buff *skb = data;
 
-   xdp.data = buf + hdr_padded_len;
-   xdp.data_end = xdp.data + (len - vi->hdr_len);
+   xdp.data = skb->data;
+   xdp.data_end = xdp.data + len;
+   buf = skb->data;
+   }
 
act = bpf_prog_run_xdp(xdp_prog, );
switch (act) {
@@ -392,8 +420,8 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
qp = vi->curr_queue_pairs -
vi

[PATCH net 5/9] virtio-net: unbreak csumed packets for XDP_PASS

2016-12-23 Thread Jason Wang
We drop csumed packet when do XDP for packets. This breaks
XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag
to be set. With this patch, simple TCP works for XDP_PASS.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 470293e..0778dc8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -440,7 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
u32 act;
 
-   if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
+   if (unlikely(hdr->hdr.gso_type))
goto err_xdp;
act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len);
switch (act) {
@@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
 * the receive path after XDP is loaded. In practice I
 * was not able to create this condition.
 */
-   if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
+   if (unlikely(hdr->hdr.gso_type))
goto err_xdp;
 
act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
-- 
2.7.4



[PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support

2016-12-23 Thread Jason Wang
When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
packet that exceeds a single page which could not be handled
correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
supported. While at it, forbid XDP for ECN (which comes only from GRO)
too to prevent user from misconfiguration.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 77ae358..c1f66d8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
int i, err;
 
if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
-   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
netdev_warn(dev, "can't set XDP while host is implementing LRO, 
disable LRO first\n");
return -EOPNOTSUPP;
}
-- 
2.7.4



[PATCH net 8/9] virtio-net: remove big packet XDP codes

2016-12-23 Thread Jason Wang
Now we in fact don't allow XDP for big packets, remove its codes.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 44 +++-
 1 file changed, 3 insertions(+), 41 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c1f66d8..e53365a8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -344,11 +344,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
/* Free up any pending old buffers before queueing new ones. */
while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) {
struct page *sent_page = virt_to_head_page(xdp_sent);
-
-   if (vi->mergeable_rx_bufs)
-   put_page(sent_page);
-   else
-   give_pages(rq, sent_page);
+   put_page(sent_page);
}
 
/* Zero header and leave csum up to XDP layers */
@@ -360,15 +356,8 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
   xdp->data, GFP_ATOMIC);
if (unlikely(err)) {
-   if (vi->mergeable_rx_bufs)
-   put_page(page);
-   else
-   give_pages(rq, page);
+   put_page(page);
return; // On error abort to avoid unnecessary kick
-   } else if (!vi->mergeable_rx_bufs) {
-   /* If not mergeable bufs must be big packets so cleanup pages */
-   give_pages(rq, (struct page *)page->private);
-   page->private = 0;
}
 
virtqueue_kick(sq->vq);
@@ -430,44 +419,17 @@ static struct sk_buff *receive_big(struct net_device *dev,
   void *buf,
   unsigned int len)
 {
-   struct bpf_prog *xdp_prog;
struct page *page = buf;
-   struct sk_buff *skb;
-
-   rcu_read_lock();
-   xdp_prog = rcu_dereference(rq->xdp_prog);
-   if (xdp_prog) {
-   struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
-   u32 act;
-
-   if (unlikely(hdr->hdr.gso_type))
-   goto err_xdp;
-   act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len);
-   switch (act) {
-   case XDP_PASS:
-   break;
-   case XDP_TX:
-   rcu_read_unlock();
-   goto xdp_xmit;
-   case XDP_DROP:
-   default:
-   goto err_xdp;
-   }
-   }
-   rcu_read_unlock();
+   struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
 
-   skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
if (unlikely(!skb))
goto err;
 
return skb;
 
-err_xdp:
-   rcu_read_unlock();
 err:
dev->stats.rx_dropped++;
give_pages(rq, page);
-xdp_xmit:
return NULL;
 }
 
-- 
2.7.4



[PATCH net 1/9] virtio-net: remove the warning before XDP linearizing

2016-12-23 Thread Jason Wang
Since we use EWMA to estimate the size of rx buffer. When rx buffer
size is underestimated, it's usual to have a packet with more than one
buffers. Consider this is not a bug, remove the warning and correct
the comment before XDP linearizing.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 08327e0..1067253 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -552,14 +552,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
struct page *xdp_page;
u32 act;
 
-   /* No known backend devices should send packets with
-* more than a single buffer when XDP conditions are
-* met. However it is not strictly illegal so the case
-* is handled as an exception and a warning is thrown.
-*/
+   /* This happens when rx buffer size is underestimated */
if (unlikely(num_buf > 1)) {
-   bpf_warn_invalid_xdp_buffer();
-
/* linearize data for XDP */
xdp_page = xdp_linearize_page(rq, num_buf,
  page, offset, );
-- 
2.7.4



[PATCH net 6/9] virtio-net: make rx buf size estimation works for XDP

2016-12-23 Thread Jason Wang
We don't update ewma rx buf size in the case of XDP. This will lead
underestimation of rx buf size which causes host to produce more than
one buffers. This will greatly increase the possibility of XDP page
linearization.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0778dc8..77ae358 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -584,10 +584,12 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
put_page(page);
head_skb = page_to_skb(vi, rq, xdp_page,
   0, len, PAGE_SIZE);
+   ewma_pkt_len_add(>mrg_avg_pkt_len, len);
return head_skb;
}
break;
case XDP_TX:
+   ewma_pkt_len_add(>mrg_avg_pkt_len, len);
if (unlikely(xdp_page != page))
goto err_xdp;
rcu_read_unlock();
@@ -596,6 +598,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
default:
if (unlikely(xdp_page != page))
__free_pages(xdp_page, 0);
+   ewma_pkt_len_add(>mrg_avg_pkt_len, len);
goto err_xdp;
}
}
-- 
2.7.4



[PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing

2016-12-23 Thread Jason Wang
We don't put page during linearizing, the would cause leaking when
xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
put page accordingly. Also decrease the number of buffers during
linearizing to make sure caller can free buffers correctly when packet
exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
huge number of packets.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fe4562d..58ad40e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -483,7 +483,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
  * anymore.
  */
 static struct page *xdp_linearize_page(struct receive_queue *rq,
-  u16 num_buf,
+  u16 *num_buf,
   struct page *p,
   int offset,
   unsigned int *len)
@@ -497,7 +497,7 @@ static struct page *xdp_linearize_page(struct receive_queue 
*rq,
memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
page_off += *len;
 
-   while (--num_buf) {
+   while (--*num_buf) {
unsigned int buflen;
unsigned long ctx;
void *buf;
@@ -507,19 +507,22 @@ static struct page *xdp_linearize_page(struct 
receive_queue *rq,
if (unlikely(!ctx))
goto err_buf;
 
+   buf = mergeable_ctx_to_buf_address(ctx);
+   p = virt_to_head_page(buf);
+   off = buf - page_address(p);
+
/* guard against a misconfigured or uncooperative backend that
 * is sending packet larger than the MTU.
 */
-   if ((page_off + buflen) > PAGE_SIZE)
+   if ((page_off + buflen) > PAGE_SIZE) {
+   put_page(p);
goto err_buf;
-
-   buf = mergeable_ctx_to_buf_address(ctx);
-   p = virt_to_head_page(buf);
-   off = buf - page_address(p);
+   }
 
memcpy(page_address(page) + page_off,
   page_address(p) + off, buflen);
page_off += buflen;
+   put_page(p);
}
 
*len = page_off;
@@ -555,7 +558,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
/* This happens when rx buffer size is underestimated */
if (unlikely(num_buf > 1)) {
/* linearize data for XDP */
-   xdp_page = xdp_linearize_page(rq, num_buf,
+   xdp_page = xdp_linearize_page(rq, _buf,
  page, offset, );
if (!xdp_page)
goto err_xdp;
-- 
2.7.4



[PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets

2016-12-23 Thread Jason Wang
When XDP_PASS were determined for linearized packets, we try to get
new buffers in the virtqueue and build skbs from them. This is wrong,
we should create skbs based on existed buffers instead. Fixing them by
creating skb based on xdp_page.

With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 58ad40e..470293e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
switch (act) {
case XDP_PASS:
-   if (unlikely(xdp_page != page))
-   __free_pages(xdp_page, 0);
+   /* We can only create skb based on xdp_page. */
+   if (unlikely(xdp_page != page)) {
+   rcu_read_unlock();
+   put_page(page);
+   head_skb = page_to_skb(vi, rq, xdp_page,
+  0, len, PAGE_SIZE);
+   return head_skb;
+   }
break;
case XDP_TX:
if (unlikely(xdp_page != page))
-- 
2.7.4



[PATCH net 2/9] virtio-net: correctly xmit linearized page on XDP_TX

2016-12-23 Thread Jason Wang
After we linearize page, we should xmit this page instead of the page
of first buffer which may lead unexpected result. With this patch, we
can see correct packet during XDP_TX.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1067253..fe4562d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
goto err_xdp;
 
-   act = do_xdp_prog(vi, rq, xdp_prog, page, offset, len);
+   act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
switch (act) {
case XDP_PASS:
if (unlikely(xdp_page != page))
-- 
2.7.4



[PATCH net 0/9] several fixups for virtio-net XDP

2016-12-23 Thread Jason Wang
Merry Xmas and a Happy New year to all:

This series tries to fixes several issues for virtio-net XDP which
could be categorized into several parts:

- fix several issues during XDP linearizing
- allow csumed packet to work for XDP_PASS
- make EWMA rxbuf size estimation works for XDP
- forbid XDP when GUEST_UFO is support
- remove big packet XDP support
- add XDP support or small buffer

Please see individual patches for details.

Thanks

Jason Wang (9):
  virtio-net: remove the warning before XDP linearizing
  virtio-net: correctly xmit linearized page on XDP_TX
  virtio-net: fix page miscount during XDP linearizing
  virtio-net: correctly handle XDP_PASS for linearized packets
  virtio-net: unbreak csumed packets for XDP_PASS
  virtio-net: make rx buf size estimation works for XDP
  virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
  virtio-net: remove big packet XDP codes
  virtio-net: XDP support for small buffers

 drivers/net/virtio_net.c | 172 ---
 1 file changed, 102 insertions(+), 70 deletions(-)

-- 
2.7.4



[PATCH net-next V2 1/3] vhost: better detection of available buffers

2016-12-28 Thread Jason Wang
This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:

- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
  step.

This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d643260..9f11838 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
__virtio16 avail_idx;
int r;
 
+   if (vq->avail_idx != vq->last_avail_idx)
+   return false;
+
r = vhost_get_user(vq, avail_idx, >avail->idx);
-   if (r)
+   if (unlikely(r))
return false;
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
+   return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.7.4



[PATCH net-next V2 3/3] tun: rx batching

2016-12-28 Thread Jason Wang
We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.

Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

  Mpps  -+%
rx_batched=0  0.90  +0%
rx_batched=4  0.97  +7.8%
rx_batched=8  0.97  +7.8%
rx_batched=16 0.98  +8.9%
rx_batched=32 1.03  +14.4%
rx_batched=48 1.09  +21.1%
rx_batched=64 1.02  +13.3%

The maximum number of batched packets were specified through a module
parameter.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 66 ---
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..6ea5d6d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -75,6 +75,10 @@
 
 #include 
 
+static int rx_batched;
+module_param(rx_batched, int, 0444);
+MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
+
 /* Uncomment to enable debugging */
 /* #define TUN_DEBUG 1 */
 
@@ -522,6 +526,7 @@ static void tun_queue_purge(struct tun_file *tfile)
while ((skb = skb_array_consume(>tx_array)) != NULL)
kfree_skb(skb);
 
+   skb_queue_purge(>sk.sk_write_queue);
skb_queue_purge(>sk.sk_error_queue);
 }
 
@@ -1140,10 +1145,44 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
*tfile,
return skb;
 }
 
+static int tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb,
+ int more)
+{
+   struct sk_buff_head *queue = >sk.sk_write_queue;
+   struct sk_buff_head process_queue;
+   int qlen;
+   bool rcv = false;
+
+   spin_lock(>lock);
+   qlen = skb_queue_len(queue);
+   if (qlen > rx_batched)
+   goto drop;
+   __skb_queue_tail(queue, skb);
+   if (!more || qlen + 1 > rx_batched) {
+   __skb_queue_head_init(_queue);
+   skb_queue_splice_tail_init(queue, _queue);
+   rcv = true;
+   }
+   spin_unlock(>lock);
+
+   if (rcv) {
+   local_bh_disable();
+   while ((skb = __skb_dequeue(_queue)))
+   netif_receive_skb(skb);
+   local_bh_enable();
+   }
+
+   return 0;
+drop:
+   spin_unlock(>lock);
+   kfree_skb(skb);
+   return -EFAULT;
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
-   int noblock)
+   int noblock, bool more)
 {
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
@@ -1283,18 +1322,27 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_probe_transport_header(skb, 0);
 
rxhash = skb_get_hash(skb);
+
 #ifndef CONFIG_4KSTACKS
-   local_bh_disable();
-   netif_receive_skb(skb);
-   local_bh_enable();
+   if (!rx_batched) {
+   local_bh_disable();
+   netif_receive_skb(skb);
+   local_bh_enable();
+   } else {
+   err = tun_rx_batched(tfile, skb, more);
+   }
 #else
netif_rx_ni(skb);
 #endif
 
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(>syncp);
-   stats->rx_packets++;
-   stats->rx_bytes += len;
+   if (err) {
+   stats->rx_dropped++;
+   } else {
+   stats->rx_packets++;
+   stats->rx_bytes += len;
+   }
u64_stats_update_end(>syncp);
put_cpu_ptr(stats);
 
@@ -1312,7 +1360,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
if (!tun)
return -EBADFD;
 
-   result = tun_get_user(tun, tfile, NULL, from, file->f_flags & 
O_NONBLOCK);
+   result = tun_get_user(tun, tfile, NULL, from,
+ file->f_flags & O_NONBLOCK, false);
 
tun_put(tun);
return result;
@@ -1570,7 +1619,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
return -EBADFD;
 
ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
-  m->msg_flags & MSG_DONTWAIT);
+  m->msg_flags & MSG_DONTWAIT,
+  m->msg_flags & MSG_MORE);
tun_put(tun);
return ret;
 }
-- 
2.7.4



[PATCH net-next V2 0/3] vhost net tx batching

2016-12-28 Thread Jason Wang
Hi:

This series tries to implement tx batching support for vhost. This was
done by using MSG_MORE as a hint for under layer socket. The backend
(e.g tap) can then batch the packets temporarily in a list and
submit it all once the number of bacthed exceeds a limitation.

Tests shows obvious improvement on guest pktgen over over
mlx4(noqueue) on host:

Mpps  -+%
rx_batched=0  0.90  +0%
rx_batched=4  0.97  +7.8%
rx_batched=8  0.97  +7.8%
rx_batched=16 0.98  +8.9%
rx_batched=32 1.03  +14.4%
rx_batched=48 1.09  +21.1%
rx_batched=64 1.02  +13.3%

Changes from V1:
- drop NAPI handler since we don't use NAPI now
- fix the issues that may exceeds max pending of zerocopy
- more improvement on available buffer detection
- move the limitation of batched pacekts from vhost to tuntap

Please review.

Thanks

Jason Wang (3):
  vhost: better detection of available buffers
  vhost_net: tx batching
  tun: rx batching

 drivers/net/tun.c | 66 ---
 drivers/vhost/net.c   | 23 +++---
 drivers/vhost/vhost.c |  8 +--
 3 files changed, 84 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH net-next V2 2/3] vhost_net: tx batching

2016-12-28 Thread Jason Wang
This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc3465..c42e9c3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
return r;
 }
 
+static bool vhost_exceeds_maxpend(struct vhost_net *net)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+
+   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
+   == nvq->done_idx;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net)
/* If more outstanding DMAs, queue the work.
 * Handle upend_idx wrap around
 */
-   if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND)
- % UIO_MAXIOV == nvq->done_idx))
+   if (unlikely(vhost_exceeds_maxpend(net)))
break;
 
head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
@@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
+
+   total_len += len;
+   if (total_len < VHOST_NET_WEIGHT &&
+   !vhost_vq_avail_empty(>dev, vq) &&
+   likely(!vhost_exceeds_maxpend(net))) {
+   msg.msg_flags |= MSG_MORE;
+   } else {
+   msg.msg_flags &= ~MSG_MORE;
+   }
+
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, , len);
if (unlikely(err < 0)) {
@@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net)
vhost_add_used_and_signal(>dev, vq, head, 0);
else
vhost_zerocopy_signal_used(net, vq);
-   total_len += len;
vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(>poll);
-- 
2.7.4



Re: [PATCH net 1/9] virtio-net: remove the warning before XDP linearizing

2016-12-26 Thread Jason Wang



On 2016年12月24日 03:31, Daniel Borkmann wrote:

Hi Jason,

On 12/23/2016 03:37 PM, Jason Wang wrote:

Since we use EWMA to estimate the size of rx buffer. When rx buffer
size is underestimated, it's usual to have a packet with more than one
buffers. Consider this is not a bug, remove the warning and correct
the comment before XDP linearizing.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 08327e0..1067253 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -552,14 +552,8 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,

  struct page *xdp_page;
  u32 act;

-/* No known backend devices should send packets with
- * more than a single buffer when XDP conditions are
- * met. However it is not strictly illegal so the case
- * is handled as an exception and a warning is thrown.
- */
+/* This happens when rx buffer size is underestimated */
  if (unlikely(num_buf > 1)) {
-bpf_warn_invalid_xdp_buffer();


Could you also remove the bpf_warn_invalid_xdp_buffer(), which got added
just for this?

Thanks. 


Posted.

Thanks


[PATCH net] net: xdp: remove unused bfp_warn_invalid_xdp_buffer()

2016-12-26 Thread Jason Wang
After commit 73b62bd085f4737679ea9afc7867fa5f99ba7d1b ("virtio-net:
remove the warning before XDP linearizing"), there's no users for
bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for
commit f23bc46c30ca5ef58b8549434899fcbac41b2cfc.

Cc: Daniel Borkmann <dan...@iogearbox.net>
Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 include/linux/filter.h | 1 -
 net/core/filter.c  | 6 --
 2 files changed, 7 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7023142..a0934e6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -610,7 +610,6 @@ bool bpf_helper_changes_pkt_data(void *func);
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 void bpf_warn_invalid_xdp_action(u32 act);
-void bpf_warn_invalid_xdp_buffer(void);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7190bd6..b146170 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2972,12 +2972,6 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-void bpf_warn_invalid_xdp_buffer(void)
-{
-   WARN_ONCE(1, "Illegal XDP buffer encountered, expect throughput 
degradation\n");
-}
-EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
-
 static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
int src_reg, int ctx_off,
struct bpf_insn *insn_buf,
-- 
2.7.4



Re: [PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing

2016-12-25 Thread Jason Wang



On 2016年12月23日 23:54, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

We don't put page during linearizing, the would cause leaking when
xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
put page accordingly. Also decrease the number of buffers during
linearizing to make sure caller can free buffers correctly when packet
exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
huge number of packets.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---

Thanks! looks good. By the way do you happen to have any actual
configuration where this path is hit? I obviously didn't test this
very long other than a quick test with my hacked vhost driver.

Acked-by: John Fastabend <john.r.fastab...@intel.com>


Yes, I have. Just increase the MTU above 1500 for both virtio and tap 
and produce some traffic with size which will lead underestimated of rxbuf.


Thanks


Re: [PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets

2016-12-25 Thread Jason Wang



On 2016年12月23日 23:57, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

When XDP_PASS were determined for linearized packets, we try to get
new buffers in the virtqueue and build skbs from them. This is wrong,
we should create skbs based on existed buffers instead. Fixing them by
creating skb based on xdp_page.

With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.

Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 58ad40e..470293e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
switch (act) {
case XDP_PASS:
-   if (unlikely(xdp_page != page))
-   __free_pages(xdp_page, 0);
+   /* We can only create skb based on xdp_page. */
+   if (unlikely(xdp_page != page)) {
+   rcu_read_unlock();
+   put_page(page);
+   head_skb = page_to_skb(vi, rq, xdp_page,
+  0, len, PAGE_SIZE);
+   return head_skb;
+   }
break;
case XDP_TX:
if (unlikely(xdp_page != page))


Great thanks. This was likely working before because of the memory
leak fixed in 3/9.


Looks not, without this and 3/9 the code will try to get buffers and 
build skb for a new packet instead of existed buffers.


Thanks



Acked-by: John Fastabend <john.r.fastab...@intel.com>




Re: [PATCH net 0/9] several fixups for virtio-net XDP

2016-12-25 Thread Jason Wang



On 2016年12月24日 01:10, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

Merry Xmas and a Happy New year to all:

This series tries to fixes several issues for virtio-net XDP which
could be categorized into several parts:

- fix several issues during XDP linearizing
- allow csumed packet to work for XDP_PASS
- make EWMA rxbuf size estimation works for XDP
- forbid XDP when GUEST_UFO is support
- remove big packet XDP support
- add XDP support or small buffer

Please see individual patches for details.

Thanks

Jason Wang (9):
   virtio-net: remove the warning before XDP linearizing
   virtio-net: correctly xmit linearized page on XDP_TX
   virtio-net: fix page miscount during XDP linearizing
   virtio-net: correctly handle XDP_PASS for linearized packets
   virtio-net: unbreak csumed packets for XDP_PASS
   virtio-net: make rx buf size estimation works for XDP
   virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
   virtio-net: remove big packet XDP codes
   virtio-net: XDP support for small buffers

  drivers/net/virtio_net.c | 172 ---
  1 file changed, 102 insertions(+), 70 deletions(-)


Thanks a lot Jason. The last piece that is needed is support to
complete XDP support is to get the adjust_head part correct. I'll
send out a patch in a bit but will need to merge it on top of this
set.

.John


Yes, glad to see the your patch.

Thanks.


Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support

2016-12-25 Thread Jason Wang



On 2016年12月24日 00:10, John Fastabend wrote:

On 16-12-23 08:02 AM, John Fastabend wrote:

On 16-12-23 06:37 AM, Jason Wang wrote:

When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
packet that exceeds a single page which could not be handled
correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
supported. While at it, forbid XDP for ECN (which comes only from GRO)
too to prevent user from misconfiguration.


Is sending packets greater than single page though normal in this case?


Yes, when NETIF_F_UFO was enabled for tap, it won't segment UFO packet 
and will send it directly to guest. (This could be reproduced with 
UDP_STREAM between two guests or host to guest).


Thanks


I don't have any need to support big packet mode other than MST asked
for it. And I wasn't seeing this in my tests. MTU is capped at 4k - hdr
when XDP is enabled.

.John


Cc: John Fastabend <john.r.fastab...@intel.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 77ae358..c1f66d8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
int i, err;
  
  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||

-   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
netdev_warn(dev, "can't set XDP while host is implementing LRO, 
disable LRO first\n");
return -EOPNOTSUPP;
}


Acked-by: John Fastabend <john.r.fastab...@intel.com>





Re: [net PATCH v4 6/6] virtio_net: XDP support for adjust_head

2017-01-15 Thread Jason Wang



On 2017年01月16日 08:01, John Fastabend wrote:

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

At the moment this code must do a full reset to ensure old buffers
without headroom on program add or with headroom on program removal
are not used incorrectly in the datapath. Ideally we would only
have to disable/enable the RX queues being updated but there is no
API to do this at the moment in virtio so use the big hammer. In
practice it is likely not that big of a problem as this will only
happen when XDP is enabled/disabled changing programs does not
require the reset. There is some risk that the driver may either
have an allocation failure or for some reason fail to correctly
negotiate with the underlying backend in this case the driver will
be left uninitialized. I have not seen this ever happen on my test
systems and for what its worth this same failure case can occur
from probe and other contexts in virtio framework.

Signed-off-by: John Fastabend 
---
  drivers/net/virtio_net.c |  110 ++
  1 file changed, 82 insertions(+), 28 deletions(-)



[...]


-   vi->xdp_queue_pairs = xdp_qp;
netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
  
  	for (i = 0; i < vi->max_queue_pairs; i++) {

@@ -1746,6 +1789,21 @@ static int virtnet_xdp_set(struct net_device *dev, 
struct bpf_prog *prog)
}
  
  	return 0;

+
+virtio_reset_err:
+   /* On reset error do our best to unwind XDP changes inflight and return
+* error up to user space for resolution. The underlying PCI reset hung
+* on us so not much we can do here.


It should work with other transport, so let's remove "PCI" here.


+*/
+   dev_warn(>dev, "XDP reset failure and queues unstable\n");
+   vi->xdp_queue_pairs = oxdp_qp;
+virtio_queue_err:
+   /* On queue set error we can unwind bpf ref count and user space can
+* retry this is most likely an allocation failure.
+*/
+   if (prog)
+   bpf_prog_sub(prog, vi->max_queue_pairs - 1);
+   return err;
  }
  
  static bool virtnet_xdp_query(struct net_device *dev)

@@ -2373,7 +2431,6 @@ static void virtnet_remove(struct virtio_device *vdev)
free_netdev(vi->dev);
  }
  
-#ifdef CONFIG_PM_SLEEP

  static int virtnet_freeze(struct virtio_device *vdev)
  {
struct virtnet_info *vi = vdev->priv;
@@ -2430,7 +2487,6 @@ static int virtnet_restore(struct virtio_device *vdev)
  
  	return 0;

  }
-#endif


If you do want to use virtio_device_reset(), it's better to squash this 
into patch 5/6.


Other looks good.

Thanks


Re: [net PATCH v4 0/6] virtio_net XDP fixes and adjust_header support

2017-01-15 Thread Jason Wang



On 2017年01月16日 07:59, John Fastabend wrote:

This has a fix to handle small buffer free logic correctly and then
also adds adjust head support.

I pushed adjust head at net (even though its rc3) to avoid having
to push another exception case into virtio_net to catch if the
program uses adjust_head and then block it. If there are any strong
objections to this we can push it at net-next and use a patch from
Jakub to add the exception handling but then user space has to deal
with it either via try/fail logic or via kernel version checks. Granted
we already have some cases that need to be configured to enable XDP
but I don't see any reason to have yet another one when we can fix it
now vs delaying a kernel version.


v2: fix spelling error, convert unsigned -> unsigned int
v3: v2 git crashed during send so retrying sorry for the noise
v4: changed layout of rtnl_lock fixes (Stephen)
 moved reset logic into virtio core with new patch (MST)
 fixed up linearize and some code cleanup (Jason)

 Otherwise did some generic code cleanup so might be a bit
 cleaner this time at least that is the hope.

Thanks everyone for the v3 review.


Thanks, looks good to me overall, just few nits.


Re: [net PATCH v4 5/6] virtio: add pci_down/pci_up configuration

2017-01-15 Thread Jason Wang



On 2017年01月16日 08:01, John Fastabend wrote:

In virtio_net we need to do a full reset of the device to support
queue reconfiguration and also we can trigger this via ethtool
commands. So instead of open coding this in net driver push this
into generic code in virtio. This also avoid exporting a handful
of internal virtio routines.


Looks like this is not a pci specific stuffs. And there's some driver 
left (e.g scsi and block).


In fact, I'm not sure touching other drivers is really needed. Maybe we 
can just:
- move virtio_device_freeze(), virtio_device_restore() and 
.freeze/.restore in virtio_driver out of CONFIG_PM_SLEEP

- move virtnet_freeze() and virtnet_restore() out of CONFIG_PM_SLEEP
- introduce virtio_net_reset() and call 
virtio_device_freeze()/virtio_device_restore() there


Another possible issue for sleep/hibernation is xdp_prog were not 
restored, if this is not XDP intended, we'd better fix this.


Thanks

[...]


Re: [PATCH net-next 1/8] ptr_ring: introduce batch dequeuing

2017-03-23 Thread Jason Wang



On 2017年03月22日 21:43, Michael S. Tsirkin wrote:

On Tue, Mar 21, 2017 at 12:04:40PM +0800, Jason Wang wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  include/linux/ptr_ring.h | 65 
  1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..4771ded 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
return ptr;
  }
  
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,

+void **array, int n)
+{
+   void *ptr;
+   int i = 0;
+
+   while (i < n) {
+   ptr = __ptr_ring_consume(r);
+   if (!ptr)
+   break;
+   array[i++] = ptr;
+   }
+
+   return i;
+}
+
  /*
   * Note: resize (below) nests producer lock within consumer lock, so if you
   * call this in interrupt or BH context, you must disable interrupts/BH when


This ignores the comment above that function:

/* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax().
  */


Yes, __ptr_ring_swap_queue() ignores this too.



Also - it looks like it shouldn't matter if reads are reordered but I wonder.
Thoughts? Including some reasoning about it in commit log would be nice.


Yes, I think it doesn't matter in this case, it matters only for batched 
producing.


Thanks




@@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
return ptr;
  }
  
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,

+  void **array, int n)
+{
+   int ret;
+
+   spin_lock(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock_irq(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irq(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+  void **array, int n)
+{
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(>consumer_lock, flags);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irqrestore(>consumer_lock, flags);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+ void **array, int n)
+{
+   int ret;
+
+   spin_lock_bh(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_bh(>consumer_lock);
+
+   return ret;
+}
+
  /* Cast to structure type and call a function without discarding from FIFO.
   * Function must return a value.
   * Callers must take consumer_lock.
--
2.7.4




Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array

2017-03-23 Thread Jason Wang



On 2017年03月22日 22:16, Michael S. Tsirkin wrote:

On Tue, Mar 21, 2017 at 12:04:46PM +0800, Jason Wang wrote:

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
   Before: 1.88Mpps
   After : 2.25Mpps (+19.6%)
- mergeable buffer:
   Before: 1.83Mpps
   After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 64 +
  1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..53f09f2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  
  #include 
  
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {

struct vhost_virtqueue *vq;
  };
  
+#define VHOST_RX_BATCH 64

  struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;
@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   void *rxq[VHOST_RX_BATCH];
+   int rt;
+   int rh;
  };
  
  struct vhost_net {

@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
  
  }

@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
  }
  
-static int peek_head_len(struct sock *sk)

+static int peek_head_len_batched(struct vhost_net_virtqueue *rvq)

Pls rename to say what it actually does: fetch skbs


Ok.




+{
+   if (rvq->rh != rvq->rt)
+   goto out;
+
+   rvq->rh = rvq->rt = 0;
+   rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq,
+   VHOST_RX_BATCH);

A comment explaining why is is -bh would be helpful.


Ok.

Thanks


[PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array

2017-03-30 Thread Jason Wang
We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
  Before: 1.88Mpps
  After : 2.25Mpps (+19.6%)
- mergeable buffer:
  Before: 1.83Mpps
  After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 64 +
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..ffa78c6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
 struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;
@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   void *rxq[VHOST_RX_BATCH];
+   int rt;
+   int rh;
 };
 
 struct vhost_net {
@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
 
 }
@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int fetch_skbs(struct vhost_net_virtqueue *rvq)
+{
+   if (rvq->rh != rvq->rt)
+   goto out;
+
+   rvq->rh = rvq->rt = 0;
+   rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
+   VHOST_RX_BATCH);
+   if (!rvq->rt)
+   return 0;
+out:
+   return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
struct socket *sock = sk->sk_socket;
struct sk_buff *head;
int len = 0;
unsigned long flags;
 
+   if (rvq->rx_array)
+   return fetch_skbs(rvq);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
 
@@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(>sk_receive_queue);
 }
 
-static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+ struct sock *sk)
 {
+   struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
-   int len = peek_head_len(sk);
+   int len = peek_head_len(rvq, sk);
 
if (!len && vq->busyloop_timeout) {
/* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_poll_queue(>poll);
mutex_unlock(>mutex);
 
-   len = peek_head_len(sk);
+   len = peek_head_len(rvq, sk);
}
 
return len;
@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
+   if (nvq->rx_array)
+   msg.msg_control = nvq->rxq[nvq->rh++];
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(_iter, READ, vq->iov, 1, 1);
@@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net 
*n,
struct vhost_virtqueue *vq)
 {
struct socket *sock;
+   struct vhost_net_virtqueue *nvq =
+   container_of(vq, struct vhost_net_virtqueue, vq);
 
mutex_lock(>mutex);

[PATCH V2 net-next 6/7] tap: support receiving skb from msg_control

2017-03-30 Thread Jason Wang
This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..07d9174 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
DEFINE_WAIT(wait);
-   struct sk_buff *skb;
ssize_t ret = 0;
 
if (!iov_iter_count(to))
return 0;
 
+   if (skb)
+   goto done;
+
while (1) {
if (!noblock)
prepare_to_wait(sk_sleep(>sk), ,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
if (!noblock)
finish_wait(sk_sleep(>sk), );
 
+done:
if (skb) {
ret = tap_put_user(q, skb, to);
if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
struct tap_queue *q = file->private_data;
ssize_t len = iov_iter_count(to), ret;
 
-   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr 
*m,
int ret;
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
return -EINVAL;
-   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4



[PATCH V2 net-next 3/7] tun: export skb_array

2017-03-30 Thread Jason Wang
This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c  | 13 +
 include/linux/if_tun.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c418f0a..70dd9ec 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+   struct tun_file *tfile;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   tfile = file->private_data;
+   if (!tfile)
+   return ERR_PTR(-EBADFD);
+   return >tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4



[PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing

2017-03-30 Thread Jason Wang
This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 include/linux/ptr_ring.h | 65 
 1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..2be0f350 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
return ptr;
 }
 
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+void **array, int n)
+{
+   void *ptr;
+   int i;
+
+   for (i = 0; i < n; i++) {
+   ptr = __ptr_ring_consume(r);
+   if (!ptr)
+   break;
+   array[i] = ptr;
+   }
+
+   return i;
+}
+
 /*
  * Note: resize (below) nests producer lock within consumer lock, so if you
  * call this in interrupt or BH context, you must disable interrupts/BH when
@@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
return ptr;
 }
 
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock_irq(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irq(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+  void **array, int n)
+{
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(>consumer_lock, flags);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irqrestore(>consumer_lock, flags);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+ void **array, int n)
+{
+   int ret;
+
+   spin_lock_bh(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_bh(>consumer_lock);
+
+   return ret;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.7.4



[PATCH V2 net-next 2/7] skb_array: introduce batch dequeuing

2017-03-30 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 include/linux/skb_array.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..90e44b9 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct 
skb_array *a)
return ptr_ring_consume(>ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
return ptr_ring_consume_irq(>ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_irq(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
return ptr_ring_consume_any(>ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_any(>ring, array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
return ptr_ring_consume_bh(>ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+  void **array, int n)
+{
+   return ptr_ring_consume_batched_bh(>ring, array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {
-- 
2.7.4



[PATCH V2 net-next 0/7] vhost-net rx batching

2017-03-30 Thread Jason Wang
Hi all:

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying.

Tests shows at most 19% improvment on rx pps.

Please review.

Thanks

Changes from V1:
- switch to use for() in __ptr_ring_consume_batched()
- rename peek_head_len_batched() to fetch_skbs()
- use skb_array_consume_batched() instead of
  skb_array_consume_batched_bh() since no consumer run in bh
- drop the lockless peeking patch since skb_array could be resized, so
  it's not safe to call lockless one

Jason Wang (7):
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array

 drivers/net/tap.c | 25 +++---
 drivers/net/tun.c | 31 --
 drivers/vhost/net.c   | 64 +++---
 include/linux/if_tap.h|  5 
 include/linux/if_tun.h|  5 
 include/linux/ptr_ring.h  | 65 +++
 include/linux/skb_array.h | 25 ++
 7 files changed, 204 insertions(+), 16 deletions(-)

-- 
2.7.4



[PATCH V2 net-next 5/7] tun: support receiving skb through msg_control

2017-03-30 Thread Jason Wang
This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 70dd9ec..a82bced 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file 
*tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
-   struct sk_buff *skb;
ssize_t ret;
int err;
 
@@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
if (!iov_iter_count(to))
return 0;
 
-   /* Read frames from ring */
-   skb = tun_ring_recv(tfile, noblock, );
-   if (!skb)
-   return err;
+   if (!skb) {
+   /* Read frames from ring */
+   skb = tun_ring_recv(tfile, noblock, );
+   if (!skb)
+   return err;
+   }
 
ret = tun_put_user(tun, tfile, skb, to);
if (unlikely(ret < 0))
@@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
 
if (!tun)
return -EBADFD;
-   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr 
*m, size_t total_len,
 SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
-   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > (ssize_t)total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4



[PATCH V2 net-next 4/7] tap: export skb_array

2017-03-30 Thread Jason Wang
This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tap.c  | 13 +
 include/linux/if_tap.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+   struct tap_queue *q;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   q = file->private_data;
+   if (!q)
+   return ERR_PTR(-EBADFD);
+   return >skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include 
-- 
2.7.4



Re: [PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling

2017-03-29 Thread Jason Wang



On 2017年03月29日 20:07, Michael S. Tsirkin wrote:

On Tue, Mar 21, 2017 at 12:04:47PM +0800, Jason Wang wrote:

For the socket that exports its skb array, we can use lockless polling
to avoid touching spinlock during busy polling.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53f09f2..41153a3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, 
struct sock *sk)
return len;
  }
  
-static int sk_has_rx_data(struct sock *sk)

+static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk)
  {
struct socket *sock = sk->sk_socket;
  
+	if (rvq->rx_array)

+   return !__skb_array_empty(rvq->rx_array);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
  

I don't see which patch adds __skb_array_empty.


This is not something new, it was introduced by ad69f35d1dc0a 
("skb_array: array based FIFO for skbs").


Thanks




@@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net,
endtime = busy_clock() + vq->busyloop_timeout;
  
  		while (vhost_can_busy_poll(>dev, endtime) &&

-  !sk_has_rx_data(sk) &&
+  !sk_has_rx_data(rvq, sk) &&
   vhost_vq_avail_empty(>dev, vq))
cpu_relax();
  
--

2.7.4




Re: [PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling

2017-03-29 Thread Jason Wang



On 2017年03月30日 10:33, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 10:16:15AM +0800, Jason Wang wrote:


On 2017年03月29日 20:07, Michael S. Tsirkin wrote:

On Tue, Mar 21, 2017 at 12:04:47PM +0800, Jason Wang wrote:

For the socket that exports its skb array, we can use lockless polling
to avoid touching spinlock during busy polling.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
   drivers/vhost/net.c | 7 +--
   1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53f09f2..41153a3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, 
struct sock *sk)
return len;
   }
-static int sk_has_rx_data(struct sock *sk)
+static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk)
   {
struct socket *sock = sk->sk_socket;
+   if (rvq->rx_array)
+   return !__skb_array_empty(rvq->rx_array);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);

I don't see which patch adds __skb_array_empty.

This is not something new, it was introduced by ad69f35d1dc0a ("skb_array:
array based FIFO for skbs").

Thanks

Same comment about a compiler barrier applies then.


Ok, rethink about this, since skb_array could be resized, using lockless 
version seems wrong.


For the comment of using compiler barrier, caller 
(vhost_net_rx_peek_head_len) uses cpu_relax(). But I haven't figured out 
why a compiler barrier is needed here. Could you please explain?


Thanks




@@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net,
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(>dev, endtime) &&
-  !sk_has_rx_data(sk) &&
+  !sk_has_rx_data(rvq, sk) &&
   vhost_vq_avail_empty(>dev, vq))
cpu_relax();
--
2.7.4




Re: [PATCH] virtio_net: fix mergeable bufs error handling

2017-03-29 Thread Jason Wang



On 2017年03月29日 20:37, Michael S. Tsirkin wrote:

On xdp error we try to free head_skb without having
initialized it, that's clearly bogus.

Fixes: f600b6905015 ("virtio_net: Add XDP support")
Cc: John Fastabend 
Signed-off-by: Michael S. Tsirkin 
---
  drivers/net/virtio_net.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 11773d6..e0fb3707 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -570,7 +570,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
u16 num_buf;
struct page *page;
int offset;
-   struct sk_buff *head_skb, *curr_skb;
+   struct sk_buff *head_skb = NULL, *curr_skb;
struct bpf_prog *xdp_prog;
unsigned int truesize;
  


My tree (net.git HEAD is 8f1f7eeb22c16a197159cf7b35d1350695193ead) has:

head_skb = NULL;

just after the above codes.

Thanks


Re: [PATCH] virtio_net: enable big packets for large MTU values

2017-03-29 Thread Jason Wang



On 2017年03月29日 20:38, Michael S. Tsirkin wrote:

If one enables e.g. jumbo frames without mergeable
buffers, packets won't fit in 1500 byte buffers
we use. Switch to big packet mode instead.
TODO: make sizing more exact, possibly extend small
packet mode to use larger pages.

Signed-off-by: Michael S. Tsirkin 
---
  drivers/net/virtio_net.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e0fb3707..9dc31dc 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2428,6 +2428,10 @@ static int virtnet_probe(struct virtio_device *vdev)
dev->mtu = mtu;
dev->max_mtu = mtu;
}
+
+   /* TODO: size buffers correctly in this case. */
+   if (dev->mtu > ETH_DATA_LEN)
+   vi->big_packets = true;
}
  
  	if (vi->any_header_sg)


Ok, I think maybe we should fail the probe of small buffer in this case 
and decrease the max_mtu to ETH_DATA_LEN if small buffer is used (since 
user are allowed to increase the mtu).


Thanks


Re: [PATCH v2] virtio_net: fix support for small rings

2017-03-29 Thread Jason Wang



On 2017年03月30日 01:42, Michael S. Tsirkin wrote:

When ring size is small (<32 entries) making buffers smaller means a
full ring might not be able to hold enough buffers to fit a single large
packet.

Make sure a ring full of buffers is large enough to allow at least one
packet of max size.

Fixes: 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag 
allocators")
Signed-off-by: Michael S. Tsirkin 
---

changes from v1:
typo fix

  drivers/net/virtio_net.c | 30 ++
  1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9dc31dc..e5f6e34 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -29,6 +29,7 @@
  #include 
  #include 
  #include 
+#include 


This seems unnecessary.

  
  static int napi_weight = NAPI_POLL_WEIGHT;

  module_param(napi_weight, int, 0444);
@@ -98,6 +99,9 @@ struct receive_queue {
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
  
+	/* Min single buffer size for mergeable buffers case. */

+   unsigned int min_buf_len;
+
/* Name of this receive queue: input.$index */
char name[40];
  };
@@ -894,13 +898,14 @@ static int add_recvbuf_big(struct virtnet_info *vi, 
struct receive_queue *rq,
return err;
  }
  
-static unsigned int get_mergeable_buf_len(struct ewma_pkt_len *avg_pkt_len)

+static unsigned int get_mergeable_buf_len(struct receive_queue *rq,
+ struct ewma_pkt_len *avg_pkt_len)
  {
const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
unsigned int len;
  
  	len = hdr_len + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len),

-   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   rq->min_buf_len - hdr_len, PAGE_SIZE - hdr_len);
return ALIGN(len, L1_CACHE_BYTES);
  }
  
@@ -914,7 +919,7 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,

int err;
unsigned int len, hole;
  
-	len = get_mergeable_buf_len(>mrg_avg_pkt_len);

+   len = get_mergeable_buf_len(rq, >mrg_avg_pkt_len);
if (unlikely(!skb_page_frag_refill(len + headroom, alloc_frag, gfp)))
return -ENOMEM;
  
@@ -2086,6 +2091,21 @@ static void virtnet_del_vqs(struct virtnet_info *vi)

virtnet_free_queues(vi);
  }
  
+/* How large should a single buffer be so a queue full of these can fit at

+ * least one full packet?
+ * Logic below assumes the mergeable buffer header is used.
+ */
+static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct 
virtqueue *vq)
+{
+   const unsigned int hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int rq_size = virtqueue_get_vring_size(vq);
+   unsigned int packet_len = vi->big_packets ? IP_MAX_MTU : 
vi->dev->max_mtu;
+   unsigned int buf_len = hdr_len + ETH_HLEN + VLAN_HLEN + packet_len;
+   unsigned int min_buf_len = DIV_ROUND_UP(buf_len, rq_size);
+
+   return max(min_buf_len, hdr_len);
+}


Consider rq_size may be large, I think this should be max(min_buf_len, 
GOOD_PACKET_LEN)?



+
  static int virtnet_find_vqs(struct virtnet_info *vi)
  {
vq_callback_t **callbacks;
@@ -2151,6 +2171,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
  
  	for (i = 0; i < vi->max_queue_pairs; i++) {

vi->rq[i].vq = vqs[rxq2vq(i)];
+   vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq);
vi->sq[i].vq = vqs[txq2vq(i)];
}
  
@@ -2237,7 +2258,8 @@ static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
  
  	BUG_ON(queue_index >= vi->max_queue_pairs);

avg = >rq[queue_index].mrg_avg_pkt_len;
-   return sprintf(buf, "%u\n", get_mergeable_buf_len(avg));
+   return sprintf(buf, "%u\n",
+  get_mergeable_buf_len(>rq[queue_index], avg));
  }
  
  static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =




Re: [PATCH net-next] virtio_net: don't reset twice on XDP on/off

2017-03-29 Thread Jason Wang



On 2017年03月30日 04:14, Michael S. Tsirkin wrote:

We already do a reset once in remove_vq_common -
there appears to be no point in doing another one
when we add/remove XDP.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
  drivers/net/virtio_net.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index de42e9a..ed8f548 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1810,7 +1810,6 @@ static int virtnet_reset(struct virtnet_info *vi, int 
curr_qp, int xdp_qp)
virtnet_freeze_down(dev);
_remove_vq_common(vi);
  
-	dev->config->reset(dev);

virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER);
  


Acked-by: Jason Wang <jasow...@redhat.com>


Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Jason Wang



On 2017年03月30日 04:48, Michael S. Tsirkin wrote:

We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin
---


A quick glance and it looks ok, but what the benefit of this series, is 
it required by other changes?


Thanks


Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array

2017-03-29 Thread Jason Wang



On 2017年03月29日 18:46, Pankaj Gupta wrote:

Hi Jason,


On 2017年03月23日 13:34, Jason Wang wrote:



+{
+if (rvq->rh != rvq->rt)
+goto out;
+
+rvq->rh = rvq->rt = 0;
+rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq,
+VHOST_RX_BATCH);

A comment explaining why is is -bh would be helpful.

Ok.

Thanks

Rethink about this. It looks like -bh is not needed in this case since
no consumer run in bh.

In that case do we need other variants of "ptr_ring_consume_batched_*()" 
functions.
Are we planning to use them in future?


I think we'd better keep them, since it serves as helpers. You can see 
that not all the helpers in ptr_ring has real users, but they were 
prepared for the future use.


Thanks




Thanks





Re: [PATCH net-next 7/8] vhost_net: try batch dequing from skb array

2017-03-29 Thread Jason Wang



On 2017年03月23日 13:34, Jason Wang wrote:






+{
+if (rvq->rh != rvq->rt)
+goto out;
+
+rvq->rh = rvq->rt = 0;
+rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq,
+VHOST_RX_BATCH);

A comment explaining why is is -bh would be helpful.


Ok.

Thanks 


Rethink about this. It looks like -bh is not needed in this case since 
no consumer run in bh.


Thanks


Re: [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control

2017-03-30 Thread Jason Wang



On 2017年03月30日 23:06, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:28PM +0800, Jason Wang wrote:

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang<jasow...@redhat.com>

Do we need to bother with tun? I didn't realize one
can even use that with vhost. What would be the point of
all the virtio header stuff dealing with checksums etc?

Even if you see a use-case is it worth optimizing?




It's for tap in fact. I use "tun" just because we have already had a 
tap.c which is used by macvtap.


Thanks


[PATCH net-next 2/8] skb_array: introduce batch dequeuing

2017-03-20 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 include/linux/skb_array.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..90e44b9 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct 
skb_array *a)
return ptr_ring_consume(>ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
return ptr_ring_consume_irq(>ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_irq(>ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
return ptr_ring_consume_any(>ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+   void **array, int n)
+{
+   return ptr_ring_consume_batched_any(>ring, array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
return ptr_ring_consume_bh(>ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+  void **array, int n)
+{
+   return ptr_ring_consume_batched_bh(>ring, array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {
-- 
2.7.4



[PATCH net-next 5/8] tun: support receiving skb through msg_control

2017-03-20 Thread Jason Wang
This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 70dd9ec..a82bced 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file 
*tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
-   struct sk_buff *skb;
ssize_t ret;
int err;
 
@@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
if (!iov_iter_count(to))
return 0;
 
-   /* Read frames from ring */
-   skb = tun_ring_recv(tfile, noblock, );
-   if (!skb)
-   return err;
+   if (!skb) {
+   /* Read frames from ring */
+   skb = tun_ring_recv(tfile, noblock, );
+   if (!skb)
+   return err;
+   }
 
ret = tun_put_user(tun, tfile, skb, to);
if (unlikely(ret < 0))
@@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
 
if (!tun)
return -EBADFD;
-   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+   ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr 
*m, size_t total_len,
 SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
-   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > (ssize_t)total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4



[PATCH net-next 7/8] vhost_net: try batch dequing from skb array

2017-03-20 Thread Jason Wang
We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
  Before: 1.88Mpps
  After : 2.25Mpps (+19.6%)
- mergeable buffer:
  Before: 1.83Mpps
  After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 64 +
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..53f09f2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
 struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;
@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   void *rxq[VHOST_RX_BATCH];
+   int rt;
+   int rh;
 };
 
 struct vhost_net {
@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
 
 }
@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int peek_head_len_batched(struct vhost_net_virtqueue *rvq)
+{
+   if (rvq->rh != rvq->rt)
+   goto out;
+
+   rvq->rh = rvq->rt = 0;
+   rvq->rt = skb_array_consume_batched_bh(rvq->rx_array, rvq->rxq,
+   VHOST_RX_BATCH);
+   if (!rvq->rt)
+   return 0;
+out:
+   return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
struct socket *sock = sk->sk_socket;
struct sk_buff *head;
int len = 0;
unsigned long flags;
 
+   if (rvq->rx_array)
+   return peek_head_len_batched(rvq);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
 
@@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(>sk_receive_queue);
 }
 
-static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+ struct sock *sk)
 {
+   struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
-   int len = peek_head_len(sk);
+   int len = peek_head_len(rvq, sk);
 
if (!len && vq->busyloop_timeout) {
/* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_poll_queue(>poll);
mutex_unlock(>mutex);
 
-   len = peek_head_len(sk);
+   len = peek_head_len(rvq, sk);
}
 
return len;
@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
+   if (nvq->rx_array)
+   msg.msg_control = nvq->rxq[nvq->rh++];
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(_iter, READ, vq->iov, 1, 1);
@@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net 
*n,
struct vhost_virtqueue *vq)
 {
struct socket *sock;
+   struct vhost_net_virtqueue *nvq =
+   container_of(vq, struct vhost_net_virtqueue, vq);
 
mu

[PATCH net-next 3/8] tun: export skb_array

2017-03-20 Thread Jason Wang
This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c  | 13 +
 include/linux/if_tun.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c418f0a..70dd9ec 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+   struct tun_file *tfile;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   tfile = file->private_data;
+   if (!tfile)
+   return ERR_PTR(-EBADFD);
+   return >tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4



[PATCH net-next 6/8] tap: support receiving skb from msg_control

2017-03-20 Thread Jason Wang
This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..07d9174 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
 {
DEFINE_WAIT(wait);
-   struct sk_buff *skb;
ssize_t ret = 0;
 
if (!iov_iter_count(to))
return 0;
 
+   if (skb)
+   goto done;
+
while (1) {
if (!noblock)
prepare_to_wait(sk_sleep(>sk), ,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
if (!noblock)
finish_wait(sk_sleep(>sk), );
 
+done:
if (skb) {
ret = tap_put_user(q, skb, to);
if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
struct tap_queue *q = file->private_data;
ssize_t len = iov_iter_count(to), ret;
 
-   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+   ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr 
*m,
int ret;
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
return -EINVAL;
-   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT);
+   ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
+ m->msg_control);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4



[PATCH net-next 4/8] tap: export skb_array

2017-03-20 Thread Jason Wang
This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tap.c  | 13 +
 include/linux/if_tap.h |  5 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+   struct tap_queue *q;
+
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+   q = file->private_data;
+   if (!q)
+   return ERR_PTR(-EBADFD);
+   return >skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include 
 #include 
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include 
-- 
2.7.4



[PATCH net-next 0/8] vhost-net rx batching

2017-03-20 Thread Jason Wang
Hi all:

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying.

Tests shows at most 19% improvment on rx pps.

Please review.

Thanks

Jason Wang (8):
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array
  vhost_net: use lockless peeking for skb array during busy polling

 drivers/net/tap.c | 25 ++---
 drivers/net/tun.c | 31 +++--
 drivers/vhost/net.c   | 71 +++
 include/linux/if_tap.h|  5 
 include/linux/if_tun.h|  5 
 include/linux/ptr_ring.h  | 65 +++
 include/linux/skb_array.h | 25 +
 7 files changed, 209 insertions(+), 18 deletions(-)

-- 
2.7.4



[PATCH net-next 8/8] vhost_net: use lockless peeking for skb array during busy polling

2017-03-20 Thread Jason Wang
For the socket that exports its skb array, we can use lockless polling
to avoid touching spinlock during busy polling.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53f09f2..41153a3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -551,10 +551,13 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, 
struct sock *sk)
return len;
 }
 
-static int sk_has_rx_data(struct sock *sk)
+static int sk_has_rx_data(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
struct socket *sock = sk->sk_socket;
 
+   if (rvq->rx_array)
+   return !__skb_array_empty(rvq->rx_array);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
 
@@ -579,7 +582,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net,
endtime = busy_clock() + vq->busyloop_timeout;
 
while (vhost_can_busy_poll(>dev, endtime) &&
-  !sk_has_rx_data(sk) &&
+  !sk_has_rx_data(rvq, sk) &&
   vhost_vq_avail_empty(>dev, vq))
cpu_relax();
 
-- 
2.7.4



Re: [PATCH net-next 1/8] ptr_ring: introduce batch dequeuing

2017-03-21 Thread Jason Wang



On 2017年03月21日 18:25, Sergei Shtylyov wrote:

Hello!

On 3/21/2017 7:04 AM, Jason Wang wrote:


Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 include/linux/ptr_ring.h | 65 


 1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..4771ded 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct 
ptr_ring *r)

 return ptr;
 }

+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+ void **array, int n)
+{
+void *ptr;
+int i = 0;
+
+while (i < n) {


   Hm, why not *for*?


Yes, it maybe better, if there's other comment on the series, will 
change it in next version.


Thanks



Re: [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control

2017-03-30 Thread Jason Wang



On 2017年03月30日 23:03, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:29PM +0800, Jason Wang wrote:

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang<jasow...@redhat.com>
---
  drivers/net/tap.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..07d9174 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
  
  static ssize_t tap_do_read(struct tap_queue *q,

   struct iov_iter *to,
-  int noblock)
+  int noblock, struct sk_buff *skb)
  {
DEFINE_WAIT(wait);
-   struct sk_buff *skb;
ssize_t ret = 0;
  
  	if (!iov_iter_count(to))

return 0;
  
+	if (skb)

+   goto done;
+
while (1) {
if (!noblock)
prepare_to_wait(sk_sleep(>sk), ,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
if (!noblock)
finish_wait(sk_sleep(>sk), );
  
+done:

Please just use an if {} block here. goto on error is ok,
but we are far from done here and goto done is misleading.




Ok.

Thanks.


Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array

2017-03-31 Thread Jason Wang



On 2017年03月31日 12:02, Jason Wang wrote:



On 2017年03月30日 22:21, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote:

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization

which cache does this refer to btw?


Both icache and dcache more or less.




and spinlock
touching for each packet.

Do you mean the effect of extra two atomics here?


In fact four, packet length peeking needs another two.




This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
   Before: 1.88Mpps
   After : 2.25Mpps (+19.6%)
- mergeable buffer:
   Before: 1.83Mpps
   After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasow...@redhat.com>

Looks like I misread the code previously. More comments below,
sorry about not asking these questions earlier.


---
  drivers/vhost/net.c | 64 
+

  1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..ffa78c6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
#include 
  @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
  struct vhost_virtqueue *vq;
  };
  +#define VHOST_RX_BATCH 64
  struct vhost_net_virtqueue {
  struct vhost_virtqueue vq;
  size_t vhost_hlen;

Could you please try playing with batch size and see
what the effect is?


Ok. I tried 32 which seems slower than 64 but still faster than no 
batching.


Ok, result is:

no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps







@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
  /* Reference counting for outstanding ubufs.
   * Protected by vq mutex. Writers must also take device mutex. */
  struct vhost_net_ubuf_ref *ubufs;
+struct skb_array *rx_array;
+void *rxq[VHOST_RX_BATCH];
+int rt;
+int rh;
  };
struct vhost_net {
@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
  n->vqs[i].ubufs = NULL;
  n->vqs[i].vhost_hlen = 0;
  n->vqs[i].sock_hlen = 0;
+n->vqs[i].rt = 0;
+n->vqs[i].rh = 0;
  }
}
@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
  mutex_unlock(>mutex);
  }
  -static int peek_head_len(struct sock *sk)
+static int fetch_skbs(struct vhost_net_virtqueue *rvq)
+{
+if (rvq->rh != rvq->rt)
+goto out;
+
+rvq->rh = rvq->rt = 0;
+rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
+VHOST_RX_BATCH);
+if (!rvq->rt)
+return 0;
+out:
+return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct 
sock *sk)

  {
  struct socket *sock = sk->sk_socket;
  struct sk_buff *head;
  int len = 0;
  unsigned long flags;
  +if (rvq->rx_array)
+return fetch_skbs(rvq);
+
  if (sock->ops->peek_len)
  return sock->ops->peek_len(sock);
  @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
  return skb_queue_empty(>sk_receive_queue);
  }
  -static int vhost_net_rx_peek_head_len(struct vhost_net *net, 
struct sock *sk)

+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+  struct sock *sk)
  {
+struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
  struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
  struct vhost_virtqueue *vq = >vq;
  unsigned long uninitialized_var(endtime);
-int len = peek_head_len(sk);
+int len = peek_head_len(rvq, sk);
if (!len && vq->busyloop_timeout) {
  /* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct 
vhost_net *net, struct sock *sk)

  vhost_poll_queue(>poll);
  mutex_unlock(>mutex);
  -len = peek_head_len(sk);
+len = peek_head_len(rvq, sk);
  }
return len;
@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
  /* On error, stop handling until the next kick. */
  if (unlikely(headcount < 0))
  goto out;
+if (nvq->rx_array)
+msg.msg_control = nvq->rxq[nvq->rh++];
  /* On overrun, truncate and discard */
  if (unlikely(headcount > UIO_MAXIOV)) {
  iov_iter_init(_iter, READ, vq->iov, 1, 1);

So there's a bit of a mystery here. vhost code isn't
batched, all we are batching is the fetch from the tun ring.


I've already had vhost batching code

Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing

2017-03-31 Thread Jason Wang



On 2017年03月31日 22:31, Michael S. Tsirkin wrote:

On Fri, Mar 31, 2017 at 11:52:24AM +0800, Jason Wang wrote:

On 2017年03月30日 21:53, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:

This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang<jasow...@redhat.com>
---
   include/linux/ptr_ring.h | 65 

   1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..2be0f350 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
return ptr;
   }
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+void **array, int n)

Can we use a shorter name? ptr_ring_consume_batch?

Ok, but at least we need to keep the prefix since there's a locked version.




+{
+   void *ptr;
+   int i;
+
+   for (i = 0; i < n; i++) {
+   ptr = __ptr_ring_consume(r);
+   if (!ptr)
+   break;
+   array[i] = ptr;
+   }
+
+   return i;
+}
+
   /*
* Note: resize (below) nests producer lock within consumer lock, so if you
* call this in interrupt or BH context, you must disable interrupts/BH when

I'd like to add a code comment here explaining why we don't
care about cpu or compiler reordering. And I think the reason is
in the way you use this API: in vhost it does not matter
if you get less entries than present in the ring.
That's ok but needs to be noted
in a code comment so people use this function correctly.

Interesting, but I still think it's not necessary.

If consumer is doing a busy polling, it will eventually get the entries. If
the consumer need notification from producer, it should drain the queue
which means it need enable notification before last try of consuming call,
otherwise it was a bug. The batch consuming function in this patch can
guarantee return at least one pointer if there's many, this looks sufficient
for the correctness?

Thanks

You ask for N entries but get N-1. This seems to imply the
ring is now empty. Do we guarantee this?


I think consumer can not assume ring is empty consider producer can 
produce at the same time. It need enable notification and do another 
poll in this case.


Thanks


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-20 Thread Jason Wang



On 2017年04月20日 21:58, Willem de Bruijn wrote:

On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang <jasow...@redhat.com> wrote:


On 2017年04月19日 04:21, Willem de Bruijn wrote:

+static void virtnet_napi_tx_enable(struct virtnet_info *vi,
+  struct virtqueue *vq,
+  struct napi_struct *napi)
+{
+   if (!napi->weight)
+   return;
+
+   if (!vi->affinity_hint_set) {
+   napi->weight = 0;
+   return;
+   }
+
+   return virtnet_napi_enable(vq, napi);
+}
+
   static void refill_work(struct work_struct *work)


Maybe I was wrong, but according to Michael's comment it looks like he want
check affinity_hint_set just for speculative tx polling on rx napi instead
of disabling it at all.

And I'm not convinced this is really needed, driver only provide affinity
hint instead of affinity, so it's not guaranteed that tx and rx interrupt
are in the same vcpus.

You're right. I made the restriction broader than the request, to really err
on the side of caution for the initial merge of napi tx. And enabling
the optimization is always a win over keeping it off, even without irq
affinity.

The cycle cost is significant without affinity regardless of whether the
optimization is used.


Yes, I noticed this in the past too.


Though this is not limited to napi-tx, it is more
pronounced in that mode than without napi.

1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:

upstream:

1,1,1: 28985 Mbps, 278 Gcyc
1,0,2: 30067 Mbps, 402 Gcyc

napi tx:

1,1,1: 34492 Mbps, 269 Gcyc
1,0,2: 36527 Mbps, 537 Gcyc (!)
1,0,1: 36269 Mbps, 394 Gcyc
1,0,0: 34674 Mbps, 402 Gcyc

This is a particularly strong example. It is also representative
of most RR tests. It is less pronounced in other streaming tests.
10x TCP_RR, for instance:

upstream:

1,1,1: 42267 Mbps, 301 Gcyc
1,0,2: 40663 Mbps, 445 Gcyc

napi tx:

1,1,1: 42420 Mbps, 303 Gcyc
1,0,2:  42267 Mbps, 431 Gcyc

These numbers were obtained with the virtqueue_enable_cb_delayed
optimization after xmit_skb, btw. It turns out that moving that before
increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
100x TCP_RR a bit.


I see, so I think we can leave the affinity hint optimization/check for 
future investigation:


- to avoid endless optimization (e.g we may want to share a single 
vector/napi for tx/rx queue pairs in the future) for this series.

- tx napi is disabled by default which means we can do optimization on top.

Thanks


Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-20 Thread Jason Wang



On 2017年04月20日 23:34, Vlad Yasevich wrote:

On 04/17/2017 11:01 PM, Jason Wang wrote:


On 2017年04月16日 00:38, Vladislav Yasevich wrote:

Curreclty virtion net header is fixed size and adding things to it is rather
difficult to do.  This series attempt to add the infrastructure as well as some
extensions that try to resolve some deficiencies we currently have.

First, vnet header only has space for 16 flags.  This may not be enough
in the future.  The extensions will provide space for 32 possbile extension
flags and 32 possible extensions.   These flags will be carried in the
first pseudo extension header, the presense of which will be determined by
the flag in the virtio net header.

The extensions themselves will immidiately follow the extension header itself.
They will be added to the packet in the same order as they appear in the
extension flags.  No padding is placed between the extensions and any
extensions negotiated, but not used need by a given packet will convert to
trailing padding.

Do we need a explicit padding (e.g an extension) which could be controlled by 
each side?

I don't think so.  The size of the vnet header is set based on the extensions 
negotiated.
The one part I am not crazy about is that in the case of packet not using any 
extensions,
the data is still placed after the entire vnet header, which essentially adds a 
lot
of padding.  However, that's really no different then if we simply grew the 
vnet header.

The other thing I've tried before is putting extensions into their own sg 
buffer, but that
made it slower.h


Yes.




For example:
   | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data |

Just some rough thoughts:

- Is this better to use TLV instead of bitmap here? One advantage of TLV is 
that the
length is not limited by the length of bitmap.

but the disadvantage is that we add at least 4 bytes per extension of just TL 
data.  That
makes this thing even longer.


Yes, and it looks like the length is still limited by e.g the length of T.




- For 1.1, do we really want something like vnet header? AFAIK, it was not used 
by modern
NICs, is this better to pack all meta-data into descriptor itself? This may 
need a some
changes in tun/macvtap, but looks more PCIE friendly.

That would really be ideal and I've looked at this.  There are small issues of 
exposing
the 'net metadata' of the descriptor to taps so they can be filled in.  The 
alternative
is to use a different control structure for tap->qemu|vhost channel (that can be
implementation specific) and have qemu|vhost populate the 'net metadata' of the 
descriptor.


Yes, this needs some thought. For vhost, things looks a little bit 
easier, we can probably use msg_control.


Thanks


Thanks
-vlad


Thanks


Extensions proposed in this series are:
   - IPv6 fragment id extension
 * Currently, the guest generated fragment id is discarded and the host
   generates an IPv6 fragment id if the packet has to be fragmented.  The
   code attempts to add time based perturbation to id generation to make
   it harder to guess the next fragment id to be used.  However, doing this
   on the host may result is less perturbation (due to differnet timing)
   and might make id guessing easier.  Ideally, the ids generated by the
   guest should be used.  One could also argue that we a "violating" the
   IPv6 protocol in the if the _strict_ interpretation of the spec.

   - VLAN header acceleration
 * Currently virtio doesn't not do vlan header acceleration and instead
   uses software tagging.  One of the first things that the host will do is
   strip the vlan header out.  When passing the packet the a guest the
   vlan header is re-inserted in to the packet.  We can skip all that work
   if we can pass the vlan data in accelearted format.  Then the host will
   not do any extra work.  However, so far, this yeilded a very small
   perf bump (only ~1%).  I am still looking into this.

   - UDP tunnel offload
 * Similar to vlan acceleration, with this extension we can pass additional
   data to host for support GSO with udp tunnel and possible other
   encapsulations.  This yeilds a significant perfromance improvement
  (still testing remote checksum code).

An addition extension that is unfinished (due to still testing for any
side-effects) is checksum passthrough to support drivers that set
CHECKSUM_COMPLETE.  This would eliminate the need for guests to compute
the software checksum.

This series only takes care of virtio net.  I have addition patches for the
host side (vhost and tap/macvtap as well as qemu), but wanted to get feedback
on the general approach first.

Vladislav Yasevich (6):
virtio-net: Remove the use the padded vnet_header structure
virtio-net: make header length handling uniform
virtio_net: Add basic skeleton for handling vnet header extensions.
virtio-net: Add support for IPv6 fragment i

Re: [PATCH RFC (resend) net-next 3/6] virtio_net: Add basic skeleton for handling vnet header extensions.

2017-04-17 Thread Jason Wang



On 2017年04月16日 00:38, Vladislav Yasevich wrote:

This is the basic sceleton which will be fleshed out by individiual
extensions.

Signed-off-by: Vladislav Yasevich 
---
  drivers/net/virtio_net.c| 21 +
  include/linux/virtio_net.h  | 12 
  include/uapi/linux/virtio_net.h | 11 +++
  3 files changed, 44 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ad6ee6..08e2709 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -145,6 +145,10 @@ struct virtnet_info {
/* Packet virtio header size */
u8 hdr_len;
  
+	/* Header extensions were negotiated */

+   bool hdr_ext;
+   u32 ext_mask;
+
/* Active statistics */
struct virtnet_stats __percpu *stats;
  
@@ -174,6 +178,11 @@ struct virtnet_info {

u32 speed;
  };
  
+struct virtio_net_hdr_max {

+   struct virtio_net_hdr_mrg_rxbuf hdr;
+   struct virtio_net_ext_hdr ext_hdr;
+};
+
  static inline u8 padded_vnet_hdr(struct virtnet_info *vi)
  {
u8 hdr_len = vi->hdr_len;
@@ -214,6 +223,7 @@ static int rxq2vq(int rxq)
  
  static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)

  {
+   BUILD_BUG_ON(sizeof(struct virtio_net_hdr_max) > sizeof(skb->cb));
return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
  }
  
@@ -767,6 +777,12 @@ static int receive_buf(struct virtnet_info *vi, struct receive_queue *rq,

goto frame_err;
}
  
+	if (vi->hdr_ext &&

+   virtio_net_ext_to_skb(skb,
+ (struct virtio_net_ext_hdr *)(hdr + 1))) {
+   goto frame_err;
+   }
+
skb->protocol = eth_type_trans(skb, dev);
pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
 ntohs(skb->protocol), skb->len, skb->pkt_type);
@@ -1106,6 +1122,11 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
if (vi->mergeable_rx_bufs)
hdr->num_buffers = 0;
  
+	if (vi->hdr_ext &&

+   virtio_net_ext_from_skb(skb, (struct virtio_net_ext_hdr *)(hdr + 1),
+   vi->ext_mask))
+   BUG();
+
sg_init_table(sq->sg, skb_shinfo(skb)->nr_frags + (can_push ? 1 : 2));
if (can_push) {
__skb_push(skb, hdr_len);
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 5209b5e..eaa524f 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -100,4 +100,16 @@ static inline int virtio_net_hdr_from_skb(const struct 
sk_buff *skb,
return 0;
  }
  
+static inline int virtio_net_ext_to_skb(struct sk_buff *skb,

+   struct virtio_net_ext_hdr *ext)
+{
+   return 0;
+}
+
+static inline int virtio_net_ext_from_skb(const struct sk_buff *skb,
+ struct virtio_net_ext_hdr *ext,
+ __u32 ext_mask)
+{
+   return 0;
+}
  #endif /* _LINUX_VIRTIO_NET_H */
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..0039b72 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -88,6 +88,7 @@ struct virtio_net_config {
  struct virtio_net_hdr_v1 {
  #define VIRTIO_NET_HDR_F_NEEDS_CSUM   1   /* Use csum_start, csum_offset 
*/
  #define VIRTIO_NET_HDR_F_DATA_VALID   2   /* Csum is valid */
+#define VIRTIO_NET_HDR_F_VNET_EXT  4   /* Vnet extensions present */
__u8 flags;
  #define VIRTIO_NET_HDR_GSO_NONE   0   /* Not a GSO frame */
  #define VIRTIO_NET_HDR_GSO_TCPV4  1   /* GSO frame, IPv4 TCP (TSO) */
@@ -102,6 +103,16 @@ struct virtio_net_hdr_v1 {
__virtio16 num_buffers; /* Number of merged rx buffers */
  };
  
+/* If IRTIO_NET_HDR_F_VNET_EXT flags is set, this header immediately

+ * follows the virtio_net_hdr.  The flags in this header will indicate
+ * which extension will follow.  The extnsion data will immidiately follow


s/extnsion/extension/


+ * this header.
+ */
+struct virtio_net_ext_hdr {
+   __u32 flags;
+   __u8 extensions[];
+};
+
  #ifndef VIRTIO_NET_NO_LEGACY
  /* This header comes first in the scatter-gather list.
   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must




Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-17 Thread Jason Wang



On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.


Ok, I will rebase the series on top of this. (Though I don't think we 
care the packet loss).




I would still prefer that we understand what's going on,


I try to reply in another thread, does it make sense?


  and I would
like to know what's the smallest batch size that's still helpful,


Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps


  but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.


Thanks a lot.



Lightly tested - this is on top of consumer batching patches.

Thanks!

  include/linux/ptr_ring.h | 57 
  1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
  }
  
+/*

+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
  static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))




Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-20 Thread Jason Wang



On 2017年04月19日 04:21, Willem de Bruijn wrote:

From: Willem de Bruijn <will...@google.com>

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 77 +---
  1 file changed, 67 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9c1df29892c..c173e85dc7b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,9 +33,10 @@
  static int napi_weight = NAPI_POLL_WEIGHT;
  module_param(napi_weight, int, 0444);
  
-static bool csum = true, gso = true;

+static bool csum = true, gso = true, napi_tx;
  module_param(csum, bool, 0444);
  module_param(gso, bool, 0444);
+module_param(napi_tx, bool, 0644);
  
  /* FIXME: MTU in config. */

  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -86,6 +87,8 @@ struct send_queue {
  
  	/* Name of the send queue: output.$index */

char name[40];
+
+   struct napi_struct napi;
  };
  
  /* Internal representation of a receive virtqueue */

@@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
  static void skb_xmit_done(struct virtqueue *vq)
  {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
  
  	/* Suppress further interrupts. */

virtqueue_disable_cb(vq);
  
-	/* We were probably waiting for more output buffers. */

-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
  }
  
  static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)

@@ -972,6 +979,21 @@ static void virtnet_napi_enable(struct virtqueue *vq, 
struct napi_struct *napi)
local_bh_enable();
  }
  
+static void virtnet_napi_tx_enable(struct virtnet_info *vi,

+  struct virtqueue *vq,
+  struct napi_struct *napi)
+{
+   if (!napi->weight)
+   return;
+
+   if (!vi->affinity_hint_set) {
+   napi->weight = 0;
+   return;
+   }
+
+   return virtnet_napi_enable(vq, napi);
+}
+
  static void refill_work(struct work_struct *work)
  {
struct virtnet_info *vi =
@@ -1046,6 +1068,7 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
}
  
  	return 0;

@@ -1081,6 +1104,25 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
  }
  
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)

+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+
+   if (__netif_tx_trylock(txq)) {
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+   }
+
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_tx_wake_queue(txq);
+
+   return 0;
+}
+
  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
  {
struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,9 +1172,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool use_napi = sq->napi.weight;
  
  	/* Free up any pending old buffers before queueing new ones. */

-   free_old_xmit_skbs(sq);
+   if (!use_napi)
+   free_old_xmit_skbs(sq);


I'm not sure this is best or even correct. Consider we clean xmit 
packets speculatively in virtnet_poll_t

Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-20 Thread Jason Wang



On 2017年04月19日 04:21, Willem de Bruijn wrote:

From: Willem de Bruijn 

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn 
---
  drivers/net/virtio_net.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b14c82ce0032..b107ae011632 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1196,8 +1196,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
bool use_napi = sq->napi.weight;
  
  	/* Free up any pending old buffers before queueing new ones. */

-   if (!use_napi)
+   if (use_napi) {
+   if (kick)
+   virtqueue_enable_cb_delayed(sq->vq);
+   else
+   virtqueue_disable_cb(sq->vq);


Since virtqueue_disable_cb() do nothing for event idx. I wonder whether 
or not just calling enable_cb_dealyed() is ok here.


Btw, it does not disable interrupt at all, I propose a patch in the past 
which can do more than this:


https://patchwork.kernel.org/patch/6472601/

Thanks




+   } else {
free_old_xmit_skbs(sq);
+   }
  
  	/* timestamp packet in software */

skb_tx_timestamp(skb);




Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-20 Thread Jason Wang



On 2017年04月19日 04:21, Willem de Bruijn wrote:

+static void virtnet_napi_tx_enable(struct virtnet_info *vi,
+  struct virtqueue *vq,
+  struct napi_struct *napi)
+{
+   if (!napi->weight)
+   return;
+
+   if (!vi->affinity_hint_set) {
+   napi->weight = 0;
+   return;
+   }
+
+   return virtnet_napi_enable(vq, napi);
+}
+
  static void refill_work(struct work_struct *work)


Maybe I was wrong, but according to Michael's comment it looks like he 
want check affinity_hint_set just for speculative tx polling on rx napi 
instead of disabling it at all.


And I'm not convinced this is really needed, driver only provide 
affinity hint instead of affinity, so it's not guaranteed that tx and rx 
interrupt are in the same vcpus.


Thanks


Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-23 Thread Jason Wang



On 2017年04月21日 21:08, Vlad Yasevich wrote:

On 04/21/2017 12:05 AM, Jason Wang wrote:

On 2017年04月20日 23:34, Vlad Yasevich wrote:

On 04/17/2017 11:01 PM, Jason Wang wrote:

On 2017年04月16日 00:38, Vladislav Yasevich wrote:

Curreclty virtion net header is fixed size and adding things to it is rather
difficult to do.  This series attempt to add the infrastructure as well as some
extensions that try to resolve some deficiencies we currently have.

First, vnet header only has space for 16 flags.  This may not be enough
in the future.  The extensions will provide space for 32 possbile extension
flags and 32 possible extensions.   These flags will be carried in the
first pseudo extension header, the presense of which will be determined by
the flag in the virtio net header.

The extensions themselves will immidiately follow the extension header itself.
They will be added to the packet in the same order as they appear in the
extension flags.  No padding is placed between the extensions and any
extensions negotiated, but not used need by a given packet will convert to
trailing padding.

Do we need a explicit padding (e.g an extension) which could be controlled by 
each side?

I don't think so.  The size of the vnet header is set based on the extensions 
negotiated.
The one part I am not crazy about is that in the case of packet not using any 
extensions,
the data is still placed after the entire vnet header, which essentially adds a 
lot
of padding.  However, that's really no different then if we simply grew the 
vnet header.

The other thing I've tried before is putting extensions into their own sg 
buffer, but that
made it slower.h

Yes.


For example:
| vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data |

Just some rough thoughts:

- Is this better to use TLV instead of bitmap here? One advantage of TLV is 
that the
length is not limited by the length of bitmap.

but the disadvantage is that we add at least 4 bytes per extension of just TL 
data.  That
makes this thing even longer.

Yes, and it looks like the length is still limited by e.g the length of T.

Not only that, but it is also limited by the skb->cb as a whole.  So adding 
putting
extensions into a TLV style means we have less extensions for now, until we get 
rid of
skb->cb usage.


- For 1.1, do we really want something like vnet header? AFAIK, it was not used 
by modern
NICs, is this better to pack all meta-data into descriptor itself? This may 
need a some
changes in tun/macvtap, but looks more PCIE friendly.

That would really be ideal and I've looked at this.  There are small issues of 
exposing
the 'net metadata' of the descriptor to taps so they can be filled in.  The 
alternative
is to use a different control structure for tap->qemu|vhost channel (that can be
implementation specific) and have qemu|vhost populate the 'net metadata' of the 
descriptor.

Yes, this needs some thought. For vhost, things looks a little bit easier, we 
can probably
use msg_control.


We can use msg_control in qemu as well, can't we?


AFAIK, it needs some changes since we don't export socket to userspace.


  It really is a question of who is doing
the work and the number of copies.

I can take a closer look of how it would look if we extend the descriptor with 
type
specific data.  I don't know if other users of virtio would benefit from it?


Not sure, but we can have a common descriptor header followed by device 
specific meta data. This probably need some prototype benchmarking to 
see the benefits first.


Thanks


Re: [PATCH net-next RFC 2/4] virtio-net: transmit napi

2017-03-06 Thread Jason Wang



On 2017年03月03日 22:39, Willem de Bruijn wrote:

From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional for now. Follow-on
patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
---
  drivers/net/virtio_net.c | 73 
  1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8c21e9a4adc7..9a9031640179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,6 +33,8 @@
  static int napi_weight = NAPI_POLL_WEIGHT;
  module_param(napi_weight, int, 0444);
  
+static int napi_tx_weight = NAPI_POLL_WEIGHT;

+


Maybe we should use module_param for this? Or in the future, use 
tx-frames-irq for a per-device configuration.


Thanks


Re: [PATCH net-next RFC 3/4] vhost: interrupt coalescing support

2017-03-06 Thread Jason Wang



On 2017年03月03日 22:39, Willem de Bruijn wrote:

+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq);
+static enum hrtimer_restart vhost_coalesce_timer(struct hrtimer *timer)
+{
+   struct vhost_virtqueue *vq =
+   container_of(timer, struct vhost_virtqueue, ctimer);
+
+   if (mutex_trylock(>mutex)) {
+   vq->coalesce_frames = vq->max_coalesce_frames;
+   vhost_signal(vq->dev, vq);
+   mutex_unlock(>mutex);
+   }
+
+   /* TODO: restart if lock failed and not held by handle_tx */
+   return HRTIMER_NORESTART;
+}
+


Then we may lose an interrupt forever if no new tx request? I believe we 
need e.g vhost_poll_queue() here.


Thanks


Re: [PATCH net-next RFC 4/4] virtio-net: clean tx descriptors from rx napi

2017-03-06 Thread Jason Wang



On 2017年03月03日 22:39, Willem de Bruijn wrote:

From: Willem de Bruijn 

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt. Together VIRTIO_F_EVENT_IDX and
vhost interrupt moderation, this suppresses most explicit tx
completion interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn 
---
  drivers/net/virtio_net.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9a9031640179..21c575127d50 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1031,6 +1031,23 @@ static int virtnet_receive(struct receive_queue *rq, int 
budget)
return received;
  }
  
+static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget);

+
+static void virtnet_poll_cleantx(struct receive_queue *rq)
+{
+   struct virtnet_info *vi = rq->vq->vdev->priv;
+   unsigned int index = vq2rxq(rq->vq);
+   struct send_queue *sq = >sq[index];
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
+
+   __netif_tx_lock(txq, smp_processor_id());
+   free_old_xmit_skbs(sq, sq->napi.weight);
+   __netif_tx_unlock(txq);


Should we check tx napi weight here? Or this was treated as an 
independent optimization?



+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+}
+
  static int virtnet_poll(struct napi_struct *napi, int budget)
  {
struct receive_queue *rq =
@@ -1039,6 +1056,8 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
  
  	received = virtnet_receive(rq, budget);
  
+	virtnet_poll_cleantx(rq);

+


Better to do the before virtnet_receive() consider refill may allocate 
memory for rx buffers.


Btw, if this is proved to be more efficient. In the future we may 
consider to:


1) use a single interrupt for both rx and tx
2) use a single napi to handle both rx and tx

Thanks


/* Out of packets? */
if (received < budget)
virtqueue_napi_complete(napi, rq->vq, received);




Re: [PATCH net-next RFC 3/4] vhost: interrupt coalescing support

2017-03-07 Thread Jason Wang



On 2017年03月07日 01:31, Willem de Bruijn wrote:

On Mon, Mar 6, 2017 at 4:28 AM, Jason Wang <jasow...@redhat.com> wrote:


On 2017年03月03日 22:39, Willem de Bruijn wrote:

+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq);
+static enum hrtimer_restart vhost_coalesce_timer(struct hrtimer *timer)
+{
+   struct vhost_virtqueue *vq =
+   container_of(timer, struct vhost_virtqueue, ctimer);
+
+   if (mutex_trylock(>mutex)) {
+   vq->coalesce_frames = vq->max_coalesce_frames;
+   vhost_signal(vq->dev, vq);
+   mutex_unlock(>mutex);
+   }
+
+   /* TODO: restart if lock failed and not held by handle_tx */
+   return HRTIMER_NORESTART;
+}
+


Then we may lose an interrupt forever if no new tx request? I believe we
need e.g vhost_poll_queue() here.

Absolutely, I need to fix this. The common case for failing to grab
the lock is competition with handle_tx. With careful coding we can
probably avoid scheduling another run with vhost_poll_queue in
the common case.


Yes, probably add some checking after releasing the mutex_lock in 
handle_tx().


Thans



Your patch v7 cancels the pending hrtimer at the start of handle_tx.
I need to reintroduce that, and also only schedule a timer at the end
of handle_tx, not immediately when vq->coalesce_frames becomes
non-zero.





Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Jason Wang



On 2017年03月30日 22:32, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote:


On 2017年03月30日 04:48, Michael S. Tsirkin wrote:

We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin<m...@redhat.com>
---

A quick glance and it looks ok, but what the benefit of this series, is it
required by other changes?

Thanks

Yes - to avoid touching all devices when doing the rest of
the patchset.


Maybe I'm not clear. I mean the benefit of this series not this single 
patch. I guess it may be used by you proposal that avoid reset when set 
XDP? If yes, do we really want to drop some packets after XDP is set?


Thanks


Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing

2017-03-30 Thread Jason Wang



On 2017年03月30日 21:53, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:

This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  include/linux/ptr_ring.h | 65 
  1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..2be0f350 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
return ptr;
  }
  
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,

+void **array, int n)

Can we use a shorter name? ptr_ring_consume_batch?


Ok, but at least we need to keep the prefix since there's a locked version.






+{
+   void *ptr;
+   int i;
+
+   for (i = 0; i < n; i++) {
+   ptr = __ptr_ring_consume(r);
+   if (!ptr)
+   break;
+   array[i] = ptr;
+   }
+
+   return i;
+}
+
  /*
   * Note: resize (below) nests producer lock within consumer lock, so if you
   * call this in interrupt or BH context, you must disable interrupts/BH when

I'd like to add a code comment here explaining why we don't
care about cpu or compiler reordering. And I think the reason is
in the way you use this API: in vhost it does not matter
if you get less entries than present in the ring.
That's ok but needs to be noted
in a code comment so people use this function correctly.


Interesting, but I still think it's not necessary.

If consumer is doing a busy polling, it will eventually get the entries. 
If the consumer need notification from producer, it should drain the 
queue which means it need enable notification before last try of 
consuming call, otherwise it was a bug. The batch consuming function in 
this patch can guarantee return at least one pointer if there's many, 
this looks sufficient for the correctness?


Thanks



Also, I think you need to repeat the comment about cpu_relax
near this function: if someone uses it in a loop,
a compiler barrier is needed to prevent compiler from
optimizing it out.

I note that ptr_ring_consume currently lacks any of these
comments so I'm ok with merging as is, and I'll add
documentation on top.
Like this perhaps?

/* Consume up to n entries and return the number of entries consumed
  * or 0 on ring empty.
  * Note: this might return early with less entries than present in the
  * ring.
  * Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must take consumer_lock
  * if the ring is ever resized - see e.g. ptr_ring_consume_batch.
  */




@@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
return ptr;
  }
  
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,

+  void **array, int n)
+{
+   int ret;
+
+   spin_lock(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+  void **array, int n)
+{
+   int ret;
+
+   spin_lock_irq(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irq(>consumer_lock);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+  void **array, int n)
+{
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(>consumer_lock, flags);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_irqrestore(>consumer_lock, flags);
+
+   return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+ void **array, int n)
+{
+   int ret;
+
+   spin_lock_bh(>consumer_lock);
+   ret = __ptr_ring_consume_batched(r, array, n);
+   spin_unlock_bh(>consumer_lock);
+
+   return ret;
+}
+
  /* Cast to structure type and call a function without discarding from FIFO.
   * Function must return a value.
   * Callers must take consumer_lock.
--
2.7.4




Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array

2017-03-30 Thread Jason Wang



On 2017年03月30日 22:21, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote:

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization

which cache does this refer to btw?


Both icache and dcache more or less.




and spinlock
touching for each packet.

Do you mean the effect of extra two atomics here?


In fact four, packet length peeking needs another two.




This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
   Before: 1.88Mpps
   After : 2.25Mpps (+19.6%)
- mergeable buffer:
   Before: 1.83Mpps
   After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasow...@redhat.com>

Looks like I misread the code previously. More comments below,
sorry about not asking these questions earlier.


---
  drivers/vhost/net.c | 64 +
  1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..ffa78c6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  
  #include 
  
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {

struct vhost_virtqueue *vq;
  };
  
+#define VHOST_RX_BATCH 64

  struct vhost_net_virtqueue {
struct vhost_virtqueue vq;
size_t vhost_hlen;

Could you please try playing with batch size and see
what the effect is?


Ok. I tried 32 which seems slower than 64 but still faster than no batching.




@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
 * Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+   struct skb_array *rx_array;
+   void *rxq[VHOST_RX_BATCH];
+   int rt;
+   int rh;
  };
  
  struct vhost_net {

@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rt = 0;
+   n->vqs[i].rh = 0;
}
  
  }

@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
mutex_unlock(>mutex);
  }
  
-static int peek_head_len(struct sock *sk)

+static int fetch_skbs(struct vhost_net_virtqueue *rvq)
+{
+   if (rvq->rh != rvq->rt)
+   goto out;
+
+   rvq->rh = rvq->rt = 0;
+   rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
+   VHOST_RX_BATCH);
+   if (!rvq->rt)
+   return 0;
+out:
+   return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
  {
struct socket *sock = sk->sk_socket;
struct sk_buff *head;
int len = 0;
unsigned long flags;
  
+	if (rvq->rx_array)

+   return fetch_skbs(rvq);
+
if (sock->ops->peek_len)
return sock->ops->peek_len(sock);
  
@@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)

return skb_queue_empty(>sk_receive_queue);
  }
  
-static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)

+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+ struct sock *sk)
  {
+   struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
-   int len = peek_head_len(sk);
+   int len = peek_head_len(rvq, sk);
  
  	if (!len && vq->busyloop_timeout) {

/* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_poll_queue(>poll);
mutex_unlock(>mutex);
  
-		len = peek_head_len(sk);

+   len = peek_head_len(rvq, sk);
}
  
  	return len;

@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
+   if (nvq->rx_array)
+   msg.msg_control = nvq->rxq[nvq->rh++];
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(_iter, READ, vq->iov, 1, 1);

So there's a bit of a mystery here. vhost code isn't
batched, all we are batching is the fetch from

Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-17 Thread Jason Wang



On 2017年04月16日 00:38, Vladislav Yasevich wrote:

Curreclty virtion net header is fixed size and adding things to it is rather
difficult to do.  This series attempt to add the infrastructure as well as some
extensions that try to resolve some deficiencies we currently have.

First, vnet header only has space for 16 flags.  This may not be enough
in the future.  The extensions will provide space for 32 possbile extension
flags and 32 possible extensions.   These flags will be carried in the
first pseudo extension header, the presense of which will be determined by
the flag in the virtio net header.

The extensions themselves will immidiately follow the extension header itself.
They will be added to the packet in the same order as they appear in the
extension flags.  No padding is placed between the extensions and any
extensions negotiated, but not used need by a given packet will convert to
trailing padding.


Do we need a explicit padding (e.g an extension) which could be 
controlled by each side?




For example:
  | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data |


Just some rough thoughts:

- Is this better to use TLV instead of bitmap here? One advantage of TLV 
is that the length is not limited by the length of bitmap.
- For 1.1, do we really want something like vnet header? AFAIK, it was 
not used by modern NICs, is this better to pack all meta-data into 
descriptor itself? This may need a some changes in tun/macvtap, but 
looks more PCIE friendly.


Thanks



Extensions proposed in this series are:
  - IPv6 fragment id extension
* Currently, the guest generated fragment id is discarded and the host
  generates an IPv6 fragment id if the packet has to be fragmented.  The
  code attempts to add time based perturbation to id generation to make
  it harder to guess the next fragment id to be used.  However, doing this
  on the host may result is less perturbation (due to differnet timing)
  and might make id guessing easier.  Ideally, the ids generated by the
  guest should be used.  One could also argue that we a "violating" the
  IPv6 protocol in the if the _strict_ interpretation of the spec.

  - VLAN header acceleration
* Currently virtio doesn't not do vlan header acceleration and instead
  uses software tagging.  One of the first things that the host will do is
  strip the vlan header out.  When passing the packet the a guest the
  vlan header is re-inserted in to the packet.  We can skip all that work
  if we can pass the vlan data in accelearted format.  Then the host will
  not do any extra work.  However, so far, this yeilded a very small
  perf bump (only ~1%).  I am still looking into this.

  - UDP tunnel offload
* Similar to vlan acceleration, with this extension we can pass additional
  data to host for support GSO with udp tunnel and possible other
  encapsulations.  This yeilds a significant perfromance improvement
 (still testing remote checksum code).

An addition extension that is unfinished (due to still testing for any
side-effects) is checksum passthrough to support drivers that set
CHECKSUM_COMPLETE.  This would eliminate the need for guests to compute
the software checksum.

This series only takes care of virtio net.  I have addition patches for the
host side (vhost and tap/macvtap as well as qemu), but wanted to get feedback
on the general approach first.

Vladislav Yasevich (6):
   virtio-net: Remove the use the padded vnet_header structure
   virtio-net: make header length handling uniform
   virtio_net: Add basic skeleton for handling vnet header extensions.
   virtio-net: Add support for IPv6 fragment id vnet header extension.
   virtio-net: Add support for vlan acceleration vnet header extension.
   virtio-net: Add support for UDP tunnel offload and extension.

  drivers/net/virtio_net.c| 132 +---
  include/linux/skbuff.h  |   5 ++
  include/linux/virtio_net.h  |  91 ++-
  include/uapi/linux/virtio_net.h |  38 
  4 files changed, 242 insertions(+), 24 deletions(-)





Re: [PATCH RFC (resend) net-next 5/6] virtio-net: Add support for vlan acceleration vnet header extension.

2017-04-17 Thread Jason Wang



On 2017年04月16日 00:38, Vladislav Yasevich wrote:

This extension allows us to pass vlan ID and vlan protocol data to the
host hypervisor as part of the vnet header and lets us take advantage
of HW accelerated vlan tagging in the host.  It requires support in the
host to negotiate the feature.  When the extension is enabled, the
virtio device will enabled HW accelerated vlan features.

Signed-off-by: Vladislav Yasevich 
---
  drivers/net/virtio_net.c| 17 -
  include/linux/virtio_net.h  | 17 +
  include/uapi/linux/virtio_net.h |  7 +++
  3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 18eb0dd..696ef4a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -182,6 +182,7 @@ struct virtio_net_hdr_max {
struct virtio_net_hdr_mrg_rxbuf hdr;
struct virtio_net_ext_hdr ext_hdr;
struct virtio_net_ext_ip6frag ip6f_ext;
+   struct virtio_net_ext_vlan vlan_ext;
  };
  
  static inline u8 padded_vnet_hdr(struct virtnet_info *vi)

@@ -2276,6 +2277,11 @@ static void virtnet_init_extensions(struct virtio_device 
*vdev)
vi->hdr_len += sizeof(u32);
vi->ext_mask |= VIRTIO_NET_EXT_F_IP6FRAG;
}
+
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD)) {
+   vi->hdr_len += sizeof(struct virtio_net_ext_vlan);
+   vi->ext_mask |= VIRTIO_NET_EXT_F_VLAN;
+   }
  }
  
  #define MIN_MTU ETH_MIN_MTU

@@ -2352,6 +2358,14 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
dev->features |= NETIF_F_RXCSUM;
  
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD)) {

+   dev->features |= NETIF_F_HW_VLAN_CTAG_TX |
+NETIF_F_HW_VLAN_CTAG_RX |
+NETIF_F_HW_VLAN_STAG_TX |
+NETIF_F_HW_VLAN_STAG_RX;
+   }
+
+
dev->vlan_features = dev->features;
  
  	/* MTU range: 68 - 65535 */

@@ -2395,7 +2409,8 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
vi->mergeable_rx_bufs = true;
  
-	if (virtio_has_feature(vdev, VIRTIO_NET_F_IP6_FRAGID))

+   if (virtio_has_feature(vdev, VIRTIO_NET_F_IP6_FRAGID) ||
+   virtio_has_feature(vdev, VIRTIO_NET_F_VLAN_OFFLOAD))
vi->hdr_ext = true;
  
  	if (vi->hdr_ext)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 3b259dc..e790191 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -113,6 +113,14 @@ static inline int virtio_net_ext_to_skb(struct sk_buff 
*skb,
ptr += sizeof(struct virtio_net_ext_ip6frag);
}
  
+	if (ext->flags & VIRTIO_NET_EXT_F_VLAN) {

+   struct virtio_net_ext_vlan *vhdr =
+   (struct virtio_net_ext_vlan *)ptr;
+
+   __vlan_hwaccel_put_tag(skb, vhdr->vlan_proto, vhdr->vlan_tci);
+   ptr += sizeof(struct virtio_net_ext_vlan);
+   }
+
return 0;
  }
  
@@ -130,6 +138,15 @@ static inline int virtio_net_ext_from_skb(const struct sk_buff *skb,

ext->flags |= VIRTIO_NET_EXT_F_IP6FRAG;


Looks like you need advance ptr here?


}
  
+	if (ext_mask & VIRTIO_NET_EXT_F_VLAN && skb_vlan_tag_present(skb)) {

+   struct virtio_net_ext_vlan *vhdr =
+   (struct virtio_net_ext_vlan *)ptr;
+
+   vlan_get_tag(skb, >vlan_tci);
+   vhdr->vlan_proto = skb->vlan_proto;
+   ext->flags |= VIRTIO_NET_EXT_F_VLAN;


And here?

Thanks

+   }
+
return 0;
  }
  #endif /* _LINUX_VIRTIO_NET_H */
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index eac8d94..6125de7 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,7 @@
 * Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
  #define VIRTIO_NET_F_IP6_FRAGID24 /* Host supports VLAN accleration */
+#define VIRTIO_NET_F_VLAN_OFFLOAD 25   /* Host supports VLAN accleration */
  
  #ifndef VIRTIO_NET_NO_LEGACY

  #define VIRTIO_NET_F_GSO  6   /* Host handles pkts w/ any GSO type */
@@ -111,6 +112,7 @@ struct virtio_net_hdr_v1 {
   */
  struct virtio_net_ext_hdr {
  #define VIRTIO_NET_EXT_F_IP6FRAG  (1<<0)
+#define VIRTIO_NET_EXT_F_VLAN  (1<<1)
__u32 flags;
__u8 extensions[];
  };
@@ -120,6 +122,11 @@ struct virtio_net_ext_ip6frag {
__be32 frag_id;
  };
  
+struct virtio_net_ext_vlan {

+   __be16 vlan_tci;
+   __be16 vlan_proto;
+};
+
  #ifndef VIRTIO_NET_NO_LEGACY
  /* This header comes first in the scatter-gather list.
   * For legacy virtio, if 

Re: [PATCH net-next V2 1/3] tap: use build_skb() for small packet

2017-08-15 Thread Jason Wang



On 2017年08月16日 11:55, Michael S. Tsirkin wrote:

On Tue, Aug 15, 2017 at 08:45:20PM -0700, Eric Dumazet wrote:

On Fri, 2017-08-11 at 19:41 +0800, Jason Wang wrote:

We use tun_alloc_skb() which calls sock_alloc_send_pskb() to allocate
skb in the past. This socket based method is not suitable for high
speed userspace like virtualization which usually:

- ignore sk_sndbuf (INT_MAX) and expect to receive the packet as fast as
   possible
- don't want to be block at sendmsg()

To eliminate the above overheads, this patch tries to use build_skb()
for small packet. We will do this only when the following conditions
are all met:

- TAP instead of TUN
- sk_sndbuf is INT_MAX
- caller don't want to be blocked
- zerocopy is not used
- packet size is smaller enough to use build_skb()

Pktgen from guest to host shows ~11% improvement for rx pps of tap:

Before: ~1.70Mpps
After : ~1.88Mpps

What's more important, this makes it possible to implement XDP for tap
before creating skbs.

Well well well.

You do realize that tun_build_skb() is not thread safe ?

The issue is alloc frag, isn't it?
I guess for now we can limit this to XDP mode only, and
just allocate full pages in that mode.




Limit this to XDP mode only does not prevent user from sending packets 
to same queue in parallel I think?


Thanks


<    1   2   3   4   5   6   7   8   9   10   >