Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-08 Thread Michael S. Tsirkin
On Wed, Apr 07, 2010 at 02:07:18PM -0700, David Stevens wrote:
 kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:
 
  On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:

Thanks!
There's some whitespace damage, are you sending with your new
sendmail setup? It seems to have worked for qemu patches ...
   
   Yes, I saw some line wraps in what I received, but I checked
   the original draft to be sure and they weren't there. Possibly from
   the relay; Sigh.
   
   
 @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
/* TODO: Check specific error and bomb out unless ENOBUFS? 
 */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
 - vhost_discard_vq_desc(vq);
 - tx_poll_start(net, sock);
 + if (err == -EAGAIN) {
 +vhost_discard_desc(vq, 1);
 +tx_poll_start(net, sock);
 + } else {
 +vq_err(vq, sendmsg: errno %d\n, -err);
 +/* drop packet; do not discard/resend */
 +vhost_add_used_and_signal(net-dev, vq, head,
 +   0);

vhost does not currently has a consistent error handling strategy:
if we drop packets, need to think which other errors should cause
packet drops.  I prefer to just call vq_err for now,
and have us look at handling segfaults etc in a consistent way
separately.
   
   I had to add this to avoid an infinite loop when I wrote a bad
   packet on the socket. I agree error handling needs a better look,
   but retrying a bad packet continuously while dumping in the log
   is what it was doing when I hit an error before this code. Isn't
   this better, at least until a second look?
   
  
  Hmm, what do you mean 'continuously'? Don't we only try again
  on next kick?
 
 If the packet is corrupt (in my case, a missing vnet header
 during testing), every send will fail and we never make progress.
 I had thousands of error messages in the log (for the same packet)
 before I added this code.

Hmm, we do not want a buggy guest to be able to fill
host logs. This is only if debugging is enabled though, right?
We can also rate-limit the errors.

 If the problem is with the packet,
 retrying the same one as the original code does will never recover.
 This isn't required for mergeable rx buffer support, so I
 can certainly remove it from this patch, but I think the original
 error handling doesn't handle a single corrupted packet very
 gracefully.
 

An approach I considered was to have qemu poll vq_err fd
and stop the device when an error is seen. My concern with
dropping a tx packet is that it would make debugging
very difficult.


 @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
 vq_log = unlikely(vhost_has_feature(net-dev, 
 VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
 -   for (;;) {
 -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -ARRAY_SIZE(vq-iov),
 -out, in,
 -vq_log, log);
 +   while ((datalen = vhost_head_len(sock-sk))) {
 +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
 +vq_log, log);

This looks like a bug, I think we need to pass
datalen + header size to vhost_get_desc_n.
Not sure how we know the header size that backend will use though.
Maybe just look at our features.
   
   Yes; we have hdr_size, so I can add it here. It'll be 0 for
   the cases where the backend and guest both have vnet header (either
   the regular or larger mergeable buffers one), but should be added
   in for future raw socket support.
  
  So hdr_size is the wrong thing to add then.
  We need to add non-zero value for tap now.
 
 datalen includes the vnet_hdr in the tap case, so we don't need
 a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
 and that length is what we're getting from vhost_head_len().

I only see vhost_head_len returning skb-len. You are sure skb-len
includes vnet_hdr for tap rx?

  

/* OK, now we need to know about added descriptors. */
 -  if (head == vq-num) {
 - if (unlikely(vhost_enable_notify(vq))) {
 +  if (!headcount) {
 + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
  /* They have slipped one in as we were
   * doing that: check again. */
  vhost_disable_notify(vq);
 +retries++;
  continue;
   }

Hmm. The reason we have the code at all, as the comment says, is 
 because
guest could have added more buffers between the time we read last 
 index
and the time we enabled notification. So if we just break like this
the race still exists. We could remember the
last head value we observed, and have 

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread Michael S. Tsirkin
On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote:
 
 This patch adds support for the Mergeable Receive Buffers feature to
 vhost_net.
 
   +-DLS
 
 Changes from previous revision:
 1) renamed:
   vhost_discard_vq_desc - vhost_discard_desc
   vhost_get_heads - vhost_get_desc_n
   vhost_get_vq_desc - vhost_get_desc
 2) added heads as argument to ghost_get_desc_n
 3) changed vq-heads from iovec to vring_used_elem, removed casts
 4) changed vhost_add_used to do multiple elements in a single
 copy_to_user,
   or two when we wrap the ring.
 5) removed rxmaxheadcount and available buffer checks in favor of
 running until
   an allocation failure, but making sure we break the loop if we get
   two in a row, indicating we have at least 1 buffer, but not enough
   for the current receive packet
 6) restore non-vnet header handling
 
 Signed-Off-By: David L Stevens dlstev...@us.ibm.com

Thanks!
There's some whitespace damage, are you sending with your new
sendmail setup? It seems to have worked for qemu patches ...

 diff -ruNp net-next-p0/drivers/vhost/net.c
 net-next-v3/drivers/vhost/net.c
 --- net-next-p0/drivers/vhost/net.c   2010-03-22 12:04:38.0 -0700
 +++ net-next-v3/drivers/vhost/net.c   2010-04-06 12:54:56.0 -0700
 @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
   hdr_size = vq-hdr_size;
  
   for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  ARRAY_SIZE(vq-iov),
 -  out, in,
 + head = vhost_get_desc(net-dev, vq, vq-iov,
 +  ARRAY_SIZE(vq-iov), out, in,
NULL, NULL);
   /* Nothing new?  Wait for eventfd to tell us they refilled. */
   if (head == vq-num) {
 @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
   /* TODO: Check specific error and bomb out unless ENOBUFS? */
   err = sock-ops-sendmsg(NULL, sock, msg, len);
   if (unlikely(err  0)) {
 - vhost_discard_vq_desc(vq);
 - tx_poll_start(net, sock);
 + if (err == -EAGAIN) {
 + vhost_discard_desc(vq, 1);
 + tx_poll_start(net, sock);
 + } else {
 + vq_err(vq, sendmsg: errno %d\n, -err);
 + /* drop packet; do not discard/resend */
 + vhost_add_used_and_signal(net-dev, vq, head,
 +   0);

vhost does not currently has a consistent error handling strategy:
if we drop packets, need to think which other errors should cause
packet drops.  I prefer to just call vq_err for now,
and have us look at handling segfaults etc in a consistent way
separately.

 + }
   break;
   }
   if (err != len)
 @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
   unuse_mm(net-dev.mm);
  }
  
 +static int vhost_head_len(struct sock *sk)
 +{
 + struct sk_buff *head;
 + int len = 0;
 +
 + lock_sock(sk);
 + head = skb_peek(sk-sk_receive_queue);
 + if (head)
 + len = head-len;
 + release_sock(sk);
 + return len;
 +}
 +

I wonder whether it makes sense to check
skb_queue_empty(sk-sk_receive_queue)
outside the lock, to reduce the cost of this call
on an empty queue (we know that it happens at least once
each time we exit the loop on rx)?

  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_rx(struct vhost_net *net)
  {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
 - unsigned head, out, in, log, s;
 + unsigned in, log, s;
   struct vhost_log *vq_log;
   struct msghdr msg = {
   .msg_name = NULL,
 @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
   .msg_flags = MSG_DONTWAIT,
   };
  
 - struct virtio_net_hdr hdr = {
 - .flags = 0,
 - .gso_type = VIRTIO_NET_HDR_GSO_NONE
 + struct virtio_net_hdr_mrg_rxbuf hdr = {
 + .hdr.flags = 0,
 + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
   };
  
 + int retries = 0;
   size_t len, total_len = 0;
 - int err;
 + int err, headcount, datalen;
   size_t hdr_size;
   struct socket *sock = rcu_dereference(vq-private_data);
   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
 @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
   vq-log : NULL;
  
 - for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread Michael S. Tsirkin
Some corrections:

On Wed, Apr 07, 2010 at 01:59:10PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote:
  
  This patch adds support for the Mergeable Receive Buffers feature to
  vhost_net.
  
  +-DLS
  
  Changes from previous revision:
  1) renamed:
  vhost_discard_vq_desc - vhost_discard_desc
  vhost_get_heads - vhost_get_desc_n
  vhost_get_vq_desc - vhost_get_desc
  2) added heads as argument to ghost_get_desc_n
  3) changed vq-heads from iovec to vring_used_elem, removed casts
  4) changed vhost_add_used to do multiple elements in a single
  copy_to_user,
  or two when we wrap the ring.
  5) removed rxmaxheadcount and available buffer checks in favor of
  running until
  an allocation failure, but making sure we break the loop if we get
  two in a row, indicating we have at least 1 buffer, but not enough
  for the current receive packet
  6) restore non-vnet header handling
  
  Signed-Off-By: David L Stevens dlstev...@us.ibm.com
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...
 
  diff -ruNp net-next-p0/drivers/vhost/net.c
  net-next-v3/drivers/vhost/net.c
  --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700
  +++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700
  @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
  hdr_size = vq-hdr_size;
   
  for (;;) {
  -   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  +   head = vhost_get_desc(net-dev, vq, vq-iov,
  +ARRAY_SIZE(vq-iov), out, in,
   NULL, NULL);
  /* Nothing new?  Wait for eventfd to tell us they refilled. */
  if (head == vq-num) {
  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
  /* TODO: Check specific error and bomb out unless ENOBUFS? */
  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard_vq_desc(vq);
  -   tx_poll_start(net, sock);
  +   if (err == -EAGAIN) {
  +   vhost_discard_desc(vq, 1);
  +   tx_poll_start(net, sock);
  +   } else {
  +   vq_err(vq, sendmsg: errno %d\n, -err);
  +   /* drop packet; do not discard/resend */
  +   vhost_add_used_and_signal(net-dev, vq, head,
  + 0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.
 
  +   }
  break;
  }
  if (err != len)
  @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
  unuse_mm(net-dev.mm);
   }
   
  +static int vhost_head_len(struct sock *sk)
  +{
  +   struct sk_buff *head;
  +   int len = 0;
  +
  +   lock_sock(sk);
  +   head = skb_peek(sk-sk_receive_queue);
  +   if (head)
  +   len = head-len;
  +   release_sock(sk);
  +   return len;
  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?
 
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_rx(struct vhost_net *net)
   {
  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
  -   unsigned head, out, in, log, s;
  +   unsigned in, log, s;
  struct vhost_log *vq_log;
  struct msghdr msg = {
  .msg_name = NULL,
  @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
  .msg_flags = MSG_DONTWAIT,
  };
   
  -   struct virtio_net_hdr hdr = {
  -   .flags = 0,
  -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
  +   struct virtio_net_hdr_mrg_rxbuf hdr = {
  +   .hdr.flags = 0,
  +   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
  };
   
  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
  vq-log : NULL;
   
  -   for (;;) {
  -   head = 

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...

Yes, I saw some line wraps in what I received, but I checked
the original draft to be sure and they weren't there. Possibly from
the relay; Sigh.


  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
 /* TODO: Check specific error and bomb out unless ENOBUFS? */
 err = sock-ops-sendmsg(NULL, sock, msg, len);
 if (unlikely(err  0)) {
  - vhost_discard_vq_desc(vq);
  - tx_poll_start(net, sock);
  + if (err == -EAGAIN) {
  +vhost_discard_desc(vq, 1);
  +tx_poll_start(net, sock);
  + } else {
  +vq_err(vq, sendmsg: errno %d\n, -err);
  +/* drop packet; do not discard/resend */
  +vhost_add_used_and_signal(net-dev, vq, head,
  +   0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.

I had to add this to avoid an infinite loop when I wrote a bad
packet on the socket. I agree error handling needs a better look,
but retrying a bad packet continuously while dumping in the log
is what it was doing when I hit an error before this code. Isn't
this better, at least until a second look?


  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?

I was looking at alternatives to adding the lock in the
first place, but I found I couldn't measure a difference in the
cost with and without the lock.


  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
  +vq_log, log);
 
 This looks like a bug, I think we need to pass
 datalen + header size to vhost_get_desc_n.
 Not sure how we know the header size that backend will use though.
 Maybe just look at our features.

Yes; we have hdr_size, so I can add it here. It'll be 0 for
the cases where the backend and guest both have vnet header (either
the regular or larger mergeable buffers one), but should be added
in for future raw socket support.

 
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  - if (unlikely(vhost_enable_notify(vq))) {
  +  if (!headcount) {
  + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
   vhost_disable_notify(vq);
  +retries++;
   continue;
}
 
 Hmm. The reason we have the code at all, as the comment says, is because
 guest could have added more buffers between the time we read last index
 and the time we enabled notification. So if we just break like this
 the race still exists. We could remember the
 last head value we observed, and have vhost_enable_notify check
 against this value?

This is to prevent a spin loop in the case where we have some
buffers available, but not enough for the current packet (ie, this
is the replacement code for the rxmaxheadcount business). If they
actually added something new, retrying once should see it, but what
vhost_enable_notify() returns non-zero on is not new buffers but
rather not empty. In the case mentioned, we aren't empty, so
vhost_enable_notify() returns nonzero every time, but the guest hasn't
given us enough buffers to proceed, so we continuously retry; this
code breaks the spin loop until we've really got new buffers from
the guest.

 
 Need to think about it.
 
 Another concern here is that on retries vhost_get_desc_n
 is doing extra work, rescanning the same descriptor
 again and again. Not sure how common this is, might be
 worthwhile to add a TODO to consider this at least.

I had a printk in there to test the code and with the
retries counter, it happens when we fill the ring (once,
because of the retries checks), and then proceeds as
desired when the guest gives us more buffers. Without the
check, it spews until we 

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:

 On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:
   
   Thanks!
   There's some whitespace damage, are you sending with your new
   sendmail setup? It seems to have worked for qemu patches ...
  
  Yes, I saw some line wraps in what I received, but I checked
  the original draft to be sure and they weren't there. Possibly from
  the relay; Sigh.
  
  
@@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
   /* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
   err = sock-ops-sendmsg(NULL, sock, msg, len);
   if (unlikely(err  0)) {
- vhost_discard_vq_desc(vq);
- tx_poll_start(net, sock);
+ if (err == -EAGAIN) {
+vhost_discard_desc(vq, 1);
+tx_poll_start(net, sock);
+ } else {
+vq_err(vq, sendmsg: errno %d\n, -err);
+/* drop packet; do not discard/resend */
+vhost_add_used_and_signal(net-dev, vq, head,
+   0);
   
   vhost does not currently has a consistent error handling strategy:
   if we drop packets, need to think which other errors should cause
   packet drops.  I prefer to just call vq_err for now,
   and have us look at handling segfaults etc in a consistent way
   separately.
  
  I had to add this to avoid an infinite loop when I wrote a bad
  packet on the socket. I agree error handling needs a better look,
  but retrying a bad packet continuously while dumping in the log
  is what it was doing when I hit an error before this code. Isn't
  this better, at least until a second look?
  
 
 Hmm, what do you mean 'continuously'? Don't we only try again
 on next kick?

If the packet is corrupt (in my case, a missing vnet header
during testing), every send will fail and we never make progress.
I had thousands of error messages in the log (for the same packet)
before I added this code. If the problem is with the packet,
retrying the same one as the original code does will never recover.
This isn't required for mergeable rx buffer support, so I
can certainly remove it from this patch, but I think the original
error handling doesn't handle a single corrupted packet very
gracefully.


@@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
vq_log = unlikely(vhost_has_feature(net-dev, 
VHOST_F_LOG_ALL)) ?
   vq-log : NULL;

-   for (;;) {
-  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(sock-sk))) {
+  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
+vq_log, log);
   
   This looks like a bug, I think we need to pass
   datalen + header size to vhost_get_desc_n.
   Not sure how we know the header size that backend will use though.
   Maybe just look at our features.
  
  Yes; we have hdr_size, so I can add it here. It'll be 0 for
  the cases where the backend and guest both have vnet header (either
  the regular or larger mergeable buffers one), but should be added
  in for future raw socket support.
 
 So hdr_size is the wrong thing to add then.
 We need to add non-zero value for tap now.

datalen includes the vnet_hdr in the tap case, so we don't need
a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
and that length is what we're getting from vhost_head_len().

 
   
   /* OK, now we need to know about added descriptors. */
-  if (head == vq-num) {
- if (unlikely(vhost_enable_notify(vq))) {
+  if (!headcount) {
+ if (retries == 0  unlikely(vhost_enable_notify(vq))) {
 /* They have slipped one in as we were
  * doing that: check again. */
 vhost_disable_notify(vq);
+retries++;
 continue;
  }
   
   Hmm. The reason we have the code at all, as the comment says, is 
because
   guest could have added more buffers between the time we read last 
index
   and the time we enabled notification. So if we just break like this
   the race still exists. We could remember the
   last head value we observed, and have vhost_enable_notify check
   against this value?
  
  This is to prevent a spin loop in the case where we have some
  buffers available, but not enough for the current packet (ie, this
  is the replacement code for the rxmaxheadcount business). If they
  actually added something new, retrying once should see it, but what
  vhost_enable_notify() returns non-zero on is not new buffers but
  rather not empty. In the case mentioned, we aren't empty, so
  vhost_enable_notify() returns nonzero every time, but the guest hasn't
  given us enough buffers to proceed, so we continuously retry; this
  code breaks the 

[PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-06 Thread David L Stevens

This patch adds support for the Mergeable Receive Buffers feature to
vhost_net.

+-DLS

Changes from previous revision:
1) renamed:
vhost_discard_vq_desc - vhost_discard_desc
vhost_get_heads - vhost_get_desc_n
vhost_get_vq_desc - vhost_get_desc
2) added heads as argument to ghost_get_desc_n
3) changed vq-heads from iovec to vring_used_elem, removed casts
4) changed vhost_add_used to do multiple elements in a single
copy_to_user,
or two when we wrap the ring.
5) removed rxmaxheadcount and available buffer checks in favor of
running until
an allocation failure, but making sure we break the loop if we get
two in a row, indicating we have at least 1 buffer, but not enough
for the current receive packet
6) restore non-vnet header handling

Signed-Off-By: David L Stevens dlstev...@us.ibm.com

diff -ruNp net-next-p0/drivers/vhost/net.c
net-next-v3/drivers/vhost/net.c
--- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700
+++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700
@@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
hdr_size = vq-hdr_size;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
+   head = vhost_get_desc(net-dev, vq, vq-iov,
+ARRAY_SIZE(vq-iov), out, in,
 NULL, NULL);
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq-num) {
@@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
-   tx_poll_start(net, sock);
+   if (err == -EAGAIN) {
+   vhost_discard_desc(vq, 1);
+   tx_poll_start(net, sock);
+   } else {
+   vq_err(vq, sendmsg: errno %d\n, -err);
+   /* drop packet; do not discard/resend */
+   vhost_add_used_and_signal(net-dev, vq, head,
+ 0);
+   }
break;
}
if (err != len)
@@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
unuse_mm(net-dev.mm);
 }
 
+static int vhost_head_len(struct sock *sk)
+{
+   struct sk_buff *head;
+   int len = 0;
+
+   lock_sock(sk);
+   head = skb_peek(sk-sk_receive_queue);
+   if (head)
+   len = head-len;
+   release_sock(sk);
+   return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   struct virtio_net_hdr_mrg_rxbuf hdr = {
+   .hdr.flags = 0,
+   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
};
 
+   int retries = 0;
size_t len, total_len = 0;
-   int err;
+   int err, headcount, datalen;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
@@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(sock-sk))) {
+   headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
+vq_log, log);
/* OK, now we need to know about added descriptors. */
-   if (head == vq-num) {
-   if (unlikely(vhost_enable_notify(vq))) {
+   if (!headcount) {
+   if (retries == 0  unlikely(vhost_enable_notify(vq))) {
/* They have