Re: [PATCH][VHOST] fix race with guest on multi-buffer used buffer updates

2010-05-24 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2010 03:17:10 AM:

 On Thu, May 20, 2010 at 09:58:06AM -0700, David L Stevens wrote:
  [for Michael Tsirkin's vhost development git tree]
  
  This patch fixes a race between guest and host when
  adding used buffers wraps the ring. Without it, guests
  can see partial packets before num_buffers is set in
  the vnet header.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Could you please explain what the race is?

Sure. The pre-patch code in the ring-wrap case
does this:

add part1 bufs
update used index
add part2 bufs
update used index

After we update the used index for part1, the part1
buffers are available to the guest. If the guest is
consuming at that point, it can process the partial
packet before the rest of the packet is there. In that
case, num_buffers will be greater than the number of
buffers available to the guest and it'll drop the
packet with a framing error. I was seeing 2 or 3 framing
errors every 100 million packets or so pre-patch, none
post-patch.
Actually, the second sentence is incorrect in the
original description-- num_buffers is up to date when
the guest sees it, but the used index is not.

+-DLS


 
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 7f2568d..74790ab 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -1065,14 +1065,6 @@ static int __vhost_add_used_n(struct 
vhost_virtqueue *vq,
 vq_err(vq, Failed to write used);
 return -EFAULT;
  }
  -   /* Make sure buffer is written before we update index. */
  -   smp_wmb();
  -   if (put_user(vq-last_used_idx + count, vq-used-idx)) {
  -  vq_err(vq, Failed to increment used idx);
  -  return -EFAULT;
  -   }
  -   if (unlikely(vq-log_used))
  -  vhost_log_used(vq, used);
  vq-last_used_idx += count;
  return 0;
   }
  @@ -1093,7 +1085,17 @@ int vhost_add_used_n(struct vhost_virtqueue 
*vq, 
 struct vring_used_elem *heads,
 heads += n;
 count -= n;
  }
  -   return __vhost_add_used_n(vq, heads, count);
  +   r = __vhost_add_used_n(vq, heads, count);
  +
  +   /* Make sure buffer is written before we update index. */
  +   smp_wmb();
  +   if (put_user(vq-last_used_idx, vq-used-idx)) {
  +  vq_err(vq, Failed to increment used idx);
  +  return -EFAULT;
  +   }
  +   if (unlikely(vq-log_used))
  +  vhost_log_used(vq, vq-used-ring + start);
  +   return r;
   }
  
   /* This actually signals the guest, using eventfd. */
  

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VHOST] fix race with guest on multi-buffer used buffer updates

2010-05-24 Thread David Stevens
netdev-ow...@vger.kernel.org wrote on 05/24/2010 09:13:51 AM:

 On Mon, May 24, 2010 at 08:52:40AM -0700, David Stevens wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 05/24/2010 03:17:10 AM:
  
   On Thu, May 20, 2010 at 09:58:06AM -0700, David L Stevens wrote:
[for Michael Tsirkin's vhost development git tree]

This patch fixes a race between guest and host when
adding used buffers wraps the ring. Without it, guests
can see partial packets before num_buffers is set in
the vnet header.

Signed-off-by: David L Stevens dlstev...@us.ibm.com
   
   Could you please explain what the race is?
  
  Sure. The pre-patch code in the ring-wrap case
  does this:
  
  add part1 bufs
  update used index
  add part2 bufs
  update used index
  
  After we update the used index for part1, the part1
  buffers are available to the guest. If the guest is
  consuming at that point, it can process the partial
  packet before the rest of the packet is there. In that
  case, num_buffers will be greater than the number of
  buffers available to the guest and it'll drop the
  packet with a framing error. I was seeing 2 or 3 framing
  errors every 100 million packets or so pre-patch, none
  post-patch.
  Actually, the second sentence is incorrect in the
  original description-- num_buffers is up to date when
  the guest sees it, but the used index is not.
  
  +-DLS
 
 so this happens always - what does wrap-around refer to?

The 2-part update only happens when a packet spans
the end/beginning of the vring (the wrap). The framing error only
happens if the guest sees the vring-wrapping packets
before the second used-index write (the race).
So, the framing error doesn't happen always--it's
pretty rare. But with the patch, it never happens.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VHOST] fix race with guest on multi-buffer used buffer updates

2010-05-24 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2010 09:42:05 AM:
 
 I see. The logging is still bugg though I think.

Possibly; migration isn't working for me under load even
without mergeable buffers (investigating), so I haven't
yet been able to test wrap w/ logging, but did you see
something specific that's wrong?

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens
Since datalen carries the difference and will be negative by that amount
from the original loop, what about just adding something like:

}
if (headcount)
heads[headcount-1].len += datalen;
[and really, headcount 0 since datalen  0, so just:

heads[headcount-1].len += datalen;

+-DLS


kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
  use_mm(net-dev.mm);
  mutex_lock(vq-mutex);
  vhost_disable_notify(vq);
  -   hdr_size = vq-hdr_size;
  +   vhost_hlen = vq-vhost_hlen;
  
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(vq, sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads,
  +datalen + vhost_hlen,
  +in, vq_log, log);
  +  if (headcount  0)
  + break;
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  +  if (!headcount) {
if (unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
  @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
break;
 }
 /* We don't need to be notified again. */
  -  if (out) {
  - vq_err(vq, Unexpected descriptor format for RX: 
  -out %d, int %d\n,
  -out, in);
  - break;
  -  }
  -  /* Skip header. TODO: support TSO/mergeable rx buffers. */
  -  s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
  +  if (vhost_hlen)
  + /* Skip header. TODO: support TSO. */
  + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in);
  +  else
  + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in);
 msg.msg_iovlen = in;
 len = iov_length(vq-iov, in);
 /* Sanity check */
 if (!len) {
vq_err(vq, Unexpected header len for RX: 
   %zd expected %zd\n,
  -iov_length(vq-hdr, s), hdr_size);
  +iov_length(vq-hdr, s), vhost_hlen);
break;
 }
 err = sock-ops-recvmsg(NULL, sock, msg,
   len, MSG_DONTWAIT | MSG_TRUNC);
 /* TODO: Check specific error and bomb out unless EAGAIN? */
 if (err  0) {
  - vhost_discard_vq_desc(vq);
  + vhost_discard_desc(vq, headcount);
break;
 }
  -  /* TODO: Should check and handle checksum. */
  -  if (err  len) {
  - pr_err(Discarded truncated rx packet: 
  - len %d  %zd\n, err, len);
  - vhost_discard_vq_desc(vq);
  +  if (err != datalen) {
  + pr_err(Discarded rx packet: 
  + len %d, expected %zd\n, err, datalen);
  + vhost_discard_desc(vq, headcount);
continue;
 }
 len = err;
  -  err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
  -  if (err) {
  - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n,
  -vq-iov-iov_base, err);
  +  if (vhost_hlen 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0,
  +  vhost_hlen)) {
  + vq_err(vq, Unable to write vnet_hdr at addr %p\n,
  +vq-iov-iov_base);
break;
 }
  -  len += hdr_size;
  -  vhost_add_used_and_signal(net-dev, vq, head, len);
  +  /* TODO: Should check and handle checksum. */
  +  if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)headcount,
  +  offsetof(typeof(hdr), num_buffers),
  +  sizeof(hdr.num_buffers))) {
  + vq_err(vq, Failed num_buffers write);
  + vhost_discard_desc(vq, headcount);
  + break;
  +  }
  +  len += vhost_hlen;
  +  vhost_add_used_and_signal_n(net-dev, vq, vq-heads,
  +   headcount);
 if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, len);
 total_len += len;
 
 OK I think I see the bug here: vhost_add_used_and_signal_n
 does not get the actual length, it gets the iovec length from vhost.
 Guest virtio uses this as packet length, with bad results.
 
 So I have applied the follows and it seems to have fixed the problem:
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index c16db02..9d7496d 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -219,7 +219,7 @@ static int peek_head_len(struct vhost_virtqueue *vq, 

 struct sock *sk)
  /* This is a multi-buffer 

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens
netdev-ow...@vger.kernel.org wrote on 05/10/2010 10:25:57 AM:

 On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote:
  Since datalen carries the difference and will be negative by that 
amount
  from the original loop, what about just adding something like:
  
  }
  if (headcount)
  heads[headcount-1].len += datalen;
  [and really, headcount 0 since datalen  0, so just:
  
  heads[headcount-1].len += datalen;
  
  +-DLS
 
 This works too, just does more checks and comparisons.
 I am still surprised that you were unable to reproduce the problem.
 

I'm sure it happened, and probably had a performance
penalty on my systems too, but not as much as yours.
I didn't see any obvious performance difference running
with the patch, though; not sure why. I'll instrument to
see how often it's happening, I think.
But fixed now, good catch!

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  This patch adds mergeable receive buffer support to vhost_net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 I've been doing some more testing before sending out a pull
 request, and I see a drastic performance degradation in guest to host
 traffic when this is applied but mergeable buffers are not in used
 by userspace (existing qemu-kvm userspace).

Actually, I wouldn't expect it to work at all; the qemu-kvm
patch (particularly the feature bit setting bug fix) is required.
Without it, I think the existing code will tell the guest to use
mergeable buffers while turning it off in vhost. That was completely
non-functional for me -- what version of qemu-kvm are you using?
What I did to test w/o mergeable buffers is turn off the
bit in VHOST_FEATURES. I'll recheck these, but qemu-kvm definitely
must be updated; the original doesn't correctly handle feature bits.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 08:56:14 AM:

 On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:
  
   On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
This patch adds mergeable receive buffer support to vhost_net.

Signed-off-by: David L Stevens dlstev...@us.ibm.com
   
   I've been doing some more testing before sending out a pull
   request, and I see a drastic performance degradation in guest to 
host
   traffic when this is applied but mergeable buffers are not in used
   by userspace (existing qemu-kvm userspace).
  
  Actually, I wouldn't expect it to work at all;
  the qemu-kvm
  patch (particularly the feature bit setting bug fix) is required.
 
 Which bugfix is that?

Actually, I see you put that upstream already--
commit dc14a397812b91dd0d48b03d1b8f66a251542369  in
Avi's tree is the one I was talking about.
I'll look further.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-04-30 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/29/2010 06:45:15 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  This patch adds mergeable receive buffer support to vhost_net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 I have applied this, thanks very much!
 I have also applied some tweaks on top,
 please take a look.
 
 Thanks,
 MSt
 

Looks fine to me.

Acked-by: David L Stevens dlstev...@us.ibm.com

 commit 2809e94f5f26d89dc5232aaec753ffda95c4d95e
 Author: Michael S. Tsirkin m...@redhat.com
 Date:   Thu Apr 29 16:18:08 2010 +0300
 
 vhost-net: minor tweaks in mergeable buffer code
 
 Applies the following tweaks on top of mergeable buffers patch:
 1. vhost_get_desc_n assumes that all desriptors are 'in' only.
It's also unlikely to be useful for any vhost frontend
besides vhost_net, so just move it to net.c, and rename
get_rx_bufs for brevity.
 
 2. for rx, we called iov_length within vhost_get_desc_n
(now get_rx_bufs) already, so we don't
need an extra call to iov_length to avoid overflow anymore.
Accordingly, copy_iovec_hdr can return void now.
 
 3. for rx, do some further code tweaks:
do not assign len = err as we check that err == len
handle data length in a way similar to how we handle
header length: datalen - sock_len, len - vhost_len.
add sock_hlen as a local variable, for symmetry with vhost_hlen.
 
 4. add some likely/unlikely annotations
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index d61d945..85519b4 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -74,9 +74,9 @@ static int move_iovec_hdr(struct iovec *from, struct 
iovec *to,
 }
 return seg;
  }
 -/* Copy iovec entries for len bytes from iovec. Return segments used. 
*/
 -static int copy_iovec_hdr(const struct iovec *from, struct iovec *to,
 -   size_t len, int iovcount)
 +/* Copy iovec entries for len bytes from iovec. */
 +static void copy_iovec_hdr(const struct iovec *from, struct iovec *to,
 +size_t len, int iovcount)
  {
 int seg = 0;
 size_t size;
 @@ -89,7 +89,6 @@ static int copy_iovec_hdr(const struct iovec *from, 
struct iovec *to,
++to;
++seg;
 }
 -   return seg;
  }
 
  /* Caller must have TX VQ lock */
 @@ -204,7 +203,7 @@ static void handle_tx(struct vhost_net *net)
 unuse_mm(net-dev.mm);
  }
 
 -static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
 +static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk)
  {
 struct sk_buff *head;
 int len = 0;
 @@ -212,17 +211,70 @@ static int vhost_head_len(struct vhost_virtqueue 
*vq, 
 struct sock *sk)
 lock_sock(sk);
 head = skb_peek(sk-sk_receive_queue);
 if (head)
 -  len = head-len + vq-sock_hlen;
 +  len = head-len;
 release_sock(sk);
 return len;
  }
 
 +/* This is a multi-buffer version of vhost_get_desc, that works if
 + *   vq has read descriptors only.
 + * @vq  - the relevant virtqueue
 + * @datalen   - data length we'll be reading
 + * @iovcount   - returned count of io vectors we fill
 + * @log  - vhost log
 + * @log_num   - log offset
 + *   returns number of buffer heads allocated, negative on error
 + */
 +static int get_rx_bufs(struct vhost_virtqueue *vq,
 + struct vring_used_elem *heads,
 + int datalen,
 + unsigned *iovcount,
 + struct vhost_log *log,
 + unsigned *log_num)
 +{
 +   unsigned int out, in;
 +   int seg = 0;
 +   int headcount = 0;
 +   unsigned d;
 +   int r;
 +
 +   while (datalen  0) {
 +  if (unlikely(headcount = VHOST_NET_MAX_SG)) {
 + r = -ENOBUFS;
 + goto err;
 +  }
 +  d = vhost_get_desc(vq-dev, vq, vq-iov + seg,
 +   ARRAY_SIZE(vq-iov) - seg, out,
 +   in, log, log_num);
 +  if (d == vq-num) {
 + r = 0;
 + goto err;
 +  }
 +  if (unlikely(out || in = 0)) {
 + vq_err(vq, unexpected descriptor format for RX: 
 +out %d, in %d\n, out, in);
 + r = -EINVAL;
 + goto err;
 +  }
 +  heads[headcount].id = d;
 +  heads[headcount].len = iov_length(vq-iov + seg, in);
 +  datalen -= heads[headcount].len;
 +  ++headcount;
 +  seg += in;
 +   }
 +   *iovcount = seg;
 +   return headcount;
 +err:
 +   vhost_discard_desc(vq, headcount);
 +   return r;
 +}
 +
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_rx(struct vhost_net *net)
  {
 struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
 -   unsigned in, log, s;
 +   unsigned uninitialized_var(in), log;
 struct vhost_log *vq_log;
 struct msghdr msg = {
.msg_name = NULL,
 @@ -238,9 +290,10 @@ static void handle_rx(struct vhost_net *net)
   

Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM:

 On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
  This patch adds the mergeable RX buffers feature to vhost.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Looks pretty clean to me.
 Could you send a checkpatch-clean version please?

The original passes checkpatch already, but I guess I must
still be getting whitespace mangling if it didn't for you. (sigh)
Here it is as an attachment:



 We should also check performance implications.
 Do you have any data?

I'm getting on the order of 10-20% improvement in
throughput over stock vhost guest to host, but I do see
a lot of variability in the results, even with no KVM
and just over loopback. I don't know where that's coming
from, but I'll do some runs and post.

Thanks for all the reviews!

+-DLS



MRXBv4.patch
Description: Binary data


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 10:43:49 AM:

 On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
  This patch adds the mergeable RX buffers feature to vhost.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 BTW, which userspace should I use for testing this?

I use the qemu-kvm patch to your git tree that I posted
a while back (also attached). And it needs your TUNSETVNETHDRSIZE
ioctl in the kernel as well. Also, I added the TUN ioctl
defines to /usr/include for this to pick up (it compiles, but
won't do the ioctl's without that); will get back
to that patch per your comments next.

[also correction: I said 10-20% guest to host, I meant host-to-guest
(ie, the receiver side). h2g appears improved too, but not as
much)]

+-DLS


qemu-MRXB-2.patch
Description: Binary data


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...

Yes, I saw some line wraps in what I received, but I checked
the original draft to be sure and they weren't there. Possibly from
the relay; Sigh.


  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
 /* TODO: Check specific error and bomb out unless ENOBUFS? */
 err = sock-ops-sendmsg(NULL, sock, msg, len);
 if (unlikely(err  0)) {
  - vhost_discard_vq_desc(vq);
  - tx_poll_start(net, sock);
  + if (err == -EAGAIN) {
  +vhost_discard_desc(vq, 1);
  +tx_poll_start(net, sock);
  + } else {
  +vq_err(vq, sendmsg: errno %d\n, -err);
  +/* drop packet; do not discard/resend */
  +vhost_add_used_and_signal(net-dev, vq, head,
  +   0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.

I had to add this to avoid an infinite loop when I wrote a bad
packet on the socket. I agree error handling needs a better look,
but retrying a bad packet continuously while dumping in the log
is what it was doing when I hit an error before this code. Isn't
this better, at least until a second look?


  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?

I was looking at alternatives to adding the lock in the
first place, but I found I couldn't measure a difference in the
cost with and without the lock.


  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
  +vq_log, log);
 
 This looks like a bug, I think we need to pass
 datalen + header size to vhost_get_desc_n.
 Not sure how we know the header size that backend will use though.
 Maybe just look at our features.

Yes; we have hdr_size, so I can add it here. It'll be 0 for
the cases where the backend and guest both have vnet header (either
the regular or larger mergeable buffers one), but should be added
in for future raw socket support.

 
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  - if (unlikely(vhost_enable_notify(vq))) {
  +  if (!headcount) {
  + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
   vhost_disable_notify(vq);
  +retries++;
   continue;
}
 
 Hmm. The reason we have the code at all, as the comment says, is because
 guest could have added more buffers between the time we read last index
 and the time we enabled notification. So if we just break like this
 the race still exists. We could remember the
 last head value we observed, and have vhost_enable_notify check
 against this value?

This is to prevent a spin loop in the case where we have some
buffers available, but not enough for the current packet (ie, this
is the replacement code for the rxmaxheadcount business). If they
actually added something new, retrying once should see it, but what
vhost_enable_notify() returns non-zero on is not new buffers but
rather not empty. In the case mentioned, we aren't empty, so
vhost_enable_notify() returns nonzero every time, but the guest hasn't
given us enough buffers to proceed, so we continuously retry; this
code breaks the spin loop until we've really got new buffers from
the guest.

 
 Need to think about it.
 
 Another concern here is that on retries vhost_get_desc_n
 is doing extra work, rescanning the same descriptor
 again and again. Not sure how common this is, might be
 worthwhile to add a TODO to consider this at least.

I had a printk in there to test the code and with the
retries counter, it happens when we fill the ring (once,
because of the retries checks), and then proceeds as
desired when the guest gives us more buffers. Without the
check, it spews until we 

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:

 On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:
   
   Thanks!
   There's some whitespace damage, are you sending with your new
   sendmail setup? It seems to have worked for qemu patches ...
  
  Yes, I saw some line wraps in what I received, but I checked
  the original draft to be sure and they weren't there. Possibly from
  the relay; Sigh.
  
  
@@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
   /* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
   err = sock-ops-sendmsg(NULL, sock, msg, len);
   if (unlikely(err  0)) {
- vhost_discard_vq_desc(vq);
- tx_poll_start(net, sock);
+ if (err == -EAGAIN) {
+vhost_discard_desc(vq, 1);
+tx_poll_start(net, sock);
+ } else {
+vq_err(vq, sendmsg: errno %d\n, -err);
+/* drop packet; do not discard/resend */
+vhost_add_used_and_signal(net-dev, vq, head,
+   0);
   
   vhost does not currently has a consistent error handling strategy:
   if we drop packets, need to think which other errors should cause
   packet drops.  I prefer to just call vq_err for now,
   and have us look at handling segfaults etc in a consistent way
   separately.
  
  I had to add this to avoid an infinite loop when I wrote a bad
  packet on the socket. I agree error handling needs a better look,
  but retrying a bad packet continuously while dumping in the log
  is what it was doing when I hit an error before this code. Isn't
  this better, at least until a second look?
  
 
 Hmm, what do you mean 'continuously'? Don't we only try again
 on next kick?

If the packet is corrupt (in my case, a missing vnet header
during testing), every send will fail and we never make progress.
I had thousands of error messages in the log (for the same packet)
before I added this code. If the problem is with the packet,
retrying the same one as the original code does will never recover.
This isn't required for mergeable rx buffer support, so I
can certainly remove it from this patch, but I think the original
error handling doesn't handle a single corrupted packet very
gracefully.


@@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
vq_log = unlikely(vhost_has_feature(net-dev, 
VHOST_F_LOG_ALL)) ?
   vq-log : NULL;

-   for (;;) {
-  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(sock-sk))) {
+  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
+vq_log, log);
   
   This looks like a bug, I think we need to pass
   datalen + header size to vhost_get_desc_n.
   Not sure how we know the header size that backend will use though.
   Maybe just look at our features.
  
  Yes; we have hdr_size, so I can add it here. It'll be 0 for
  the cases where the backend and guest both have vnet header (either
  the regular or larger mergeable buffers one), but should be added
  in for future raw socket support.
 
 So hdr_size is the wrong thing to add then.
 We need to add non-zero value for tap now.

datalen includes the vnet_hdr in the tap case, so we don't need
a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
and that length is what we're getting from vhost_head_len().

 
   
   /* OK, now we need to know about added descriptors. */
-  if (head == vq-num) {
- if (unlikely(vhost_enable_notify(vq))) {
+  if (!headcount) {
+ if (retries == 0  unlikely(vhost_enable_notify(vq))) {
 /* They have slipped one in as we were
  * doing that: check again. */
 vhost_disable_notify(vq);
+retries++;
 continue;
  }
   
   Hmm. The reason we have the code at all, as the comment says, is 
because
   guest could have added more buffers between the time we read last 
index
   and the time we enabled notification. So if we just break like this
   the race still exists. We could remember the
   last head value we observed, and have vhost_enable_notify check
   against this value?
  
  This is to prevent a spin loop in the case where we have some
  buffers available, but not enough for the current packet (ie, this
  is the replacement code for the rxmaxheadcount business). If they
  actually added something new, retrying once should see it, but what
  vhost_enable_notify() returns non-zero on is not new buffers but
  rather not empty. In the case mentioned, we aren't empty, so
  vhost_enable_notify() returns nonzero every time, but the guest hasn't
  given us enough buffers to proceed, so we continuously retry; this
  code breaks

Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-04-01 Thread David Stevens
kvm-ow...@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:

 On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:

  
+   head.iov_base = (void 
*)vhost_get_vq_desc(net-dev, 
  vq,
+   vq-iov, ARRAY_SIZE(vq-iov), out, in, 
NULL, 
  
NULL);
   
   I this casting confusing.
   Is it really expensive to add an array of heads so that
   we do not need to cast?
  
  It needs the heads and the lengths, which looks a lot
  like an iovec. I was trying to resist adding a new
  struct XXX { unsigned head; unsigned len; } just for this,
  but I could make these parallel arrays, one with head index and
  the other with length.

Michael, on this one, if I add vq-heads as an argument to
vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
Would you rather this 1) remain an iovec (and a single arg added) but
cast still there, 2) 2 arrays (head and length) and 2 args added, or
3) a new struct type of {unsigned,int} to carry for the heads+len
instead of iovec?
My preference would be 1). I agree the casts are ugly, but
it is essentially an iovec the way we use it; it's just that the
base isn't a pointer but a descriptor index instead.

  
  EAGAIN is not possible after the change, because we don't
  even enter the loop unless we have an skb on the read queue; the
  other cases bomb out, so I figured the comment for future work is
  now done. :-)
 
 Guest could be buggy so we'll get EFAULT.
 If skb is taken off the rx queue (as below), we might get EAGAIN.

We break on any error. If we get EAGAIN because someone read
on the socket, this code would break the loop, but EAGAIN is a more
serious problem if it changed since we peeked (because it means
someone else is reading the socket).
But I don't understand -- are you suggesting that the error
handling be different than that, or that the comment is still
relevant? My intention here is to do the TODO from the comment
so that it can be removed, by handling all error cases. I think
because of the peek, EAGAIN isn't something to be ignored anymore,
but the effect is the same whether we break out of the loop or
not, since we retry the packet next time around. Essentially, we
ignore every error since we will redo it with the same packet the
next time around. Maybe we should print something here, but since
we'll be retrying the packet that's still on the socket, a permanent
error would spew continuously. Maybe we should shut down entirely
if we get any negative return value here (including EAGAIN, since
that tells us someone messed with the socket when we don't want them
to).
If you want the comment still there, ok, but I do think EAGAIN
isn't a special case per the comment anymore, and is handled as all
other errors are: by exiting the loop and retrying next time.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-03-31 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM:
 
 attached patch seems to be whiespace damaged as well.
 Does the origin pass checkpatch.pl for you?

Yes, but I probably broke it in the transfer -- will be more
careful with the next revision.



  +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
vq,
  +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 

  NULL);
 
 I this casting confusing.
 Is it really expensive to add an array of heads so that
 we do not need to cast?

It needs the heads and the lengths, which looks a lot
like an iovec. I was trying to resist adding a new
struct XXX { unsigned head; unsigned len; } just for this,
but I could make these parallel arrays, one with head index and
the other with length.

  +   if (vq-rxmaxheadcount  headcount)
  +   vq-rxmaxheadcount = headcount;
 
 This seems the only place where we set the rxmaxheadcount
 value. Maybe it can be moved out of vhost.c to net.c?
 If vhost library needs this it can get it as function
 parameter.

I can move it to vhost_get_heads(), sure.
 
  +   /* Skip header. TODO: support TSO. */
 
 You seem to have removed the code that skips the header.
 Won't this break non-GSO backends such as raw?

From our prior discussion, what I had in mind was that
we'd remove all special-header processing by using the ioctl
you added for TUN and later, an equivalent ioctl for raw (when
supported in qemu-kvm). Qemu would arrange headers with tap
(and later raw) to get what the guest expects, and vhost then
just passes all data as-is between the socket and the ring.
That not only removes the different-header-len code
for mergeable buffers, but also eliminates making a copy of the
header for every packet as before. Vhost has no need to do
anything with the header, or even know its length. It also
doesn't have to care what the type of the backend is-- raw or
tap.
I think that simplifies everything, but it does mean that
raw socket support requires a header ioctl for the different
vnethdr sizes if the guest wants a vnethdr with and without
mergeable buffers. Actually, I thought this was your idea and
the point of the TUNSETVNETHDRSZ ioctl, but I guess you had
something else in mind?

  -   /* TODO: Check specific error and bomb out unless 
EAGAIN? 
  */
 
 Do you think it's not a good idea?

EAGAIN is not possible after the change, because we don't
even enter the loop unless we have an skb on the read queue; the
other cases bomb out, so I figured the comment for future work is
now done. :-)

 
  if (err  0) {
  -   vhost_discard_vq_desc(vq);
  +   vhost_discard_vq_desc(vq, headcount);
  break;
  }
 
 I think we should detect and discard truncated messages,
 since len might not be reliable if userspace pulls
 a packet from under us. Also, if new packet is
 shorter than the old one, there's no truncation but
 headcount is wrong.
 
 So simplest fix IMO would be to compare err with expected len.
 If there's a difference, we hit the race and so
 we would discard the packet.

Sure.

 
 
  /* TODO: Should check and handle checksum. */
  +   if (vhost_has_feature(net-dev, 
VIRTIO_NET_F_MRG_RXBUF)) 
  {
  +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
  +   (struct virtio_net_hdr_mrg_rxbuf *)
  +   vq-iov[0].iov_base;
  +   /* add num_bufs */
  +   if (put_user(headcount, vhdr-num_buffers)) {
  +   vq_err(vq, Failed to write 
num_buffers);
  +   vhost_discard_vq_desc(vq, headcount);
 
 Let's do memcpy_to_iovecend etc so that we do not assume layout.
 This is also why we need move_iovec: sendmsg might modify the iovec.
 It would also be nice not to corrupt memory and
 get a reasonable error if buffer size
 that we get is smaller than expected header size.

I wanted to avoid the header copy here, with the reasonable 
restriction
that the guest gives you a buffer at least big enough for the vnet_hdr. A
length check alone (on iov[0].iov_len) could enforce that without copying
headers back and forth to support silly cases like 8-byte buffers or
num_buffers spanning multiple iovecs, and I think paying the price of the
copy on every packet to support guests that broken isn't worth it.
So, my preference here would be to add:

if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF))
struct virtio_net_mrg_rxbuf *vhdr...

if (vq-iov[0].iov_len  sizeof(*vhdr)) {
vq_err(vq, tiny buffers (len %d  %d) are not 
supported,
vq-iov[0].iov_len, sizeof(*vhdr));
break;
}

[PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-03-30 Thread David Stevens
This patch adds support for the Mergeable Receive Buffers feature to
vhost_net.

Changes:
1) generalize descriptor allocation functions to allow multiple
descriptors per packet
2) add socket peek to know datalen at buffer allocation time
3) change notification to enable a multi-buffer max packet, rather
than the single-buffer run until completely empty

Changes from previous revision:
1) incorporate review comments from Michael Tsirkin
2) assume use of TUNSETVNETHDRSZ ioctl by qemu, which simplifies vnet 
header
processing
3) fixed notification code to only affect the receive side

Signed-Off-By: David L Stevens dlstev...@us.ibm.com

[in-line for review, attached for applying w/o whitespace mangling]

diff -ruNp net-next-p0/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c
--- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 
-0700
+++ net-next-p3/drivers/vhost/net.c 2010-03-30 12:50:57.0 
-0700
@@ -54,26 +54,6 @@ struct vhost_net {
enum vhost_net_poll_state tx_poll_state;
 };
 
-/* Pop first len bytes from iovec. Return number of segments used. */
-static int move_iovec_hdr(struct iovec *from, struct iovec *to,
- size_t len, int iov_count)
-{
-   int seg = 0;
-   size_t size;
-   while (len  seg  iov_count) {
-   size = min(from-iov_len, len);
-   to-iov_base = from-iov_base;
-   to-iov_len = size;
-   from-iov_len -= size;
-   from-iov_base += size;
-   len -= size;
-   ++from;
-   ++to;
-   ++seg;
-   }
-   return seg;
-}
-
 /* Caller must have TX VQ lock */
 static void tx_poll_stop(struct vhost_net *net)
 {
@@ -97,7 +77,8 @@ static void tx_poll_start(struct vhost_n
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
-   unsigned head, out, in, s;
+   unsigned out, in;
+   struct iovec head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -108,8 +89,8 @@ static void handle_tx(struct vhost_net *
};
size_t len, total_len = 0;
int err, wmem;
-   size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
+
if (!sock)
return;
 
@@ -127,22 +108,19 @@ static void handle_tx(struct vhost_net *
 
if (wmem  sock-sk-sk_sndbuf / 2)
tx_poll_stop(net);
-   hdr_size = vq-hdr_size;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-NULL, NULL);
+   head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
+   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 
NULL);
/* Nothing new?  Wait for eventfd to tell us they 
refilled. */
-   if (head == vq-num) {
+   if (head.iov_base == (void *)vq-num) {
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf * 3 / 4) {
tx_poll_start(net, sock);
set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
break;
}
-   if (unlikely(vhost_enable_notify(vq))) {
+   if (unlikely(vhost_enable_notify(vq, 0))) {
vhost_disable_notify(vq);
continue;
}
@@ -154,27 +132,30 @@ static void handle_tx(struct vhost_net *
break;
}
/* Skip header. TODO: support TSO. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
-   len = iov_length(vq-iov, out);
+   head.iov_len = len = iov_length(vq-iov, out);
+
/* Sanity check */
if (!len) {
-   vq_err(vq, Unexpected header len for TX: 
-  %zd expected %zd\n,
-  iov_length(vq-hdr, s), hdr_size);
+   vq_err(vq, Unexpected buffer len for TX: %zd , 
len);
break;
}
-   /* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
-   tx_poll_start(net, sock);
+   if (err == -EAGAIN) {
+   vhost_discard_vq_desc(vq, 1);
+   tx_poll_start(net, sock);
+   } else {
+  

Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-10 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 11:45:33 PM:

+static int skb_head_len(struct sk_buff_head *skq)
+{
+   struct sk_buff *head;
+
+   head = skb_peek(skq);
+   if (head)
+   return head-len;
+   return 0;
+}
+
   
   This is done without locking, which I think can crash
   if skb is consumed after we peek it but before we read the
   length.
  
  This thread is the only legitimate consumer, right? But
  qemu has the file descriptor and I guess we shouldn't trust
  that it won't give it to someone else; it'd break vhost, but
  a crash would be inappropriate, of course. I'd like to avoid
  the lock, but I have another idea here, so will investigate.

I looked at this some more and actually, I'm not sure I
see a crash here.
First, without qemu, or something it calls, being broken
as root, nothing else should ever read from the socket, in which
case the length will be exactly right for the next packet we
read. No problem.
But if by some error this skb is freed, we'll have valid
memory address that isn't the length field of the next packet
we'll read.
If the length is negative or more than available in the
vring, we'll fail the buffer allocation, exit the loop, and
get the new head length of the receive queue the next time
around -- no problem.
If the length field is 0, we'll exit the loop even
though we have data to read, but will get that packet the
next time we get in here, again, with the right length.
No problem.
If the length field is big enough to allocate buffer
space for it, but smaller than the new head we have to read,
the recvmsg will fail with EMSGSIZE, drop the packet, exit
the loop and be back in business with the next packet. No
problem.
Otherwise, the packet will fit and be delivered.

I don't much like the notion of using skb-head when
it's garbage, but that can only happen if qemu has broken,
and I don't see a crash unless the skb is not only freed
but no longer a valid memory address for reading at all,
and all within the race window.
Since the code handles other failure cases (not
enough ring buffers or packet not fitting in the allocated
buffers), the actual length value only matters in the
sense that it prevents us from using buffers unnecessarily--
something that isn't all that relevant if it's hosed enough
to have unauthorized readers on the socket.

Is this case worth the performance penalty we'll no
doubt pay for either locking the socket or always allocating
for a max-sized packet? I'll experiment with a couple
solutions here, but unless I've missed something, we might
be better off just leaving it as-is.

+-DLS


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 07:31:30 AM:

 On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote:
  This patch generalizes buffer handling functions to
  NULL, NULL);
  +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
vq,
  +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 

  NULL);
 
 Should type for head be changed so that we do not need to cast?
 
 I also prefer aligning descendants on the opening (.

Yes, that's probably better; the indentation with the cast would
still wrap a lot, but I'll see what I can do here.


  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard_vq_desc(vq);
  +   vhost_discard(vq, 1);
 
 Isn't the original name a bit more descriptive?

During development, I had both and I generally like
shorter names, but I can change it back.

  +static int skb_head_len(struct sk_buff_head *skq)
  +{
  +   struct sk_buff *head;
  +
  +   head = skb_peek(skq);
  +   if (head)
  +   return head-len;
  +   return 0;
  +}
  +
 
 This is done without locking, which I think can crash
 if skb is consumed after we peek it but before we read the
 length.

This thread is the only legitimate consumer, right? But
qemu has the file descriptor and I guess we shouldn't trust
that it won't give it to someone else; it'd break vhost, but
a crash would be inappropriate, of course. I'd like to avoid
the lock, but I have another idea here, so will investigate.

 
 
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_rx(struct vhost_net *net)
   {
  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
  -   unsigned head, out, in, log, s;
  +   unsigned in, log, s;
  struct vhost_log *vq_log;
  struct msghdr msg = {
  .msg_name = NULL,
  @@ -204,10 +213,11 @@
  };
  
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  -   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  +
  +   if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
  return;
  
 
 Isn't this equivalent? Do you expect zero len skbs in socket?
 If yes, maybe we should discard these, not stop processing ...

A zero return means no skb's. They are equivalent; I
changed this call just to make it identical to the loop check,
but I don't have a strong attachment to this.


  vq_log = unlikely(vhost_has_feature(net-dev, 
VHOST_F_LOG_ALL)) ?
  vq-log : NULL;
  
  -   for (;;) {
  -   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) 
{
 
 This peeks in the queue to figure out how much data we have.
 It's a neat trick, but it does introduce new failure modes
 where an skb is consumed after we did the peek:
 - new skb could be shorter, we should return the unconsumed part
 - new skb could be longer, this will drop a packet.
   maybe this last is not an issue as the race should be rare in practice

As before, this loop is the only legitimate consumer of skb's; if
anyone else is reading the socket, it's broken. But since the file
descriptor is available to qemu, it's probably trusting qemu too much.
I don't believe there is a race at all on a working system; the head
can't change until we read it after this test, but I'll see if I can
come up with something better here. Closing the fd for qemu so vhost
has exclusive access might do it, but otherwise we could just get a
max-sized packet worth of buffers and then return them after the read.
That'd force us to wait with small packets when the ring is nearly
full, where we don't have to now, but I expect locking to read the head
length will hurt performance in all cases. Will try these ideas out.k

 
  +   headcount = vhost_get_heads(vq, datalen, in, vq_log, 
  log);
  /* OK, now we need to know about added descriptors. */
  -   if (head == vq-num) {
  +   if (!headcount) {
  if (unlikely(vhost_enable_notify(vq))) {
  /* They have slipped one in as we were
   * doing that: check again. */
  @@ -235,13 +242,6 @@
   * they refilled. */
  break;
  }
  -   /* We don't need to be notified again. */
 
 you find this comment unhelpful

Re: [RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:12:29 AM:

 On Tue, Mar 02, 2010 at 05:20:26PM -0700, David Stevens wrote:
  This patch adds vnet_hdr processing for mergeable buffer
  support to vhost-net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
  
  diff -ruN net-next-p1/drivers/vhost/net.c 
net-next-p2/drivers/vhost/net.c
 
 Could you please add -p to diff flags so that it's easier
 to figure out which function is changes?

Sure.


  @@ -148,25 +146,45 @@
 out %d, int %d\n, out, in);
  break;
  }
  +   if (vq-guest_hlen  vq-sock_hlen) {
  +   if (msg.msg_iov[0].iov_len == vq-guest_hlen)
  +   msg.msg_iov[0].iov_len = 
vq-sock_hlen;
  +   else if (out == ARRAY_SIZE(vq-iov))
  +   vq_err(vq, handle_tx iov overflow!);
  +   else {
  +   int i;
  +
  +   /* give header its own iov */
  +   for (i=out; i0; ++i)
  +   msg.msg_iov[i+1] = 
msg.msg_iov[i];
  +   msg.msg_iov[0].iov_len = 
vq-sock_hlen;
  +   msg.msg_iov[1].iov_base += 
vq-guest_hlen;
  +   msg.msg_iov[1].iov_len -= 
vq-guest_hlen;
  +   out++;
 
 We seem to spend a fair bit of code here and elsewhere trying to cover
 the diff between header size in guest and host.  In hindsight, it was
 not a good idea to put new padding between data and the header:
 virtio-net should have added it *before*. But it is what it is.
 
 Wouldn't it be easier to just add an ioctl to tap so that
 vnet header size is made bigger by 4 bytes?

I'd be ok with that, but if we support raw sockets, we have
some of the issues (easier, since it's 0). num_buffers is only
16 bits, so just add 2 bytes, but yeah-- pushing it to tap would
be easier for vhost.


  /* TODO: Check specific error and bomb out unless 
ENOBUFS? 
  */
  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard(vq, 1);
  -   tx_poll_start(net, sock);
  +   if (err == -EAGAIN) {
 
 The comment mentions ENOBUFS. Are you sure it's EAGAIN?

The comment's from the original -- the code changes should do
the TODO, so probably should remove the comment.
 
  +   tx_poll_start(net, sock);
 
 Don't we need to call discard to move the last avail header back?

Yes, I think you're right. We might consider dropping the packet
in the EAGAIN case here instead, though. It's not clear retrying the
same packet with a delay here is better than getting new ones when the
socket buffer is full (esp. with queues on both sides already). Maybe
we should fold EAGAIN into the other error cases? Otherwise, you're
right, it looks like it should have a discard here.

 
  +   } else {
  +   vq_err(vq, sendmsg: errno %d\n, 
-err);
  +   /* drop packet; do not discard/resend 
*/
  + vhost_add_used_and_signal(net-dev,vq,head,1);
  +   }
  break;
  -   }
  -   if (err != len)
  +   } else if (err != len)
 
 
 Previous if ends with break so no need for else here.

Yes.


  @@ -232,25 +243,18 @@
  headcount = vhost_get_heads(vq, datalen, in, vq_log, 
  log);
  /* OK, now we need to know about added descriptors. */
  if (!headcount) {
  -   if (unlikely(vhost_enable_notify(vq))) {
  -   /* They have slipped one in as we were
  -* doing that: check again. */
  -   vhost_disable_notify(vq);
  -   continue;
  -   }
  -   /* Nothing new?  Wait for eventfd to tell us
  -* they refilled. */
  +   vhost_enable_notify(vq);
 
 
 You don't think this race can happen?

The notify case has to allow for multiple heads, not just the one,
so if he added just one head, it doesn't mean we're ok to continue-- the
packet may require multiple. Instead, the notifier now checks for enough
to handle a max-sized packet (whether or not NONOTIFY is set). I'll think
about this some more, but I concluded it wasn't needed at the time. :-)


  @@ -519,6 +524,20 @@
  
  vhost_net_disable_vq(n, vq);
  rcu_assign_pointer(vq-private_data, sock);
  +
  +   if (sock  sock-sk) {
  +   if (!vhost_sock_is_raw(sock) ||
 
 I dislike this backend-specific code, it ties us

Re: [RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:26:33 AM:

 On Tue, Mar 02, 2010 at 05:20:34PM -0700, David Stevens wrote:
  This patch glues them all together and makes sure we
  notify whenever we don't have enough buffers to receive
  a max-sized packet, and adds the feature bit.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Maybe split this up?

I can. I was looking mostly at size (and this is the smallest
of the bunch). But the feature requires all of them together, of course.
This last one is just everything left over from the other two.

  @@ -110,6 +90,7 @@
  size_t len, total_len = 0;
  int err, wmem;
  struct socket *sock = rcu_dereference(vq-private_data);
  +
 
 I tend not to add empty lines if line below it is already short.

This leaves no blank line between the declarations and the start
of code. It's habit for me-- not sure of kernel coding standards address
that or not, but I don't think I've seen it anywhere else.

 
  if (!sock)
  return;
  
  @@ -166,11 +147,11 @@
  /* Skip header. TODO: support TSO. */
  msg.msg_iovlen = out;
  head.iov_len = len = iov_length(vq-iov, out);
  +
 
 I tend not to add empty lines if line below it is a comment.

I added this to separate the logical skip header block from
the next, unrelated piece. Not important to me, though.

 
  /* Sanity check */
  if (!len) {
  vq_err(vq, Unexpected header len for TX: 
  -  %zd expected %zd\n,
  -  len, vq-guest_hlen);
  +  %zd expected %zd\n, len, 
vq-guest_hlen);
  break;
  }
  /* TODO: Check specific error and bomb out unless 
ENOBUFS? 
  */


  /* TODO: Should check and handle checksum. */
  +   if (vhost_has_feature(net-dev, 
VIRTIO_NET_F_MRG_RXBUF)) 
  {
  +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
  +   (struct virtio_net_hdr_mrg_rxbuf *)
  +   vq-iov[0].iov_base;
  +   /* add num_bufs */
  +   vq-iov[0].iov_len = vq-guest_hlen;
  +   vhdr-num_buffers = headcount;
 
 I don't understand this. iov_base is a userspace pointer, isn't it.
 How can you assign values to it like that?
 Rusty also commented earlier that it's not a good idea to assume
 specific layout, such as first chunk being large enough to
 include virtio_net_hdr_mrg_rxbuf.
 
 I think we need to use memcpy to/from iovec etc.

I guess you mean put_user() or copy_to_user(); yes, I suppose
it could be paged since we read it.
The code doesn't assume that it'll fit so much as arranged for
it to fit. We allocate guest_hlen bytes in the buffer, but set the
iovec to the (smaller) sock_hlen; do the read, then this code adds
back the 2 bytes in the middle that we didn't read into (where
num_buffers goes). But the allocator does require that guest_hlen
will fit in a single buffer (and reports error if it doesn't). The
alternative is significantly more complicated, and only fails if
the guest doesn't give us at least the buffer size the guest header
requires (a truly lame guest). I'm not sure it's worth a lot of
complexity in vhost to support the guest giving us 12 byte buffers;
those guests don't exist now and maybe they never should?


   /* This actually signals the guest, using eventfd. */
   void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
   {
  __u16 flags = 0;
  +
 
 I tend not to add empty lines if a line above it is already short.

Again, separating declarations from code-- never seen different
in any other kernel code.

 
  if (get_user(flags, vq-avail-flags)) {
  vq_err(vq, Failed to get flags);
  return;
  @@ -1125,7 +1140,7 @@
  
  /* If they don't want an interrupt, don't signal, unless 
empty. */
  if ((flags  VRING_AVAIL_F_NO_INTERRUPT) 
  -   (vq-avail_idx != vq-last_avail_idx ||
  +   (vhost_available(vq)  vq-maxheadcount ||
 
 I don't understand this change. It seems to make
 code not match the comments.

It redefines empty. Without mergeable buffers, we can empty
the ring down to nothing before we require notification. With
mergeable buffers, if the packet requires, say, 3 buffers, and we
have only 2 left, we are empty and require notification and new
buffers to read anything. In both cases, we notify when we can't
read another packet without more buffers.
I can think about changing the comment to reflect this more
clearly.

 
   !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
  return;
  
  diff -ruN net-next-p2/drivers/vhost/vhost.h 
  net-next-p3/drivers/vhost/vhost.h

Re: [RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-03 Thread David Stevens
 Interesting. Since the feature in question is billed first of all a
 performance optimization...

By whom? Although I see some improved performance, I think its real
benefit is improving memory utilization on the guest. Instead of using
75K for an ARP packet, mergeable RX buffers only uses 4K. :-)

 Since the patches affect code paths when mergeable RX buffers are
 disabled as well, I guess the most important point would be to verify
 whether there's increase in latency and/or CPU utilization, or bandwidth
 cost when the feature bit is *disabled*.

Actually, when the feature bit is disabled, it'll only get a single
head, doesn't use the special vnet_hdr, and the codepath reduces to the
essentially to the original. But the answer is no; I saw no regressions
when using it without the feature bit. The only substantive difference in 
that case
is that the new code avoids copying the vnet header as the original
does, so it should actually be faster, but I don't think that's measurable
above the variability I already see.

 
  2 notes: I have a modified version of qemu to get the VHOST_FEATURES
  flags, including the mergeable RX bufs flag, passed to the guest; I'll
  be working with your current qemu git trees next, if any changes are
  needed to support it there.
 
 This feature also seems to conflict with zero-copy rx patches from Xin
 Xiaohui (subject: Provide a zero-copy method on KVM virtio-net) these
 are not in a mergeable shape yet, so this is not a blocker, but I wonder
 what your thoughts on the subject are: how will we do feature
 negotiation if some backends don't support some features?

The qemu code I have basically sends the set features and get
features all the way to vhost (ie, it's the guest negotiating with
vhost), except, of course, for the magic qemu-only bits. I think that's
the right model. I'll definitely take a look at the patch you mention
and maybe comment further.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-02 Thread David Stevens
These patches add support for mergeable receive buffers to
vhost-net, allowing it to use multiple virtio buffer heads for a single
receive packet.
+-DLS


Signed-off-by: David L Stevens dlstev...@us.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-02 Thread David Stevens
This patch generalizes buffer handling functions to
support multiple buffer heads.

In-line for viewing, attached for applying.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p0/drivers/vhost/net.c net-next-p1/drivers/vhost/net.c
--- net-next-p0/drivers/vhost/net.c 2010-02-24 12:59:24.0 
-0800
+++ net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 
-0800
@@ -97,7 +97,8 @@
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
-   unsigned head, out, in, s;
+   unsigned out, in, s;
+   struct iovec head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -126,12 +127,10 @@
hdr_size = vq-hdr_size;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-NULL, NULL);
+   head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
+   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 
NULL);
/* Nothing new?  Wait for eventfd to tell us they 
refilled. */
-   if (head == vq-num) {
+   if (head.iov_base == (void *)vq-num) {
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf * 3 / 4) {
tx_poll_start(net, sock);
@@ -152,7 +151,7 @@
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
-   len = iov_length(vq-iov, out);
+   head.iov_len = len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
@@ -163,14 +162,14 @@
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
+   vhost_discard(vq, 1);
tx_poll_start(net, sock);
break;
}
if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used_and_signal(net-dev, vq, head, 1);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
@@ -182,12 +181,22 @@
unuse_mm(net-dev.mm);
 }
 
+static int skb_head_len(struct sk_buff_head *skq)
+{
+   struct sk_buff *head;
+
+   head = skb_peek(skq);
+   if (head)
+   return head-len;
+   return 0;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -204,10 +213,11 @@
};
 
size_t len, total_len = 0;
-   int err;
+   int err, headcount, datalen;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
-   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
+
+   if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
return;
 
use_mm(net-dev.mm);
@@ -218,13 +228,10 @@
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) {
+   headcount = vhost_get_heads(vq, datalen, in, vq_log, 
log);
/* OK, now we need to know about added descriptors. */
-   if (head == vq-num) {
+   if (!headcount) {
if (unlikely(vhost_enable_notify(vq))) {
/* They have slipped one in as we were
 * doing that: check again. */
@@ -235,13 +242,6 @@
 * they refilled. */
break;
}
-   /* We don't need to be notified again. */
-   if (out) {
-   vq_err(vq, Unexpected descriptor format for RX: 
-  out %d, int 

[RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF

2010-03-02 Thread David Stevens
This patch adds vnet_hdr processing for mergeable buffer
support to vhost-net.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p1/drivers/vhost/net.c net-next-p2/drivers/vhost/net.c
--- net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 
-0800
+++ net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 
-0800
@@ -109,7 +109,6 @@
};
size_t len, total_len = 0;
int err, wmem;
-   size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock)
return;
@@ -124,7 +123,6 @@
 
if (wmem  sock-sk-sk_sndbuf * 2)
tx_poll_stop(net);
-   hdr_size = vq-hdr_size;
 
for (;;) {
head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
@@ -148,25 +146,45 @@
   out %d, int %d\n, out, in);
break;
}
+   if (vq-guest_hlen  vq-sock_hlen) {
+   if (msg.msg_iov[0].iov_len == vq-guest_hlen)
+   msg.msg_iov[0].iov_len = vq-sock_hlen;
+   else if (out == ARRAY_SIZE(vq-iov))
+   vq_err(vq, handle_tx iov overflow!);
+   else {
+   int i;
+
+   /* give header its own iov */
+   for (i=out; i0; ++i)
+   msg.msg_iov[i+1] = msg.msg_iov[i];
+   msg.msg_iov[0].iov_len = vq-sock_hlen;
+   msg.msg_iov[1].iov_base += vq-guest_hlen;
+   msg.msg_iov[1].iov_len -= vq-guest_hlen;
+   out++;
+   }
+   }
/* Skip header. TODO: support TSO. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
head.iov_len = len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
   %zd expected %zd\n,
-  iov_length(vq-hdr, s), hdr_size);
+  len, vq-guest_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard(vq, 1);
-   tx_poll_start(net, sock);
+   if (err == -EAGAIN) {
+   tx_poll_start(net, sock);
+   } else {
+   vq_err(vq, sendmsg: errno %d\n, -err);
+   /* drop packet; do not discard/resend */
+ vhost_add_used_and_signal(net-dev,vq,head,1);
+   }
break;
-   }
-   if (err != len)
+   } else if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
vhost_add_used_and_signal(net-dev, vq, head, 1);
@@ -207,14 +225,8 @@
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
-   };
-
size_t len, total_len = 0;
int err, headcount, datalen;
-   size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
 
if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
@@ -223,7 +235,6 @@
use_mm(net-dev.mm);
mutex_lock(vq-mutex);
vhost_disable_notify(vq);
-   hdr_size = vq-hdr_size;
 
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
@@ -232,25 +243,18 @@
headcount = vhost_get_heads(vq, datalen, in, vq_log, 
log);
/* OK, now we need to know about added descriptors. */
if (!headcount) {
-   if (unlikely(vhost_enable_notify(vq))) {
-   /* They have slipped one in as we were
-* doing that: check again. */
-   vhost_disable_notify(vq);
-   continue;
-   }
-   /* Nothing new?  Wait for eventfd to tell us
-* they refilled. */
+   vhost_enable_notify(vq);
break;
}
/* Skip header. TODO: support TSO/mergeable rx buffers. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
+
/* Sanity check */
 

[RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-02 Thread David Stevens
This patch glues them all together and makes sure we
notify whenever we don't have enough buffers to receive
a max-sized packet, and adds the feature bit.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p2/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c
--- net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 
-0800
+++ net-next-p3/drivers/vhost/net.c 2010-03-02 15:25:15.0 
-0800
@@ -54,26 +54,6 @@
enum vhost_net_poll_state tx_poll_state;
 };
 
-/* Pop first len bytes from iovec. Return number of segments used. */
-static int move_iovec_hdr(struct iovec *from, struct iovec *to,
- size_t len, int iov_count)
-{
-   int seg = 0;
-   size_t size;
-   while (len  seg  iov_count) {
-   size = min(from-iov_len, len);
-   to-iov_base = from-iov_base;
-   to-iov_len = size;
-   from-iov_len -= size;
-   from-iov_base += size;
-   len -= size;
-   ++from;
-   ++to;
-   ++seg;
-   }
-   return seg;
-}
-
 /* Caller must have TX VQ lock */
 static void tx_poll_stop(struct vhost_net *net)
 {
@@ -97,7 +77,7 @@
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
-   unsigned out, in, s;
+   unsigned out, in;
struct iovec head;
struct msghdr msg = {
.msg_name = NULL,
@@ -110,6 +90,7 @@
size_t len, total_len = 0;
int err, wmem;
struct socket *sock = rcu_dereference(vq-private_data);
+
if (!sock)
return;
 
@@ -166,11 +147,11 @@
/* Skip header. TODO: support TSO. */
msg.msg_iovlen = out;
head.iov_len = len = iov_length(vq-iov, out);
+
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
-  %zd expected %zd\n,
-  len, vq-guest_hlen);
+  %zd expected %zd\n, len, vq-guest_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
@@ -214,7 +195,7 @@
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned in, log, s;
+   unsigned in, log;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -245,30 +226,36 @@
if (!headcount) {
vhost_enable_notify(vq);
break;
-   }
+   } else if (vq-maxheadcount  headcount)
+   vq-maxheadcount = headcount;
/* Skip header. TODO: support TSO/mergeable rx buffers. */
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
-
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for RX: 
-  %zd expected %zd\n,
-  len, vq-guest_hlen);
+  %zd expected %zd\n, len, vq-guest_hlen);
break;
}
err = sock-ops-recvmsg(NULL, sock, msg,
 len, MSG_DONTWAIT | MSG_TRUNC);
-   /* TODO: Check specific error and bomb out unless EAGAIN? 
*/
if (err  0) {
-   vhost_discard(vq, 1);
+   vhost_discard(vq, headcount);
break;
}
/* TODO: Should check and handle checksum. */
+   if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) 
{
+   struct virtio_net_hdr_mrg_rxbuf *vhdr =
+   (struct virtio_net_hdr_mrg_rxbuf *)
+   vq-iov[0].iov_base;
+   /* add num_bufs */
+   vq-iov[0].iov_len = vq-guest_hlen;
+   vhdr-num_buffers = headcount;
+   }
if (err  len) {
pr_err(Discarded truncated rx packet: 
len %d  %zd\n, err, len);
-   vhost_discard(vq, 1);
+   vhost_discard(vq, headcount);
continue;
}
len = err;
@@ -573,8 +560,6 @@
 
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
-   size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
-   sizeof(struct virtio_net_hdr) : 0;
int i;
mutex_lock(n-dev.mutex);
if ((features  (1  VHOST_F_LOG_ALL)) 
diff -ruN net-next-p2/drivers/vhost/vhost.c 
net-next-p3/drivers/vhost/vhost.c
--- net-next-p2/drivers/vhost/vhost.c   

Re: [PATCH] fix vhost ioctl handling for 32-bit

2009-12-21 Thread David Stevens
Michael,
There is a 64-bit put_user(), so SET actually works already, but
there just isn't a 64-bit get_user().

It doesn't hurt to make them symmetric, though. The changes look
fine to me, if you want to do both.

I'm looking at adding mergeable rx buffer support and doing the
development work on a Centrino vPro. I've run into other problems
with feature bits setting from your qemu git tree; will send a patch
for that when I get it straightened out (unless someone else gets to
it first, while I'm on vacation).

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix vhost ioctl handling for 32-bit

2009-12-17 Thread David Stevens
VHOST_GET_FEATURES returns high-order garbage on 32-bit
machines. This patch fixes it to use 64 bits throughout.

+-DLS

[in-line for viewing, attached to avoid whitespace mangling]

Signed-off-by: David L Stevens dlstev...@us.ibm.com

--- a/drivers/vhost/net.c   2009-11-17 22:51:56.0 -0800
+++ b/drivers/vhost/net.c   2009-12-17 11:31:51.0 -0800
@@ -563,7 +563,7 @@
 {
struct vhost_net *n = f-private_data;
void __user *argp = (void __user *)arg;
-   u32 __user *featurep = argp;
+   u64 __user *featurep = (u64 __user *)argp;
struct vhost_vring_file backend;
u64 features;
int r;
@@ -577,7 +577,7 @@
features = VHOST_FEATURES;
return put_user(features, featurep);
case VHOST_SET_FEATURES:
-   r = get_user(features, featurep);
+   r = copy_from_user(features, featurep, sizeof features);
/* No features for now */
if (r  0)
return r;



VH1.patch
Description: Binary data


Re: [PATCH 4/5] virtio_net: Add a MAC filter table

2009-01-28 Thread David Stevens
I haven't been following this closely, so apologies if the point's been 
made, or
if you're talking about unicast addresses here too, but just to be clear:

For multicasting, false positives are ok, false negatives are not 
(non-functional),
and if the fixed-size address filter is exceeded, _multicast_promiscuous_
(but not all unicasts, so not promiscuous mode) is the good thing to do.
So best effort still shouldn't lead to an address you previously joined 
not
being passed because a new one is added.

Also, if you can't keep all the MAC multicast addresses (ie,
the limit is memory and not look-up speed), then getting
out of multicast-promiscuous mode correctly isn't easy
since you don't know what groups you forgot. You could
rebuild from the protocol memberships, if you know when
you've left enough groups to fit, but otherwise the MAC
multicast addresses you didn't keep of course won't work if you
leave multicast-promiscuous mode and the filter doesn't
have them.

So, if you're talking about not being able to fit all
the address (vs. not wanting to search that many), then
I'd suggest either staying in MP mode until ifdown, or
making the join a hard failure at the limit in the first
place.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html