Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-04-04 Thread Michael S. Tsirkin
On Thu, Apr 01, 2010 at 11:22:37AM -0700, David Stevens wrote:
 kvm-ow...@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:
 
  On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:
 
   
 +   head.iov_base = (void 
 *)vhost_get_vq_desc(net-dev, 
   vq,
 +   vq-iov, ARRAY_SIZE(vq-iov), out, in, 
 NULL, 
   
 NULL);

I this casting confusing.
Is it really expensive to add an array of heads so that
we do not need to cast?
   
   It needs the heads and the lengths, which looks a lot
   like an iovec. I was trying to resist adding a new
   struct XXX { unsigned head; unsigned len; } just for this,
   but I could make these parallel arrays, one with head index and
   the other with length.
 
 Michael, on this one, if I add vq-heads as an argument to
 vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
 Would you rather this 1) remain an iovec (and a single arg added) but
 cast still there, 2) 2 arrays (head and length) and 2 args added, or
 3) a new struct type of {unsigned,int} to carry for the heads+len
 instead of iovec?
 My preference would be 1). I agree the casts are ugly, but
 it is essentially an iovec the way we use it; it's just that the
 base isn't a pointer but a descriptor index instead.

I prefer 2 or 3. If you prefer 1 strongly, I think we should
add a detailed comment near the iovec, and
a couple of inline wrappers to store/get data in the iovec.

   
   EAGAIN is not possible after the change, because we don't
   even enter the loop unless we have an skb on the read queue; the
   other cases bomb out, so I figured the comment for future work is
   now done. :-)
  
  Guest could be buggy so we'll get EFAULT.
  If skb is taken off the rx queue (as below), we might get EAGAIN.
 
 We break on any error. If we get EAGAIN because someone read
 on the socket, this code would break the loop, but EAGAIN is a more
 serious problem if it changed since we peeked (because it means
 someone else is reading the socket).
 But I don't understand -- are you suggesting that the error
 handling be different than that, or that the comment is still
 relevant?
 My intention here is to do the TODO from the comment
 so that it can be removed, by handling all error cases. I think
 because of the peek, EAGAIN isn't something to be ignored anymore,
 but the effect is the same whether we break out of the loop or
 not, since we retry the packet next time around. Essentially, we
 ignore every error since we will redo it with the same packet the
 next time around. Maybe we should print something here, but since
 we'll be retrying the packet that's still on the socket, a permanent
 error would spew continuously. Maybe we should shut down entirely
 if we get any negative return value here (including EAGAIN, since
 that tells us someone messed with the socket when we don't want them
 to).
 If you want the comment still there, ok, but I do think EAGAIN
 isn't a special case per the comment anymore, and is handled as all
 other errors are: by exiting the loop and retrying next time.
 
 +-DLS

Yes, I just think some comment should stay, as you say, because
otherwise we simply retry continuously. Maybe we should trigger vq_err.
It needs to be given some thought which I have not given it yet.

Thinking aloud, EAGAIN means someone reads the socket
together with us, I prefer that this condition is made a fatal
error, we should make sure we are polling the socket
so we see packets if more appear.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-04-01 Thread Michael S. Tsirkin
On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM:
  
  attached patch seems to be whiespace damaged as well.
  Does the origin pass checkpatch.pl for you?
 
 Yes, but I probably broke it in the transfer -- will be more
 careful with the next revision.
 
 
 
   +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
 vq,
   +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 
 
   NULL);
  
  I this casting confusing.
  Is it really expensive to add an array of heads so that
  we do not need to cast?
 
 It needs the heads and the lengths, which looks a lot
 like an iovec. I was trying to resist adding a new
 struct XXX { unsigned head; unsigned len; } just for this,
 but I could make these parallel arrays, one with head index and
 the other with length.
 
   +   if (vq-rxmaxheadcount  headcount)
   +   vq-rxmaxheadcount = headcount;
  
  This seems the only place where we set the rxmaxheadcount
  value. Maybe it can be moved out of vhost.c to net.c?
  If vhost library needs this it can get it as function
  parameter.
 
 I can move it to vhost_get_heads(), sure.
  
   +   /* Skip header. TODO: support TSO. */
  
  You seem to have removed the code that skips the header.
  Won't this break non-GSO backends such as raw?
 
 From our prior discussion, what I had in mind was that
 we'd remove all special-header processing by using the ioctl
 you added for TUN and later, an equivalent ioctl for raw (when
 supported in qemu-kvm). Qemu would arrange headers with tap
 (and later raw) to get what the guest expects, and vhost then
 just passes all data as-is between the socket and the ring.
 That not only removes the different-header-len code
 for mergeable buffers, but also eliminates making a copy of the
 header for every packet as before.

Yes but please note that in practice if this is what we do,
then vhost header size is 0 and then there is no actual copy.

 Vhost has no need to do
 anything with the header, or even know its length. It also
 doesn't have to care what the type of the backend is-- raw or
 tap.
 I think that simplifies everything, but it does mean that
 raw socket support requires a header ioctl for the different
 vnethdr sizes if the guest wants a vnethdr with and without
 mergeable buffers. Actually, I thought this was your idea and
 the point of the TUNSETVNETHDRSZ ioctl, but I guess you had
 something else in mind?

Yes. However, we also have an ioctl that sets vhost header size, and it
allows supporting simple backends which do not support an (equivalent
of) TUNSETVNETHDRSZ. We have two of these now: packet and macvtap.
So I thought that unless there are gains in code simplicity and/or
performance we can keep supporting these just as well.

   -   /* TODO: Check specific error and bomb out unless 
 EAGAIN? 
   */
  
  Do you think it's not a good idea?
 
 EAGAIN is not possible after the change, because we don't
 even enter the loop unless we have an skb on the read queue; the
 other cases bomb out, so I figured the comment for future work is
 now done. :-)

Guest could be buggy so we'll get EFAULT.
If skb is taken off the rx queue (as below), we might get EAGAIN.

  
   if (err  0) {
   -   vhost_discard_vq_desc(vq);
   +   vhost_discard_vq_desc(vq, headcount);
   break;
   }
  
  I think we should detect and discard truncated messages,
  since len might not be reliable if userspace pulls
  a packet from under us. Also, if new packet is
  shorter than the old one, there's no truncation but
  headcount is wrong.
  
  So simplest fix IMO would be to compare err with expected len.
  If there's a difference, we hit the race and so
  we would discard the packet.
 
 Sure.
 
  
  
   /* TODO: Should check and handle checksum. */
   +   if (vhost_has_feature(net-dev, 
 VIRTIO_NET_F_MRG_RXBUF)) 
   {
   +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
   +   (struct virtio_net_hdr_mrg_rxbuf *)
   +   vq-iov[0].iov_base;
   +   /* add num_bufs */
   +   if (put_user(headcount, vhdr-num_buffers)) {
   +   vq_err(vq, Failed to write 
 num_buffers);
   +   vhost_discard_vq_desc(vq, headcount);
  
  Let's do memcpy_to_iovecend etc so that we do not assume layout.
  This is also why we need move_iovec: sendmsg might modify the iovec.
  It would also be nice not to corrupt memory and
  get a reasonable error if buffer size
  that we get is smaller than expected header size.
 
 I wanted to avoid the header copy here, with the reasonable 
 restriction
 that the guest gives you a buffer at least big 

Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-04-01 Thread David Stevens
kvm-ow...@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:

 On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:

  
+   head.iov_base = (void 
*)vhost_get_vq_desc(net-dev, 
  vq,
+   vq-iov, ARRAY_SIZE(vq-iov), out, in, 
NULL, 
  
NULL);
   
   I this casting confusing.
   Is it really expensive to add an array of heads so that
   we do not need to cast?
  
  It needs the heads and the lengths, which looks a lot
  like an iovec. I was trying to resist adding a new
  struct XXX { unsigned head; unsigned len; } just for this,
  but I could make these parallel arrays, one with head index and
  the other with length.

Michael, on this one, if I add vq-heads as an argument to
vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
Would you rather this 1) remain an iovec (and a single arg added) but
cast still there, 2) 2 arrays (head and length) and 2 args added, or
3) a new struct type of {unsigned,int} to carry for the heads+len
instead of iovec?
My preference would be 1). I agree the casts are ugly, but
it is essentially an iovec the way we use it; it's just that the
base isn't a pointer but a descriptor index instead.

  
  EAGAIN is not possible after the change, because we don't
  even enter the loop unless we have an skb on the read queue; the
  other cases bomb out, so I figured the comment for future work is
  now done. :-)
 
 Guest could be buggy so we'll get EFAULT.
 If skb is taken off the rx queue (as below), we might get EAGAIN.

We break on any error. If we get EAGAIN because someone read
on the socket, this code would break the loop, but EAGAIN is a more
serious problem if it changed since we peeked (because it means
someone else is reading the socket).
But I don't understand -- are you suggesting that the error
handling be different than that, or that the comment is still
relevant? My intention here is to do the TODO from the comment
so that it can be removed, by handling all error cases. I think
because of the peek, EAGAIN isn't something to be ignored anymore,
but the effect is the same whether we break out of the loop or
not, since we retry the packet next time around. Essentially, we
ignore every error since we will redo it with the same packet the
next time around. Maybe we should print something here, but since
we'll be retrying the packet that's still on the socket, a permanent
error would spew continuously. Maybe we should shut down entirely
if we get any negative return value here (including EAGAIN, since
that tells us someone messed with the socket when we don't want them
to).
If you want the comment still there, ok, but I do think EAGAIN
isn't a special case per the comment anymore, and is handled as all
other errors are: by exiting the loop and retrying next time.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-03-31 Thread Michael S. Tsirkin
On Tue, Mar 30, 2010 at 07:23:48PM -0600, David Stevens wrote:
 This patch adds support for the Mergeable Receive Buffers feature to
 vhost_net.
 
 Changes:
 1) generalize descriptor allocation functions to allow multiple
 descriptors per packet
 2) add socket peek to know datalen at buffer allocation time
 3) change notification to enable a multi-buffer max packet, rather
 than the single-buffer run until completely empty
 
 Changes from previous revision:
 1) incorporate review comments from Michael Tsirkin
 2) assume use of TUNSETVNETHDRSZ ioctl by qemu, which simplifies vnet 
 header
 processing
 3) fixed notification code to only affect the receive side
 
 Signed-Off-By: David L Stevens dlstev...@us.ibm.com
 
 [in-line for review, attached for applying w/o whitespace mangling]

attached patch seems to be whiespace damaged as well.
Does the origin pass checkpatch.pl for you?

 diff -ruNp net-next-p0/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c
 --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 
 -0700
 +++ net-next-p3/drivers/vhost/net.c 2010-03-30 12:50:57.0 
 -0700
 @@ -54,26 +54,6 @@ struct vhost_net {
 enum vhost_net_poll_state tx_poll_state;
  };
  
 -/* Pop first len bytes from iovec. Return number of segments used. */
 -static int move_iovec_hdr(struct iovec *from, struct iovec *to,
 - size_t len, int iov_count)
 -{
 -   int seg = 0;
 -   size_t size;
 -   while (len  seg  iov_count) {
 -   size = min(from-iov_len, len);
 -   to-iov_base = from-iov_base;
 -   to-iov_len = size;
 -   from-iov_len -= size;
 -   from-iov_base += size;
 -   len -= size;
 -   ++from;
 -   ++to;
 -   ++seg;
 -   }
 -   return seg;
 -}
 -
  /* Caller must have TX VQ lock */
  static void tx_poll_stop(struct vhost_net *net)
  {
 @@ -97,7 +77,8 @@ static void tx_poll_start(struct vhost_n
  static void handle_tx(struct vhost_net *net)
  {
 struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 -   unsigned head, out, in, s;
 +   unsigned out, in;
 +   struct iovec head;
 struct msghdr msg = {
 .msg_name = NULL,
 .msg_namelen = 0,
 @@ -108,8 +89,8 @@ static void handle_tx(struct vhost_net *
 };
 size_t len, total_len = 0;
 int err, wmem;
 -   size_t hdr_size;
 struct socket *sock = rcu_dereference(vq-private_data);
 +
 if (!sock)
 return;
  
 @@ -127,22 +108,19 @@ static void handle_tx(struct vhost_net *
  
 if (wmem  sock-sk-sk_sndbuf / 2)
 tx_poll_stop(net);
 -   hdr_size = vq-hdr_size;
  
 for (;;) {
 -   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -ARRAY_SIZE(vq-iov),
 -out, in,
 -NULL, NULL);
 +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
 +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 
 NULL);

I this casting confusing.
Is it really expensive to add an array of heads so that
we do not need to cast?


 /* Nothing new?  Wait for eventfd to tell us they 
 refilled. */
 -   if (head == vq-num) {
 +   if (head.iov_base == (void *)vq-num) {
 wmem = atomic_read(sock-sk-sk_wmem_alloc);
 if (wmem = sock-sk-sk_sndbuf * 3 / 4) {
 tx_poll_start(net, sock);
 set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
 break;
 }
 -   if (unlikely(vhost_enable_notify(vq))) {
 +   if (unlikely(vhost_enable_notify(vq, 0))) {
 vhost_disable_notify(vq);
 continue;
 }
 @@ -154,27 +132,30 @@ static void handle_tx(struct vhost_net *
 break;
 }
 /* Skip header. TODO: support TSO. */
 -   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
 msg.msg_iovlen = out;
 -   len = iov_length(vq-iov, out);
 +   head.iov_len = len = iov_length(vq-iov, out);
 +
 /* Sanity check */
 if (!len) {
 -   vq_err(vq, Unexpected header len for TX: 
 -  %zd expected %zd\n,
 -  iov_length(vq-hdr, s), hdr_size);
 +   vq_err(vq, Unexpected buffer len for TX: %zd , 
 len);
 break;
 }
 -   /* TODO: Check specific error and bomb out unless ENOBUFS? 
 */
 err = 

Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-03-31 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM:
 
 attached patch seems to be whiespace damaged as well.
 Does the origin pass checkpatch.pl for you?

Yes, but I probably broke it in the transfer -- will be more
careful with the next revision.



  +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
vq,
  +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 

  NULL);
 
 I this casting confusing.
 Is it really expensive to add an array of heads so that
 we do not need to cast?

It needs the heads and the lengths, which looks a lot
like an iovec. I was trying to resist adding a new
struct XXX { unsigned head; unsigned len; } just for this,
but I could make these parallel arrays, one with head index and
the other with length.

  +   if (vq-rxmaxheadcount  headcount)
  +   vq-rxmaxheadcount = headcount;
 
 This seems the only place where we set the rxmaxheadcount
 value. Maybe it can be moved out of vhost.c to net.c?
 If vhost library needs this it can get it as function
 parameter.

I can move it to vhost_get_heads(), sure.
 
  +   /* Skip header. TODO: support TSO. */
 
 You seem to have removed the code that skips the header.
 Won't this break non-GSO backends such as raw?

From our prior discussion, what I had in mind was that
we'd remove all special-header processing by using the ioctl
you added for TUN and later, an equivalent ioctl for raw (when
supported in qemu-kvm). Qemu would arrange headers with tap
(and later raw) to get what the guest expects, and vhost then
just passes all data as-is between the socket and the ring.
That not only removes the different-header-len code
for mergeable buffers, but also eliminates making a copy of the
header for every packet as before. Vhost has no need to do
anything with the header, or even know its length. It also
doesn't have to care what the type of the backend is-- raw or
tap.
I think that simplifies everything, but it does mean that
raw socket support requires a header ioctl for the different
vnethdr sizes if the guest wants a vnethdr with and without
mergeable buffers. Actually, I thought this was your idea and
the point of the TUNSETVNETHDRSZ ioctl, but I guess you had
something else in mind?

  -   /* TODO: Check specific error and bomb out unless 
EAGAIN? 
  */
 
 Do you think it's not a good idea?

EAGAIN is not possible after the change, because we don't
even enter the loop unless we have an skb on the read queue; the
other cases bomb out, so I figured the comment for future work is
now done. :-)

 
  if (err  0) {
  -   vhost_discard_vq_desc(vq);
  +   vhost_discard_vq_desc(vq, headcount);
  break;
  }
 
 I think we should detect and discard truncated messages,
 since len might not be reliable if userspace pulls
 a packet from under us. Also, if new packet is
 shorter than the old one, there's no truncation but
 headcount is wrong.
 
 So simplest fix IMO would be to compare err with expected len.
 If there's a difference, we hit the race and so
 we would discard the packet.

Sure.

 
 
  /* TODO: Should check and handle checksum. */
  +   if (vhost_has_feature(net-dev, 
VIRTIO_NET_F_MRG_RXBUF)) 
  {
  +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
  +   (struct virtio_net_hdr_mrg_rxbuf *)
  +   vq-iov[0].iov_base;
  +   /* add num_bufs */
  +   if (put_user(headcount, vhdr-num_buffers)) {
  +   vq_err(vq, Failed to write 
num_buffers);
  +   vhost_discard_vq_desc(vq, headcount);
 
 Let's do memcpy_to_iovecend etc so that we do not assume layout.
 This is also why we need move_iovec: sendmsg might modify the iovec.
 It would also be nice not to corrupt memory and
 get a reasonable error if buffer size
 that we get is smaller than expected header size.

I wanted to avoid the header copy here, with the reasonable 
restriction
that the guest gives you a buffer at least big enough for the vnet_hdr. A
length check alone (on iov[0].iov_len) could enforce that without copying
headers back and forth to support silly cases like 8-byte buffers or
num_buffers spanning multiple iovecs, and I think paying the price of the
copy on every packet to support guests that broken isn't worth it.
So, my preference here would be to add:

if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF))
struct virtio_net_mrg_rxbuf *vhdr...

if (vq-iov[0].iov_len  sizeof(*vhdr)) {
vq_err(vq, tiny buffers (len %d  %d) are not 
supported,
vq-iov[0].iov_len, sizeof(*vhdr));
break;
}