Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
On Thu, Apr 01, 2010 at 11:22:37AM -0700, David Stevens wrote: kvm-ow...@vger.kernel.org wrote on 04/01/2010 03:54:15 AM: On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote: + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? It needs the heads and the lengths, which looks a lot like an iovec. I was trying to resist adding a new struct XXX { unsigned head; unsigned len; } just for this, but I could make these parallel arrays, one with head index and the other with length. Michael, on this one, if I add vq-heads as an argument to vhost_get_heads (aka vhost_get_desc_n), I'd need the length too. Would you rather this 1) remain an iovec (and a single arg added) but cast still there, 2) 2 arrays (head and length) and 2 args added, or 3) a new struct type of {unsigned,int} to carry for the heads+len instead of iovec? My preference would be 1). I agree the casts are ugly, but it is essentially an iovec the way we use it; it's just that the base isn't a pointer but a descriptor index instead. I prefer 2 or 3. If you prefer 1 strongly, I think we should add a detailed comment near the iovec, and a couple of inline wrappers to store/get data in the iovec. EAGAIN is not possible after the change, because we don't even enter the loop unless we have an skb on the read queue; the other cases bomb out, so I figured the comment for future work is now done. :-) Guest could be buggy so we'll get EFAULT. If skb is taken off the rx queue (as below), we might get EAGAIN. We break on any error. If we get EAGAIN because someone read on the socket, this code would break the loop, but EAGAIN is a more serious problem if it changed since we peeked (because it means someone else is reading the socket). But I don't understand -- are you suggesting that the error handling be different than that, or that the comment is still relevant? My intention here is to do the TODO from the comment so that it can be removed, by handling all error cases. I think because of the peek, EAGAIN isn't something to be ignored anymore, but the effect is the same whether we break out of the loop or not, since we retry the packet next time around. Essentially, we ignore every error since we will redo it with the same packet the next time around. Maybe we should print something here, but since we'll be retrying the packet that's still on the socket, a permanent error would spew continuously. Maybe we should shut down entirely if we get any negative return value here (including EAGAIN, since that tells us someone messed with the socket when we don't want them to). If you want the comment still there, ok, but I do think EAGAIN isn't a special case per the comment anymore, and is handled as all other errors are: by exiting the loop and retrying next time. +-DLS Yes, I just think some comment should stay, as you say, because otherwise we simply retry continuously. Maybe we should trigger vq_err. It needs to be given some thought which I have not given it yet. Thinking aloud, EAGAIN means someone reads the socket together with us, I prefer that this condition is made a fatal error, we should make sure we are polling the socket so we see packets if more appear. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote: Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM: attached patch seems to be whiespace damaged as well. Does the origin pass checkpatch.pl for you? Yes, but I probably broke it in the transfer -- will be more careful with the next revision. + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? It needs the heads and the lengths, which looks a lot like an iovec. I was trying to resist adding a new struct XXX { unsigned head; unsigned len; } just for this, but I could make these parallel arrays, one with head index and the other with length. + if (vq-rxmaxheadcount headcount) + vq-rxmaxheadcount = headcount; This seems the only place where we set the rxmaxheadcount value. Maybe it can be moved out of vhost.c to net.c? If vhost library needs this it can get it as function parameter. I can move it to vhost_get_heads(), sure. + /* Skip header. TODO: support TSO. */ You seem to have removed the code that skips the header. Won't this break non-GSO backends such as raw? From our prior discussion, what I had in mind was that we'd remove all special-header processing by using the ioctl you added for TUN and later, an equivalent ioctl for raw (when supported in qemu-kvm). Qemu would arrange headers with tap (and later raw) to get what the guest expects, and vhost then just passes all data as-is between the socket and the ring. That not only removes the different-header-len code for mergeable buffers, but also eliminates making a copy of the header for every packet as before. Yes but please note that in practice if this is what we do, then vhost header size is 0 and then there is no actual copy. Vhost has no need to do anything with the header, or even know its length. It also doesn't have to care what the type of the backend is-- raw or tap. I think that simplifies everything, but it does mean that raw socket support requires a header ioctl for the different vnethdr sizes if the guest wants a vnethdr with and without mergeable buffers. Actually, I thought this was your idea and the point of the TUNSETVNETHDRSZ ioctl, but I guess you had something else in mind? Yes. However, we also have an ioctl that sets vhost header size, and it allows supporting simple backends which do not support an (equivalent of) TUNSETVNETHDRSZ. We have two of these now: packet and macvtap. So I thought that unless there are gains in code simplicity and/or performance we can keep supporting these just as well. - /* TODO: Check specific error and bomb out unless EAGAIN? */ Do you think it's not a good idea? EAGAIN is not possible after the change, because we don't even enter the loop unless we have an skb on the read queue; the other cases bomb out, so I figured the comment for future work is now done. :-) Guest could be buggy so we'll get EFAULT. If skb is taken off the rx queue (as below), we might get EAGAIN. if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_vq_desc(vq, headcount); break; } I think we should detect and discard truncated messages, since len might not be reliable if userspace pulls a packet from under us. Also, if new packet is shorter than the old one, there's no truncation but headcount is wrong. So simplest fix IMO would be to compare err with expected len. If there's a difference, we hit the race and so we would discard the packet. Sure. /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) { + struct virtio_net_hdr_mrg_rxbuf *vhdr = + (struct virtio_net_hdr_mrg_rxbuf *) + vq-iov[0].iov_base; + /* add num_bufs */ + if (put_user(headcount, vhdr-num_buffers)) { + vq_err(vq, Failed to write num_buffers); + vhost_discard_vq_desc(vq, headcount); Let's do memcpy_to_iovecend etc so that we do not assume layout. This is also why we need move_iovec: sendmsg might modify the iovec. It would also be nice not to corrupt memory and get a reasonable error if buffer size that we get is smaller than expected header size. I wanted to avoid the header copy here, with the reasonable restriction that the guest gives you a buffer at least big
Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
kvm-ow...@vger.kernel.org wrote on 04/01/2010 03:54:15 AM: On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote: + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? It needs the heads and the lengths, which looks a lot like an iovec. I was trying to resist adding a new struct XXX { unsigned head; unsigned len; } just for this, but I could make these parallel arrays, one with head index and the other with length. Michael, on this one, if I add vq-heads as an argument to vhost_get_heads (aka vhost_get_desc_n), I'd need the length too. Would you rather this 1) remain an iovec (and a single arg added) but cast still there, 2) 2 arrays (head and length) and 2 args added, or 3) a new struct type of {unsigned,int} to carry for the heads+len instead of iovec? My preference would be 1). I agree the casts are ugly, but it is essentially an iovec the way we use it; it's just that the base isn't a pointer but a descriptor index instead. EAGAIN is not possible after the change, because we don't even enter the loop unless we have an skb on the read queue; the other cases bomb out, so I figured the comment for future work is now done. :-) Guest could be buggy so we'll get EFAULT. If skb is taken off the rx queue (as below), we might get EAGAIN. We break on any error. If we get EAGAIN because someone read on the socket, this code would break the loop, but EAGAIN is a more serious problem if it changed since we peeked (because it means someone else is reading the socket). But I don't understand -- are you suggesting that the error handling be different than that, or that the comment is still relevant? My intention here is to do the TODO from the comment so that it can be removed, by handling all error cases. I think because of the peek, EAGAIN isn't something to be ignored anymore, but the effect is the same whether we break out of the loop or not, since we retry the packet next time around. Essentially, we ignore every error since we will redo it with the same packet the next time around. Maybe we should print something here, but since we'll be retrying the packet that's still on the socket, a permanent error would spew continuously. Maybe we should shut down entirely if we get any negative return value here (including EAGAIN, since that tells us someone messed with the socket when we don't want them to). If you want the comment still there, ok, but I do think EAGAIN isn't a special case per the comment anymore, and is handled as all other errors are: by exiting the loop and retrying next time. +-DLS -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
On Tue, Mar 30, 2010 at 07:23:48PM -0600, David Stevens wrote: This patch adds support for the Mergeable Receive Buffers feature to vhost_net. Changes: 1) generalize descriptor allocation functions to allow multiple descriptors per packet 2) add socket peek to know datalen at buffer allocation time 3) change notification to enable a multi-buffer max packet, rather than the single-buffer run until completely empty Changes from previous revision: 1) incorporate review comments from Michael Tsirkin 2) assume use of TUNSETVNETHDRSZ ioctl by qemu, which simplifies vnet header processing 3) fixed notification code to only affect the receive side Signed-Off-By: David L Stevens dlstev...@us.ibm.com [in-line for review, attached for applying w/o whitespace mangling] attached patch seems to be whiespace damaged as well. Does the origin pass checkpatch.pl for you? diff -ruNp net-next-p0/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700 +++ net-next-p3/drivers/vhost/net.c 2010-03-30 12:50:57.0 -0700 @@ -54,26 +54,6 @@ struct vhost_net { enum vhost_net_poll_state tx_poll_state; }; -/* Pop first len bytes from iovec. Return number of segments used. */ -static int move_iovec_hdr(struct iovec *from, struct iovec *to, - size_t len, int iov_count) -{ - int seg = 0; - size_t size; - while (len seg iov_count) { - size = min(from-iov_len, len); - to-iov_base = from-iov_base; - to-iov_len = size; - from-iov_len -= size; - from-iov_base += size; - len -= size; - ++from; - ++to; - ++seg; - } - return seg; -} - /* Caller must have TX VQ lock */ static void tx_poll_stop(struct vhost_net *net) { @@ -97,7 +77,8 @@ static void tx_poll_start(struct vhost_n static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; - unsigned head, out, in, s; + unsigned out, in; + struct iovec head; struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0, @@ -108,8 +89,8 @@ static void handle_tx(struct vhost_net * }; size_t len, total_len = 0; int err, wmem; - size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); + if (!sock) return; @@ -127,22 +108,19 @@ static void handle_tx(struct vhost_net * if (wmem sock-sk-sk_sndbuf / 2) tx_poll_stop(net); - hdr_size = vq-hdr_size; for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -NULL, NULL); + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? /* Nothing new? Wait for eventfd to tell us they refilled. */ - if (head == vq-num) { + if (head.iov_base == (void *)vq-num) { wmem = atomic_read(sock-sk-sk_wmem_alloc); if (wmem = sock-sk-sk_sndbuf * 3 / 4) { tx_poll_start(net, sock); set_bit(SOCK_ASYNC_NOSPACE, sock-flags); break; } - if (unlikely(vhost_enable_notify(vq))) { + if (unlikely(vhost_enable_notify(vq, 0))) { vhost_disable_notify(vq); continue; } @@ -154,27 +132,30 @@ static void handle_tx(struct vhost_net * break; } /* Skip header. TODO: support TSO. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; - len = iov_length(vq-iov, out); + head.iov_len = len = iov_length(vq-iov, out); + /* Sanity check */ if (!len) { - vq_err(vq, Unexpected header len for TX: - %zd expected %zd\n, - iov_length(vq-hdr, s), hdr_size); + vq_err(vq, Unexpected buffer len for TX: %zd , len); break; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err =
Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM: attached patch seems to be whiespace damaged as well. Does the origin pass checkpatch.pl for you? Yes, but I probably broke it in the transfer -- will be more careful with the next revision. + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? It needs the heads and the lengths, which looks a lot like an iovec. I was trying to resist adding a new struct XXX { unsigned head; unsigned len; } just for this, but I could make these parallel arrays, one with head index and the other with length. + if (vq-rxmaxheadcount headcount) + vq-rxmaxheadcount = headcount; This seems the only place where we set the rxmaxheadcount value. Maybe it can be moved out of vhost.c to net.c? If vhost library needs this it can get it as function parameter. I can move it to vhost_get_heads(), sure. + /* Skip header. TODO: support TSO. */ You seem to have removed the code that skips the header. Won't this break non-GSO backends such as raw? From our prior discussion, what I had in mind was that we'd remove all special-header processing by using the ioctl you added for TUN and later, an equivalent ioctl for raw (when supported in qemu-kvm). Qemu would arrange headers with tap (and later raw) to get what the guest expects, and vhost then just passes all data as-is between the socket and the ring. That not only removes the different-header-len code for mergeable buffers, but also eliminates making a copy of the header for every packet as before. Vhost has no need to do anything with the header, or even know its length. It also doesn't have to care what the type of the backend is-- raw or tap. I think that simplifies everything, but it does mean that raw socket support requires a header ioctl for the different vnethdr sizes if the guest wants a vnethdr with and without mergeable buffers. Actually, I thought this was your idea and the point of the TUNSETVNETHDRSZ ioctl, but I guess you had something else in mind? - /* TODO: Check specific error and bomb out unless EAGAIN? */ Do you think it's not a good idea? EAGAIN is not possible after the change, because we don't even enter the loop unless we have an skb on the read queue; the other cases bomb out, so I figured the comment for future work is now done. :-) if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_vq_desc(vq, headcount); break; } I think we should detect and discard truncated messages, since len might not be reliable if userspace pulls a packet from under us. Also, if new packet is shorter than the old one, there's no truncation but headcount is wrong. So simplest fix IMO would be to compare err with expected len. If there's a difference, we hit the race and so we would discard the packet. Sure. /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) { + struct virtio_net_hdr_mrg_rxbuf *vhdr = + (struct virtio_net_hdr_mrg_rxbuf *) + vq-iov[0].iov_base; + /* add num_bufs */ + if (put_user(headcount, vhdr-num_buffers)) { + vq_err(vq, Failed to write num_buffers); + vhost_discard_vq_desc(vq, headcount); Let's do memcpy_to_iovecend etc so that we do not assume layout. This is also why we need move_iovec: sendmsg might modify the iovec. It would also be nice not to corrupt memory and get a reasonable error if buffer size that we get is smaller than expected header size. I wanted to avoid the header copy here, with the reasonable restriction that the guest gives you a buffer at least big enough for the vnet_hdr. A length check alone (on iov[0].iov_len) could enforce that without copying headers back and forth to support silly cases like 8-byte buffers or num_buffers spanning multiple iovecs, and I think paying the price of the copy on every packet to support guests that broken isn't worth it. So, my preference here would be to add: if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) struct virtio_net_mrg_rxbuf *vhdr... if (vq-iov[0].iov_len sizeof(*vhdr)) { vq_err(vq, tiny buffers (len %d %d) are not supported, vq-iov[0].iov_len, sizeof(*vhdr)); break; }