Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Mon, Nov 09, 2009 at 04:47:29PM +1030, Rusty Russell wrote: Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. I remembered another reason for caching head in avail_idx. Basically, avail index could change between when I poll for descriptors and when I want to notify guest. So we could have: - poll descriptors until empty - notify detects not empty so does not notify And the way to solve it would be to return flag from notify telling us to restart the polling loop. But, this will be more code, on data path, than what happens today where I simply keep state from descriptor polling and use that to notify. I also suspect that somehow this race in practice can not create deadlocks ... but I prefer to avoid it, these things are very tricky: if I see an empty ring, and stop processing descriptors, I want to trigger notify on empty. So if we want to avoid keeping empty state, IMO the best way would be to pass a flag to vhost_signal that tells it that ring is empty. Makes sense? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Mon, 9 Nov 2009 05:40:32 pm Michael S. Tsirkin wrote: On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote: There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? Hmm. Are you happy with the ioctl name? If yes I think being consistent with that is important. I think in my original comments I noted that I preferred GET / SET, rather than GET/ACK. Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. Exactly. That's what VHOST_GET/SET_VRING_BASE does. avail_idx is just a cached value for notify on empty, so what this does is clear the value. Ah, you actually refresh it every time anyway. Hmm, could you do my poor brain a favor and either just get_user it in vhost_trigger_irq(), or call it 'cached_avail_idx' or something? The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. Yes, I remember that patch. There seems to be little point though, at this stage. Well, it avoids this ioctl, by exposing all the state. We may well need it later, to expand the ring in other ways. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. avail_idx is there for notify on empty: I had this thought that it's better to leave the avail cache line alone when we are triggering interrupt to avoid bouncing it around if guest is updating it meanwhile on another CPU, and I think my testing showed that it helped performance, but could be a mistake. You don't believe this can help? I believe it could help, but this is YA case where it would have been nice to have a dumb basic patch and this as a patch on top. But I am going to ask you to re-run that measurement, see if it stacks up (because it's an interesting lesson if it does..) Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote: On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Hi Michael, Now everyone else has finally kicked all the tires and it seems to pass, I've done a fairly complete review. Generally, it's really nice; just one bug and a few minor suggestions for polishing. Thanks for the review! I'll add more polishing and repost. Answers to some questions below. +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? Hmm ... yes. tx poll stop is called on each packet (as long as we do not fill up 1/2 backend queue), the first call will stop polling the rest checks state and does nothing. This is because we normally do not care when the message has left the queue in backend device: we tell backend to send it and forget. We only start polling when backend tx queue fills up. Makes sense? + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in, +NULL, NULL); Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size you can handle. Otherwise, it's only limited by ring size, and a malicious guest can overflow you here, and below: In fact, I think this is not a bug. This happens to work correctly (even with malicious guests) because vhost_get_vq_desc is hard-coded to check VHOST_NET_MAX_SG, so in fact no overflow is possible. I agree that it's mich nicer to pass iovec size to vhost_get_vq_desc. + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); ... + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); I prefer a name like vhost_disable_notify(). Good idea. + /* OK, now we need to know about added descriptors. */ + if (head == vq-num vhost_notify(vq)) + /* They could have slipped one in as we were doing that: +* check again. */ + continue; + /* Nothing new? Wait for eventfd to tell us they refilled. */ + if (head == vq-num) + break; + /* We don't need to be notified again. */ + vhost_no_notify(vq); Similarly, vhost_enable_notify. This one is particularly misleading since it doesn't actually notify anything! Good idea. In particular, this code would be neater as: if (head == vq-num) { if (vhost_enable_notify(vq)) { /* Try again, they could have slipped one in. */ continue; } /* Nothing more to do. */ break; } vhost_disable_notify(vq); Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to return true only when there's more pending. Saves a loop around here most of the time. OKay, I'll look into this. It kind of annoys me that we would do get_user for the same value twice: once in vhost_enable_notify and once in vhost_get_vq_desc. OTOH, all the loop does is call vhost_get_vq_desc again. Also, the vhost_no_notify/vhost_disable_notify() can be moved out of the loop entirely. I don't think it can, if we enabled notification and then see more descriptors in queue, we want to disable notification again. But it can be (It could be under an if (unlikely(enabled)), not sure if it's worth it). if (unlikely(vhost_enable_notify(vq))) { /* Try again, they have slipped one in. */ vhost_disable_notify(vq); continue; } + len = err; + err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); That unsigned char * arg to memcpy_toiovec is annoying. A patch might be nice, separate from this effort. Sounds good. +static int vhost_net_open(struct inode *inode, struct file *f) +{ + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL); + int r; + if (!n) + return -ENOMEM; + f-private_data = n; + n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; + n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; I have a personal dislike of calloc for structures. You mean zalloc? In userspace, it's because valgrind can't spot uninitialized fields. These days a similar argument applies in the kernel, because we have KMEMCHECK now. If someone adds a field to the struct and forgets to initialize it, we can spot it. OK. +static void vhost_net_enable_vq(struct vhost_net *n, int index) +{ + struct socket *sock = n-vqs[index].private_data; OK, I can't help but this that presenting the vqs as an
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote: On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote: +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? Hmm ... yes. tx poll stop is called on each packet (as long as we do not fill up 1/2 backend queue), the first call will stop polling the rest checks state and does nothing. This is because we normally do not care when the message has left the queue in backend device: we tell backend to send it and forget. We only start polling when backend tx queue fills up. OK, good. +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see code which exposed these back to userspace, and didn't. Not sure how do you mean. Userspace sets them, why does it want to get them exposed back? There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? set_features matches your ioctl names, but it sounds like a fn name :( It's marginal. And 'features' is shorter than both. + switch (ioctl) { + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, Not by much though, right? but not sure if it'd be simpler to use? The problem is with VHOST_SET_VRING_BASE as well. I want it to be separate because I want to make it possible to relocate e.g. used ring to another address while ring is running. This would be a good debugging tool (you look at kernel's used ring, check descriptor, then update guest's used ring) and also possibly an extra way to do migration. And it's nicer to have vring size separate as well, because it is initialized by host and never changed, right? Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. We could merge DESC, AVAIL, USED, and it will reduce the amount of code in userspace. With both base, size and fds separate, it seemed a bit more symmetrical to have desc/avail/used separate as well. What's your opinion? Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT. For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. Yes. It's not too late to split it out though: should I do it yet? Only if you're feeling enthused. It's lightly reviewed now. Cheers, Rusty. virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_ring.c | 23 +++ include/linux/virtio_ring.h | 12 +++- 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -71,9 +71,6 @@ struct vring_virtqueue /* Number we've added since last sync. */ unsigned int num_added; - /* Last
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote: +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see code which exposed these back to userspace, and didn't. Not sure how do you mean. Userspace sets them, why does it want to get them exposed back? There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? Hmm. Are you happy with the ioctl name? If yes I think being consistent with that is important. set_features matches your ioctl names, but it sounds like a fn name :( It's marginal. And 'features' is shorter than both. I started with this but I was always getting confused whether this includes all features or just acked features. I'll go with enabled_features. + switch (ioctl) { + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, Not by much though, right? but not sure if it'd be simpler to use? The problem is with VHOST_SET_VRING_BASE as well. I want it to be separate because I want to make it possible to relocate e.g. used ring to another address while ring is running. This would be a good debugging tool (you look at kernel's used ring, check descriptor, then update guest's used ring) and also possibly an extra way to do migration. And it's nicer to have vring size separate as well, because it is initialized by host and never changed, right? Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. Exactly. That's what VHOST_GET/SET_VRING_BASE does. avail_idx is just a cached value for notify on empty, so what this does is clear the value. What exactly do you refer to when you say this looks wrong? This could trigger an extra notification if I ever called trigger_irq without get first. As I don't, it in fact has no effect. The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. Yes, I remember that patch. There seems to be little point though, at this stage. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. avail_idx is there for notify on empty: I had this thought that it's better to leave the avail cache line alone when we are triggering interrupt to avoid bouncing it around if guest is updating it meanwhile on another CPU, and I think my testing showed that it helped performance, but could be a mistake. You don't believe this can help? We could merge DESC, AVAIL, USED, and it will reduce the amount of code in userspace. With both base, size and fds separate, it seemed a bit more symmetrical to have desc/avail/used separate as well. What's your opinion? Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT. Will do. For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. Yes. It's not too late to split it out though: should I do it yet? Only if you're feeling enthused. It's lightly reviewed now. Not really :) I'll keep this in mind for the future. Thanks! Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Mon, Nov 09, 2009 at 04:47:29PM +1030, Rusty Russell wrote: On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote: On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote: +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? Hmm ... yes. tx poll stop is called on each packet (as long as we do not fill up 1/2 backend queue), the first call will stop polling the rest checks state and does nothing. This is because we normally do not care when the message has left the queue in backend device: we tell backend to send it and forget. We only start polling when backend tx queue fills up. OK, good. +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see code which exposed these back to userspace, and didn't. Not sure how do you mean. Userspace sets them, why does it want to get them exposed back? There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? set_features matches your ioctl names, but it sounds like a fn name :( Hmm. Are you happy with the ioctl name? If yes I think being consistent with that is important. It's marginal. And 'features' is shorter than both. I started with this but I was always getting confused whether this includes all features or just acked features. I'll go with enabled_features. + switch (ioctl) { + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, Not by much though, right? but not sure if it'd be simpler to use? The problem is with VHOST_SET_VRING_BASE as well. I want it to be separate because I want to make it possible to relocate e.g. used ring to another address while ring is running. This would be a good debugging tool (you look at kernel's used ring, check descriptor, then update guest's used ring) and also possibly an extra way to do migration. And it's nicer to have vring size separate as well, because it is initialized by host and never changed, right? Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. Exactly. That's what VHOST_GET/SET_VRING_BASE does. avail_idx is just a cached value for notify on empty, so what this does is clear the value. What exactly do you refer to when you say this looks wrong? This could trigger an extra notification if I ever called trigger_irq without get first. As I don't, it in fact has no effect. The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. Yes, I remember that patch. There seems to be little point though, at this stage. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. avail_idx is there for notify on empty: I had this thought that it's better to leave the avail cache line alone when we are triggering interrupt to avoid bouncing it around if guest is updating it meanwhile on another CPU, and I think my testing showed that it helped performance, but could be a mistake. You don't believe this can help? We could merge DESC, AVAIL, USED, and it will reduce the amount of code in userspace. With both base, size and fds separate, it seemed a bit more symmetrical to have desc/avail/used separate as well. What's your opinion? Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT. OK, I'll do this. For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. Yes. It's not too late to split it out though: should I do it yet? Only if you're feeling enthused. It's lightly reviewed now. Not really :) I'll keep this in mind for the
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Hi Michael, Now everyone else has finally kicked all the tires and it seems to pass, I've done a fairly complete review. Generally, it's really nice; just one bug and a few minor suggestions for polishing. +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in, + NULL, NULL); Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size you can handle. Otherwise, it's only limited by ring size, and a malicious guest can overflow you here, and below: + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); ... + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); I prefer a name like vhost_disable_notify(). + /* OK, now we need to know about added descriptors. */ + if (head == vq-num vhost_notify(vq)) + /* They could have slipped one in as we were doing that: + * check again. */ + continue; + /* Nothing new? Wait for eventfd to tell us they refilled. */ + if (head == vq-num) + break; + /* We don't need to be notified again. */ + vhost_no_notify(vq); Similarly, vhost_enable_notify. This one is particularly misleading since it doesn't actually notify anything! In particular, this code would be neater as: if (head == vq-num) { if (vhost_enable_notify(vq)) { /* Try again, they could have slipped one in. */ continue; } /* Nothing more to do. */ break; } vhost_disable_notify(vq); Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to return true only when there's more pending. Saves a loop around here most of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved out of the loop entirely. (It could be under an if (unlikely(enabled)), not sure if it's worth it). + len = err; + err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); That unsigned char * arg to memcpy_toiovec is annoying. A patch might be nice, separate from this effort. +static int vhost_net_open(struct inode *inode, struct file *f) +{ + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL); + int r; + if (!n) + return -ENOMEM; + f-private_data = n; + n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; + n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; I have a personal dislike of calloc for structures. In userspace, it's because valgrind can't spot uninitialized fields. These days a similar argument applies in the kernel, because we have KMEMCHECK now. If someone adds a field to the struct and forgets to initialize it, we can spot it. +static void vhost_net_enable_vq(struct vhost_net *n, int index) +{ + struct socket *sock = n-vqs[index].private_data; OK, I can't help but this that presenting the vqs as an array doesn't buy us very much. Esp. if you change vhost_dev_init to take a NULL-terminated varargs. I think readability would improve. It means passing a vq around rather than an index. Not completely sure it'll be a win tho. +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) +{ + struct socket *sock, *oldsock = NULL; ... + sock = get_socket(fd); + if (IS_ERR(sock)) { + r = PTR_ERR(sock); + goto done; + } + + /* start polling new socket */ + oldsock = vq-private_data; ... +done: + mutex_unlock(n-dev.mutex); + if (oldsock) { + vhost_net_flush_vq(n, index); + fput(oldsock-file); I dislike this style; I prefer multiple different goto points, one for when oldsock is set, and one for when it's not. That way, gcc warns us about uninitialized variables if we get it wrong. +static long vhost_net_reset_owner(struct vhost_net *n) +{ + struct socket *tx_sock = NULL; + struct socket *rx_sock = NULL; + long r; This should be called err, since that's what it is. +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see
[PATCHv8 3/3] vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration, bug work-arounds in userspace) - write logging is supported (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tap device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. Compared to userspace, people reported improved latency (as I save up to 4 system calls per packet), as well as better bandwidth and CPU utilization. Features that I plan to look at in the future: - mergeable buffers - zero copy - scalability tuning: figure out the best threading model to use Note on RCU usage (this is also documented in vhost.h, near private_pointer which is the value protected by this variant of RCU): what is happening is that the rcu_dereference() is being used in a workqueue item. The role of rcu_read_lock() is taken on by the start of execution of the workqueue item, of rcu_read_unlock() by the end of execution of the workqueue item, and of synchronize_rcu() by flush_workqueue()/flush_work(). Acked-by: Arnd Bergmann a...@arndb.de Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Paul, I'm Cc you so that you can verify that nothing significant changed in vhost/net.c since you approved it the previous time. Could you please Ack use of rcu_dereference/rcu_assign_pointer there? Eric Dumazet proposed that I open-code these, he suggested that this is an abuse of these macros and not enough like RCU. What is your opinion? MAINTAINERS|9 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 633 + drivers/vhost/vhost.c | 970 drivers/vhost/vhost.h | 158 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 126 ++ 11 files changed, 1913 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index 8824115..980a69b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5619,6 +5619,15 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +M: Michael S. Tsirkin m...@redhat.com +L: kvm@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index 6ee53c7..81e3659 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -106,6 +106,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB) += ssb/ +obj-$(CONFIG_VHOST_NET)+= vhost/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ)+= vlynq/ obj-$(CONFIG_STAGING) += staging/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig new file mode 100644 index 000..9f409f4 --- /dev/null +++ b/drivers/vhost/Kconfig @@ -0,0 +1,11 @@ +config VHOST_NET + tristate Host kernel accelerator for virtio net (EXPERIMENTAL) + depends on NET EVENTFD EXPERIMENTAL + ---help--- + This kernel module