Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-09 Thread Michael S. Tsirkin
On Mon, Nov 09, 2009 at 04:47:29PM +1030, Rusty Russell wrote:
 Actually, this looks wrong to me:
 
 + case VHOST_SET_VRING_BASE:
 ...
 + vq-avail_idx = vq-last_avail_idx = s.num;
 
 The last_avail_idx is part of the state of the driver.  It needs to be saved
 and restored over susp/resume.  The only reason it's not in the ring itself
 is because I figured the other side doesn't need to see it (which is true, but
 missed debugging opportunities as well as man-in-the-middle issues like this
 one).  I had a patch which put this field at the end of the ring, I might
 resurrect it to avoid this problem.  This is backwards compatible with all
 implementations.  See patch at end.
 
 I would drop avail_idx altogether: get_user is basically free, and simplifies
 a lot.  As most state is in the ring, all you need is an ioctl to save/restore
 the last_avail_idx.

I remembered another reason for caching head in avail_idx.  Basically,
avail index could change between when I poll for descriptors and when I
want to notify guest.

So we could have:
- poll descriptors until empty
- notify
detects not empty so does not notify

And the way to solve it would be to return flag from
notify telling us to restart the polling loop.

But, this will be more code, on data path, than
what happens today where I simply keep state
from descriptor polling and use that to notify.

I also suspect that somehow this race in practice can not create
deadlocks ... but I prefer to avoid it, these things are very tricky: if
I see an empty ring, and stop processing descriptors, I want to trigger
notify on empty.

So if we want to avoid keeping empty state, IMO the best way would be
to pass a flag to vhost_signal that tells it that ring is empty.
Makes sense?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-09 Thread Rusty Russell
On Mon, 9 Nov 2009 05:40:32 pm Michael S. Tsirkin wrote:
 On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote:
  There's something about the 'acked' which rubs me the wrong way.
  enabled_features is perhaps a better term than acked_features; acked
  seems more a user point-of-view, enabled seems more driver POV?
 
 Hmm. Are you happy with the ioctl name? If yes I think being consistent
 with that is important.

I think in my original comments I noted that I preferred GET / SET, rather
than GET/ACK.

  Actually, this looks wrong to me:
 
  +   case VHOST_SET_VRING_BASE:
  ...
  +   vq-avail_idx = vq-last_avail_idx = s.num;
 
  The last_avail_idx is part of the state of the driver.  It needs to be saved
  and restored over susp/resume.
 
 
 Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
 cached value for notify on empty, so what this does is clear the value.

Ah, you actually refresh it every time anyway.  Hmm, could you do my poor
brain a favor and either just get_user it in vhost_trigger_irq(), or call
it 'cached_avail_idx' or something?

   The only reason it's not in the ring itself
  is because I figured the other side doesn't need to see it (which is true, 
  but
  missed debugging opportunities as well as man-in-the-middle issues like this
  one).  I had a patch which put this field at the end of the ring, I might
  resurrect it to avoid this problem.  This is backwards compatible with all
  implementations.  See patch at end.
 
 Yes, I remember that patch. There seems to be little point though, at
 this stage.

Well, it avoids this ioctl, by exposing all the state.  We may well need it
later, to expand the ring in other ways.

  I would drop avail_idx altogether: get_user is basically free, and 
  simplifies
  a lot.  As most state is in the ring, all you need is an ioctl to 
  save/restore
  the last_avail_idx.
 
 avail_idx is there for notify on empty: I had this thought that it's
 better to leave the avail cache line alone when we are triggering
 interrupt to avoid bouncing it around if guest is updating it meanwhile
 on another CPU, and I think my testing showed that it helped
 performance, but could be a mistake.  You don't believe this can help?

I believe it could help, but this is YA case where it would have been nice to
have a dumb basic patch and this as a patch on top.  But I am going to ask
you to re-run that measurement, see if it stacks up (because it's an
interesting lesson if it does..)

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
 On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
  What it is: vhost net is a character device that can be used to reduce
  the number of system calls involved in virtio networking.
 
 Hi Michael,
 
Now everyone else has finally kicked all the tires and it seems to pass,
 I've done a fairly complete review.  Generally, it's really nice; just one
 bug and a few minor suggestions for polishing.

Thanks for the review! I'll add more polishing and repost.
Answers to some questions below.

  +/* Caller must have TX VQ lock */
  +static void tx_poll_stop(struct vhost_net *net)
  +{
  +   if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
  +   return;
 
 likely?  Really?

Hmm ... yes. tx poll stop is called on each packet (as long as we do not
fill up 1/2 backend queue), the first call will stop polling
the rest checks state and does nothing.

This is because we normally do not care when the message has left the
queue in backend device: we tell backend to send it and forget. We only
start polling when backend tx queue fills up.

Makes sense?

  +   for (;;) {
  +   head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in,
  +NULL, NULL);
 
 Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
 you can handle.  Otherwise, it's only limited by ring size, and a malicious
 guest can overflow you here, and below:

In fact, I think this is not a bug.  This happens to work correctly
(even with malicious guests) because vhost_get_vq_desc is hard-coded to
check VHOST_NET_MAX_SG, so in fact no overflow is possible.  I agree
that it's mich nicer to pass iovec size to vhost_get_vq_desc.

 
  +   /* Skip header. TODO: support TSO. */
  +   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
 ...
  +
  +   use_mm(net-dev.mm);
  +   mutex_lock(vq-mutex);
  +   vhost_no_notify(vq);
 
 I prefer a name like vhost_disable_notify().

Good idea.

  +   /* OK, now we need to know about added descriptors. */
  +   if (head == vq-num  vhost_notify(vq))
  +   /* They could have slipped one in as we were doing that:
  +* check again. */
  +   continue;
  +   /* Nothing new?  Wait for eventfd to tell us they refilled. */
  +   if (head == vq-num)
  +   break;
  +   /* We don't need to be notified again. */
  +   vhost_no_notify(vq);
 
 Similarly, vhost_enable_notify.  This one is particularly misleading since
 it doesn't actually notify anything!

Good idea.


 
 In particular, this code would be neater as:
 
   if (head == vq-num) {
   if (vhost_enable_notify(vq)) {
   /* Try again, they could have slipped one in. */
   continue;
   }
   /* Nothing more to do. */
   break;
   }
   vhost_disable_notify(vq);
 
 Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
 return true only when there's more pending.  Saves a loop around here most
 of the time.

OKay, I'll look into this. It kind of annoys me that we would do
get_user for the same value twice: once in vhost_enable_notify and once
in vhost_get_vq_desc.  OTOH, all the loop does is call vhost_get_vq_desc
again.

  Also, the vhost_no_notify/vhost_disable_notify() can be moved
 out of the loop entirely.

I don't think it can, if we enabled notification and then see more
descriptors in queue, we want to disable notification again. But it can
be
  (It could be under an if (unlikely(enabled)), not
 sure if it's worth it).

if (unlikely(vhost_enable_notify(vq))) {
/* Try again, they have slipped one in. */
vhost_disable_notify(vq);
continue;
}

 
  +   len = err;
  +   err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
 
 That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
 nice, separate from this effort.

Sounds good.

  +static int vhost_net_open(struct inode *inode, struct file *f)
  +{
  +   struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
  +   int r;
  +   if (!n)
  +   return -ENOMEM;
  +   f-private_data = n;
  +   n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
  +   n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
 
 I have a personal dislike of calloc for structures.

You mean zalloc?

 In userspace, it's because valgrind can't spot uninitialized fields.
 These days a similar argument applies in the kernel, because we have
 KMEMCHECK now.  If someone adds a field to the struct and forgets to
 initialize it, we can spot it.

OK.

  +static void vhost_net_enable_vq(struct vhost_net *n, int index)
  +{
  +   struct socket *sock = n-vqs[index].private_data;
 
 OK, I can't help but this that presenting the vqs as an 

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Rusty Russell
On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote:
 On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
   +/* Caller must have TX VQ lock */
   +static void tx_poll_stop(struct vhost_net *net)
   +{
   + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
   + return;
  
  likely?  Really?
 
 Hmm ... yes. tx poll stop is called on each packet (as long as we do not
 fill up 1/2 backend queue), the first call will stop polling
 the rest checks state and does nothing.
 
 This is because we normally do not care when the message has left the
 queue in backend device: we tell backend to send it and forget. We only
 start polling when backend tx queue fills up.

OK, good.

   +static void vhost_net_set_features(struct vhost_net *n, u64 features)
   +{
   + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   + sizeof(struct virtio_net_hdr) : 0;
   + int i;
   + mutex_lock(n-dev.mutex);
   + n-dev.acked_features = features;
  
  Why is this called acked_features?  Not just features?  I expected
  to see code which exposed these back to userspace, and didn't.
 
 Not sure how do you mean. Userspace sets them, why
 does it want to get them exposed back?

There's something about the 'acked' which rubs me the wrong way.
enabled_features is perhaps a better term than acked_features; acked
seems more a user point-of-view, enabled seems more driver POV?

set_features matches your ioctl names, but it sounds like a fn name :(

It's marginal.  And 'features' is shorter than both.

   + switch (ioctl) {
   + case VHOST_SET_VRING_NUM:
  
  I haven't looked at your userspace implementation, but does a generic
  VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
  sense?  It'd be simpler here,
 
 Not by much though, right?
 
  but not sure if it'd be simpler to use?
 
 The problem is with VHOST_SET_VRING_BASE as well. I want it to be
 separate because I want to make it possible to relocate e.g. used ring
 to another address while ring is running. This would be a good debugging
 tool (you look at kernel's used ring, check descriptor, then update
 guest's used ring) and also possibly an extra way to do migration.  And
 it's nicer to have vring size separate as well, because it is
 initialized by host and never changed, right?

Actually, this looks wrong to me:

+   case VHOST_SET_VRING_BASE:
...
+   vq-avail_idx = vq-last_avail_idx = s.num;

The last_avail_idx is part of the state of the driver.  It needs to be saved
and restored over susp/resume.  The only reason it's not in the ring itself
is because I figured the other side doesn't need to see it (which is true, but
missed debugging opportunities as well as man-in-the-middle issues like this
one).  I had a patch which put this field at the end of the ring, I might
resurrect it to avoid this problem.  This is backwards compatible with all
implementations.  See patch at end.

I would drop avail_idx altogether: get_user is basically free, and simplifies
a lot.  As most state is in the ring, all you need is an ioctl to save/restore
the last_avail_idx.

 We could merge DESC, AVAIL, USED, and it will reduce the amount of code
 in userspace. With both base, size and fds separate, it seemed a bit
 more symmetrical to have desc/avail/used separate as well.
 What's your opinion?

Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.

  For future reference, this is *exactly* the kind of thing which would have
  been nice as a followup patch.  Easy to separate, easy to review, not 
  critical
  to the core.
 
 Yes. It's not too late to split it out though: should I do it yet?

Only if you're feeling enthused.  It's lightly reviewed now.

Cheers,
Rusty.

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   23 +++
 include/linux/virtio_ring.h  |   12 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last 

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote:
   +static void vhost_net_set_features(struct vhost_net *n, u64 features)
   +{
   + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   +         sizeof(struct virtio_net_hdr) : 0;
   + int i;
   + mutex_lock(n-dev.mutex);
   + n-dev.acked_features = features;
 
  Why is this called acked_features?  Not just features?  I expected
  to see code which exposed these back to userspace, and didn't.

 Not sure how do you mean. Userspace sets them, why
 does it want to get them exposed back?

 There's something about the 'acked' which rubs me the wrong way.
 enabled_features is perhaps a better term than acked_features; acked
 seems more a user point-of-view, enabled seems more driver POV?


Hmm. Are you happy with the ioctl name? If yes I think being consistent
with that is important.


 set_features matches your ioctl names, but it sounds like a fn name :(

 It's marginal.  And 'features' is shorter than both.

I started with this but I was always getting confused whether this
includes all features or just acked features.  I'll go with
enabled_features.



   + switch (ioctl) {
   + case VHOST_SET_VRING_NUM:
 
  I haven't looked at your userspace implementation, but does a generic
  VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
  sense?  It'd be simpler here,

 Not by much though, right?

  but not sure if it'd be simpler to use?

 The problem is with VHOST_SET_VRING_BASE as well. I want it to be
 separate because I want to make it possible to relocate e.g. used ring
 to another address while ring is running. This would be a good debugging
 tool (you look at kernel's used ring, check descriptor, then update
 guest's used ring) and also possibly an extra way to do migration.  And
 it's nicer to have vring size separate as well, because it is
 initialized by host and never changed, right?

 Actually, this looks wrong to me:

 +       case VHOST_SET_VRING_BASE:
 ...
 +               vq-avail_idx = vq-last_avail_idx = s.num;

 The last_avail_idx is part of the state of the driver.  It needs to be saved
 and restored over susp/resume.


Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
cached value for notify on empty, so what this does is clear the value.
What exactly do you refer to when you say this looks wrong?
This could trigger an extra notification if I ever called
trigger_irq without get first. As I don't, it in fact has no effect.

  The only reason it's not in the ring itself
 is because I figured the other side doesn't need to see it (which is true, but
 missed debugging opportunities as well as man-in-the-middle issues like this
 one).  I had a patch which put this field at the end of the ring, I might
 resurrect it to avoid this problem.  This is backwards compatible with all
 implementations.  See patch at end.

Yes, I remember that patch. There seems to be little point though, at
this stage.



 I would drop avail_idx altogether: get_user is basically free, and simplifies
 a lot.  As most state is in the ring, all you need is an ioctl to save/restore
 the last_avail_idx.


avail_idx is there for notify on empty: I had this thought that it's
better to leave the avail cache line alone when we are triggering
interrupt to avoid bouncing it around if guest is updating it meanwhile
on another CPU, and I think my testing showed that it helped
performance, but could be a mistake.  You don't believe this can help?




 We could merge DESC, AVAIL, USED, and it will reduce the amount of code
 in userspace. With both base, size and fds separate, it seemed a bit
 more symmetrical to have desc/avail/used separate as well.
 What's your opinion?

 Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.


Will do.


  For future reference, this is *exactly* the kind of thing which would have
  been nice as a followup patch.  Easy to separate, easy to review, not 
  critical
  to the core.

 Yes. It's not too late to split it out though: should I do it yet?

 Only if you're feeling enthused.  It's lightly reviewed now.


Not really :) I'll keep this in mind for the future.
Thanks!




 Cheers,
 Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Mon, Nov 09, 2009 at 04:47:29PM +1030, Rusty Russell wrote:
 On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote:
  On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
+/* Caller must have TX VQ lock */
+static void tx_poll_stop(struct vhost_net *net)
+{
+   if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
+   return;
   
   likely?  Really?
  
  Hmm ... yes. tx poll stop is called on each packet (as long as we do not
  fill up 1/2 backend queue), the first call will stop polling
  the rest checks state and does nothing.
  
  This is because we normally do not care when the message has left the
  queue in backend device: we tell backend to send it and forget. We only
  start polling when backend tx queue fills up.
 
 OK, good.
 
+static void vhost_net_set_features(struct vhost_net *n, u64 features)
+{
+   size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
+   sizeof(struct virtio_net_hdr) : 0;
+   int i;
+   mutex_lock(n-dev.mutex);
+   n-dev.acked_features = features;
   
   Why is this called acked_features?  Not just features?  I expected
   to see code which exposed these back to userspace, and didn't.
  
  Not sure how do you mean. Userspace sets them, why
  does it want to get them exposed back?
 
 There's something about the 'acked' which rubs me the wrong way.
 enabled_features is perhaps a better term than acked_features; acked
 seems more a user point-of-view, enabled seems more driver POV?
 
 set_features matches your ioctl names, but it sounds like a fn name :(

Hmm. Are you happy with the ioctl name? If yes I think being consistent
with that is important.

 It's marginal.  And 'features' is shorter than both.

I started with this but I was always getting confused whether this
includes all features or just acked features.  I'll go with
enabled_features.

+   switch (ioctl) {
+   case VHOST_SET_VRING_NUM:
   
   I haven't looked at your userspace implementation, but does a generic
   VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
   sense?  It'd be simpler here,
  
  Not by much though, right?
  
   but not sure if it'd be simpler to use?
  
  The problem is with VHOST_SET_VRING_BASE as well. I want it to be
  separate because I want to make it possible to relocate e.g. used ring
  to another address while ring is running. This would be a good debugging
  tool (you look at kernel's used ring, check descriptor, then update
  guest's used ring) and also possibly an extra way to do migration.  And
  it's nicer to have vring size separate as well, because it is
  initialized by host and never changed, right?
 
 Actually, this looks wrong to me:
 
 + case VHOST_SET_VRING_BASE:
 ...
 + vq-avail_idx = vq-last_avail_idx = s.num;
 
 The last_avail_idx is part of the state of the driver.  It needs to be saved
 and restored over susp/resume.

Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
cached value for notify on empty, so what this does is clear the value.
What exactly do you refer to when you say this looks wrong?
This could trigger an extra notification if I ever called
trigger_irq without get first. As I don't, it in fact has no effect.

  The only reason it's not in the ring itself
 is because I figured the other side doesn't need to see it (which is true, but
 missed debugging opportunities as well as man-in-the-middle issues like this
 one).  I had a patch which put this field at the end of the ring, I might
 resurrect it to avoid this problem.  This is backwards compatible with all
 implementations.  See patch at end.

Yes, I remember that patch. There seems to be little point though, at
this stage.

 
 I would drop avail_idx altogether: get_user is basically free, and
 simplifies a lot.  As most state is in the ring, all you need is an
 ioctl to save/restore the last_avail_idx.

avail_idx is there for notify on empty: I had this thought that it's
better to leave the avail cache line alone when we are triggering
interrupt to avoid bouncing it around if guest is updating it meanwhile
on another CPU, and I think my testing showed that it helped
performance, but could be a mistake.  You don't believe this can help?

  We could merge DESC, AVAIL, USED, and it will reduce the amount of code
  in userspace. With both base, size and fds separate, it seemed a bit
  more symmetrical to have desc/avail/used separate as well.
  What's your opinion?
 
 Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.

OK, I'll do this.

   For future reference, this is *exactly* the kind of thing which would have
   been nice as a followup patch.  Easy to separate, easy to review, not 
   critical
   to the core.
  
  Yes. It's not too late to split it out though: should I do it yet?
 
 Only if you're feeling enthused.  It's lightly reviewed now.

Not really :) I'll keep this in mind for the 

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-05 Thread Rusty Russell
On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.

Hi Michael,

   Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review.  Generally, it's really nice; just one
bug and a few minor suggestions for polishing.

 +/* Caller must have TX VQ lock */
 +static void tx_poll_stop(struct vhost_net *net)
 +{
 + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
 + return;

likely?  Really?

 + for (;;) {
 + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in,
 +  NULL, NULL);

Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle.  Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:

 + /* Skip header. TODO: support TSO. */
 + s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
...
 +
 + use_mm(net-dev.mm);
 + mutex_lock(vq-mutex);
 + vhost_no_notify(vq);

I prefer a name like vhost_disable_notify().

 + /* OK, now we need to know about added descriptors. */
 + if (head == vq-num  vhost_notify(vq))
 + /* They could have slipped one in as we were doing that:
 +  * check again. */
 + continue;
 + /* Nothing new?  Wait for eventfd to tell us they refilled. */
 + if (head == vq-num)
 + break;
 + /* We don't need to be notified again. */
 + vhost_no_notify(vq);

Similarly, vhost_enable_notify.  This one is particularly misleading since
it doesn't actually notify anything!

In particular, this code would be neater as:

if (head == vq-num) {
if (vhost_enable_notify(vq)) {
/* Try again, they could have slipped one in. */
continue;
}
/* Nothing more to do. */
break;
}
vhost_disable_notify(vq);

Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending.  Saves a loop around here most
of the time.  Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely.  (It could be under an if (unlikely(enabled)), not
sure if it's worth it).

 + len = err;
 + err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);

That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
nice, separate from this effort.

 +static int vhost_net_open(struct inode *inode, struct file *f)
 +{
 + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
 + int r;
 + if (!n)
 + return -ENOMEM;
 + f-private_data = n;
 + n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 + n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;

I have a personal dislike of calloc for structures.  In userspace, it's
because valgrind can't spot uninitialized fields.  These days a similar
argument applies in the kernel, because we have KMEMCHECK now.  If someone
adds a field to the struct and forgets to initialize it, we can spot it.

 +static void vhost_net_enable_vq(struct vhost_net *n, int index)
 +{
 + struct socket *sock = n-vqs[index].private_data;

OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much.  Esp. if you change vhost_dev_init to take a NULL-terminated
varargs.  I think readability would improve.  It means passing a vq around
rather than an index.

Not completely sure it'll be a win tho.

 +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int 
 fd)
 +{
 + struct socket *sock, *oldsock = NULL;
...
 + sock = get_socket(fd);
 + if (IS_ERR(sock)) {
 + r = PTR_ERR(sock);
 + goto done;
 + }
 +
 + /* start polling new socket */
 + oldsock = vq-private_data;
...
 +done:
 + mutex_unlock(n-dev.mutex);
 + if (oldsock) {
 + vhost_net_flush_vq(n, index);
 + fput(oldsock-file);

I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.

That way, gcc warns us about uninitialized variables if we get it wrong.

 +static long vhost_net_reset_owner(struct vhost_net *n)
 +{
 + struct socket *tx_sock = NULL;
 + struct socket *rx_sock = NULL;
 + long r;

This should be called err, since that's what it is.

 +static void vhost_net_set_features(struct vhost_net *n, u64 features)
 +{
 + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
 + sizeof(struct virtio_net_hdr) : 0;
 + int i;
 + mutex_lock(n-dev.mutex);
 + n-dev.acked_features = features;

Why is this called acked_features?  Not just features?  I expected
to see 

[PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used
in a workqueue item.  The role of rcu_read_lock() is taken on by the
start of execution of the workqueue item, of rcu_read_unlock() by the
end of execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work().

Acked-by: Arnd Bergmann a...@arndb.de
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Paul, I'm Cc you so that you can verify that nothing significant changed
in vhost/net.c since you approved it the previous time.  Could you
please Ack use of rcu_dereference/rcu_assign_pointer there?  Eric
Dumazet proposed that I open-code these, he suggested that this is an
abuse of these macros and not enough like RCU. What is your opinion?

 MAINTAINERS|9 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  633 +
 drivers/vhost/vhost.c  |  970 
 drivers/vhost/vhost.h  |  158 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  126 ++
 11 files changed, 1913 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 8824115..980a69b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5619,6 +5619,15 @@ S:   Maintained
 F: Documentation/filesystems/vfat.txt
 F: fs/fat/
 
+VIRTIO HOST (VHOST)
+M: Michael S. Tsirkin m...@redhat.com
+L: kvm@vger.kernel.org
+L: virtualizat...@lists.osdl.org
+L: net...@vger.kernel.org
+S: Maintained
+F: drivers/vhost/
+F: include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 M: Roger Luethi r...@hellgate.ch
 S: Maintained
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index 6ee53c7..81e3659 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_HID)   += hid/
 obj-$(CONFIG_PPC_PS3)  += ps3/
 obj-$(CONFIG_OF)   += of/
 obj-$(CONFIG_SSB)  += ssb/
+obj-$(CONFIG_VHOST_NET)+= vhost/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_VLYNQ)+= vlynq/
 obj-$(CONFIG_STAGING)  += staging/
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 000..9f409f4
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@
+config VHOST_NET
+   tristate Host kernel accelerator for virtio net (EXPERIMENTAL)
+   depends on NET  EVENTFD  EXPERIMENTAL
+   ---help---
+ This kernel module