Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-02-14 Thread Sjur Brændeland
Hi Rusty,

On Wed, Feb 13, 2013 at 11:25 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Sjur Brændeland sjurb...@gmail.com writes:
 On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote:
 Sjur Brændeland sjurb...@gmail.com writes:
 ...
 I guess my vringh related stuff should go into your tree together with your
 vringh patches...
 Would you be willing to take this via your tree, provided that I get acks
 from the right people?

 Yes please.  Now I have set up a test rig for vhost (unfortunately not
 with 10Ge) I can make progress on vhost adaption.


 I have one question, are you planning to include vringh in your
 virtio-next branch
 for the 3.9 merge window or some time later?

 If you're eager with the CAIF stuff, I'll merge it for this merge window
 (we may still rework stuff, but I don't want to block you).

 Otherwise, it'll probably wait.

Yes, I would prefer to get CAIF-virtio merged now unless it creates a
a lot of trouble for you. But I'm not sure I'll manage to get ack from Ohad
in time though. He said he would be busy for the next week. I need
an ack from David Miller as well, I'll resend the patch to him if you
decide to go ahead and merge this.

Thanks,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-02-13 Thread Rusty Russell
Sjur Brændeland sjurb...@gmail.com writes:
 Hi Rusty,

 On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote:
 Sjur Brændeland sjurb...@gmail.com writes:
 ...
 I guess my vringh related stuff should go into your tree together with your
 vringh patches...
 Would you be willing to take this via your tree, provided that I get acks
 from the right people?

 Yes please.  Now I have set up a test rig for vhost (unfortunately not
 with 10Ge) I can make progress on vhost adaption.


 I have one question, are you planning to include vringh in your
 virtio-next branch
 for the 3.9 merge window or some time later?

If you're eager with the CAIF stuff, I'll merge it for this merge window
(we may still rework stuff, but I don't want to block you).

Otherwise, it'll probably wait.

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-02-12 Thread Sjur Brændeland
Hi Rusty,

On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote:
 Sjur Brændeland sjurb...@gmail.com writes:
...
 I guess my vringh related stuff should go into your tree together with your
 vringh patches...
 Would you be willing to take this via your tree, provided that I get acks
 from the right people?

 Yes please.  Now I have set up a test rig for vhost (unfortunately not
 with 10Ge) I can make progress on vhost adaption.


I have one question, are you planning to include vringh in your
virtio-next branch
for the 3.9 merge window or some time later?

Regards,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-02-04 Thread Sjur Brændeland
Hi Rusty,

On Thu, Jan 17, 2013 at 11:29 AM, Rusty Russell ru...@rustcorp.com.auwrote:

 Getting use of virtio rings correct is tricky, and a recent patch saw
 an implementation of in-kernel rings (as separate from userspace).

 This patch attempts to abstract the business of dealing with the
 virtio ring layout from the access (userspace or direct); to do this,
 we use function pointers, which gcc inlines correctly.


I have been using your patches for a while in my test setup without any
issues with vringh. The only thing I'm missing is export of symbols.
My current caif_virtio driver expects vringh to be a module exporting
symbols. Is this something you are planning to add?

I guess my vringh related stuff should go into your tree together with your
vringh patches...
Would you be willing to take this via your tree, provided that I get acks
from the right people?

Regards,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-22 Thread Asias He
On 01/17/2013 06:29 PM, Rusty Russell wrote:
 Getting use of virtio rings correct is tricky, and a recent patch saw
 an implementation of in-kernel rings (as separate from userspace).
 
 This patch attempts to abstract the business of dealing with the
 virtio ring layout from the access (userspace or direct); to do this,
 we use function pointers, which gcc inlines correctly.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/Makefile |2 +-
  drivers/vhost/Kconfig|8 +
  drivers/vhost/Kconfig.tcm|1 +
  drivers/vhost/Makefile   |2 +
  drivers/vhost/vringh.c   |  818 
 ++
  drivers/virtio/virtio_ring.c |   33 +-
  include/linux/virtio_ring.h  |   57 +++


Why vringh_notify_enable_user() and vringh_notify_disable_user() are not
declared in include/linux/virtio_ring.h? Missed that?


  include/linux/vringh.h   |  115 ++
  8 files changed, 1008 insertions(+), 28 deletions(-)
  create mode 100644 drivers/vhost/vringh.c
  create mode 100644 include/linux/vringh.h
 
 diff --git a/drivers/Makefile b/drivers/Makefile
 index 7863b9f..351a34f 100644
 --- a/drivers/Makefile
 +++ b/drivers/Makefile
 @@ -123,7 +123,7 @@ obj-$(CONFIG_PPC_PS3) += ps3/
  obj-$(CONFIG_OF) += of/
  obj-$(CONFIG_SSB)+= ssb/
  obj-$(CONFIG_BCMA)   += bcma/
 -obj-$(CONFIG_VHOST_NET)  += vhost/
 +obj-$(CONFIG_VHOST_RING) += vhost/
  obj-$(CONFIG_VLYNQ)  += vlynq/
  obj-$(CONFIG_STAGING)+= staging/
  obj-y+= platform/
 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 index 202bba6..613b074 100644
 --- a/drivers/vhost/Kconfig
 +++ b/drivers/vhost/Kconfig
 @@ -1,6 +1,7 @@
  config VHOST_NET
   tristate Host kernel accelerator for virtio net (EXPERIMENTAL)
   depends on NET  EVENTFD  (TUN || !TUN)  (MACVTAP || !MACVTAP)  
 EXPERIMENTAL
 + select VHOST_RING
   ---help---
 This kernel module can be loaded in host kernel to accelerate
 guest networking with virtio_net. Not to be confused with virtio_net
 @@ -12,3 +13,10 @@ config VHOST_NET
  if STAGING
  source drivers/vhost/Kconfig.tcm
  endif
 +
 +config VHOST_RING
 + tristate
 + ---help---
 +   This option is selected by any driver which needs to access
 +   the host side of a virtio ring.
 +
 diff --git a/drivers/vhost/Kconfig.tcm b/drivers/vhost/Kconfig.tcm
 index a9c6f76..0218f77 100644
 --- a/drivers/vhost/Kconfig.tcm
 +++ b/drivers/vhost/Kconfig.tcm
 @@ -1,6 +1,7 @@
  config TCM_VHOST
   tristate TCM_VHOST fabric module (EXPERIMENTAL)
   depends on TARGET_CORE  EVENTFD  EXPERIMENTAL  m
 + select VHOST_RING
   default n
   ---help---
   Say M here to enable the TCM_VHOST fabric module for use with 
 virtio-scsi guests
 diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
 index a27b053..1d37f5e 100644
 --- a/drivers/vhost/Makefile
 +++ b/drivers/vhost/Makefile
 @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
  vhost_net-y := vhost.o net.o
  
  obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
 +
 +obj-$(CONFIG_VHOST_RING) += vringh.o
 diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
 new file mode 100644
 index 000..b28670f
 --- /dev/null
 +++ b/drivers/vhost/vringh.c
 @@ -0,0 +1,818 @@
 +/*
 + * Helpers for the host side of a virtio ring.
 + *
 + * Since these may be in userspace, we use (inline) accessors.
 + */
 +#include linux/vringh.h
 +#include linux/virtio_ring.h
 +#include linux/kernel.h
 +#include linux/ratelimit.h
 +#include linux/uaccess.h
 +#include linux/slab.h
 +
 +static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 +{
 + static DEFINE_RATELIMIT_STATE(vringh_rs,
 +   DEFAULT_RATELIMIT_INTERVAL,
 +   DEFAULT_RATELIMIT_BURST);
 + if (__ratelimit(vringh_rs)) {
 + va_list ap;
 + va_start(ap, fmt);
 + printk(KERN_NOTICE vringh:);
 + vprintk(fmt, ap);
 + va_end(ap);
 + }
 +}
 +
 +/* Returns vring-num if empty, -ve on error. */
 +static inline int __vringh_get_head(const struct vringh *vrh,
 + int (*getu16)(u16 *val, const u16 *p),
 + u16 *last_avail_idx)
 +{
 + u16 avail_idx, i, head;
 + int err;
 +
 + err = getu16(avail_idx, vrh-vring.avail-idx);
 + if (err) {
 + vringh_bad(Failed to access avail idx at %p,
 +vrh-vring.avail-idx);
 + return err;
 + }
 +
 + if (*last_avail_idx == avail_idx)
 + return vrh-vring.num;
 +
 + /* Only get avail ring entries after they have been exposed by guest. */
 + virtio_rmb(vrh-weak_barriers);
 +
 + i = *last_avail_idx  (vrh-vring.num - 1);
 +
 + err = getu16(head, vrh-vring.avail-ring[i]);
 + if (err) {
 +   

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-22 Thread Sjur Brændeland
On Mon, Jan 21, 2013 at 3:36 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Sjur Brændeland sjur.brandel...@stericsson.com writes:
 On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote:
 in otgher words, we might need to split a single desc to multiple
 iov entries.

 Splitting descriptors doesn't work for me. As described earlier, I
 receive one IP-frame for each descriptor. So I need to keep
 the original boundaries.

 Yes, but the _kern helpers don't do range checking or offsetting at the
 moment, so they'll be preserved.

 If you want to do range checking, I can add that...

I don't think I need that. Initially I can just check if the received address
is readable.

Regards,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-22 Thread Rusty Russell
Asias He as...@redhat.com writes:

 On 01/17/2013 06:29 PM, Rusty Russell wrote:
 Getting use of virtio rings correct is tricky, and a recent patch saw
 an implementation of in-kernel rings (as separate from userspace).
 
 This patch attempts to abstract the business of dealing with the
 virtio ring layout from the access (userspace or direct); to do this,
 we use function pointers, which gcc inlines correctly.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/Makefile |2 +-
  drivers/vhost/Kconfig|8 +
  drivers/vhost/Kconfig.tcm|1 +
  drivers/vhost/Makefile   |2 +
  drivers/vhost/vringh.c   |  818 
 ++
  drivers/virtio/virtio_ring.c |   33 +-
  include/linux/virtio_ring.h  |   57 +++


 Why vringh_notify_enable_user() and vringh_notify_disable_user() are not
 declared in include/linux/virtio_ring.h? Missed that?

Yes, I did... I've fixed it in my vringh branch, where I'm doing
development from now on.

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Michael S. Tsirkin
On Mon, Jan 21, 2013 at 01:04:47PM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote:
  +/* Returns vring-num if empty, -ve on error. */
  +static inline int __vringh_get_head(const struct vringh *vrh,
  +  int (*getu16)(u16 *val, const u16 *p),
 
  I think int (*getu16)(const u16 *p) would be cleaner
  than returning through a pointer, then
  callers check that value  0 for error.
 
 I disagree: I dislike overloading the error code, and I like the
 symmetry with other operations (getdesc, putu16).
 
  +  /* Make sure it's OK, and get offset. */
  +  if (!check_range(desc.addr, desc.len, range, getrange)) {
  +  err = -EINVAL;
  +  goto fail;
  +  }
  +  addr = (void *)(long)desc.addr + range.offset;
 
  Should probably be (void *)(long)(desc.addr + range.offset).
  Otherwise we risk signed integer overflow.
 
 Well, it's a noop.  Either a pointer and long are 64 bit (no overflow),
 or they're not (we truncate anyway when we assign to addr).

For readability purposes, I think it's better to do all
math in 64 bit then cast as appropriate.
This also relies on -fno-strict-overflow - below you say it's best
not to rely on specific compiler flags, and I agree.

  +  iov-iov[iov-i].iov_base = (__force __user void *)addr;
  +  iov-iov[iov-i].iov_len = desc.len;
 
  The following comment from the previous version still applies:
   This looks like it won't do the right thing if desc.len spans multiple
   ranges. I don't know if this happens in practice but this is something
   vhost supports ATM.
  in otgher words, we might need to split a single desc to multiple
  iov entries.
 
 Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
 don't do that, but qemu is rarely sane.  I'll fix it.
 
  +  err = putu16(vrh-vring.used-idx, vrh-last_used_idx);
  +  if (err) {
  +  vringh_bad(Failed to update used index at %p,
  + vrh-vring.used-idx);
  +  return err;
 
 
  One thing vhost does is roll back everything on error,
  so you can for example have an invalid range
  of memory and handle writes there in userspace.
  I think it's worth preserving though this is
  currently unused.
 
 Indeed, that's a nice feature.  So is distinguishing a single bad
 descriptor (which can be dropped, for vhost net) from a corrupt ring
 (which means the device is useless).
 
  +  /* They could have slipped one in as we were doing that: make
  +   * sure it's written, then check again. */
  +  virtio_mb(vrh-weak_barriers);
  +
  +  if (getu16(avail, vrh-vring.avail-idx) != 0) {
 
  Hmm above has implicit != 0 why not here?
 
 I didn't see the one above, but it's a clear nod that it doesn't return
 a bool (yeah, it's nasty that we don't return the error in this case,
 but in practice it's a tiny corner).
 
  +static inline int getdesc_user(struct vring_desc *dst,
  + const struct vring_desc *src)
  +{
  +  return copy_from_user(dst, (__force void *)src, sizeof(*dst)) == 0 ? 0 :
  +  -EFAULT;
 
  confused about __force above. Shouldn't it cast to __user?
  I have not tried does this patch pass the checker?
 
 You're right, I haven't run sparse across it yet...
 
  +  vrh-vring.desc = (__force struct vring_desc *)desc;
  +  vrh-vring.avail = (__force struct vring_avail *)avail;
  +  vrh-vring.used = (__force struct vring_used *)used;
 
  I counted 3 separate chunks that do __force casts.
  Let's try to isolate them and comment why it's safe.
 
 Yes, I want to look at using a union of kvec and iovec internally, but
 I worry about breaking gcc's aliasing detection (the kernel compiles
 with -fno-strict-aliasing but I hate relying on this).
 
 Thanks,
 Rusty.

Right. I think tagging it __user to mean can be userspace
and then removing tag where we know it's safe is a decent compromize.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Michael S. Tsirkin
On Mon, Jan 21, 2013 at 02:24:31PM +0200, Michael S. Tsirkin wrote:
 On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote:
  Rusty Russell ru...@rustcorp.com.au writes:
   Michael S. Tsirkin m...@redhat.com writes:
   +   iov-iov[iov-i].iov_base = (__force __user void *)addr;
   +   iov-iov[iov-i].iov_len = desc.len;
  
   The following comment from the previous version still applies:
 This looks like it won't do the right thing if desc.len spans multiple
 ranges. I don't know if this happens in practice but this is something
 vhost supports ATM.
   in otgher words, we might need to split a single desc to multiple
   iov entries.
  
   Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
   don't do that, but qemu is rarely sane.  I'll fix it.
  
  Actually, you make the same assumption for vhost, with your use of
  getuser and putuser for accessing the ring.
 
 There's no requirement that ring is mapped directly into guest
 memory. If a ring is not contigious qemu can allocate
 it's own virtuall contigious rings and copy data back and forth.
 
  The complexity and cost of handling this is significant,
 
 Why is it? Just add a while loop, increment desc.addr
 and decrement desc.len.
 
  and the error
  if it ever did happen is distinctive.  Does qemu ever create such
  things?
  
  Thanks,
  Rusty.
 
 I think qemu does create ranges that are consequitive in guest pa
 space but not in qemu va ranges. I'm sick today, will try to dig out
 some examples later.

It seems it could happen if e.g. one region is ROM and another one is
RAM.  This means it won't be used for the ring (which is R/W) but could
be for the data.

 -- 
 MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Michael S. Tsirkin
On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote:
 Rusty Russell ru...@rustcorp.com.au writes:
  Michael S. Tsirkin m...@redhat.com writes:
  + iov-iov[iov-i].iov_base = (__force __user void *)addr;
  + iov-iov[iov-i].iov_len = desc.len;
 
  The following comment from the previous version still applies:
  This looks like it won't do the right thing if desc.len spans multiple
  ranges. I don't know if this happens in practice but this is something
  vhost supports ATM.
  in otgher words, we might need to split a single desc to multiple
  iov entries.
 
  Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
  don't do that, but qemu is rarely sane.  I'll fix it.
 
 Actually, you make the same assumption for vhost, with your use of
 getuser and putuser for accessing the ring.

There's no requirement that ring is mapped directly into guest
memory. If a ring is not contigious qemu can allocate
it's own virtuall contigious rings and copy data back and forth.

 The complexity and cost of handling this is significant,

Why is it? Just add a while loop, increment desc.addr
and decrement desc.len.

 and the error
 if it ever did happen is distinctive.  Does qemu ever create such
 things?
 
 Thanks,
 Rusty.

I think qemu does create ranges that are consequitive in guest pa
space but not in qemu va ranges. I'm sick today, will try to dig out
some examples later.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote:
 Rusty Russell ru...@rustcorp.com.au writes:
  Michael S. Tsirkin m...@redhat.com writes:
  +iov-iov[iov-i].iov_base = (__force __user void *)addr;
  +iov-iov[iov-i].iov_len = desc.len;
 
  The following comment from the previous version still applies:
 This looks like it won't do the right thing if desc.len spans multiple
 ranges. I don't know if this happens in practice but this is something
 vhost supports ATM.
  in otgher words, we might need to split a single desc to multiple
  iov entries.
 
  Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
  don't do that, but qemu is rarely sane.  I'll fix it.
 
 Actually, you make the same assumption for vhost, with your use of
 getuser and putuser for accessing the ring.

 There's no requirement that ring is mapped directly into guest
 memory. If a ring is not contigious qemu can allocate
 it's own virtuall contigious rings and copy data back and forth.

True, but it's the guest which allocates the ring.  If QEMU sets up a
guest with a offset-discontiguous mapping, vhost would be unreliable
today.

 The complexity and cost of handling this is significant,

 Why is it? Just add a while loop, increment desc.addr
 and decrement desc.len.

Indirect handling now needs to re-check the boundaries for every
descriptor it fetches, and handle the case where a single descriptor
overlaps the boundary.

The ability to verify the indirect descriptor table all at once makes
the code fast and neat.

From your other mail:

 It seems it could happen if e.g. one region is ROM and another one is
 RAM.  This means it won't be used for the ring (which is R/W) but could
 be for the data.

That case wouldn't be so bad, I think, since it's a simple loop.

I'll create some patches and see if it's too ugly to live...

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Rusty Russell
Rusty Russell ru...@rustcorp.com.au writes:
 Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote:
 Rusty Russell ru...@rustcorp.com.au writes:
  Michael S. Tsirkin m...@redhat.com writes:
  +   iov-iov[iov-i].iov_base = (__force __user void *)addr;
  +   iov-iov[iov-i].iov_len = desc.len;
 
  The following comment from the previous version still applies:
This looks like it won't do the right thing if desc.len spans multiple
ranges. I don't know if this happens in practice but this is something
vhost supports ATM.
  in otgher words, we might need to split a single desc to multiple
  iov entries.
 
  Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
  don't do that, but qemu is rarely sane.  I'll fix it.
 
 Actually, you make the same assumption for vhost, with your use of
 getuser and putuser for accessing the ring.

 There's no requirement that ring is mapped directly into guest
 memory. If a ring is not contigious qemu can allocate
 it's own virtuall contigious rings and copy data back and forth.

 True, but it's the guest which allocates the ring.  If QEMU sets up a
 guest with a offset-discontiguous mapping, vhost would be unreliable
 today.

My mistake: the ring addresses handed through the ioctl already
translated, so you can't specify such a thing.

I struck this when I tried to clean it up: this is an asymmetry between
the toplevel descriptor table and any indirect ones.

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-21 Thread Rusty Russell
Rusty Russell ru...@rustcorp.com.au writes:
 I'll create some patches and see if it's too ugly to live...

Hmm, with rough userspace testing I managed to get the speed penalty
pretty low.

Here are the two patches, inline:

vringh: handle case where data goes across multiple ranges.

QEMU seems to only use separate memory ranges for ROM vs RAM, but in theory
a single scatterlist element referred to by a vring could cross two ranges,
and thus need to be split into two separate iovec entries.

This causes no measurable slowdown:

Before: (average of 20 runs)
./vringh_test --indirect --eventidx --parallel
real0m3.236s

After: (average of 20 runs)
./vringh_test --indirect --eventidx --parallel
real0m3.223s

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 2ba087d..50d37a7 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -91,33 +91,37 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov 
*iov,
return done;
 }
 
-static inline bool check_range(u64 addr, u32 len,
+/* May reduce *len if range is shorter. */
+static inline bool check_range(u64 addr, u32 *len,
   struct vringh_range *range,
   bool (*getrange)(u64, struct vringh_range *))
 {
if (addr  range-start || addr  range-end_incl) {
if (!getrange(addr, range))
-   goto bad;
+   return false;
}
BUG_ON(addr  range-start || addr  range-end_incl);
 
/* To end of memory? */
-   if (unlikely(addr + len == 0)) {
+   if (unlikely(addr + *len == 0)) {
if (range-end_incl == -1ULL)
return true;
-   goto bad;
+   goto truncate;
}
 
/* Otherwise, don't wrap. */
-   if (unlikely(addr + len  addr))
-   goto bad;
-   if (unlikely(addr + len - 1  range-end_incl))
-   goto bad;
+   if (addr + *len  addr) {
+   vringh_bad(Wrapping descriptor %u@0x%llx, *len, addr);
+   return false;
+   }
+
+   if (unlikely(addr + *len - 1  range-end_incl))
+   goto truncate;
return true;
 
-bad:
-   vringh_bad(Malformed descriptor address %u@0x%llx, len, addr);
-   return false;
+truncate:
+   *len = range-end_incl + 1 - addr;
+   return true;
 }
 
 /* No reason for this code to be inline. */
@@ -205,19 +209,30 @@ __vringh_iov(struct vringh *vrh, u16 i,
for (;;) {
void *addr;
struct vringh_kiov *iov;
+   u32 len;
 
err = getdesc(desc, descs[i]);
if (unlikely(err))
goto fail;
 
-   /* Make sure it's OK, and get offset. */
-   if (!check_range(desc.addr, desc.len, range, getrange)) {
-   err = -EINVAL;
-   goto fail;
-   }
-   addr = (void *)(long)desc.addr + range.offset;
-
if (unlikely(desc.flags  VRING_DESC_F_INDIRECT)) {
+   /* Make sure it's OK, and get offset. */
+   len = desc.len;
+   if (!check_range(desc.addr, len, range, getrange)) {
+   err = -EINVAL;
+   goto fail;
+   }
+
+   if (len != desc.len) {
+   vringh_bad(Indirect descriptor table across
+   ranges: %u@%#llx vs %#llx-%#llx,
+  desc.len, desc.addr, range.start,
+  range.end_incl);
+   err = -EINVAL;
+   goto fail;
+   }
+   addr = (void *)(long)(desc.addr + range.offset);
+
err = move_to_indirect(up_next, i, addr, desc,
   descs, desc_max);
if (err)
@@ -243,6 +258,15 @@ __vringh_iov(struct vringh *vrh, u16 i,
}
}
 
+   again:
+   /* Make sure it's OK, and get offset. */
+   len = desc.len;
+   if (!check_range(desc.addr, len, range, getrange)) {
+   err = -EINVAL;
+   goto fail;
+   }
+   addr = (void *)(unsigned long)(desc.addr + range.offset);
+
if (unlikely(iov-i == iov-max)) {
err = resize_iovec(iov, gfp);
if (err)
@@ -250,9 +274,15 @@ __vringh_iov(struct vringh *vrh, u16 i,
}
 
iov-iov[iov-i].iov_base = addr;
-   iov-iov[iov-i].iov_len = desc.len;
+   iov-iov[iov-i].iov_len = len;

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-20 Thread Rusty Russell
Sjur Brændeland sjur.brandel...@stericsson.com writes:
 On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote:
 in otgher words, we might need to split a single desc to multiple
 iov entries.

 Splitting descriptors doesn't work for me. As described earlier, I
 receive one IP-frame for each descriptor. So I need to keep
 the original boundaries.

Yes, but the _kern helpers don't do range checking or offsetting at the
moment, so they'll be preserved.

If you want to do range checking, I can add that...

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-20 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote:
 +/* Returns vring-num if empty, -ve on error. */
 +static inline int __vringh_get_head(const struct vringh *vrh,
 +int (*getu16)(u16 *val, const u16 *p),

 I think int (*getu16)(const u16 *p) would be cleaner
 than returning through a pointer, then
 callers check that value  0 for error.

I disagree: I dislike overloading the error code, and I like the
symmetry with other operations (getdesc, putu16).

 +/* Make sure it's OK, and get offset. */
 +if (!check_range(desc.addr, desc.len, range, getrange)) {
 +err = -EINVAL;
 +goto fail;
 +}
 +addr = (void *)(long)desc.addr + range.offset;

 Should probably be (void *)(long)(desc.addr + range.offset).
 Otherwise we risk signed integer overflow.

Well, it's a noop.  Either a pointer and long are 64 bit (no overflow),
or they're not (we truncate anyway when we assign to addr).

 +iov-iov[iov-i].iov_base = (__force __user void *)addr;
 +iov-iov[iov-i].iov_len = desc.len;

 The following comment from the previous version still applies:
This looks like it won't do the right thing if desc.len spans multiple
ranges. I don't know if this happens in practice but this is something
vhost supports ATM.
 in otgher words, we might need to split a single desc to multiple
 iov entries.

Ah, separate offsets for consecutive ranges, right.  I'd prefer to say
don't do that, but qemu is rarely sane.  I'll fix it.

 +err = putu16(vrh-vring.used-idx, vrh-last_used_idx);
 +if (err) {
 +vringh_bad(Failed to update used index at %p,
 +   vrh-vring.used-idx);
 +return err;


 One thing vhost does is roll back everything on error,
 so you can for example have an invalid range
 of memory and handle writes there in userspace.
 I think it's worth preserving though this is
 currently unused.

Indeed, that's a nice feature.  So is distinguishing a single bad
descriptor (which can be dropped, for vhost net) from a corrupt ring
(which means the device is useless).

 +/* They could have slipped one in as we were doing that: make
 + * sure it's written, then check again. */
 +virtio_mb(vrh-weak_barriers);
 +
 +if (getu16(avail, vrh-vring.avail-idx) != 0) {

 Hmm above has implicit != 0 why not here?

I didn't see the one above, but it's a clear nod that it doesn't return
a bool (yeah, it's nasty that we don't return the error in this case,
but in practice it's a tiny corner).

 +static inline int getdesc_user(struct vring_desc *dst,
 +   const struct vring_desc *src)
 +{
 +return copy_from_user(dst, (__force void *)src, sizeof(*dst)) == 0 ? 0 :
 +-EFAULT;

 confused about __force above. Shouldn't it cast to __user?
 I have not tried does this patch pass the checker?

You're right, I haven't run sparse across it yet...

 +vrh-vring.desc = (__force struct vring_desc *)desc;
 +vrh-vring.avail = (__force struct vring_avail *)avail;
 +vrh-vring.used = (__force struct vring_used *)used;

 I counted 3 separate chunks that do __force casts.
 Let's try to isolate them and comment why it's safe.

Yes, I want to look at using a union of kvec and iovec internally, but
I worry about breaking gcc's aliasing detection (the kernel compiles
with -fno-strict-aliasing but I hate relying on this).

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-17 Thread Michael S. Tsirkin
On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote:
 Getting use of virtio rings correct is tricky, and a recent patch saw
 an implementation of in-kernel rings (as separate from userspace).
 
 This patch attempts to abstract the business of dealing with the
 virtio ring layout from the access (userspace or direct); to do this,
 we use function pointers, which gcc inlines correctly.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/Makefile |2 +-
  drivers/vhost/Kconfig|8 +
  drivers/vhost/Kconfig.tcm|1 +
  drivers/vhost/Makefile   |2 +
  drivers/vhost/vringh.c   |  818 
 ++
  drivers/virtio/virtio_ring.c |   33 +-
  include/linux/virtio_ring.h  |   57 +++
  include/linux/vringh.h   |  115 ++
  8 files changed, 1008 insertions(+), 28 deletions(-)
  create mode 100644 drivers/vhost/vringh.c
  create mode 100644 include/linux/vringh.h
 
 diff --git a/drivers/Makefile b/drivers/Makefile
 index 7863b9f..351a34f 100644
 --- a/drivers/Makefile
 +++ b/drivers/Makefile
 @@ -123,7 +123,7 @@ obj-$(CONFIG_PPC_PS3) += ps3/
  obj-$(CONFIG_OF) += of/
  obj-$(CONFIG_SSB)+= ssb/
  obj-$(CONFIG_BCMA)   += bcma/
 -obj-$(CONFIG_VHOST_NET)  += vhost/
 +obj-$(CONFIG_VHOST_RING) += vhost/
  obj-$(CONFIG_VLYNQ)  += vlynq/
  obj-$(CONFIG_STAGING)+= staging/
  obj-y+= platform/
 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 index 202bba6..613b074 100644
 --- a/drivers/vhost/Kconfig
 +++ b/drivers/vhost/Kconfig
 @@ -1,6 +1,7 @@
  config VHOST_NET
   tristate Host kernel accelerator for virtio net (EXPERIMENTAL)
   depends on NET  EVENTFD  (TUN || !TUN)  (MACVTAP || !MACVTAP)  
 EXPERIMENTAL
 + select VHOST_RING
   ---help---
 This kernel module can be loaded in host kernel to accelerate
 guest networking with virtio_net. Not to be confused with virtio_net
 @@ -12,3 +13,10 @@ config VHOST_NET
  if STAGING
  source drivers/vhost/Kconfig.tcm
  endif
 +
 +config VHOST_RING
 + tristate
 + ---help---
 +   This option is selected by any driver which needs to access
 +   the host side of a virtio ring.
 +
 diff --git a/drivers/vhost/Kconfig.tcm b/drivers/vhost/Kconfig.tcm
 index a9c6f76..0218f77 100644
 --- a/drivers/vhost/Kconfig.tcm
 +++ b/drivers/vhost/Kconfig.tcm
 @@ -1,6 +1,7 @@
  config TCM_VHOST
   tristate TCM_VHOST fabric module (EXPERIMENTAL)
   depends on TARGET_CORE  EVENTFD  EXPERIMENTAL  m
 + select VHOST_RING
   default n
   ---help---
   Say M here to enable the TCM_VHOST fabric module for use with 
 virtio-scsi guests
 diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
 index a27b053..1d37f5e 100644
 --- a/drivers/vhost/Makefile
 +++ b/drivers/vhost/Makefile
 @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
  vhost_net-y := vhost.o net.o
  
  obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
 +
 +obj-$(CONFIG_VHOST_RING) += vringh.o
 diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
 new file mode 100644
 index 000..b28670f
 --- /dev/null
 +++ b/drivers/vhost/vringh.c
 @@ -0,0 +1,818 @@
 +/*
 + * Helpers for the host side of a virtio ring.
 + *
 + * Since these may be in userspace, we use (inline) accessors.
 + */
 +#include linux/vringh.h
 +#include linux/virtio_ring.h
 +#include linux/kernel.h
 +#include linux/ratelimit.h
 +#include linux/uaccess.h
 +#include linux/slab.h
 +
 +static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 +{
 + static DEFINE_RATELIMIT_STATE(vringh_rs,
 +   DEFAULT_RATELIMIT_INTERVAL,
 +   DEFAULT_RATELIMIT_BURST);
 + if (__ratelimit(vringh_rs)) {
 + va_list ap;
 + va_start(ap, fmt);
 + printk(KERN_NOTICE vringh:);
 + vprintk(fmt, ap);
 + va_end(ap);
 + }
 +}
 +
 +/* Returns vring-num if empty, -ve on error. */
 +static inline int __vringh_get_head(const struct vringh *vrh,
 + int (*getu16)(u16 *val, const u16 *p),

I think int (*getu16)(const u16 *p) would be cleaner
than returning through a pointer, then
callers check that value  0 for error.

 + u16 *last_avail_idx)
 +{
 + u16 avail_idx, i, head;
 + int err;
 +
 + err = getu16(avail_idx, vrh-vring.avail-idx);
 + if (err) {
 + vringh_bad(Failed to access avail idx at %p,
 +vrh-vring.avail-idx);
 + return err;
 + }
 +
 + if (*last_avail_idx == avail_idx)
 + return vrh-vring.num;
 +
 + /* Only get avail ring entries after they have been exposed by guest. */
 + virtio_rmb(vrh-weak_barriers);
 +
 + i = *last_avail_idx  (vrh-vring.num - 1);
 +
 + err = getu16(head, vrh-vring.avail-ring[i]);
 + if 

Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-17 Thread Sjur Brændeland
On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote:
...
 +static inline int
 +__vringh_iov(struct vringh *vrh, u16 i,
 +  struct vringh_iov *riov,
 +  struct vringh_iov *wiov,
 +  bool (*getrange)(u64 addr, struct vringh_range *r),
 +  gfp_t gfp,
 +  int (*getdesc)(struct vring_desc *dst, const struct vring_desc 
 *s))
 +{
 + int err, count = 0, up_next, desc_max;
 + struct vring_desc desc, *descs;
 + struct vringh_range range = { -1ULL, 0 };
 +
 + /* We start traversing vring's descriptor table. */
 + descs = vrh-vring.desc;
 + desc_max = vrh-vring.num;
 + up_next = -1;
 +
 + riov-i = wiov-i = 0;
 + for (;;) {
 + void *addr;
 + struct vringh_iov *iov;
 +
 + err = getdesc(desc, descs[i]);
 + if (unlikely(err))
 + goto fail;
 +
 + /* Make sure it's OK, and get offset. */
 + if (!check_range(desc.addr, desc.len, range, getrange)) {
 + err = -EINVAL;
 + goto fail;
 + }
 + addr = (void *)(long)desc.addr + range.offset;

 Should probably be (void *)(long)(desc.addr + range.offset).
 Otherwise we risk signed integer overflow.

 +
 + if (unlikely(desc.flags  VRING_DESC_F_INDIRECT)) {
 + err = move_to_indirect(up_next, i, addr, desc,
 +descs, desc_max);
 + if (err)
 + goto fail;
 + continue;
 + }
 +
 + if (count++ == vrh-vring.num) {
 + vringh_bad(Descriptor loop in %p, descs);
 + err = -ELOOP;
 + goto fail;
 + }
 +
 + if (desc.flags  VRING_DESC_F_WRITE)
 + iov = wiov;
 + else {
 + iov = riov;
 + if (unlikely(wiov-i)) {
 + vringh_bad(Readable desc %p after writable,
 +descs[i]);
 + err = -EINVAL;
 + goto fail;
 + }
 + }
 +
 + if (unlikely(iov-i == iov-max)) {
 + err = resize_iovec(iov, gfp);
 + if (err)
 + goto fail;
 + }
 +
 + iov-iov[iov-i].iov_base = (__force __user void *)addr;
 + iov-iov[iov-i].iov_len = desc.len;

 The following comment from the previous version still applies:
  This looks like it won't do the right thing if desc.len spans 
 multiple
  ranges. I don't know if this happens in practice but this is 
 something
  vhost supports ATM.
 in otgher words, we might need to split a single desc to multiple
 iov entries.

Splitting descriptors doesn't work for me. As described earlier, I
receive one IP-frame for each descriptor. So I need to keep
the original boundaries.

Regards,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.

2013-01-17 Thread Michael S. Tsirkin
On Thu, Jan 17, 2013 at 12:49:25PM +0100, Sjur Brændeland wrote:
 On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote:
  On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote:
 ...
  +static inline int
  +__vringh_iov(struct vringh *vrh, u16 i,
  +  struct vringh_iov *riov,
  +  struct vringh_iov *wiov,
  +  bool (*getrange)(u64 addr, struct vringh_range *r),
  +  gfp_t gfp,
  +  int (*getdesc)(struct vring_desc *dst, const struct vring_desc 
  *s))
  +{
  + int err, count = 0, up_next, desc_max;
  + struct vring_desc desc, *descs;
  + struct vringh_range range = { -1ULL, 0 };
  +
  + /* We start traversing vring's descriptor table. */
  + descs = vrh-vring.desc;
  + desc_max = vrh-vring.num;
  + up_next = -1;
  +
  + riov-i = wiov-i = 0;
  + for (;;) {
  + void *addr;
  + struct vringh_iov *iov;
  +
  + err = getdesc(desc, descs[i]);
  + if (unlikely(err))
  + goto fail;
  +
  + /* Make sure it's OK, and get offset. */
  + if (!check_range(desc.addr, desc.len, range, getrange)) {
  + err = -EINVAL;
  + goto fail;
  + }
  + addr = (void *)(long)desc.addr + range.offset;
 
  Should probably be (void *)(long)(desc.addr + range.offset).
  Otherwise we risk signed integer overflow.
 
  +
  + if (unlikely(desc.flags  VRING_DESC_F_INDIRECT)) {
  + err = move_to_indirect(up_next, i, addr, desc,
  +descs, desc_max);
  + if (err)
  + goto fail;
  + continue;
  + }
  +
  + if (count++ == vrh-vring.num) {
  + vringh_bad(Descriptor loop in %p, descs);
  + err = -ELOOP;
  + goto fail;
  + }
  +
  + if (desc.flags  VRING_DESC_F_WRITE)
  + iov = wiov;
  + else {
  + iov = riov;
  + if (unlikely(wiov-i)) {
  + vringh_bad(Readable desc %p after writable,
  +descs[i]);
  + err = -EINVAL;
  + goto fail;
  + }
  + }
  +
  + if (unlikely(iov-i == iov-max)) {
  + err = resize_iovec(iov, gfp);
  + if (err)
  + goto fail;
  + }
  +
  + iov-iov[iov-i].iov_base = (__force __user void *)addr;
  + iov-iov[iov-i].iov_len = desc.len;
 
  The following comment from the previous version still applies:
   This looks like it won't do the right thing if desc.len spans 
  multiple
   ranges. I don't know if this happens in practice but this is 
  something
   vhost supports ATM.
  in otgher words, we might need to split a single desc to multiple
  iov entries.
 
 Splitting descriptors doesn't work for me. As described earlier, I
 receive one IP-frame for each descriptor. So I need to keep
 the original boundaries.
 
 Regards,
 Sjur

Hmm this kind of conflicts with the description we were
going into with virtio generally.
Other side should be free to lay out request any way it
wants to.
Anyway, in this case this is splitting iovecs really not descriptors.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization