Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Hi Rusty, On Wed, Feb 13, 2013 at 11:25 AM, Rusty Russell ru...@rustcorp.com.au wrote: Sjur Brændeland sjurb...@gmail.com writes: On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote: Sjur Brændeland sjurb...@gmail.com writes: ... I guess my vringh related stuff should go into your tree together with your vringh patches... Would you be willing to take this via your tree, provided that I get acks from the right people? Yes please. Now I have set up a test rig for vhost (unfortunately not with 10Ge) I can make progress on vhost adaption. I have one question, are you planning to include vringh in your virtio-next branch for the 3.9 merge window or some time later? If you're eager with the CAIF stuff, I'll merge it for this merge window (we may still rework stuff, but I don't want to block you). Otherwise, it'll probably wait. Yes, I would prefer to get CAIF-virtio merged now unless it creates a a lot of trouble for you. But I'm not sure I'll manage to get ack from Ohad in time though. He said he would be busy for the next week. I need an ack from David Miller as well, I'll resend the patch to him if you decide to go ahead and merge this. Thanks, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Sjur Brændeland sjurb...@gmail.com writes: Hi Rusty, On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote: Sjur Brændeland sjurb...@gmail.com writes: ... I guess my vringh related stuff should go into your tree together with your vringh patches... Would you be willing to take this via your tree, provided that I get acks from the right people? Yes please. Now I have set up a test rig for vhost (unfortunately not with 10Ge) I can make progress on vhost adaption. I have one question, are you planning to include vringh in your virtio-next branch for the 3.9 merge window or some time later? If you're eager with the CAIF stuff, I'll merge it for this merge window (we may still rework stuff, but I don't want to block you). Otherwise, it'll probably wait. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Hi Rusty, On Mon, Feb 4, 2013 at 10:44 PM, Rusty Russell ru...@rustcorp.com.au wrote: Sjur Brændeland sjurb...@gmail.com writes: ... I guess my vringh related stuff should go into your tree together with your vringh patches... Would you be willing to take this via your tree, provided that I get acks from the right people? Yes please. Now I have set up a test rig for vhost (unfortunately not with 10Ge) I can make progress on vhost adaption. I have one question, are you planning to include vringh in your virtio-next branch for the 3.9 merge window or some time later? Regards, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Hi Rusty, On Thu, Jan 17, 2013 at 11:29 AM, Rusty Russell ru...@rustcorp.com.auwrote: Getting use of virtio rings correct is tricky, and a recent patch saw an implementation of in-kernel rings (as separate from userspace). This patch attempts to abstract the business of dealing with the virtio ring layout from the access (userspace or direct); to do this, we use function pointers, which gcc inlines correctly. I have been using your patches for a while in my test setup without any issues with vringh. The only thing I'm missing is export of symbols. My current caif_virtio driver expects vringh to be a module exporting symbols. Is this something you are planning to add? I guess my vringh related stuff should go into your tree together with your vringh patches... Would you be willing to take this via your tree, provided that I get acks from the right people? Regards, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On 01/17/2013 06:29 PM, Rusty Russell wrote: Getting use of virtio rings correct is tricky, and a recent patch saw an implementation of in-kernel rings (as separate from userspace). This patch attempts to abstract the business of dealing with the virtio ring layout from the access (userspace or direct); to do this, we use function pointers, which gcc inlines correctly. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/Makefile |2 +- drivers/vhost/Kconfig|8 + drivers/vhost/Kconfig.tcm|1 + drivers/vhost/Makefile |2 + drivers/vhost/vringh.c | 818 ++ drivers/virtio/virtio_ring.c | 33 +- include/linux/virtio_ring.h | 57 +++ Why vringh_notify_enable_user() and vringh_notify_disable_user() are not declared in include/linux/virtio_ring.h? Missed that? include/linux/vringh.h | 115 ++ 8 files changed, 1008 insertions(+), 28 deletions(-) create mode 100644 drivers/vhost/vringh.c create mode 100644 include/linux/vringh.h diff --git a/drivers/Makefile b/drivers/Makefile index 7863b9f..351a34f 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -123,7 +123,7 @@ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB)+= ssb/ obj-$(CONFIG_BCMA) += bcma/ -obj-$(CONFIG_VHOST_NET) += vhost/ +obj-$(CONFIG_VHOST_RING) += vhost/ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_STAGING)+= staging/ obj-y+= platform/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 202bba6..613b074 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -1,6 +1,7 @@ config VHOST_NET tristate Host kernel accelerator for virtio net (EXPERIMENTAL) depends on NET EVENTFD (TUN || !TUN) (MACVTAP || !MACVTAP) EXPERIMENTAL + select VHOST_RING ---help--- This kernel module can be loaded in host kernel to accelerate guest networking with virtio_net. Not to be confused with virtio_net @@ -12,3 +13,10 @@ config VHOST_NET if STAGING source drivers/vhost/Kconfig.tcm endif + +config VHOST_RING + tristate + ---help--- + This option is selected by any driver which needs to access + the host side of a virtio ring. + diff --git a/drivers/vhost/Kconfig.tcm b/drivers/vhost/Kconfig.tcm index a9c6f76..0218f77 100644 --- a/drivers/vhost/Kconfig.tcm +++ b/drivers/vhost/Kconfig.tcm @@ -1,6 +1,7 @@ config TCM_VHOST tristate TCM_VHOST fabric module (EXPERIMENTAL) depends on TARGET_CORE EVENTFD EXPERIMENTAL m + select VHOST_RING default n ---help--- Say M here to enable the TCM_VHOST fabric module for use with virtio-scsi guests diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index a27b053..1d37f5e 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o + +obj-$(CONFIG_VHOST_RING) += vringh.o diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c new file mode 100644 index 000..b28670f --- /dev/null +++ b/drivers/vhost/vringh.c @@ -0,0 +1,818 @@ +/* + * Helpers for the host side of a virtio ring. + * + * Since these may be in userspace, we use (inline) accessors. + */ +#include linux/vringh.h +#include linux/virtio_ring.h +#include linux/kernel.h +#include linux/ratelimit.h +#include linux/uaccess.h +#include linux/slab.h + +static __printf(1,2) __cold void vringh_bad(const char *fmt, ...) +{ + static DEFINE_RATELIMIT_STATE(vringh_rs, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + if (__ratelimit(vringh_rs)) { + va_list ap; + va_start(ap, fmt); + printk(KERN_NOTICE vringh:); + vprintk(fmt, ap); + va_end(ap); + } +} + +/* Returns vring-num if empty, -ve on error. */ +static inline int __vringh_get_head(const struct vringh *vrh, + int (*getu16)(u16 *val, const u16 *p), + u16 *last_avail_idx) +{ + u16 avail_idx, i, head; + int err; + + err = getu16(avail_idx, vrh-vring.avail-idx); + if (err) { + vringh_bad(Failed to access avail idx at %p, +vrh-vring.avail-idx); + return err; + } + + if (*last_avail_idx == avail_idx) + return vrh-vring.num; + + /* Only get avail ring entries after they have been exposed by guest. */ + virtio_rmb(vrh-weak_barriers); + + i = *last_avail_idx (vrh-vring.num - 1); + + err = getu16(head, vrh-vring.avail-ring[i]); + if (err) { +
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Mon, Jan 21, 2013 at 3:36 AM, Rusty Russell ru...@rustcorp.com.au wrote: Sjur Brændeland sjur.brandel...@stericsson.com writes: On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote: in otgher words, we might need to split a single desc to multiple iov entries. Splitting descriptors doesn't work for me. As described earlier, I receive one IP-frame for each descriptor. So I need to keep the original boundaries. Yes, but the _kern helpers don't do range checking or offsetting at the moment, so they'll be preserved. If you want to do range checking, I can add that... I don't think I need that. Initially I can just check if the received address is readable. Regards, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Asias He as...@redhat.com writes: On 01/17/2013 06:29 PM, Rusty Russell wrote: Getting use of virtio rings correct is tricky, and a recent patch saw an implementation of in-kernel rings (as separate from userspace). This patch attempts to abstract the business of dealing with the virtio ring layout from the access (userspace or direct); to do this, we use function pointers, which gcc inlines correctly. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/Makefile |2 +- drivers/vhost/Kconfig|8 + drivers/vhost/Kconfig.tcm|1 + drivers/vhost/Makefile |2 + drivers/vhost/vringh.c | 818 ++ drivers/virtio/virtio_ring.c | 33 +- include/linux/virtio_ring.h | 57 +++ Why vringh_notify_enable_user() and vringh_notify_disable_user() are not declared in include/linux/virtio_ring.h? Missed that? Yes, I did... I've fixed it in my vringh branch, where I'm doing development from now on. Thanks! Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Mon, Jan 21, 2013 at 01:04:47PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote: +/* Returns vring-num if empty, -ve on error. */ +static inline int __vringh_get_head(const struct vringh *vrh, + int (*getu16)(u16 *val, const u16 *p), I think int (*getu16)(const u16 *p) would be cleaner than returning through a pointer, then callers check that value 0 for error. I disagree: I dislike overloading the error code, and I like the symmetry with other operations (getdesc, putu16). + /* Make sure it's OK, and get offset. */ + if (!check_range(desc.addr, desc.len, range, getrange)) { + err = -EINVAL; + goto fail; + } + addr = (void *)(long)desc.addr + range.offset; Should probably be (void *)(long)(desc.addr + range.offset). Otherwise we risk signed integer overflow. Well, it's a noop. Either a pointer and long are 64 bit (no overflow), or they're not (we truncate anyway when we assign to addr). For readability purposes, I think it's better to do all math in 64 bit then cast as appropriate. This also relies on -fno-strict-overflow - below you say it's best not to rely on specific compiler flags, and I agree. + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. + err = putu16(vrh-vring.used-idx, vrh-last_used_idx); + if (err) { + vringh_bad(Failed to update used index at %p, + vrh-vring.used-idx); + return err; One thing vhost does is roll back everything on error, so you can for example have an invalid range of memory and handle writes there in userspace. I think it's worth preserving though this is currently unused. Indeed, that's a nice feature. So is distinguishing a single bad descriptor (which can be dropped, for vhost net) from a corrupt ring (which means the device is useless). + /* They could have slipped one in as we were doing that: make + * sure it's written, then check again. */ + virtio_mb(vrh-weak_barriers); + + if (getu16(avail, vrh-vring.avail-idx) != 0) { Hmm above has implicit != 0 why not here? I didn't see the one above, but it's a clear nod that it doesn't return a bool (yeah, it's nasty that we don't return the error in this case, but in practice it's a tiny corner). +static inline int getdesc_user(struct vring_desc *dst, + const struct vring_desc *src) +{ + return copy_from_user(dst, (__force void *)src, sizeof(*dst)) == 0 ? 0 : + -EFAULT; confused about __force above. Shouldn't it cast to __user? I have not tried does this patch pass the checker? You're right, I haven't run sparse across it yet... + vrh-vring.desc = (__force struct vring_desc *)desc; + vrh-vring.avail = (__force struct vring_avail *)avail; + vrh-vring.used = (__force struct vring_used *)used; I counted 3 separate chunks that do __force casts. Let's try to isolate them and comment why it's safe. Yes, I want to look at using a union of kvec and iovec internally, but I worry about breaking gcc's aliasing detection (the kernel compiles with -fno-strict-aliasing but I hate relying on this). Thanks, Rusty. Right. I think tagging it __user to mean can be userspace and then removing tag where we know it's safe is a decent compromize. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Mon, Jan 21, 2013 at 02:24:31PM +0200, Michael S. Tsirkin wrote: On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote: Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. Actually, you make the same assumption for vhost, with your use of getuser and putuser for accessing the ring. There's no requirement that ring is mapped directly into guest memory. If a ring is not contigious qemu can allocate it's own virtuall contigious rings and copy data back and forth. The complexity and cost of handling this is significant, Why is it? Just add a while loop, increment desc.addr and decrement desc.len. and the error if it ever did happen is distinctive. Does qemu ever create such things? Thanks, Rusty. I think qemu does create ranges that are consequitive in guest pa space but not in qemu va ranges. I'm sick today, will try to dig out some examples later. It seems it could happen if e.g. one region is ROM and another one is RAM. This means it won't be used for the ring (which is R/W) but could be for the data. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote: Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. Actually, you make the same assumption for vhost, with your use of getuser and putuser for accessing the ring. There's no requirement that ring is mapped directly into guest memory. If a ring is not contigious qemu can allocate it's own virtuall contigious rings and copy data back and forth. The complexity and cost of handling this is significant, Why is it? Just add a while loop, increment desc.addr and decrement desc.len. and the error if it ever did happen is distinctive. Does qemu ever create such things? Thanks, Rusty. I think qemu does create ranges that are consequitive in guest pa space but not in qemu va ranges. I'm sick today, will try to dig out some examples later. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Michael S. Tsirkin m...@redhat.com writes: On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote: Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: +iov-iov[iov-i].iov_base = (__force __user void *)addr; +iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. Actually, you make the same assumption for vhost, with your use of getuser and putuser for accessing the ring. There's no requirement that ring is mapped directly into guest memory. If a ring is not contigious qemu can allocate it's own virtuall contigious rings and copy data back and forth. True, but it's the guest which allocates the ring. If QEMU sets up a guest with a offset-discontiguous mapping, vhost would be unreliable today. The complexity and cost of handling this is significant, Why is it? Just add a while loop, increment desc.addr and decrement desc.len. Indirect handling now needs to re-check the boundaries for every descriptor it fetches, and handle the case where a single descriptor overlaps the boundary. The ability to verify the indirect descriptor table all at once makes the code fast and neat. From your other mail: It seems it could happen if e.g. one region is ROM and another one is RAM. This means it won't be used for the ring (which is R/W) but could be for the data. That case wouldn't be so bad, I think, since it's a simple loop. I'll create some patches and see if it's too ugly to live... Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: On Mon, Jan 21, 2013 at 10:22:03PM +1030, Rusty Russell wrote: Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. Actually, you make the same assumption for vhost, with your use of getuser and putuser for accessing the ring. There's no requirement that ring is mapped directly into guest memory. If a ring is not contigious qemu can allocate it's own virtuall contigious rings and copy data back and forth. True, but it's the guest which allocates the ring. If QEMU sets up a guest with a offset-discontiguous mapping, vhost would be unreliable today. My mistake: the ring addresses handed through the ioctl already translated, so you can't specify such a thing. I struck this when I tried to clean it up: this is an asymmetry between the toplevel descriptor table and any indirect ones. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Rusty Russell ru...@rustcorp.com.au writes: I'll create some patches and see if it's too ugly to live... Hmm, with rough userspace testing I managed to get the speed penalty pretty low. Here are the two patches, inline: vringh: handle case where data goes across multiple ranges. QEMU seems to only use separate memory ranges for ROM vs RAM, but in theory a single scatterlist element referred to by a vring could cross two ranges, and thus need to be split into two separate iovec entries. This causes no measurable slowdown: Before: (average of 20 runs) ./vringh_test --indirect --eventidx --parallel real0m3.236s After: (average of 20 runs) ./vringh_test --indirect --eventidx --parallel real0m3.223s Signed-off-by: Rusty Russell ru...@rustcorp.com.au diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c index 2ba087d..50d37a7 100644 --- a/drivers/vhost/vringh.c +++ b/drivers/vhost/vringh.c @@ -91,33 +91,37 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov, return done; } -static inline bool check_range(u64 addr, u32 len, +/* May reduce *len if range is shorter. */ +static inline bool check_range(u64 addr, u32 *len, struct vringh_range *range, bool (*getrange)(u64, struct vringh_range *)) { if (addr range-start || addr range-end_incl) { if (!getrange(addr, range)) - goto bad; + return false; } BUG_ON(addr range-start || addr range-end_incl); /* To end of memory? */ - if (unlikely(addr + len == 0)) { + if (unlikely(addr + *len == 0)) { if (range-end_incl == -1ULL) return true; - goto bad; + goto truncate; } /* Otherwise, don't wrap. */ - if (unlikely(addr + len addr)) - goto bad; - if (unlikely(addr + len - 1 range-end_incl)) - goto bad; + if (addr + *len addr) { + vringh_bad(Wrapping descriptor %u@0x%llx, *len, addr); + return false; + } + + if (unlikely(addr + *len - 1 range-end_incl)) + goto truncate; return true; -bad: - vringh_bad(Malformed descriptor address %u@0x%llx, len, addr); - return false; +truncate: + *len = range-end_incl + 1 - addr; + return true; } /* No reason for this code to be inline. */ @@ -205,19 +209,30 @@ __vringh_iov(struct vringh *vrh, u16 i, for (;;) { void *addr; struct vringh_kiov *iov; + u32 len; err = getdesc(desc, descs[i]); if (unlikely(err)) goto fail; - /* Make sure it's OK, and get offset. */ - if (!check_range(desc.addr, desc.len, range, getrange)) { - err = -EINVAL; - goto fail; - } - addr = (void *)(long)desc.addr + range.offset; - if (unlikely(desc.flags VRING_DESC_F_INDIRECT)) { + /* Make sure it's OK, and get offset. */ + len = desc.len; + if (!check_range(desc.addr, len, range, getrange)) { + err = -EINVAL; + goto fail; + } + + if (len != desc.len) { + vringh_bad(Indirect descriptor table across + ranges: %u@%#llx vs %#llx-%#llx, + desc.len, desc.addr, range.start, + range.end_incl); + err = -EINVAL; + goto fail; + } + addr = (void *)(long)(desc.addr + range.offset); + err = move_to_indirect(up_next, i, addr, desc, descs, desc_max); if (err) @@ -243,6 +258,15 @@ __vringh_iov(struct vringh *vrh, u16 i, } } + again: + /* Make sure it's OK, and get offset. */ + len = desc.len; + if (!check_range(desc.addr, len, range, getrange)) { + err = -EINVAL; + goto fail; + } + addr = (void *)(unsigned long)(desc.addr + range.offset); + if (unlikely(iov-i == iov-max)) { err = resize_iovec(iov, gfp); if (err) @@ -250,9 +274,15 @@ __vringh_iov(struct vringh *vrh, u16 i, } iov-iov[iov-i].iov_base = addr; - iov-iov[iov-i].iov_len = desc.len; + iov-iov[iov-i].iov_len = len;
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Sjur Brændeland sjur.brandel...@stericsson.com writes: On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote: in otgher words, we might need to split a single desc to multiple iov entries. Splitting descriptors doesn't work for me. As described earlier, I receive one IP-frame for each descriptor. So I need to keep the original boundaries. Yes, but the _kern helpers don't do range checking or offsetting at the moment, so they'll be preserved. If you want to do range checking, I can add that... Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
Michael S. Tsirkin m...@redhat.com writes: On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote: +/* Returns vring-num if empty, -ve on error. */ +static inline int __vringh_get_head(const struct vringh *vrh, +int (*getu16)(u16 *val, const u16 *p), I think int (*getu16)(const u16 *p) would be cleaner than returning through a pointer, then callers check that value 0 for error. I disagree: I dislike overloading the error code, and I like the symmetry with other operations (getdesc, putu16). +/* Make sure it's OK, and get offset. */ +if (!check_range(desc.addr, desc.len, range, getrange)) { +err = -EINVAL; +goto fail; +} +addr = (void *)(long)desc.addr + range.offset; Should probably be (void *)(long)(desc.addr + range.offset). Otherwise we risk signed integer overflow. Well, it's a noop. Either a pointer and long are 64 bit (no overflow), or they're not (we truncate anyway when we assign to addr). +iov-iov[iov-i].iov_base = (__force __user void *)addr; +iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Ah, separate offsets for consecutive ranges, right. I'd prefer to say don't do that, but qemu is rarely sane. I'll fix it. +err = putu16(vrh-vring.used-idx, vrh-last_used_idx); +if (err) { +vringh_bad(Failed to update used index at %p, + vrh-vring.used-idx); +return err; One thing vhost does is roll back everything on error, so you can for example have an invalid range of memory and handle writes there in userspace. I think it's worth preserving though this is currently unused. Indeed, that's a nice feature. So is distinguishing a single bad descriptor (which can be dropped, for vhost net) from a corrupt ring (which means the device is useless). +/* They could have slipped one in as we were doing that: make + * sure it's written, then check again. */ +virtio_mb(vrh-weak_barriers); + +if (getu16(avail, vrh-vring.avail-idx) != 0) { Hmm above has implicit != 0 why not here? I didn't see the one above, but it's a clear nod that it doesn't return a bool (yeah, it's nasty that we don't return the error in this case, but in practice it's a tiny corner). +static inline int getdesc_user(struct vring_desc *dst, + const struct vring_desc *src) +{ +return copy_from_user(dst, (__force void *)src, sizeof(*dst)) == 0 ? 0 : +-EFAULT; confused about __force above. Shouldn't it cast to __user? I have not tried does this patch pass the checker? You're right, I haven't run sparse across it yet... +vrh-vring.desc = (__force struct vring_desc *)desc; +vrh-vring.avail = (__force struct vring_avail *)avail; +vrh-vring.used = (__force struct vring_used *)used; I counted 3 separate chunks that do __force casts. Let's try to isolate them and comment why it's safe. Yes, I want to look at using a union of kvec and iovec internally, but I worry about breaking gcc's aliasing detection (the kernel compiles with -fno-strict-aliasing but I hate relying on this). Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote: Getting use of virtio rings correct is tricky, and a recent patch saw an implementation of in-kernel rings (as separate from userspace). This patch attempts to abstract the business of dealing with the virtio ring layout from the access (userspace or direct); to do this, we use function pointers, which gcc inlines correctly. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/Makefile |2 +- drivers/vhost/Kconfig|8 + drivers/vhost/Kconfig.tcm|1 + drivers/vhost/Makefile |2 + drivers/vhost/vringh.c | 818 ++ drivers/virtio/virtio_ring.c | 33 +- include/linux/virtio_ring.h | 57 +++ include/linux/vringh.h | 115 ++ 8 files changed, 1008 insertions(+), 28 deletions(-) create mode 100644 drivers/vhost/vringh.c create mode 100644 include/linux/vringh.h diff --git a/drivers/Makefile b/drivers/Makefile index 7863b9f..351a34f 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -123,7 +123,7 @@ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB)+= ssb/ obj-$(CONFIG_BCMA) += bcma/ -obj-$(CONFIG_VHOST_NET) += vhost/ +obj-$(CONFIG_VHOST_RING) += vhost/ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_STAGING)+= staging/ obj-y+= platform/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 202bba6..613b074 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -1,6 +1,7 @@ config VHOST_NET tristate Host kernel accelerator for virtio net (EXPERIMENTAL) depends on NET EVENTFD (TUN || !TUN) (MACVTAP || !MACVTAP) EXPERIMENTAL + select VHOST_RING ---help--- This kernel module can be loaded in host kernel to accelerate guest networking with virtio_net. Not to be confused with virtio_net @@ -12,3 +13,10 @@ config VHOST_NET if STAGING source drivers/vhost/Kconfig.tcm endif + +config VHOST_RING + tristate + ---help--- + This option is selected by any driver which needs to access + the host side of a virtio ring. + diff --git a/drivers/vhost/Kconfig.tcm b/drivers/vhost/Kconfig.tcm index a9c6f76..0218f77 100644 --- a/drivers/vhost/Kconfig.tcm +++ b/drivers/vhost/Kconfig.tcm @@ -1,6 +1,7 @@ config TCM_VHOST tristate TCM_VHOST fabric module (EXPERIMENTAL) depends on TARGET_CORE EVENTFD EXPERIMENTAL m + select VHOST_RING default n ---help--- Say M here to enable the TCM_VHOST fabric module for use with virtio-scsi guests diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index a27b053..1d37f5e 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o + +obj-$(CONFIG_VHOST_RING) += vringh.o diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c new file mode 100644 index 000..b28670f --- /dev/null +++ b/drivers/vhost/vringh.c @@ -0,0 +1,818 @@ +/* + * Helpers for the host side of a virtio ring. + * + * Since these may be in userspace, we use (inline) accessors. + */ +#include linux/vringh.h +#include linux/virtio_ring.h +#include linux/kernel.h +#include linux/ratelimit.h +#include linux/uaccess.h +#include linux/slab.h + +static __printf(1,2) __cold void vringh_bad(const char *fmt, ...) +{ + static DEFINE_RATELIMIT_STATE(vringh_rs, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + if (__ratelimit(vringh_rs)) { + va_list ap; + va_start(ap, fmt); + printk(KERN_NOTICE vringh:); + vprintk(fmt, ap); + va_end(ap); + } +} + +/* Returns vring-num if empty, -ve on error. */ +static inline int __vringh_get_head(const struct vringh *vrh, + int (*getu16)(u16 *val, const u16 *p), I think int (*getu16)(const u16 *p) would be cleaner than returning through a pointer, then callers check that value 0 for error. + u16 *last_avail_idx) +{ + u16 avail_idx, i, head; + int err; + + err = getu16(avail_idx, vrh-vring.avail-idx); + if (err) { + vringh_bad(Failed to access avail idx at %p, +vrh-vring.avail-idx); + return err; + } + + if (*last_avail_idx == avail_idx) + return vrh-vring.num; + + /* Only get avail ring entries after they have been exposed by guest. */ + virtio_rmb(vrh-weak_barriers); + + i = *last_avail_idx (vrh-vring.num - 1); + + err = getu16(head, vrh-vring.avail-ring[i]); + if
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote: ... +static inline int +__vringh_iov(struct vringh *vrh, u16 i, + struct vringh_iov *riov, + struct vringh_iov *wiov, + bool (*getrange)(u64 addr, struct vringh_range *r), + gfp_t gfp, + int (*getdesc)(struct vring_desc *dst, const struct vring_desc *s)) +{ + int err, count = 0, up_next, desc_max; + struct vring_desc desc, *descs; + struct vringh_range range = { -1ULL, 0 }; + + /* We start traversing vring's descriptor table. */ + descs = vrh-vring.desc; + desc_max = vrh-vring.num; + up_next = -1; + + riov-i = wiov-i = 0; + for (;;) { + void *addr; + struct vringh_iov *iov; + + err = getdesc(desc, descs[i]); + if (unlikely(err)) + goto fail; + + /* Make sure it's OK, and get offset. */ + if (!check_range(desc.addr, desc.len, range, getrange)) { + err = -EINVAL; + goto fail; + } + addr = (void *)(long)desc.addr + range.offset; Should probably be (void *)(long)(desc.addr + range.offset). Otherwise we risk signed integer overflow. + + if (unlikely(desc.flags VRING_DESC_F_INDIRECT)) { + err = move_to_indirect(up_next, i, addr, desc, +descs, desc_max); + if (err) + goto fail; + continue; + } + + if (count++ == vrh-vring.num) { + vringh_bad(Descriptor loop in %p, descs); + err = -ELOOP; + goto fail; + } + + if (desc.flags VRING_DESC_F_WRITE) + iov = wiov; + else { + iov = riov; + if (unlikely(wiov-i)) { + vringh_bad(Readable desc %p after writable, +descs[i]); + err = -EINVAL; + goto fail; + } + } + + if (unlikely(iov-i == iov-max)) { + err = resize_iovec(iov, gfp); + if (err) + goto fail; + } + + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Splitting descriptors doesn't work for me. As described earlier, I receive one IP-frame for each descriptor. So I need to keep the original boundaries. Regards, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] virtio_host: host-side implementation of virtio rings.
On Thu, Jan 17, 2013 at 12:49:25PM +0100, Sjur Brændeland wrote: On Thu, Jan 17, 2013 at 12:23 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Jan 17, 2013 at 08:59:38PM +1030, Rusty Russell wrote: ... +static inline int +__vringh_iov(struct vringh *vrh, u16 i, + struct vringh_iov *riov, + struct vringh_iov *wiov, + bool (*getrange)(u64 addr, struct vringh_range *r), + gfp_t gfp, + int (*getdesc)(struct vring_desc *dst, const struct vring_desc *s)) +{ + int err, count = 0, up_next, desc_max; + struct vring_desc desc, *descs; + struct vringh_range range = { -1ULL, 0 }; + + /* We start traversing vring's descriptor table. */ + descs = vrh-vring.desc; + desc_max = vrh-vring.num; + up_next = -1; + + riov-i = wiov-i = 0; + for (;;) { + void *addr; + struct vringh_iov *iov; + + err = getdesc(desc, descs[i]); + if (unlikely(err)) + goto fail; + + /* Make sure it's OK, and get offset. */ + if (!check_range(desc.addr, desc.len, range, getrange)) { + err = -EINVAL; + goto fail; + } + addr = (void *)(long)desc.addr + range.offset; Should probably be (void *)(long)(desc.addr + range.offset). Otherwise we risk signed integer overflow. + + if (unlikely(desc.flags VRING_DESC_F_INDIRECT)) { + err = move_to_indirect(up_next, i, addr, desc, +descs, desc_max); + if (err) + goto fail; + continue; + } + + if (count++ == vrh-vring.num) { + vringh_bad(Descriptor loop in %p, descs); + err = -ELOOP; + goto fail; + } + + if (desc.flags VRING_DESC_F_WRITE) + iov = wiov; + else { + iov = riov; + if (unlikely(wiov-i)) { + vringh_bad(Readable desc %p after writable, +descs[i]); + err = -EINVAL; + goto fail; + } + } + + if (unlikely(iov-i == iov-max)) { + err = resize_iovec(iov, gfp); + if (err) + goto fail; + } + + iov-iov[iov-i].iov_base = (__force __user void *)addr; + iov-iov[iov-i].iov_len = desc.len; The following comment from the previous version still applies: This looks like it won't do the right thing if desc.len spans multiple ranges. I don't know if this happens in practice but this is something vhost supports ATM. in otgher words, we might need to split a single desc to multiple iov entries. Splitting descriptors doesn't work for me. As described earlier, I receive one IP-frame for each descriptor. So I need to keep the original boundaries. Regards, Sjur Hmm this kind of conflicts with the description we were going into with virtio generally. Other side should be free to lay out request any way it wants to. Anyway, in this case this is splitting iovecs really not descriptors. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization