RE: [PATCH v16 14/17]Add a kconfig entry and make entry for mp device.
-Original Message- From: Randy Dunlap [mailto:randy.dun...@oracle.com] Sent: Thursday, December 02, 2010 1:54 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v16 14/17]Add a kconfig entry and make entry for mp device. On Wed, 1 Dec 2010 16:08:25 +0800 xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU +tristate mediate passthru network driver (EXPERIMENTAL) +depends on VHOST_NET +---help--- + zerocopy network I/O support, we call it as mediate passthru to support; we call it mediate passthru to + be distiguish with hardare passthru. distinguish it from hardware passthru. Thanks. I will modify that. Thanks Xiaohui + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Xin, Xiaohui Sent: Thursday, November 11, 2010 4:28 PM To: David Miller Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net. -Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Thursday, November 11, 2010 1:47 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net. From: xiaohui@intel.com Date: Wed, 10 Nov 2010 17:23:28 +0800 From: Xin Xiaohui xiaohui@intel.com 2) The idea to key off of skb-dev in skb_release_data() is fundamentally flawed since many actions can change skb-dev on you, which will end up causing a leak of your external data areas. How about this one? If the destructor_arg is not a good candidate, then I have to add an apparent field in shinfo. If destructor_arg is actually a net_device pointer or similar, you will need to take a reference count on it or similar. Do you mean destructor_arg will be consumed by other user? If that case, may I add a new structure member in shinfo? Thus only zero-copy will use it, and no need for the reference count. How about this? It really needs somewhere to track the external data area, and if something wrong with it, we can also release the data area. We think skb_release_data() is the right place to deal with it. If I understood right, that destructor_arg will be used by other else that why reference count is needed, then how about add a new structure member in shinfo? Thanks Xiaohui Which means -- good bye performance especially on SMP. You're going to be adding new serialization points and at least two new atomics per packet. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
-Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Thursday, November 11, 2010 1:47 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net. From: xiaohui@intel.com Date: Wed, 10 Nov 2010 17:23:28 +0800 From: Xin Xiaohui xiaohui@intel.com 2) The idea to key off of skb-dev in skb_release_data() is fundamentally flawed since many actions can change skb-dev on you, which will end up causing a leak of your external data areas. How about this one? If the destructor_arg is not a good candidate, then I have to add an apparent field in shinfo. If destructor_arg is actually a net_device pointer or similar, you will need to take a reference count on it or similar. Do you mean destructor_arg will be consumed by other user? If that case, may I add a new structure member in shinfo? Thus only zero-copy will use it, and no need for the reference count. Which means -- good bye performance especially on SMP. You're going to be adding new serialization points and at least two new atomics per packet. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
-Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Monday, November 08, 2010 4:25 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially. Le lundi 08 novembre 2010 à 16:03 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... Something like : diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e6ba898..2dca504 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -195,6 +195,9 @@ struct skb_shared_info { __be32 ip6_frag_id; __u8tx_flags; struct sk_buff *frag_list; + /* Intermediate layers must ensure that destructor_arg +* remains valid until skb destructor */ + void*destructor_arg; struct skb_shared_hwtstamps hwtstamps; /* @@ -202,9 +205,6 @@ struct skb_shared_info { */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg -* remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; Will that affect the cache line? What do you mean ? Or, we can move the line to clear destructor_arg to the end of __alloc_skb(). It looks like as the following, which one do you prefer? Thanks Xiaohui --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..df852f2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, child-fclone = SKB_FCLONE_UNAVAILABLE; } +shinfo-destructor_arg = NULL; out: return skb; nodata: I dont understand why you want to do this. This adds an instruction, makes code bigger, and no obvious gain for me, at memory transactions side. If integrated in the existing memset(), cost is an extra iteration to perform the clear of this field. Ok. Thanks for this explanation and will update with your solution. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
I have addressed this issue in v14 patch set. Thanks Xiaohui -Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Saturday, October 30, 2010 4:29 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver. From: Xin, Xiaohui xiaohui@intel.com Date: Wed, 27 Oct 2010 09:33:12 +0800 Somehow, it seems not a trivial work to support it now. Can we support it later and as a todo with our current work? I would prefer the feature work properly, rather than only in specific cases, before being integated. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
-Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Tuesday, October 19, 2010 11:24 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver. From: xiaohui@intel.com Date: Fri, 15 Oct 2010 17:12:11 +0800 @@ -2891,6 +2922,11 @@ static int __netif_receive_skb(struct sk_buff *skb) ncls: #endif +/* To intercept mediate passthru(zero-copy) packets here */ +skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); +if (!skb) +goto out; + /* Handle special case of bridge or macvlan */ rx_handler = rcu_dereference(skb-dev-rx_handler); if (rx_handler) { If you consume the packet here, devices in passthru mode cannot be use with bonding. But there is nothing that prevents a bond being created with such a device. So we have to either prevent such configurations (bad) or make it work somehow (good) :-) The big picture may like this: To prevent such configurations, we should add code to check in both mp and bonding driver. If a nic is in zero-copy mode , bonding can't be made with it, and if nic is in bonding mode, we can't bind the device to do zero-copy. If we want to support such configurations, it also has some constraints. If bonding is created first, we need code to check if all the slaves support zero-copy mode, and if yes, all the slaves should be assigned a same page_ctor(), all the packets received should be intercepted with master nic. If not, fails. If zero-copy is enabled first, bonding created with it should fail. Somehow, it seems not a trivial work to support it now. Can we support it later and as a todo with our current work? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v12 12/17] Add mp(mediate passthru) device.
-Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, October 03, 2010 9:13 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v12 12/17] Add mp(mediate passthru) device. On Thu, Sep 30, 2010 at 10:04:30PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com So you plan to rewrite all this to make this code part of macvtap? V13 exports functions which can be reused by macvtap to get zero-copy. Currently, I'm adding code in macvtap and trying to make it work. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
-Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Friday, October 01, 2010 3:15 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially. From: xiaohui@intel.com Date: Thu, 30 Sep 2010 22:04:23 +0800 @@ -197,10 +197,11 @@ struct skb_shared_info { union skb_shared_tx tx_flags; struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; -skb_frag_t frags[MAX_SKB_FRAGS]; /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + +skb_frag_t frags[MAX_SKB_FRAGS]; }; /* The structure is for a skb which pages may point to Why are you moving frags[] to the end like this? That's to avoid the new cache miss caused by using destructor_arg in data path like skb_release_data(). That's based on the comment from Eric Dumazet on v7 patches. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.
-Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Thursday, September 30, 2010 10:22 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver. Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c11e32c..83172b8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2517,6 +2517,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ +struct mp_port *mp_port = NULL; +struct sock *sk = NULL; + +if (!dev_is_mpassthru(skb-dev)) +return skb; +mp_port = skb-dev-mp_port; + +if (*pt_prev) { +*ret = deliver_skb(skb, *pt_prev, orig_dev); +*pt_prev = NULL; +} + +sk = mp_port-sock-sk; +skb_queue_tail(sk-sk_receive_queue, skb); +sk-sk_state_change(sk); + +return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif +/* To intercept mediate passthru(zero-copy) packets here */ +skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); +if (!skb) +goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; This does not apply to current net-next-2.6 We now have dev-rx_handler (currently for bridge or macvlan) Ok. Thanks, will rebase to fix that. Thanks Xiaohui commit ab95bfe01f9872459c8678572ccadbf646badad0 Author: Jiri Pirko jpi...@redhat.com Date: Tue Jun 1 21:52:08 2010 + net: replace hooks in __netif_receive_skb V5 What this patch does is it removes two receive frame hooks (for bridge and for macvlan) from __netif_receive_skb. These are replaced them with a single hook for both. It only supports one hook per device because it makes no sense to do bridging and macvlan on the same device. Then a network driver (of virtual netdev like macvlan or bridge) can register an rx_handler for needed net device. Signed-off-by: Jiri Pirko jpi...@redhat.com Signed-off-by: Stephen Hemminger shemmin...@vyatta.com Signed-off-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
-Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Monday, October 11, 2010 11:42 PM To: David Miller Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially. Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit : From: Xin, Xiaohui xiaohui@intel.com Date: Mon, 11 Oct 2010 15:06:05 +0800 That's to avoid the new cache miss caused by using destructor_arg in data path like skb_release_data(). That's based on the comment from Eric Dumazet on v7 patches. Thanks for the explanation. Anyway, frags[] must be the last field of struct skb_shared_info since commit fed66381 (net: pskb_expand_head() optimization) It seems Xin worked on a quite old tree. I will rebase soon. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.
Will be on leave during 10/01 ~ 10/07, and slow or no response to the comments. Thanks Xiaohui -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of xiaohui@intel.com Sent: Thursday, September 30, 2010 10:04 PM To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net. We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10: net core and kernel changes. patch 11-13: new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17: It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. Need to revise. The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make
RE: [PATCH v11 13/17] Add mp(mediate passthru) device.
-Original Message- From: Ben Hutchings [mailto:bhutchi...@solarflare.com] Sent: Tuesday, September 28, 2010 5:24 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v11 13/17] Add mp(mediate passthru) device. On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1407 + 1 files changed, 1407 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..d86d94c --- /dev/null +++ b/drivers/vhost/mpassthru.c [...] +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk This is unsafe; consider this usage: if (foo) DBG(bar\n); else baz(); You should use the standard pr_debug() or dev_dbg() instead. Ok. Move to pr_debug(). [...] +struct page_ctor { + struct list_headreadq; + int wq_len; + int rq_len; + spinlock_t read_lock; Only one queue?! This readq is for mp device to store the rx guest buffers coming from vhost. For tx side, we don't need that, since tx buffers are sent out to device immediately. I would have appreciated some introductory comments on these structures. I still don't have any sort of clear picture of how this is all supposed to work. Basically, struct page_info{} is used to store each info of the guest buffers and some of vhost ring descriptor info according to this buffers. struct page_ctor{} is used to manipulate multiple struct page_info{}. Each device can do zero-copy is according one struct page_ctor{}. Anyway, I will comments on fields of these structures. [...] +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mpassthru_port *port, +struct sk_buff *skb, int npages) +{ +int i; +unsigned long flags; +struct page_ctor *ctor; +struct page_info *info = NULL; + +ctor = container_of(port, struct page_ctor, port); + +spin_lock_irqsave(ctor-read_lock, flags); +if (!list_empty(ctor-readq)) { +info = list_first_entry(ctor-readq, struct page_info, list); +list_del(info-list); +ctor-rq_len--; +} +spin_unlock_irqrestore(ctor-read_lock, flags); +if (!info) +return NULL; + +for (i = 0; i info-pnum; i++) +get_page(info-pages[i]); +info-skb = skb; +return info-ext_page; +} Why isn't the npages parameter used? Sorry, somehow forgot that, currently it only allocates one page a time, we want to use npages to see if we break the limit. [...] +static void relinquish_resource(struct page_ctor *ctor) +{ +if (!(ctor-dev-flags IFF_UP) +!(ctor-wq_len + ctor-rq_len)) +printk(KERN_INFO relinquish_resource\n); +} Looks like something's missing here. This function is for debug, and so on the rq_len stuff, I'll remove them later. +static void mp_ki_dtor(struct kiocb *iocb) +{ +struct page_info *info = (struct page_info *)(iocb-private); +int i; + +if (info-flags == INFO_READ) { +for (i = 0; i info-pnum; i++) { +if (info-pages[i]) { +set_page_dirty_lock(info-pages[i]); +put_page(info-pages[i]); +} +} +mp_hash_delete(info-ctor, info); +if (info-skb) { +info-skb-destructor = NULL; +kfree_skb(info-skb); +} +info-ctor-rq_len--; Doesn't rq_len represent the number of buffers queued between the guest and the driver? It is already decremented in page_ctor() so it seems like it gets decremented twice for each buffer. Also don't you need to hold the read_lock when updating rq_len? +} else +info-ctor-wq_len--; Maybe you should define rq_len and wq_len both as atomic_t. [...] +static void __mp_detach(struct mp_struct *mp) +{ +mp-mfile = NULL; + +mp_dev_change_flags(mp-dev, mp-dev-flags ~IFF_UP); +page_ctor_detach(mp); +mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP); This is racy; you should hold rtnl_lock over all
RE: [PATCH v11 12/17] Add a kconfig entry and make entry for mp device.
Thanks, I will move this in order next version. Thanks Xiaohui -Original Message- From: Ben Hutchings [mailto:bhutchi...@solarflare.com] Sent: Monday, September 27, 2010 9:57 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v11 12/17] Add a kconfig entry and make entry for mp device. This patch is in the wrong position in the sequence. It needs to be applied after mpassthru.c is created, not before. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
Ok, I will comment it above the structure. Thanks Xiaohui -Original Message- From: Ben Hutchings [mailto:bhutchi...@solarflare.com] Sent: Monday, September 27, 2010 9:43 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops. On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. [...] This information needs to be included in the comment above struct net_device_ops, not just in the commit message. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.
-Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Xin, Xiaohui Sent: Monday, September 27, 2010 8:45 AM To: Michael S. Tsirkin Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net. From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 27, 2010 1:02 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net. On Sat, Sep 25, 2010 at 12:27:18PM +0800, xiaohui@intel.com wrote: We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10:net core and kernel changes. patch 11-13:new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17: It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. I commented on how to avoid mm semaphore on data path separately, and since you didn't have time to review that yet, I won't repeat that here. I think I did avoid that in data path to use mm semaphore. I stored the value in mp structures and check with that. Hi Michael, Did you think I have addressed your comments about to avoid mm semaphore on data path or not? I stored the value in mp structures and don't use mmap_semphore. Or you still have some concerns with it? At this point what are the plans on macvtap integration? You indicated this is the interface you intend to use longterm. I'm now working on that. I'm now trying to export some functions from mp device, and let macvtap to use them for rx zero-copy. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, September 26, 2010 7:50 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Thu, Sep 23, 2010 at 08:56:33PM +0800, Xin, Xiaohui wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 22, 2010 7:55 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Tuesday, September 21, 2010 9:14 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 20, 2010 7:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com --- Michael, I have move the ioctl to configure the locked memory to vhost It's ok to move this to vhost but vhost does not know how much memory is needed by the backend. I think the backend here you mean is mp device. Actually, the memory needed is related to vq-num to run zero-copy smoothly. That means mp device did not know it but vhost did. Well, this might be so if you insist on locking all posted buffers immediately. However, let's assume I have a very large ring and prepost a ton of RX buffers: there's no need to lock all of them directly: if we have buffers A and B, we can lock A, pass it to hardware, and when A is consumed unlock A, lock B and pass it to hardware. It's not really critical. But note we can always have userspace tell MP device all it wants to know, after all. Ok. Here are two values we have mentioned, one is how much memory user application wants to lock, and one is how much memory locked is needed to run smoothly. When net backend is setup, we first need an ioctl to get how much memory is needed to lock, and then we call another ioctl to set how much it want to lock. Is that what's in your mind? That's fine. And the rlimt stuff is per process, we use current pointer to set and check the rlimit, the operations should be in the same process. Well no, the ring is handled from the kernel thread: we switch the mm to point to the owner task so copy from/to user and friends work, but you can't access the rlimit etc. Yes, the userspace and vhost kernel is not the same process. But we can record the task pointer as mm. So you will have to store mm and do device-mm, not current-mm. Anyway, better not touch mm on data path. Now the check operations are in vhost process, as mp_recvmsg() or mp_sendmsg() are called by vhost. Hmm, what do you mean by the check operations? send/recv are data path operations, they shouldn't do any checks, should they? As you mentioned what infiniband driver done: down_write(current-mm-mmap_sem); locked = npages + current-mm-locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; if ((locked lock_limit) !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; goto out; } cur_base = addr PAGE_MASK; ret = 0; while (npages) { ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), 1, !umem-writable, page_list, vma_list); I think it's a data path too. in infiniband this is used to 'register memory' which is not data path. We do the check because get_user_pages() really pin and locked the memory. Don't do this. Performance will be bad. Do the check once in ioctl and increment locked_vm by max amount you will use. On data path just make sure you do not exceed what userspace told you to. What's in my mind is that in the ioctl which
RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 27, 2010 1:02 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net. On Sat, Sep 25, 2010 at 12:27:18PM +0800, xiaohui@intel.com wrote: We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10: net core and kernel changes. patch 11-13: new device as interface to mantpulate external buffers. patch 14:for vhost-net. patch 15:An example on modifying NIC driver to using napi_gro_frags(). patch 16:An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17:It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. I commented on how to avoid mm semaphore on data path separately, and since you didn't have time to review that yet, I won't repeat that here. I think I did avoid that in data path to use mm semaphore. I stored the value in mp structures and check with that. At this point what are the plans on macvtap integration? You indicated this is the interface you intend to use longterm. I'm now working on that. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
-Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 22, 2010 7:55 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Tuesday, September 21, 2010 9:14 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 20, 2010 7:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com --- Michael, I have move the ioctl to configure the locked memory to vhost It's ok to move this to vhost but vhost does not know how much memory is needed by the backend. I think the backend here you mean is mp device. Actually, the memory needed is related to vq-num to run zero-copy smoothly. That means mp device did not know it but vhost did. Well, this might be so if you insist on locking all posted buffers immediately. However, let's assume I have a very large ring and prepost a ton of RX buffers: there's no need to lock all of them directly: if we have buffers A and B, we can lock A, pass it to hardware, and when A is consumed unlock A, lock B and pass it to hardware. It's not really critical. But note we can always have userspace tell MP device all it wants to know, after all. Ok. Here are two values we have mentioned, one is how much memory user application wants to lock, and one is how much memory locked is needed to run smoothly. When net backend is setup, we first need an ioctl to get how much memory is needed to lock, and then we call another ioctl to set how much it want to lock. Is that what's in your mind? That's fine. And the rlimt stuff is per process, we use current pointer to set and check the rlimit, the operations should be in the same process. Well no, the ring is handled from the kernel thread: we switch the mm to point to the owner task so copy from/to user and friends work, but you can't access the rlimit etc. Yes, the userspace and vhost kernel is not the same process. But we can record the task pointer as mm. So you will have to store mm and do device-mm, not current-mm. Anyway, better not touch mm on data path. Now the check operations are in vhost process, as mp_recvmsg() or mp_sendmsg() are called by vhost. Hmm, what do you mean by the check operations? send/recv are data path operations, they shouldn't do any checks, should they? As you mentioned what infiniband driver done: down_write(current-mm-mmap_sem); locked = npages + current-mm-locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; if ((locked lock_limit) !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; goto out; } cur_base = addr PAGE_MASK; ret = 0; while (npages) { ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), 1, !umem-writable, page_list, vma_list); I think it's a data path too. in infiniband this is used to 'register memory' which is not data path. We do the check because get_user_pages() really pin and locked the memory. Don't do this. Performance will be bad. Do the check once in ioctl and increment locked_vm by max amount you will use. On data path just make sure you do not exceed what userspace told you to. What's in my mind is that in the ioctl which to get the memory locked needed to run smoothly, it just return a value of how much memory is needed by mp device. And then in the ioctl which to set the memory locked by user space, it check the rlimit and increment locked_vm by user want. But I'm not sure how to make sure do not exceed what userspace told to. If we don't check locked_vm, what do we use to check? And Is it another kind of check on data path? So set operations should be in vhost process too, it's natural. So I think we'll need another ioctl
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
-Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Tuesday, September 21, 2010 9:14 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 20, 2010 7:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com --- Michael, I have move the ioctl to configure the locked memory to vhost It's ok to move this to vhost but vhost does not know how much memory is needed by the backend. I think the backend here you mean is mp device. Actually, the memory needed is related to vq-num to run zero-copy smoothly. That means mp device did not know it but vhost did. Well, this might be so if you insist on locking all posted buffers immediately. However, let's assume I have a very large ring and prepost a ton of RX buffers: there's no need to lock all of them directly: if we have buffers A and B, we can lock A, pass it to hardware, and when A is consumed unlock A, lock B and pass it to hardware. It's not really critical. But note we can always have userspace tell MP device all it wants to know, after all. Ok. Here are two values we have mentioned, one is how much memory user application wants to lock, and one is how much memory locked is needed to run smoothly. When net backend is setup, we first need an ioctl to get how much memory is needed to lock, and then we call another ioctl to set how much it want to lock. Is that what's in your mind? And the rlimt stuff is per process, we use current pointer to set and check the rlimit, the operations should be in the same process. Well no, the ring is handled from the kernel thread: we switch the mm to point to the owner task so copy from/to user and friends work, but you can't access the rlimit etc. Yes, the userspace and vhost kernel is not the same process. But we can record the task pointer as mm. Now the check operations are in vhost process, as mp_recvmsg() or mp_sendmsg() are called by vhost. Hmm, what do you mean by the check operations? send/recv are data path operations, they shouldn't do any checks, should they? As you mentioned what infiniband driver done: down_write(current-mm-mmap_sem); locked = npages + current-mm-locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; if ((locked lock_limit) !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; goto out; } cur_base = addr PAGE_MASK; ret = 0; while (npages) { ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), 1, !umem-writable, page_list, vma_list); I think it's a data path too. We do the check because get_user_pages() really pin and locked the memory. So set operations should be in vhost process too, it's natural. So I think we'll need another ioctl in the backend to tell userspace how much memory is needed? Except vhost tells it to mp device, mp did not know how much memory is needed to run zero-copy smoothly. Is userspace interested about the memory mp is needed? Couldn't parse this last question. I think userspace generally does want control over how much memory we'll lock. We should not just lock as much as we can. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Monday, September 20, 2010 7:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com --- Michael, I have move the ioctl to configure the locked memory to vhost It's ok to move this to vhost but vhost does not know how much memory is needed by the backend. I think the backend here you mean is mp device. Actually, the memory needed is related to vq-num to run zero-copy smoothly. That means mp device did not know it but vhost did. And the rlimt stuff is per process, we use current pointer to set and check the rlimit, the operations should be in the same process. Now the check operations are in vhost process, as mp_recvmsg() or mp_sendmsg() are called by vhost. So set operations should be in vhost process too, it's natural. So I think we'll need another ioctl in the backend to tell userspace how much memory is needed? Except vhost tells it to mp device, mp did not know how much memory is needed to run zero-copy smoothly. Is userspace interested about the memory mp is needed? It seems a bit cleaner as a backend ioctl as vhost does not lock memory itself, but I am not principally opposed. and check the limit with mm-locked_vm. please have a look. Thanks Xiaohui drivers/vhost/mpassthru.c | 74 +-- drivers/vhost/net.c | 78 ++-- include/linux/vhost.h |3 ++ 3 files changed, 85 insertions(+), 70 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index d86d94c..fd3827b 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -109,9 +109,6 @@ struct page_ctor { int wq_len; int rq_len; spinlock_t read_lock; -/* record the locked pages */ -int lock_pages; -struct rlimit o_rlim; struct net_device *dev; struct mpassthru_port port; struct page_info**hash_table; @@ -231,7 +228,6 @@ static int page_ctor_attach(struct mp_struct *mp) ctor-port.ctor = page_ctor; ctor-port.sock = mp-socket; ctor-port.hash = mp_lookup; -ctor-lock_pages = 0; /* locked by mp_mutex */ dev-mp_port = ctor-port; @@ -264,37 +260,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor) return info; } -static int set_memlock_rlimit(struct page_ctor *ctor, int resource, - unsigned long cur, unsigned long max) -{ -struct rlimit new_rlim, *old_rlim; -int retval; - -if (resource != RLIMIT_MEMLOCK) -return -EINVAL; -new_rlim.rlim_cur = cur; -new_rlim.rlim_max = max; - -old_rlim = current-signal-rlim + resource; - -/* remember the old rlimit value when backend enabled */ -ctor-o_rlim.rlim_cur = old_rlim-rlim_cur; -ctor-o_rlim.rlim_max = old_rlim-rlim_max; - -if ((new_rlim.rlim_max old_rlim-rlim_max) -!capable(CAP_SYS_RESOURCE)) -return -EPERM; - -retval = security_task_setrlimit(resource, new_rlim); -if (retval) -return retval; - -task_lock(current-group_leader); -*old_rlim = new_rlim; -task_unlock(current-group_leader); -return 0; -} - static void relinquish_resource(struct page_ctor *ctor) { if (!(ctor-dev-flags IFF_UP) @@ -322,8 +287,6 @@ static void mp_ki_dtor(struct kiocb *iocb) info-ctor-rq_len--; } else info-ctor-wq_len--; -/* Decrement the number of locked pages */ -info-ctor-lock_pages -= info-pnum; kmem_cache_free(ext_page_info_cache, info); relinquish_resource(info-ctor); @@ -349,7 +312,7 @@ static struct kiocb *create_iocb(struct page_info *info, int size) iocb-ki_dtor(iocb); iocb-private = (void *)info; iocb-ki_dtor = mp_ki_dtor; - +iocb-ki_user_data = info-pnum; return iocb; } @@ -375,10 +338,6 @@ static int page_ctor_detach(struct mp_struct *mp) relinquish_resource(ctor); -set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, - ctor-o_rlim.rlim_cur, - ctor-o_rlim.rlim_max); - /* locked by mp_mutex */ ctor-dev-mp_port = NULL; dev_put(ctor-dev); @@ -565,21 +524,23 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor, int rc; int i, j, n = 0; int len; -unsigned long base, lock_limit; +unsigned long base, lock_limit, locked; struct page_info *info = NULL; -lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 5:59 PM To: Xin, Xiaohui Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? I have trouble parsing the above question. At some point Arnd suggested that the mp device functionality would fit nicely as part of the macvtap driver. It seems to make sense superficially, the advantage if it worked would be that we would get zero copy (mostly) transparently. Do you agree with this goal? I would say yes. Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. tap would be very hard at this point as it does not bind to a device. macvtap might work, we mainly need to figure out a way to detect that device can do zero copy so the right mode is used. I think a first step could be to simply link mp code into macvtap module, pass necessary ioctls on, then move some code around as necessary. This might get rid of code duplication nicely. I'll look into this to see how much effort would be. Do you think that's worth to do and help current process which is blocked too long than I expected? I think it's nice to have. And if done hopefully this will get the folk working on the macvtap driver to review the code, which will help find all issues faster. I think if you post some performance numbers, this will also help get people excited and looking at the code. The performance data I have posted before is compared with raw socket on vhost-net. But currently, the raw socket backend is removed from the qemu side. So I can only compare with tap on vhost-net. But unfortunately, I missed something that I even can't bring it up. I was blocked by this for a time. I also don't see the process as completely blocked, each review round points out more issues: we aren't going back and forth changing same lines again and again, are we? One thing that might help is increase the frequency of updates, try sending them out sooner. On the other hand 10 new patches each revision is a lot: if there is a part of patchset that has stabilised you can split it out, post once and keep posting the changing part separately. I hope these suggestions help. Thanks, Michael! -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 7:28 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Wed, Sep 15, 2010 at 11:13:44AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, September 12, 2010 9:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote: Playing with rlimit on data path, transparently to the application in this way looks strange to me, I suspect this has unexpected security implications. Further, applications may have other uses for locked memory besides mpassthru - you should not just take it because it's there. Can we have an ioctl that lets userspace configure how much memory to lock? This ioctl will decrement the rlimit and store the data in the device structure so we can do accounting internally. Put it back on close or on another ioctl. Yes, we can decrement the rlimit in ioctl in one time to avoid data path. Need to be careful for when this operation gets called again with 0 or another small value while we have locked memory - maybe just fail with EBUSY? or wait until it gets unlocked? Maybe 0 can be special-cased and deactivate zero-copy?. How about we don't use a new ioctl, but just check the rlimit in one MPASSTHRU_BINDDEV ioctl? If we find mp device break the rlimit, then we fail the bind ioctl, and thus can't do zero copy any more. Yes, and not just check, but decrement as well. I think we should give userspace control over how much memory we can lock and subtract from the rlimit. It's OK to add this as a parameter to MPASSTHRU_BINDDEV. Then increment the rlimit back on unbind and on close? This opens up an interesting condition: process 1 calls bind, process 2 calls unbind or close. This will increment rlimit for process 2. Not sure how to fix this properly. I can't too, can we do any synchronous operations on rlimit stuff? I quite suspect in it. -- MST Here's what infiniband does: simply pass the amount of memory userspace wants you to lock on an ioctl, and verify that either you have CAP_IPC_LOCK or this number does not exceed the current rlimit. (must be on ioctl, not on open, as we likely want the fd passed around between processes), but do not decrement rlimit. Use this on following operations. Be careful if this can be changed while operations are in progress. This does mean that the effective amount of memory that userspace can lock is doubled, but at least it is not unlimited, and we sidestep all other issues such as userspace running out of lockable memory simply by virtue of using the driver. What I have done in mp device is almost the same as it. The difference is that I do not check the capability, and I use my own parameter ctor-pages instead of mm-locked_vm. So currently, 1) add the capability check 2) use mm-locked_vm 3) add an ioctl for userspace to configure how much memory can lock. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Shirley Ma [mailto:mashi...@us.ibm.com] Sent: Tuesday, September 14, 2010 11:05 PM To: Avi Kivity Cc: David Miller; a...@arndb.de; m...@redhat.com; Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: +base = (unsigned long)from-iov_base + offset1; +size = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; +num_pages = get_user_pages_fast(base, size, 0,page[i]); +if ((num_pages != size) || +(num_pages MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags)) +/* put_page is in skb free */ +return -EFAULT; What keeps the user from writing to these pages in it's address space after the write call returns? A write() return of success means: I wrote what you gave to me not I wrote what you gave to me, oh and BTW don't touch these pages for a while. In fact a while isn't even defined in any way, as there is no way for the write() invoker to know when the networking card is done with those pages. That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. Thanks Shirley I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Arnd Bergmann [mailto:a...@arndb.de] Sent: Tuesday, September 14, 2010 11:21 PM To: Shirley Ma Cc: Avi Kivity; David Miller; m...@redhat.com; Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tuesday 14 September 2010, Shirley Ma wrote: On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. I guess the best solution would be to make macvtap_aio_write return -EIOCBQUEUED when a packet gets passed down to the adapter, and call aio_complete when the adapter is done with it. This would change the regular behavior of macvtap into a model where every write on the file blocks until the packet has left the machine, which gives us better flow control, but does slow down the traffic when we only put one packet at a time into the queue. It also allows the user to call io_submit instead of write in order to do an asynchronous submission as Avi was suggesting. But currently, this patch is communicated with vhost-net, which is almost in the kernel side. If it uses aio stuff, it should be communicate with user space Backend. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. Do you think that's worth to do and help current process which is blocked too long than I expected? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Shirley Ma [mailto:mashi...@us.ibm.com] Sent: Wednesday, September 15, 2010 10:41 AM To: Xin, Xiaohui Cc: Avi Kivity; David Miller; a...@arndb.de; m...@redhat.com; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote: I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Have you tried it? If so what's the performance penalty you have seen by notifying vhost when tx buffer freed? We did not try it before, as we cared RX side more. I am thinking to have a callback in skb destructor, vhost_add_used_and_signal gets updated when skb is actually freed, vhost vq head need to be passed to the callback. This might requires vhost ring size is at least as big as the lower device driver. That's almost the same what we have done except we use destructor_arg and another callback.. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net.
Herbert, Any comments on the modifications of the net core and driver side of this patch? Thanks Xiaohui -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of xiaohui@intel.com Sent: Saturday, September 11, 2010 5:53 PM To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net. We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10: net core and kernel changes. patch 11-13: new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, September 12, 2010 9:37 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device. On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote: Playing with rlimit on data path, transparently to the application in this way looks strange to me, I suspect this has unexpected security implications. Further, applications may have other uses for locked memory besides mpassthru - you should not just take it because it's there. Can we have an ioctl that lets userspace configure how much memory to lock? This ioctl will decrement the rlimit and store the data in the device structure so we can do accounting internally. Put it back on close or on another ioctl. Yes, we can decrement the rlimit in ioctl in one time to avoid data path. Need to be careful for when this operation gets called again with 0 or another small value while we have locked memory - maybe just fail with EBUSY? or wait until it gets unlocked? Maybe 0 can be special-cased and deactivate zero-copy?. How about we don't use a new ioctl, but just check the rlimit in one MPASSTHRU_BINDDEV ioctl? If we find mp device break the rlimit, then we fail the bind ioctl, and thus can't do zero copy any more. Yes, and not just check, but decrement as well. I think we should give userspace control over how much memory we can lock and subtract from the rlimit. It's OK to add this as a parameter to MPASSTHRU_BINDDEV. Then increment the rlimit back on unbind and on close? This opens up an interesting condition: process 1 calls bind, process 2 calls unbind or close. This will increment rlimit for process 2. Not sure how to fix this properly. I can't too, can we do any synchronous operations on rlimit stuff? I quite suspect in it. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
Playing with rlimit on data path, transparently to the application in this way looks strange to me, I suspect this has unexpected security implications. Further, applications may have other uses for locked memory besides mpassthru - you should not just take it because it's there. Can we have an ioctl that lets userspace configure how much memory to lock? This ioctl will decrement the rlimit and store the data in the device structure so we can do accounting internally. Put it back on close or on another ioctl. Yes, we can decrement the rlimit in ioctl in one time to avoid data path. Need to be careful for when this operation gets called again with 0 or another small value while we have locked memory - maybe just fail with EBUSY? or wait until it gets unlocked? Maybe 0 can be special-cased and deactivate zero-copy?. How about we don't use a new ioctl, but just check the rlimit in one MPASSTHRU_BINDDEV ioctl? If we find mp device break the rlimit, then we fail the bind ioctl, and thus can't do zero copy any more. In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory, the default value is only 16 pages. It's too small to make the device to work. So we always to configure it with a large value. I think, if rlimit value after decrement is 0, then deactivate zero-copy is better. 0 maybe ok. + + if (ctor-lock_pages + count lock_limit npages) { + printk(KERN_INFO exceed the locked memory rlimit.); + return NULL; + } + + info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL); You seem to fill in all memory, why zalloc? this is data path ... Ok, Let me check this. + + if (!info) + return NULL; + + for (i = j = 0; i count; i++) { + base = (unsigned long)iov[i].iov_base; + len = iov[i].iov_len; + + if (!len) + continue; + n = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; + + rc = get_user_pages_fast(base, n, npages ? 1 : 0, npages controls whether this is a write? Why? We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with npages = 0. + info-pages[j]); + if (rc != n) + goto failed; + + while (n--) { + frags[j].offset = base ~PAGE_MASK; + frags[j].size = min_t(int, len, + PAGE_SIZE - frags[j].offset); + len -= frags[j].size; + base += frags[j].size; + j++; + } + } + +#ifdef CONFIG_HIGHMEM + if (npages !(dev-features NETIF_F_HIGHDMA)) { + for (i = 0; i j; i++) { + if (PageHighMem(info-pages[i])) + goto failed; + } + } +#endif Are non-highdma devices worth bothering with? If yes - are there other limitations devices might have that we need to handle? E.g. what about non-s/g devices or no checksum offloading?. Basically I think there is no limitations for both, but let me have a check. + skb_push(skb, ETH_HLEN); + + if (skb_is_gso(skb)) { + hdr.hdr.hdr_len = skb_headlen(skb); + hdr.hdr.gso_size = shinfo-gso_size; + if (shinfo-gso_type SKB_GSO_TCPV4) + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; + else if (shinfo-gso_type SKB_GSO_TCPV6) + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; + else if (shinfo-gso_type SKB_GSO_UDP) + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP; + else + BUG(); + if (shinfo-gso_type SKB_GSO_TCP_ECN) + hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN; + + } else + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE; + + if (skb-ip_summed == CHECKSUM_PARTIAL) { + hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + hdr.hdr.csum_start = + skb-csum_start - skb_headroom(skb); + hdr.hdr.csum_offset = skb-csum_offset; + } We have this code in tun, macvtap and packet socket already. Could this be a good time to move these into networking core? I'm not asking you to do this right now, but could this generic virtio-net to skb stuff be encapsulated in functions? It seems reasonable. -- MST -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
+ + if (ctor-lock_pages + count lock_limit npages) { + printk(KERN_INFO exceed the locked memory rlimit.); + return NULL; + } + + info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL); You seem to fill in all memory, why zalloc? this is data path ... Ok, Let me check this. It's mainly for info-next and info-prev, these two fields will be used in hash functions. But you are right, since most fields will be refilled. The new version includes the fix. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
Michael, Sorry to reply the mail late. So - does this driver help reduce service demand signifiantly? I'm looking at the performance now. Some comments from looking at the code: On Fri, Aug 06, 2010 at 05:23:41PM +0800, xiaohui@intel.com wrote: +static struct page_info *alloc_page_info(struct page_ctor *ctor, +struct kiocb *iocb, struct iovec *iov, +int count, struct frag *frags, +int npages, int total) +{ +int rc; +int i, j, n = 0; +int len; +unsigned long base, lock_limit; +struct page_info *info = NULL; + +lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; +lock_limit = PAGE_SHIFT; Playing with rlimit on data path, transparently to the application in this way looks strange to me, I suspect this has unexpected security implications. Further, applications may have other uses for locked memory besides mpassthru - you should not just take it because it's there. Can we have an ioctl that lets userspace configure how much memory to lock? This ioctl will decrement the rlimit and store the data in the device structure so we can do accounting internally. Put it back on close or on another ioctl. Yes, we can decrement the rlimit in ioctl in one time to avoid data path. Need to be careful for when this operation gets called again with 0 or another small value while we have locked memory - maybe just fail with EBUSY? or wait until it gets unlocked? Maybe 0 can be special-cased and deactivate zero-copy?. In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory, the default value is only 16 pages. It's too small to make the device to work. So we always to configure it with a large value. I think, if rlimit value after decrement is 0, then deactivate zero-copy is better. 0 maybe ok. + +if (ctor-lock_pages + count lock_limit npages) { +printk(KERN_INFO exceed the locked memory rlimit.); +return NULL; +} + +info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL); You seem to fill in all memory, why zalloc? this is data path ... Ok, Let me check this. + +if (!info) +return NULL; + +for (i = j = 0; i count; i++) { +base = (unsigned long)iov[i].iov_base; +len = iov[i].iov_len; + +if (!len) +continue; +n = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; + +rc = get_user_pages_fast(base, n, npages ? 1 : 0, npages controls whether this is a write? Why? We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with npages = 0. +info-pages[j]); +if (rc != n) +goto failed; + +while (n--) { +frags[j].offset = base ~PAGE_MASK; +frags[j].size = min_t(int, len, +PAGE_SIZE - frags[j].offset); +len -= frags[j].size; +base += frags[j].size; +j++; +} +} + +#ifdef CONFIG_HIGHMEM +if (npages !(dev-features NETIF_F_HIGHDMA)) { +for (i = 0; i j; i++) { +if (PageHighMem(info-pages[i])) +goto failed; +} +} +#endif Are non-highdma devices worth bothering with? If yes - are there other limitations devices might have that we need to handle? E.g. what about non-s/g devices or no checksum offloading?. Basically I think there is no limitations for both, but let me have a check. +skb_push(skb, ETH_HLEN); + +if (skb_is_gso(skb)) { +hdr.hdr.hdr_len = skb_headlen(skb); +hdr.hdr.gso_size = shinfo-gso_size; +if (shinfo-gso_type SKB_GSO_TCPV4) +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; +else if (shinfo-gso_type SKB_GSO_TCPV6) +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; +else if (shinfo-gso_type SKB_GSO_UDP) +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP; +else +BUG(); +if (shinfo-gso_type SKB_GSO_TCP_ECN) +hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN; + +} else +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE; + +if (skb-ip_summed == CHECKSUM_PARTIAL) { +hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; +hdr.hdr.csum_start = +skb-csum_start - skb_headroom(skb); +hdr.hdr.csum_offset = skb-csum_offset; +} We have this code in tun, macvtap and packet socket already. Could this be a good time to move these into networking core? I'm not asking you to do this right now, but could
RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
Herbert, The v8 patches are modified mostly based on your comments about napi_gro_frags interface. How do you think about the patches about net core system part? We know currently there are some comments about the mp device, such as to support zero-copy for tun/tap and macvtap. Since there isn't a decision yet about it. May you give comments about the net core system first, since this part is all the same for zero-copy. Thanks Xiaohui -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of xiaohui@intel.com Sent: Thursday, July 29, 2010 7:15 PM To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net. We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10: net core and kernel changes. patch 11-13: new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of
RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
-Original Message- From: Shirley Ma [mailto:mashi...@us.ibm.com] Sent: Friday, July 30, 2010 6:31 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net. Hello Xiaohui, On Thu, 2010-07-29 at 19:14 +0800, xiaohui@intel.com wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. Since vhost-net already supports macvtap/tun backends, do you think whether it's better to implement zero copy in macvtap/tun than inducing a new media passthrough device here? Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. I did some vhost performance measurement over 10Gb ixgbe, and found that in order to get consistent BW results, netperf/netserver, qemu, vhost threads smp affinities are required. Looking forward to these results for small message size comparison. For large message size 10Gb ixgbe BW already reached by doing vhost smp affinity w/i offloading support, we will see how much CPU utilization it can be reduced. Please provide latency results as well. I did some experimental on macvtap zero copy sendmsg, what I have found that get_user_pages latency pretty high. May you share me with your performance results (including BW and latency)on vhost-net and how you get them(your configuration and especially with the affinity settings)? Thanks Xiaohui Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
Hello Xiaohui, On Thu, 2010-07-29 at 19:14 +0800, xiaohui@intel.com wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. Since vhost-net already supports macvtap/tun backends, do you think whether it's better to implement zero copy in macvtap/tun than inducing a new media passthrough device here? I'm not sure if there will be more duplicated code in the kernel. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. I did some vhost performance measurement over 10Gb ixgbe, and found that in order to get consistent BW results, netperf/netserver, qemu, vhost threads smp affinities are required. Looking forward to these results for small message size comparison. For large message size 10Gb ixgbe BW already reached by doing vhost smp affinity w/i offloading support, we will see how much CPU utilization it can be reduced. Please provide latency results as well. I did some experimental on macvtap zero copy sendmsg, what I have found that get_user_pages latency pretty high. Ok, I will try that. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Sunday, June 27, 2010 2:15 PM To: Dong, Eddie Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote: In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :( OK, if I understand you correctly then I don't think have a problem. With your current patch-set you have exactly the same situation when the skb-data is reallocated as a kernel buffer. When will skb-data to be reallocated? May you point me the code path? This is OK because as you correctly argue, it is a rare situation. With my proposal you will need to get this extra external buffer in even less cases, because you'd only need to do it if the skb head grows, which only happens if it becomes encapsulated. So let me explain it in a bit more detail: Our packet starts out as a purely non-linear skb, i.e., skb-head contains nothing and all the page frags come from the guest. During host processing we may pull data into skb-head but the first frag will remain unless we pull all of it. If we did do that then you would have a free external buffer anyway. Now in the common case the header may be modified or pulled, but it very rarely grows. So you can just copy the header back into the first frag just before we give it to the guest. Since the data is still there, so recompute the page offset and size is ok, right? Only in the case where the packet header grows (e.g., encapsulation) would you need to get an extra external buffer. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
Herbert, That's why I have sent you the patch for guest virtio-net driver. I reserved 512 bytes in each page, then I can always have the space to copy and avoid the backend memory used up issue. Thanks Xiaohui -Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Thursday, June 24, 2010 6:09 PM To: Dong, Eddie Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote: I mean once the frontend side driver post the buffers to the backend driver, the backend driver will immediately use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :( OK I see what you mean. Can you tell me how does Xiaohui's previous patch-set deal with this problem? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Friday, June 18, 2010 1:59 PM To: Xin, Xiaohui Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com; Rusty Russell Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote: Herbert, I have questions about the idea above: 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), then we don't need napi_gro_frags() any more, the driver's original receiving function is ok. Right? Well I was actually thinking about converting all drivers that need this to napi_gro_frags. But now that you mention it, yes we could still keep the old interface to minimise the work. 2) Is napi_gro_frags() only suitable for TCP protocol packet? I have done a small test for ixgbe driver to let it only allocate paged buffers and found kernel hangs when napi_gro_frags() receives an ARP packet. It should work with any packet. In fact, I'm pretty sure the other drivers (e.g., cxgb3) use that interface for all packets. Thanks for the verification. By the way, does that mean that nearly all drivers can use the same napi_gro_frags() to receive buffers though currently each driver has it's own receiving function? 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate as usual, the data pointed by skb-data will be copied into the first guest buffer. That means we should reserve sufficient room in guest buffer. For PS mode supported driver (for example ixgbe), the room will be more than 128. After 128bytes, we will put the first frag data. Look into virtio-net.c the function page_to_skb() and receive_mergeable(), that means we should modify guest virtio-net driver to compute the offset as the parameter for skb_set_frag(). How do you think about this? Attached is a patch to how to modify the guest driver. I reserve 512 bytes as an example, and transfer the header len of the skb in hdr-hdr_len. Expanding the buffer size to 512 bytes to accomodate PS mode looks reasonable to me. However, I don't think we should increase the copy threshold to 512 bytes at the same time. I don't have any figures myself but I think if we are to make such a change it should be a separate one and come with supporting numbers. Let me have a look to see if I can retain the copy threshold as 128 bytes and copy the header data safely. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Xin, Xiaohui Sent: Saturday, June 12, 2010 5:31 PM To: Herbert Xu Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. -Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Friday, June 11, 2010 1:21 PM To: Xin, Xiaohui Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote: I'm not sure if I understand your way correctly: 1) Does the way only deal with driver with SG feature? Since packet is non-linear... No the hardware doesn't have to support SG. You just need to place the entire packet contents in a page instead of skb-head. 2) Is skb-data still pointing to guest user buffers? If yes, how to avoid the modifications to net core change to skb? skb-data would not point to guest user buffers. In the common case the packet is not modified on its way to the guest so this is not an issue. In the rare case where it is modified, you only have to copy the bits which are modified and the cost of that is inconsequential since you have to write to that memory anyway. 3) In our way only parts of drivers need be modified to support zero-copy. and here, need we modify all the drivers? If you're asking the portion of each driver supporting zero-copy that needs to be modified, then AFAICS this doesn't change that very much at all. I think to make skb-head empty at first will cause more effort to pass the check with skb header. Have I missed something here? I really make the skb-head NULL just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb. No I'm not suggesting you set it to NULL. It should have some memory allocated, but skb_headlen(skb) should be zero. Please have a look at how the napi_gro_frags interface works (e.g., in drivers/net/cxgb3/sge.c). This is exactly the model that I am suggesting. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt Herbert, I explained what I think the thought in your mind here, please clarify if something missed. 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed. If the driver support PS mode, then modify alloc_page() too. 2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving function. 3) napi_gro_frags() will allocate small skb and pull the header data from the first page to skb-data. Is above the way what you have suggested? I have thought something in detail about the way. 1) The first page will have an offset after the header is copied into allocated kernel skb. The offset should be recalculated when the user page data is transferred to guest. This may modify some of the gro code. 2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot put a user page as normally. This may modify the gro code too. 3) When the user buffer returned to guest, some of them need to be appended a vnet header. That means for some pages, the vnet header room should be reserved when allocated. But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong. 4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo? Currently I can only think of this. How do you think about then? Thanks Xiaohui Herbert, In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers. We can also have another way here. We can provide a function to only substitute alloc_page(), and a function to release the pages when cleaning the rx buffers. The skb for the rx buffer can be allocated in original way, and when pushing the data to guest, the header data will be copied to guest buffer. In this way, we should reserve sufficient room for the header in the first guest user buffers. That need modifications to guest virtio-net kernel. And this way only suitable for PS mode supported driver. Considered the advanced driver mostly has PS mode. So it should be not a critical issue. Thanks Xiaohui
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Friday, June 11, 2010 1:21 PM To: Xin, Xiaohui Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote: I'm not sure if I understand your way correctly: 1) Does the way only deal with driver with SG feature? Since packet is non-linear... No the hardware doesn't have to support SG. You just need to place the entire packet contents in a page instead of skb-head. 2) Is skb-data still pointing to guest user buffers? If yes, how to avoid the modifications to net core change to skb? skb-data would not point to guest user buffers. In the common case the packet is not modified on its way to the guest so this is not an issue. In the rare case where it is modified, you only have to copy the bits which are modified and the cost of that is inconsequential since you have to write to that memory anyway. 3) In our way only parts of drivers need be modified to support zero-copy. and here, need we modify all the drivers? If you're asking the portion of each driver supporting zero-copy that needs to be modified, then AFAICS this doesn't change that very much at all. I think to make skb-head empty at first will cause more effort to pass the check with skb header. Have I missed something here? I really make the skb-head NULL just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb. No I'm not suggesting you set it to NULL. It should have some memory allocated, but skb_headlen(skb) should be zero. Please have a look at how the napi_gro_frags interface works (e.g., in drivers/net/cxgb3/sge.c). This is exactly the model that I am suggesting. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt Herbert, I explained what I think the thought in your mind here, please clarify if something missed. 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed. If the driver support PS mode, then modify alloc_page() too. 2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving function. 3) napi_gro_frags() will allocate small skb and pull the header data from the first page to skb-data. Is above the way what you have suggested? I have thought something in detail about the way. 1) The first page will have an offset after the header is copied into allocated kernel skb. The offset should be recalculated when the user page data is transferred to guest. This may modify some of the gro code. 2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot put a user page as normally. This may modify the gro code too. 3) When the user buffer returned to guest, some of them need to be appended a vnet header. That means for some pages, the vnet header room should be reserved when allocated. But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong. 4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo? Currently I can only think of this. How do you think about then? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.
-Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Saturday, June 05, 2010 10:56 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially. Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 37587f0..418457c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb) static void skb_release_data(struct sk_buff *skb) { +/* check if the skb has external buffers, we have use destructor_arg + * here to indicate + */ +struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg; + Oh well. This is v7 of your series, and nobody complained yet ? This is a new cache miss on a _critical_ path. Ok, I would remove the declaration here to avoid the new cache miss. if (!skb-cloned || !atomic_sub_return(skb-nohdr ? (1 SKB_DATAREF_SHIFT) + 1 : 1, skb_shinfo(skb)-dataref)) { @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); +/* if the skb has external buffers, use destructor here, + * since after that skb-head will be kfree, in case skb-head + * from external buffer cannot use kfree to destroy. + */ Why not deferring here the access to skb_shinfo(skb)-destructor_arg ? And references skb_shinfo(skb)-destructor_arg here. +if (dev_is_mpassthru(skb-dev) ext_page ext_page-dtor) +ext_page-dtor(ext_page); kfree(skb-head); } } if (dev_is_mpassthru(skb-dev)) { struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg; if (ext_page ext_page-dtor) ext_page-dtor(ext_page); } destructor_arg should me moved before frags[] if you really want to use it. Thanks for the patch. But why destructor_arg before frags[] is better than after frags[]? skb_release_data() will reference both of them diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bf243fc..b136d90 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -202,10 +202,11 @@ struct skb_shared_info { */ atomic_tdataref; - skb_frag_t frags[MAX_SKB_FRAGS]; /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + + skb_frag_t frags[MAX_SKB_FRAGS]; }; /* We divide dataref into two halves. The higher 16 bits hold references -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.
-Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Saturday, June 05, 2010 10:53 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer. Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com child-fclone = SKB_FCLONE_UNAVAILABLE; } +/* Record the external buffer info in this field. It's not so good, + * but we cannot find another place easily. + */ +shinfo-destructor_arg = ext_page; + Yes this is a big problem, its basically using a cache line that was not touched before. Did your patch which moves destructor_arg before frags[] also fix this? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Stephen Hemminger [mailto:shemmin...@vyatta.com] Sent: Monday, June 07, 2010 7:14 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. Still not sure this is a good idea for a couple of reasons: 1. We already have lots of special cases with skb's (frags and fraglist), and skb's travel through a lot of different parts of the kernel. So any new change like this creates lots of exposed points for new bugs. Look at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec and ppp and ... Yes, I agree on your concern at some extent. But the skbs which use external pages in our cases have short travels which start from NIC driver and then forward to the guest immediately. It will not travel into host kernel stack or any filters which may avoid many problems you have mentioned here. Another point is that we try to make the solution more generic to different NIC drivers, then many drivers may use the way without modifications. 2. SKB's can have infinite lifetime in the kernel. If these buffers come from a fixed size pool in an external device, they can easily all get tied up if you have a slow listener. What happens then? The pool is not fixed size, it's the size of usable buffers submitted by guest virtio-net driver. Guest virtio-net will check how much buffers are filled and do resubmit. What a slow listener mean? A slow NIC driver? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Andi Kleen Sent: Monday, June 07, 2010 3:51 PM To: Stephen Hemminger Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. Stephen Hemminger shemmin...@vyatta.com writes: Still not sure this is a good idea for a couple of reasons: 1. We already have lots of special cases with skb's (frags and fraglist), and skb's travel through a lot of different parts of the kernel. So any new change like this creates lots of exposed points for new bugs. Look at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec and ppp and ... 2. SKB's can have infinite lifetime in the kernel. If these buffers come from a fixed size pool in an external device, they can easily all get tied up if you have a slow listener. What happens then? 3. If they come from an internal pool what happens when the kernel runs low on memory? How is that pool balanced against other kernel memory users? The size of the pool is limited by the virtqueue capacity now. If the virtqueue is really huge, then how to balance on memory is a problem. I did not consider it clearly how to tune it dynamically currently... -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Mitchell Erblich [mailto:erbli...@earthlink.net] Sent: Monday, June 07, 2010 4:17 PM To: Andi Kleen Cc: Stephen Hemminger; Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote: Stephen Hemminger shemmin...@vyatta.com writes: Still not sure this is a good idea for a couple of reasons: 1. We already have lots of special cases with skb's (frags and fraglist), and skb's travel through a lot of different parts of the kernel. So any new change like this creates lots of exposed points for new bugs. Look at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec and ppp and ... 2. SKB's can have infinite lifetime in the kernel. If these buffers come from a fixed size pool in an external device, they can easily all get tied up if you have a slow listener. What happens then? 3. If they come from an internal pool what happens when the kernel runs low on memory? How is that pool balanced against other kernel memory users? -Andi -- a...@linux.intel.com -- Speaking for myself only. In general, When an internal pool is created/used, their SHOULD be a reason. Maybe, to keep allocation latency to a min, OR ... The internal pool here is a collection of user buffers submitted by guest virtio-net driver. Guest put buffers here and driver get buffers from it. If guest submit more buffers then driver needs, we need somewhere to put the buffers, it's the internal pool here to deal with. Now IMO, internal pool objects should have a ref count and if that count is 0, then under memory pressure and/or num of objects are above a high water mark, then they are freed, OR if there is a last reference age field, then the object is to be cleaned if dirty, then freed, Else, the pool is allowed to grow if the number of objects in the pool is below a set max (max COULD equal Infinity). Thanks for the thoughts. Basically, the size of the internal pool is not decided by the pool itself, To add/delete the objects in the pool is not a task of the pool itself too. It's decided by guest virtio-net driver and vhost-net driver both, and decided by the guest receive speed and submit speed. The max size of the pool is limited by the virtqueue buffer numbers. Thanks Xiaohui Mitchell Erblich -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Tuesday, June 08, 2010 1:28 PM To: Stephen Hemminger Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external. On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote: Still not sure this is a good idea for a couple of reasons: 1. We already have lots of special cases with skb's (frags and fraglist), and skb's travel through a lot of different parts of the kernel. So any new change like this creates lots of exposed points for new bugs. Look at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec and ppp and ... 2. SKB's can have infinite lifetime in the kernel. If these buffers come from a fixed size pool in an external device, they can easily all get tied up if you have a slow listener. What happens then? I agree with Stephen on this. FWIW I don't think we even need the external pages concept in order to implement zero-copy receive (which I gather is the intent here). Here is one way to do it, simply construct a completely non-linear packet in the driver, as you would if you were using the GRO frags interface (grep for napi_gro_frags under drivers/net for examples). I'm not sure if I understand your way correctly: 1) Does the way only deal with driver with SG feature? Since packet is non-linear... 2) Is skb-data still pointing to guest user buffers? If yes, how to avoid the modifications to net core change to skb? 3) In our way only parts of drivers need be modified to support zero-copy. and here, need we modify all the drivers? If I missed your idea, may you explain it in more detail? This way you can transfer the entire contents of the packet without copying through to the other side, provided that the host stack does not modify the packet. If the host side did modify the packet then we have to incur the memory cost anyway. IOW I think the only feature provided by the external pages construct is allowing the skb-head area to be shared without copying. I'm claiming that this can be done by simply making skb-head empty. I think to make skb-head empty at first will cause more effort to pass the check with skb header. Have I missed something here? I really make the skb-head NULL just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb. Thanks Xiaohui Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode
Michael, What's you have suggested could avoid to taint the guest virtio-net driver. Really thanks! For the two alternative, the first will do more trick with driver or net-core stuff. So currently, I prefer to try the second one. Anyway, let me have a good think of it. Thanks! Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Thursday, May 27, 2010 4:20 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Herbert Xu Subject: Re: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode On Thu, May 27, 2010 at 09:21:02AM +0800, Xin, Xiaohui wrote: Michael, I'm now looking into the vhost mergeable buffer, and I tried to use it to support PS mode with zero-copy. And I found an issue there that I have to modify the guest virito-net driver. When guest virtio-net driver submits mergeable buffers, it submits multiple pages outside. In zero-copy case, vhost cannot know which page is used to put header, and which page is used to put payload. Then vhost can only reserves 12 bytes for each page. That means, the page_offset of the payload DMAed into the guest buffer is always 12 bytes. But guest virtio-net driver always use offset 0 to put the data (See receive_mergeable()). That's where the zero-copy use mergeable buffer must modify. Have I missed something here? And how do you think about it? Thanks Xiaohui Maybe you can teach the hardware skip the first 12 bytes: qemu will call an ioctl telling hardware what the virtio header size is. This is how we plan to do it for tap. Alternatively, buffers can be used in any order. So we can have hardware use N buffers for the packet, and then have vhost put the header in buffer N+1. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode
Michael, I'm now looking into the vhost mergeable buffer, and I tried to use it to support PS mode with zero-copy. And I found an issue there that I have to modify the guest virito-net driver. When guest virtio-net driver submits mergeable buffers, it submits multiple pages outside. In zero-copy case, vhost cannot know which page is used to put header, and which page is used to put payload. Then vhost can only reserves 12 bytes for each page. That means, the page_offset of the payload DMAed into the guest buffer is always 12 bytes. But guest virtio-net driver always use offset 0 to put the data (See receive_mergeable()). That's where the zero-copy use mergeable buffer must modify. Have I missed something here? And how do you think about it? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH v4 00/18] Provide a zero-copy method on KVM virtio-net.
Michael, Sorry, somehow I missed this mail. :-( Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Do you still do this? Currently, we still use the first way. But we now ignore the room which host skb_reserve() required when device is doing zero-copy. Now we don't taint guest virtio-net driver with a new method by this way. Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. I commented on this in the past. Do you still want comments? Now we continue with the first way and try to push it. But any comments about the two methods are still welcome. That's nice. The thing to do is probably to enable GSO/TSO and see what we get this way. Also, mergeable buffer support was recently posted and I hope to merge it for 2.6.35. You might want to take a look. I'm looking at the mergeable buffer. I think GSO/GRO support with zero-copy also needs it. Currently, GSO/TSO is still not supported by vhost-net? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH v4 00/18] Provide a zero-copy method on KVM virtio-net.
The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. Isn't it much easier to map the RX ring of the network device into the guest's address space, have DMA map calls translate guest addresses to physical/DMA addresses as well as do all of this crazy page pinning stuff, and provide the translations and protections via the IOMMU? This means we need guest know how the specific network device works. So we won't be able to, for example, move guest between different hosts. There are other problems: many physical systems do not have an iommu, some guest OS-es do not support DMA map calls, doing VM exit on each DMA map call might turn out to be very slow. And so on. This solution is what now we can think of to implement zero-copy. Some modifications are made to net core to try to avoid network device driver changes. The major change is to __alloc_skb(), in which we added a dev parameter to indicate whether the device will DMA to/from guest/user buffer which is pointed by host skb-data. We also modify skb_release_data() and skb_reserve(). We made it now works with ixgbe driver with PS mode disabled, and got some performance data with it. using netperf with GSO/TSO disabled, 10G NIC, disabled packet split mode, with raw socket case compared to vhost. bindwidth will be from 1.1Gbps to 1.7Gbps CPU % from 120%-140% to 140%-160% We are now trying to get decent performance data with advanced features. Do you have any other concerns with this solution? What's being proposed here looks a bit over-engineered. This is an attempt to reduce overhead for virtio (paravirtualization). 'Don't use PV' is kind of an alternative, but I do not think it's a simpler one. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH v4 05/18] Add a function to indicate if device use external buffer.
+static int dev_is_mpassthru(struct net_device *dev) bool return value should be better here. -- Regards, Changli Gao(xiao...@gmail.com) Thanks, would fix that. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
Michael, Sorry, it's based on the suggestion to hook an iocb completion callback to handle the iocb list in vhost-net. Thanks Xiaohui -Original Message- From: Xin, Xiaohui Sent: Thursday, April 22, 2010 4:24 PM To: m...@redhat.com Cc: a...@arndb.de; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com; Xin, Xiaohui Subject: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net. From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- Michael, Thanks. I have updated the patch with your suggestion. It looks much clean now. Please have a review. Thanks Xiaohui drivers/vhost/Kconfig | 10 + drivers/vhost/Makefile|2 + drivers/vhost/mpassthru.c | 1239 + include/linux/mpassthru.h | 29 + 4 files changed, 1280 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..91806b1 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..cc99b14 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1239 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + struct kmem_cache *cache; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct page_info { + struct list_headlist; + int header; + /* indicate the actual
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Michael, Yes, I think this packet split mode probably maps well to mergeable buffer support. Note that 1. Not all devices support large packets in this way, others might map to indirect buffers better Do the indirect buffers accord to deal with the skb-frag_list? So we have to figure out how migration is going to work Yes, different guest virtio-net driver may contain different features. Does the qemu migration work with different features supported by virtio-net driver now? 2. It's up to guest driver whether to enable features such as mergeable buffers and indirect buffers So we have to figure out how to notify guest which mode is optimal for a given device Yes. When a device is binded, the mp device may query the capabilities from driver. Actually, there is a structure now in mp device can do this, we can add some field to support more. 3. We don't want to depend on jumbo frames for decent performance So we probably should support GSO/GRO GSO is for the tx side, right? I think driver can handle it itself. For GRO, I'm not sure it's easy or not. Basically, the mp device now we have support is doing what raw socket is doing. The packets are not going to host stack. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Michael, The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Unfortunately, this would break compatibility with existing virtio. This also complicates migration. You mean any modification to the guest virtio-net driver will break the compatibility? We tried to enlarge the virtio_net_config to contains the 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() will check the feature flag, and get the parameters, then virtio-net driver use it to allocate buffers. How about this? This means that we can't, for example, live-migrate between different systems without flushing outstanding buffers. Ok. What we have thought about now is to do something with skb_reserve(). If the device is binded by mp, then skb_reserve() will do nothing with it. What is the room in skb head used for? I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to NET_IP_ALIGN. Looking at code, this seems to do with alignment - could just be a performance optimization. Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. The obvious question would be whether you see any speed difference with the two approaches. If no, then the second approach would be better. I remember the second approach is a bit slower in 1500MTU. But we did not tested too much. Well, that's an important datapoint. By the way, you'll need header copy to activate LRO in host, so that's a good reason to go with option 2 as well. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage.
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Michael, What we have not done yet: packet split support What does this mean, exactly? We can support 1500MTU, but for jumbo frame, since vhost driver before don't support mergeable buffer, we cannot try it for multiple sg. I do not see why, vhost currently supports 64K buffers with indirect descriptors. The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how can indirect descriptors to that? See add_recvbuf_big. I don't mean this, it's for buffer submission. I mean when packet is received, in receive_buf(), mergeable buffer knows which pages received can be hooked in skb frags, it's receive_mergeable() which do this. When a NIC driver supports packet split mode, then each ring descriptor contains a skb and a page. When packet is received, if the status is not EOP, then hook the page of the next descriptor to the prev skb. We don't how many frags belongs to one skb. So when guest submit buffers, it should submit multiple pages, and when receive, the guest should know which pages are belongs to one skb and hook them together. I think receive_mergeable() can do this, but I don't see how big-packets handle this. May I miss something here? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
Arnd, From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Sorry for taking so long before finding the time to look at your code in more detail. It seems that you are duplicating a lot of functionality that is already in macvtap. I've asked about this before but then didn't look at your newer versions. Can you explain the value of introducing another interface to user land? I'm still planning to add zero-copy support to macvtap, hopefully reusing parts of your code, but do you think there is value in having both? I have not looked into your macvtap code in detail before. Does the two interface exactly the same? We just want to create a simple way to do zero-copy. Now it can only support vhost, but in future we also want it to support directly read/write operations from user space too. Basically, compared to the interface, I'm more worried about the modification to net core we have made to implement zero-copy now. If this hardest part can be done, then any user space interface modifications or integrations are more easily to be done after that. diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..86d2525 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1264 @@ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif This should probably just use the existing dev_dbg/pr_debug infrastructure. Thanks. Will try that. [... skipping buffer management code for now] +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock, +struct msghdr *m, size_t total_len) +{ [...] This function looks like we should be able to easily include it into macvtap and get zero-copy transmits without introducing the new user-level interface. +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock, +struct msghdr *m, size_t total_len, +int flags) +{ +struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp; +struct page_ctor *ctor; +struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb-private); It smells like a layering violation to look at the iocb-private field from a lower-level driver. I would have hoped that it's possible to implement this without having this driver know about the higher-level vhost driver internals. Can you explain why this is needed? I don't like this too, but since the kiocb is maintained by vhost with a list_head. And mp device is responsible to collect the kiocb into the list_head, We need something known by vhost/mp both. +spin_lock_irqsave(ctor-read_lock, flag); +list_add_tail(info-list, ctor-readq); +spin_unlock_irqrestore(ctor-read_lock, flag); + +if (!vq-receiver) { +vq-receiver = mp_recvmsg_notify; +set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, + vq-num * 4096, + vq-num * 4096); +} + +return 0; +} Not sure what I'm missing, but who calls the vq-receiver? This seems to be neither in the upstream version of vhost nor introduced by your patch. See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost. +static void __mp_detach(struct mp_struct *mp) +{ +mp-mfile = NULL; + +mp_dev_change_flags(mp-dev, mp-dev-flags ~IFF_UP); +page_ctor_detach(mp); +mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP); + +/* Drop the extra count on the net device */ +dev_put(mp-dev); +} + +static DEFINE_MUTEX(mp_mutex); + +static void mp_detach(struct mp_struct *mp) +{ +mutex_lock(mp_mutex); +__mp_detach(mp); +mutex_unlock(mp_mutex); +} + +static void mp_put(struct mp_file *mfile) +{ +if (atomic_dec_and_test(mfile-count)) +mp_detach(mfile-mp); +} + +static int mp_release(struct socket *sock) +{ +struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp; +struct mp_file *mfile = mp-mfile; + +mp_put(mfile); +sock_put(mp-socket.sk); +put_net(mfile-net); + +return 0; +} Doesn't this prevent the underlying interface from going away while the chardev is open? You also have logic to handle that case, so why do you keep the extra reference on the netdev? Let me think. +/* Ops structure to mimic raw sockets with mp device */ +static const struct proto_ops mp_socket_ops = { +.sendmsg = mp_sendmsg, +.recvmsg = mp_recvmsg, +.release = mp_release, +}; +static int mp_chr_open(struct inode *inode, struct file * file) +{ +struct mp_file *mfile; + cycle_kernel_lock(); I don't think you really want to use the BKL here, just kill
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Michael, The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Unfortunately, this would break compatibility with existing virtio. This also complicates migration. You mean any modification to the guest virtio-net driver will break the compatibility? We tried to enlarge the virtio_net_config to contains the 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() will check the feature flag, and get the parameters, then virtio-net driver use it to allocate buffers. How about this? What is the room in skb head used for? I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to NET_IP_ALIGN. Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. The obvious question would be whether you see any speed difference with the two approaches. If no, then the second approach would be better. I remember the second approach is a bit slower in 1500MTU. But we did not tested too much. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support What does this mean, exactly? We can support 1500MTU, but for jumbo frame, since vhost driver before don't support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5 frags and hook them once a descriptor, so the user buffer allocation is greatly dependent on how guest
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Sridhar, The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. What is the advantage of this approach compared to PCI-passthrough of the host NIC to the guest? PCI-passthrough needs hardware support, a kind of iommu engine will help to translate guest physical address to host physical address. And currently, a PCI-passthrough device cannot pass live migration. The zero-copy is a pure software solution. It doesn't need special hardware support. In theory, it can pass live migration. Does this require pinning of the entire guest memory? Or only the send/receive buffers? We need only to pin the send/receive buffers. Thanks Xiaohui Thanks Sridhar The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor
RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Xin Xiaohui xiaohui@intel.com The patch let host NIC driver to receive user space skb, then the driver has chance to directly DMA to guest user space buffers thru single ethX interface. We want it to be more generic as a zero copy framework. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- We consider 2 way to utilize the user buffres, but not sure which one is better. Please give any comments. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Thanks Xiaohui How do you deal with the DoS problem of hostile user space app posting huge number of receives and never getting anything. That's a problem we are trying to deal with. It's critical for long term. Currently, we tried to limit the pages it can pin, but not sure how much is reasonable. For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent just for now. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
Michael, For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Yes, I think this will work. Thanks, so do you have the function to recompute the log info in your hand that I can use? I have weakly remembered that you have noticed it before some time. Doesn't just rerunning vhost_get_vq_desc work? Am I missing something here? The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and converts it to an iovec. I think the first available buffer is not the buffers in the async queue, so I think rerunning vhost_get_vq_desc() cannot work. Thanks Xiaohui Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len =
RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Michael, Qemu needs a userspace write, is that a synchronous one or asynchronous one? It's a synchronous non-blocking write. Sorry, why the Qemu live migration needs the device have a userspace write? how does the write operation work? And why a read operation is not cared here? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Michael, For the DOS issue, I'm not sure how much the limit get_user_pages() can pin is reasonable, should we compute the bindwidth to make it? There's a ulimit for locked memory. Can we use this, decreasing the value for rlimit array? We can do this when backend is enabled and re-increment when backend is disabled. I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found the initial value for it is 0x1, after right shift PAGE_SHIFT, it's only 16 pages we can lock then, it seems too small, since the guest virito-net driver may submit a lot requests one time. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
Michael, For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Yes, I think this will work. Thanks, so do you have the function to recompute the log info in your hand that I can use? I have weakly remembered that you have noticed it before some time. Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq =
Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- Michael, Now, I made vhost to alloc/destroy the kiocb, and transfer it from sendmsg/recvmsg. I did not remove vq-receiver, since what the callback does is related to the structures owned by mp device, and I think isolation them to vhost is a good thing to us all. And it will not prevent mp device to be independent of vhost in future. Later, when mp device can be a real device which provides asynchronous read/write operations and not just report proto_ops, it will use another callback function which is not related to vhost at all. For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind
Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- Micheal, Sorry, I did not resolve all your comments this time. I did not move the device out of vhost directory because I did not implement real asynchronous read/write operations to mp device for now, We wish we can do this after the network code checked in. For the DOS issue, I'm not sure how much the limit get_user_pages() can pin is reasonable, should we compute the bindwidth to make it? We use get_user_pages_fast() and use set_page_dirty_lock(). Remove read_rcu_lock()/unlock(), since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, the NIC is always stoped, all outstanding requests are done, so the ctor pointer cannot be raced into wrong condition. Qemu needs a userspace write, is that a synchronous one or asynchronous one? Thanks Xiaohui drivers/vhost/Kconfig |5 + drivers/vhost/Makefile|2 + drivers/vhost/mpassthru.c | 1162 + include/linux/mpassthru.h | 29 ++ 4 files changed, 1198 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..ee32a3b 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,8 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config VHOST_PASSTHRU + tristate Zerocopy network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..3f79c79 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..6e8fc4d --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1162 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +#include vhost.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + struct kmem_cache *cache; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct page_info { + void*ctrl; + struct list_headlist; + int header; + /* indicate the actual length of bytes +* send/recv in the user space buffers +*/ + int total; + int
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + struct kiocb *iocb = NULL; unsigned head, out, in, s; struct msghdr msg = { .msg_name = NULL, @@ -124,6 +204,8 @@ static void handle_tx(struct
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
Michael, I don't use the kiocb comes from the sendmsg/recvmsg, since I have embeded the kiocb in page_info structure, and allocate it when page_info allocated. So what I suggested was that vhost allocates and tracks the iocbs, and passes them to your device with sendmsg/ recvmsg calls. This way your device won't need to share structures and locking strategy with vhost: you get an iocb, handle it, invoke a callback to notify vhost about completion. This also gets rid of the 'receiver' callback I'm not sure receiver callback can be removed here: The patch describes a work flow like this: netif_receive_skb() gets the packet, it does nothing but just queue the skb and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver callback to deal with skb and and get the necessary notify info into a list, vhost owns the list and in the same handle_rx() context use it to complete. We use receiver callback here is because only handle_rx() is waked up from netif_receive_skb(), and we need mp device context to deal with the skb and notify info attached to it. We also have some lock in the callback function. If I remove the receiver callback, I can only deal with the skb and notify info in netif_receive_skb(), but this function is in an interrupt context, which I think lock is not allowed there. But I cannot remove the lock there. Please have a review and thanks for the instruction for replying email which helps me a lot. Thanks, Xiaohui drivers/vhost/net.c | 159 +++-- drivers/vhost/vhost.h | 12 2 files changed, 166 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..5483848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +static void handle_async_rx_events_notify(struct vhost_net *net, +struct vhost_virtqueue *vq); + +static void handle_async_tx_events_notify(struct vhost_net *net, +struct vhost_virtqueue *vq); + A couple of style comments: - It's better to arrange functions in such order that forward declarations aren't necessary. Since we don't have recursion, this should always be possible. - continuation lines should be idented at least at the position of '(' on the previous line. Thanks. I'd correct that. /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; +handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + +if (vq-link_state == VHOST_VQ_LINK_ASYNC) { +vq-head = head; +msg.msg_control = (void *)vq; So here a device gets a pointer to vhost_virtqueue structure. If it gets an iocb and invokes a callback, it would not care about vhost internals. +} + len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net) tx_poll_start(net, sock); break; } + +if (vq-link_state == VHOST_VQ_LINK_ASYNC) +continue; + if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net) } } +handle_async_tx_events_notify(net, vq); + mutex_unlock(vq-mutex); unuse_mm(net-dev.mm); } @@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net) int err; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); -if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) +if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) +vq-link_state == VHOST_VQ_LINK_SYNC)) return; use_mm(net-dev.mm); @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net
Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- Michael, I don't use the kiocb comes from the sendmsg/recvmsg, since I have embeded the kiocb in page_info structure, and allocate it when page_info allocated. Please have a review and thanks for the instruction for replying email which helps me a lot. Thanks, Xiaohui drivers/vhost/net.c | 159 +++-- drivers/vhost/vhost.h | 12 2 files changed, 166 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..5483848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; + handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + vq-head = head; + msg.msg_control = (void *)vq; + } + len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net) tx_poll_start(net, sock); break; } + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) + continue; + if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net) } } + handle_async_tx_events_notify(net, vq); + mutex_unlock(vq-mutex); unuse_mm(net-dev.mm); } @@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net) int err; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); - if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) + if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) + vq-link_state == VHOST_VQ_LINK_SYNC)) return; use_mm(net-dev.mm); @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net) vhost_disable_notify(vq); hdr_size = vq-hdr_size; - vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + /* In async cases, for write logging, the simple way is to get +* the log info always, and really logging is decided later. +* Thus, when logging enabled, we can get log, and when logging +* disabled, we can get log disabled accordingly. +*/ + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) | + (vq-link_state == VHOST_VQ_LINK_ASYNC) ? vq-log : NULL; + handle_async_rx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net) s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + vq-head = head; + vq-_log = log; + msg.msg_control = (void *)vq; + } /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: @@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
+/* The structure to notify the virtqueue for async socket */ +struct vhost_notifier { +struct list_head list; +struct vhost_virtqueue *vq; +int head; +int size; +int log; +void *ctrl; +void (*dtor)(struct vhost_notifier *); +}; + So IMO, this is not the best interface between vhost and your driver, exposing them to each other unnecessarily. If you think about it, your driver should not care about this structure. It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor on completion. vhost could save it's state in ki_user_data. If your driver needs to add more data to do more tracking, I think it can put skb pointer in the private pointer. Then if I remove the struct vhost_notifier, and just use struct kiocb, but don't use the one got from sendmsg or recvmsg, but allocated within the page_info structure, and don't implement any aio logic related to it, is that ok? Sorry, I made a patch, but don't know how to reply mail with a good formatted patch here Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.
On Sat, Mar 06, 2010 at 05:38:35PM +0800, xiaohui@intel.com wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning Am I right to say that nic driver needs changes for these patches to work? If so, please publish nic driver patches as well. For drivers not support packet split mode, the NIC drivers don't need to change. For packet split support drivers, we plan to add the drivers API in updated versions. Now for PS support drivers, just disable the PS mode, it also works. what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() performance: using netperf with GSO/TSO disabled, 10G NIC, disabled packet split mode, with raw socket case compared to vhost. bindwidth will be from 1.1Gbps to 1.7Gbps CPU % from 120%-140% to 140%-160% That's pretty low for a 10Gb nic. Are you hitting some other bottleneck, like high interrupt rate? Also, GSO support and performance tuning for raw are incomplete. Try comparing with e.g. tap with GSO. I'm curious too. I have tested vhost-net without zero-copy patch at first in RAW socket case with ixgbe driver, with that driver GRO feature is enabled default, but netperf data is extremely low, after disabled GRO, then I can get more than 1Gbps. So I thought I have missed something there, but I had send 2 emails to you about this before and got no reply from you. Have you got some perf data in raw socket case with vhost-net? The data I have got from your web page is always tap with GSO case. If GSO is not supported, I think the data cannot compare with tap with GSO case in 1500 MTU. Maybe mergable buffers may help the performance in raw socket case? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.
Will be in a vacation during 2/13~2/20, so email may be very slow or no replied for your comments. But please don't hesitate to comment more, and I will address them after the vacation. :-) Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Xin Xiaohui Sent: Wednesday, February 10, 2010 7:49 PM To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; m...@redhat.com; jd...@c2.user-mode-linux.org Subject: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: To support GRO Performance tuning -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Let host NIC driver to DMA to guest user space.
The patch let host NIC driver to receive user space skb, then the driver has chance to directly DMA to guest user space buffers thru single ethX interface. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- include/linux/netdevice.h | 72 + include/linux/skbuff.h| 32 ++-- net/core/dev.c| 27 + net/core/skbuff.c | 62 +++ 4 files changed, 184 insertions(+), 9 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 94958c1..0de8688 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -486,6 +486,16 @@ struct netdev_queue { } cacheline_aligned_in_smp; +struct netdev_page_ctor{ + int hdr_len; + int data_len; + int npages; + unsignedflags; + struct socket *sock; + struct skb_user_page*(*ctor)(struct netdev_page_ctor *, + struct sk_buff *, int); +}; + /* * This structure defines the management hooks for network devices. * The following hooks can be defined; unless noted otherwise, they are @@ -636,6 +646,8 @@ struct net_device_ops { int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); #endif + int (*ndo_page_ctor_prep)(struct net_device *dev, + struct netdev_page_ctor *ctor); }; /* @@ -916,6 +928,7 @@ struct net_device /* max exchange id for FCoE LRO by ddp */ unsigned intfcoe_ddp_xid; #endif + struct netdev_page_ctor *page_ctor; }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -2013,6 +2026,65 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev) return 0; return dev-ethtool_ops-get_flags(dev); } + +static inline int netdev_page_ctor_prep(struct net_device *dev, + struct netdev_page_ctor *ctor) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + /* needed by packet split */ + if (ops-ndo_page_ctor_prep) { + rc = ops-ndo_page_ctor_prep(dev, ctor); + if (rc) + return rc; + } else { /* should be temp */ + ctor-hdr_len = 128; + ctor-data_len = 2048; + ctor-npages = 1; + } + + if (ctor-hdr_len = 0) + goto err; + + npages = ctor-npages; + data_len = ctor-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} + +static inline int netdev_page_ctor_attach(struct net_device *dev, + struct netdev_page_ctor *ctor) +{ + if (dev-flags IFF_UP) + return -EBUSY; + + if (rcu_dereference(dev-page_ctor)) + return -EBUSY; + + rcu_assign_pointer(dev-page_ctor, ctor); + + return 0; +} + +static inline void netdev_page_ctor_detach(struct net_device *dev) +{ + if (!rcu_dereference(dev-page_ctor)) + return; + + rcu_assign_pointer(dev-page_ctor, NULL); + synchronize_rcu(); +} + #endif /* __KERNEL__ */ #endif /* _LINUX_NETDEVICE_H */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index df7b23a..c77837e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -209,6 +209,13 @@ struct skb_shared_info { void * destructor_arg; }; +struct skb_user_page { + u8 *start; + int size; + struct skb_frag_struct *frags; + struct skb_shared_info *ushinfo; + void(*dtor)(struct skb_user_page *); +}; /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of @@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb); extern void consume_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, - gfp_t priority, int fclone, int node); + gfp_t priority, int fclone, + int node, struct net_device *dev); static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority
[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- drivers/vhost/Kconfig |5 + drivers/vhost/Makefile |2 + drivers/vhost/mpassthru.c | 1178 include/linux/miscdevice.h |1 + include/linux/mpassthru.h | 17 + 5 files changed, 1203 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..ee32a3b 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,8 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config VHOST_PASSTHRU + tristate Zerocopy network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..3f79c79 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..d8d153f --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1178 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +#include vhost.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + atomic_trefcnt; + struct kmem_cache *cache; + struct net_device *dev; + struct netdev_page_ctor ctor; + void*sendctrl; + void*recvctrl; +}; + +struct page_info { + struct list_headlist; + int header; + /* indicate the actual length of bytes +* send/recv in the user space buffers +*/ + int total; + int offset; + struct page *pages[MAX_SKB_FRAGS+1]; + struct skb_frag_struct frag[MAX_SKB_FRAGS+1]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a user space allocated skb or kernel +*/ + struct skb_user_pageuser; + struct skb_shared_info ushinfo; + +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + unsignedpnum; + + /* It's meaningful for receive, means +* the max length allowed +*/ + size_t
[PATCH 2/3] Provides multiple submits and asynchronous notifications.
The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 145 +++-- drivers/vhost/vhost.h | 23 2 files changed, 164 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..8a85227 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -22,6 +22,7 @@ #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -91,6 +92,10 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -124,6 +129,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; + handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +158,12 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + vq-head = head; + msg.msg_control = (void *)vq; + } + len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { @@ -166,6 +179,10 @@ static void handle_tx(struct vhost_net *net) tx_poll_start(net, sock); break; } + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) + continue; + if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); @@ -177,6 +194,8 @@ static void handle_tx(struct vhost_net *net) } } + handle_async_tx_events_notify(net, vq); + mutex_unlock(vq-mutex); unuse_mm(net-dev.mm); } @@ -206,7 +225,8 @@ static void handle_rx(struct vhost_net *net) int err; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); - if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) + if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) + vq-link_state == VHOST_VQ_LINK_SYNC)) return; use_mm(net-dev.mm); @@ -217,6 +237,8 @@ static void handle_rx(struct vhost_net *net) vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + handle_async_rx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -245,6 +267,11 @@ static void handle_rx(struct vhost_net *net) s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + vq-head = head; + vq-_log = log; + msg.msg_control = (void *)vq; + } /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: @@ -259,6 +286,10 @@ static void handle_rx(struct vhost_net *net) vhost_discard_vq_desc(vq); break; } + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) + continue; + /* TODO: Should check and handle checksum. */ if (err len) { pr_err(Discarded truncated rx packet: @@ -284,10 +315,83 @@ static void handle_rx(struct vhost_net *net) } } + handle_async_rx_events_notify(net, vq); + mutex_unlock(vq-mutex); unuse_mm(net-dev.mm); } +struct vhost_notifier *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct vhost_notifier *vnotify = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + vnotify = list_first_entry(vq-notifier
RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Eric, Thanks. I will look into that. But don't stop there. Please comments more. :-) Thanks Xiaohui -Original Message- From: Eric Dumazet [mailto:eric.duma...@gmail.com] Sent: Wednesday, February 10, 2010 11:18 PM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; m...@redhat.com; jd...@c2.user-mode-linux.org; Zhao Yu Subject: Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net. Le mercredi 10 février 2010 à 19:48 +0800, Xin Xiaohui a écrit : Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org +static int page_ctor_attach(struct mp_struct *mp) +{ + int rc; + struct page_ctor *ctor; + struct net_device *dev = mp-dev; + + rcu_read_lock(); + if (rcu_dereference(mp-ctor)) { + rcu_read_unlock(); + return -EBUSY; + } + rcu_read_unlock(); Strange read locking here, for an obvious writer role. What do you really want to do ? If writer are serialized by mp_mutex, you dont need this recu_read_lock()/rcu_read_unlock() stuff. + + ctor = kzalloc(sizeof(*ctor), GFP_KERNEL); + if (!ctor) + return -ENOMEM; + rc = netdev_page_ctor_prep(dev, ctor-ctor); + if (rc) + goto fail; + + ctor-cache = kmem_cache_create(skb_page_info, + sizeof(struct page_info), 0, + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); SLAB_PANIC here means : crash whole system in case of error. This is not what you want in a driver. + + if (!ctor-cache) + goto cache_fail; + + INIT_LIST_HEAD(ctor-readq); + spin_lock_init(ctor-read_lock); + + ctor-w_len = 0; + ctor-r_len = 0; + + dev_hold(dev); + ctor-dev = dev; + ctor-ctor.ctor = page_ctor; + ctor-ctor.sock = mp-socket; + atomic_set(ctor-refcnt, 1); + + rc = netdev_page_ctor_attach(dev, ctor-ctor); + if (rc) + goto fail; + + /* locked by mp_mutex */ + rcu_assign_pointer(mp-ctor, ctor); + + /* XXX:Need we do set_offload here ? */ + + return 0; + +fail: + kmem_cache_destroy(ctor-cache); +cache_fail: + kfree(ctor); + dev_put(dev); + + return rc; +} + + +static inline void get_page_ctor(struct page_ctor *ctor) +{ + atomic_inc(ctor-refcnt); +} + +static inline void put_page_ctor(struct page_ctor *ctor) +{ + if (atomic_dec_and_test(ctor-refcnt)) + kfree(ctor); Are you sure a RCU grace period is not needed before freeing ? + +static int page_ctor_detach(struct mp_struct *mp) +{ + struct page_ctor *ctor; + struct page_info *info; + int i; + + rcu_read_lock(); + ctor = rcu_dereference(mp-ctor); + rcu_read_unlock(); Strange locking again here + + if (!ctor) + return -ENODEV; + + while ((info = info_dequeue(ctor))) { + for (i = 0; i info-pnum; i++) + if (info-pages[i]) + put_page(info-pages[i]); + kmem_cache_free(ctor-cache, info); + } + kmem_cache_destroy(ctor-cache); + netdev_page_ctor_detach(ctor-dev); + dev_put(ctor-dev); + + /* locked by mp_mutex */ + rcu_assign_pointer(mp-ctor, NULL); + synchronize_rcu(); + + put_page_ctor(ctor); + + return 0; +} + +/* For small user space buffers transmit, we don't need to call + * get_user_pages(). + */ +static struct page_info *alloc_small_page_info(struct page_ctor *ctor, + int total) +{ + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL); kmem_cache_zalloc() ? + + if (!info) + return NULL; + memset(info, 0, sizeof(struct page_info)); + memset(info-pages, 0, sizeof(info-pages)); redundant memset() whole structure already cleared one line above + + info-header = 0; already cleared + info-total = total; + info-skb = NULL; already cleared + info-user.dtor = page_dtor; + info-ctor = ctor; + info-flags = INFO_WRITE; + info-pnum = 0; already cleared + return info; +} + +/* The main function to transform the guest user space address + * to host kernel address via get_user_pages(). Thus the hardware + * can do DMA directly to the user space address. + */ +static struct page_info *alloc_page_info(struct page_ctor *ctor, + struct iovec *iov, int count, struct frag *frags, + int npages, int total) +{ + int rc; + int i, j, n = 0; + int len; + unsigned long base; + struct page_info *info
RE: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.
On Wednesday 10 February 2010, Xin Xiaohui wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. We provide multiple submits and asynchronous notifiicaton to vhost-net too. This does a lot of things that I had planned for macvtap. It's great to hear that you have made this much progress. However, I'd hope that we could combine this with the macvtap driver, which would give us zero-copy transfer capability both with and without vhost, as well as (tx at least) when using multiple guests on a macvlan setup. You mean the zero-copy can work with macvtap driver without vhost. May you give me some detailed info about your macvtap driver and the relationship between vhost and macvtap to make me have a clear picture then? For transmit, it should be fairly straightforward to hook up your zero-copy method and the vhost-net interface into the macvtap driver. You have simplified the receiv path significantly by assuming that the entire netdev can receive into a single guest, right? Yes. I'm assuming that the idea is to allow VMDq adapters to simply show up as separate adapters and have the driver handle this in a hardware specific way. Does the VMDq driver do so now? My plan for this was to instead move support for VMDq into the macvlan driver so we can transparently use VMDq on hardware where available, including zero-copy receives, but fall back to software operation on non-VMDq hardware. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ANNOUNCE] KVM developer conference call
Hi, Chris We are interested in joining the conference, since we are now working on the zero copy patch based on vhost-net. Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Chris Wright Sent: Tuesday, January 12, 2010 10:16 AM To: kvm@vger.kernel.org Cc: qemu-de...@nongnu.org Subject: [ANNOUNCE] KVM developer conference call The KVM project is announcing a regular conference call focused on KVM development issues. The call will take place weekly on Tuesdays at 15:00 UTC for 1 hour. The purpose of the call is to discuss relevant development issues in a high bandwidth forum, certainly _not_ meant to supplant discussion on list. I will send out a request for agenda items each week the day before the call, and post minutes after the call concludes. If you are interested in joining please send me an email. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: vhost-net patches
Hi, Michael, What's your deferring skb allocation patch mentioned here, may you elaborate it a little more detailed? Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Tuesday, November 03, 2009 3:47 AM To: Shirley Ma Cc: Sridhar Samudrala; Shirley Ma; David Stevens; kvm@vger.kernel.org; s...@linux.vnet.ibm.com; mashi...@linux.vnet.ibm.com Subject: Re: vhost-net patches On Thu, Oct 29, 2009 at 12:11:33AM -0700, Shirley Ma wrote: Hello Michael, I am able to get 63xxMb/s throughput with 10% less cpu utilization when I apply deferring skb patch on top of your most recent vhost patch. The userspace TCP_STREAM BW used to be 3xxxMb/s from upper stream git tree. After applying your recent vhost patch, it goes up to 53xxMb/s. Thanks for the report! I have pushed this patch (after cleaning it up and fixing raw sockets), and updated the numbers in the wiki here http://www.linux-kvm.org/page/VhostNet -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
The irqfd/ioeventfd patches are part of Avi's kvm.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git I expect them to be merged by 2.6.32-rc1 - right, Avi? Michael, I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd. I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this: /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function) /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function) Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd? Thanks Xiaohui -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, September 13, 2009 1:46 PM To: Xin, Xiaohui Cc: Ira W. Snyder; net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty Russell; s.he...@linux-ag.com; a...@redhat.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote: Michael, We are very interested in your patch and want to have a try with it. I have collected your 3 patches in kernel side and 4 patches in queue side. The patches are listed here: PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu. But seems there are some patches are needed at least irqfd and ioeventfd patches on current qemu. I cannot create a kvm guest with -net nic,model=virtio,vhost=vethX. May you kindly advice us the patch lists all exactly to make it work? Thanks a lot. :-) Thanks Xiaohui The irqfd/ioeventfd patches are part of Avi's kvm.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git I expect them to be merged by 2.6.32-rc1 - right, Avi? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael, We are very interested in your patch and want to have a try with it. I have collected your 3 patches in kernel side and 4 patches in queue side. The patches are listed here: PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu. But seems there are some patches are needed at least irqfd and ioeventfd patches on current qemu. I cannot create a kvm guest with -net nic,model=virtio,vhost=vethX. May you kindly advice us the patch lists all exactly to make it work? Thanks a lot. :-) Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Wednesday, September 09, 2009 4:14 AM To: Ira W. Snyder Cc: net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty Russell; s.he...@linux-ag.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote: On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Hello Michael, I've started looking at vhost with the intention of using it over PCI to connect physical machines together. The part that I am struggling with the most is figuring out which parts of the rings are in the host's memory, and which parts are in the guest's memory. All rings are in guest's memory, to match existing virtio code. Ok, this makes sense. vhost assumes that the memory space of the hypervisor userspace process covers the whole of guest memory. Is this necessary? Why? Because with virtio ring can give us arbitrary guest addresses. If guest was limited to using a subset of addresses, hypervisor would only have to map these. The assumption seems very wrong when you're doing data transport between two physical systems via PCI. I know vhost has not been designed for this specific situation, but it is good to be looking toward other possible uses. And there's a translation table. Ring addresses are userspace addresses, they do not undergo translation. If I understand everything correctly, the rings are all userspace addresses, which means that they can be moved around in physical memory, and get pushed out to swap. Unless they are locked, yes. AFAIK, this is impossible to handle when connecting two physical systems, you'd need the rings available in IO memory (PCI memory), so you can ioreadXX() them instead.
RE: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
* Code is easier to review than bullet points. Yes. We'd send the code soon. * Direct I/O has to be safe when page is shared by multiple threads, and has to be non-blocking since network I/O can take indeterminately long (think big queue's, tunneling, ...) In the situation, one queue pair NIC is assigned to only one guest, the pages are locked and a KVM guest will not swapped out. * In the past attempts at Direct I/O on network have always had SMP TLB issues. The page has to be flipped or marked as COW on all CPU's and the cost of the Inter Processor Interrupt to steal the page has been slower than copying It may be, we have not thought about this more . Thanks. Thanks Xiaohui -Original Message- From: Stephen Hemminger [mailto:shemmin...@vyatta.com] Sent: Wednesday, September 02, 2009 12:05 AM To: Xin, Xiaohui Cc: m...@redhat.com; net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM On Tue, 1 Sep 2009 14:58:19 +0800 Xin, Xiaohui xiaohui@intel.com wrote: [RFC] Virtual Machine Device Queues (VMDq) support on KVM Network adapter with VMDq technology presents multiple pairs of tx/rx queues, and renders network L2 sorting mechanism based on MAC addresses and VLAN tags for each tx/rx queue pair. Here we present a generic framework, in which network traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without any software copy. Actually this framework can apply to traditional network adapters which have just one tx/rx queue pair. And applications using the same user/kernel interface can utilize this framework to send/receive network traffic directly thru a tx/rx queue pair in a network adapter. We use virtio-net architecture to illustrate the framework. || pop add_buf|| |Qemu process| -TX -- | Guest Kernel | || - -- || |Virtio-net | push get_buf|| | (Backend service) | -RX -- | Virtio-net| || - -- |driver | || push get_buf|| || || | | | AIO (read write) combined with Direct I/O | (which substitute synced file operations) |---| | Host kernel | read: copy-less with directly mapped user | | | space to kernel, payload directly DMAed | | | into user space | | | write: copy-less with directly mapped user | | | space to kernel, payload directly hooked | | | to a skb | | || | (a likely || | queue pair || | instance) || | | || | NIC driver -- TUN/TAP driver | |---| | | traditional adapter or a tx/rx queue pair The basic idea is to utilize the kernel Asynchronous I/O combined with Direct I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to kernel, we still can see it in SCSI tape driver. With traditional file operations, a copying of payload contents from/to the kernel DMA address to/from a user buffer is needed. That's what the copying we want to save. The proposed framework is like this: A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in host side. KVM virto-net Backend service, the user space program submits asynchronous read/write I/O requests to the host kernel through TUN/TAP device. The requests are corresponding to the vqueue elements include both transmission receive. They can be queued in one AIO request and later, the completion will be notified through the underlying packets tx/rx processing of the rx/tx queue pair. Detailed path: To guest Virtio-net driver, packets receive corresponding to asynchronous read I/O requests of Backend service. 1) Guest Virtio-net driver provides header
[RFC] Virtual Machine Device Queues(VMDq) support on KVM
[RFC] Virtual Machine Device Queues (VMDq) support on KVM Network adapter with VMDq technology presents multiple pairs of tx/rx queues, and renders network L2 sorting mechanism based on MAC addresses and VLAN tags for each tx/rx queue pair. Here we present a generic framework, in which network traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without any software copy. Actually this framework can apply to traditional network adapters which have just one tx/rx queue pair. And applications using the same user/kernel interface can utilize this framework to send/receive network traffic directly thru a tx/rx queue pair in a network adapter. We use virtio-net architecture to illustrate the framework. || pop add_buf|| |Qemu process| -TX -- | Guest Kernel | || - -- || |Virtio-net | push get_buf|| | (Backend service) | -RX -- | Virtio-net| || - -- |driver | || push get_buf|| || || | | | AIO (read write) combined with Direct I/O | (which substitute synced file operations) |---| | Host kernel | read: copy-less with directly mapped user | | | space to kernel, payload directly DMAed | | | into user space | | | write: copy-less with directly mapped user | | | space to kernel, payload directly hooked | | | to a skb | | || | (a likely || | queue pair || | instance) || | | || | NIC driver -- TUN/TAP driver | |---| | | traditional adapter or a tx/rx queue pair The basic idea is to utilize the kernel Asynchronous I/O combined with Direct I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to kernel, we still can see it in SCSI tape driver. With traditional file operations, a copying of payload contents from/to the kernel DMA address to/from a user buffer is needed. That's what the copying we want to save. The proposed framework is like this: A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in host side. KVM virto-net Backend service, the user space program submits asynchronous read/write I/O requests to the host kernel through TUN/TAP device. The requests are corresponding to the vqueue elements include both transmission receive. They can be queued in one AIO request and later, the completion will be notified through the underlying packets tx/rx processing of the rx/tx queue pair. Detailed path: To guest Virtio-net driver, packets receive corresponding to asynchronous read I/O requests of Backend service. 1) Guest Virtio-net driver provides header and payload address through the receive vqueue to Virtio-net backend service. 2) Virtio-net backend service encapsulates multiple vqueue elements into multiple AIO control blocks and composes them into one AIO read request. 3) Virtio-net backend service uses io_submit() syscall to pass the request to the TUN/TAP device. 4) Virtio-net backend service uses io_getevents() syscall to check the completion of the request. 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares for Direct I/O. A modified NIC driver may render a skb which header is allocated in host kernel, but the payload buffer is directly mapped from user space buffer which are rendered through the AIO request by the Backend service. get_user_pages() may do this. For one AIO read request, the TUN/TAP driver maintains a list for the directly mapped buffers, and a NIC driver tries to get the buffers as payload buffer to compose the new skbs. Of course, if getting the buffers fails, then kernel allocated buffers are used. 6) Modern NIC cards now mostly have the header split feature. The NIC queue pair then may directly DMA the payload into the user spaces mapped payload buffers. Thus a zero-copy for payload is implemented in packet receiving. 7) The TUN/TAP driver manually copy the host header to space user mapped. 8)
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
I don't think we should do that with the tun/tap driver. By design, tun/tap is a way to interact with the networking stack as if coming from a device. The only way this connects to an external adapter is through a bridge or through IP routing, which means that it does not correspond to a specific NIC. I have worked on a driver I called 'macvtap' in lack of a better name, to add a new tap frontend to the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC device, this gives you a direct connection between one or multiple tap devices to an external NIC, which works a lot better than when you have a bridge inbetween. There is also work underway to add a bridging capability to macvlan, so you can communicate directly between guests like you can do with a bridge. Michael's vhost_net can plug into the same macvlan infrastructure, so the work is complementary. We use TUN/TAP device to implement the prototype, and agree that it's not the only choice here. We'd compare the two if possible. And what we cares more about is the modification in the kernel like the net_dev and skb structures' modifications, thanks. Thanks Xiaohui -Original Message- From: Arnd Bergmann [mailto:a...@arndb.de] Sent: Monday, August 31, 2009 11:24 PM To: Xin, Xiaohui Cc: m...@redhat.com; net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Monday 31 August 2009, Xin, Xiaohui wrote: Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. I'm also interested in helping there, please include me in the discussions. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. right, that sounds useful. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. I don't think we should do that with the tun/tap driver. By design, tun/tap is a way to interact with the networking stack as if coming from a device. The only way this connects to an external adapter is through a bridge or through IP routing, which means that it does not correspond to a specific NIC. I have worked on a driver I called 'macvtap' in lack of a better name, to add a new tap frontend to the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC device, this gives you a direct connection between one or multiple tap devices to an external NIC, which works a lot better than when you have a bridge inbetween. There is also work underway to add a bridging capability to macvlan, so you can communicate directly between guests like you can do with a bridge. Michael's vhost_net can plug into the same macvlan infrastructure, so the work is complementary. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. yes. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. BTW, we will send our VMDq write-up very soon. Ok, cool. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
It may be possible to make vmdq appear like an sr-iov capable device from userspace. sr-iov provides the userspace interfaces to allocate interfaces and assign mac addresses. To make it useful, you would have to handle tx multiplexing in the driver but that would be much easier to consume for kvm What we have thought is to support multiple net_dev structures according to multiple queue pairs of one vmdq adapter and presents multiple mac address in user space and each one mac can be used by a guest. What does the tx multiplexing in the driver exactly mean? Thanks Xiaohui -Original Message- From: Anthony Liguori [mailto:anth...@codemonkey.ws] Sent: Tuesday, September 01, 2009 5:57 AM To: Avi Kivity Cc: Xin, Xiaohui; m...@redhat.com; net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server Avi Kivity wrote: On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. One way to share the effort is to make vmdq queues available as normal kernel interfaces. It may be possible to make vmdq appear like an sr-iov capable device from userspace. sr-iov provides the userspace interfaces to allocate interfaces and assign mac addresses. To make it useful, you would have to handle tx multiplexing in the driver but that would be much easier to consume for kvm. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. BTW, we will send our VMDq write-up very soon. Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Wednesday, August 19, 2009 11:03 PM To: net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: [PATCHv4 2/2] vhost_net: a kernel-level virtio server What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have not run any benchmarks yet, compared to userspace, I expect to see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). Features that I plan to look at in the future: - TSO - interrupt mitigation - zero copy Acked-by: Arnd Bergmann a...@arndb.de Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 429 drivers/vhost/vhost.c | 664 drivers/vhost/vhost.h | 108 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 100 +++ 11 files changed, 1328 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: kvm@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
One way to share the effort is to make vmdq queues available as normal kernel interfaces. It would take quite a bit of work, but the end result is that no other components need to be change, and it makes vmdq useful outside kvm. It also greatly reduces the amount of integration work needed throughout the stack (kvm/qemu/libvirt). Yes. The common queue pair interface which we want to present will also apply to normal hardware, and try to leave other components unknown. Thanks Xiaohui -Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Tuesday, September 01, 2009 1:52 AM To: Xin, Xiaohui Cc: m...@redhat.com; net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. One way to share the effort is to make vmdq queues available as normal kernel interfaces. It would take quite a bit of work, but the end result is that no other components need to be change, and it makes vmdq useful outside kvm. It also greatly reduces the amount of integration work needed throughout the stack (kvm/qemu/libvirt). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html