RE: [PATCH v16 14/17]Add a kconfig entry and make entry for mp device.

2010-12-01 Thread Xin, Xiaohui
-Original Message-
From: Randy Dunlap [mailto:randy.dun...@oracle.com]
Sent: Thursday, December 02, 2010 1:54 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v16 14/17]Add a kconfig entry and make entry for mp device.

On Wed,  1 Dec 2010 16:08:25 +0800 xiaohui@intel.com wrote:

 From: Xin Xiaohui xiaohui@intel.com

 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  drivers/vhost/Kconfig  |   10 ++
  drivers/vhost/Makefile |2 ++
  2 files changed, 12 insertions(+), 0 deletions(-)

 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 index e4e2fd1..a6b8cbf 100644
 --- a/drivers/vhost/Kconfig
 +++ b/drivers/vhost/Kconfig
 @@ -9,3 +9,13 @@ config VHOST_NET
To compile this driver as a module, choose M here: the module will
be called vhost_net.

 +config MEDIATE_PASSTHRU
 +tristate mediate passthru network driver (EXPERIMENTAL)
 +depends on VHOST_NET
 +---help---
 +  zerocopy network I/O support, we call it as mediate passthru to

  support; we call it mediate passthru to

 +  be distiguish with hardare passthru.

 distinguish it from hardware passthru.


Thanks. I will modify that.

Thanks
Xiaohui

 +
 +  To compile this driver as a module, choose M here: the module will
 +  be called mpassthru.
 +


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.

2010-11-17 Thread Xin, Xiaohui
-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
Of Xin,
Xiaohui
Sent: Thursday, November 11, 2010 4:28 PM
To: David Miller
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.

-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Thursday, November 11, 2010 1:47 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.

From: xiaohui@intel.com
Date: Wed, 10 Nov 2010 17:23:28 +0800

 From: Xin Xiaohui xiaohui@intel.com

2) The idea to key off of skb-dev in skb_release_data() is
   fundamentally flawed since many actions can change skb-dev on you,
   which will end up causing a leak of your external data areas.

 How about this one? If the destructor_arg is not a good candidate,
 then I have to add an apparent field in shinfo.

If destructor_arg is actually a net_device pointer or similar,
you will need to take a reference count on it or similar.

Do you mean destructor_arg will be consumed by other user?
If that case, may I add a new structure member in shinfo?
Thus only zero-copy will use it, and no need for the reference count.

How about this? It really needs somewhere to track the external data area,
and if something wrong with it, we can also release the data area. We think 
skb_release_data() is the right place to deal with it. If I understood right,
that destructor_arg will be used by other else that why reference count is
needed, then how about add a new structure member in shinfo?

Thanks
Xiaohui 

Which means -- good bye performance especially on SMP.

You're going to be adding new serialization points and at
least two new atomics per packet.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.

2010-11-11 Thread Xin, Xiaohui
-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Thursday, November 11, 2010 1:47 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.

From: xiaohui@intel.com
Date: Wed, 10 Nov 2010 17:23:28 +0800

 From: Xin Xiaohui xiaohui@intel.com

2) The idea to key off of skb-dev in skb_release_data() is
   fundamentally flawed since many actions can change skb-dev on you,
   which will end up causing a leak of your external data areas.

 How about this one? If the destructor_arg is not a good candidate,
 then I have to add an apparent field in shinfo.

If destructor_arg is actually a net_device pointer or similar,
you will need to take a reference count on it or similar.

Do you mean destructor_arg will be consumed by other user?
If that case, may I add a new structure member in shinfo?
Thus only zero-copy will use it, and no need for the reference count.

Which means -- good bye performance especially on SMP.

You're going to be adding new serialization points and at
least two new atomics per packet.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.

2010-11-08 Thread Xin, Xiaohui
-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
Sent: Monday, November 08, 2010 4:25 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() 
specially.

Le lundi 08 novembre 2010 à 16:03 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com

  Hmm, I suggest you read the comment two lines above.
 
  If destructor_arg is now cleared each time we allocate a new skb, then,
  please move it before dataref in shinfo structure, so that the following
  memset() does the job efficiently...
 
 
 Something like :
 
 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 index e6ba898..2dca504 100644
 --- a/include/linux/skbuff.h
 +++ b/include/linux/skbuff.h
 @@ -195,6 +195,9 @@ struct skb_shared_info {
 __be32  ip6_frag_id;
 __u8tx_flags;
 struct sk_buff  *frag_list;
 +   /* Intermediate layers must ensure that destructor_arg
 +* remains valid until skb destructor */
 +   void*destructor_arg;
 struct skb_shared_hwtstamps hwtstamps;
 
 /*
 @@ -202,9 +205,6 @@ struct skb_shared_info {
  */
 atomic_tdataref;
 
 -   /* Intermediate layers must ensure that destructor_arg
 -* remains valid until skb destructor */
 -   void *  destructor_arg;
 /* must be last field, see pskb_expand_head() */
 skb_frag_t  frags[MAX_SKB_FRAGS];
  };
 
 

 Will that affect the cache line?

What do you mean ?

 Or, we can move the line to clear destructor_arg to the end of __alloc_skb().
 It looks like as the following, which one do you prefer?

 Thanks
 Xiaohui

 ---
  net/core/skbuff.c |8 
  1 files changed, 8 insertions(+), 0 deletions(-)

 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index c83b421..df852f2 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
 gfp_mask,

  child-fclone = SKB_FCLONE_UNAVAILABLE;
  }
 +shinfo-destructor_arg = NULL;
  out:
  return skb;
  nodata:

I dont understand why you want to do this.

This adds an instruction, makes code bigger, and no obvious gain for me,
at memory transactions side.

If integrated in the existing memset(), cost is an extra iteration to
perform the clear of this field.

Ok. Thanks for this explanation and will update with your solution.

Thanks
Xiaohui


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.

2010-11-07 Thread Xin, Xiaohui
I have addressed this issue in v14 patch set.

Thanks
Xiaohui

-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Saturday, October 30, 2010 4:29 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from 
NIC driver.

From: Xin, Xiaohui xiaohui@intel.com
Date: Wed, 27 Oct 2010 09:33:12 +0800

 Somehow, it seems not a trivial work to support it now. Can we support it
 later and as a todo with our current work?

I would prefer the feature work properly, rather than only in specific
cases, before being integated.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.

2010-10-26 Thread Xin, Xiaohui
-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Tuesday, October 19, 2010 11:24 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from 
NIC driver.

From: xiaohui@intel.com
Date: Fri, 15 Oct 2010 17:12:11 +0800

 @@ -2891,6 +2922,11 @@ static int __netif_receive_skb(struct sk_buff *skb)
  ncls:
  #endif

 +/* To intercept mediate passthru(zero-copy) packets here */
 +skb = handle_mpassthru(skb, pt_prev, ret, orig_dev);
 +if (!skb)
 +goto out;
 +
  /* Handle special case of bridge or macvlan */
  rx_handler = rcu_dereference(skb-dev-rx_handler);
  if (rx_handler) {

If you consume the packet here, devices in passthru mode cannot
be use with bonding.

But there is nothing that prevents a bond being created with such
a device.

So we have to either prevent such configurations (bad) or make
it work somehow (good) :-)

The big picture may like this:
To prevent such configurations, we should add code to check in both
mp and bonding driver. If a nic is in zero-copy mode , bonding can't
be made with it, and if nic is in bonding mode, we can't bind the device
to do zero-copy.

If we want to support such configurations, it also has some constraints.
If bonding is created first, we need code to check if all the slaves support
zero-copy mode, and if yes, all the slaves should be assigned a same 
page_ctor(), all the packets received should be intercepted with master nic. 
If not, fails.
If zero-copy is enabled first, bonding created with it should fail.

Somehow, it seems not a trivial work to support it now. Can we support it
later and as a todo with our current work?

Thanks
Xiaohui




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v12 12/17] Add mp(mediate passthru) device.

2010-10-18 Thread Xin, Xiaohui
-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Sunday, October 03, 2010 9:13 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v12 12/17] Add mp(mediate passthru) device.

On Thu, Sep 30, 2010 at 10:04:30PM +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com

 The patch add mp(mediate passthru) device, which now
 based on vhost-net backend driver and provides proto_ops
 to send/receive guest buffers data from/to guest vitio-net
 driver.

 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com

So you plan to rewrite all this to make this code part of macvtap?

V13 exports functions which can be reused by macvtap to get zero-copy.
Currently, I'm adding code in macvtap and trying to make it work.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Xin, Xiaohui
-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Friday, October 01, 2010 3:15 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() 
specially.

From: xiaohui@intel.com
Date: Thu, 30 Sep 2010 22:04:23 +0800

 @@ -197,10 +197,11 @@ struct skb_shared_info {
  union skb_shared_tx tx_flags;
  struct sk_buff  *frag_list;
  struct skb_shared_hwtstamps hwtstamps;
 -skb_frag_t  frags[MAX_SKB_FRAGS];
  /* Intermediate layers must ensure that destructor_arg
   * remains valid until skb destructor */
  void *  destructor_arg;
 +
 +skb_frag_t  frags[MAX_SKB_FRAGS];
  };

  /* The structure is for a skb which pages may point to

Why are you moving frags[] to the end like this?

That's to avoid the new cache miss caused by using destructor_arg in data path
like skb_release_data().
That's based on the comment from Eric Dumazet on v7 patches.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.

2010-10-11 Thread Xin, Xiaohui
-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
Sent: Thursday, September 30, 2010 10:22 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [PATCH v12 10/17] Add a hook to intercept external buffers from 
NIC driver.

Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui@intel.com a
écrit :
 From: Xin Xiaohui xiaohui@intel.com

 The hook is called in netif_receive_skb().
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  net/core/dev.c |   35 +++
  1 files changed, 35 insertions(+), 0 deletions(-)

 diff --git a/net/core/dev.c b/net/core/dev.c
 index c11e32c..83172b8 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -2517,6 +2517,37 @@ err:
  EXPORT_SYMBOL(netdev_mp_port_prep);
  #endif

 +#if defined(CONFIG_MEDIATE_PASSTHRU) ||
defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
 +/* Add a hook to intercept mediate passthru(zero-copy) packets,
 + * and insert it to the socket queue owned by mp_port specially.
 + */
 +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
 +   struct packet_type **pt_prev,
 +   int *ret,
 +   struct net_device *orig_dev)
 +{
 +struct mp_port *mp_port = NULL;
 +struct sock *sk = NULL;
 +
 +if (!dev_is_mpassthru(skb-dev))
 +return skb;
 +mp_port = skb-dev-mp_port;
 +
 +if (*pt_prev) {
 +*ret = deliver_skb(skb, *pt_prev, orig_dev);
 +*pt_prev = NULL;
 +}
 +
 +sk = mp_port-sock-sk;
 +skb_queue_tail(sk-sk_receive_queue, skb);
 +sk-sk_state_change(sk);
 +
 +return NULL;
 +}
 +#else
 +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
 +#endif
 +
  /**
   *  netif_receive_skb - process receive buffer from network
   *  @skb: buffer to process
 @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
  ncls:
  #endif

 +/* To intercept mediate passthru(zero-copy) packets here */
 +skb = handle_mpassthru(skb, pt_prev, ret, orig_dev);
 +if (!skb)
 +goto out;
  skb = handle_bridge(skb, pt_prev, ret, orig_dev);
  if (!skb)
  goto out;

This does not apply to current net-next-2.6

We now have dev-rx_handler (currently for bridge or macvlan)


Ok. Thanks, will rebase to fix that.

Thanks
Xiaohui

commit ab95bfe01f9872459c8678572ccadbf646badad0
Author: Jiri Pirko jpi...@redhat.com
Date:   Tue Jun 1 21:52:08 2010 +

net: replace hooks in __netif_receive_skb V5

What this patch does is it removes two receive frame hooks (for bridge and 
 for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.

Then a network driver (of virtual netdev like macvlan or bridge) can 
 register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko jpi...@redhat.com
Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
Signed-off-by: David S. Miller da...@davemloft.net



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Xin, Xiaohui
-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
Sent: Monday, October 11, 2010 11:42 PM
To: David Miller
Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu;
herb...@gondor.apana.org.au; jd...@linux.intel.com
Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() 
specially.

Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
 From: Xin, Xiaohui xiaohui@intel.com
 Date: Mon, 11 Oct 2010 15:06:05 +0800

  That's to avoid the new cache miss caused by using destructor_arg in data 
  path
  like skb_release_data().
  That's based on the comment from Eric Dumazet on v7 patches.

 Thanks for the explanation.

Anyway, frags[] must be the last field of struct skb_shared_info
since commit fed66381 (net: pskb_expand_head() optimization)

It seems Xin worked on a quite old tree.


I will rebase soon.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.

2010-09-30 Thread Xin, Xiaohui
Will be on leave during 10/01 ~ 10/07, and slow or no response to the comments.

Thanks
Xiaohui

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
Behalf Of
xiaohui@intel.com
Sent: Thursday, September 30, 2010 10:04 PM
To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it.
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:   net core and kernel changes.
patch 11-13:   new device as interface to mantpulate external buffers.
patch 14:  for vhost-net.
patch 15:  An example on modifying NIC driver to using napi_gro_frags().
patch 16:  An example how to get guest buffers based on driver
   who using napi_gro_frags().
patch 17:  It's a patch to address comments from Michael S. Thirkin
   to add 2 new ioctls in mp device.
   We split it out here to make easier reiewer.
   Need to revise.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
   Performance tuning

what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor instead of static minor number
   a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

   remove most of the RCU usage, since the ctor pointer is only
   changed by BIND/UNBIND ioctl, and during that time, NIC will be
   stopped to get good cleanup(all outstanding requests are finished),
   so the ctor pointer cannot be raced into wrong situation.

   Remove the struct vhost_notifier with struct kiocb.
   Let vhost-net backend to alloc/free the kiocb and transfer them
   via sendmsg/recvmsg.

   use get_user_pages_fast() and set_page_dirty_lock() when read.

   Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
   the async write logging is rewritten
   a drafted synchronous write function for qemu live migration
   a limit for locked pages from get_user_pages_fast() to prevent Dos
   by using RLIMIT_MEMLOCK


what we have done in v4:
   add iocb completion callback from vhost-net to queue iocb in mp device
   replace vq-receiver by mp_sock_data_ready()
   remove stuff in mp device which access structures from vhost-net
   modify skb_reserve() to ignore host NIC driver reserved space
   rebase to the latest vhost tree
   split large patches into small pieces, especially for net core part.


what we have done in v5:
   address Arnd Bergmann's comments
   -remove IFF_MPASSTHRU_EXCL flag in mp device
   -Add CONFIG_COMPAT macro
   -remove mp_release ops
   move dev_is_mpassthru() as inline func
   fix a bug in memory relinquish
   Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
   move create_iocb() out of page_dtor which may happen in interrupt 
 context
   -This remove the potential issues which lock called in interrupt context
   make 

RE: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-29 Thread Xin, Xiaohui
-Original Message-
From: Ben Hutchings [mailto:bhutchi...@solarflare.com]
Sent: Tuesday, September 28, 2010 5:24 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com

 The patch add mp(mediate passthru) device, which now
 based on vhost-net backend driver and provides proto_ops
 to send/receive guest buffers data from/to guest vitio-net
 driver.

 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  drivers/vhost/mpassthru.c | 1407
+
  1 files changed, 1407 insertions(+), 0 deletions(-)
  create mode 100644 drivers/vhost/mpassthru.c

 diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
 new file mode 100644
 index 000..d86d94c
 --- /dev/null
 +++ b/drivers/vhost/mpassthru.c
[...]
 +/* #define MPASSTHRU_DEBUG 1 */
 +
 +#ifdef MPASSTHRU_DEBUG
 +static int debug;
 +
 +#define DBG  if (mp-debug) printk
 +#define DBG1 if (debug == 2) printk

This is unsafe; consider this usage:

   if (foo)
   DBG(bar\n);
   else
   baz();

You should use the standard pr_debug() or dev_dbg() instead.

Ok. Move to pr_debug().

[...]
 +struct page_ctor {
 +   struct list_headreadq;
 +   int wq_len;
 +   int rq_len;
 +   spinlock_t  read_lock;

Only one queue?!

This readq is for mp device to store the rx guest buffers coming from vhost.
For tx side, we don't need that, since tx buffers are sent out to device 
immediately.

I would have appreciated some introductory comments on these structures.
I still don't have any sort of clear picture of how this is all supposed
to work.

Basically, struct page_info{} is used to store each info of the guest buffers
and some of vhost ring descriptor info according to this buffers.
struct page_ctor{} is used to manipulate multiple struct page_info{}.
Each device can do zero-copy is according one struct page_ctor{}.

Anyway, I will comments on fields of these structures.

[...]
 +/* The main function to allocate external buffers */
 +static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
 +struct sk_buff *skb, int npages)
 +{
 +int i;
 +unsigned long flags;
 +struct page_ctor *ctor;
 +struct page_info *info = NULL;
 +
 +ctor = container_of(port, struct page_ctor, port);
 +
 +spin_lock_irqsave(ctor-read_lock, flags);
 +if (!list_empty(ctor-readq)) {
 +info = list_first_entry(ctor-readq, struct page_info, list);
 +list_del(info-list);
 +ctor-rq_len--;
 +}
 +spin_unlock_irqrestore(ctor-read_lock, flags);
 +if (!info)
 +return NULL;
 +
 +for (i = 0; i  info-pnum; i++)
 +get_page(info-pages[i]);
 +info-skb = skb;
 +return info-ext_page;
 +}

Why isn't the npages parameter used?


Sorry, somehow forgot that, currently it only allocates one page a time,
we want to use npages to see if we break the limit.

[...]
 +static void relinquish_resource(struct page_ctor *ctor)
 +{
 +if (!(ctor-dev-flags  IFF_UP) 
 +!(ctor-wq_len + ctor-rq_len))
 +printk(KERN_INFO relinquish_resource\n);
 +}

Looks like something's missing here.


This function is for debug, and so on the rq_len stuff, I'll remove them later.

 +static void mp_ki_dtor(struct kiocb *iocb)
 +{
 +struct page_info *info = (struct page_info *)(iocb-private);
 +int i;
 +
 +if (info-flags == INFO_READ) {
 +for (i = 0; i  info-pnum; i++) {
 +if (info-pages[i]) {
 +set_page_dirty_lock(info-pages[i]);
 +put_page(info-pages[i]);
 +}
 +}
 +mp_hash_delete(info-ctor, info);
 +if (info-skb) {
 +info-skb-destructor = NULL;
 +kfree_skb(info-skb);
 +}
 +info-ctor-rq_len--;

Doesn't rq_len represent the number of buffers queued between the guest
and the driver?  It is already decremented in page_ctor() so it seems
like it gets decremented twice for each buffer.  Also don't you need to
hold the read_lock when updating rq_len?

 +} else
 +info-ctor-wq_len--;

Maybe you should define rq_len and wq_len both as atomic_t.

[...]
 +static void __mp_detach(struct mp_struct *mp)
 +{
 +mp-mfile = NULL;
 +
 +mp_dev_change_flags(mp-dev, mp-dev-flags  ~IFF_UP);
 +page_ctor_detach(mp);
 +mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP);

This is racy; you should hold rtnl_lock over all

RE: [PATCH v11 12/17] Add a kconfig entry and make entry for mp device.

2010-09-29 Thread Xin, Xiaohui
Thanks, I will move this in order next version.

Thanks
Xiaohui

-Original Message-
From: Ben Hutchings [mailto:bhutchi...@solarflare.com]
Sent: Monday, September 27, 2010 9:57 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v11 12/17] Add a kconfig entry and make entry for mp 
device.

This patch is in the wrong position in the sequence.  It needs to be
applied after mpassthru.c is created, not before.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-09-29 Thread Xin, Xiaohui
Ok, I will comment it above the structure.

Thanks
Xiaohui

-Original Message-
From: Ben Hutchings [mailto:bhutchi...@solarflare.com]
Sent: Monday, September 27, 2010 9:43 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v11 03/17] Add a ndo_mp_port_prep pointer to 
net_device_ops.

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com

 If the driver want to allocate external buffers,
 then it can export it's capability, as the skb
 buffer header length, the page length can be DMA, etc.
 The external buffers owner may utilize this.
[...]

This information needs to be included in the comment above struct
net_device_ops, not just in the commit message.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

2010-09-27 Thread Xin, Xiaohui
-Original Message-
From: linux-kernel-ow...@vger.kernel.org 
[mailto:linux-kernel-ow...@vger.kernel.org] On
Behalf Of Xin, Xiaohui
Sent: Monday, September 27, 2010 8:45 AM
To: Michael S. Tsirkin
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Monday, September 27, 2010 1:02 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

On Sat, Sep 25, 2010 at 12:27:18PM +0800, xiaohui@intel.com wrote:
 We provide an zero-copy method which driver side may get external
 buffers to DMA. Here external means driver don't use kernel space
 to allocate skb buffers. Currently the external buffer can be from
 guest virtio-net driver.

 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it.
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 patch 01-10:net core and kernel changes.
 patch 11-13:new device as interface to mantpulate external buffers.
 patch 14:   for vhost-net.
 patch 15:   An example on modifying NIC driver to using napi_gro_frags().
 patch 16:   An example how to get guest buffers based on driver
 who using napi_gro_frags().
 patch 17:   It's a patch to address comments from Michael S. Thirkin
 to add 2 new ioctls in mp device.
 We split it out here to make easier reiewer.



I commented on how to avoid mm semaphore on data path separately, and
since you didn't have time to review that yet, I won't repeat that here.


I think I did avoid that in data path to use mm semaphore. I stored the value
in mp structures and check with that.

Hi Michael,
Did you think I have addressed your comments about to avoid mm semaphore on
data path or not? I stored the value in mp structures and don't use 
mmap_semphore.
Or you still have some concerns with it?

At this point what are the plans on macvtap integration?
You indicated this is the interface you intend to use longterm.

I'm now working on that.

I'm now trying to export some functions from mp device, and let macvtap to use 
them
for rx zero-copy.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-26 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Sunday, September 26, 2010 7:50 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Thu, Sep 23, 2010 at 08:56:33PM +0800, Xin, Xiaohui wrote:
 -Original Message-
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Wednesday, September 22, 2010 7:55 PM
 To: Xin, Xiaohui
 Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
 linux-ker...@vger.kernel.org;
 mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
 jd...@linux.intel.com
 Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
 
 On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
  -Original Message-
  From: Michael S. Tsirkin [mailto:m...@redhat.com]
  Sent: Tuesday, September 21, 2010 9:14 PM
  To: Xin, Xiaohui
  Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
  linux-ker...@vger.kernel.org;
  mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
  jd...@linux.intel.com
  Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  
  On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
   From: Michael S. Tsirkin [mailto:m...@redhat.com]
   Sent: Monday, September 20, 2010 7:37 PM
   To: Xin, Xiaohui
   Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
   linux-ker...@vger.kernel.org;
   mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
   jd...@linux.intel.com
   Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
   
   On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote:
From: Xin Xiaohui xiaohui@intel.com
   
---
Michael,
I have move the ioctl to configure the locked memory to vhost
   
   It's ok to move this to vhost but vhost does not
   know how much memory is needed by the backend.
  
   I think the backend here you mean is mp device.
   Actually, the memory needed is related to vq-num to run zero-copy
   smoothly.
   That means mp device did not know it but vhost did.
  
  Well, this might be so if you insist on locking
  all posted buffers immediately. However, let's assume I have a
  very large ring and prepost a ton of RX buffers:
  there's no need to lock all of them directly:
  
  if we have buffers A and B, we can lock A, pass it
  to hardware, and when A is consumed unlock A, lock B
  and pass it to hardware.
  
  
  It's not really critical. But note we can always have userspace
  tell MP device all it wants to know, after all.
  
  Ok. Here are two values we have mentioned, one is how much memory
  user application wants to lock, and one is how much memory locked
  is needed to run smoothly. When net backend is setup, we first need
  an ioctl to get how much memory is needed to lock, and then we call
  another ioctl to set how much it want to lock. Is that what's in your 
  mind?
 
 That's fine.
 
   And the rlimt stuff is per process, we use current pointer to set
   and check the rlimit, the operations should be in the same process.
  
  Well no, the ring is handled from the kernel thread: we switch the mm to
  point to the owner task so copy from/to user and friends work, but you
  can't access the rlimit etc.
  
  Yes, the userspace and vhost kernel is not the same process. But we can
  record the task pointer as mm.
 
 So you will have to store mm and do device-mm, not current-mm.
 Anyway, better not touch mm on data path.
 
   Now the check operations are in vhost process, as mp_recvmsg() or
   mp_sendmsg() are called by vhost.
  
  Hmm, what do you mean by the check operations?
  send/recv are data path operations, they shouldn't
  do any checks, should they?
  
  As you mentioned what infiniband driver done:
  down_write(current-mm-mmap_sem);
 
  locked = npages + current-mm-locked_vm;
  lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
 
  if ((locked  lock_limit)  !capable(CAP_IPC_LOCK)) {
  ret = -ENOMEM;
  goto out;
  }
 
  cur_base = addr  PAGE_MASK;
 
  ret = 0;
  while (npages) {
  ret = get_user_pages(current, current-mm, cur_base,
   min_t(unsigned long, npages,
 PAGE_SIZE / sizeof (struct page
*)),
   1, !umem-writable, page_list,
vma_list);
 
  I think it's a data path too.
 
 in infiniband this is used to 'register memory' which is not data path.
 
  We do the check because get_user_pages() really pin and locked
  the memory.
 
 Don't do this. Performance will be bad.
 Do the check once in ioctl and increment locked_vm by max amount you will 
 use.
 On data path just make sure you do not exceed what userspace told you
 to.

 What's in my mind is that in the ioctl which

RE: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

2010-09-26 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Monday, September 27, 2010 1:02 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

On Sat, Sep 25, 2010 at 12:27:18PM +0800, xiaohui@intel.com wrote:
 We provide an zero-copy method which driver side may get external
 buffers to DMA. Here external means driver don't use kernel space
 to allocate skb buffers. Currently the external buffer can be from
 guest virtio-net driver.

 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it.
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 patch 01-10: net core and kernel changes.
 patch 11-13: new device as interface to mantpulate external buffers.
 patch 14:for vhost-net.
 patch 15:An example on modifying NIC driver to using napi_gro_frags().
 patch 16:An example how to get guest buffers based on driver
  who using napi_gro_frags().
 patch 17:It's a patch to address comments from Michael S. Thirkin
  to add 2 new ioctls in mp device.
  We split it out here to make easier reiewer.



I commented on how to avoid mm semaphore on data path separately, and
since you didn't have time to review that yet, I won't repeat that here.


I think I did avoid that in data path to use mm semaphore. I stored the value
in mp structures and check with that.

At this point what are the plans on macvtap integration?
You indicated this is the interface you intend to use longterm.

I'm now working on that.


Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-23 Thread Xin, Xiaohui
-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 22, 2010 7:55 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
 -Original Message-
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Tuesday, September 21, 2010 9:14 PM
 To: Xin, Xiaohui
 Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
 linux-ker...@vger.kernel.org;
 mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
 jd...@linux.intel.com
 Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
 
 On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
  From: Michael S. Tsirkin [mailto:m...@redhat.com]
  Sent: Monday, September 20, 2010 7:37 PM
  To: Xin, Xiaohui
  Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
  linux-ker...@vger.kernel.org;
  mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
  jd...@linux.intel.com
  Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  
  On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote:
   From: Xin Xiaohui xiaohui@intel.com
  
   ---
   Michael,
   I have move the ioctl to configure the locked memory to vhost
  
  It's ok to move this to vhost but vhost does not
  know how much memory is needed by the backend.
 
  I think the backend here you mean is mp device.
  Actually, the memory needed is related to vq-num to run zero-copy
  smoothly.
  That means mp device did not know it but vhost did.
 
 Well, this might be so if you insist on locking
 all posted buffers immediately. However, let's assume I have a
 very large ring and prepost a ton of RX buffers:
 there's no need to lock all of them directly:
 
 if we have buffers A and B, we can lock A, pass it
 to hardware, and when A is consumed unlock A, lock B
 and pass it to hardware.
 
 
 It's not really critical. But note we can always have userspace
 tell MP device all it wants to know, after all.
 
 Ok. Here are two values we have mentioned, one is how much memory
 user application wants to lock, and one is how much memory locked
 is needed to run smoothly. When net backend is setup, we first need
 an ioctl to get how much memory is needed to lock, and then we call
 another ioctl to set how much it want to lock. Is that what's in your mind?

That's fine.

  And the rlimt stuff is per process, we use current pointer to set
  and check the rlimit, the operations should be in the same process.
 
 Well no, the ring is handled from the kernel thread: we switch the mm to
 point to the owner task so copy from/to user and friends work, but you
 can't access the rlimit etc.
 
 Yes, the userspace and vhost kernel is not the same process. But we can
 record the task pointer as mm.

So you will have to store mm and do device-mm, not current-mm.
Anyway, better not touch mm on data path.

  Now the check operations are in vhost process, as mp_recvmsg() or
  mp_sendmsg() are called by vhost.
 
 Hmm, what do you mean by the check operations?
 send/recv are data path operations, they shouldn't
 do any checks, should they?
 
 As you mentioned what infiniband driver done:
 down_write(current-mm-mmap_sem);

 locked = npages + current-mm-locked_vm;
 lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;

 if ((locked  lock_limit)  !capable(CAP_IPC_LOCK)) {
 ret = -ENOMEM;
 goto out;
 }

 cur_base = addr  PAGE_MASK;

 ret = 0;
 while (npages) {
 ret = get_user_pages(current, current-mm, cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page 
 *)),
  1, !umem-writable, page_list, 
 vma_list);

 I think it's a data path too.

in infiniband this is used to 'register memory' which is not data path.

 We do the check because get_user_pages() really pin and locked
 the memory.

Don't do this. Performance will be bad.
Do the check once in ioctl and increment locked_vm by max amount you will use.
On data path just make sure you do not exceed what userspace told you
to.

What's in my mind is that in the ioctl which to get the memory locked needed to 
run smoothly,
it just return a value of how much memory is needed by mp device.
And then in the ioctl which to set the memory locked by user space, it check 
the rlimit and
increment locked_vm by user want. But I'm not sure how to make sure do not 
exceed what
userspace told to. If we don't check locked_vm, what do we use to check? And 
Is it another kind of check on data path?



  So set operations should be in
  vhost process too, it's natural.
 
  So I think we'll need another ioctl

RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-22 Thread Xin, Xiaohui
-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Tuesday, September 21, 2010 9:14 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Monday, September 20, 2010 7:37 PM
 To: Xin, Xiaohui
 Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
 linux-ker...@vger.kernel.org;
 mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
 jd...@linux.intel.com
 Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
 
 On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote:
  From: Xin Xiaohui xiaohui@intel.com
 
  ---
  Michael,
  I have move the ioctl to configure the locked memory to vhost
 
 It's ok to move this to vhost but vhost does not
 know how much memory is needed by the backend.

 I think the backend here you mean is mp device.
 Actually, the memory needed is related to vq-num to run zero-copy
 smoothly.
 That means mp device did not know it but vhost did.

Well, this might be so if you insist on locking
all posted buffers immediately. However, let's assume I have a
very large ring and prepost a ton of RX buffers:
there's no need to lock all of them directly:

if we have buffers A and B, we can lock A, pass it
to hardware, and when A is consumed unlock A, lock B
and pass it to hardware.


It's not really critical. But note we can always have userspace
tell MP device all it wants to know, after all.

Ok. Here are two values we have mentioned, one is how much memory
user application wants to lock, and one is how much memory locked
is needed to run smoothly. When net backend is setup, we first need
an ioctl to get how much memory is needed to lock, and then we call
another ioctl to set how much it want to lock. Is that what's in your mind? 

 And the rlimt stuff is per process, we use current pointer to set
 and check the rlimit, the operations should be in the same process.

Well no, the ring is handled from the kernel thread: we switch the mm to
point to the owner task so copy from/to user and friends work, but you
can't access the rlimit etc.

Yes, the userspace and vhost kernel is not the same process. But we can
record the task pointer as mm.

 Now the check operations are in vhost process, as mp_recvmsg() or
 mp_sendmsg() are called by vhost.

Hmm, what do you mean by the check operations?
send/recv are data path operations, they shouldn't
do any checks, should they?

As you mentioned what infiniband driver done:
down_write(current-mm-mmap_sem);

locked = npages + current-mm-locked_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;

if ((locked  lock_limit)  !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
goto out;
}

cur_base = addr  PAGE_MASK;

ret = 0;
while (npages) {
ret = get_user_pages(current, current-mm, cur_base,
 min_t(unsigned long, npages,
   PAGE_SIZE / sizeof (struct page *)),
 1, !umem-writable, page_list, vma_list);

I think it's a data path too. We do the check because get_user_pages() really 
pin and locked
the memory. 

 So set operations should be in
 vhost process too, it's natural.

 So I think we'll need another ioctl in the backend
 to tell userspace how much memory is needed?
 
 Except vhost tells it to mp device, mp did not know
 how much memory is needed to run zero-copy smoothly.
 Is userspace interested about the memory mp is needed?

Couldn't parse this last question.
I think userspace generally does want control over
how much memory we'll lock. We should not just lock
as much as we can.

--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-20 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Monday, September 20, 2010 7:37 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com

 ---
 Michael,
 I have move the ioctl to configure the locked memory to vhost

It's ok to move this to vhost but vhost does not
know how much memory is needed by the backend.

I think the backend here you mean is mp device.
Actually, the memory needed is related to vq-num to run zero-copy
smoothly. That means mp device did not know it but vhost did.
And the rlimt stuff is per process, we use current pointer to set
and check the rlimit, the operations should be in the same process.
Now the check operations are in vhost process, as mp_recvmsg() or
mp_sendmsg() are called by vhost. So set operations should be in
vhost process too, it's natural.

So I think we'll need another ioctl in the backend
to tell userspace how much memory is needed?

Except vhost tells it to mp device, mp did not know
how much memory is needed to run zero-copy smoothly.
Is userspace interested about the memory mp is needed?

It seems a bit cleaner as a backend ioctl as vhost
does not lock memory itself, but I am not
principally opposed.

 and
 check the limit with mm-locked_vm. please have a look.

 Thanks
 Xiaohui

  drivers/vhost/mpassthru.c |   74 +--
  drivers/vhost/net.c   |   78
++--
  include/linux/vhost.h |3 ++
  3 files changed, 85 insertions(+), 70 deletions(-)

 diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
 index d86d94c..fd3827b 100644
 --- a/drivers/vhost/mpassthru.c
 +++ b/drivers/vhost/mpassthru.c
 @@ -109,9 +109,6 @@ struct page_ctor {
  int wq_len;
  int rq_len;
  spinlock_t  read_lock;
 -/* record the locked pages */
 -int lock_pages;
 -struct rlimit   o_rlim;
  struct net_device   *dev;
  struct mpassthru_port   port;
  struct page_info**hash_table;
 @@ -231,7 +228,6 @@ static int page_ctor_attach(struct mp_struct *mp)
  ctor-port.ctor = page_ctor;
  ctor-port.sock = mp-socket;
  ctor-port.hash = mp_lookup;
 -ctor-lock_pages = 0;

  /* locked by mp_mutex */
  dev-mp_port = ctor-port;
 @@ -264,37 +260,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor)
  return info;
  }

 -static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
 -  unsigned long cur, unsigned long max)
 -{
 -struct rlimit new_rlim, *old_rlim;
 -int retval;
 -
 -if (resource != RLIMIT_MEMLOCK)
 -return -EINVAL;
 -new_rlim.rlim_cur = cur;
 -new_rlim.rlim_max = max;
 -
 -old_rlim = current-signal-rlim + resource;
 -
 -/* remember the old rlimit value when backend enabled */
 -ctor-o_rlim.rlim_cur = old_rlim-rlim_cur;
 -ctor-o_rlim.rlim_max = old_rlim-rlim_max;
 -
 -if ((new_rlim.rlim_max  old_rlim-rlim_max) 
 -!capable(CAP_SYS_RESOURCE))
 -return -EPERM;
 -
 -retval = security_task_setrlimit(resource, new_rlim);
 -if (retval)
 -return retval;
 -
 -task_lock(current-group_leader);
 -*old_rlim = new_rlim;
 -task_unlock(current-group_leader);
 -return 0;
 -}
 -
  static void relinquish_resource(struct page_ctor *ctor)
  {
  if (!(ctor-dev-flags  IFF_UP) 
 @@ -322,8 +287,6 @@ static void mp_ki_dtor(struct kiocb *iocb)
  info-ctor-rq_len--;
  } else
  info-ctor-wq_len--;
 -/* Decrement the number of locked pages */
 -info-ctor-lock_pages -= info-pnum;
  kmem_cache_free(ext_page_info_cache, info);
  relinquish_resource(info-ctor);

 @@ -349,7 +312,7 @@ static struct kiocb *create_iocb(struct page_info *info, 
 int size)
  iocb-ki_dtor(iocb);
  iocb-private = (void *)info;
  iocb-ki_dtor = mp_ki_dtor;
 -
 +iocb-ki_user_data = info-pnum;
  return iocb;
  }

 @@ -375,10 +338,6 @@ static int page_ctor_detach(struct mp_struct *mp)

  relinquish_resource(ctor);

 -set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
 -   ctor-o_rlim.rlim_cur,
 -   ctor-o_rlim.rlim_max);
 -
  /* locked by mp_mutex */
  ctor-dev-mp_port = NULL;
  dev_put(ctor-dev);
 @@ -565,21 +524,23 @@ static struct page_info *alloc_page_info(struct 
 page_ctor *ctor,
  int rc;
  int i, j, n = 0;
  int len;
 -unsigned long base, lock_limit;
 +unsigned long base, lock_limit, locked;
  struct page_info *info = NULL;

 -lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-16 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 15, 2010 5:59 PM
To: Xin, Xiaohui
Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Wednesday, September 15, 2010 12:30 AM
 To: Shirley Ma
 Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
 net...@vger.kernel.org;
 kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
 kernel
 
 On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
  On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
   I would expect this to hurt performance significantly.
   We could do this for asynchronous requests only to avoid the
   slowdown.
 
  Is kiocb in sendmsg helpful here? It is not used now.
 
  Shirley
 
 Precisely. This is what the patch from Xin Xiaohui does.  That code
 already seems to do most of what you are trying to do, right?
 
 The main thing missing seems to be macvtap integration, so that we can fall 
 back
 on data copy if zero copy is unavailable?
 How hard would it be to basically link the mp and macvtap modules
 together to get us this functionality? Anyone?
 
 Michael,
 Is to support macvtap with zero-copy through mp device the functionality
 you mentioned above?

I have trouble parsing the above question.  At some point Arnd suggested
that the mp device functionality would fit nicely as part of the macvtap
driver.  It seems to make sense superficially, the advantage if it
worked would be that we would get zero copy (mostly) transparently.

Do you agree with this goal?


I would say yes.

 Before Shirley Ma has suggested to move the zero-copy functionality into
 tun/tap device or macvtap device. How do you think about that?
 I suspect
 there will be a lot of duplicate code in that three drivers except we can 
 extract
 code of zero-copy into kernel APIs and vhost APIs.


tap would be very hard at this point as it does not bind to a device.
macvtap might work, we mainly need to figure out a way to detect that
device can do zero copy so the right mode is used.  I think a first step
could be to simply link mp code into macvtap module, pass necessary
ioctls on, then move some code around as necessary.  This might get rid
of code duplication nicely.

I'll look into this to see how much effort would be.



 Do you think that's worth to do and help current process which is blocked too
 long than I expected?

I think it's nice to have.

And if done hopefully this will get the folk working on the macvtap
driver to review the code, which will help find all issues faster.

I think if you post some performance numbers,
this will also help get people excited and looking at the code.


The performance data I have posted before is compared with raw socket on 
vhost-net.
But currently, the raw socket backend is removed from the qemu side.
So I can only compare with tap on vhost-net. But unfortunately, I missed 
something
that I even can't bring it up. I was blocked by this for a time.

I also don't see the process as completely blocked, each review round points
out more issues: we aren't going back and forth changing
same lines again and again, are we?

One thing that might help is increase the frequency of updates,
try sending them out sooner.
On the other hand 10 new patches each revision is a lot:
if there is a part of patchset that has stabilised you can split it out,
post once and keep posting the changing part separately.

I hope these suggestions help.

Thanks, Michael!


 
 --
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-16 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 15, 2010 7:28 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Wed, Sep 15, 2010 at 11:13:44AM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Sunday, September 12, 2010 9:37 PM
 To: Xin, Xiaohui
 Cc: net...@vger.kernel.org; kvm@vger.kernel.org; 
 linux-ker...@vger.kernel.org;
 mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
 jd...@linux.intel.com
 Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
 
 On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
  Playing with rlimit on data path, transparently to the application in 
  this way
  looks strange to me, I suspect this has unexpected security 
  implications.
  Further, applications may have other uses for locked memory
  besides mpassthru - you should not just take it because it's there.
  
  Can we have an ioctl that lets userspace configure how much
  memory to lock? This ioctl will decrement the rlimit and store
  the data in the device structure so we can do accounting
  internally. Put it back on close or on another ioctl.
  Yes, we can decrement the rlimit in ioctl in one time to avoid
  data path.
  
  Need to be careful for when this operation gets called
  again with 0 or another small value while we have locked memory -
  maybe just fail with EBUSY?  or wait until it gets unlocked?
  Maybe 0 can be special-cased and deactivate zero-copy?.
  
 
  How about we don't use a new ioctl, but just check the rlimit
  in one MPASSTHRU_BINDDEV ioctl? If we find mp device
  break the rlimit, then we fail the bind ioctl, and thus can't do
  zero copy any more.
 
 Yes, and not just check, but decrement as well.
 I think we should give userspace control over
 how much memory we can lock and subtract from the rlimit.
 It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
 Then increment the rlimit back on unbind and on close?
 
 This opens up an interesting condition: process 1
 calls bind, process 2 calls unbind or close.
 This will increment rlimit for process 2.
 Not sure how to fix this properly.
 
 I can't too, can we do any synchronous operations on rlimit stuff?
 I quite suspect in it.

 --
 MST

Here's what infiniband does: simply pass the amount of memory userspace
wants you to lock on an ioctl, and verify that either you have
CAP_IPC_LOCK or this number does not exceed the current rlimit.  (must
be on ioctl, not on open, as we likely want the fd passed around between
processes), but do not decrement rlimit.  Use this on following
operations.  Be careful if this can be changed while operations are in
progress.

This does mean that the effective amount of memory that userspace can
lock is doubled, but at least it is not unlimited, and we sidestep all
other issues such as userspace running out of lockable memory simply by
virtue of using the driver.


What I have done in mp device is almost the same as it. The difference is that
I do not check the capability, and I use my own parameter ctor-pages instead
of mm-locked_vm.

So currently, 1) add the capability check 2) use mm-locked_vm 3) add
an ioctl for userspace to configure how much memory can lock.
 
--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Shirley Ma [mailto:mashi...@us.ibm.com]
Sent: Tuesday, September 14, 2010 11:05 PM
To: Avi Kivity
Cc: David Miller; a...@arndb.de; m...@redhat.com; Xin, Xiaohui; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
  +base = (unsigned long)from-iov_base + offset1;
  +size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)
 PAGE_SHIFT;
  +num_pages = get_user_pages_fast(base, size,
 0,page[i]);
  +if ((num_pages != size) ||
  +(num_pages  MAX_SKB_FRAGS -
 skb_shinfo(skb)-nr_frags))
  +/* put_page is in skb free */
  +return -EFAULT;
  What keeps the user from writing to these pages in it's address
 space
  after the write call returns?
 
  A write() return of success means:
 
I wrote what you gave to me
 
  not
 
I wrote what you gave to me, oh and BTW don't touch these
pages for a while.
 
  In fact a while isn't even defined in any way, as there is no way
  for the write() invoker to know when the networking card is done
 with
  those pages.

 That's what io_submit() is for.  Then io_getevents() tells you what
 a
 while actually was.

This macvtap zero copy uses iov buffers from vhost ring, which is
allocated from guest kernel. In host kernel, vhost calls macvtap
sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
pages for zero copy.

The patch is relying on how vhost handle these buffers. I need to look
at vhost code (qemu) first for addressing the questions here.

Thanks
Shirley

I think what David said is what we have thought before in mp device.
Since we are not sure the exact time the tx buffer was wrote though DMA 
operation.
But the deadline is when the tx buffer was freed. So we only notify the vhost 
stuff
about the write when tx buffer freed. But the deadline is maybe too late for 
performance.

Thanks
Xiaohui 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Arnd Bergmann [mailto:a...@arndb.de]
Sent: Tuesday, September 14, 2010 11:21 PM
To: Shirley Ma
Cc: Avi Kivity; David Miller; m...@redhat.com; Xin, Xiaohui; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tuesday 14 September 2010, Shirley Ma wrote:
 On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:

  That's what io_submit() is for.  Then io_getevents() tells you what
  a
  while actually was.

 This macvtap zero copy uses iov buffers from vhost ring, which is
 allocated from guest kernel. In host kernel, vhost calls macvtap
 sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
 pages for zero copy.

 The patch is relying on how vhost handle these buffers. I need to look
 at vhost code (qemu) first for addressing the questions here.

I guess the best solution would be to make macvtap_aio_write return
-EIOCBQUEUED when a packet gets passed down to the adapter, and
call aio_complete when the adapter is done with it.

This would change the regular behavior of macvtap into a model where
every write on the file blocks until the packet has left the machine,
which gives us better flow control, but does slow down the traffic
when we only put one packet at a time into the queue.

It also allows the user to call io_submit instead of write in order
to do an asynchronous submission as Avi was suggesting.


But currently, this patch is communicated with vhost-net, which is almost
in the kernel side. If it uses aio stuff, it should be communicate with user
space Backend. 
 
   Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 15, 2010 12:30 AM
To: Shirley Ma
Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
  I would expect this to hurt performance significantly.
  We could do this for asynchronous requests only to avoid the
  slowdown.

 Is kiocb in sendmsg helpful here? It is not used now.

 Shirley

Precisely. This is what the patch from Xin Xiaohui does.  That code
already seems to do most of what you are trying to do, right?

The main thing missing seems to be macvtap integration, so that we can fall 
back
on data copy if zero copy is unavailable?
How hard would it be to basically link the mp and macvtap modules
together to get us this functionality? Anyone?

Michael,
Is to support macvtap with zero-copy through mp device the functionality
you mentioned above?
Before Shirley Ma has suggested to move the zero-copy functionality into
tun/tap device or macvtap device. How do you think about that? I suspect
there will be a lot of duplicate code in that three drivers except we can 
extract
code of zero-copy into kernel APIs and vhost APIs.
Do you think that's worth to do and help current process which is blocked too
long than I expected?


--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Shirley Ma [mailto:mashi...@us.ibm.com]
Sent: Wednesday, September 15, 2010 10:41 AM
To: Xin, Xiaohui
Cc: Avi Kivity; David Miller; a...@arndb.de; m...@redhat.com; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
 I think what David said is what we have thought before in mp device.
 Since we are not sure the exact time the tx buffer was wrote though
 DMA operation.
 But the deadline is when the tx buffer was freed. So we only notify
 the vhost stuff
 about the write when tx buffer freed. But the deadline is maybe too
 late for performance.

Have you tried it? If so what's the performance penalty you have seen by
notifying vhost when tx buffer freed?


We did not try it before, as we cared RX side more.

I am thinking to have a callback in skb destructor,
vhost_add_used_and_signal gets updated when skb is actually freed, vhost
vq  head need to be passed to the callback. This might requires vhost
ring size is at least as big as the lower device driver.


That's almost the same what we have done except we use destructor_arg and
another callback..

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net.

2010-09-14 Thread Xin, Xiaohui
Herbert,
Any comments on the modifications of the net core and driver side of this patch?

Thanks
Xiaohui

-Original Message-
From: linux-kernel-ow...@vger.kernel.org 
[mailto:linux-kernel-ow...@vger.kernel.org] On
Behalf Of xiaohui@intel.com
Sent: Saturday, September 11, 2010 5:53 PM
To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net.

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it.
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:   net core and kernel changes.
patch 11-13:   new device as interface to mantpulate external buffers.
patch 14:  for vhost-net.
patch 15:  An example on modifying NIC driver to using napi_gro_frags().
patch 16:  An example how to get guest buffers based on driver
   who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
   Performance tuning

what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor instead of static minor number
   a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

   remove most of the RCU usage, since the ctor pointer is only
   changed by BIND/UNBIND ioctl, and during that time, NIC will be
   stopped to get good cleanup(all outstanding requests are finished),
   so the ctor pointer cannot be raced into wrong situation.

   Remove the struct vhost_notifier with struct kiocb.
   Let vhost-net backend to alloc/free the kiocb and transfer them
   via sendmsg/recvmsg.

   use get_user_pages_fast() and set_page_dirty_lock() when read.

   Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
   the async write logging is rewritten
   a drafted synchronous write function for qemu live migration
   a limit for locked pages from get_user_pages_fast() to prevent Dos
   by using RLIMIT_MEMLOCK


what we have done in v4:
   add iocb completion callback from vhost-net to queue iocb in mp device
   replace vq-receiver by mp_sock_data_ready()
   remove stuff in mp device which access structures from vhost-net
   modify skb_reserve() to ignore host NIC driver reserved space
   rebase to the latest vhost tree
   split large patches into small pieces, especially for net core part.


what we have done in v5:
   address Arnd Bergmann's comments
   -remove IFF_MPASSTHRU_EXCL flag in mp device
   -Add CONFIG_COMPAT macro
   -remove mp_release ops
   move dev_is_mpassthru() as inline func
   fix a bug in memory relinquish
   Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
   move create_iocb() out of page_dtor which may happen in interrupt 
 context
   -This remove the potential issues which lock called in interrupt context
   make the cache used by mp, vhost as static, and created/destoryed during
   modules init/exit functions.
   -This makes multiple mp guest created at the same time.

what we have done 

RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-14 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Sunday, September 12, 2010 9:37 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
 Playing with rlimit on data path, transparently to the application in this 
 way
 looks strange to me, I suspect this has unexpected security implications.
 Further, applications may have other uses for locked memory
 besides mpassthru - you should not just take it because it's there.
 
 Can we have an ioctl that lets userspace configure how much
 memory to lock? This ioctl will decrement the rlimit and store
 the data in the device structure so we can do accounting
 internally. Put it back on close or on another ioctl.
 Yes, we can decrement the rlimit in ioctl in one time to avoid
 data path.
 
 Need to be careful for when this operation gets called
 again with 0 or another small value while we have locked memory -
 maybe just fail with EBUSY?  or wait until it gets unlocked?
 Maybe 0 can be special-cased and deactivate zero-copy?.
 

 How about we don't use a new ioctl, but just check the rlimit
 in one MPASSTHRU_BINDDEV ioctl? If we find mp device
 break the rlimit, then we fail the bind ioctl, and thus can't do
 zero copy any more.

Yes, and not just check, but decrement as well.
I think we should give userspace control over
how much memory we can lock and subtract from the rlimit.
It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
Then increment the rlimit back on unbind and on close?

This opens up an interesting condition: process 1
calls bind, process 2 calls unbind or close.
This will increment rlimit for process 2.
Not sure how to fix this properly.

I can't too, can we do any synchronous operations on rlimit stuff?
I quite suspect in it.
 
--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-11 Thread Xin, Xiaohui
Playing with rlimit on data path, transparently to the application in this way
looks strange to me, I suspect this has unexpected security implications.
Further, applications may have other uses for locked memory
besides mpassthru - you should not just take it because it's there.

Can we have an ioctl that lets userspace configure how much
memory to lock? This ioctl will decrement the rlimit and store
the data in the device structure so we can do accounting
internally. Put it back on close or on another ioctl.
Yes, we can decrement the rlimit in ioctl in one time to avoid
data path.

Need to be careful for when this operation gets called
again with 0 or another small value while we have locked memory -
maybe just fail with EBUSY?  or wait until it gets unlocked?
Maybe 0 can be special-cased and deactivate zero-copy?.


How about we don't use a new ioctl, but just check the rlimit 
in one MPASSTHRU_BINDDEV ioctl? If we find mp device
break the rlimit, then we fail the bind ioctl, and thus can't do 
zero copy any more.

In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory,
the default value is only 16 pages. It's too small to make the device to
work. So we always to configure it with a large value.
I think, if rlimit value after decrement is  0, then deactivate zero-copy
is better. 0 maybe ok.


 +
 +   if (ctor-lock_pages + count  lock_limit  npages) {
 +   printk(KERN_INFO exceed the locked memory rlimit.);
 +   return NULL;
 +   }
 +
 +   info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);

You seem to fill in all memory, why zalloc? this is data path ...

Ok, Let me check this.


 +
 +   if (!info)
 +   return NULL;
 +
 +   for (i = j = 0; i  count; i++) {
 +   base = (unsigned long)iov[i].iov_base;
 +   len = iov[i].iov_len;
 +
 +   if (!len)
 +   continue;
 +   n = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
 +
 +   rc = get_user_pages_fast(base, n, npages ? 1 : 0,

npages controls whether this is a write? Why?

We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() 
with npages =
0.


 +   info-pages[j]);
 +   if (rc != n)
 +   goto failed;
 +
 +   while (n--) {
 +   frags[j].offset = base  ~PAGE_MASK;
 +   frags[j].size = min_t(int, len,
 +   PAGE_SIZE - frags[j].offset);
 +   len -= frags[j].size;
 +   base += frags[j].size;
 +   j++;
 +   }
 +   }
 +
 +#ifdef CONFIG_HIGHMEM
 +   if (npages  !(dev-features  NETIF_F_HIGHDMA)) {
 +   for (i = 0; i  j; i++) {
 +   if (PageHighMem(info-pages[i]))
 +   goto failed;
 +   }
 +   }
 +#endif

Are non-highdma devices worth bothering with? If yes -
are there other limitations devices might have that we need to handle?
E.g. what about non-s/g devices or no checksum offloading?.

Basically I think there is no limitations for both, but let me have a check.


 +   skb_push(skb, ETH_HLEN);
 +
 +   if (skb_is_gso(skb)) {
 +   hdr.hdr.hdr_len = skb_headlen(skb);
 +   hdr.hdr.gso_size = shinfo-gso_size;
 +   if (shinfo-gso_type  SKB_GSO_TCPV4)
 +   hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 +   else if (shinfo-gso_type  SKB_GSO_TCPV6)
 +   hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
 +   else if (shinfo-gso_type  SKB_GSO_UDP)
 +   hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
 +   else
 +   BUG();
 +   if (shinfo-gso_type  SKB_GSO_TCP_ECN)
 +   hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
 +
 +   } else
 +   hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
 +
 +   if (skb-ip_summed == CHECKSUM_PARTIAL) {
 +   hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
 +   hdr.hdr.csum_start =
 +   skb-csum_start - skb_headroom(skb);
 +   hdr.hdr.csum_offset = skb-csum_offset;
 +   }

We have this code in tun, macvtap and packet socket already.
Could this be a good time to move these into networking core?
I'm not asking you to do this right now, but could this generic
virtio-net to skb stuff be encapsulated in functions?

It seems reasonable.


--
MST
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-11 Thread Xin, Xiaohui
 +
 +   if (ctor-lock_pages + count  lock_limit  npages) {
 +   printk(KERN_INFO exceed the locked memory rlimit.);
 +   return NULL;
 +   }
 +
 +   info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);

You seem to fill in all memory, why zalloc? this is data path ...

Ok, Let me check this.

It's mainly for info-next and info-prev, these two fields will be used in 
hash functions.
But you are right, since most fields will be refilled. The new version includes 
the fix. 

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-10 Thread Xin, Xiaohui
Michael,
Sorry to reply the mail late.

So - does this driver help reduce service demand signifiantly?

I'm looking at the performance now.

Some comments from looking at the code:

On Fri, Aug 06, 2010 at 05:23:41PM +0800, xiaohui@intel.com wrote:
 +static struct page_info *alloc_page_info(struct page_ctor *ctor,
 +struct kiocb *iocb, struct iovec *iov,
 +int count, struct frag *frags,
 +int npages, int total)
 +{
 +int rc;
 +int i, j, n = 0;
 +int len;
 +unsigned long base, lock_limit;
 +struct page_info *info = NULL;
 +
 +lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur;
 +lock_limit = PAGE_SHIFT;

Playing with rlimit on data path, transparently to the application in this way
looks strange to me, I suspect this has unexpected security implications.
Further, applications may have other uses for locked memory
besides mpassthru - you should not just take it because it's there.

Can we have an ioctl that lets userspace configure how much
memory to lock? This ioctl will decrement the rlimit and store
the data in the device structure so we can do accounting
internally. Put it back on close or on another ioctl.

Yes, we can decrement the rlimit in ioctl in one time to avoid
data path.

Need to be careful for when this operation gets called
again with 0 or another small value while we have locked memory -
maybe just fail with EBUSY?  or wait until it gets unlocked?
Maybe 0 can be special-cased and deactivate zero-copy?.

In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory,
the default value is only 16 pages. It's too small to make the device to 
work. So we always to configure it with a large value. 
I think, if rlimit value after decrement is  0, then deactivate zero-copy 
is better. 0 maybe ok.

 +
 +if (ctor-lock_pages + count  lock_limit  npages) {
 +printk(KERN_INFO exceed the locked memory rlimit.);
 +return NULL;
 +}
 +
 +info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);

You seem to fill in all memory, why zalloc? this is data path ...

Ok, Let me check this.


 +
 +if (!info)
 +return NULL;
 +
 +for (i = j = 0; i  count; i++) {
 +base = (unsigned long)iov[i].iov_base;
 +len = iov[i].iov_len;
 +
 +if (!len)
 +continue;
 +n = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
 +
 +rc = get_user_pages_fast(base, n, npages ? 1 : 0,

npages controls whether this is a write? Why?

We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with 
npages = 0.


 +info-pages[j]);
 +if (rc != n)
 +goto failed;
 +
 +while (n--) {
 +frags[j].offset = base  ~PAGE_MASK;
 +frags[j].size = min_t(int, len,
 +PAGE_SIZE - frags[j].offset);
 +len -= frags[j].size;
 +base += frags[j].size;
 +j++;
 +}
 +}
 +
 +#ifdef CONFIG_HIGHMEM
 +if (npages  !(dev-features  NETIF_F_HIGHDMA)) {
 +for (i = 0; i  j; i++) {
 +if (PageHighMem(info-pages[i]))
 +goto failed;
 +}
 +}
 +#endif

Are non-highdma devices worth bothering with? If yes -
are there other limitations devices might have that we need to handle?
E.g. what about non-s/g devices or no checksum offloading?.

Basically I think there is no limitations for both, but let me have a check.


 +skb_push(skb, ETH_HLEN);
 +
 +if (skb_is_gso(skb)) {
 +hdr.hdr.hdr_len = skb_headlen(skb);
 +hdr.hdr.gso_size = shinfo-gso_size;
 +if (shinfo-gso_type  SKB_GSO_TCPV4)
 +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 +else if (shinfo-gso_type  SKB_GSO_TCPV6)
 +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
 +else if (shinfo-gso_type  SKB_GSO_UDP)
 +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
 +else
 +BUG();
 +if (shinfo-gso_type  SKB_GSO_TCP_ECN)
 +hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
 +
 +} else
 +hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
 +
 +if (skb-ip_summed == CHECKSUM_PARTIAL) {
 +hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
 +hdr.hdr.csum_start =
 +skb-csum_start - skb_headroom(skb);
 +hdr.hdr.csum_offset = skb-csum_offset;
 +}

We have this code in tun, macvtap and packet socket already.
Could this be a good time to move these into networking core?
I'm not asking you to do this right now, but could 

RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-08-05 Thread Xin, Xiaohui
Herbert,
The v8 patches are modified mostly based on your comments about
napi_gro_frags interface. How do you think about the patches about
net core system part?
We know currently there are some comments about the mp device,
such as to support zero-copy for tun/tap and macvtap. Since there 
isn't a decision yet about it. May you give comments about the 
net core system first, since this part is all the same for zero-copy.

Thanks
Xiaohui

-Original Message-
From: linux-kernel-ow...@vger.kernel.org 
[mailto:linux-kernel-ow...@vger.kernel.org] On
Behalf Of xiaohui@intel.com
Sent: Thursday, July 29, 2010 7:15 PM
To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it.
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:   net core and kernel changes.
patch 11-13:   new device as interface to mantpulate external buffers.
patch 14:  for vhost-net.
patch 15:  An example on modifying NIC driver to using napi_gro_frags().
patch 16:  An example how to get guest buffers based on driver
   who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
   Performance tuning

what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor instead of static minor number
   a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

   remove most of the RCU usage, since the ctor pointer is only
   changed by BIND/UNBIND ioctl, and during that time, NIC will be
   stopped to get good cleanup(all outstanding requests are finished),
   so the ctor pointer cannot be raced into wrong situation.

   Remove the struct vhost_notifier with struct kiocb.
   Let vhost-net backend to alloc/free the kiocb and transfer them
   via sendmsg/recvmsg.

   use get_user_pages_fast() and set_page_dirty_lock() when read.

   Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
   the async write logging is rewritten
   a drafted synchronous write function for qemu live migration
   a limit for locked pages from get_user_pages_fast() to prevent Dos
   by using RLIMIT_MEMLOCK


what we have done in v4:
   add iocb completion callback from vhost-net to queue iocb in mp device
   replace vq-receiver by mp_sock_data_ready()
   remove stuff in mp device which access structures from vhost-net
   modify skb_reserve() to ignore host NIC driver reserved space
   rebase to the latest vhost tree
   split large patches into small pieces, especially for net core part.


what we have done in v5:
   address Arnd Bergmann's comments
   -remove IFF_MPASSTHRU_EXCL flag in mp device
   -Add CONFIG_COMPAT macro
   -remove mp_release ops
   move dev_is_mpassthru() as inline func
   fix a bug in memory relinquish
   Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
   move create_iocb() out of 

RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-08-03 Thread Xin, Xiaohui
-Original Message-
From: Shirley Ma [mailto:mashi...@us.ibm.com]
Sent: Friday, July 30, 2010 6:31 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

Hello Xiaohui,

On Thu, 2010-07-29 at 19:14 +0800, xiaohui@intel.com wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it.
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

Since vhost-net already supports macvtap/tun backends, do you think
whether it's better to implement zero copy in macvtap/tun than inducing
a new media passthrough device here?

 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later.

I did some vhost performance measurement over 10Gb ixgbe, and found that
in order to get consistent BW results, netperf/netserver, qemu, vhost
threads smp affinities are required.

Looking forward to these results for small message size comparison. For
large message size 10Gb ixgbe BW already reached by doing vhost smp
affinity w/i offloading support, we will see how much CPU utilization it
can be reduced.

Please provide latency results as well. I did some experimental on
macvtap zero copy sendmsg, what I have found that get_user_pages latency
pretty high.

May you share me with your performance results (including BW and latency)on 
vhost-net and how you get them(your configuration and especially with the 
affinity 
settings)?

Thanks
Xiaohui

Thanks
Shirley




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-07-30 Thread Xin, Xiaohui
Hello Xiaohui,

On Thu, 2010-07-29 at 19:14 +0800, xiaohui@intel.com wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it.
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

Since vhost-net already supports macvtap/tun backends, do you think
whether it's better to implement zero copy in macvtap/tun than inducing
a new media passthrough device here?


I'm not sure if there will be more duplicated code in the kernel.

 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later.

I did some vhost performance measurement over 10Gb ixgbe, and found that
in order to get consistent BW results, netperf/netserver, qemu, vhost
threads smp affinities are required.

Looking forward to these results for small message size comparison. For
large message size 10Gb ixgbe BW already reached by doing vhost smp
affinity w/i offloading support, we will see how much CPU utilization it
can be reduced.

Please provide latency results as well. I did some experimental on
macvtap zero copy sendmsg, what I have found that get_user_pages latency
pretty high.

Ok, I will try that.

Thanks
Shirley




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-28 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Sunday, June 27, 2010 2:15 PM
To: Dong, Eddie
Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:

 In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
 complete
queue pairs) uses the buffer from guest, so it eliminates copy completely in 
software and
requires hardware to do so. If we can have an additonal place to store the 
buffer per skb (may
cause copy later on), we can do copy later on or re-post the buffer to 
assigned NIC driver
later on. But that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem.  With your current patch-set you have exactly the same
situation when the skb-data is reallocated as a kernel buffer.


When will skb-data to be reallocated? May you point me the code path?

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.
So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb-head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb-head but the
first frag will remain unless we pull all of it.  If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows.  So you can just copy the header back into
the first frag just before we give it to the guest.

Since the data is still there, so recompute the page offset and size is ok, 
right?

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
--
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Xin, Xiaohui
Herbert,
That's why I have sent you the patch for guest virtio-net driver. I reserved 
512 bytes in each page, then I can always have the space to copy and avoid the 
backend memory used up issue.

Thanks
Xiaohui

-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Thursday, June 24, 2010 6:09 PM
To: Dong, Eddie
Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:

 I mean once the frontend side driver post the buffers to the backend driver, 
 the backend
driver will immediately use that buffers to compose skb or gro_frags and 
post them to the
assigned host NIC driver as receive buffers. In that case, if the backend 
driver recieves a
packet from the NIC that requires to do copy, it may be unable to find 
additional free guest
buffer because all of them are already used by the NIC driver. We have to 
reserve some guest
buffers for the possible copy even if the buffer address is not identified by 
original skb :(

OK I see what you mean.  Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 18, 2010 1:59 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com; Rusty Russell
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:

 Herbert,
 I have questions about the idea above:
 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
 then we don't need napi_gro_frags() any more, the driver's original receiving
 function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags.  But now that you mention it, yes
we could still keep the old interface to minimise the work.

 2) Is napi_gro_frags() only suitable for TCP protocol packet?
 I have done a small test for ixgbe driver to let it only allocate paged 
 buffers
 and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet.  In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

Thanks for the verification. By the way, does that mean that nearly all drivers 
can use the 
same napi_gro_frags() to receive buffers though currently each driver has it's 
own receiving 
function?

 3) As I have mentioned above, with this idea, netdev_alloc_skb() will 
 allocate
 as usual, the data pointed by skb-data will be copied into the first guest 
 buffer.
 That means we should reserve sufficient room in guest buffer. For PS mode
 supported driver (for example ixgbe), the room will be more than 128. After 
 128bytes,
 we will put the first frag data. Look into virtio-net.c the function 
 page_to_skb()
 and receive_mergeable(), that means we should modify guest virtio-net driver 
 to
 compute the offset as the parameter for skb_set_frag().

 How do you think about this? Attached is a patch to how to modify the guest 
 driver.
 I reserve 512 bytes as an example, and transfer the header len of the skb in 
 hdr-hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time.  I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Let me have a look to see if I can retain the copy threshold as 128 bytes 
and copy the header data safely.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-13 Thread Xin, Xiaohui
-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
Behalf Of
Xin, Xiaohui
Sent: Saturday, June 12, 2010 5:31 PM
To: Herbert Xu
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 11, 2010 1:21 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:

 I'm not sure if I understand your way correctly:
 1) Does the way only deal with driver with SG feature? Since packet
 is non-linear...

No the hardware doesn't have to support SG.  You just need to
place the entire packet contents in a page instead of skb-head.

 2) Is skb-data still pointing to guest user buffers?
 If yes, how to avoid the modifications to net core change to skb?

skb-data would not point to guest user buffers.  In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

 3) In our way only parts of drivers need be modified to support zero-copy.
 and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

 I think to make skb-head empty at first will cause more effort to pass the 
 check with
 skb header. Have I missed something here? I really make the skb-head NULL
 just before kfree(skb) in skb_release_data(), it's done by callback we have 
 made for skb.

No I'm not suggesting you set it to NULL.  It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
zero-copyed.
  If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of 
driver's receiving
function.
3) napi_gro_frags() will allocate small skb and pull the header data from
the first page to skb-data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into 
allocated kernel skb.
The offset should be recalculated when the user page data is transferred to 
guest. This
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but 
we cannot
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a 
vnet header.
That means for some pages, the vnet header room should be reserved when 
allocated.
But we cannot know which one will be used as the first page when allocated. If 
we reserved
vnet header for each page, since the set_skb_frag() in guest driver only use 
the offset 0 for
second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor 
callback to do that,
and then I still need a place to hold it. How do you think about to put it in 
skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui

Herbert,
In this way, I think we should create 3 functions at least in drivers to 
allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.

We can also have another way here. We can provide a function to only substitute 
alloc_page(), and a function to release the pages when cleaning the rx buffers.
The skb for the rx buffer can be allocated in original way, and when pushing 
the data to guest, the header data will be copied to guest buffer. In this way, 
we 
should reserve sufficient room for the header in the first guest user buffers. 
That need modifications to guest virtio-net kernel. And this way only suitable 
for
PS mode supported driver. Considered the advanced driver mostly has PS mode.
So it should be not a critical issue.

Thanks
Xiaohui

RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-12 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 11, 2010 1:21 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:

 I'm not sure if I understand your way correctly:
 1) Does the way only deal with driver with SG feature? Since packet
 is non-linear...

No the hardware doesn't have to support SG.  You just need to
place the entire packet contents in a page instead of skb-head.

 2) Is skb-data still pointing to guest user buffers?
 If yes, how to avoid the modifications to net core change to skb?

skb-data would not point to guest user buffers.  In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

 3) In our way only parts of drivers need be modified to support zero-copy.
 and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

 I think to make skb-head empty at first will cause more effort to pass the 
 check with
 skb header. Have I missed something here? I really make the skb-head NULL
 just before kfree(skb) in skb_release_data(), it's done by callback we have 
 made for skb.

No I'm not suggesting you set it to NULL.  It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if 
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
zero-copyed.
  If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of driver's 
receiving 
function.
3) napi_gro_frags() will allocate small skb and pull the header data from 
the first page to skb-data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into allocated 
kernel skb. 
The offset should be recalculated when the user page data is transferred to 
guest. This 
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but 
we cannot 
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a 
vnet header.
That means for some pages, the vnet header room should be reserved when 
allocated.
But we cannot know which one will be used as the first page when allocated. If 
we reserved vnet header for each page, since the set_skb_frag() in guest driver 
only use the offset 0 for second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor 
callback to do that, and then I still need a place to hold it. How do you think 
about to put it in skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
Sent: Saturday, June 05, 2010 10:56 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() 
specially.

Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com

 If buffer is external, then use the callback to destruct
 buffers.

 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  net/core/skbuff.c |   11 +++
  1 files changed, 11 insertions(+), 0 deletions(-)

 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index 37587f0..418457c 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)

  static void skb_release_data(struct sk_buff *skb)
  {
 +/* check if the skb has external buffers, we have use destructor_arg
 + * here to indicate
 + */
 +struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg;
 +

Oh well. This is v7 of your series, and nobody complained yet ?

This is a new cache miss on a _critical_ path.

Ok, I would remove the declaration here to avoid the new cache miss.



  if (!skb-cloned ||
  !atomic_sub_return(skb-nohdr ? (1  SKB_DATAREF_SHIFT) + 1 : 1,
 skb_shinfo(skb)-dataref)) {
 @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
  if (skb_has_frags(skb))
  skb_drop_fraglist(skb);

 +/* if the skb has external buffers, use destructor here,
 + * since after that skb-head will be kfree, in case skb-head
 + * from external buffer cannot use kfree to destroy.
 + */

Why not deferring here the access to skb_shinfo(skb)-destructor_arg ?

And references skb_shinfo(skb)-destructor_arg here.


 +if (dev_is_mpassthru(skb-dev)  ext_page  ext_page-dtor)
 +ext_page-dtor(ext_page);
  kfree(skb-head);
  }
  }

if (dev_is_mpassthru(skb-dev)) {
   struct skb_external_page *ext_page =
   skb_shinfo(skb)-destructor_arg;
   if (ext_page  ext_page-dtor)
   ext_page-dtor(ext_page);
}

destructor_arg should me moved before frags[] if you really want to use it.

Thanks for the patch. But why destructor_arg before frags[] is better than 
after frags[]?
skb_release_data() will reference both of them


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf243fc..b136d90 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -202,10 +202,11 @@ struct skb_shared_info {
 */
atomic_tdataref;

-   skb_frag_t  frags[MAX_SKB_FRAGS];
/* Intermediate layers must ensure that destructor_arg
 * remains valid until skb destructor */
void *  destructor_arg;
+
+   skb_frag_t  frags[MAX_SKB_FRAGS];
 };

 /* We divide dataref into two halves.  The higher 16 bits hold references



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
Sent: Saturday, June 05, 2010 10:53 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
  child-fclone = SKB_FCLONE_UNAVAILABLE;
  }
 +/* Record the external buffer info in this field. It's not so good,
 + * but we cannot find another place easily.
 + */
 +shinfo-destructor_arg = ext_page;
 +


Yes this is a big problem, its basically using a cache line that was not
touched before.


Did your patch which moves destructor_arg before frags[] also fix this?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Stephen Hemminger [mailto:shemmin...@vyatta.com]
Sent: Monday, June 07, 2010 7:14 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...


Yes, I agree on your concern at some extent. But the skbs which use external 
pages in our cases have short travels which start from NIC driver and then 
forward to the guest immediately. It will not travel into host kernel stack or 
any filters which may avoid many problems you have mentioned here.

Another point is that we try to make the solution more generic to different NIC 
drivers, then many drivers may use the way without modifications.
  
2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?

The pool is not fixed size, it's the size of usable buffers submitted by guest 
virtio-net driver. Guest virtio-net will check how much buffers are filled and 
do resubmit. What a slow listener mean? A slow NIC driver?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
Of Andi
Kleen
Sent: Monday, June 07, 2010 3:51 PM
To: Stephen Hemminger
Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
herb...@gondor.apana.org.au; jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs
low on memory? How is that pool balanced against other kernel
memory users?

The size of the pool is limited by the virtqueue capacity now.
If the virtqueue is really huge, then how to balance on memory is a problem.
I did not consider it clearly how to tune it dynamically currently...

-Andi

--
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Mitchell Erblich [mailto:erbli...@earthlink.net]
Sent: Monday, June 07, 2010 4:17 PM
To: Andi Kleen
Cc: Stephen Hemminger; Xin, Xiaohui; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
herb...@gondor.apana.org.au; jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.


On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

 Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come 
 from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?

 3. If they come from an internal pool what happens when the kernel runs
 low on memory? How is that pool balanced against other kernel
 memory users?

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

The internal pool here is a collection of user buffers submitted
by guest virtio-net driver. Guest put buffers here and driver get
buffers from it. If guest submit more buffers then driver needs,
we need somewhere to put the buffers, it's the internal pool here
to deal with. 

Now IMO,

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be
cleaned if dirty, then freed,

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).

Thanks for the thoughts.

Basically, the size of the internal pool is not decided by the pool itself,
To add/delete the objects in the pool is not a task of the pool itself too. 
It's decided by guest virtio-net driver and vhost-net driver both, and 
decided by the guest receive speed and submit speed.
The max size of the pool is limited by the virtqueue buffer numbers.

Thanks
Xiaohui




Mitchell Erblich
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Tuesday, June 08, 2010 1:28 PM
To: Stephen Hemminger
Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

I'm not sure if I understand your way correctly:
1) Does the way only deal with driver with SG feature? Since packet 
is non-linear...

2) Is skb-data still pointing to guest user buffers?
If yes, how to avoid the modifications to net core change to skb?

3) In our way only parts of drivers need be modified to support zero-copy.
and here, need we modify all the drivers?

If I missed your idea, may you explain it in more detail?

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.


If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb-head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb-head empty.

I think to make skb-head empty at first will cause more effort to pass the 
check with 
skb header. Have I missed something here? I really make the skb-head NULL
just before kfree(skb) in skb_release_data(), it's done by callback we have 
made for skb.

Thanks
Xiaohui

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode

2010-05-27 Thread Xin, Xiaohui
Michael,
What's you have suggested could avoid to taint the guest virtio-net driver. 
Really thanks!
For the two alternative, the first will do more trick with driver or net-core 
stuff. So currently, I prefer to try the second one. Anyway, let me have a good 
think of it. Thanks!

Thanks
Xiaohui

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of
Michael S. Tsirkin
Sent: Thursday, May 27, 2010 4:20 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; 
Herbert
Xu
Subject: Re: issue about virtio-net driver to suppoprt vhost mergeable buffer 
with zero-copy
to support PS mode

On Thu, May 27, 2010 at 09:21:02AM +0800, Xin, Xiaohui wrote:
 Michael,
 I'm now looking into the vhost mergeable buffer, and I tried to use it to 
 support PS mode
with zero-copy. And I found an issue there that I have to modify the guest 
virito-net driver.

 When guest virtio-net driver submits mergeable buffers, it submits multiple 
 pages outside.
In zero-copy case, vhost cannot know which page is used to put header, and 
which page is
used to put payload. Then vhost can only reserves 12 bytes for each page. That 
means, the
page_offset of the payload DMAed into the guest buffer is always 12 bytes. But 
guest
virtio-net driver always use offset 0 to put the data (See 
receive_mergeable()). That's where
the zero-copy use mergeable buffer must modify.

 Have I missed something here? And how do you think about it?

 Thanks
 Xiaohui

Maybe you can teach the hardware skip the first 12 bytes: qemu will
call an ioctl telling hardware what the virtio header size is.
This is how we plan to do it for tap.

Alternatively, buffers can be used in any order.
So we can have hardware use N buffers for the packet, and then
have vhost put the header in buffer N+1.

--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode

2010-05-26 Thread Xin, Xiaohui
Michael,
I'm now looking into the vhost mergeable buffer, and I tried to use it to 
support PS mode with zero-copy. And I found an issue there that I have to 
modify the guest virito-net driver.

When guest virtio-net driver submits mergeable buffers, it submits multiple 
pages outside. In zero-copy case, vhost cannot know which page is used to put 
header, and which page is used to put payload. Then vhost can only reserves 12 
bytes for each page. That means, the page_offset of the payload DMAed into the 
guest buffer is always 12 bytes. But guest virtio-net driver always use offset 
0 to put the data (See receive_mergeable()). That's where the zero-copy use 
mergeable buffer must modify.

Have I missed something here? And how do you think about it?

Thanks 
Xiaohui 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH v4 00/18] Provide a zero-copy method on KVM virtio-net.

2010-05-08 Thread Xin, Xiaohui
Michael,
Sorry, somehow I missed this mail. :-(

 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One: Modify __alloc_skb() function a bit, it can only allocate a 
  structure of sk_buff, and the data pointer is pointing to a 
  user buffer which is coming from a page constructor API.
  Then the shinfo of the skb is also from guest.
  When packet is received from hardware, the skb-data is filled
  directly by h/w. What we have done is in this way.
 
  Pros:   We can avoid any copy here.
  Cons:   Guest virtio-net driver needs to allocate skb as almost
  the same method with the host NIC drivers, say the size
  of netdev_alloc_skb() and the same reserved space in the
  head of skb. Many NIC drivers are the same with guest and
  ok for this. But some lastest NIC drivers reserves special
  room in skb head. To deal with it, we suggest to provide
  a method in guest virtio-net driver to ask for parameter
  we interest from the NIC driver when we know which device 
  we have bind to do zero-copy. Then we ask guest to do so.
  Is that reasonable?

Do you still do this?

Currently, we still use the first way. But we now ignore the room which 
host skb_reserve() required when device is doing zero-copy. Now we don't 
taint guest virtio-net driver with a new method by this way.

 Two: Modify driver to get user buffer allocated from a page constructor
  API(to substitute alloc_page()), the user buffer are used as payload
  buffers and filled by h/w directly when packet is received. Driver
  should associate the pages with skb (skb_shinfo(skb)-frags). For 
  the head buffer side, let host allocates skb, and h/w fills it. 
  After that, the data filled in host skb header will be copied into
  guest header buffer which is submitted together with the payload buffer.
 
  Pros:   We could less care the way how guest or host allocates their
  buffers.
  Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.

I commented on this in the past. Do you still want comments?

Now we continue with the first way and try to push it. But any comments about 
the two methods are still welcome.

That's nice. The thing to do is probably to enable GSO/TSO
and see what we get this way. Also, mergeable buffer support
was recently posted and I hope to merge it for 2.6.35.
You might want to take a look.

I'm looking at the mergeable buffer. I think GSO/GRO support with zero-copy 
also needs it.
Currently, GSO/TSO is still not supported by vhost-net?
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH v4 00/18] Provide a zero-copy method on KVM virtio-net.

2010-04-28 Thread Xin, Xiaohui
  The idea is simple, just to pin the guest VM user space and then let
  host NIC driver has the chance to directly DMA to it.
 
 Isn't it much easier to map the RX ring of the network device into the
 guest's address space, have DMA map calls translate guest addresses to
 physical/DMA addresses as well as do all of this crazy page pinning
 stuff, and provide the translations and protections via the IOMMU?

This means we need guest know how the specific network device works.
So we won't be able to, for example, move guest between different hosts.
There are other problems: many physical systems do not have an iommu,
some guest OS-es do not support DMA map calls, doing VM exit
on each DMA map call might turn out to be very slow. And so on.

This solution is what now we can think of to implement zero-copy. Some
modifications are made to net core to try to avoid network device driver
changes.  The major change is to __alloc_skb(), in which we added a dev
parameter to indicate whether the device will DMA to/from guest/user buffer
which is pointed by host skb-data. We also modify skb_release_data() and
skb_reserve(). We made it now works with ixgbe driver with PS mode disabled,
and got some performance data with it.
  
using netperf with GSO/TSO disabled, 10G NIC, 
disabled packet split mode, with raw socket case compared to vhost.

bindwidth will be from 1.1Gbps to 1.7Gbps
CPU % from 120%-140% to 140%-160%

We are now trying to get decent performance data with advanced features.
Do you have any other concerns with this solution? 

 What's being proposed here looks a bit over-engineered.

This is an attempt to reduce overhead for virtio (paravirtualization).
'Don't use PV' is kind of an alternative, but I do not
think it's a simpler one.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH v4 05/18] Add a function to indicate if device use external buffer.

2010-04-28 Thread Xin, Xiaohui
 +static int dev_is_mpassthru(struct net_device *dev)

bool return value should be better here.

-- 
Regards,
Changli Gao(xiao...@gmail.com)

Thanks, would fix that.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,
Sorry, it's based on the suggestion to hook an iocb completion callback
to handle the iocb list in vhost-net.

Thanks
Xiaohui

-Original Message-
From: Xin, Xiaohui
Sent: Thursday, April 22, 2010 4:24 PM
To: m...@redhat.com
Cc: a...@arndb.de; net...@vger.kernel.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; 
jd...@linux.intel.com; Xin, Xiaohui
Subject: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

From: Xin Xiaohui xiaohui@intel.com

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

Michael,
Thanks. I have updated the patch with your suggestion.
It looks much clean now. Please have a review.

Thanks
Xiaohui

 drivers/vhost/Kconfig |   10 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1239 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1280 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..91806b1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.

+config MEDIATE_PASSTHRU
+   tristate mediate passthru network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..cc99b14
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1239 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,

Yes, I think this packet split mode probably maps well to mergeable buffer
support. Note that
1. Not all devices support large packets in this way, others might map
   to indirect buffers better

Do the indirect buffers accord to deal with the skb-frag_list?

   So we have to figure out how migration is going to work
Yes, different guest virtio-net driver may contain different features.
Does the qemu migration work with different features supported by virtio-net
driver now?

2. It's up to guest driver whether to enable features such as
   mergeable buffers and indirect buffers
   So we have to figure out how to notify guest which mode
   is optimal for a given device
Yes. When a device is binded, the mp device may query the capabilities from 
driver.
Actually, there is a structure now in mp device can do this, we can add some 
field
to support more.

3. We don't want to depend on jumbo frames for decent performance
   So we probably should support GSO/GRO
GSO is for the tx side, right? I think driver can handle it itself.
For GRO, I'm not sure it's easy or not. Basically, the mp device now
we have support is doing what raw socket is doing. The packets are not going to 
host stack.
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-19 Thread Xin, Xiaohui
 Michael,
  The idea is simple, just to pin the guest VM user space and then
  let host NIC driver has the chance to directly DMA to it. 
  The patches are based on vhost-net backend driver. We add a device
  which provides proto_ops as sendmsg/recvmsg to vhost-net to
  send/recv directly to/from the NIC driver. KVM guest who use the
  vhost-net backend may bind any ethX interface in the host side to
  get copyless data transfer thru guest virtio-net frontend.
  
  The scenario is like this:
  
  The guest virtio-net driver submits multiple requests thru vhost-net
  backend driver to the kernel. And the requests are queued and then
  completed after corresponding actions in h/w are done.
  
  For read, user space buffers are dispensed to NIC driver for rx when
  a page constructor API is invoked. Means NICs can allocate user buffers
  from a page constructor. We add a hook in netif_receive_skb() function
  to intercept the incoming packets, and notify the zero-copy device.
  
  For write, the zero-copy deivce may allocates a new host skb and puts
  payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
  The request remains pending until the skb is transmitted by h/w.
  
  Here, we have ever considered 2 ways to utilize the page constructor
  API to dispense the user buffers.
  
  One:  Modify __alloc_skb() function a bit, it can only allocate a 
structure of sk_buff, and the data pointer is pointing to a 
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb-data is filled
directly by h/w. What we have done is in this way.
  
Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device 
we have bind to do zero-copy. Then we ask guest to do so.
Is that reasonable?
 Unfortunately, this would break compatibility with existing virtio.
 This also complicates migration.  
 You mean any modification to the guest virtio-net driver will break the
 compatibility? We tried to enlarge the virtio_net_config to contains the
 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
 will check the feature flag, and get the parameters, then virtio-net driver 
 use
 it to allocate buffers. How about this?

This means that we can't, for example, live-migrate between different systems
without flushing outstanding buffers.

Ok. What we have thought about now is to do something with skb_reserve().
If the device is binded by mp, then skb_reserve() will do nothing with it.

 What is the room in skb head used for?
 I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes 
 compared to
 NET_IP_ALIGN.

Looking at code, this seems to do with alignment - could just be
a performance optimization.

  Two:  Modify driver to get user buffer allocated from a page 
  constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)-frags). For 
the head buffer side, let host allocates skb, and h/w fills it. 
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.
  
Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.
  
  We are not sure which way is the better here. 
 The obvious question would be whether you see any speed difference
 with the two approaches. If no, then the second approach would be
 better.
 
 I remember the second approach is a bit slower in 1500MTU. 
 But we did not tested too much.

Well, that's an important datapoint. By the way, you'll need
header copy to activate LRO in host, so that's a good
reason to go with option 2 as well.


  This is the first thing we want
  to get comments from the community. We wish the modification to the 
  network
  part will be generic which not used by vhost-net backend only, but a user
  application may use it as well when the zero-copy device may provides 
  async
  read/write operations later.
  
  Please give comments especially for the network part modifications.
  
  
  We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
  
  Our goal is to improve the bandwidth and reduce the CPU usage.
  

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-19 Thread Xin, Xiaohui
Michael,

 What we have not done yet:
  packet split support
 
What does this mean, exactly?
 We can support 1500MTU, but for jumbo frame, since vhost driver before 
 don't 
support mergeable buffer, we cannot try it for multiple sg.
 
I do not see why, vhost currently supports 64K buffers with indirect
descriptors.
 
 The receive_skb() in guest virtio-net driver will merge the multiple sg to 
 skb frags, how can indirect descriptors to that?

See add_recvbuf_big.

I don't mean this, it's for buffer submission. I mean when packet is received, 
in receive_buf(), mergeable buffer knows which pages received can be hooked in 
skb frags, it's receive_mergeable() which do this.

When a NIC driver supports packet split mode, then each ring descriptor 
contains a skb and a page. When packet is received, if the status is not EOP, 
then hook the page of the next descriptor to the prev skb. We don't how many 
frags belongs to one skb. So when guest submit buffers, it should submit 
multiple pages, and when receive, the guest should know which pages are belongs 
to one skb and hook them together. I think receive_mergeable() can do this, but 
I don't see how big-packets handle this. May I miss something here?

Thanks
Xiaohui 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-15 Thread Xin, Xiaohui
Arnd,
 From: Xin Xiaohui xiaohui@intel.com
 
 Add a device to utilize the vhost-net backend driver for
 copy-less data transfer between guest FE and host NIC.
 It pins the guest user space to the host memory and
 provides proto_ops as sendmsg/recvmsg to vhost-net.

Sorry for taking so long before finding the time to look
at your code in more detail.

It seems that you are duplicating a lot of functionality that
is already in macvtap. I've asked about this before but then
didn't look at your newer versions. Can you explain the value
of introducing another interface to user land?

I'm still planning to add zero-copy support to macvtap,
hopefully reusing parts of your code, but do you think there
is value in having both?

I have not looked into your macvtap code in detail before.
Does the two interface exactly the same? We just want to create a simple
way to do zero-copy. Now it can only support vhost, but in future
we also want it to support directly read/write operations from user space too.

Basically, compared to the interface, I'm more worried about the modification
to net core we have made to implement zero-copy now. If this hardest part
can be done, then any user space interface modifications or integrations are 
more easily to be done after that.

 diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
 new file mode 100644
 index 000..86d2525
 --- /dev/null
 +++ b/drivers/vhost/mpassthru.c
 @@ -0,0 +1,1264 @@
 +
 +#ifdef MPASSTHRU_DEBUG
 +static int debug;
 +
 +#define DBG  if (mp-debug) printk
 +#define DBG1 if (debug == 2) printk
 +#else
 +#define DBG(a...)
 +#define DBG1(a...)
 +#endif

This should probably just use the existing dev_dbg/pr_debug infrastructure.

Thanks. Will try that.
 [... skipping buffer management code for now]

 +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
 +struct msghdr *m, size_t total_len)
 +{
 [...]

This function looks like we should be able to easily include it into
macvtap and get zero-copy transmits without introducing the new
user-level interface.

 +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
 +struct msghdr *m, size_t total_len,
 +int flags)
 +{
 +struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp;
 +struct page_ctor *ctor;
 +struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb-private);

It smells like a layering violation to look at the iocb-private field
from a lower-level driver. I would have hoped that it's possible to implement
this without having this driver know about the higher-level vhost driver
internals. Can you explain why this is needed?

I don't like this too, but since the kiocb is maintained by vhost with a 
list_head.
And mp device is responsible to collect the kiocb into the list_head,
We need something known by vhost/mp both.
 
 +spin_lock_irqsave(ctor-read_lock, flag);
 +list_add_tail(info-list, ctor-readq);
 +spin_unlock_irqrestore(ctor-read_lock, flag);
 +
 +if (!vq-receiver) {
 +vq-receiver = mp_recvmsg_notify;
 +set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
 +   vq-num * 4096,
 +   vq-num * 4096);
 +}
 +
 +return 0;
 +}

Not sure what I'm missing, but who calls the vq-receiver? This seems
to be neither in the upstream version of vhost nor introduced by your
patch.

See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost.

 +static void __mp_detach(struct mp_struct *mp)
 +{
 +mp-mfile = NULL;
 +
 +mp_dev_change_flags(mp-dev, mp-dev-flags  ~IFF_UP);
 +page_ctor_detach(mp);
 +mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP);
 +
 +/* Drop the extra count on the net device */
 +dev_put(mp-dev);
 +}
 +
 +static DEFINE_MUTEX(mp_mutex);
 +
 +static void mp_detach(struct mp_struct *mp)
 +{
 +mutex_lock(mp_mutex);
 +__mp_detach(mp);
 +mutex_unlock(mp_mutex);
 +}
 +
 +static void mp_put(struct mp_file *mfile)
 +{
 +if (atomic_dec_and_test(mfile-count))
 +mp_detach(mfile-mp);
 +}
 +
 +static int mp_release(struct socket *sock)
 +{
 +struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp;
 +struct mp_file *mfile = mp-mfile;
 +
 +mp_put(mfile);
 +sock_put(mp-socket.sk);
 +put_net(mfile-net);
 +
 +return 0;
 +}

Doesn't this prevent the underlying interface from going away while the chardev
is open? You also have logic to handle that case, so why do you keep the extra
reference on the netdev?

Let me think.

 +/* Ops structure to mimic raw sockets with mp device */
 +static const struct proto_ops mp_socket_ops = {
 +.sendmsg = mp_sendmsg,
 +.recvmsg = mp_recvmsg,
 +.release = mp_release,
 +};

 +static int mp_chr_open(struct inode *inode, struct file * file)
 +{
 +struct mp_file *mfile;
+ cycle_kernel_lock();

I don't think you really want to use the BKL here, just kill

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-15 Thread Xin, Xiaohui

Michael,
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 
 The scenario is like this:
 
 The guest virtio-net driver submits multiple requests thru vhost-net
 backend driver to the kernel. And the requests are queued and then
 completed after corresponding actions in h/w are done.
 
 For read, user space buffers are dispensed to NIC driver for rx when
 a page constructor API is invoked. Means NICs can allocate user buffers
 from a page constructor. We add a hook in netif_receive_skb() function
 to intercept the incoming packets, and notify the zero-copy device.
 
 For write, the zero-copy deivce may allocates a new host skb and puts
 payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
 The request remains pending until the skb is transmitted by h/w.
 
 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One: Modify __alloc_skb() function a bit, it can only allocate a 
  structure of sk_buff, and the data pointer is pointing to a 
  user buffer which is coming from a page constructor API.
  Then the shinfo of the skb is also from guest.
  When packet is received from hardware, the skb-data is filled
  directly by h/w. What we have done is in this way.
 
  Pros:   We can avoid any copy here.
  Cons:   Guest virtio-net driver needs to allocate skb as almost
  the same method with the host NIC drivers, say the size
  of netdev_alloc_skb() and the same reserved space in the
  head of skb. Many NIC drivers are the same with guest and
  ok for this. But some lastest NIC drivers reserves special
  room in skb head. To deal with it, we suggest to provide
  a method in guest virtio-net driver to ask for parameter
  we interest from the NIC driver when we know which device 
  we have bind to do zero-copy. Then we ask guest to do so.
  Is that reasonable?

Unfortunately, this would break compatibility with existing virtio.
This also complicates migration. 

You mean any modification to the guest virtio-net driver will break the
compatibility? We tried to enlarge the virtio_net_config to contains the
2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
will check the feature flag, and get the parameters, then virtio-net driver use
it to allocate buffers. How about this?

What is the room in skb head used for?
I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes 
compared to
NET_IP_ALIGN.

 Two: Modify driver to get user buffer allocated from a page constructor
  API(to substitute alloc_page()), the user buffer are used as payload
  buffers and filled by h/w directly when packet is received. Driver
  should associate the pages with skb (skb_shinfo(skb)-frags). For 
  the head buffer side, let host allocates skb, and h/w fills it. 
  After that, the data filled in host skb header will be copied into
  guest header buffer which is submitted together with the payload buffer.
 
  Pros:   We could less care the way how guest or host allocates their
  buffers.
  Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here.

The obvious question would be whether you see any speed difference
with the two approaches. If no, then the second approach would be
better.

I remember the second approach is a bit slower in 1500MTU. 
But we did not tested too much.

 This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 Please give comments especially for the network part modifications.
 
 
 We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.
 
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
  packet split support

What does this mean, exactly?
We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will 
split 5
frags and hook them once a descriptor, so the user buffer allocation is greatly 
dependent
on how guest 

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-06 Thread Xin, Xiaohui
Sridhar,

 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?

PCI-passthrough needs hardware support, a kind of iommu engine will
help to translate guest physical address to host physical address.
And currently, a PCI-passthrough device cannot pass live migration.

The zero-copy is a pure software solution. It doesn't need special hardware 
support.
In theory, it can pass live migration.
 
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

We need only to pin the send/receive buffers.

Thanks
Xiaohui

Thanks
Sridhar
 
 The scenario is like this:
 
 The guest virtio-net driver submits multiple requests thru vhost-net
 backend driver to the kernel. And the requests are queued and then
 completed after corresponding actions in h/w are done.
 
 For read, user space buffers are dispensed to NIC driver for rx when
 a page constructor API is invoked. Means NICs can allocate user buffers
 from a page constructor. We add a hook in netif_receive_skb() function
 to intercept the incoming packets, and notify the zero-copy device.
 
 For write, the zero-copy deivce may allocates a new host skb and puts
 payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
 The request remains pending until the skb is transmitted by h/w.
 
 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One:  Modify __alloc_skb() function a bit, it can only allocate a 
   structure of sk_buff, and the data pointer is pointing to a 
   user buffer which is coming from a page constructor API.
   Then the shinfo of the skb is also from guest.
   When packet is received from hardware, the skb-data is filled
   directly by h/w. What we have done is in this way.
 
   Pros:   We can avoid any copy here.
   Cons:   Guest virtio-net driver needs to allocate skb as almost
   the same method with the host NIC drivers, say the size
   of netdev_alloc_skb() and the same reserved space in the
   head of skb. Many NIC drivers are the same with guest and
   ok for this. But some lastest NIC drivers reserves special
   room in skb head. To deal with it, we suggest to provide
   a method in guest virtio-net driver to ask for parameter
   we interest from the NIC driver when we know which device 
   we have bind to do zero-copy. Then we ask guest to do so.
   Is that reasonable?
 
 Two:  Modify driver to get user buffer allocated from a page constructor
   API(to substitute alloc_page()), the user buffer are used as payload
   buffers and filled by h/w directly when packet is received. Driver
   should associate the pages with skb (skb_shinfo(skb)-frags). For 
   the head buffer side, let host allocates skb, and h/w fills it. 
   After that, the data filled in host skb header will be copied into
   guest header buffer which is submitted together with the payload buffer.
 
   Pros:   We could less care the way how guest or host allocates their
   buffers.
   Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 Please give comments especially for the network part modifications.
 
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
 
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
   packet split support
   To support GRO
   Performance tuning
 
 what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor 

RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.

2010-04-06 Thread Xin, Xiaohui

 From: Xin Xiaohui xiaohui@intel.com
 
 The patch let host NIC driver to receive user space skb,
 then the driver has chance to directly DMA to guest user
 space buffers thru single ethX interface.
 We want it to be more generic as a zero copy framework.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzha...@gmail.com
 Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
 ---
 
 We consider 2 way to utilize the user buffres, but not sure which one
 is better. Please give any comments.
 
 One:Modify __alloc_skb() function a bit, it can only allocate a
 structure of sk_buff, and the data pointer is pointing to a
 user buffer which is coming from a page constructor API.
 Then the shinfo of the skb is also from guest.
 When packet is received from hardware, the skb-data is filled
 directly by h/w. What we have done is in this way.
 
 Pros:   We can avoid any copy here.
 Cons:   Guest virtio-net driver needs to allocate skb as almost
 the same method with the host NIC drivers, say the size
 of netdev_alloc_skb() and the same reserved space in the
 head of skb. Many NIC drivers are the same with guest and
 ok for this. But some lastest NIC drivers reserves special
 room in skb head. To deal with it, we suggest to provide
 a method in guest virtio-net driver to ask for parameter
 we interest from the NIC driver when we know which device
 we have bind to do zero-copy. Then we ask guest to do so.
 Is that reasonable?
 
 Two:Modify driver to get user buffer allocated from a page constructor
 API(to substitute alloc_page()), the user buffer are used as payload
 buffers and filled by h/w directly when packet is received. Driver
 should associate the pages with skb (skb_shinfo(skb)-frags). For
 the head buffer side, let host allocates skb, and h/w fills it.
 After that, the data filled in host skb header will be copied into
 guest header buffer which is submitted together with the payload 
 buffer.
 
 Pros:   We could less care the way how guest or host allocates their
 buffers.
 Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 
 Thanks
 Xiaohui

How do you deal with the DoS problem of hostile user space app posting huge
number of receives and never getting anything. 

That's a problem we are trying to deal with. It's critical for long term.
Currently, we tried to limit the pages it can pin, but not sure how much is 
reasonable.
For now, the buffers submitted is from guest virtio-net driver, so it's safe in 
some extent
just for now.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-06 Thread Xin, Xiaohui
Michael,
   For the write logging, do you have a function in hand that we can
   recompute the log? If that, I think I can use it to recompute the
  log info when the logging is suddenly enabled.
   For the outstanding requests, do you mean all the user buffers have
  submitted before the logging ioctl changed? That may be a lot, and
   some of them are still in NIC ring descriptors. Waiting them to be
  finished may be need some time. I think when logging ioctl changed,
   then the logging is changed just after that is also reasonable.

  The key point is that after loggin ioctl returns, any
  subsequent change to memory must be logged. It does not
  matter when was the request submitted, otherwise we will
  get memory corruption on migration.

  The change to memory happens when vhost_add_used_and_signal(), right?
  So after ioctl returns, just recompute the log info to the events in the 
  async queue,
  is ok. Since the ioctl and write log operations are all protected by 
  vq-mutex.

  Thanks
  Xiaohui

 Yes, I think this will work.

 Thanks, so do you have the function to recompute the log info in your hand 
 that I can
use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

Am I missing something here?
The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and 
converts it
to an iovec. I think the first available buffer is not the buffers in the async 
queue, so I
think rerunning vhost_get_vq_desc() cannot work.

Thanks
Xiaohui

   Thanks
   Xiaohui
  
drivers/vhost/net.c   |  189 
   +++--
drivers/vhost/vhost.h |   10 +++
2 files changed, 192 insertions(+), 7 deletions(-)
  
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index 22d5fef..2aafd90 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -17,11 +17,13 @@
#include linux/workqueue.h
#include linux/rcupdate.h
#include linux/file.h
   +#include linux/aio.h
  
#include linux/net.h
#include linux/if_packet.h
#include linux/if_arp.h
#include linux/if_tun.h
   +#include linux/mpassthru.h
  
#include net/sock.h
  
   @@ -47,6 +49,7 @@ struct vhost_net {
 struct vhost_dev dev;
 struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 struct vhost_poll poll[VHOST_NET_VQ_MAX];
   + struct kmem_cache   *cache;
 /* Tells us whether we are polling a socket for TX.
  * We only do this when socket buffer fills up.
  * Protected by tx vq lock. */
   @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, 
   struct socket *sock)
 net-tx_poll_state = VHOST_NET_POLL_STARTED;
}
  
   +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + unsigned long flags;
   +
   + spin_lock_irqsave(vq-notify_lock, flags);
   + if (!list_empty(vq-notifier)) {
   + iocb = list_first_entry(vq-notifier,
   + struct kiocb, ki_list);
   + list_del(iocb-ki_list);
   + }
   + spin_unlock_irqrestore(vq-notify_lock, flags);
   + return iocb;
   +}
   +
   +static void handle_async_rx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + struct vhost_log *vq_log = NULL;
   + int rx_total_len = 0;
   + int log, size;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + if (vq-receiver)
   + vq-receiver(vq);
   +
   + vq_log = unlikely(vhost_has_feature(
   + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, iocb-ki_nbytes);
   + log = (int)iocb-ki_user_data;
   + size = iocb-ki_nbytes;
   + rx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   + kmem_cache_free(net-cache, iocb);
   +
   + if (unlikely(vq_log))
   + vhost_log_write(vq, vq_log, log, size);
   + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
   + vhost_poll_queue(vq-poll);
   + break;
   + }
   + }
   +}
   +
   +static void handle_async_tx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + int tx_total_len = 0;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, 0);
   + tx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   +
   + kmem_cache_free(net-cache, iocb);
   + if (unlikely(tx_total_len = 

RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-06 Thread Xin, Xiaohui
Michael,
 
 Qemu needs a userspace write, is that a synchronous one or
asynchronous one?

It's a synchronous non-blocking write.
Sorry, why the Qemu live migration needs the device have a userspace write?
how does the write operation work? And why a read operation is not cared here?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-05 Thread Xin, Xiaohui
Michael,
 
For the DOS issue, I'm not sure how much the limit get_user_pages()
 can pin is reasonable, should we compute the bindwidth to make it?

There's a ulimit for locked memory. Can we use this, decreasing
the value for rlimit array? We can do this when backend is
enabled and re-increment when backend is disabled.

I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
the initial value for it is 0x1, after right shift PAGE_SHIFT,
it's only 16 pages we can lock then, it seems too small, since the 
guest virito-net driver may submit a lot requests one time.


Thanks
Xiaohui

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-05 Thread Xin, Xiaohui
Michael,
  For the write logging, do you have a function in hand that we can
  recompute the log? If that, I think I can use it to recompute the
 log info when the logging is suddenly enabled.
  For the outstanding requests, do you mean all the user buffers have
 submitted before the logging ioctl changed? That may be a lot, and
  some of them are still in NIC ring descriptors. Waiting them to be
 finished may be need some time. I think when logging ioctl changed,
  then the logging is changed just after that is also reasonable.
 
 The key point is that after loggin ioctl returns, any
 subsequent change to memory must be logged. It does not
 matter when was the request submitted, otherwise we will
 get memory corruption on migration.

 The change to memory happens when vhost_add_used_and_signal(), right?
 So after ioctl returns, just recompute the log info to the events in the 
 async queue,
 is ok. Since the ioctl and write log operations are all protected by 
 vq-mutex.
 
 Thanks
 Xiaohui

Yes, I think this will work.

Thanks, so do you have the function to recompute the log info in your hand that 
I can 
use? I have weakly remembered that you have noticed it before some time.

  Thanks
  Xiaohui
  
   drivers/vhost/net.c   |  189 
  +++--
   drivers/vhost/vhost.h |   10 +++
   2 files changed, 192 insertions(+), 7 deletions(-)
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index 22d5fef..2aafd90 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -17,11 +17,13 @@
   #include linux/workqueue.h
   #include linux/rcupdate.h
   #include linux/file.h
  +#include linux/aio.h
   
   #include linux/net.h
   #include linux/if_packet.h
   #include linux/if_arp.h
   #include linux/if_tun.h
  +#include linux/mpassthru.h
   
   #include net/sock.h
   
  @@ -47,6 +49,7 @@ struct vhost_net {
  struct vhost_dev dev;
  struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
  struct vhost_poll poll[VHOST_NET_VQ_MAX];
  +   struct kmem_cache   *cache;
  /* Tells us whether we are polling a socket for TX.
   * We only do this when socket buffer fills up.
   * Protected by tx vq lock. */
  @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct 
  socket *sock)
  net-tx_poll_state = VHOST_NET_POLL_STARTED;
   }
   
  +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
  +{
  +   struct kiocb *iocb = NULL;
  +   unsigned long flags;
  +
  +   spin_lock_irqsave(vq-notify_lock, flags);
  +   if (!list_empty(vq-notifier)) {
  +   iocb = list_first_entry(vq-notifier,
  +   struct kiocb, ki_list);
  +   list_del(iocb-ki_list);
  +   }
  +   spin_unlock_irqrestore(vq-notify_lock, flags);
  +   return iocb;
  +}
  +
  +static void handle_async_rx_events_notify(struct vhost_net *net,
  +   struct vhost_virtqueue *vq)
  +{
  +   struct kiocb *iocb = NULL;
  +   struct vhost_log *vq_log = NULL;
  +   int rx_total_len = 0;
  +   int log, size;
  +
  +   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
  +   return;
  +
  +   if (vq-receiver)
  +   vq-receiver(vq);
  +
  +   vq_log = unlikely(vhost_has_feature(
  +   net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
  +   while ((iocb = notify_dequeue(vq)) != NULL) {
  +   vhost_add_used_and_signal(net-dev, vq,
  +   iocb-ki_pos, iocb-ki_nbytes);
  +   log = (int)iocb-ki_user_data;
  +   size = iocb-ki_nbytes;
  +   rx_total_len += iocb-ki_nbytes;
  +
  +   if (iocb-ki_dtor)
  +   iocb-ki_dtor(iocb);
  +   kmem_cache_free(net-cache, iocb);
  +
  +   if (unlikely(vq_log))
  +   vhost_log_write(vq, vq_log, log, size);
  +   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
  +   vhost_poll_queue(vq-poll);
  +   break;
  +   }
  +   }
  +}
  +
  +static void handle_async_tx_events_notify(struct vhost_net *net,
  +   struct vhost_virtqueue *vq)
  +{
  +   struct kiocb *iocb = NULL;
  +   int tx_total_len = 0;
  +
  +   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
  +   return;
  +
  +   while ((iocb = notify_dequeue(vq)) != NULL) {
  +   vhost_add_used_and_signal(net-dev, vq,
  +   iocb-ki_pos, 0);
  +   tx_total_len += iocb-ki_nbytes;
  +
  +   if (iocb-ki_dtor)
  +   iocb-ki_dtor(iocb);
  +
  +   kmem_cache_free(net-cache, iocb);
  +   if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
  +   vhost_poll_queue(vq-poll);
  +   break;
  +   }
  +   }
  +}
  +
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_tx(struct vhost_net *net)
   {
  struct vhost_virtqueue *vq = 

Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-01 Thread Xin Xiaohui
The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---

Michael,
Now, I made vhost to alloc/destroy the kiocb, and transfer it from 
sendmsg/recvmsg. I did not remove vq-receiver, since what the
callback does is related to the structures owned by mp device,
and I think isolation them to vhost is a good thing to us all.
And it will not prevent mp device to be independent of vhost 
in future. Later, when mp device can be a real device which
provides asynchronous read/write operations and not just report
proto_ops, it will use another callback function which is not
related to vhost at all.

For the write logging, do you have a function in hand that we can
recompute the log? If that, I think I can use it to recompute the
log info when the logging is suddenly enabled.
For the outstanding requests, do you mean all the user buffers have
submitted before the logging ioctl changed? That may be a lot, and
some of them are still in NIC ring descriptors. Waiting them to be
finished may be need some time. I think when logging ioctl changed,
then the logging is changed just after that is also reasonable.

Thanks
Xiaohui

 drivers/vhost/net.c   |  189 +++--
 drivers/vhost/vhost.h |   10 +++
 2 files changed, 192 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..2aafd90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
 
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   int log, size;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   if (vq-receiver)
+   vq-receiver(vq);
+
+   vq_log = unlikely(vhost_has_feature(
+   net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   log = (int)iocb-ki_user_data;
+   size = iocb-ki_nbytes;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   if (unlikely(vq_log))
+   vhost_log_write(vq, vq_log, log, size);
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+
+   kmem_cache_free(net-cache, iocb);
+   if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind

Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-01 Thread Xin Xiaohui
Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
---

Micheal,
Sorry, I did not resolve all your comments this time.
I did not move the device out of vhost directory because I
did not implement real asynchronous read/write operations
to mp device for now, We wish we can do this after the network
code checked in. 

For the DOS issue, I'm not sure how much the limit get_user_pages()
can pin is reasonable, should we compute the bindwidth to make it?

We use get_user_pages_fast() and use set_page_dirty_lock().
Remove read_rcu_lock()/unlock(), since the ctor pointer is
only changed by BIND/UNBIND ioctl, and during that time,
the NIC is always stoped, all outstanding requests are done,
so the ctor pointer cannot be raced into wrong condition.

Qemu needs a userspace write, is that a synchronous one or
asynchronous one?

Thanks
Xiaohui

 drivers/vhost/Kconfig |5 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1162 +
 include/linux/mpassthru.h |   29 ++
 4 files changed, 1198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config VHOST_PASSTHRU
+   tristate Zerocopy network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..6e8fc4d
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1162 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+#include vhost.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   void*ctrl;
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int

RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-01 Thread Xin, Xiaohui

 For the write logging, do you have a function in hand that we can
 recompute the log? If that, I think I can use it to recompute the
log info when the logging is suddenly enabled.
 For the outstanding requests, do you mean all the user buffers have
submitted before the logging ioctl changed? That may be a lot, and
 some of them are still in NIC ring descriptors. Waiting them to be
finished may be need some time. I think when logging ioctl changed,
 then the logging is changed just after that is also reasonable.

The key point is that after loggin ioctl returns, any
subsequent change to memory must be logged. It does not
matter when was the request submitted, otherwise we will
get memory corruption on migration.

The change to memory happens when vhost_add_used_and_signal(), right?
So after ioctl returns, just recompute the log info to the events in the async 
queue,
is ok. Since the ioctl and write log operations are all protected by vq-mutex.

Thanks
Xiaohui

 Thanks
 Xiaohui
 
  drivers/vhost/net.c   |  189 
 +++--
  drivers/vhost/vhost.h |   10 +++
  2 files changed, 192 insertions(+), 7 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 22d5fef..2aafd90 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -17,11 +17,13 @@
  #include linux/workqueue.h
  #include linux/rcupdate.h
  #include linux/file.h
 +#include linux/aio.h
  
  #include linux/net.h
  #include linux/if_packet.h
  #include linux/if_arp.h
  #include linux/if_tun.h
 +#include linux/mpassthru.h
  
  #include net/sock.h
  
 @@ -47,6 +49,7 @@ struct vhost_net {
   struct vhost_dev dev;
   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
   struct vhost_poll poll[VHOST_NET_VQ_MAX];
 + struct kmem_cache   *cache;
   /* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
 @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct 
 socket *sock)
   net-tx_poll_state = VHOST_NET_POLL_STARTED;
  }
  
 +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
 +{
 + struct kiocb *iocb = NULL;
 + unsigned long flags;
 +
 + spin_lock_irqsave(vq-notify_lock, flags);
 + if (!list_empty(vq-notifier)) {
 + iocb = list_first_entry(vq-notifier,
 + struct kiocb, ki_list);
 + list_del(iocb-ki_list);
 + }
 + spin_unlock_irqrestore(vq-notify_lock, flags);
 + return iocb;
 +}
 +
 +static void handle_async_rx_events_notify(struct vhost_net *net,
 + struct vhost_virtqueue *vq)
 +{
 + struct kiocb *iocb = NULL;
 + struct vhost_log *vq_log = NULL;
 + int rx_total_len = 0;
 + int log, size;
 +
 + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
 + return;
 +
 + if (vq-receiver)
 + vq-receiver(vq);
 +
 + vq_log = unlikely(vhost_has_feature(
 + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
 + while ((iocb = notify_dequeue(vq)) != NULL) {
 + vhost_add_used_and_signal(net-dev, vq,
 + iocb-ki_pos, iocb-ki_nbytes);
 + log = (int)iocb-ki_user_data;
 + size = iocb-ki_nbytes;
 + rx_total_len += iocb-ki_nbytes;
 +
 + if (iocb-ki_dtor)
 + iocb-ki_dtor(iocb);
 + kmem_cache_free(net-cache, iocb);
 +
 + if (unlikely(vq_log))
 + vhost_log_write(vq, vq_log, log, size);
 + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
 + vhost_poll_queue(vq-poll);
 + break;
 + }
 + }
 +}
 +
 +static void handle_async_tx_events_notify(struct vhost_net *net,
 + struct vhost_virtqueue *vq)
 +{
 + struct kiocb *iocb = NULL;
 + int tx_total_len = 0;
 +
 + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
 + return;
 +
 + while ((iocb = notify_dequeue(vq)) != NULL) {
 + vhost_add_used_and_signal(net-dev, vq,
 + iocb-ki_pos, 0);
 + tx_total_len += iocb-ki_nbytes;
 +
 + if (iocb-ki_dtor)
 + iocb-ki_dtor(iocb);
 +
 + kmem_cache_free(net-cache, iocb);
 + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
 + vhost_poll_queue(vq-poll);
 + break;
 + }
 + }
 +}
 +
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_tx(struct vhost_net *net)
  {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 + struct kiocb *iocb = NULL;
   unsigned head, out, in, s;
   struct msghdr msg = {
   .msg_name = NULL,
 @@ -124,6 +204,8 @@ static void handle_tx(struct 

RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-03-17 Thread Xin, Xiaohui
 Michael,
 I don't use the kiocb comes from the sendmsg/recvmsg,
 since I have embeded the kiocb in page_info structure,
 and allocate it when page_info allocated.

So what I suggested was that vhost allocates and tracks the iocbs, and
passes them to your device with sendmsg/ recvmsg calls. This way your
device won't need to share structures and locking strategy with vhost:
you get an iocb, handle it, invoke a callback to notify vhost about
completion.

This also gets rid of the 'receiver' callback

I'm not sure receiver callback can be removed here:
The patch describes a work flow like this:
netif_receive_skb() gets the packet, it does nothing but just queue the skb
and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver 
callback
to deal with skb and and get the necessary notify info into a list, vhost owns 
the 
list and in the same handle_rx() context use it to complete.

We use receiver callback here is because only handle_rx() is waked up from
netif_receive_skb(), and we need mp device context to deal with the skb and
notify info attached to it. We also have some lock in the callback function.

If I remove the receiver callback, I can only deal with the skb and notify
info in netif_receive_skb(), but this function is in an interrupt context,
which I think lock is not allowed there. But I cannot remove the lock there.


 Please have a review and thanks for the instruction
 for replying email which helps me a lot.
 
 Thanks,
 Xiaohui
 
  drivers/vhost/net.c   |  159 
  +++--
  drivers/vhost/vhost.h |   12 
  2 files changed, 166 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 22d5fef..5483848 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -17,11 +17,13 @@
  #include linux/workqueue.h
  #include linux/rcupdate.h
  #include linux/file.h
 +#include linux/aio.h
  
  #include linux/net.h
  #include linux/if_packet.h
  #include linux/if_arp.h
  #include linux/if_tun.h
 +#include linux/mpassthru.h
  
  #include net/sock.h
  
 @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct 
 socket *sock)
  net-tx_poll_state = VHOST_NET_POLL_STARTED;
  }
  
 +static void handle_async_rx_events_notify(struct vhost_net *net,
 +struct vhost_virtqueue *vq);
 +
 +static void handle_async_tx_events_notify(struct vhost_net *net,
 +struct vhost_virtqueue *vq);
 +

A couple of style comments:

- It's better to arrange functions in such order that forward declarations
aren't necessary.  Since we don't have recursion, this should always be
possible.

- continuation lines should be idented at least at the position of '('
on the previous line.

Thanks. I'd correct that.

  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_tx(struct vhost_net *net)
 @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
  tx_poll_stop(net);
  hdr_size = vq-hdr_size;
  
 +handle_async_tx_events_notify(net, vq);
 +
  for (;;) {
  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
   ARRAY_SIZE(vq-iov),
 @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
  /* Skip header. TODO: support TSO. */
  s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
  msg.msg_iovlen = out;
 +
 +if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
 +vq-head = head;
 +msg.msg_control = (void *)vq;

So here a device gets a pointer to vhost_virtqueue structure. If it gets
an iocb and invokes a callback, it would not care about vhost internals.

 +}
 +
  len = iov_length(vq-iov, out);
  /* Sanity check */
  if (!len) {
 @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
  tx_poll_start(net, sock);
  break;
  }
 +
 +if (vq-link_state == VHOST_VQ_LINK_ASYNC)
 +continue;
+
  if (err != len)
  pr_err(Truncated TX packet: 
  len %d != %zd\n, err, len);
 @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
  }
  }
  
 +handle_async_tx_events_notify(net, vq);
 +
  mutex_unlock(vq-mutex);
  unuse_mm(net-dev.mm);
  }
@@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
  int err;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
 -if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
 +if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) 
 +vq-link_state == VHOST_VQ_LINK_SYNC))
  return;
  
  use_mm(net-dev.mm);
 @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net 

Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-03-16 Thread Xin Xiaohui
The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---

Michael,
I don't use the kiocb comes from the sendmsg/recvmsg,
since I have embeded the kiocb in page_info structure,
and allocate it when page_info allocated.
Please have a review and thanks for the instruction
for replying email which helps me a lot.

Thanks,
Xiaohui

 drivers/vhost/net.c   |  159 +++--
 drivers/vhost/vhost.h |   12 
 2 files changed, 166 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..5483848 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
 
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+static void handle_async_rx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq);
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq);
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq-hdr_size;
 
+   handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
@@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
+   vq-head = head;
+   msg.msg_control = (void *)vq;
+   }
+
len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
@@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
tx_poll_start(net, sock);
break;
}
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC)
+   continue;
+
if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
@@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
}
}
 
+   handle_async_tx_events_notify(net, vq);
+
mutex_unlock(vq-mutex);
unuse_mm(net-dev.mm);
 }
@@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
int err;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
-   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
+   if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) 
+   vq-link_state == VHOST_VQ_LINK_SYNC))
return;
 
use_mm(net-dev.mm);
@@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq-hdr_size;
 
-   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
+   /* In async cases, for write logging, the simple way is to get
+* the log info always, and really logging is decided later.
+* Thus, when logging enabled, we can get log, and when logging
+* disabled, we can get log disabled accordingly.
+*/
+
+   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) |
+   (vq-link_state == VHOST_VQ_LINK_ASYNC) ?
vq-log : NULL;
 
+   handle_async_rx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
@@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net)
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
+   vq-head = head;
+   vq-_log = log;
+   msg.msg_control = (void *)vq;
+   }
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for RX: 
@@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net

RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-03-15 Thread Xin, Xiaohui
 +/* The structure to notify the virtqueue for async socket */
 +struct vhost_notifier {
 +struct list_head list;
 +struct vhost_virtqueue *vq;
 +int head;
 +int size;
 +int log;
 +void *ctrl;
 +void (*dtor)(struct vhost_notifier *);
 +};
 +

So IMO, this is not the best interface between vhost
and your driver, exposing them to each other unnecessarily.

If you think about it, your driver should not care about this structure.
It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
on completion.  vhost could save it's state in ki_user_data.  If your
driver needs to add more data to do more tracking, I think it can put
skb pointer in the private pointer.

Then if I remove the struct vhost_notifier, and just use struct kiocb, but 
don't use the one got from sendmsg or recvmsg, but allocated within the 
page_info structure, and don't implement any aio logic related to it, is that 
ok?

Sorry, I made a patch, but don't know how to reply mail with a good formatted 
patch here

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.

2010-03-08 Thread Xin, Xiaohui
On Sat, Mar 06, 2010 at 05:38:35PM +0800, xiaohui@intel.com wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
  
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
  packet split support
  To support GRO
  Performance tuning
 
Am I right to say that nic driver needs changes for these patches
to work? If so, please publish nic driver patches as well.

For drivers not support packet split mode, the NIC drivers don't need to change.
For packet split support drivers, we plan to add the drivers API in updated 
versions.
Now for PS support drivers, just disable the PS mode, it also works.

  what we have done in v1:
  polish the RCU usage
  deal with write logging in asynchroush mode in vhost
  add notifier block for mp device
  rename page_ctor to mp_port in netdevice.h to make it looks generic
  add mp_dev_change_flags() for mp device to change NIC state
  add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
  a small fix for missing dev_put when fail
  using dynamic minor instead of static minor number
  a __KERNEL__ protect to mp_get_sock()
 
 performance:
  using netperf with GSO/TSO disabled, 10G NIC, 
  disabled packet split mode, with raw socket case compared to vhost.
 
  bindwidth will be from 1.1Gbps to 1.7Gbps
  CPU % from 120%-140% to 140%-160%

 That's pretty low for a 10Gb nic. Are you hitting some other bottleneck,
 like high interrupt rate? Also, GSO support and performance tuning
 for raw are incomplete. Try comparing with e.g. tap with GSO.

I'm curious too. 
I have tested vhost-net without zero-copy patch at first in RAW socket case 
with ixgbe driver, with that driver GRO feature is enabled default, but netperf 
data is extremely low, after disabled GRO, then I can get more than 1Gbps. So I 
thought I have missed something there, but I had send 2 emails to you about 
this before and got no reply from you.
Have you got some perf data in raw socket case with vhost-net? 
The data I have got from your web page is always tap with GSO case.

If GSO is not supported, I think the data cannot compare with tap with GSO case 
in 1500 MTU. 
Maybe mergable buffers may help the performance in raw socket case?

Thanks Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.

2010-02-11 Thread Xin, Xiaohui
Will be in a vacation during 2/13~2/20, so email may be very slow or no replied
for your comments. But please don't hesitate to comment more, and I will 
address 
them after the vacation. :-)

Thanks
Xiaohui
-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Xin Xiaohui
Sent: Wednesday, February 10, 2010 7:49 PM
To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; 
mi...@elte.hu; m...@redhat.com; jd...@c2.user-mode-linux.org
Subject: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
To support GRO
Performance tuning
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Let host NIC driver to DMA to guest user space.

2010-02-10 Thread Xin Xiaohui
The patch let host NIC driver to receive user space skb,
then the driver has chance to directly DMA to guest user
space buffers thru single ethX interface.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
---
 include/linux/netdevice.h |   72 +
 include/linux/skbuff.h|   32 ++--
 net/core/dev.c|   27 +
 net/core/skbuff.c |   62 +++
 4 files changed, 184 insertions(+), 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..0de8688 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -486,6 +486,16 @@ struct netdev_queue {
 } cacheline_aligned_in_smp;
 
 
+struct netdev_page_ctor{
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   struct skb_user_page*(*ctor)(struct netdev_page_ctor *,
+   struct sk_buff *, int);
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -636,6 +646,8 @@ struct net_device_ops {
int (*ndo_fcoe_ddp_done)(struct net_device *dev,
 u16 xid);
 #endif
+   int (*ndo_page_ctor_prep)(struct net_device *dev,
+   struct netdev_page_ctor *ctor);
 };
 
 /*
@@ -916,6 +928,7 @@ struct net_device
/* max exchange id for FCoE LRO by ddp */
unsigned intfcoe_ddp_xid;
 #endif
+   struct netdev_page_ctor *page_ctor;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -2013,6 +2026,65 @@ static inline u32 dev_ethtool_get_flags(struct 
net_device *dev)
return 0;
return dev-ethtool_ops-get_flags(dev);
 }
+
+static inline int netdev_page_ctor_prep(struct net_device *dev,
+   struct netdev_page_ctor *ctor)
+{
+   int rc;
+   int npages, data_len;
+   const struct net_device_ops *ops = dev-netdev_ops;
+
+   /* needed by packet split */
+   if (ops-ndo_page_ctor_prep) {
+   rc = ops-ndo_page_ctor_prep(dev, ctor);
+   if (rc)
+   return rc;
+   } else {  /* should be temp */
+   ctor-hdr_len = 128;
+   ctor-data_len = 2048;
+   ctor-npages = 1;
+   }
+
+   if (ctor-hdr_len = 0)
+   goto err;
+
+   npages = ctor-npages;
+   data_len = ctor-data_len;
+   if (npages = 0 || npages  MAX_SKB_FRAGS ||
+   (data_len  PAGE_SIZE * (npages - 1) ||
+data_len  PAGE_SIZE * npages))
+   goto err;
+
+   return 0;
+err:
+   dev_warn(dev-dev, invalid page constructor parameters\n);
+
+   return -EINVAL;
+}
+
+static inline int netdev_page_ctor_attach(struct net_device *dev,
+   struct netdev_page_ctor *ctor)
+{
+   if (dev-flags  IFF_UP)
+   return -EBUSY;
+
+   if (rcu_dereference(dev-page_ctor))
+   return -EBUSY;
+
+   rcu_assign_pointer(dev-page_ctor, ctor);
+
+   return 0;
+}
+
+static inline void netdev_page_ctor_detach(struct net_device *dev)
+{
+   if (!rcu_dereference(dev-page_ctor))
+   return;
+
+   rcu_assign_pointer(dev-page_ctor, NULL);
+   synchronize_rcu();
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_NETDEVICE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..c77837e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -209,6 +209,13 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+struct skb_user_page {
+   u8  *start;
+   int size;
+   struct skb_frag_struct *frags;
+   struct skb_shared_info *ushinfo;
+   void(*dtor)(struct skb_user_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb-data.  The lower 16 bits hold references to
  * the entire skb-data.  A clone of a headerless skb holds the length of
@@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int fclone,
+  int node, struct net_device *dev);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority

[PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-02-10 Thread Xin Xiaohui
Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
---
 drivers/vhost/Kconfig  |5 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/mpassthru.c  | 1178 
 include/linux/miscdevice.h |1 +
 include/linux/mpassthru.h  |   17 +
 5 files changed, 1203 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config VHOST_PASSTHRU
+   tristate Zerocopy network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..d8d153f
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1178 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+#include vhost.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   atomic_trefcnt;
+   struct kmem_cache   *cache;
+   struct net_device   *dev;
+   struct netdev_page_ctor ctor;
+   void*sendctrl;
+   void*recvctrl;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct skb_user_pageuser;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* It's meaningful for receive, means
+* the max length allowed
+*/
+   size_t

[PATCH 2/3] Provides multiple submits and asynchronous notifications.

2010-02-10 Thread Xin Xiaohui
The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---
 drivers/vhost/net.c   |  145 +++--
 drivers/vhost/vhost.h |   23 
 2 files changed, 164 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..8a85227 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -22,6 +22,7 @@
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -91,6 +92,10 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+static void handle_async_rx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq);
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq);
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -124,6 +129,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq-hdr_size;
 
+   handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
@@ -151,6 +158,12 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
+   vq-head = head;
+   msg.msg_control = (void *)vq;
+   }
+
len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
@@ -166,6 +179,10 @@ static void handle_tx(struct vhost_net *net)
tx_poll_start(net, sock);
break;
}
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC)
+   continue;
+
if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
@@ -177,6 +194,8 @@ static void handle_tx(struct vhost_net *net)
}
}
 
+   handle_async_tx_events_notify(net, vq);
+
mutex_unlock(vq-mutex);
unuse_mm(net-dev.mm);
 }
@@ -206,7 +225,8 @@ static void handle_rx(struct vhost_net *net)
int err;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
-   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
+   if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) 
+   vq-link_state == VHOST_VQ_LINK_SYNC))
return;
 
use_mm(net-dev.mm);
@@ -217,6 +237,8 @@ static void handle_rx(struct vhost_net *net)
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
+   handle_async_rx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
@@ -245,6 +267,11 @@ static void handle_rx(struct vhost_net *net)
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
+   vq-head = head;
+   vq-_log = log;
+   msg.msg_control = (void *)vq;
+   }
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for RX: 
@@ -259,6 +286,10 @@ static void handle_rx(struct vhost_net *net)
vhost_discard_vq_desc(vq);
break;
}
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC)
+   continue;
+
/* TODO: Should check and handle checksum. */
if (err  len) {
pr_err(Discarded truncated rx packet: 
@@ -284,10 +315,83 @@ static void handle_rx(struct vhost_net *net)
}
}
 
+   handle_async_rx_events_notify(net, vq);
+
mutex_unlock(vq-mutex);
unuse_mm(net-dev.mm);
 }
 
+struct vhost_notifier *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct vhost_notifier *vnotify = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   vnotify = list_first_entry(vq-notifier

RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-02-10 Thread Xin, Xiaohui
Eric,
Thanks. I will look into that. But don't stop there. 
Please comments more. :-)

Thanks
Xiaohui

-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com] 
Sent: Wednesday, February 10, 2010 11:18 PM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; 
mi...@elte.hu; m...@redhat.com; jd...@c2.user-mode-linux.org; Zhao Yu
Subject: Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

Le mercredi 10 février 2010 à 19:48 +0800, Xin Xiaohui a écrit :
 Add a device to utilize the vhost-net backend driver for
 copy-less data transfer between guest FE and host NIC.
 It pins the guest user space to the host memory and
 provides proto_ops as sendmsg/recvmsg to vhost-net.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzha...@gmail.com
 Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org


 +static int page_ctor_attach(struct mp_struct *mp)
 +{
 + int rc;
 + struct page_ctor *ctor;
 + struct net_device *dev = mp-dev;
 +
 + rcu_read_lock();
 + if (rcu_dereference(mp-ctor)) {
 + rcu_read_unlock();
 + return -EBUSY;
 + }
 + rcu_read_unlock();

Strange read locking here, for an obvious writer role.
What do you really want to do ?
If writer are serialized by mp_mutex, you dont need this
recu_read_lock()/rcu_read_unlock() stuff.

 +
 + ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
 + if (!ctor)
 + return -ENOMEM;
 + rc = netdev_page_ctor_prep(dev, ctor-ctor);
 + if (rc)
 + goto fail;
 +
 + ctor-cache = kmem_cache_create(skb_page_info,
 + sizeof(struct page_info), 0,
 + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);


SLAB_PANIC here means : crash whole system in case of error.
This is not what you want in a driver.

 +
 + if (!ctor-cache)
 + goto cache_fail;
 +
 + INIT_LIST_HEAD(ctor-readq);
 + spin_lock_init(ctor-read_lock);
 +
 + ctor-w_len = 0;
 + ctor-r_len = 0;
 +
 + dev_hold(dev);
 + ctor-dev = dev;
 + ctor-ctor.ctor = page_ctor;
 + ctor-ctor.sock = mp-socket;
 + atomic_set(ctor-refcnt, 1);
 +
 + rc = netdev_page_ctor_attach(dev, ctor-ctor);
 + if (rc)
 + goto fail;
 +
 + /* locked by mp_mutex */
 + rcu_assign_pointer(mp-ctor, ctor);
 +
 + /* XXX:Need we do set_offload here ? */
 +
 + return 0;
 +
 +fail:
 + kmem_cache_destroy(ctor-cache);
 +cache_fail:
 + kfree(ctor);
 + dev_put(dev);
 +
 + return rc;
 +}
 +
 +
 +static inline void get_page_ctor(struct page_ctor *ctor)
 +{
 +   atomic_inc(ctor-refcnt);
 +}
 +
 +static inline void put_page_ctor(struct page_ctor *ctor)
 +{
 + if (atomic_dec_and_test(ctor-refcnt))
 + kfree(ctor);

Are you sure a RCU grace period is not needed before freeing ?


 +
 +static int page_ctor_detach(struct mp_struct *mp)
 +{
 + struct page_ctor *ctor;
 + struct page_info *info;
 + int i;
 +
 + rcu_read_lock();
 + ctor = rcu_dereference(mp-ctor);
 + rcu_read_unlock();

Strange locking again here


 +
 + if (!ctor)
 + return -ENODEV;
 +
 + while ((info = info_dequeue(ctor))) {
 + for (i = 0; i  info-pnum; i++)
 + if (info-pages[i])
 + put_page(info-pages[i]);
 + kmem_cache_free(ctor-cache, info);
 + }
 + kmem_cache_destroy(ctor-cache);
 + netdev_page_ctor_detach(ctor-dev);
 + dev_put(ctor-dev);
 +
 + /* locked by mp_mutex */
 + rcu_assign_pointer(mp-ctor, NULL);
 + synchronize_rcu();
 +
 + put_page_ctor(ctor);
 +
 + return 0;
 +}
 +
 +/* For small user space buffers transmit, we don't need to call
 + * get_user_pages().
 + */
 +static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
 + int total)
 +{
 + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL);
kmem_cache_zalloc() ?
 +
 + if (!info)
 + return NULL;
 + memset(info, 0, sizeof(struct page_info));
 + memset(info-pages, 0, sizeof(info-pages));

redundant memset() whole structure already cleared one line above

 +
 + info-header = 0;
already cleared
 + info-total = total;
 + info-skb = NULL;
already cleared 
 
 + info-user.dtor = page_dtor;
 + info-ctor = ctor;
 + info-flags = INFO_WRITE;
 + info-pnum = 0;
already cleared 
 
 + return info;
 +}
 +
 +/* The main function to transform the guest user space address
 + * to host kernel address via get_user_pages(). Thus the hardware
 + * can do DMA directly to the user space address.
 + */
 +static struct page_info *alloc_page_info(struct page_ctor *ctor,
 + struct iovec *iov, int count, struct frag *frags,
 + int npages, int total)
 +{
 + int rc;
 + int i, j, n = 0;
 + int len;
 + unsigned long base;
 + struct page_info *info

RE: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.

2010-02-10 Thread Xin, Xiaohui
On Wednesday 10 February 2010, Xin Xiaohui wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.

This does a lot of things that I had planned for macvtap. It's
great to hear that you have made this much progress.

However, I'd hope that we could combine this with the macvtap driver,
which would give us zero-copy transfer capability both with and
without vhost, as well as (tx at least) when using multiple guests
on a macvlan setup.

You mean the zero-copy can work with macvtap driver without vhost.
May you give me some detailed info about your macvtap driver and the
relationship between vhost and macvtap to make me have a clear picture then?

For transmit, it should be fairly straightforward to hook up
your zero-copy method and the vhost-net interface into the
macvtap driver.

You have simplified the receiv path significantly by assuming
that the entire netdev can receive into a single guest, right?

Yes.

I'm assuming that the idea is to allow VMDq adapters to simply
show up as separate adapters and have the driver handle this
in a hardware specific way.

Does the VMDq driver do so now?

My plan for this was to instead move support for VMDq into the
macvlan driver so we can transparently use VMDq on hardware where
available, including zero-copy receives, but fall back to software
operation on non-VMDq hardware.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] KVM developer conference call

2010-01-11 Thread Xin, Xiaohui
Hi, Chris
We are interested in joining the conference, since we are now working on the 
zero copy patch based on vhost-net.

Thanks
Xiaohui

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Chris Wright
Sent: Tuesday, January 12, 2010 10:16 AM
To: kvm@vger.kernel.org
Cc: qemu-de...@nongnu.org
Subject: [ANNOUNCE] KVM developer conference call

The KVM project is announcing a regular conference call focused on KVM
development issues.  The call will take place weekly on Tuesdays at
15:00 UTC for 1 hour.  The purpose of the call is to discuss relevant
development issues in a high bandwidth forum, certainly _not_ meant
to supplant discussion on list.  I will send out a request for agenda
items each week the day before the call, and post minutes after the
call concludes.

If you are interested in joining please send me an email.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: vhost-net patches

2009-11-02 Thread Xin, Xiaohui
Hi, Michael,
What's your deferring skb allocation patch mentioned here, may you elaborate it 
a little more detailed?

Thanks
Xiaohui

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Michael S. Tsirkin
Sent: Tuesday, November 03, 2009 3:47 AM
To: Shirley Ma
Cc: Sridhar Samudrala; Shirley Ma; David Stevens; kvm@vger.kernel.org; 
s...@linux.vnet.ibm.com; mashi...@linux.vnet.ibm.com
Subject: Re: vhost-net patches

On Thu, Oct 29, 2009 at 12:11:33AM -0700, Shirley Ma wrote:
 Hello Michael,
 
 I am able to get 63xxMb/s throughput with 10% less cpu utilization when
 I apply deferring skb patch on top of your most recent vhost patch. The
 userspace TCP_STREAM BW used to be 3xxxMb/s from upper stream git tree. 
 
 After applying your recent vhost patch, it goes up to 53xxMb/s.

Thanks for the report!  I have pushed this patch (after cleaning it up
and fixing raw sockets), and updated the numbers in the wiki here
http://www.linux-kvm.org/page/VhostNet

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-13 Thread Xin, Xiaohui
The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git

I expect them to be merged by 2.6.32-rc1 - right, Avi?

Michael,

I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the 
qemu side patch for irqfd and ioeventfd.

I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this:

/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` 
undeclared (first use in this function)
/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` 
undeclared (first use in this function)

Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd?

Thanks
Xiaohui

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com] 
Sent: Sunday, September 13, 2009 1:46 PM
To: Xin, Xiaohui
Cc: Ira W. Snyder; net...@vger.kernel.org; 
virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; 
a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty 
Russell; s.he...@linux-ag.com; a...@redhat.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
 Michael,
 We are very interested in your patch and want to have a try with it.
 I have collected your 3 patches in kernel side and 4 patches in queue side.
 The patches are listed here:
 
 PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
 PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
 PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
 
 PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
 PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
 PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
 PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
 
 I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest 
 kvm qemu.
 But seems there are some patches are needed at least irqfd and ioeventfd 
 patches on
 current qemu. I cannot create a kvm guest with -net 
 nic,model=virtio,vhost=vethX.
 
 May you kindly advice us the patch lists all exactly to make it work?
 Thanks a lot. :-)
 
 Thanks
 Xiaohui


The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git

I expect them to be merged by 2.6.32-rc1 - right, Avi?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-11 Thread Xin, Xiaohui
Michael,
We are very interested in your patch and want to have a try with it.
I have collected your 3 patches in kernel side and 4 patches in queue side.
The patches are listed here:

PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch

PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch

I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm 
qemu.
But seems there are some patches are needed at least irqfd and ioeventfd 
patches on
current qemu. I cannot create a kvm guest with -net 
nic,model=virtio,vhost=vethX.

May you kindly advice us the patch lists all exactly to make it work?
Thanks a lot. :-)

Thanks
Xiaohui
-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Michael S. Tsirkin
Sent: Wednesday, September 09, 2009 4:14 AM
To: Ira W. Snyder
Cc: net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; 
kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; 
linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; 
gregory.hask...@gmail.com; Rusty Russell; s.he...@linux-ag.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
 On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
  On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
   On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for 
migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have done some light benchmarking (with v4), compared to userspace, I
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
ping benchmark (where there's no TSO) troughput is also improved.

Features that I plan to look at in the future:
- tap support
- TSO
- interrupt mitigation
- zero copy

   
   Hello Michael,
   
   I've started looking at vhost with the intention of using it over PCI to
   connect physical machines together.
   
   The part that I am struggling with the most is figuring out which parts
   of the rings are in the host's memory, and which parts are in the
   guest's memory.
  
  All rings are in guest's memory, to match existing virtio code.
 
 Ok, this makes sense.
 
  vhost
  assumes that the memory space of the hypervisor userspace process covers
  the whole of guest memory.
 
 Is this necessary? Why?

Because with virtio ring can give us arbitrary guest addresses.  If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.

 The assumption seems very wrong when you're
 doing data transport between two physical systems via PCI.
 I know vhost has not been designed for this specific situation, but it
 is good to be looking toward other possible uses.
 
  And there's a translation table.
  Ring addresses are userspace addresses, they do not undergo translation.
  
   If I understand everything correctly, the rings are all userspace
   addresses, which means that they can be moved around in physical memory,
   and get pushed out to swap.
  
  Unless they are locked, yes.
  
   AFAIK, this is impossible to handle when
   connecting two physical systems, you'd need the rings available in IO
   memory (PCI memory), so you can ioreadXX() them instead. 

RE: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-02 Thread Xin, Xiaohui
* Code is easier to review than bullet points.

Yes. We'd send the code soon.

* Direct I/O has to be safe when page is shared by multiple threads,
 and has to be non-blocking since network I/O can take indeterminately
 long (think big queue's, tunneling, ...)

In the situation, one queue pair NIC is assigned to only one guest, the pages 
are locked and a KVM guest will not swapped out.


* In the past attempts at Direct I/O on network have always had SMP
 TLB issues. The page has to be flipped or marked as COW on all CPU's
 and the cost of the Inter Processor Interrupt to steal the page has
 been slower than copying

It may be, we have not thought about this more . Thanks.

Thanks
Xiaohui

-Original Message-
From: Stephen Hemminger [mailto:shemmin...@vyatta.com] 
Sent: Wednesday, September 02, 2009 12:05 AM
To: Xin, Xiaohui
Cc: m...@redhat.com; net...@vger.kernel.org; 
virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; 
a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com
Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

On Tue, 1 Sep 2009 14:58:19 +0800
Xin, Xiaohui xiaohui@intel.com wrote:

   [RFC] Virtual Machine Device Queues (VMDq) support on KVM
 
 Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
 and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
 for each tx/rx queue pair. Here we present a generic framework, in which 
 network
 traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
 any software copy.
 
 Actually this framework can apply to traditional network adapters which have
 just one tx/rx queue pair. And applications using the same user/kernel 
 interface
 can utilize this framework to send/receive network traffic directly thru a 
 tx/rx
 queue pair in a network adapter.
 
 We use virtio-net architecture to illustrate the framework.
 
 
 || pop   add_buf||
 |Qemu process|  -TX   --  | Guest Kernel   |
 ||  - --  ||
 |Virtio-net  | push  get_buf||
 |  (Backend service) |  -RX   --  |  Virtio-net|
 ||  - --  |driver  |
 || push  get_buf||
 ||  ||
|
|
| AIO (read  write) combined with Direct I/O
|   (which substitute synced file operations)
 |---|
 | Host kernel  | read: copy-less with directly mapped user  |
 |  |   space to kernel, payload directly DMAed  |
 |  |   into user space  |
 |  | write: copy-less with directly mapped user |
 |  |   space to kernel, payload directly hooked |
 |  |   to a skb |
 |  ||
 |  (a likely   ||
 |   queue pair ||
 |   instance)  ||
 |  |   ||
 | NIC driver --  TUN/TAP driver   |
 |---|
|
|
traditional adapter or a tx/rx queue pair
 
 The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
 I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
 kernel, we still can see it in SCSI tape driver.
 
 With traditional file operations, a copying of payload contents from/to the
 kernel DMA address to/from a user buffer is needed. That's what the copying we
 want to save.
 
 The proposed framework is like this:
 A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair 
 in
 host side. KVM virto-net Backend service, the user space program submits
 asynchronous read/write I/O requests to the host kernel through TUN/TAP 
 device.
 The requests are corresponding to the vqueue elements include both 
 transmission
  receive. They can be queued in one AIO request and later, the completion 
 will
 be notified through the underlying packets tx/rx processing of the rx/tx queue
 pair.
 
 Detailed path:
 
 To guest Virtio-net driver, packets receive corresponding to asynchronous read
 I/O requests of Backend service.
 
 1) Guest Virtio-net driver provides header

[RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-01 Thread Xin, Xiaohui
  [RFC] Virtual Machine Device Queues (VMDq) support on KVM

Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
for each tx/rx queue pair. Here we present a generic framework, in which network
traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
any software copy.

Actually this framework can apply to traditional network adapters which have
just one tx/rx queue pair. And applications using the same user/kernel interface
can utilize this framework to send/receive network traffic directly thru a tx/rx
queue pair in a network adapter.

We use virtio-net architecture to illustrate the framework.


|| pop   add_buf||
|Qemu process|  -TX   --  | Guest Kernel   |
||  - --  ||
|Virtio-net  | push  get_buf||
|  (Backend service) |  -RX   --  |  Virtio-net|
||  - --  |driver  |
|| push  get_buf||
||  ||
   |
   |
   | AIO (read  write) combined with Direct I/O
   |   (which substitute synced file operations)
|---|
| Host kernel  | read: copy-less with directly mapped user  |
|  |   space to kernel, payload directly DMAed  |
|  |   into user space  |
|  | write: copy-less with directly mapped user |
|  |   space to kernel, payload directly hooked |
|  |   to a skb |
|  ||
|  (a likely   ||
|   queue pair ||
|   instance)  ||
|  |   ||
| NIC driver --  TUN/TAP driver   |
|---|
   |
   |
   traditional adapter or a tx/rx queue pair

The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
kernel, we still can see it in SCSI tape driver.

With traditional file operations, a copying of payload contents from/to the
kernel DMA address to/from a user buffer is needed. That's what the copying we
want to save.

The proposed framework is like this:
A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
host side. KVM virto-net Backend service, the user space program submits
asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
The requests are corresponding to the vqueue elements include both transmission
 receive. They can be queued in one AIO request and later, the completion will
be notified through the underlying packets tx/rx processing of the rx/tx queue
pair.

Detailed path:

To guest Virtio-net driver, packets receive corresponding to asynchronous read
I/O requests of Backend service.

1) Guest Virtio-net driver provides header and payload address through the
receive vqueue to Virtio-net backend service.

2) Virtio-net backend service encapsulates multiple vqueue elements into
multiple AIO control blocks and composes them into one AIO read request.

3) Virtio-net backend service uses io_submit() syscall to pass the request to
the TUN/TAP device.

4) Virtio-net backend service uses io_getevents() syscall to check the
completion of the request.

5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
for Direct I/O.
   A modified NIC driver may render a skb which header is allocated in host
kernel, but the payload buffer is directly mapped from user space buffer which
are rendered through the AIO request by the Backend service. get_user_pages()
may do this. For one AIO read request, the TUN/TAP driver maintains a list for
the directly mapped buffers, and a NIC driver tries to get the buffers as
payload buffer to compose the new skbs. Of course, if getting the buffers
fails, then kernel allocated buffers are used.

6) Modern NIC cards now mostly have the header split feature. The NIC queue
pair then may directly DMA the payload into the user spaces mapped payload
buffers.
Thus a zero-copy for payload is implemented in packet receiving.

7) The TUN/TAP driver manually copy the host header to space user mapped.

8) 

RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-01 Thread Xin, Xiaohui
I don't think we should do that with the tun/tap driver. By design, tun/tap is 
a way to interact with the
networking stack as if coming from a device. The only way this connects to an 
external adapter is through
a bridge or through IP routing, which means that it does not correspond to a 
specific NIC.
I have worked on a driver I called 'macvtap' in lack of a better name, to add 
a new tap frontend to
the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC 
device, this gives you a direct
connection between one or multiple tap devices to an external NIC, which works 
a lot better than when
you have a bridge inbetween. There is also work underway to add a bridging 
capability to macvlan, so
you can communicate directly between guests like you can do with a bridge.
Michael's vhost_net can plug into the same macvlan infrastructure, so the work 
is complementary.

We use TUN/TAP device to implement the prototype, and agree that it's not the 
only
choice here. We'd compare the two if possible.
And what we cares more about is the modification in the kernel like the net_dev 
and 
skb structures' modifications, thanks.

Thanks
Xiaohui

-Original Message-
From: Arnd Bergmann [mailto:a...@arndb.de] 
Sent: Monday, August 31, 2009 11:24 PM
To: Xin, Xiaohui
Cc: m...@redhat.com; net...@vger.kernel.org; 
virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; 
a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Monday 31 August 2009, Xin, Xiaohui wrote:
 
 Hi, Michael
 That's a great job. We are now working on support VMDq on KVM, and since the 
 VMDq hardware presents L2 sorting
 based on MAC addresses and VLAN tags, our target is to implement a zero copy 
 solution using VMDq.

I'm also interested in helping there, please include me in the discussions.

 We stared
 from the virtio-net architecture. What we want to proposal is to use AIO 
 combined with direct I/O:
 1) Modify virtio-net Backend service in Qemu to submit aio requests composed 
 from virtqueue.

right, that sounds useful.

 2) Modify TUN/TAP device to support aio operations and the user space buffer 
 directly mapping into the host kernel.
 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC.

I don't think we should do that with the tun/tap driver. By design, tun/tap is 
a way to interact with the
networking stack as if coming from a device. The only way this connects to an 
external adapter is through
a bridge or through IP routing, which means that it does not correspond to a 
specific NIC.

I have worked on a driver I called 'macvtap' in lack of a better name, to add a 
new tap frontend to
the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC device, 
this gives you a direct
connection between one or multiple tap devices to an external NIC, which works 
a lot better than when
you have a bridge inbetween. There is also work underway to add a bridging 
capability to macvlan, so
you can communicate directly between guests like you can do with a bridge.

Michael's vhost_net can plug into the same macvlan infrastructure, so the work 
is complementary.

 4) Modify the net_dev and skb structure to permit allocated skb to use user 
 space directly mapped payload
 buffer address rather then kernel allocated.

yes.

 As zero copy is also your goal, we are interested in what's in your mind, and 
 would like to collaborate with you if possible.
 BTW, we will send our VMDq write-up very soon.

Ok, cool.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-01 Thread Xin, Xiaohui

It may be possible to make vmdq appear like an sr-iov capable device 
from userspace.  sr-iov provides the userspace interfaces to allocate 
interfaces and assign mac addresses.  To make it useful, you would have 
to handle tx multiplexing in the driver but that would be much easier to 
consume for kvm

What we have thought is to support multiple net_dev structures 
according to multiple queue pairs of one vmdq adapter and presents
multiple mac address in user space and each one mac can be used 
by a guest. 
What does the tx multiplexing in the driver exactly mean?

Thanks
Xiaohui

-Original Message-
From: Anthony Liguori [mailto:anth...@codemonkey.ws] 
Sent: Tuesday, September 01, 2009 5:57 AM
To: Avi Kivity
Cc: Xin, Xiaohui; m...@redhat.com; net...@vger.kernel.org; 
virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; 
a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

Avi Kivity wrote:
 On 08/31/2009 02:42 PM, Xin, Xiaohui wrote:
 Hi, Michael
 That's a great job. We are now working on support VMDq on KVM, and 
 since the VMDq hardware presents L2 sorting based on MAC addresses 
 and VLAN tags, our target is to implement a zero copy solution using 
 VMDq. We stared from the virtio-net architecture. What we want to 
 proposal is to use AIO combined with direct I/O:
 1) Modify virtio-net Backend service in Qemu to submit aio requests 
 composed from virtqueue.
 2) Modify TUN/TAP device to support aio operations and the user space 
 buffer directly mapping into the host kernel.
 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC.
 4) Modify the net_dev and skb structure to permit allocated skb to 
 use user space directly mapped payload buffer address rather then 
 kernel allocated.

 As zero copy is also your goal, we are interested in what's in your 
 mind, and would like to collaborate with you if possible.


 One way to share the effort is to make vmdq queues available as normal 
 kernel interfaces.

It may be possible to make vmdq appear like an sr-iov capable device 
from userspace.  sr-iov provides the userspace interfaces to allocate 
interfaces and assign mac addresses.  To make it useful, you would have 
to handle tx multiplexing in the driver but that would be much easier to 
consume for kvm.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-08-31 Thread Xin, Xiaohui
Hi, Michael
That's a great job. We are now working on support VMDq on KVM, and since the 
VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our 
target is to implement a zero copy solution using VMDq. We stared from the 
virtio-net architecture. What we want to proposal is to use AIO combined with 
direct I/O:
1) Modify virtio-net Backend service in Qemu to submit aio requests composed 
from virtqueue.
2) Modify TUN/TAP device to support aio operations and the user space buffer 
directly mapping into the host kernel.
3) Let a TUN/TAP device binds to single rx/tx queue from the NIC.
4) Modify the net_dev and skb structure to permit allocated skb to use user 
space directly mapped payload buffer address rather then kernel allocated.

As zero copy is also your goal, we are interested in what's in your mind, and 
would like to collaborate with you if possible.
BTW, we will send our VMDq write-up very soon.

Thanks
Xiaohui

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Michael S. Tsirkin
Sent: Wednesday, August 19, 2009 11:03 PM
To: net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; 
kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; 
linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; 
gregory.hask...@gmail.com
Subject: [PATCHv4 2/2] vhost_net: a kernel-level virtio server

What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have not run any benchmarks yet, compared to userspace, I expect to
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported).

Features that I plan to look at in the future:
- TSO
- interrupt mitigation
- zero copy

Acked-by: Arnd Bergmann a...@arndb.de
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 MAINTAINERS|   10 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  429 
 drivers/vhost/vhost.c  |  664 
 drivers/vhost/vhost.h  |  108 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  100 +++
 11 files changed, 1328 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b1114cf..de4587f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5431,6 +5431,16 @@ S:   Maintained
 F: Documentation/filesystems/vfat.txt
 F: fs/fat/

+VIRTIO HOST (VHOST)
+P: Michael S. Tsirkin
+M: m...@redhat.com
+L: kvm@vger.kernel.org
+L: virtualizat...@lists.osdl.org
+L: net...@vger.kernel.org
+S: Maintained
+F: drivers/vhost/
+F: include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 M: Roger Luethi r...@hellgate.ch
 S: Maintained
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_AMD

 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig

diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1551ae1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_HID)   += hid/
 

RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-08-31 Thread Xin, Xiaohui
 One way to share the effort is to make vmdq queues available as normal 
kernel interfaces.  It would take quite a bit of work, but the end 
result is that no other components need to be change, and it makes vmdq 
useful outside kvm.  It also greatly reduces the amount of integration 
work needed throughout the stack (kvm/qemu/libvirt).

Yes. The common queue pair interface which we want to present will also apply 
to normal hardware, and try to leave other components unknown.

Thanks
Xiaohui

-Original Message-
From: Avi Kivity [mailto:a...@redhat.com] 
Sent: Tuesday, September 01, 2009 1:52 AM
To: Xin, Xiaohui
Cc: m...@redhat.com; net...@vger.kernel.org; 
virtualizat...@lists.linux-foundation.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; 
a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On 08/31/2009 02:42 PM, Xin, Xiaohui wrote:
 Hi, Michael
 That's a great job. We are now working on support VMDq on KVM, and since the 
 VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our 
 target is to implement a zero copy solution using VMDq. We stared from the 
 virtio-net architecture. What we want to proposal is to use AIO combined with 
 direct I/O:
 1) Modify virtio-net Backend service in Qemu to submit aio requests composed 
 from virtqueue.
 2) Modify TUN/TAP device to support aio operations and the user space buffer 
 directly mapping into the host kernel.
 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC.
 4) Modify the net_dev and skb structure to permit allocated skb to use user 
 space directly mapped payload buffer address rather then kernel allocated.

 As zero copy is also your goal, we are interested in what's in your mind, and 
 would like to collaborate with you if possible.


One way to share the effort is to make vmdq queues available as normal 
kernel interfaces.  It would take quite a bit of work, but the end 
result is that no other components need to be change, and it makes vmdq 
useful outside kvm.  It also greatly reduces the amount of integration 
work needed throughout the stack (kvm/qemu/libvirt).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html