Re: KVM Page Fault Question

2010-04-02 Thread Avi Kivity

On 04/02/2010 07:41 AM, Marek Olszewski wrote:
When a guest OS writes to a shadowed (and therefore page protected) 
guest page table, does the resulting page fault get handled in 
paging_tmpl.h:xxx_page_fault or does it call some rmap related code 
directly? 


page faults are dispatched to the page_fault callback.


Also, what does the direct mmu page role mean?



It means that the page maps the linear range (gfn  12)..(((gfn + (1  
level*9)))  12) instead of shadowing a guest page table at gfn.  
Useful for real mode, large pages, and tdp.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-02 Thread Avi Kivity

On 04/02/2010 12:27 AM, Tom Lyon wrote:



kvm really wants the event counter to be an eventfd, that allows hooking
it directly to kvm (which can inject an interrupt on an eventfd_signal),
can you adapt your patch to do this?
 

I looked further into eventfds - they seem the perfect solution for the
MSI/MSI-X interrupts. Will include in V2.
   


They are indeed.  Thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-02 Thread xiaohui . xin
The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

The scenario is like this:

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:Modify __alloc_skb() function a bit, it can only allocate a 
structure of sk_buff, and the data pointer is pointing to a 
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb-data is filled
directly by h/w. What we have done is in this way.

Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device 
we have bind to do zero-copy. Then we ask guest to do so.
Is that reasonable?

Two:Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)-frags). For 
the head buffer side, let host allocates skb, and h/w fills it. 
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

Please give comments especially for the network part modifications.


We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
packet split support
To support GRO
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using dynamic minor instead of static minor number
a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

remove most of the RCU usage, since the ctor pointer is only
changed by BIND/UNBIND ioctl, and during that time, NIC will be
stopped to get good cleanup(all outstanding requests are finished),
so the ctor pointer cannot be raced into wrong situation.

Remove the struct vhost_notifier with struct kiocb.
Let vhost-net backend to alloc/free the kiocb and transfer them
via sendmsg/recvmsg.

use get_user_pages_fast() and set_page_dirty_lock() 

[RFC] [PATCH v2 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-02 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
---

 drivers/vhost/Kconfig |5 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1162 +
 include/linux/mpassthru.h |   29 ++
 4 files changed, 1198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config VHOST_PASSTHRU
+   tristate Zerocopy network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..6e8fc4d
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1162 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+#include vhost.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   void*ctrl;
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct skb_user_pageuser;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* It's meaningful for receive, means
+* the max length allowed
+*/
+   size_t  len;
+
+   /* The fields after that is for 

[RFC] [PATCH v2 2/3] Provides multiple submits and asynchronous notifications.

2010-04-02 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---

 drivers/vhost/net.c   |  189 +++--
 drivers/vhost/vhost.h |   10 +++
 2 files changed, 192 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..2aafd90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
 
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   int log, size;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   if (vq-receiver)
+   vq-receiver(vq);
+
+   vq_log = unlikely(vhost_has_feature(
+   net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   log = (int)iocb-ki_user_data;
+   size = iocb-ki_nbytes;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   if (unlikely(vq_log))
+   vhost_log_write(vq, vq_log, log, size);
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+
+   kmem_cache_free(net-cache, iocb);
+   if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
+   struct kiocb *iocb = NULL;
unsigned head, out, in, s;
struct msghdr msg = {
.msg_name = NULL,
@@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq-hdr_size;
 
+   handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
@@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+   if (vq-link_state == VHOST_VQ_LINK_ASYNC) {
+   iocb = kmem_cache_zalloc(net-cache, GFP_KERNEL);
+   if (!iocb)
+   break;
+ 

[RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.

2010-04-02 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The patch let host NIC driver to receive user space skb,
then the driver has chance to directly DMA to guest user
space buffers thru single ethX interface.
We want it to be more generic as a zero copy framework.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
---

We consider 2 way to utilize the user buffres, but not sure which one
is better. Please give any comments.

One:Modify __alloc_skb() function a bit, it can only allocate a
structure of sk_buff, and the data pointer is pointing to a
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb-data is filled
directly by h/w. What we have done is in this way.

Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device
we have bind to do zero-copy. Then we ask guest to do so.
Is that reasonable?

Two:Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)-frags). For
the head buffer side, let host allocates skb, and h/w fills it.
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.


Thanks
Xiaohui

 include/linux/netdevice.h |   69 -
 include/linux/skbuff.h|   30 --
 net/core/dev.c|   63 ++
 net/core/skbuff.c |   74 
 4 files changed, 224 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..ba48eb0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -485,6 +485,17 @@ struct netdev_queue {
unsigned long   tx_dropped;
 } cacheline_aligned_in_smp;
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct mpassthru_port  {
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   struct skb_user_page*(*ctor)(struct mpassthru_port *,
+   struct sk_buff *, int);
+};
+#endif
 
 /*
  * This structure defines the management hooks for network devices.
@@ -636,6 +647,10 @@ struct net_device_ops {
int (*ndo_fcoe_ddp_done)(struct net_device *dev,
 u16 xid);
 #endif
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+   int (*ndo_mp_port_prep)(struct net_device *dev,
+   struct mpassthru_port *port);
+#endif
 };
 
 /*
@@ -891,7 +906,8 @@ struct net_device
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port*garp_port;
-
+   /* mpassthru */
+   struct mpassthru_port   *mp_port;
/* class/net/name entry */
struct device   dev;
/* space for optional statistics and wireless sysfs groups */
@@ -2013,6 +2029,55 @@ static inline u32 dev_ethtool_get_flags(struct 
net_device *dev)
return 0;
return dev-ethtool_ops-get_flags(dev);
 }
-#endif /* __KERNEL__ */
 
+/* To support zero-copy between user space application and NIC driver,
+ * we'd better ask NIC driver for the capability it can provide, especially
+ * for packet split mode, now we only ask for the header size, and the
+ * 

Re: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.

2010-04-02 Thread Stephen Hemminger
On Fri,  2 Apr 2010 15:30:10 +0800
xiaohui@intel.com wrote:

 From: Xin Xiaohui xiaohui@intel.com
 
 The patch let host NIC driver to receive user space skb,
 then the driver has chance to directly DMA to guest user
 space buffers thru single ethX interface.
 We want it to be more generic as a zero copy framework.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzha...@gmail.com
 Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
 ---
 
 We consider 2 way to utilize the user buffres, but not sure which one
 is better. Please give any comments.
 
 One:Modify __alloc_skb() function a bit, it can only allocate a
 structure of sk_buff, and the data pointer is pointing to a
 user buffer which is coming from a page constructor API.
 Then the shinfo of the skb is also from guest.
 When packet is received from hardware, the skb-data is filled
 directly by h/w. What we have done is in this way.
 
 Pros:   We can avoid any copy here.
 Cons:   Guest virtio-net driver needs to allocate skb as almost
 the same method with the host NIC drivers, say the size
 of netdev_alloc_skb() and the same reserved space in the
 head of skb. Many NIC drivers are the same with guest and
 ok for this. But some lastest NIC drivers reserves special
 room in skb head. To deal with it, we suggest to provide
 a method in guest virtio-net driver to ask for parameter
 we interest from the NIC driver when we know which device
 we have bind to do zero-copy. Then we ask guest to do so.
 Is that reasonable?
 
 Two:Modify driver to get user buffer allocated from a page constructor
 API(to substitute alloc_page()), the user buffer are used as payload
 buffers and filled by h/w directly when packet is received. Driver
 should associate the pages with skb (skb_shinfo(skb)-frags). For
 the head buffer side, let host allocates skb, and h/w fills it.
 After that, the data filled in host skb header will be copied into
 guest header buffer which is submitted together with the payload 
 buffer.
 
 Pros:   We could less care the way how guest or host allocates their
 buffers.
 Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 
 Thanks
 Xiaohui

How do you deal with the DoS problem of hostile user space app posting huge
number of receives and never getting anything. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-02 Thread Greg KH
On Fri, Apr 02, 2010 at 09:43:35AM +0300, Avi Kivity wrote:
 On 04/01/2010 10:24 PM, Tom Lyon wrote:

 But there are multiple msi-x interrupts, how do you know which one
 triggered?
  
 You don't. This would suck for KVM, I guess, but we'd need major rework of 
 the
 generic UIO stuff to have a separate event channel for each MSI-X.


 Doesn't it suck for non-kvm in the same way?  Multiple vectors are there 
 for a reason.  For example, if you have a multiqueue NIC, you'd have to 
 process all queues instead of just the one that triggered.

 For my purposes, collapsing all the MSI-Xs into one MSI-look-alike is fine,
 because I'd be using MSI anyways if I could. The weird Intel 82599 VF only
 supports MSI-X.

 So one big question is - do we expand the whole UIO framework for KVM
 requirements, or do we split off either KVM or non-VM into a separate driver?
 Hans or Greg - care to opine?


 Currently kvm does device assignment with its own code, I'd like to unify 
 it with uio, not split it off.

 Separate notifications for msi-x interrupts are just as useful for uio as 
 they are for kvm.

I agree, there should not be a difference here for KVM vs. the normal
version.

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-02 Thread Sridhar Samudrala
Make vhost scalable by creating a separate vhost thread per vhost
device. This provides better scaling across multiple guests and with
multiple interfaces in a guest.

I am seeing better aggregated througput/latency when running netperf
across multiple guests or multiple interfaces in a guest in parallel
with this patch.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a6a88df..29aa80f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
return r;
}
 
-   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+   n-dev);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+   n-dev);
n-tx_poll_state = VHOST_NET_POLL_DISABLED;
 
f-private_data = n;
@@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-   int r = vhost_init();
-   if (r)
-   goto err_init;
-   r = misc_register(vhost_net_misc);
-   if (r)
-   goto err_reg;
-   return 0;
-err_reg:
-   vhost_cleanup();
-err_init:
-   return r;
-
+   return misc_register(vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
-   vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7bd7a1e..243f4d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -36,8 +36,6 @@ enum {
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned 
mode, int sync,
if (!((unsigned long)key  poll-mask))
return 0;
 
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
return 0;
 }
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-unsigned long mask)
+unsigned long mask, struct vhost_dev *dev)
 {
INIT_WORK(poll-work, func);
init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup);
init_poll_funcptr(poll-table, vhost_poll_func);
poll-mask = mask;
+   poll-dev = dev;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue *vqs, int nvqs)
 {
int i;
+
+   dev-wq = create_singlethread_workqueue(vhost);
+   if (!dev-wq)
+   return -ENOMEM;
+
dev-vqs = vqs;
dev-nvqs = nvqs;
mutex_init(dev-mutex);
@@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev,
if (dev-vqs[i].handle_kick)
vhost_poll_init(dev-vqs[i].poll,
dev-vqs[i].handle_kick,
-   POLLIN);
+   POLLIN, dev);
}
return 0;
 }
@@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
if (dev-mm)
mmput(dev-mm);
dev-mm = NULL;
+
+   destroy_workqueue(dev-wq);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, Failed to enable notification at %p: %d\n,
   vq-used-flags, r);
 }
-
-int vhost_init(void)
-{
-   vhost_workqueue = create_singlethread_workqueue(vhost);
-   if (!vhost_workqueue)
-   return -ENOMEM;
-   return 0;
-}
-
-void vhost_cleanup(void)
-{
-   destroy_workqueue(vhost_workqueue);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..60fefd0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -29,10 +29,11 @@ struct vhost_poll {
/* struct which will handle all actual work. */
struct work_structwork;
unsigned long mask;
+   struct vhost_dev *dev;
 };
 
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-unsigned long mask);
+unsigned long mask, struct vhost_dev *dev);
 

[GSoC 2010] Completing Nested VMX

2010-04-02 Thread Mohammed Gamal
Hello All,
I'm interested in adding nested VMX support to KVM in GSoC 2010 (among
other things). I see that Orit Wasserman has done some work in this
area, but it didn't get merged yet. The last patches were a few months
ago and I have not seen any substantial progress in that front ever
since.

I wonder whether the previous work can be used as a starting ground
for any future effort? What is missing from it? What are the current
limitations of that implementation? And how can it be extended?

And within the scopr of GSoC, what do you think the achievments of
such a project should be?

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-02 Thread Richard Simpson
Nope, both Kernels are 64 bit.

uname -a Host: Linux gordon 2.6.27-gentoo-r8 #5 Sat Mar 14 18:01:59 GMT
2009 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux

uname -a Guest: Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31
GMT 2010 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux

As you can see, both kernels are a little old, and I have been wondering
if that might be part of the problem.  The Guest one is old because that
is the latest stable hardened version in Gentoo.  The host one is old
because of:

(gordon:~) rs10% uptime
 22:01:37 up 374 days, 23:29,  1 user,  load average: 1.09, 0.42, 0.18

Now that I have managed to smash the psychologically important 1 year
uptime for the first time ever (Woo!) I shall probably upgrade the host
kernel in the near future.  Of course, it is important to remember that
with the --no-kvm switch it works just fine (only slowly) with exactly
the same two kernels.

Thanks

On 01/04/10 09:43, Avi Kivity wrote:
 On 03/30/2010 01:16 AM, Richard Simpson wrote:
 Hello,

 Summary: How can I have a virtual CPU with the nx bit set whilst
 enjoying KVM acceleration?

 My Host - AMD Athlon(tm) 64 Processor 3200+ running Gentoo
 My VM - KVM running hardened Gentoo
 My KVM version - 0.12.3
 My Task - Implement restricted secure VM to handle services exposed to
 internet.
 My Command - kvm -hda /dev/mapper/vols-andrew -kernel ./bzImage -append
 root=/dev/hda2 -cpu host -runas xxx -net nic -net user -m 256 -k en-gb
 -vnc :1 -monitor stdio


 
 
 Are you running a 32-bit non-pae host kernel?  In that case, nx is
 disabled both for the guest and host.  Switch to a pae (or 64-bit)
 kernel and all should be well.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-02 Thread Sridhar Samudrala
On Fri, 2010-04-02 at 15:25 +0800, xiaohui@intel.com wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

Thanks
Sridhar
 
 The scenario is like this:
 
 The guest virtio-net driver submits multiple requests thru vhost-net
 backend driver to the kernel. And the requests are queued and then
 completed after corresponding actions in h/w are done.
 
 For read, user space buffers are dispensed to NIC driver for rx when
 a page constructor API is invoked. Means NICs can allocate user buffers
 from a page constructor. We add a hook in netif_receive_skb() function
 to intercept the incoming packets, and notify the zero-copy device.
 
 For write, the zero-copy deivce may allocates a new host skb and puts
 payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
 The request remains pending until the skb is transmitted by h/w.
 
 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One:  Modify __alloc_skb() function a bit, it can only allocate a 
   structure of sk_buff, and the data pointer is pointing to a 
   user buffer which is coming from a page constructor API.
   Then the shinfo of the skb is also from guest.
   When packet is received from hardware, the skb-data is filled
   directly by h/w. What we have done is in this way.
 
   Pros:   We can avoid any copy here.
   Cons:   Guest virtio-net driver needs to allocate skb as almost
   the same method with the host NIC drivers, say the size
   of netdev_alloc_skb() and the same reserved space in the
   head of skb. Many NIC drivers are the same with guest and
   ok for this. But some lastest NIC drivers reserves special
   room in skb head. To deal with it, we suggest to provide
   a method in guest virtio-net driver to ask for parameter
   we interest from the NIC driver when we know which device 
   we have bind to do zero-copy. Then we ask guest to do so.
   Is that reasonable?
 
 Two:  Modify driver to get user buffer allocated from a page constructor
   API(to substitute alloc_page()), the user buffer are used as payload
   buffers and filled by h/w directly when packet is received. Driver
   should associate the pages with skb (skb_shinfo(skb)-frags). For 
   the head buffer side, let host allocates skb, and h/w fills it. 
   After that, the data filled in host skb header will be copied into
   guest header buffer which is submitted together with the payload buffer.
 
   Pros:   We could less care the way how guest or host allocates their
   buffers.
   Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 Please give comments especially for the network part modifications.
 
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
 
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
   packet split support
   To support GRO
   Performance tuning
 
 what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor instead of static minor number
   a __KERNEL__ protect to mp_get_sock()
 
 what we have done in v2:
   
   remove most of the RCU usage, since the ctor pointer is only
   changed by BIND/UNBIND ioctl, and during that time, NIC will be
   stopped to get good cleanup(all outstanding requests are finished),
   so the