Re: [PATCH 1/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Joerg Roedel
Btw. This patch posting is broken. It suffers from line-wraps which make
it impossible to apply as-is. I was able to fix it but please consider
this in your next posting.

On Wed, Mar 31, 2010 at 05:12:35PM -0700, Tom Lyon wrote:

 --- linux-2.6.33/drivers/uio/uio_pci_generic.c2010-02-24 
 10:52:17.0 -0800
^
Unexpected line-wrap.

I also got some whitespace warnings when trying to apply it. Please make
sure you fix this in the next version too.

Thanks,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH] KVM test: Memory ballooning test for KVM guest

2010-04-09 Thread pradeep


Hi Lucas

Thanks for your comments.
Please find the patch, with suggested changes.

Thanks
Pradeep


Signed-off-by: Pradeep Kumar Surisetty psuri...@linux.vnet.ibm.com
---
diff -uprN autotest-old/client/tests/kvm/tests/balloon_check.py 
autotest/client/tests/kvm/tests/balloon_check.py
--- autotest-old/client/tests/kvm/tests/balloon_check.py1969-12-31 
19:00:00.0 -0500
+++ autotest/client/tests/kvm/tests/balloon_check.py2010-04-09 
12:33:34.0 -0400
@@ -0,0 +1,47 @@
+import re, string, logging, random, time
+from autotest_lib.client.common_lib import error
+import kvm_test_utils, kvm_utils
+
+def run_balloon_check(test, params, env):
+
+Check Memory ballooning:
+1) Boot a guest
+2) Increase and decrease the memory of guest using balloon command from 
monitor
+3) check memory info
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+
+
+vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
+session = kvm_test_utils.wait_for_login(vm)
+fail = 0
+
+# Check memory size
+logging.info(Memory size check)
+expected_mem = int(params.get(mem))
+actual_mem = vm.get_memory_size()
+if actual_mem != expected_mem:
+logging.error(Memory size mismatch:)
+logging.error(Assigned to VM: %s % expected_mem)
+logging.error(Reported by OS: %s % actual_mem)
+
+#change memory to random size between 60% to 95% of actual memory
+percent = random.uniform(0.6, 0.95)
+new_mem = int(percent*expected_mem)
+vm.send_monitor_cmd(balloon %s %new_mem)
+time.sleep(20)
+status, output = vm.send_monitor_cmd(info balloon)
+if status != 0:
+logging.error(qemu monitor command failed: info balloon)
+
+balloon_cmd_mem = int(re.findall(\d+,output)[0])
+if balloon_cmd_mem != new_mem:
+logging.error(memory ballooning failed while changing memory to %s 
%balloon_cmd_mem)  
+   fail += 1
+
+#Checking for test result
+if fail != 0:
+raise error.TestFail(Memory ballooning test failed )
+session.close()
diff -uprN autotest-old/client/tests/kvm/tests_base.cfg.sample 
autotest/client/tests/kvm/tests_base.cfg.sample
--- autotest-old/client/tests/kvm/tests_base.cfg.sample 2010-04-09 
12:32:50.0 -0400
+++ autotest/client/tests/kvm/tests_base.cfg.sample 2010-04-09 
12:53:27.0 -0400
@@ -185,6 +185,10 @@ variants:
 drift_threshold = 10
 drift_threshold_single = 3
 
+- balloon_check:  install setup unattended_install boot
+type = balloon_check
+extra_params +=  -balloon virtio
+
 - stress_boot:  install setup unattended_install
 type = stress_boot
 max_vms = 5
---


[PATCH RFC 0/5] KVM: Moving dirty bitmaps to userspace: double buffering approach

2010-04-09 Thread Takuya Yoshikawa
Hi, this is the first version!


We've first implemented the x86 specific parts without introducing
new APIs: so this code works with current qemu-kvm.

Although we have many things to do, we'd like to get some comments
to see we are going to the right direction.


Note: we are now testing this and now getting to be thinking we may
  be able to improve some performance especially concerning to the
  migration, graphics, etc, sorry but not confirmed yet.

Thanks in advance,
  Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/5] KVM: use a rapper function to calculate the sizes of dirty bitmaps

2010-04-09 Thread Takuya Yoshikawa
We will use this later in other parts.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
---
 arch/powerpc/kvm/book3s.c |2 +-
 arch/x86/kvm/x86.c|2 +-
 include/linux/kvm_host.h  |5 +
 virt/kvm/kvm_main.c   |4 ++--
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index a7ab2ea..3ca857b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1136,7 +1136,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
kvm_for_each_vcpu(n, vcpu, kvm)
kvmppc_mmu_pte_pflush(vcpu, ga, ga_end);
 
-   n = ALIGN(memslot-npages, BITS_PER_LONG) / 8;
+   n = kvm_dirty_bitmap_bytes(memslot);
memset(memslot-dirty_bitmap, 0, n);
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd5c3d3..450ecfe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2664,7 +2664,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (!memslot-dirty_bitmap)
goto out;
 
-   n = ALIGN(memslot-npages, BITS_PER_LONG) / 8;
+   n = kvm_dirty_bitmap_bytes(memslot);
 
r = -ENOMEM;
dirty_bitmap = vmalloc(n);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8e91fa7..dd6bcf4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,11 @@ struct kvm_memory_slot {
int user_alloc;
 };
 
+static inline int kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
+{
+   return ALIGN(memslot-npages, BITS_PER_LONG) / 8;
+}
+
 struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9379533..5ab581e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -645,7 +645,7 @@ skip_lpage:
 
/* Allocate page dirty bitmap if needed */
if ((new.flags  KVM_MEM_LOG_DIRTY_PAGES)  !new.dirty_bitmap) {
-   unsigned dirty_bytes = ALIGN(npages, BITS_PER_LONG) / 8;
+   int dirty_bytes = kvm_dirty_bitmap_bytes(new);
 
new.dirty_bitmap = vmalloc(dirty_bytes);
if (!new.dirty_bitmap)
@@ -777,7 +777,7 @@ int kvm_get_dirty_log(struct kvm *kvm,
if (!memslot-dirty_bitmap)
goto out;
 
-   n = ALIGN(memslot-npages, BITS_PER_LONG) / 8;
+   n = kvm_dirty_bitmap_bytes(memslot);
 
for (i = 0; !any  i  n/sizeof(long); ++i)
any = memslot-dirty_bitmap[i];
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 3/5] KVM: Use rapper functions to create and destroy dirty bitmaps

2010-04-09 Thread Takuya Yoshikawa
For x86, we will change the allocation and free parts to do_mmap() and
do_munmap(). This patch makes it cleaner.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
---
 virt/kvm/kvm_main.c |   27 ---
 1 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5ab581e..f919bd1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -431,6 +431,12 @@ out_err_nodisable:
return ERR_PTR(r);
 }
 
+static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
+{
+   vfree(memslot-dirty_bitmap);
+   memslot-dirty_bitmap = NULL;
+}
+
 /*
  * Free any memory in @free but not in @dont.
  */
@@ -443,7 +449,7 @@ static void kvm_free_physmem_slot(struct kvm_memory_slot 
*free,
vfree(free-rmap);
 
if (!dont || free-dirty_bitmap != dont-dirty_bitmap)
-   vfree(free-dirty_bitmap);
+   kvm_destroy_dirty_bitmap(free);
 
 
for (i = 0; i  KVM_NR_PAGE_SIZES - 1; ++i) {
@@ -454,7 +460,6 @@ static void kvm_free_physmem_slot(struct kvm_memory_slot 
*free,
}
 
free-npages = 0;
-   free-dirty_bitmap = NULL;
free-rmap = NULL;
 }
 
@@ -516,6 +521,18 @@ static int kvm_vm_release(struct inode *inode, struct file 
*filp)
return 0;
 }
 
+static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
+{
+   int dirty_bytes = kvm_dirty_bitmap_bytes(memslot);
+
+   memslot-dirty_bitmap = vmalloc(dirty_bytes);
+   if (!memslot-dirty_bitmap)
+   return -ENOMEM;
+
+   memset(memslot-dirty_bitmap, 0, dirty_bytes);
+   return 0;
+}
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -645,12 +662,8 @@ skip_lpage:
 
/* Allocate page dirty bitmap if needed */
if ((new.flags  KVM_MEM_LOG_DIRTY_PAGES)  !new.dirty_bitmap) {
-   int dirty_bytes = kvm_dirty_bitmap_bytes(new);
-
-   new.dirty_bitmap = vmalloc(dirty_bytes);
-   if (!new.dirty_bitmap)
+   if (kvm_create_dirty_bitmap(new)  0)
goto out_free;
-   memset(new.dirty_bitmap, 0, dirty_bytes);
/* destroy any largepage mappings for dirty tracking */
if (old.npages)
flush_shadow = 1;
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 4/5] KVM: add new members to the memory slot for double buffering of bitmaps

2010-04-09 Thread Takuya Yoshikawa
Currently, x86 vmalloc()s a dirty bitmap every time when we swich
to the next dirty bitmap. To avoid this, we use the double buffering
technique: we also move the bitmaps to userspace, so that extra
bitmaps will not use the precious kernel resource.

This idea is based on Avi's suggestion.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
---
 arch/x86/include/asm/kvm_host.h |3 +++
 include/linux/kvm_host.h|6 ++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0c49c88..b502bca 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -25,6 +25,9 @@
 #include asm/mtrr.h
 #include asm/msr-index.h
 
+/* Select x86 specific features in linux/kvm_host.h */
+#define __KVM_HAVE_USER_DIRTYBITMAP
+
 #define KVM_MAX_VCPUS 64
 #define KVM_MEMORY_SLOTS 32
 /* memory slots that does not exposed to userspace */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dd6bcf4..07092d6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -110,7 +110,13 @@ struct kvm_memory_slot {
unsigned long npages;
unsigned long flags;
unsigned long *rmap;
+#ifndef __KVM_HAVE_USER_DIRTYBITMAP
unsigned long *dirty_bitmap;
+#else
+   unsigned long __user *dirty_bitmap;
+   unsigned long __user *dirty_bitmap_old;
+   bool is_dirty;
+#endif
struct {
unsigned long rmap_pde;
int write_count;
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 5/5] KVM: This is the main part of the moving dirty bitmaps to user space

2010-04-09 Thread Takuya Yoshikawa
By this patch, bitmap allocation is replaced with do_mmap() and
bitmap manipulation is replaced with *_user() functions.

Note that this does not change the APIs between kernel and user space.
To get more advantage from this hack, we need to add a new interface
for triggering the bitmap swith and getting the bitmap addresses: the
addresses is in user space and we can export them to qemu.

TODO:
1. We want to use copy_in_user() for 32bit case too.
   Note that this is only for the compatibility issue: in the future,
   we hope, qemu will not need to use this ioctl.
2. We have to implement test_bit_user() to avoid extra set_bit.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
---
 arch/x86/kvm/x86.c   |  118 +
 include/linux/kvm_host.h |4 ++
 virt/kvm/kvm_main.c  |   30 +++-
 3 files changed, 130 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 450ecfe..995b970 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2642,16 +2642,99 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
return 0;
 }
 
+int kvm_arch_create_dirty_bitmap(struct kvm_memory_slot *memslot)
+{
+   unsigned long user_addr1;
+   unsigned long user_addr2;
+   int dirty_bytes = kvm_dirty_bitmap_bytes(memslot);
+
+   down_write(current-mm-mmap_sem);
+   user_addr1 = do_mmap(NULL, 0, dirty_bytes,
+PROT_READ | PROT_WRITE,
+MAP_PRIVATE | MAP_ANONYMOUS, 0);
+   if (IS_ERR((void *)user_addr1)) {
+   up_write(current-mm-mmap_sem);
+   return PTR_ERR((void *)user_addr1);
+   }
+   user_addr2 = do_mmap(NULL, 0, dirty_bytes,
+PROT_READ | PROT_WRITE,
+MAP_PRIVATE | MAP_ANONYMOUS, 0);
+   if (IS_ERR((void *)user_addr2)) {
+   do_munmap(current-mm, user_addr1, dirty_bytes);
+   up_write(current-mm-mmap_sem);
+   return PTR_ERR((void *)user_addr2);
+   }
+   up_write(current-mm-mmap_sem);
+
+   memslot-dirty_bitmap = (unsigned long __user *)user_addr1;
+   memslot-dirty_bitmap_old = (unsigned long __user *)user_addr2;
+   clear_user(memslot-dirty_bitmap, dirty_bytes);
+   clear_user(memslot-dirty_bitmap_old, dirty_bytes);
+
+   return 0;
+}
+
+void kvm_arch_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
+{
+   int n = kvm_dirty_bitmap_bytes(memslot);
+
+   if (!memslot-dirty_bitmap)
+   return;
+
+   down_write(current-mm-mmap_sem);
+   do_munmap(current-mm, (unsigned long)memslot-dirty_bitmap, n);
+   do_munmap(current-mm, (unsigned long)memslot-dirty_bitmap_old, n);
+   up_write(current-mm-mmap_sem);
+
+   memslot-dirty_bitmap = NULL;
+   memslot-dirty_bitmap_old = NULL;
+}
+
+static int kvm_copy_dirty_bitmap(unsigned long __user *to,
+const unsigned long __user *from, int n)
+{
+#ifdef CONFIG_X86_64
+   if (copy_in_user(to, from, n)  0) {
+   printk(KERN_WARNING %s: copy_in_user failed\n, __func__);
+   return -EFAULT;
+   }
+   return 0;
+#else
+   int ret = 0;
+   void *p = vmalloc(n);
+
+   if (!p) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   if (copy_from_user(p, from, n)  0) {
+   printk(KERN_WARNING %s: copy_from_user failed\n, __func__);
+   ret = -EFAULT;
+   goto out_free;
+   }
+   if (copy_to_user(to, p, n)  0) {
+   printk(KERN_WARNING %s: copy_to_user failed\n, __func__);
+   ret = -EFAULT;
+   goto out_free;
+   }
+
+out_free:
+   vfree(p);
+out:
+   return ret;
+#endif
+}
+
 /*
  * Get (and clear) the dirty memory log for a memory slot.
  */
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
  struct kvm_dirty_log *log)
 {
-   int r, n, i;
+   int r, n;
struct kvm_memory_slot *memslot;
-   unsigned long is_dirty = 0;
-   unsigned long *dirty_bitmap = NULL;
+   unsigned long __user *dirty_bitmap;
+   unsigned long __user *dirty_bitmap_old;
 
mutex_lock(kvm-slots_lock);
 
@@ -2664,44 +2747,37 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (!memslot-dirty_bitmap)
goto out;
 
-   n = kvm_dirty_bitmap_bytes(memslot);
-
-   r = -ENOMEM;
-   dirty_bitmap = vmalloc(n);
-   if (!dirty_bitmap)
-   goto out;
-   memset(dirty_bitmap, 0, n);
+   dirty_bitmap = memslot-dirty_bitmap;
+   dirty_bitmap_old = memslot-dirty_bitmap_old;
 
-   for (i = 0; !is_dirty  i  n/sizeof(long); i++)
-   is_dirty = memslot-dirty_bitmap[i];
+   n = kvm_dirty_bitmap_bytes(memslot);
+   clear_user(dirty_bitmap_old, n);
 

[RFC][PATCH v3 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-09 Thread xiaohui . xin
The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

The scenario is like this:

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:Modify __alloc_skb() function a bit, it can only allocate a 
structure of sk_buff, and the data pointer is pointing to a 
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb-data is filled
directly by h/w. What we have done is in this way.

Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device 
we have bind to do zero-copy. Then we ask guest to do so.
Is that reasonable?

Two:Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)-frags). For 
the head buffer side, let host allocates skb, and h/w fills it. 
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

Please give comments especially for the network part modifications.


We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
packet split support
To support GRO
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using dynamic minor instead of static minor number
a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

remove most of the RCU usage, since the ctor pointer is only
changed by BIND/UNBIND ioctl, and during that time, NIC will be
stopped to get good cleanup(all outstanding requests are finished),
so the ctor pointer cannot be raced into wrong situation.

Remove the struct vhost_notifier with struct kiocb.
Let vhost-net backend to alloc/free the kiocb and transfer them
via sendmsg/recvmsg.

use get_user_pages_fast() and set_page_dirty_lock() 

[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-09 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

memory leak fixed,
kconfig made, 
do_unbind() made,
mp_chr_ioctl() cleaned up and
some other cleanups made
 
by Jeff Dike jd...@linux.intel.com

 drivers/vhost/Kconfig |5 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1264 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1300 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config VHOST_PASSTHRU
+   tristate Zerocopy network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..86d2525
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1264 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+#include vhost.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   void*ctrl;
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct skb_user_pageuser;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ

[RFC][PATCH v3 3/3] Let host NIC driver to DMA to guest user space.

2010-04-09 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The patch let host NIC driver to receive user space skb,
then the driver has chance to directly DMA to guest user
space buffers thru single ethX interface.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

alloc_skb() is cleanup by Jeff Dike jd...@linux.intel.com

 include/linux/netdevice.h |   69 -
 include/linux/skbuff.h|   30 --
 net/core/dev.c|   63 ++
 net/core/skbuff.c |   74 
 4 files changed, 224 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..ba48eb0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -485,6 +485,17 @@ struct netdev_queue {
unsigned long   tx_dropped;
 } cacheline_aligned_in_smp;
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct mpassthru_port  {
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   struct skb_user_page*(*ctor)(struct mpassthru_port *,
+   struct sk_buff *, int);
+};
+#endif
 
 /*
  * This structure defines the management hooks for network devices.
@@ -636,6 +647,10 @@ struct net_device_ops {
int (*ndo_fcoe_ddp_done)(struct net_device *dev,
 u16 xid);
 #endif
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+   int (*ndo_mp_port_prep)(struct net_device *dev,
+   struct mpassthru_port *port);
+#endif
 };
 
 /*
@@ -891,7 +906,8 @@ struct net_device
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port*garp_port;
-
+   /* mpassthru */
+   struct mpassthru_port   *mp_port;
/* class/net/name entry */
struct device   dev;
/* space for optional statistics and wireless sysfs groups */
@@ -2013,6 +2029,55 @@ static inline u32 dev_ethtool_get_flags(struct 
net_device *dev)
return 0;
return dev-ethtool_ops-get_flags(dev);
 }
-#endif /* __KERNEL__ */
 
+/* To support zero-copy between user space application and NIC driver,
+ * we'd better ask NIC driver for the capability it can provide, especially
+ * for packet split mode, now we only ask for the header size, and the
+ * payload once a descriptor may carry.
+ */
+
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+static inline int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   int rc;
+   int npages, data_len;
+   const struct net_device_ops *ops = dev-netdev_ops;
+
+   /* needed by packet split */
+   if (ops-ndo_mp_port_prep) {
+   rc = ops-ndo_mp_port_prep(dev, port);
+   if (rc)
+   return rc;
+   } else {
+   /* If the NIC driver did not report this,
+* then we try to use it as igb driver.
+*/
+   port-hdr_len = 128;
+   port-data_len = 2048;
+   port-npages = 1;
+   }
+
+   if (port-hdr_len = 0)
+   goto err;
+
+   npages = port-npages;
+   data_len = port-data_len;
+   if (npages = 0 || npages  MAX_SKB_FRAGS ||
+   (data_len  PAGE_SIZE * (npages - 1) ||
+data_len  PAGE_SIZE * npages))
+   goto err;
+
+   return 0;
+err:
+   dev_warn(dev-dev, invalid page constructor parameters\n);
+
+   return -EINVAL;
+}
+
+extern int netdev_mp_port_attach(struct net_device *dev,
+   struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
 #endif /* _LINUX_NETDEVICE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..e59fa57 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -209,6 +209,13 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+struct skb_user_page {
+   u8  *start;
+   int size;
+   struct skb_frag_struct *frags;
+   struct skb_shared_info *ushinfo;
+   void(*dtor)(struct skb_user_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb-data.  The lower 16 bits hold references to
  * the entire skb-data.  A clone of a headerless skb holds the length of
@@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void 

[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-09 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---
 drivers/vhost/net.c   |  203 +++--
 drivers/vhost/vhost.c |  115 
 drivers/vhost/vhost.h |   15 
 3 files changed, 278 insertions(+), 55 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..d3fb3fc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
 
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -91,11 +94,100 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   if (vq-receiver)
+   vq-receiver(vq);
+
+   vq_log = unlikely(vhost_has_feature(
+   net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   log = (int)iocb-ki_user_data;
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+
+   kmem_cache_free(net-cache, iocb);
+   if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
+   struct kiocb *iocb = NULL;
unsigned head, out, in, s;
struct msghdr msg = {
.msg_name = NULL,
@@ -124,6 +216,8 @@ static void handle_tx(struct vhost_net *net)

Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Avi Kivity

On 04/02/2010 08:05 PM, Greg KH wrote:



Currently kvm does device assignment with its own code, I'd like to unify
it with uio, not split it off.

Separate notifications for msi-x interrupts are just as useful for uio as
they are for kvm.
 

I agree, there should not be a difference here for KVM vs. the normal
version.
   


Just so you know what you got into, here are the kvm requirements:

- msi interrupts delivered via eventfd (these allow us to inject 
interrupts from uio to a guest without going through userspace)
- nonlinear iommu mapping (i.e. map discontiguous ranges of the device 
address space into ranges of the virtual address space)

- dynamic iommu mapping (support guest memory hotplug)
- unprivileged operation once an admin has assigned a device (my 
preferred implementation is to have all operations go through an fd, 
which can be passed via SCM_RIGHTS from a privileged application that 
opens the file)
- access to all config space, but BARs must be translated so userspace 
cannot attack the host
- some mechanism which allows us to affine device interrupts with their 
target vcpus (eventually, this is vague)

- anything mst might add
- a pony

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH] KVM test: Memory ballooning test for KVM guest

2010-04-09 Thread sudhir kumar
On Fri, Apr 9, 2010 at 2:40 PM, pradeep psuri...@linux.vnet.ibm.com wrote:

 Hi Lucas

 Thanks for your comments.
 Please find the patch, with suggested changes.

 Thanks
 Pradeep



 Signed-off-by: Pradeep Kumar Surisetty psuri...@linux.vnet.ibm.com
 ---
 diff -uprN autotest-old/client/tests/kvm/tests/balloon_check.py
 autotest/client/tests/kvm/tests/balloon_check.py
 --- autotest-old/client/tests/kvm/tests/balloon_check.py        1969-12-31
 19:00:00.0 -0500
 +++ autotest/client/tests/kvm/tests/balloon_check.py    2010-04-09
 12:33:34.0 -0400
 @@ -0,0 +1,47 @@
 +import re, string, logging, random, time
 +from autotest_lib.client.common_lib import error
 +import kvm_test_utils, kvm_utils
 +
 +def run_balloon_check(test, params, env):
 +    
 +    Check Memory ballooning:
 +    1) Boot a guest
 +    2) Increase and decrease the memory of guest using balloon command from
 monitor
Better replace this description by Change the guest memory between X
and Y values
Also instead of using 0.6 and 0.95 below, better use two variables and
take their value from config file. This will give the user a
flexibility to narrow or widen the ballooning range.

 +    3) check memory info
 +
 +   �...@param test: kvm test object
 +   �...@param params: Dictionary with the test parameters
 +   �...@param env: Dictionary with test environment.
 +    
 +
 +    vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 +    session = kvm_test_utils.wait_for_login(vm)
 +    fail = 0
 +
 +    # Check memory size
 +    logging.info(Memory size check)
 +    expected_mem = int(params.get(mem))
 +    actual_mem = vm.get_memory_size()
 +    if actual_mem != expected_mem:
 +        logging.error(Memory size mismatch:)
 +        logging.error(Assigned to VM: %s % expected_mem)
 +        logging.error(Reported by OS: %s % actual_mem)
 +
 +    #change memory to random size between 60% to 95% of actual memory
 +    percent = random.uniform(0.6, 0.95)
 +    new_mem = int(percent*expected_mem)
 +    vm.send_monitor_cmd(balloon %s %new_mem)

You may want to check if the command passed/failed. Older versions
might not support ballooning.

 +    time.sleep(20)
why 20 second sleep and why the magic number?

 +    status, output = vm.send_monitor_cmd(info balloon)
You might want to put this check before changing the memory.

 +    if status != 0:
 +        logging.error(qemu monitor command failed: info balloon)
 +
 +    balloon_cmd_mem = int(re.findall(\d+,output)[0])
A better variable name I can think of is ballooned_mem

 +    if balloon_cmd_mem != new_mem:
 +        logging.error(memory ballooning failed while changing memory to
 %s %balloon_cmd_mem)
 +       fail += 1
 +
 +    #Checking for test result
 +    if fail != 0:
In case you are running multiple iterations and the 2nd iteration
fails you will always miss this condition.

 +        raise error.TestFail(Memory ballooning test failed )
 +    session.close()
 diff -uprN autotest-old/client/tests/kvm/tests_base.cfg.sample
 autotest/client/tests/kvm/tests_base.cfg.sample
 --- autotest-old/client/tests/kvm/tests_base.cfg.sample 2010-04-09
 12:32:50.0 -0400
 +++ autotest/client/tests/kvm/tests_base.cfg.sample     2010-04-09
 12:53:27.0 -0400
 @@ -185,6 +185,10 @@ variants:
                 drift_threshold = 10
                 drift_threshold_single = 3

 +    - balloon_check:  install setup unattended_install boot
 +        type = balloon_check
 +        extra_params +=  -balloon virtio
 +
     - stress_boot:  install setup unattended_install
         type = stress_boot
         max_vms = 5
 ---

Rest all looks good

 ___
 Autotest mailing list
 autot...@test.kernel.org
 http://test.kernel.org/cgi-bin/mailman/listinfo/autotest





-- 
Regards
Sudhir Kumar
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Luiz Capitulino
On Thu, 8 Apr 2010 18:01:01 +0200
Mohammed Gamal m.gamal...@gmail.com wrote:

 Hi,
 Now that Cam is almost done with his ivshmem patches, I was thinking
 of another idea for GSoC which is improving the pass-though
 filesystems.
 I've got some questions on that:
 
 1- What does the community prefer to use and improve? CIFS, 9p, or
 both? And which is better taken up for GSoC.
 
 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?
 
 3- In addition, I see the idea mentions that some Windows code needs
 to be written to use network shares on a special interface. What's
 that interface? And what's the nature of that Windows code? (a driver
 a la guest additions?)

 CC'ing Aneesh as he's working on that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-09 Thread Sridhar Samudrala
On Thu, 2010-04-08 at 17:14 -0700, Rick Jones wrote:
  Here are the results with netperf TCP_STREAM 64K guest to host on a
  8-cpu Nehalem system.
 
 I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?

Yes. It is a 2 socket quad-core Nehalem. so i guess it is a 8 core
Nehalem-EP.
 
 Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a 
 lot!-) 
 but I find it incomplete and also like to run things like single-instance 
 TCP_RR 
 and multiple-instance, multiple transaction (./configure --enable-burst) 
 TCP_RR tests, particularly when concerned with scaling issues.

Can we run multiple instance and multiple transaction tests with a
single netperf commandline?

Is there any easy way to get consolidated throughput when a netserver on
the host is servicing netperf clients from multiple guests?

Thanks
Sridhar

 
 happy benchmarking,
 
 rick jones
 
  It shows cumulative bandwidth in Mbps and host 
  CPU utilization.
  
  Current default single vhost thread
  ---
  1 guest:  12500  37%
  2 guests: 12800  46%
  3 guests: 12600  47%
  4 guests: 12200  47%
  5 guests: 12000  47%
  6 guests: 11700  47%
  7 guests: 11340  47%
  8 guests: 11200  48%
  
  vhost thread per cpu
  
  1 guest:   4900 25%
  2 guests: 10800 49%
  3 guests: 17100 67%
  4 guests: 20400 84%
  5 guests: 21000 90%
  6 guests: 22500 92%
  7 guests: 23500 96%
  8 guests: 24500 99%
  
  vhost thread per guest interface
  
  1 guest:  12500 37%
  2 guests: 21000 72%
  3 guests: 21600 79%
  4 guests: 21600 85%
  5 guests: 22500 89%
  6 guests: 22800 94%
  7 guests: 24500 98%
  8 guests: 26400 99%
  
  Thanks
  Sridhar
  
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Tom Lyon
On Friday 09 April 2010 02:58:19 am Avi Kivity wrote:
 On 04/02/2010 08:05 PM, Greg KH wrote:
 
  Currently kvm does device assignment with its own code, I'd like to unify
  it with uio, not split it off.
 
  Separate notifications for msi-x interrupts are just as useful for uio as
  they are for kvm.
   
  I agree, there should not be a difference here for KVM vs. the normal
  version.
 
 
 Just so you know what you got into, here are the kvm requirements:
 
 - msi interrupts delivered via eventfd (these allow us to inject 
 interrupts from uio to a guest without going through userspace)
Check.
 - nonlinear iommu mapping (i.e. map discontiguous ranges of the device 
 address space into ranges of the virtual address space)
Check.
 - dynamic iommu mapping (support guest memory hotplug)
Check.
 - unprivileged operation once an admin has assigned a device (my 
 preferred implementation is to have all operations go through an fd, 
 which can be passed via SCM_RIGHTS from a privileged application that 
 opens the file)
Check.
 - access to all config space, but BARs must be translated so userspace 
 cannot attack the host
Please elaborate. All of PCI config? All of PCIe config? Seems like a huge mess.
 - some mechanism which allows us to affine device interrupts with their 
 target vcpus (eventually, this is vague)
Do-able.
 - anything mst might add
mst?
 - a pony
Rainbow or glitter?

The 'check' items are already done, not fully tested; probably available next 
week.
Can we leave the others for future patches? Please? And I definitely need help 
with 
the PCI config stuff.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Tom Lyon
Mea culpa. 

On Friday 09 April 2010 02:08:55 am Joerg Roedel wrote:
 Btw. This patch posting is broken. It suffers from line-wraps which make
 it impossible to apply as-is. I was able to fix it but please consider
 this in your next posting.
 
 On Wed, Mar 31, 2010 at 05:12:35PM -0700, Tom Lyon wrote:
 
  --- linux-2.6.33/drivers/uio/uio_pci_generic.c  2010-02-24 
  10:52:17.0 -0800
 ^
 Unexpected line-wrap.
 
 I also got some whitespace warnings when trying to apply it. Please make
 sure you fix this in the next version too.
 
 Thanks,
 
   Joerg
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for?non-privileged processes

2010-04-09 Thread Joerg Roedel
On Fri, Apr 09, 2010 at 09:34:16AM -0700, Tom Lyon wrote:
 The 'check' items are already done, not fully tested; probably available next 
 week.
 Can we leave the others for future patches? Please? And I definitely need 
 help with 
 the PCI config stuff.

Yeah, go in small steps forward. Just post again what you have next
week. We can add more functionality step by step.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread jvrao
Luiz Capitulino wrote:
 On Thu, 8 Apr 2010 18:01:01 +0200
 Mohammed Gamal m.gamal...@gmail.com wrote:
 
 Hi,
 Now that Cam is almost done with his ivshmem patches, I was thinking
 of another idea for GSoC which is improving the pass-though
 filesystems.
 I've got some questions on that:

 1- What does the community prefer to use and improve? CIFS, 9p, or
 both? And which is better taken up for GSoC.

Please look at our recent set of patches. 
We are developing a 9P server for QEMU and client is already part of mainline 
Linux.
Our goal is to optimize it for virualization environment and will work as FS 
pass-through
mechanism between host and the guest.

Here is the latest set of patches..

http://www.mail-archive.com/qemu-de...@nongnu.org/msg29267.html

Please let us know if you are interested ... we can coordinate.

Thanks,
JV


 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?

 3- In addition, I see the idea mentions that some Windows code needs
 to be written to use network shares on a special interface. What's
 that interface? And what's the nature of that Windows code? (a driver
 a la guest additions?)
 
  CC'ing Aneesh as he's working on that.
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-09 Thread Rick Jones

Sridhar Samudrala wrote:

On Thu, 2010-04-08 at 17:14 -0700, Rick Jones wrote:


Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system.


I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?



Yes. It is a 2 socket quad-core Nehalem. so i guess it is a 8 core
Nehalem-EP.

Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-) 
but I find it incomplete and also like to run things like single-instance TCP_RR 
and multiple-instance, multiple transaction (./configure --enable-burst) 
TCP_RR tests, particularly when concerned with scaling issues.



Can we run multiple instance and multiple transaction tests with a
single netperf commandline?


Do you count a shell for loop as a single command line?


Is there any easy way to get consolidated throughput when a netserver on
the host is servicing netperf clients from multiple guests?


I tend to use a script such as:

ftp://ftp.netperf.org/netperf/misc/runemomniagg2.sh

which presumes that netperf/netserver have been built with:

./configure --enable-omni --enable-burst ...

and uses the CSV output format of the omni tests.  When I want sums I then turn 
to a spreadsheet, or I suppose I could turn to awk etc.


The TCP_RR test can be flipped around request size for response size etc, so 
when I have a single sustem under test, I initiate the netperf commands on it, 
targetting netservers on the clients.  If I want inbound bulk throughput I use 
the TCP_MAERTS test rather than the TCP_STREAM test.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Mohammed Gamal
On Fri, Apr 9, 2010 at 7:11 PM, jvrao jv...@linux.vnet.ibm.com wrote:
 Luiz Capitulino wrote:
 On Thu, 8 Apr 2010 18:01:01 +0200
 Mohammed Gamal m.gamal...@gmail.com wrote:

 Hi,
 Now that Cam is almost done with his ivshmem patches, I was thinking
 of another idea for GSoC which is improving the pass-though
 filesystems.
 I've got some questions on that:

 1- What does the community prefer to use and improve? CIFS, 9p, or
 both? And which is better taken up for GSoC.

 Please look at our recent set of patches.
 We are developing a 9P server for QEMU and client is already part of mainline 
 Linux.
 Our goal is to optimize it for virualization environment and will work as FS 
 pass-through
 mechanism between host and the guest.

 Here is the latest set of patches..

 http://www.mail-archive.com/qemu-de...@nongnu.org/msg29267.html

 Please let us know if you are interested ... we can coordinate.

 Thanks,
 JV


I'd be interested indeed.


 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?

 3- In addition, I see the idea mentions that some Windows code needs
 to be written to use network shares on a special interface. What's
 that interface? And what's the nature of that Windows code? (a driver
 a la guest additions?)

  CC'ing Aneesh as he's working on that.





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Avi Kivity

On 04/09/2010 07:34 PM, Tom Lyon wrote:

- access to all config space, but BARs must be translated so userspace
cannot attack the host
 

Please elaborate. All of PCI config? All of PCIe config? Seems like a huge mess.
   


Yes.  Anything a guest's device driver may want to access.


The 'check' items are already done, not fully tested; probably available next 
week.
Can we leave the others for future patches? Please?


Hey, I was expecting we'd have to do all of this.  The requirements list 
was to get the uio maintainers confirmation that this is going in an 
acceptable direction.


We can definitely proceed incrementally.


And I definitely need help with
the PCI config stuff.
   


Sure.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hugetlbfs and KSM

2010-04-09 Thread Chris Wright
* Bernhard Schmidt (be...@birkenwald.de) wrote:
 * KSM seems to be largely ineffective (100MB saved - 1.3MB saved)
 
 Am I doing something wrong? Is this a bug? Is this generally impossible
 with large pages (which might explain the lower load on the host, if
 large pages are not scanned)? Or is it just way less likely to have
 identical pages at that size?

KSM only scans and merges 4k pages.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 On 04/02/2010 08:05 PM, Greg KH wrote:
 - access to all config space, but BARs must be translated so
 userspace cannot attack the host

Specifically, intermediated access to config space.  For example, need
to know about MSI/MSI-X updates in config space.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for?non-privileged processes

2010-04-09 Thread Chris Wright
* Tom Lyon (p...@lyon-about.com) wrote:
 On Friday 09 April 2010 02:58:19 am Avi Kivity wrote:
  - access to all config space, but BARs must be translated so userspace 
  cannot attack the host
 Please elaborate. All of PCI config? All of PCIe config? Seems like a huge 
 mess.

All of config space, but not raw access to all bits.  So the MSI/MSI-X
capability writes need to be intermediated.  There's bits in the
header too.  And it's not just PCI, it's extended config space as well,
drivers may care about finding their whizzybang PCIe capability and doing
someting with it (and worse...they are allowed to put device specific
registers there, and worse yet...they do!).

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Jamie Lokier
Mohammed Gamal wrote:
 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?

One problem with TCP ports is it only works when the guest's network
is up :) You can't boot from that.  It also makes things fragile or
difficult if the guest work you are doing involves fiddling with the
network settings.

Doing it over virtio-serial would have many benefits.

On the other hand, Samba+TCP+CIFS does have the advantage of working
with virtually all guest OSes, including Linux / BSDs / Windows /
MacOSX / Solaris etc.  9P only works with Linux as far as I know.

I big problem with Samba at the moment is it's not possible to
instantiate multiple instances of Samba any more, and not as a
non-root user.  That's because it contains some hard-coded paths to
directories of run-time state, at least on Debian/Ubuntu hosts where I
have tried and failed to use qemu's smb option, and there is no config
file option to disable that or even change all the paths.

Patching Samba to make per-user instantiations possible again would go
a long way to making it useful for filesystem passthrough.  Patching
it so you can turn off all the fancy features and have it _just_ serve
a filesystem with the most basic necessary authentication would be
even better.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Mohammed Gamal
On Fri, Apr 9, 2010 at 11:22 PM, Jamie Lokier ja...@shareable.org wrote:
 Mohammed Gamal wrote:
 2- With respect to CIFS. I wonder how the shares are supposed to be
 exposed to the guest. Should the Samba server be modified to be able
 to use unix domain sockets instead of TCP ports and then QEMU
 communicating on these sockets. With that approach, how should the
 guest be able to see the exposed share? And what is the problem of
 using Samba with TCP ports?

 One problem with TCP ports is it only works when the guest's network
 is up :) You can't boot from that.  It also makes things fragile or
 difficult if the guest work you are doing involves fiddling with the
 network settings.

 Doing it over virtio-serial would have many benefits.

 On the other hand, Samba+TCP+CIFS does have the advantage of working
 with virtually all guest OSes, including Linux / BSDs / Windows /
 MacOSX / Solaris etc.  9P only works with Linux as far as I know.

 I big problem with Samba at the moment is it's not possible to
 instantiate multiple instances of Samba any more, and not as a
 non-root user.  That's because it contains some hard-coded paths to
 directories of run-time state, at least on Debian/Ubuntu hosts where I
 have tried and failed to use qemu's smb option, and there is no config
 file option to disable that or even change all the paths.

 Patching Samba to make per-user instantiations possible again would go
 a long way to making it useful for filesystem passthrough.  Patching
 it so you can turn off all the fancy features and have it _just_ serve
 a filesystem with the most basic necessary authentication would be
 even better.

 -- Jamie


Hi Jamie,

Thanks for your input.

That's all good and well. The question now is which direction would
the community prefer to go. Would everyone be just happy with
virtio-9p passthrough? Would it support multiple OSs (Windows comes to
mind here)? Or would we eventually need to patch Samba for passthrough
filesystems?

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Javier Guerra Giraldez
On Fri, Apr 9, 2010 at 5:17 PM, Mohammed Gamal m.gamal...@gmail.com wrote:
 That's all good and well. The question now is which direction would
 the community prefer to go. Would everyone be just happy with
 virtio-9p passthrough? Would it support multiple OSs (Windows comes to
 mind here)? Or would we eventually need to patch Samba for passthrough
 filesystems?

found this:

http://code.google.com/p/ninefs/

it's a BSD-licensed 9p client for windows i have no idea of how
stable / complete / trustable it is; but might be some start


-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.

2010-04-09 Thread Mohammed Gamal
On Sat, Apr 10, 2010 at 12:22 AM, Javier Guerra Giraldez
jav...@guerrag.com wrote:
 On Fri, Apr 9, 2010 at 5:17 PM, Mohammed Gamal m.gamal...@gmail.com wrote:
 That's all good and well. The question now is which direction would
 the community prefer to go. Would everyone be just happy with
 virtio-9p passthrough? Would it support multiple OSs (Windows comes to
 mind here)? Or would we eventually need to patch Samba for passthrough
 filesystems?

 found this:

 http://code.google.com/p/ninefs/

 it's a BSD-licensed 9p client for windows i have no idea of how
 stable / complete / trustable it is; but might be some start


 --
 Javier


Hi Javier,
Thanks for the link. However, I'm still concerned with
interoperability with other operating systems, including non-Windows
ones. I am not sure of how many operating systems actually support 9p,
but I'm almost certain that CIFS would be more widely-supported.
I am still a newbie as far as all this is concerned, so if anyone has
any arguments as to whether which approach should be taken, I'd be
enlightened to hear them.

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-09 Thread Andre Przywara

Richard Simpson wrote:

On 08/04/10 09:52, Andre Przywara wrote:


Can you try to boot the attached multiboot kernel, which just outputs
a brief CPUID dump?
$ qemu-kvm -kernel cpuid_mb -vnc :0
(Unfortunately I have no serial console support in there yet, so you
either have to write the values down or screenshot it).
In the 4th line from the button it should print NX (after SYSCALL).


OK, that was fun!  Resulting screen shots are attached.

...default.png  With command line above.
...cpu_host.png With -cpu host option added.
...no_kvm.png   With -no-kvm option added.

I hope that helps!


OK, AFAIK there are several flags missing. I dimly remember there was a 
bug with masking the CPUID bits in older kernels, so I guess you have to 
celebrate your uptime for the last time and then give it a reboot with a 
more up-to-date host kernel.
(I also rebooted my desktop after I made the one year and now am gone 
green with turning it off over night ;-)
Maybe you get around with rebuilding fixed versions of kvm.ko and 
kvm_amd.ko, I can provide a fix for you if you wish (please point me to 
a way to get the actual kernel source you use).

The userspace was up-to-date? (qemu-kvm 0.12.3)?

Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 488-3567-12

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html