Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com

   I think we discussed the need for external to guest testing
   over 10G. For large messages we should not see any change
   but you should be able to get better numbers for small messages
   assuming a MQ NIC card.
 
  For external host, there is a contention among different
  queues (vhosts) when packets are processed in tun/bridge,
  unless I implement MQ TX for macvtap (tun/bridge?).  So
  my testing shows a small improvement (1 to 1.5% average)
  in BW and a rise in SD (between 10-15%).  For remote host,
  I think tun/macvtap needs MQ TX support?

 Confused. I thought this *is* with a multiqueue tun/macvtap?
 bridge does not do any queueing AFAIK ...
 I think we need to fix the contention. With migration what was guest to
 host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com
 
I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.
  
   For external host, there is a contention among different
   queues (vhosts) when packets are processed in tun/bridge,
   unless I implement MQ TX for macvtap (tun/bridge?).  So
   my testing shows a small improvement (1 to 1.5% average)
   in BW and a rise in SD (between 10-15%).  For remote host,
   I think tun/macvtap needs MQ TX support?
 
  Confused. I thought this *is* with a multiqueue tun/macvtap?
  bridge does not do any queueing AFAIK ...
  I think we need to fix the contention. With migration what was guest to
  host a minute ago might become guest to external now ...
 
 Macvtap RX is MQ but not TX. I don't think MQ TX support is
 required for macvtap, though. Is it enough for existing
 macvtap sendmsg to work, since it calls dev_queue_xmit
 which selects the txq for the outgoing device?
 
 Thanks,
 
 - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] KVM: avoid unnecessary wait for a async pf

2010-10-28 Thread Xiao Guangrong
On 10/27/2010 06:42 PM, Gleb Natapov wrote:
 On Wed, Oct 27, 2010 at 05:04:58PM +0800, Xiao Guangrong wrote:
 In current code, it checks async pf completion out of the wait context,
 like this:

 if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
  !vcpu-arch.apf.halted)
  r = vcpu_enter_guest(vcpu);
  else {
  ..
  kvm_vcpu_block(vcpu)
   ^- waiting until 'async_pf.done' is not empty
 }
  
 kvm_check_async_pf_completion(vcpu)
  ^- delete list from async_pf.done

 So, if we check aysnc pf completion first, it can be blocked at
 kvm_vcpu_block

 Correct, but it can be fixed by adding vcpu-arch.apf.halted = false; to
 kvm_arch_async_page_present(), no?
 Adding kvm_check_async_pf_completion() to arch independent kvm_vcpu_block()
 constrains how other archs may implement async pf support IMO.
  

Um, i think it's reasonable, will fix it address your comment.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:

 Results for UDP BW tests (unidirectional, sum across
 3 iterations, each iteration of 45 seconds, default
 netperf, vhosts bound to cpus 0-3; no other tuning):
   
Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?
  
   Nothing drastic, I remember BW% and SD% both improved a
   bit as a result of binding.
 
  If there's a significant improvement this would mean that
  we need to rethink the vhost-net interaction with the scheduler.

 I will get a test run with and without binding and post the
 results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_
 numtxqs=8,vhosts=5, Bind vs No-bind
# BW% CPU% RCPU% SD%   RSD%
_
1 11.25 10.771.89 0-6.06
2 18.66 7.20 7.20-14.28-7.40
4 4.24 -1.27 1.56-2.70 -.98
8 14.91-3.79 5.46-12.19-3.76
1612.32-8.67 4.63-35.97-26.66
2411.68-7.83 5.10-40.73-32.37
3213.09-10.516.57-51.52-42.28
4011.04-4.12 11.23   -50.69-42.81
488.61 -10.306.04-62.38-55.54
647.55 -6.05 6.41-61.20-56.04
808.74 -11.456.29-72.65-67.17
969.84 -6.01 9.87-69.89-64.78
128   5.57 -6.23 8.99-75.03-70.97
_
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
1.  All my test results earlier was binding vhost
to cpus 0-3 for both org and new kernel.
2.  I am not using MST's use_mq patch, only mainline
kernel. However, I reported earlier that I got
better results with that patch. The result for
MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: don't touch vcpu stat after async pf is complete

2010-10-28 Thread Xiao Guangrong
On 10/27/2010 06:44 PM, Gleb Natapov wrote:
 On Wed, Oct 27, 2010 at 05:05:57PM +0800, Xiao Guangrong wrote:
 Don't make a KVM_REQ_UNHALT request after async pf is completed since it
 can break guest's 'halt' instruction.

 Why is it a problem? CPU may be unhalted by different events so OS
 shouldn't depend on it.
 

We don't know how guest OS handles it after HLT instruction is completed,
according to X86's spec, only NMI/INTR/RESET/INIT/SMI can break halt state,
it violations the hardware behavior if we allow other event break this
state. Your opinion? :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Add a subtest kdump

2010-10-28 Thread Jason Wang
Add a new subtest to check whether kdump work correctly in guest. This test just
try to trigger crash on each vcpu and then verify it by checking the vmcore.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 client/tests/kvm/tests/kdump.py  |   79 ++
 client/tests/kvm/tests_base.cfg.sample   |   11 
 client/tests/kvm/unattended/RHEL-5-series.ks |1 
 3 files changed, 91 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/kdump.py

diff --git a/client/tests/kvm/tests/kdump.py b/client/tests/kvm/tests/kdump.py
new file mode 100644
index 000..8fa3cca
--- /dev/null
+++ b/client/tests/kvm/tests/kdump.py
@@ -0,0 +1,79 @@
+import logging, time
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils, kvm_utils
+
+
+def run_kdump(test, params, env):
+
+KVM reboot test:
+1) Log into a guest
+2) Check and enable the kdump
+3) For each vcpu, trigger a crash and check the vmcore
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+
+vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
+timeout = float(params.get(login_timeout, 240))
+crash_timeout = float(params.get(crash_timeout, 360))
+session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2)
+def_kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` \
+ --args=crashkernel=1...@64m
+kernel_param_cmd = params.get(kernel_param_cmd, def_kernel_param_cmd)
+def_kdump_enable_cmd = chkconfig kdump on  service kdump start
+kdump_enable_cmd = params.get(kdump_enable_cmd, def_kdump_enable_cmd)
+
+def crash_test(vcpu):
+
+Trigger a crash dump through sysrq-trigger
+
+@param vcpu: vcpu which is used to trigger a crash
+
+session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2)
+session.get_command_status(rm -rf /var/crash/*)
+
+logging.info(Triggering crash on vcpu %d ..., vcpu)
+crash_cmd = taskset -c %d echo c  /proc/sysrq-trigger % vcpu
+session.sendline(crash_cmd)
+
+if not kvm_utils.wait_for(lambda: not session.is_responsive(), 240, 0,
+  1):
+raise error.TestFail(Could not trigger crash on vcpu %d % vcpu)
+
+logging.info(Waiting for the completion of dumping)
+session = kvm_test_utils.wait_for_login(vm, 0, crash_timeout, 0, 2)
+
+logging.info(Probing vmcore file ...)
+s = session.get_command_status(ls -R /var/crash | grep vmcore)
+if s != 0:
+raise error.TestFail(Could not find the generated vmcore file!)
+else:
+logging.info(Found vmcore.)
+
+session.get_command_status(rm -rf /var/crash/*)
+
+try:
+logging.info(Check the existence of crash kernel ...)
+prob_cmd = grep -q 1 /sys/kernel/kexec_crash_loaded
+s = session.get_command_status(prob_cmd)
+if s != 0:
+logging.info(Crash kernel is not loaded. Try to load it.)
+# We need to setup the kernel params
+s, o = session.get_command_status_output(kernel_param_cmd)
+if s != 0:
+raise error.TestFail(Could not add crashkernel params to
+ kernel)
+session = kvm_test_utils.reboot(vm, session, timeout=timeout);
+
+logging.info(Enable kdump service ...)
+# the initrd may be rebuilt here so we need to wait a little more
+s, o = session.get_command_status_output(kdump_enable_cmd, timeout=120)
+if s != 0:
+raise error.TestFail(Could not enable kdump service:%s % o)
+
+nvcpu = int(params.get(smp, 1))
+[crash_test(i) for i in range(nvcpu)]
+
+finally:
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index fe3563c..25ad688 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -665,6 +665,15 @@ variants:
 image_name_snapshot1 = sn1
 image_name_snapshot2 = sn2
 
+- kdump:
+type = kdump
+# time waited for the completion of crash dump
+# crash_timeout = 360
+# command to add the crashkerne...@y to kernel cmd line
+# kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` 
--args=crashkernel=1...@64m
+# command to enable kdump service
+# kdump_enable_cmd = chkconfig kdump on  service kdump start
+
 # system_powerdown, system_reset and shutdown *must* be the last ones
 # defined (in this order), since the effect of such tests can leave
 # the VM on a bad state.
@@ -1924,6 +1933,8 @@ virtio_net|virtio_blk|e1000|balloon_check:
 only Fedora.11 Fedora.12 Fedora.13 RHEL.5 OpenSUSE.11 SLES.11 
Ubuntu-8.10-server
 # 

Re: [PATCH 5/8] KVM: don't touch vcpu stat after async pf is complete

2010-10-28 Thread Gleb Natapov
On Thu, Oct 28, 2010 at 03:35:13PM +0800, Xiao Guangrong wrote:
 On 10/27/2010 06:44 PM, Gleb Natapov wrote:
  On Wed, Oct 27, 2010 at 05:05:57PM +0800, Xiao Guangrong wrote:
  Don't make a KVM_REQ_UNHALT request after async pf is completed since it
  can break guest's 'halt' instruction.
 
  Why is it a problem? CPU may be unhalted by different events so OS
  shouldn't depend on it.
  
 
 We don't know how guest OS handles it after HLT instruction is completed,
 according to X86's spec, only NMI/INTR/RESET/INIT/SMI can break halt state,
 it violations the hardware behavior if we allow other event break this
 state. Your opinion? :-)
I agree in principle, but since SMI (which is completely out of guest OS
control) can cause CPU to exit halt, in practice OS can't rely on CPU to
be unhalted only by events controlled by OS itself. In the past we had a
bug that any timer even unhalted vcpu even when timer interrupt was masked.
The only practical problem it caused was that vcpu that executed cli;
1: hlt; jmp 1b sequence still consumed host cpu time. That said I am not
against fixing it if the fix is easy. Your current fix though relies on
patch 4 that I have problem with. 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: simply wakup async pf

2010-10-28 Thread Xiao Guangrong
On 10/27/2010 06:50 PM, Gleb Natapov wrote:
 On Wed, Oct 27, 2010 at 05:07:32PM +0800, Xiao Guangrong wrote:
 The current way is queued a complete async pf with:
  asyc_pf.page = bad_page
  async_pf.arch.gfn = 0

 It has two problems while kvm_check_async_pf_completion handle this
 async_pf:
 - since !async_pf.page, it can retry a pseudo #PF
 kvm_arch_async_page_ready checks for is_error_page()
 
 - it can delete gfn 0 from vcpu-arch.apf.gfns[]
 kvm_arch_async_page_present() checks for is_error_page() too and,
 in case of PV guest, injects special token if it is true. 
 

Ah, sorry for my stupid questions.

 After your patch special token will not be injected and migration will
 not work.
 
 Actually, we can simply record this wakeup request and let
 kvm_check_async_pf_completion simply break the wait

 May be wakeup_all function naming is misleading. It means wake up all PV
 guest processes by sending broadcast async pf notification. It is not
 about waking host vcpu thread.
 

I'm not good at the KVM PV way, i'll dig into it, please ignore this patch,
thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace

2010-10-28 Thread Sheng Yang
On Sunday 24 October 2010 20:23:20 Michael S. Tsirkin wrote:
 On Sun, Oct 24, 2010 at 08:19:09PM +0800, Sheng Yang wrote:
   You need a guarantee that MSIX per-vector mask is used for
   disable_irq/enable_irq, right? I can't see how this provides it.
  
  This one meant to directly operate the mask/unmask bit of the MSI-X
  table, to emulate the mask/unmask behavior that guest want. In the
  previous patch I used enable_irq()/disable_irq(), but they won't
  directly operate the MSI-X table unless it's necessary, and Michael want
  to read the table in userspace then he prefer using mask/unmask
  directly.
 
 As I said, the main problem was really that the implementation
 proposed only works for interrupts used by assigned devices.
 I would like for it to work for irqfd as well.

I think we can't let QEmu access mask or pending bit directly. It must 
communicate 
with kernel to get the info if kernel owned mask.

That's because in fact guest supposed mask/unmask operation has nothing todo 
with 
what host would do. Maybe we can emulate it by do the same thing on the device, 
but it's two layer in fact. Also we know host kernel would does 
disabling/enabling 
according to its own mechanism, e.g. it may disable interrupt temporarily if 
there 
are too many interrupts. What host does should be transparent to guest. 
Directly 
accessing the data from device should be prohibited. 

And pending bit case is the same. In fact kernel knows which IRQ is pending, we 
can check IRQ_PENDING bit of desc, though we don't have such interface now. But 
we 
can do it in the future if it's necessary.

I'm purposing an new interface like kvm_get_msix_entry, to return the mask 
bit 
of specific entry. The pending bit support can be added in the future if it's 
needed. But we can't directly access the MSI-X table/PBA in theory.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Page Eviction Algorithm

2010-10-28 Thread Avi Kivity

 On 10/26/2010 03:31 PM, Prasad Joshi wrote:

On Tue, Oct 26, 2010 at 2:07 PM, Avi Kivitya...@redhat.com  wrote:
On 10/26/2010 12:42 PM, Prasad Joshi wrote:

  Thanks a lot for your reply.

  On Tue, Oct 26, 2010 at 11:31 AM, Avi Kivitya...@redhat.comwrote:
On 10/26/2010 11:19 AM, Prasad Joshi wrote:
  
  Hi All,
  
  I was just going over TODO list on KVM page. In MMU related TODO I saw
  only page eviction algorithm currently implemented is FIFO.
  
  Is it really the case?
  
  Yes.
  
  If yes I would like to work on it. Can someone
  let me know the place where the FIFO code is implemented?
  
  Look at the code that touches mmu_active_list.
  
  FWIW improving the algorithm is not critically important.  It's rare
that
  mmu shadow pages need to be evicted.

  I would be doing a University project on Virtualization. I would like
  to work on Linux kernel and KVM. I was looking over the TODO list on
  KVM wiki.

  Can you please suggest me something that would add value to KVM?


  O(1) write protection (on the TODO page) is interesting and important.  It's
  difficult, so you may want to start with O(1) invalidation.

I am not sure if I can understand what exactly is a MMU invalidation.
Is it cache invalidation or TLB invalidation? Can you please
elaborate. I am really sorry if I am asking a silly question.


Invalidation of all shadow page tables.  The current code which does 
this is in kvm_mmu_zap_all().


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Page Eviction Algorithm

2010-10-28 Thread Avi Kivity

 On 10/26/2010 05:08 PM, Prasad Joshi wrote:


  Can you please suggest me something that would add value to KVM?


  O(1) write protection (on the TODO page) is interesting and important.  It's
  difficult, so you may want to start with O(1) invalidation.

  I am not sure if I can understand what exactly is a MMU invalidation.
  Is it cache invalidation or TLB invalidation? Can you please
  elaborate. I am really sorry if I am asking a silly question.

Does this MMU invalidation has to do something with the EPT (Extended
Page Table)


No


and instruction INVEPT?


No, (though INVEPT has to be run as part of this operation, via 
kvm_flush_remote_tlbs).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance

2010-10-28 Thread Stefan Hajnoczi
On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma mashi...@us.ibm.com wrote:
 This patch changes vhost TX used buffer signal to guest from one by
 one to up to 3/4 of vring size. This change improves vhost TX message
 size from 256 to 8K performance for both bandwidth and CPU utilization
 without inducing any regression.

Any concerns about introducing latency or does the guest not care when
TX completions come in?

 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---

  drivers/vhost/net.c   |   19 ++-
  drivers/vhost/vhost.c |   31 +++
  drivers/vhost/vhost.h |    3 +++
  3 files changed, 52 insertions(+), 1 deletions(-)

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 4b4da5b..bd1ba71 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net)
                if (err != len)
                        pr_debug(Truncated TX packet: 
                                  len %d != %zd\n, err, len);
 -               vhost_add_used_and_signal(net-dev, vq, head, 0);
 +               /*
 +                * if no pending buffer size allocate, signal used buffer
 +                * one by one, otherwise, signal used buffer when reaching
 +                * 3/4 ring size to reduce CPU utilization.
 +                */
 +               if (unlikely(vq-pend))
 +                       vhost_add_used_and_signal(net-dev, vq, head, 0);
 +               else {
 +                       vq-pend[vq-num_pend].id = head;

I don't understand the logic here: if !vq-pend then we assign to
vq-pend[vq-num_pend].

 +                       vq-pend[vq-num_pend].len = 0;
 +                       ++vq-num_pend;
 +                       if (vq-num_pend == (vq-num - (vq-num  2))) {
 +                               vhost_add_used_and_signal_n(net-dev, vq,
 +                                                           vq-pend,
 +                                                           vq-num_pend);
 +                               vq-num_pend = 0;
 +                       }
 +               }
                total_len += len;
                if (unlikely(total_len = VHOST_NET_WEIGHT)) {
                        vhost_poll_queue(vq-poll);
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 94701ff..47696d2 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
        vq-call_ctx = NULL;
        vq-call = NULL;
        vq-log_ctx = NULL;
 +       /* signal pending used buffers */
 +       if (vq-pend) {
 +               if (vq-num_pend != 0) {
 +                       vhost_add_used_and_signal_n(dev, vq, vq-pend,
 +                                                   vq-num_pend);
 +                       vq-num_pend = 0;
 +               }
 +               kfree(vq-pend);
 +       }
 +       vq-pend = NULL;
  }

  static int vhost_worker(void *data)
 @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev,
                dev-vqs[i].heads = NULL;
                dev-vqs[i].dev = dev;
                mutex_init(dev-vqs[i].mutex);
 +               dev-vqs[i].num_pend = 0;
 +               dev-vqs[i].pend = NULL;
                vhost_vq_reset(dev, dev-vqs + i);
 +               /* signal 3/4 of ring size used buffers */
 +               dev-vqs[i].pend = kmalloc((dev-vqs[i].num -
 +                                          (dev-vqs[i].num  2)) *
 +                                          sizeof *vq-peed, GFP_KERNEL);

Has this patch been compile tested?  vq-peed?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance

2010-10-28 Thread Stefan Hajnoczi
On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
Just read the patch 1/1 discussion and it looks like you're already on
it.  Sorry for the noise.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: make async_pf work queue lockless

2010-10-28 Thread Xiao Guangrong
On 10/27/2010 07:41 PM, Gleb Natapov wrote:
 On Wed, Oct 27, 2010 at 05:09:41PM +0800, Xiao Guangrong wrote:
 The async_pf number is very few since only pending interrupt can
 let it re-enter to the guest mode.

 During my test(Host 4 CPU + 4G, Guest 4 VCPU + 6G), it's no
 more than 10 requests in the system.

 So, we can only increase the completion counter in the work queue
 context, and walk vcpu-async_pf.queue list to get all completed
 async_pf

 That depends on the load. I used memory cgroups to create very big
 memory pressure and I saw hundreds of apfs per second. We shouldn't
 optimize for very low numbers. With vcpu-async_pf.queue having more
 then one element I am not sure your patch is beneficial.
 

Maybe we need a new no-lock way to record the complete apfs, i'll reproduce
your test environment and improve it.

 +
 +list_del(work-queue);
 +vcpu-async_pf.queued--;
 +kmem_cache_free(async_pf_cache, work);
 +if (atomic_dec_and_test(vcpu-async_pf.done))
 +break;
 You should do atomic_dec() and always break. We cannot inject two apfs during
 one vcpu entry.
 

Sorry, i'm little confused. 

Why 'atomic_dec_and_test(vcpu-async_pf.done)' always break? async_pf.done is 
used to
record the complete apfs and many apfs may be completed when vcpu enters guest 
mode(it
means vcpu-async_pf.done  1)

Look at the current code:

void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
{
..
spin_lock(vcpu-async_pf.lock);
work = list_first_entry(vcpu-async_pf.done, typeof(*work), link);
list_del(work-link);
spin_unlock(vcpu-async_pf.lock);
..
}

You only handle one complete apf, why we inject them at once? I missed 
something? :-(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: add debugfs file to show the number of async pf

2010-10-28 Thread Xiao Guangrong
On 10/27/2010 06:58 PM, Gleb Natapov wrote:
 On Wed, Oct 27, 2010 at 05:10:51PM +0800, Xiao Guangrong wrote:
 It can help us to see the state of async pf

 I have patch to add three async pf statistics:
 apf_not_present
 apf_present
 apf_doublefault
 
 But Avi now wants to deprecate debugfs interface completely and move
 towards ftrace, so I had to drop it.
 

OK, let's ignore this patch, thanks :-) 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: make async_pf work queue lockless

2010-10-28 Thread Gleb Natapov
On Thu, Oct 28, 2010 at 05:08:58PM +0800, Xiao Guangrong wrote:
 On 10/27/2010 07:41 PM, Gleb Natapov wrote:
  On Wed, Oct 27, 2010 at 05:09:41PM +0800, Xiao Guangrong wrote:
  The async_pf number is very few since only pending interrupt can
  let it re-enter to the guest mode.
 
  During my test(Host 4 CPU + 4G, Guest 4 VCPU + 6G), it's no
  more than 10 requests in the system.
 
  So, we can only increase the completion counter in the work queue
  context, and walk vcpu-async_pf.queue list to get all completed
  async_pf
 
  That depends on the load. I used memory cgroups to create very big
  memory pressure and I saw hundreds of apfs per second. We shouldn't
  optimize for very low numbers. With vcpu-async_pf.queue having more
  then one element I am not sure your patch is beneficial.
  
 
 Maybe we need a new no-lock way to record the complete apfs, i'll reproduce
 your test environment and improve it.
 
That is always welcomed :)

  +
  +  list_del(work-queue);
  +  vcpu-async_pf.queued--;
  +  kmem_cache_free(async_pf_cache, work);
  +  if (atomic_dec_and_test(vcpu-async_pf.done))
  +  break;
  You should do atomic_dec() and always break. We cannot inject two apfs 
  during
  one vcpu entry.
  
 
 Sorry, i'm little confused. 
 
 Why 'atomic_dec_and_test(vcpu-async_pf.done)' always break? async_pf.done 
 is used to
In your code it is not, but it should (at least if guest is PV, read
below).

 record the complete apfs and many apfs may be completed when vcpu enters 
 guest mode(it
 means vcpu-async_pf.done  1)
 
Correct, but only one apf should be handled on each vcpu entry in case
of PV guest. Look at kvm_arch_async_page_present(vcpu, work); that is called
in a loop in your code. If vcpu-arch.apf.msr_val  KVM_ASYNC_PF_ENABLED
is not null it injects exception into the guest. You can't inject more
then one exception on each guest entry. If guest is not PV you are
correct that we can loop here until vcpu-async_pf.done == 0.

 Look at the current code:
 
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
   ..
   spin_lock(vcpu-async_pf.lock);
   work = list_first_entry(vcpu-async_pf.done, typeof(*work), link);
   list_del(work-link);
   spin_unlock(vcpu-async_pf.lock);
   ..
 }
 
 You only handle one complete apf, why we inject them at once? I missed 
 something? :-(

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM test: Add a subtest kdump

2010-10-28 Thread Lucas Meneghel Rodrigues
On Thu, 2010-10-28 at 15:36 +0800, Jason Wang wrote:
 Add a new subtest to check whether kdump work correctly in guest. This test 
 just
 try to trigger crash on each vcpu and then verify it by checking the vmcore.

Nice test Jason, some comments below:

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  client/tests/kvm/tests/kdump.py  |   79 
 ++
  client/tests/kvm/tests_base.cfg.sample   |   11 
  client/tests/kvm/unattended/RHEL-5-series.ks |1 
  3 files changed, 91 insertions(+), 0 deletions(-)
  create mode 100644 client/tests/kvm/tests/kdump.py
 
 diff --git a/client/tests/kvm/tests/kdump.py b/client/tests/kvm/tests/kdump.py
 new file mode 100644
 index 000..8fa3cca
 --- /dev/null
 +++ b/client/tests/kvm/tests/kdump.py
 @@ -0,0 +1,79 @@
 +import logging, time
 +from autotest_lib.client.common_lib import error
 +import kvm_subprocess, kvm_test_utils, kvm_utils
 +
 +
 +def run_kdump(test, params, env):
 +
 +KVM reboot test:
 +1) Log into a guest
 +2) Check and enable the kdump
 +3) For each vcpu, trigger a crash and check the vmcore
 +
 +@param test: kvm test object
 +@param params: Dictionary with the test parameters
 +@param env: Dictionary with test environment.
 +
 +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 +timeout = float(params.get(login_timeout, 240))
 +crash_timeout = float(params.get(crash_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2)
 +def_kernel_param_cmd = grubby --update-kernel=`grubby 
 --default-kernel` \
 + --args=crashkernel=1...@64m

^ Implicit line continuation is better here

def_kernel_param_cmd = (command param1 param 2...
param8 param9)

 +kernel_param_cmd = params.get(kernel_param_cmd, def_kernel_param_cmd)
 +def_kdump_enable_cmd = chkconfig kdump on  service kdump start
 +kdump_enable_cmd = params.get(kdump_enable_cmd, def_kdump_enable_cmd)
 +
 +def crash_test(vcpu):
 +
 +Trigger a crash dump through sysrq-trigger
 +
 +@param vcpu: vcpu which is used to trigger a crash
 +
 +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2)
 +session.get_command_status(rm -rf /var/crash/*)
 +
 +logging.info(Triggering crash on vcpu %d ..., vcpu)
 +crash_cmd = taskset -c %d echo c  /proc/sysrq-trigger % vcpu
 +session.sendline(crash_cmd)
 +
 +if not kvm_utils.wait_for(lambda: not session.is_responsive(), 240, 
 0,
 +  1):
 +raise error.TestFail(Could not trigger crash on vcpu %d % vcpu)
 +
 +logging.info(Waiting for the completion of dumping)

^ Waiting for kernel crash dump to complete would be better

 +session = kvm_test_utils.wait_for_login(vm, 0, crash_timeout, 0, 2)
 +
 +logging.info(Probing vmcore file ...)
 +s = session.get_command_status(ls -R /var/crash | grep vmcore)
 +if s != 0:
 +raise error.TestFail(Could not find the generated vmcore file!)
 +else:
 +logging.info(Found vmcore.)
 +
 +session.get_command_status(rm -rf /var/crash/*)
 +
 +try:
 +logging.info(Check the existence of crash kernel ...)
 +prob_cmd = grep -q 1 /sys/kernel/kexec_crash_loaded
 +s = session.get_command_status(prob_cmd)
 +if s != 0:
 +logging.info(Crash kernel is not loaded. Try to load it.)
 +# We need to setup the kernel params
 +s, o = session.get_command_status_output(kernel_param_cmd)
 +if s != 0:
 +raise error.TestFail(Could not add crashkernel params to
 + kernel)
 +session = kvm_test_utils.reboot(vm, session, timeout=timeout);
 +
 +logging.info(Enable kdump service ...)
 +# the initrd may be rebuilt here so we need to wait a little more
 +s, o = session.get_command_status_output(kdump_enable_cmd, 
 timeout=120)

^ I remember initrd built usually takes longer than 2 minutes in most
machines, does this work fine on both Fedora and RHEL?

 +if s != 0:
 +raise error.TestFail(Could not enable kdump service:%s % o)
 +
 +nvcpu = int(params.get(smp, 1))
 +[crash_test(i) for i in range(nvcpu)]

^ Although list comprehension is indeed very cool, since we're not going
to do anything with this list, I'd rather prefer to use the good old for
loop.

 +finally:
 +session.close()
 diff --git a/client/tests/kvm/tests_base.cfg.sample 
 b/client/tests/kvm/tests_base.cfg.sample
 index fe3563c..25ad688 100644
 --- a/client/tests/kvm/tests_base.cfg.sample
 +++ b/client/tests/kvm/tests_base.cfg.sample
 @@ -665,6 +665,15 @@ variants:
  image_name_snapshot1 = sn1
  image_name_snapshot2 = sn2
  
 +- kdump:
 +type = kdump
 +# time waited for 

Re: Page Eviction Algorithm

2010-10-28 Thread Prasad Joshi
On Thu, Oct 28, 2010 at 1:45 AM, Avi Kivity a...@redhat.com wrote:
 Does this MMU invalidation has to do something with the EPT (Extended
 Page Table)

 No

 and instruction INVEPT?

 No, (though INVEPT has to be run as part of this operation, via
 kvm_flush_remote_tlbs).

Thanks a lot Avi for your help. I would look at the code and do my
study. I would ask clarification whenever I need help.


 --
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] msix: Allow msix_init on a device with existing MSI-X capability

2010-10-28 Thread Avi Kivity

 On 10/23/2010 06:55 PM, Alex Williamson wrote:

On Sat, 2010-10-23 at 18:18 +0200, Michael S. Tsirkin wrote:
  On Fri, Oct 22, 2010 at 02:40:31PM -0600, Alex Williamson wrote:
To enable common msix support to be used with pass through devices,
don't attempt to change the BAR if the device already has an
MSI-X capability.  This also means we want to pay closer attention
to the size when we map the msix table page, as it isn't necessarily
covering the entire end of the BAR.
  
Signed-off-by: Alex Williamsonalex.william...@redhat.com
---
  
 hw/msix.c |   67 
+++--
 1 files changed, 38 insertions(+), 29 deletions(-)
  
diff --git a/hw/msix.c b/hw/msix.c
index 43efbd2..4122395 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -167,35 +167,43 @@ static int msix_add_config(struct PCIDevice *pdev, 
unsigned short nentries,
 {
 int config_offset;
 uint8_t *config;
-uint32_t new_size;
  
-if (nentries  1 || nentries  PCI_MSIX_FLAGS_QSIZE + 1)
-return -EINVAL;
-if (bar_size  0x8000)
-return -ENOSPC;
-
-/* Add space for MSI-X structures */
-if (!bar_size) {
-new_size = MSIX_PAGE_SIZE;
-} else if (bar_size  MSIX_PAGE_SIZE) {
-bar_size = MSIX_PAGE_SIZE;
-new_size = MSIX_PAGE_SIZE * 2;
-} else {
-new_size = bar_size * 2;
-}
-
-pdev-msix_bar_size = new_size;
-config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, 
MSIX_CAP_LENGTH);
-if (config_offset  0)
-return config_offset;
-config = pdev-config + config_offset;
-
-pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
-/* Table on top of BAR */
-pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
-/* Pending bits on top of that */
-pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + 
MSIX_PAGE_PENDING) |
- bar_nr);
+pdev-msix_bar_size = bar_size;
+
+config_offset = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+
+if (!config_offset) {
+uint32_t new_size;
+
+if (nentries  1 || nentries  PCI_MSIX_FLAGS_QSIZE + 1)
+return -EINVAL;
+if (bar_size  0x8000)
+return -ENOSPC;
+
+/* Add space for MSI-X structures */
+if (!bar_size) {
+new_size = MSIX_PAGE_SIZE;
+} else if (bar_size  MSIX_PAGE_SIZE) {
+bar_size = MSIX_PAGE_SIZE;
+new_size = MSIX_PAGE_SIZE * 2;
+} else {
+new_size = bar_size * 2;
+}
+
+pdev-msix_bar_size = new_size;
+config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX,
+   MSIX_CAP_LENGTH);
+if (config_offset  0)
+return config_offset;
+config = pdev-config + config_offset;
+
+pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
+/* Table on top of BAR */
+pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
+/* Pending bits on top of that */
+pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + 
MSIX_PAGE_PENDING) |
+ bar_nr);
+}
 pdev-msix_cap = config_offset;
 /* Make flags bit writeable. */
 pdev-wmask[config_offset + MSIX_CONTROL_OFFSET] |= MSIX_ENABLE_MASK 
|
@@ -337,7 +345,8 @@ void msix_mmio_map(PCIDevice *d, int region_num,
 return;
 if (size= offset)
 return;
-cpu_register_physical_memory(addr + offset, size - offset,
+cpu_register_physical_memory(addr + offset,
+ MIN(size - offset, MSIX_PAGE_SIZE),

  This is wrong I think, the table might not fit in a single page.
  You would need to read table size out of from device config.

That's true, but I was hoping to save that for later since we don't seem
to be running into that problem yet.  Current device assignment code
assumes a single page, and I haven't heard of anyone with a vector table
that exceeds that yet.  Thanks,



Ok; applied.  Please add some warning if the condition happens so the 
breakage is at least not silent.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Shirley Ma
On Thu, 2010-10-28 at 07:20 +0200, Michael S. Tsirkin wrote:
 My concern is this can delay signalling for unlimited time.
 Could you pls test this with guests that do not have
 2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb
 b0c39dbdc204006ef3558a66716ff09797619778
 that is 2.6.31 and older?

I will test it out.

 This seems to be slighltly out of spec, even though
 for TX, signals are less important.
 Two ideas:
 1. How about writing out used, just delaying the signal?
This way we don't have to queue separately.
 2. How about flushing out queued stuff before we exit
the handle_tx loop? That would address most of
the spec issue. 

I will modify the patch to test out the performance for both 1  2
approaches.

Thanks
Shirley



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Shirley Ma
On Thu, 2010-10-28 at 07:20 +0200, Michael S. Tsirkin wrote:
 My concern is this can delay signalling for unlimited time.
 Could you pls test this with guests that do not have
 2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb
 b0c39dbdc204006ef3558a66716ff09797619778
 that is 2.6.31 and older? 

The patch only induces delay signaling unlimited time when there is no
TX packet to transmit. I thought TX signaling only noticing guest to
release the used buffers, anything else beside this?

I tested rhel5u5 guest (2.6.18 kernel), it works fine. I checked the two
commits log, I don't think this patch could cause any issue w/o these
two patches.

Also I found a big TX regression for old guest and new guest. For old
guest, I am able to get almost 11Gb/s for 2K message size, but for the
new guest kernel, I can only get 3.5 Gb/s with the patch and same host.
I will dig it why.

thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] VFIO V5: Non-privileged user level PCI drivers

2010-10-28 Thread Tom Lyon
[Rebased to 2.6.36]
Just in time for Halloween - be afraid!

This version adds support for PCIe extended capabilities, including Advanced
Error Reporting.  All of the config table initialization has been rewritten to
be much more readable. All config accesses are byte-at-a-time and endian issues
have been resolved. Reading of PCI ROMs is now handled properly. Problems with
usage if pci_iomap and pci_request_regions have now been resolved.  Devices are
now reset upon first open. Races in rmMemlock rlimit accounting have been
cleaned up. Lots of other tweaks and comments.

Blurb from version 4:

After a long summer break, it's tanned, it's rested, and it's ready to rumble!

In this version:*** REBASE to 2.6.35 ***

There's new code using generic netlink messages which allows the kernel
to notify the user level of weird events and allows the user level to 
respond. This is currently used to handle device removal (whether software
or hardware driven), PCI error events, and system suspend  hibernate.

The driver now supports devices which use multiple MSI interrupts, reflecting
the actual number of interrupts allocated by the system to the user level.

PCI config accesses are now done through the pci_user_{read,write)_config
routines from drivers/pci/access.c.

Numerous other tweaks and cleanups.

Blurb from version 3:

There are lots of bug fixes and cleanups in this version, but the main
change is to check to make sure that the IOMMU has interrupt remapping
enabled, which is necessary to prevent user level code from triggering
spurious interrupts for other devices.  Since most platforms today
do not have the necessary hardware and/or software for this, a module
option can override this check, thus making vfio useful (but not safe)
on many more platforms.

In the next version I plan to add kernel to user messaging using the
generic netlink mechanism to allow the user driver to react to hot add
and remove, and power management requests.

Blurb from version 2:

This version now requires an IOMMU domain to be set before any access to
device registers is granted (except that config space may be read).  In
addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
which does not have sufficient controls around IOMMU usage.  The IOMMU domain
is obtained from the 'uiommu' driver which is included in this patch.

Various locking, security, and documentation issues have also been fixed.

Please commit - it or me!
But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?

Blurb from version 1:

This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.

[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read  write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not.  Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.

All interrupts are now handled using eventfds, which makes things very simple.

The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices.  I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?

I currently have user level drivers working for 3 different network adapters
- the Cisco Palo enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release).  This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.

Tom Lyon (4):
  VFIO V5: export pci_user_{read,write}_config
  VFIO V5: additions to include/linux/pci_regs.h
  VFIO V5: uiommu driver - allow user progs to manipulate iommu domains
  VFIO V5: Non-privileged user level PCI drivers

 Documentation/ioctl/ioctl-number.txt |1 +
 Documentation/vfio.txt   |  182 ++
 MAINTAINERS  |8 +
 drivers/Kconfig  |2 +
 drivers/Makefile |2 +
 drivers/pci/access.c |6 +-
 drivers/pci/pci.h|7 -
 drivers/vfio/Kconfig |   18 +
 drivers/vfio/Makefile|   11 +
 drivers/vfio/uiommu.c|  126 
 drivers/vfio/vfio_dma.c  |  400 
 

[PATCH 1/4] VFIO V5: export pci_user_{read,write}_config

2010-10-28 Thread Tom Lyon

Acked-by:   Jesse Barnes jbar...@virtuousgeek.org
Signed-off-by: Tom Lyon p...@cisco.com
---
 drivers/pci/access.c |6 --
 drivers/pci/pci.h|7 ---
 include/linux/pci.h  |8 
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 531bc69..96ed449 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -157,7 +157,8 @@ int pci_user_read_config_##size 
\
raw_spin_unlock_irq(pci_lock); \
*val = (type)data;  \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 #define PCI_USER_WRITE_CONFIG(size,type)   \
 int pci_user_write_config_##size   \
@@ -171,7 +172,8 @@ int pci_user_write_config_##size
\
pos, sizeof(type), val);\
raw_spin_unlock_irq(pci_lock); \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6beb11b..e1db481 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -77,13 +77,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
return !!(pci_dev-subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
-
 struct pci_vpd_ops {
ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void 
*buf);
ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const 
void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c8d95e3..7f22c8a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -756,6 +756,14 @@ static inline int pci_write_config_dword(struct pci_dev 
*dev, int where,
return pci_bus_write_config_dword(dev-bus, dev-devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] VFIO V5: uiommu driver - allow user progs to manipulate iommu domains

2010-10-28 Thread Tom Lyon

Signed-off-by: Tom Lyon p...@cisco.com
---
 drivers/Kconfig|2 +
 drivers/Makefile   |1 +
 drivers/vfio/Kconfig   |8 +++
 drivers/vfio/Makefile  |1 +
 drivers/vfio/uiommu.c  |  126 
 include/linux/uiommu.h |   76 +
 6 files changed, 214 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/uiommu.c
 create mode 100644 include/linux/uiommu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index a2b902f..711c1cb 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -111,4 +111,6 @@ source drivers/xen/Kconfig
 source drivers/staging/Kconfig
 
 source drivers/platform/Kconfig
+
+source drivers/vfio/Kconfig
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index a2aea53..c445440 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_FUSION)  += message/
 obj-y  += firewire/
 obj-y  += ieee1394/
 obj-$(CONFIG_UIO)  += uio/
+obj-$(CONFIG_UIOMMU)   += vfio/
 obj-y  += cdrom/
 obj-y  += auxdisplay/
 obj-$(CONFIG_PCCARD)   += pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 000..3ab9af3
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig UIOMMU
+   tristate User level manipulation of IOMMU
+   help
+ Device driver to allow user level programs to
+ manipulate IOMMU domains.
+
+ If you don't know what to do here, say N.
+
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 000..556f3c1
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_UIOMMU) += uiommu.o
diff --git a/drivers/vfio/uiommu.c b/drivers/vfio/uiommu.c
new file mode 100644
index 000..5c17c5a
--- /dev/null
+++ b/drivers/vfio/uiommu.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, p...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+/*
+ * uiommu driver - issue fd handles for IOMMU domains
+ * so they may be passed to vfio (and others?)
+ */
+#include linux/fs.h
+#include linux/mm.h
+#include linux/init.h
+#include linux/kernel.h
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/slab.h
+#include linux/file.h
+#include linux/iommu.h
+#include linux/uiommu.h
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Tom Lyon p...@cisco.com);
+MODULE_DESCRIPTION(User IOMMU driver);
+
+static struct uiommu_domain *uiommu_domain_alloc(void)
+{
+   struct iommu_domain *domain;
+   struct uiommu_domain *udomain;
+
+   domain = iommu_domain_alloc();
+   if (!domain)
+   return NULL;
+   udomain = kzalloc(sizeof *udomain, GFP_KERNEL);
+   if (!udomain) {
+   iommu_domain_free(domain);
+   return NULL;
+   }
+   udomain-domain = domain;
+   atomic_inc(udomain-refcnt);
+   return udomain;
+}
+
+static int uiommu_open(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = uiommu_domain_alloc();
+   if (!udomain)
+   return -ENOMEM;
+   file-private_data = udomain;
+   return 0;
+}
+
+static int uiommu_release(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = file-private_data;
+   uiommu_put(udomain);
+   return 0;
+}
+
+static const struct file_operations uiommu_fops = {
+   .owner  = THIS_MODULE,
+   .open   = uiommu_open,
+   .release= uiommu_release,
+};
+
+static struct miscdevice uiommu_dev = {
+   .name   = uiommu,
+   .minor  = MISC_DYNAMIC_MINOR,
+   .fops   = uiommu_fops,
+};
+
+struct uiommu_domain *uiommu_fdget(int fd)
+{
+   struct file *file;
+   struct uiommu_domain *udomain;
+
+   file = fget(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);
+   if (file-f_op != uiommu_fops) {
+   fput(file);
+   return ERR_PTR(-EINVAL);
+   }
+   udomain = file-private_data;
+   atomic_inc(udomain-refcnt);
+   return 

[PATCH 2/4] VFIO V5: additions to include/linux/pci_regs.h

2010-10-28 Thread Tom Lyon

Signed-off-by: Tom Lyon p...@cisco.com
---
 include/linux/pci_regs.h |  107 ++
 1 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 455b9cc..70addc9 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
+#definePCI_STD_HEADER_SIZEOF   64
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
 #define PCI_COMMAND0x04/* 16 bits */
@@ -209,9 +210,12 @@
 #define  PCI_CAP_ID_SHPC   0x0C/* PCI Standard Hot-Plug Controller */
 #define  PCI_CAP_ID_SSVID  0x0D/* Bridge subsystem vendor/device ID */
 #define  PCI_CAP_ID_AGP3   0x0E/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV 0x0F/* Secure Device */
 #define  PCI_CAP_ID_EXP0x10/* PCI Express */
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
+#define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -276,6 +280,7 @@
 #define  PCI_VPD_ADDR_MASK 0x7fff  /* Address mask */
 #define  PCI_VPD_ADDR_F0x8000  /* Write 0, 1 indicates 
completion */
 #define PCI_VPD_DATA   4   /* 32-bits of data returned here */
+#definePCI_CAP_VPD_SIZEOF  8
 
 /* Slot Identification */
 
@@ -297,8 +302,10 @@
 #define PCI_MSI_ADDRESS_HI 8   /* Upper 32 bits (if 
PCI_MSI_FLAGS_64BIT set) */
 #define PCI_MSI_DATA_328   /* 16 bits of data for 32-bit 
devices */
 #define PCI_MSI_MASK_3212  /* Mask bits register for 
32-bit devices */
+#define PCI_MSI_PENDING_32 16  /* Pending intrs for 32-bit devices */
 #define PCI_MSI_DATA_6412  /* 16 bits of data for 64-bit 
devices */
 #define PCI_MSI_MASK_6416  /* Mask bits register for 
64-bit devices */
+#define PCI_MSI_PENDING_64 20  /* Pending intrs for 64-bit devices */
 
 /* MSI-X registers (these are at offset PCI_MSIX_FLAGS) */
 #define PCI_MSIX_FLAGS 2
@@ -306,6 +313,7 @@
 #define  PCI_MSIX_FLAGS_ENABLE (1  15)
 #define  PCI_MSIX_FLAGS_MASKALL(1  14)
 #define PCI_MSIX_FLAGS_BIRMASK (7  0)
+#definePCI_CAP_MSIX_SIZEOF 12  /* size of MSIX registers */
 
 /* CompactPCI Hotswap Register */
 
@@ -328,6 +336,7 @@
 #define  PCI_AF_CTRL_FLR   0x01
 #define PCI_AF_STATUS  5
 #define  PCI_AF_STATUS_TP  0x01
+#definePCI_CAP_AF_SIZEOF   6   /* size of AF registers */
 
 /* PCI-X registers */
 
@@ -364,6 +373,9 @@
 #define  PCI_X_STATUS_SPL_ERR  0x2000  /* Rcvd Split Completion Error 
Msg */
 #define  PCI_X_STATUS_266MHZ   0x4000  /* 266 MHz capable */
 #define  PCI_X_STATUS_533MHZ   0x8000  /* 533 MHz capable */
+#definePCI_X_ECC_CSR   8   /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8   /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V1224  /* size for Version 1  2 */
 
 /* PCI Bridge Subsystem ID registers */
 
@@ -451,6 +463,7 @@
 #define  PCI_EXP_LNKSTA_DLLLA  0x2000  /* Data Link Layer Link Active */
 #define  PCI_EXP_LNKSTA_LBMS   0x4000  /* Link Bandwidth Management Status */
 #define  PCI_EXP_LNKSTA_LABS   0x8000  /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20  /* v1 endpoints end here */
 #define PCI_EXP_SLTCAP 20  /* Slot Capabilities */
 #define  PCI_EXP_SLTCAP_ABP0x0001 /* Attention Button Present */
 #define  PCI_EXP_SLTCAP_PCP0x0002 /* Power Controller Present */
@@ -498,6 +511,7 @@
 #define  PCI_EXP_DEVCAP2_ARI   0x20/* Alternative Routing-ID */
 #define PCI_EXP_DEVCTL240  /* Device Control 2 */
 #define  PCI_EXP_DEVCTL2_ARI   0x20/* Alternative Routing-ID */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44  /* v2 endpoints end here */
 #define PCI_EXP_LNKCTL248  /* Link Control 2 */
 #define PCI_EXP_SLTCTL256  /* Slot Control 2 */
 
@@ -506,20 +520,42 @@
 #define PCI_EXT_CAP_VER(header)((header  16)  0xf)
 #define PCI_EXT_CAP_NEXT(header)   ((header  20)  0xffc)
 
-#define PCI_EXT_CAP_ID_ERR 1
-#define PCI_EXT_CAP_ID_VC  2
-#define PCI_EXT_CAP_ID_DSN 3
-#define PCI_EXT_CAP_ID_PWR 4
-#define PCI_EXT_CAP_ID_VNDR11
-#define PCI_EXT_CAP_ID_ACS 13
-#define PCI_EXT_CAP_ID_ARI 14
-#define PCI_EXT_CAP_ID_ATS 15
-#define PCI_EXT_CAP_ID_SRIOV   

Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Shirley Ma
On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote:
 Also I found a big TX regression for old guest and new guest. For old
 guest, I am able to get almost 11Gb/s for 2K message size, but for the
 new guest kernel, I can only get 3.5 Gb/s with the patch and same
 host.
 I will dig it why. 

The regression is from guest kernel, not from this patch. Tested 2.6.31
kernel, it's performance is less than 2Gb/s for 2K message size already.
I will resubmit the patch for review. 

I will start to test from 2.6.30 kernel to figure it when TX regression
induced in virtio_net. Any suggestion which guest kernel I should test
to figure out this regression?

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Hitting 29 NIC limit (+Intel VT-c)

2010-10-28 Thread linux_kvm
On Thu, 14 Oct 2010 14:07 +0200, Avi Kivity a...@redhat.com wrote:
   On 10/14/2010 12:54 AM, Anthony Liguori wrote:
  On 10/13/2010 05:32 PM, Anjali Kulkarni wrote:

 What's the motivation for such a huge number of interfaces?

Ultimately to bring multiple 10Gb bonds into a Vyatta guest.

---

 BTW, I don't think it's possible to hot-add physical functions.  I 
 believe I know of a card that supports dynamic add of physical functions 
 (pre-dating SR-IOV)

I don't know what you're talking about, but it seems you have a better
handle than I on this VT-c stuff, so perhaps misguidedly I'll direct my
next question to you.

Is additional configuration required to make use of SR-IOV  VTq?
I don't immediateley understand how the queueing knows who is who in the
absense of eth.vlan- or if I need to for that matter.

My hope is that this is something like plug n play as long as kernel,
host  driver versions are foo, but I haven't yet found documentation
to confirm it.

For the sake of future queries, I've come across these references so
far:

http://download.intel.com/design/network/applnots/321211.pdf
http://www.linux-kvm.org/wiki/images/6/6a/KvmForum2008%24kdf2008_7.pdf
http://www.mail-archive.com/kvm@vger.kernel.org/msg27860.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg22721.html
http://thread.gmane.org/gmane.linux.kernel.mm/38508
http://ark.intel.com/Product.aspx?id=36918
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Sridhar Samudrala
On Thu, 2010-10-28 at 13:13 -0700, Shirley Ma wrote:
 On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote:
  Also I found a big TX regression for old guest and new guest. For old
  guest, I am able to get almost 11Gb/s for 2K message size, but for the
  new guest kernel, I can only get 3.5 Gb/s with the patch and same
  host.
  I will dig it why. 
 
 The regression is from guest kernel, not from this patch. Tested 2.6.31
 kernel, it's performance is less than 2Gb/s for 2K message size already.
 I will resubmit the patch for review. 
 
 I will start to test from 2.6.30 kernel to figure it when TX regression
 induced in virtio_net. Any suggestion which guest kernel I should test
 to figure out this regression?

It would be some change in virtio-net driver that may have improved the
latency of small messages which in turn would have reduced the bandwidth
as TCP could not accumulate and send large packets.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Shirley Ma
On Thu, 2010-10-28 at 14:04 -0700, Sridhar Samudrala wrote:
 It would be some change in virtio-net driver that may have improved
 the
 latency of small messages which in turn would have reduced the
 bandwidth
 as TCP could not accumulate and send large packets.

I will check out any latency improvement patch in virtio_net. If that's
the case, whether it is good to have some tunable parameter to benefit
both BW and latency workload?

Shirley 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/3] KVM page cache optimization (v3)

2010-10-28 Thread Balbir Singh
This is version 3 of the page cache control patches

From: Balbir Singh bal...@linux.vnet.ibm.com

This series has three patches, the first controls
the amount of unmapped page cache usage via a boot
parameter and sysctl. The second patch controls page
and slab cache via the balloon driver. Both the patches
make heavy use of the zone_reclaim() functionality
already present in the kernel.

The last patch in the series is against QEmu to make
the ballooning hint optional.

V2 was posted a long time back (see http://lwn.net/Articles/391293/)
One of the review suggestions was to make the hint optional
(discussed in the community call as well).

I'd appreciate any test results with the patches.

TODO

1. libvirt exploits for optional hint

page-cache-control
balloon-page-cache
provide-memory-hint-during-ballooning

---
 b/balloon.c   |   18 +++-
 b/balloon.h   |4
 b/drivers/virtio/virtio_balloon.c |   17 +++
 b/hmp-commands.hx |7 +
 b/hw/virtio-balloon.c |   14 ++-
 b/hw/virtio-balloon.h |3
 b/include/linux/gfp.h |8 +
 b/include/linux/mmzone.h  |2
 b/include/linux/swap.h|3
 b/include/linux/virtio_balloon.h  |3
 b/mm/page_alloc.c |9 +-
 b/mm/vmscan.c |  162 --
 b/qmp-commands.hx |7 -
 include/linux/swap.h  |9 --
 mm/page_alloc.c   |3
 mm/vmscan.c   |2
 16 files changed, 202 insertions(+), 69 deletions(-)


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/3] Linux/Guest unmapped page cache control

2010-10-28 Thread Balbir Singh
Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Host Usage without boot parameter (memory in KB)

MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS Cache   mapped  swap
2788664 781884  3780359536

Guest Usage with boot parameter (memory in KB)
-
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS Cache   mapped  swap
2700184 958012  334848  398412

TODOS
-
1. Balance slab cache as well

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 include/linux/mmzone.h |2 -
 include/linux/swap.h   |3 +
 mm/page_alloc.c|9 ++-
 mm/vmscan.c|  162 
 4 files changed, 132 insertions(+), 44 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..a591a7a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -300,12 +300,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
+   unsigned long   min_unmapped_pages;
 #ifdef CONFIG_NUMA
int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
-   unsigned long   min_unmapped_pages;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7cdd633..5d29097 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long 
nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f12ad18..d8fe29f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1642,6 +1642,9 @@ zonelist_scan:
unsigned long mark;
int ret;
 
+   if (should_balance_unmapped_pages(zone))
+   wakeup_kswapd(zone, order);
+
mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4101,10 +4104,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   

[RFC][PATCH 2/3] Linux/Guest cooperative unmapped page cache control

2010-10-28 Thread Balbir Singh
Balloon unmapped page cache pages first

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch builds on the ballooning infrastructure by ballooning unmapped
page cache pages first. It looks for low hanging fruit first and tries
to reclaim clean unmapped pages first.

This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA
and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed
in the gfp_mask. The virtio balloon driver has been changed to use
__GFP_FREE_CACHE. During fill_balloon(), the driver looks for hints
provided by the hypervisor to reclaim cached memory. By default the hint
is off and can be turned on by passing an argument that specifies that
we intend to reclaim cached memory.

Tests:

Test 1
--
I ran a simple filter function that kept frequently ballon a single VM
running kernbench. The VM was configured with 2GB of memory and 2 VCPUs.
The filter function was a triangular wave function that ballooned
the VM under study from 500MB to 1500MB using a triangular wave function
continously. The run times of the VM with and without changes are shown
below. The run times showed no significant impact of the changes.

Withchanges

Elapsed Time 223.86 (1.52822)
User Time 191.01 (0.65395)
System Time 199.468 (2.43616)
Percent CPU 174 (1)
Context Switches 103182 (595.05)
Sleeps 39107.6 (1505.67)

Without changes

Elapsed Time 225.526 (2.93102)
User Time 193.53 (3.53626)
System Time 199.832 (3.26281)
Percent CPU 173.6 (1.14018)
Context Switches 103744 (1311.53)
Sleeps 39383.2 (831.865)

The key advantage was that it resulted in lesser RSS usage in the host and
more cached usage, indicating that the caching had been pushed towards
the host. The guest cached memory usage was lower and free memory in
the guest was also higher.

Test 2
--
I ran kernbench under the memory overcommit manager (6 VM's with 2 vCPUs, 2GB)
with KSM and ksmtuned enabled. memory overcommit manager details are at
http://github.com/aglitke/mom/wiki. The command line for kernbench was
kernbench -M.

The tests showed the following

Withchanges

Elapsed Time 842.936 (12.2247)
Elapsed Time 844.266 (25.8047)
Elapsed Time 844.696 (11.2433)
Elapsed Time 846.08 (14.0249)
Elapsed Time 838.58 (7.44609)
Elapsed Time 842.362 (4.37463)

Withoutchanges

Elapsed Time 837.604 (14.1311)
Elapsed Time 839.322 (17.1772)
Elapsed Time 843.744 (9.21541)
Elapsed Time 842.592 (7.48622)
Elapsed Time 844.272 (25.486)
Elapsed Time 838.858 (7.5044)

General observations

1. Free memory in each of guests was higher with changes.
   The additional free memory was of the order of 120MB per VM
2. Cached memory in each guest was lesser with changes
3. Host free memory was almost constant (independent of
   changes)
4. Host anonymous memory usage was lesser with the changes

The goal of this patch is to free up memory locked up in
duplicated cache contents and (1) above shows that we are
able to successfully free it up.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 drivers/virtio/virtio_balloon.c |   17 +++--
 include/linux/gfp.h |8 +++-
 include/linux/swap.h|9 +++--
 include/linux/virtio_balloon.h  |3 +++
 mm/page_alloc.c |3 ++-
 mm/vmscan.c |2 +-
 6 files changed, 31 insertions(+), 11 deletions(-)


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0f1da45..70f97ea 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -99,12 +99,24 @@ static void tell_host(struct virtio_balloon *vb, struct 
virtqueue *vq)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+   u32 reclaim_cache_first;
+   int err;
+   gfp_t mask = GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC |
+   __GFP_NOWARN;
+
+   err = virtio_config_val(vb-vdev, VIRTIO_BALLOON_F_BALLOON_HINT,
+   offsetof(struct virtio_balloon_config,
+   reclaim_cache_first),
+   reclaim_cache_first);
+
+   if (!err  reclaim_cache_first)
+   mask |= __GFP_FREE_CACHE;
+
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   struct page *page = alloc_page(mask);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, vb-vdev-dev,
@@ -358,6 +370,7 @@ static void __devexit virtballoon_remove(struct 
virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_BALLOON_HINT,
 };
 
 static struct 

[RFC][PATCH 3/3] QEmu changes to provide balloon hint

2010-10-28 Thread Balbir Singh
Provide memory hint during ballooning

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch adds an optional hint to the qemu monitor balloon
command. The hint tells the guest operating system to consider
a class of memory during reclaim. Currently the supported
hint is cached memory. The design is generic and can be extended
to provide other hints in the future if required.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 balloon.c   |   18 ++
 balloon.h   |4 +++-
 hmp-commands.hx |7 +--
 hw/virtio-balloon.c |   15 +++
 hw/virtio-balloon.h |3 +++
 qmp-commands.hx |7 ---
 6 files changed, 40 insertions(+), 14 deletions(-)


diff --git a/balloon.c b/balloon.c
index 0021fef..b2bdda5 100644
--- a/balloon.c
+++ b/balloon.c
@@ -41,11 +41,13 @@ void qemu_add_balloon_handler(QEMUBalloonEvent *func, void 
*opaque)
 qemu_balloon_event_opaque = opaque;
 }
 
-int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque)
+int qemu_balloon(ram_addr_t target, bool reclaim_cache_first,
+ MonitorCompletion cb, void *opaque)
 {
 if (qemu_balloon_event) {
 trace_balloon_event(qemu_balloon_event_opaque, target);
-qemu_balloon_event(qemu_balloon_event_opaque, target, cb, opaque);
+qemu_balloon_event(qemu_balloon_event_opaque, target,
+   reclaim_cache_first, cb, opaque);
 return 1;
 } else {
 return 0;
@@ -55,7 +57,7 @@ int qemu_balloon(ram_addr_t target, MonitorCompletion cb, 
void *opaque)
 int qemu_balloon_status(MonitorCompletion cb, void *opaque)
 {
 if (qemu_balloon_event) {
-qemu_balloon_event(qemu_balloon_event_opaque, 0, cb, opaque);
+qemu_balloon_event(qemu_balloon_event_opaque, 0, 0, cb, opaque);
 return 1;
 } else {
 return 0;
@@ -131,13 +133,21 @@ int do_balloon(Monitor *mon, const QDict *params,
   MonitorCompletion cb, void *opaque)
 {
 int ret;
+int val;
+const char *cache_hint;
+int reclaim_cache_first = 0;
 
 if (kvm_enabled()  !kvm_has_sync_mmu()) {
 qerror_report(QERR_KVM_MISSING_CAP, synchronous MMU, balloon);
 return -1;
 }
 
-ret = qemu_balloon(qdict_get_int(params, value), cb, opaque);
+val = qdict_get_int(params, value);
+cache_hint = qdict_get_try_str(params, hint);
+if (cache_hint)
+reclaim_cache_first = 1;
+
+ret = qemu_balloon(val, reclaim_cache_first, cb, opaque);
 if (ret == 0) {
 qerror_report(QERR_DEVICE_NOT_ACTIVE, balloon);
 return -1;
diff --git a/balloon.h b/balloon.h
index d478e28..65d68c1 100644
--- a/balloon.h
+++ b/balloon.h
@@ -17,11 +17,13 @@
 #include monitor.h
 
 typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target,
+bool reclaim_cache_first,
 MonitorCompletion cb, void *cb_data);
 
 void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque);
 
-int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque);
+int qemu_balloon(ram_addr_t target, bool reclaim_cache_first,
+ MonitorCompletion cb, void *opaque);
 
 int qemu_balloon_status(MonitorCompletion cb, void *opaque);
 
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 81999aa..80e42aa 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -925,8 +925,8 @@ ETEXI
 
 {
 .name   = balloon,
-.args_type  = value:M,
-.params = target,
+.args_type  = value:M,hint:s?,
+.params = target [cache],
 .help   = request VM to change its memory allocation (in MB),
 .user_print = monitor_user_noop,
 .mhandler.cmd_async = do_balloon,
@@ -937,6 +937,9 @@ STEXI
 @item balloon @var{value}
 @findex balloon
 Request VM to change its memory allocation to @var{value} (in MB).
+An optional @var{hint} can be specified to indicate if the guest
+should reclaim from the cached memory in the guest first. The
+...@var{hint} may be ignored by the guest.
 ETEXI
 
 {
diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 8adddea..e363507 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -44,6 +44,7 @@ typedef struct VirtIOBalloon
 size_t stats_vq_offset;
 MonitorCompletion *stats_callback;
 void *stats_opaque_callback_data;
+uint32_t reclaim_cache_first;
 } VirtIOBalloon;
 
 static VirtIOBalloon *to_virtio_balloon(VirtIODevice *vdev)
@@ -181,8 +182,11 @@ static void virtio_balloon_get_config(VirtIODevice *vdev, 
uint8_t *config_data)
 
 config.num_pages = cpu_to_le32(dev-num_pages);
 config.actual = cpu_to_le32(dev-actual);
-
-memcpy(config_data, config, 8);
+if (vdev-guest_features  (1  VIRTIO_BALLOON_F_BALLOON_HINT)) {
+config.reclaim_cache_first = cpu_to_le32(dev-reclaim_cache_first);
+memcpy(config_data, config, 12);
+} else
+memcpy(config_data, config, 8);
 

[PATCH iproute2] Support 'mode' parameter when creating macvtap device

2010-10-28 Thread Sridhar Samudrala
Add support for 'mode' parameter when creating a macvtap device.
This allows a macvtap device to be created in bridge, private or
the default vepa modes.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

diff --git a/ip/Makefile b/ip/Makefile
index 2f223ca..6054e8a 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
 ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
 ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
 iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-iplink_macvlan.o
+iplink_macvlan.o iplink_macvtap.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
new file mode 100644
index 000..35199b1
--- /dev/null
+++ b/ip/iplink_macvtap.c
@@ -0,0 +1,90 @@
+/*
+ * iplink_macvtap.cmacvtap device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void explain(void)
+{
+   fprintf(stderr,
+   Usage: ... macvtap mode { private | vepa | bridge }\n
+   );
+}
+
+static int mode_arg(void)
+{
+fprintf(stderr, Error: argument of \mode\ must be \private\, 
+   \vepa\ or \bridge\\n);
+return -1;
+}
+
+static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, mode) == 0) {
+   __u32 mode = 0;
+   NEXT_ARG();
+
+   if (strcmp(*argv, private) == 0)
+   mode = MACVLAN_MODE_PRIVATE;
+   else if (strcmp(*argv, vepa) == 0)
+   mode = MACVLAN_MODE_VEPA;
+   else if (strcmp(*argv, bridge) == 0)
+   mode = MACVLAN_MODE_BRIDGE;
+   else
+   return mode_arg();
+
+   addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, macvtap: what is \%s\?\n, *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr 
*tb[])
+{
+   __u32 mode;
+
+   if (!tb)
+   return;
+
+   if (!tb[IFLA_MACVLAN_MODE] ||
+   RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE])  sizeof(__u32))
+   return;
+
+   mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
+   fprintf(f,  mode %s ,
+ mode == MACVLAN_MODE_PRIVATE ? private
+   : mode == MACVLAN_MODE_VEPA? vepa
+   : mode == MACVLAN_MODE_BRIDGE  ? bridge
+   :unknown);
+}
+
+struct link_util macvtap_link_util = {
+   .id = macvtap,
+   .maxattr= IFLA_MACVLAN_MAX,
+   .parse_opt  = macvtap_parse_opt,
+   .print_opt  = macvtap_print_opt,
+};


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] macvlan/macvtap: support 'passthru' mode

2010-10-28 Thread Sridhar Samudrala
Add support for 'passthru' mode when creating a macvlan/macvtap device
which allows takeover of the underlying device and passing it to a KVM
guest using virtio with macvtap backend.

Only one macvlan device is allowed in passthru mode and it inherits
the mac address from the underlying device and sets it in promiscuous 
mode to receive and forward all the packets.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f5bb2dc..23de79e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -230,6 +230,7 @@ enum macvlan_mode {
MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */
MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
+   MACVLAN_MODE_PASSTHRU  = 8, /* take over the underlying device */
 };
 
 /* SR-IOV virtual function management section */
diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
index a3c78bd..15022aa 100644
--- a/ip/iplink_macvlan.c
+++ b/ip/iplink_macvlan.c
@@ -23,14 +23,14 @@
 static void explain(void)
 {
fprintf(stderr,
-   Usage: ... macvlan mode { private | vepa | bridge }\n
+   Usage: ... macvlan mode { private | vepa | bridge | passthru 
}\n
);
 }
 
 static int mode_arg(void)
 {
 fprintf(stderr, Error: argument of \mode\ must be \private\, 
-   \vepa\ or \bridge\\n);
+   \vepa\, \bridge\ or \passthru\ \n);
 return -1;
 }
 
@@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
mode = MACVLAN_MODE_VEPA;
else if (strcmp(*argv, bridge) == 0)
mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
else
return mode_arg();
 
@@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[]
  mode == MACVLAN_MODE_PRIVATE ? private
: mode == MACVLAN_MODE_VEPA? vepa
: mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode == MACVLAN_MODE_PASSTHRU  ? passthru
:unknown);
 }
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
index 35199b1..5665b6d 100644
--- a/ip/iplink_macvtap.c
+++ b/ip/iplink_macvtap.c
@@ -20,14 +20,14 @@
 static void explain(void)
 {
fprintf(stderr,
-   Usage: ... macvtap mode { private | vepa | bridge }\n
+   Usage: ... macvtap mode { private | vepa | bridge | passthru 
}\n
);
 }
 
 static int mode_arg(void)
 {
 fprintf(stderr, Error: argument of \mode\ must be \private\, 
-   \vepa\ or \bridge\\n);
+   \vepa\, \bridge\ or \passthru\ \n);
 return -1;
 }
 
@@ -45,6 +45,8 @@ static int macvtap_parse_opt(struct link_util *lu, int argc, 
char **argv,
mode = MACVLAN_MODE_VEPA;
else if (strcmp(*argv, bridge) == 0)
mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
else
return mode_arg();
 
@@ -79,6 +81,7 @@ static void macvtap_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[]
  mode == MACVLAN_MODE_PRIVATE ? private
: mode == MACVLAN_MODE_VEPA? vepa
: mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode == MACVLAN_MODE_PASSTHRU  ? passthru
:unknown);
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM devices assignment; PCIe AER?

2010-10-28 Thread Etienne Martineau


On Wed, 27 Oct 2010, Alex Williamson wrote:

KVM already has an internal IRQ ACK notifier (which is what current
device assignment uses to do the same thing), it's just a matter of
adding a callback that does a kvm_register_irq_ack_notifier that sends
off the eventfd signal.  I've got this working and will probably send
out the KVM patch this week.  For now the eventfd goes to userspace, but
this is where I imagine we could steal some of the irqfd code to make
VFIO consume the irqfd signal directly.  Thanks,


Thanks for the clarification. I must admit I was somewhat confuse about 
that irqfd mechanism until I realized that all it does is to consume an 
eventfd from kernel context (like you pointed out earlier...)
So from userspace I guess that it means that the same eventfd is going to be 
assigned to both VFIO and KVM right?


Going back to the original discussion, I think that devices assignment 
over VFIO is a great way to support PCIe AER for the assigned devices. I'm 
going to spend some time in that direction for sure. In the mean time I'll 
send some patches (shortly) that address the problem without any major 
surgery to the current implementation.


thanks,
-Etienne






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM Test report, kernel 1414115... qemu 013ddf74...

2010-10-28 Thread Hao, Xudong
Hi, all,
This is KVM test result against kvm.git 
1414115b34b9ae69d260a2e4e5d2fd6e956b64b9 and qemu-kvm.git 
013ddf74dd9ac698d0206effdf268c8768959099.

Currently qemu-kvm has a build failure issue on RHEL5 system, this issue exist 
for about 1 month, we build qemu-kvm on RHEL5u1 with a wordaround 
patch(attchment mail).
The linux guest boot slow issue got fixed. However, we found a regression issue 
that 32PAE Windows guest can not boot up without APCI on.

Fixed issue:
1. [KVM] Linux guest is too slow to boot up
https://bugzilla.kernel.org/show_bug.cgi?id=17882

New issue
1. [KVM] Noacpi Windows guest can not boot up on 32bit KVM host
https://bugzilla.kernel.org/show_bug.cgi?id=21402


Four old Issues:

1. ltp diotest running time is 2.54 times than before
https://sourceforge.net/tracker/?func=detailaid=2723366group_id=180599atid=893831
2. 32bits Rhel5/FC6 guest may fail to reboot after installation
https://sourceforge.net/tracker/?func=detailatid=893831aid=1991647group_id=180599
3. perfctr wrmsr warning when booting 64bit RHEl5.3
https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831
4. [SR] qemu return form migrate  command spend long time 
https://sourceforge.net/tracker/?func=detailaid=2942079group_id=180599atid=893831


Test environment

Platform  A
Westmere-EP
CPU 8
Memory size 12G

=
   Summary Test Report of Last Session
=
Total   PassFailNoResult   Crash
=
control_panel_ept_vpid  12  12  0 00
control_panel_vpid  3   3   0 00
control_panel   3   3   0 00
control_panel_ept   4   4   0 00
gtest_vpid  1   1   0 00
gtest_ept   1   1   0 00
gtest   3   3   0 00
vtd_ept_vpid8   8   0 00
gtest_ept_vpid  11  11  0 00
sriov_ept_vpid  5   5   0 00
=
control_panel_ept_vpid  12  12  0 00
 :KVM_LM_Continuity_64_g3   1   1   0 00
 :KVM_four_dguest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
 :KVM_SR_SMP_64_g32e1   1   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_two_winxp_64_g32e 1   1   0 00
 :KVM_256M_guest_64_gPAE1   1   0 00
 :KVM_SR_Continuity_64_g3   1   1   0 00
 :KVM_256M_guest_64_g32e1   1   0 00
 :KVM_four_sguest_64_g32e   1   1   0 00
control_panel_vpid  3   3   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
control_panel   3   3   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
control_panel_ept   4   4   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
gtest_vpid  1   1   0 00
 :boot_smp_win7_ent_64_g3   1   1   0 00
gtest_ept   1   1   0 00
 :boot_smp_win7_ent_64_g3   1   1   0 00
gtest   3   3   0 00
 :boot_smp_win2008_64_g32   1   1   0 00
 :boot_smp_win7_ent_64_gP   1   1   0 00
 :boot_smp_vista_64_g32e1   1   0 00
vtd_ept_vpid8   8   0 00
 :one_pcie_up_64_g32e   1   1   0 00
 :hp_pcie_smp_nomsi_64_g3   1   1   0 00
 :lm_pcie_smp_64_g32e   1   1   0 00