Re: pci_enable_msix() fails with ENOMEM/EINVAL

2012-11-29 Thread Alex Lyakas

Hi Alex,
did I understand correctly that vector value with which the call to 
request_threaded_irq() is made is *not* supposed to be zero? Because in my 
case, it is always zero, and still the failure I observe is not happening 
always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
attached PCI device of each KVM VM.

I will try to repro this like you suggested and let you know.

Thanks for your help,
Alex.


-Original Message- 
From: Alex Williamson

Sent: 26 November, 2012 10:04 PM
To: Alex Lyakas
Cc: kvm@vger.kernel.org
Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL

On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote:

Hi Alex,
thanks for your response.

I printed out the vector and entry values of dev-host_msix_entries[i]
within assigned_device_enable_host_msix() before call to
request_threaded_irq(). I see that they are all 0s:
kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924
kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix()
assigned_dev_id=924 #0: [v=0 e=0]
kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix()
assigned_dev_id=924 #1: [v=0 e=1]
kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix()
assigned_dev_id=924 #2: [v=0 e=2]

So I don't really understand how they all ask for irq=0; I must be missing
something. Is there any other explanation of request_threaded_irq() to
return EBUSY? From the code I don't see that there is.


The vectors all being zero sounds like an indication that
pci_enable_msix() didn't actually work.  Each of those should be a
unique vector.   Does booting the host with nointremap perhaps make a
difference?  Maybe we can isolate the problem to the interrupt remapper
code.


This issue is reproducible and is not going to go away by itself. Working
around it is also problematic. We thought to check whether all IRQs are
properly attached after QEMU sets the vm state to running. However, vm
state is set to running before IRQ attachments are performed; we 
debugged
this and found out that they are done from a different thread, from a 
stack

trace like this:
kvm_assign_irq()
assigned_dev_update_msix()
assigned_dev_pci_write_config()
pci_host_config_write_common()
pci_data_write()
pci_host_data_write()
memory_region_write_accessor()
access_with_adjusted_size()
memory_region_iorange_write()
ioport_writew_thunk()
ioport_write()
cpu_outw()
kvm_handle_io()
kvm_cpu_exec()
qemu_kvm_cpu_thread_fn()

So looks like this is performed on-demand (on first IO), so no reliable
point to check that IRQs are attached properly.


Correct, MSI-X is setup when the guest enables MSI-X on the device,
which is likely a long way into guest boot.  There's no guarantee that
the guest ever enables MSI-X, so there's no association to whether the
guest is running.


 Another issue that in KVM
code the return value of pci_host_config_write_common() is not checked, so
there is no way to report a failure.


A common problem in qemu, imho


Is there any way you think you can help me debug this further?


It seems like pci_enable_msix is still failing, but perhaps silently
without irqbalance.  We need to figure out where and why.  Isolating it
to the interrupt remapper with nointremap might give us some clues
(this is an Intel VT-d system, right?).  Thanks,

Alex


-Original Message- 
From: Alex Williamson

Sent: 22 November, 2012 12:25 AM
To: Alex Lyakas
Cc: kvm@vger.kernel.org
Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL

On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote:
 Hi,
 I was advised to turn off irqbalance and reproduced this issue, but
 the failure is in a different place now. Now request_threaded_irq()
 fails with EBUSY.
 According to the code, this can only happen on the path:
 request_threaded_irq() - __setup_irq()
 Now in setup irq, the only place where EBUSY can show up for us is here:
 ...
 raw_spin_lock_irqsave(desc-lock, flags);
 old_ptr = desc-action;
 old = *old_ptr;
 if (old) {
 /*
 * Can't share interrupts unless both agree to and are
 * the same type (level, edge, polarity). So both flag
 * fields must have IRQF_SHARED set and the bits which
 * set the trigger type must match. Also all must
 * agree on ONESHOT.
 */
 if (!((old-flags  new-flags)  IRQF_SHARED) ||
 ((old-flags ^ new-flags)  IRQF_TRIGGER_MASK) ||
 ((old-flags ^ new-flags)  IRQF_ONESHOT)) {
 old_name = old-name;
 goto mismatch;
 }

 /* All handlers must agree on per-cpuness */
 if ((old-flags  IRQF_PERCPU) !=
 (new-flags  IRQF_PERCPU))
 goto mismatch;

 KVM calls request_threaded_irq() with flags==0, so can it be that
 different KVM processes request the same IRQ?

Shouldn't be possible, irqs are allocated from a bitmap protected by a
mutex, see __irq_alloc_descs

  How different KVM
 processes spawned simultaneously agree between them on IRQ numbers?

They don't, MSI/X vectors are not currently share-able.  Can you show
that you're actually getting duplicate irq vectors?  Thanks,

Alex

--

Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios

2012-11-29 Thread Raghavendra K T

On 11/29/2012 07:37 AM, Chegu Vinod wrote:

On 11/26/2012 4:07 AM, Raghavendra K T wrote:

  In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

  The first patch optimizes all the yield_to by bailing out when there
  is no need to continue in yield_to (i.e., when there is only one task
  in source and target rq).

  Second patch uses that in PLE handler. Further when a yield_to fails
  we do not immediately go out of PLE handler instead we try thrice
  to have better statistical possibility of false return. Otherwise that
  would affect moderate overcommit cases.
  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy
1x and
  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
ebizzy (rec/sec higher is beter)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x   2511.300021.54096051.8000   170.2592   140.98276
2x   2679.4000   332.44822692.3000   251.4005 0.48145
3x   2253.5000   266.42432192.1667   178.9753-2.72169
4x   1784.3750   102.26992018.7500   187.572313.13485
--+---+---+---++---+

--+---+---+---++---+
 dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x  6677.4080   638.504810098.0060   3449.7026 51.22643
2x  2012.676064.76422019.0440 62.6702   0.31639
3x  1302.078340.83361292.7517 27.0515  -0.71629
4x  3043.1725  3243.72814664.4662   5946.5741  53.27643
--+---+---+---++---+

Here is the refernce of no ple result.
  ebizzy-1x_nople 7592.6000 rec/sec
  dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we
are
getting impressive performance with the patches.

  Changes Since V2:
  - Dropped global measures usage patch (Peter Zilstra)
  - Do not bail out on first failure (Avi Kivity)
  - Try thrice for the failure of yield_to to get statistically more
correct
behaviour.

  Changes since V1:
  - Discard the idea of exporting nrrunning and optimize in core
scheduler (Peter)
  - Use yield() instead of schedule in overcommit scenarios (Rik)
  - Use loadavg knowledge to detect undercommit/overcommit

  Peter Zijlstra (1):
   Bail out of yield_to when source and target runqueue has one task

  Raghavendra K T (1):
   Handle yield_to failure return for potential undercommit case

  Please let me know your comments and suggestions.

  Link for V2:
  https://lkml.org/lkml/2012/10/29/287

  Link for V1:
  https://lkml.org/lkml/2012/9/21/168

  kernel/sched/core.c | 25 +++--
  virt/kvm/kvm_main.c | 26 --
  2 files changed, 35 insertions(+), 16 deletions(-)

.


Tested-by: Chegu Vinod chegu_vi...@hp.com



Thanks for testing..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance issue

2012-11-29 Thread Vadim Rozenfeld
On Wednesday, November 28, 2012 09:09:29 PM George-Cristian Bîrzan wrote:
 On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com 
wrote:
  On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote:
  On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com
  
  wrote:
   I have some code which do both reference time and invariant TSC but it
   will not work after migration. I will send it later today.
  
  Do you mean migrating guests? This is not an issue for us.
  
  OK, but don't say I didn't warn you :)
  
  There are two patches, one for kvm and another one for qemu.
  you will probably need to rebase them.
  Add hv_tsc cpu parameter to activate this feature.
  you will probably need to deactivate hpet by adding -no-hpet
  parameter as well.
 
 I've also added +hv_relaxed since then, but this is the command I'm

I would suggest activating relaxed timing for all W2K8R2/Win7 guests.

 using now and there's no change:
 
 /usr/bin/qemu-kvm -name b691546e-79f8-49c6-a293-81067503a6ad -S -M
 pc-1.2 -enable-kvm -m 16384 -smp 9,sockets=1,cores=9,threads=1 -uuid
 b691546e-79f8-49c6-a293-81067503a6ad -no-user-config -nodefaults
 -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/b691546e-79f8-49c6-a293-8
 1067503a6ad.monitor,server,nowait -mon
 chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-hpet -no-shutdown -device
 piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 file=/var/lib/libvirt/images/dis-magnetics-2-223101/d8b233c6-8424-4de9-ae3c
 -7c9a60288514,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,ai
 o=native -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=vir
 tio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36
 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=22:2e:fb:a2:36:be,bus=pci.0,addr
 =0x3 -netdev tap,fd=40,id=hostnet1,vhost=on,vhostfd=41 -device
 virtio-net-pci,netdev=hostnet1,id=net1,mac=22:94:44:5a:cb:24,bus=pci.0,addr
 =0x4 -vnc 127.0.0.1:0,password -vga cirrus -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -cpu host,hv_tsc
 
 I compiled qemu-1.2.0-24 after applying your patch, used the head for
 KVM, and I see no difference. I've tried setting windows'
 useplatformclock on and off, no change either.
 
 
 Other than that, was looking into a profiling trace of the software
 running and a lot of time (60%?) is spent calling two functions from
 hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and
 HalpHPETProgramRolloverTimer which do point at something related to
 the timers.
 
It means that hyper-v time stamp source was not activated.
 Any other thing I can try?
 
 
 --
 George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-29 Thread Gleb Natapov
On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
 On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
 
 Don't understand the reasoning behind why 3 is a good choice.
 
 Here is where I came from. (explaining from scratch for
 completeness, forgive me :))
 In moderate overcommits, we can falsely exit from ple handler even when
 we have preempted task of same VM waiting on other cpus. To reduce this
 problem, we try few times before exiting.
 The problem boils down to:
 what is the probability that we exit ple handler even when we have more
 than 1 task in other cpus. Theoretical worst case should be around 1.5x
 overcommit (As also pointed by Andrew Theurer). [But practical
 worstcase may be around 2x,3x overcommits as indicated by the results
 for the patch series]
 
 So if p is the probability of finding rq length one on a particular cpu,
 and if we do n tries, then probability of exiting ple handler is:
 
  p^(n+1) [ because we would have come across one source with rq length
 1 and n target cpu rqs  with length 1 ]
 
 so
 num tries: probability of aborting ple handler (1.5x overcommit)
  1 1/4
  2 1/8
  3 1/16
 
 We can increase this probability with more tries, but the problem is
 the overhead.
IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread
is preempted we do kvm-preempted_vcpus++, when it runs again we do
kvm-preempted_vcpus--. PLE handler can try harder if kvm-preempted_vcpus
is big or do not try at all if it is zero.

 Also, If we have tried three times that means we would have iterated
 over 3 good eligible vcpus along with many non-eligible candidates. In
 worst case if we iterate all the vcpus, we reduce 1x performance and
 overcommit performance get hit. [ as in results ].
 
 I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
 concluded 3 is enough.
 
 Infact I have also run kernbench and hackbench which are giving 5-20%
 improvement.
 
 [ As a side note , I also thought how about having num_tries = f(n) =
 ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
 overhead and also there is no point in probably making it dependent on
 online cpus ]
 
 Please let me know if you are happy with this rationale/ or correct me
 if you foresee some problem. (Infact Avi, Rik's concern about false
 exiting made me arrive at 'try' logic which I did not have earlier).
 
 I am currently trying out the result for 1.5x overcommit will post the
 result.
 
 
 On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 yield_to returns -ESRCH, When source and target of yield_to
 run queue length is one. When we see three successive failures of
 yield_to we assume we are in potential undercommit case and abort
 from PLE handler.
 The assumption is backed by low probability of wrong decision
 for even worst case scenarios such as average runqueue length
 between 1 and 2.
 
 note that we do not update last boosted vcpu in failure cases.
 Thank Avi for raising question on aborting after first fail from yield_to.
 
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 [...]

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance issue

2012-11-29 Thread George-Cristian Bîrzan
On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
 I've also added +hv_relaxed since then, but this is the command I'm

 I would suggest activating relaxed timing for all W2K8R2/Win7 guests.

Is there any place I can read up on the downsides of this for Linux,
or is Just Better?

 Other than that, was looking into a profiling trace of the software
 running and a lot of time (60%?) is spent calling two functions from
 hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and
 HalpHPETProgramRolloverTimer which do point at something related to
 the timers.

 It means that hyper-v time stamp source was not activated.

I recompiled the whole kernel, with your patch, and while I cannot
check at 70Mbps now, a test stream of 20 seems to do better. Also, now
I don't see any of those functions, which used to account ~60% of the
time spent by the program. I'm waiting for the customer to come back
and start the 'real' stream, but from my tests, time spent in hal.dll
is now an order of magnitude smaller.

--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance issue

2012-11-29 Thread Gleb Natapov
On Thu, Nov 29, 2012 at 03:45:52PM +0200, George-Cristian Bîrzan wrote:
 On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
  I've also added +hv_relaxed since then, but this is the command I'm
 
  I would suggest activating relaxed timing for all W2K8R2/Win7 guests.
 
 Is there any place I can read up on the downsides of this for Linux,
 or is Just Better?
 
You shouldn't use hyper-v flags for Linux guests. In theory Linux should
just ignore them, in practice there may be bugs that will prevent Linux
from detecting that it runs as a guest and disable optimizations.

  Other than that, was looking into a profiling trace of the software
  running and a lot of time (60%?) is spent calling two functions from
  hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and
  HalpHPETProgramRolloverTimer which do point at something related to
  the timers.
 
  It means that hyper-v time stamp source was not activated.
 
 I recompiled the whole kernel, with your patch, and while I cannot
 check at 70Mbps now, a test stream of 20 seems to do better. Also, now
 I don't see any of those functions, which used to account ~60% of the
 time spent by the program. I'm waiting for the customer to come back
 and start the 'real' stream, but from my tests, time spent in hal.dll
 is now an order of magnitude smaller.
 
 --
 George-Cristian Bîrzan

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM VMX: register state after reset violates spec

2012-11-29 Thread Julian Stecklina
Hello,

we have noticed that at least on 3.6.8 with VMX after a VCPU has been
reset via the INIT-SIPI-SIPI sequence its register state violates
Intel's specification.

Specifically for our case we see at the end of vmx_vcpu_reset the
following vcpu state:

regs_avail=ffef regs_dirty=00010010
EIP= EAX=06e8 EBX=0001 ECX=8001 EDX=0600
ESI=d238 EDI= EBP= ESP=

although EAX, EBX, ECX, ESI, EDI, EBP, ESP should _all_ be zero. See
http://download.intel.com/products/processor/manual/253668.pdf section
9.1.1 (page 9-2).

Shouldn't vmx_vcpu_reset actively clear those registers? And from a
quick glance at the SVM code the problem might exist there, too.

A workaround is to use qemu-kvm with -kvm-no-irqchip.

Julian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking latency - what to expect?

2012-11-29 Thread Julian Stecklina
Thus spake David Mohr damaili...@mcbf.net:

 * vm-vm (same host)22k

This number is in the same ballpark as what I am seeing on pretty much
the same hardware.

AFAICS, there is little you can do to the current virtio-virtio code
path that would make this substantially faster.

Julian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM VMX: Make register state after reset conform to specification.

2012-11-29 Thread Julian Stecklina
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
---
 arch/x86/kvm/vmx.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ff66a3b..ec5a3b3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 
vmx-soft_vnmi_blocked = 0;
 
-   vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vmx-vcpu, 0);
msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(vmx-vcpu))
@@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_rip_write(vcpu, 0xfff0);
else
kvm_rip_write(vcpu, 0);
+
+   kvm_register_write(vcpu, VCPU_REGS_RAX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RBX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val());
+   kvm_register_write(vcpu, VCPU_REGS_RSI, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RDI, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RBP, 0);
kvm_register_write(vcpu, VCPU_REGS_RSP, 0);
 
vmcs_writel(GUEST_DR7, 0x400);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #4 from liyi yiliker...@gmail.com  2012-11-29 15:38:10 ---
1: i have disable the irqbalance in the host, and the irq affinity on the host
is ok.you can see the  4 step in  comment 0.

2: you see in comment 0, we have set irq affinity on the host is pcpu1.
and so reduce the ipi interrupt, so we want to set the irq affinity at the
vcpu2 on the guest os according vcpupin vcpu='2' cpuset='1'/, but the irq
affinity on the guest os when using msi-x is failed.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM VMX: Make register state after reset conform to specification.

2012-11-29 Thread Julian Stecklina
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
---
 arch/x86/kvm/vmx.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ff66a3b..ec5a3b3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 
vmx-soft_vnmi_blocked = 0;
 
-   vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vmx-vcpu, 0);
msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(vmx-vcpu))
@@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_rip_write(vcpu, 0xfff0);
else
kvm_rip_write(vcpu, 0);
+
+   kvm_register_write(vcpu, VCPU_REGS_RAX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RBX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val());
+   kvm_register_write(vcpu, VCPU_REGS_RSI, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RDI, 0);
+   kvm_register_write(vcpu, VCPU_REGS_RBP, 0);
kvm_register_write(vcpu, VCPU_REGS_RSP, 0);
 
vmcs_writel(GUEST_DR7, 0x400);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking latency - what to expect?

2012-11-29 Thread David Mohr

On 2012-11-29 07:48, Julian Stecklina wrote:

Thus spake David Mohr damaili...@mcbf.net:


* vm-vm (same host)22k


This number is in the same ballpark as what I am seeing on pretty 
much

the same hardware.

AFAICS, there is little you can do to the current virtio-virtio code
path that would make this substantially faster.


Thanks for the feedback. Considering that it's better than the hardware 
network performance my main issue is actually the latency of 
communication between VMs on different hosts:

* vm-vm (diff. hosts)   7k

It is obvious that there is a lot more going on compared to same host 
communication, but only ~30% of the performance when the network 
hardware should not be slowing it down (too) much?


Thanks,
~David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pci_enable_msix() fails with ENOMEM/EINVAL

2012-11-29 Thread Alex Williamson
On Thu, 2012-11-29 at 10:42 +0200, Alex Lyakas wrote:
 Hi Alex,
 did I understand correctly that vector value with which the call to 
 request_threaded_irq() is made is *not* supposed to be zero? Because in my 
 case, it is always zero, and still the failure I observe is not happening 
 always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
 attached PCI device of each KVM VM.
 I will try to repro this like you suggested and let you know.

pci_enable_msix should be setting vectors for all the host_msix_entries.
That non-zero vector is the first parameter to request_threaded_irq.  If
I put a printk in the loop installing the interrupt handler, I get:

assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 2, vector 103

So we enable MSI-X with only one vector enabled, then a second vector
gets enabled and we disable and re-enable with two, and so on with
three.  This is for an assigned e1000e device.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/14] ARM: Add page table and page defines needed by KVM

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:14 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:42:17PM +, Christoffer Dall wrote:
 KVM uses the stage-2 page tables and the Hyp page table format,
 so we define the fields and page protection flags needed by KVM.

 The nomenclature is this:
  - page_hyp:PL2 code/data mappings
  - page_hyp_device: PL2 device mappings (vgic access)
  - page_s2: Stage-2 code/data page mappings
  - page_s2_device:  Stage-2 device mappings (vgic access)

 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Christoffer Dall c.d...@virtualopensystems.com

 [...]

 diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
 index 941dfb9..087d949 100644
 --- a/arch/arm/mm/mmu.c
 +++ b/arch/arm/mm/mmu.c
 @@ -57,43 +57,61 @@ static unsigned int cachepolicy __initdata = 
 CPOLICY_WRITEBACK;
  static unsigned int ecc_mask __initdata = 0;
  pgprot_t pgprot_user;
  pgprot_t pgprot_kernel;
 +pgprot_t pgprot_hyp_device;
 +pgprot_t pgprot_s2;
 +pgprot_t pgprot_s2_device;

  EXPORT_SYMBOL(pgprot_user);
  EXPORT_SYMBOL(pgprot_kernel);
 +EXPORT_SYMBOL(pgprot_hyp_device);
 +EXPORT_SYMBOL(pgprot_s2);
 +EXPORT_SYMBOL(pgprot_s2_device);

 Do we still need these?


the exports, nope.

  struct cachepolicy {
   const char  policy[16];
   unsigned intcr_mask;
   pmdval_tpmd;
   pteval_tpte;
 + pteval_tpte_s2;
  };

 +#ifdef CONFIG_ARM_LPAE
 +#define s2_policy(policy)policy
 +#else
 +#define s2_policy(policy)0
 +#endif

 Put the macro in pgtable-{2,3}level.h?


I think that's weird, defining something far away from where it's used
will only make it harder to read, pgtable.h is not included in this
file, and there are other #ifdef CONFIG_ARM_LPAE in that file.

Here's the fix from above:

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 087d949..46ca62b 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -63,9 +63,6 @@ pgprot_t pgprot_s2_device;

 EXPORT_SYMBOL(pgprot_user);
 EXPORT_SYMBOL(pgprot_kernel);
-EXPORT_SYMBOL(pgprot_hyp_device);
-EXPORT_SYMBOL(pgprot_s2);
-EXPORT_SYMBOL(pgprot_s2_device);

 struct cachepolicy {
const char  policy[16];


-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-tool: ARM: set interrupt priority mask in secondary boot path

2012-11-29 Thread Marc Zyngier
A bug in the KVM GIC init code set the priority mask to the
highest possible value, while the reset value should be zero.

Now that the kernel bug is fixed, kvm-tool must properly configure
its GIC CPU interface in order to receive the boot IPI.  Just set
the GICC_PMR register to the maximum value (0xff), and it just works.

Cc: Will Deacon will.dea...@arm.com
Cc: Pekka Enberg penb...@kernel.org
Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 tools/kvm/arm/aarch32/smp-pen.S| 4 
 tools/kvm/arm/include/arm-common/gic.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/tools/kvm/arm/aarch32/smp-pen.S b/tools/kvm/arm/aarch32/smp-pen.S
index 0861171..cee1d3b8 100644
--- a/tools/kvm/arm/aarch32/smp-pen.S
+++ b/tools/kvm/arm/aarch32/smp-pen.S
@@ -17,6 +17,10 @@ smp_pen_start:
mov r1, #GIC_CPUI_CTLR_EN
str r1, [r0]
 
+   @ Set the priority mask to accept any interrupt
+   mov r1, #GIC_CPUI_DEFAULT_PMR
+   str r1, [r0, #4]
+
@ Now wait for the primary to poke us
adr r0, smp_jump_addr
ldr r1, =AARCH32_SMP_BAD_MAGIC
diff --git a/tools/kvm/arm/include/arm-common/gic.h 
b/tools/kvm/arm/include/arm-common/gic.h
index d534174..6fa806f 100644
--- a/tools/kvm/arm/include/arm-common/gic.h
+++ b/tools/kvm/arm/include/arm-common/gic.h
@@ -20,6 +20,8 @@
 
 #define GIC_CPUI_CTLR_EN   (1  0)
 
+#define GIC_CPUI_DEFAULT_PMR   0xff
+
 #define GIC_MAX_CPUS   8
 #define GIC_MAX_IRQ255
 
-- 
1.8.0.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH] kvm-tool: ARM: set interrupt priority mask in secondary boot path

2012-11-29 Thread Peter Maydell
On 29 November 2012 15:58, Marc Zyngier marc.zyng...@arm.com wrote:
 A bug in the KVM GIC init code set the priority mask to the
 highest possible value, while the reset value should be zero.

 Now that the kernel bug is fixed, kvm-tool must properly configure
 its GIC CPU interface in order to receive the boot IPI.  Just set
 the GICC_PMR register to the maximum value (0xff), and it just works.

 --- a/tools/kvm/arm/include/arm-common/gic.h
 +++ b/tools/kvm/arm/include/arm-common/gic.h
 @@ -20,6 +20,8 @@

  #define GIC_CPUI_CTLR_EN   (1  0)

 +#define GIC_CPUI_DEFAULT_PMR   0xff

This seems an odd name, since 0xff is not the default, which is
why we have to write it into the register. Also the register
name should presumably go after CPUI, to match CTLR above.
GIC_CPUI_PMR_MINIMUM_PRIORITY ?

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #5 from Alex Williamson alex.william...@redhat.com  2012-11-29 
16:10:55 ---
(In reply to comment #4)
 1: i have disable the irqbalance in the host, and the irq affinity on the host
 is ok.you can see the  4 step in  comment 0.
 
 2: you see in comment 0, we have set irq affinity on the host is pcpu1.
 and so reduce the ipi interrupt, so we want to set the irq affinity at the
 vcpu2 on the guest os according vcpupin vcpu='2' cpuset='1'/, but the irq
 affinity on the guest os when using msi-x is failed.

How does it fail?  When I test this, I look at /proc/interrupts on the guest
and the interrupt count on the row with the pinned IRQ only increases in the
column under the target CPU.  Perhaps if you included some logs and exact
commands used I could better understand where things are going wrong.  Thanks.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #6 from liyi yiliker...@gmail.com  2012-11-29 16:21:16 ---
1:i have set the irq (associate the passthrough device) to the vcpu2 on the
guest os.but look at /proc/interrupts on the guest and the interrupt count 
with the pinned IRQ increases a lot of vcpus except the vcpu0.

2 my qemu kvm version is 1.2 , and have not add the patch
http://www.spinics.net/lists/kvm/msg83109.html from, maybe it will help me.
are you sure that your qemu-kvm is from the
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #7 from Alex Williamson alex.william...@redhat.com  2012-11-29 
16:33:38 ---
(In reply to comment #6)
 1:i have set the irq (associate the passthrough device) to the vcpu2 on the
 guest os.but look at /proc/interrupts on the guest and the interrupt count 
 with the pinned IRQ increases a lot of vcpus except the vcpu0.

Perhaps double check that irqbalance is not running the in guest, ps aux | grep
irqbalance.

 2 my qemu kvm version is 1.2 , and have not add the patch
 http://www.spinics.net/lists/kvm/msg83109.html from, maybe it will help me.
 are you sure that your qemu-kvm is from the
 git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git?

Those patches are against git://git.qemu.org/qemu.git  I believe the BCM5716
will try to use MSI-X interrupts, not MSI.  Please provide lspci -vvv of the
device in the guest and we can verify what interrupt mode it's working in. 
What is the guest operating system?  Thanks.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking latency - what to expect?

2012-11-29 Thread Julian Stecklina
Thus spake David Mohr damaili...@mcbf.net:

 On 2012-11-29 07:48, Julian Stecklina wrote:
 Thus spake David Mohr damaili...@mcbf.net:

 * vm-vm (same host)22k

 This number is in the same ballpark as what I am seeing on pretty
 much
 the same hardware.

 AFAICS, there is little you can do to the current virtio-virtio code
 path that would make this substantially faster.

 Thanks for the feedback. Considering that it's better than the
 hardware network performance my main issue is actually the latency of
 communication between VMs on different hosts:
 * vm-vm (diff. hosts)   7k

 It is obvious that there is a lot more going on compared to same host
 communication, but only ~30% of the performance when the network
 hardware should not be slowing it down (too) much?

You are probably better of using SR-IOV NICs with PCI passthrough in
this case.

Maybe someone can comment on whether virtual interrupt delivery and
posted interrupts[1] are already usable. The first one should help in
either the virtio and SR-IOV scenarios, the latter only applies to
SR-IOV (and PCI passthrough in general).

Julian

[1] http://www.spinics.net/lists/kvm/msg82762.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-29 Thread Michael Wolf

On 11/28/2012 02:55 PM, Glauber Costa wrote:

On 11/28/2012 10:43 PM, Michael Wolf wrote:

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

In the capped case, the time that the guest spends waiting due to it
having used its full allottment of time shows up as steal time.  The way
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and
automatically adjust the setting in the
host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Yes, in the context above I'm referring to the cfs bandwidth control.


Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
vcpu-arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.
So I looked at something like this in the past.  To make sure things 
haven't changed
I set up a cgroup on my test server running a kernel built from the 
latest tip tree.


[root]# cat cpu.cfs_quota_us
5
[root]# cat cpu.cfs_period_us
10
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a 
script that will hog the cpu I see the

following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si, 
49.8%st


So the steal time here is in line with the bandwidth control settings.


In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:
About a year ago I was playing with this patch.  It is out of date now 
but will give you

an idea of what I was looking at.

 kernel/sched_fair.c  |4 ++--
 kernel/sched_stats.h |7 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..a837e4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
struct sched_entity *se)

 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
 static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 int global_update)
@@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct
cfs_rq *cfs_rq)
 }

 /* check whether cfs_rq, or any parent, is throttled */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 return cfs_rq-throttle_count;
 }
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 87f9e36..e30ff26 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -213,14 +213,19 @@ static inline void sched_info_queued(struct
task_struct *t)
  * sched_info_queued() to mark that it has now again started waiting on
  * the runqueue.
  */
+extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static inline void sched_info_depart(struct task_struct *t)
 {
+struct task_group *tg = task_group(t);
+struct cfs_rq *cfs_rq;
 unsigned long long delta = task_rq(t)-clock -
 t-sched_info.last_arrival;

+cfs_rq = tg-cfs_rq[smp_processor_id()];
 rq_sched_info_depart(task_rq(t), delta);

-if (t-state == TASK_RUNNING)
+
+if (t-state == TASK_RUNNING  !throttled_hierarchy(cfs_rq))
 sched_info_queued(t);
 }


So then the steal time did not show on the guest.  You have no value 
that needs to be passed

around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit 
was added to the kernel

   the code would potentially need to change.
* This approach doesn't help if the limits are set by 

[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #8 from liyi yiliker...@gmail.com  2012-11-29 17:52:47 ---
ok,i am sure the irqbalance is not running in the guest.
you are right, BCM5716S using the MSI-X interrupts default.

1:i have set the smp_affinity cannot work correctly using the MSI-X. but the
test is ok after insmod the bnx2 with disable_msi=1.

2:  lspci -vvv:
00:06.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit
Ethernet (rev 20)
Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort-
MAbort- SERR- PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 11
Region 0: Memory at f600 (32-bit, non-prefetchable) [size=32M]
Capabilities: [50] Vital Product Data ?
Capabilities: [ac] Express (v2) Endpoint, MSI 00
DevCap:MaxPayload 512 bytes, PhantFunc 0, Latency L0s 4us, L1
64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl:Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta:CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap:Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0
4us, L1 4us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl:ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta:Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive-
BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [a0] MSI-X: Enable+ Mask- TabSize=9
Vector table: BAR=0 offset=c000
PBA: BAR=0 offset=e000
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Count=1/16
Enable-
Address:   Data: 
Kernel driver in use: bnx2
Kernel modules: bnx2

00:06.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit
Ethernet (rev 20)
Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort-
MAbort- SERR- PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 11
Region 0: Memory at f800 (32-bit, non-prefetchable) [size=32M]
Capabilities: [50] Vital Product Data ?
Capabilities: [ac] Express (v2) Endpoint, MSI 00
DevCap:MaxPayload 512 bytes, PhantFunc 0, Latency L0s 4us, L1
64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl:Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta:CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap:Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0
4us, L1 4us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl:ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta:Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive-
BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [a0] MSI-X: Enable+ Mask- TabSize=9
Vector table: BAR=0 offset=c000
PBA: BAR=0 offset=e000
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Count=1/16
Enable-
Address:   Data: 
Kernel driver in use: bnx2
Kernel modules: bnx2

the guest os: 3.7-rc6 (the same as host)

-- 
Configure bugmail: 

[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #9 from Alex Williamson alex.william...@redhat.com  2012-11-29 
18:08:44 ---
(In reply to comment #8)
 ok,i am sure the irqbalance is not running in the guest.
 you are right, BCM5716S using the MSI-X interrupts default.
 
 1:i have set the smp_affinity cannot work correctly using the MSI-X. but the
 test is ok after insmod the bnx2 with disable_msi=1.
 
 2:  lspci -vvv:
 00:06.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit
 Ethernet (rev 20)
...
 00:06.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit
 Ethernet (rev 20)

Where's the BCM5716?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #10 from liyi yiliker...@gmail.com  2012-11-29 18:18:21 ---
sorry, i find the test on BCM5709 is ok.
now, i cannot access my physical machine (with BCM5709S,or intel 82599), i will
give the debug info asap.

Thanks

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #11 from liyi yiliker...@gmail.com  2012-11-29 18:20:59 ---
please omit the BCM5709S.

 i have BCM5709 BCM5716S intel82599  environment, but now , cannot access
the BCM5716S and intel82599.

now, the test on BCM5709 is ok, but the test on BCM5716S intel82599 before is
failed.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #12 from Alex Williamson alex.william...@redhat.com  2012-11-29 
18:55:46 ---
(In reply to comment #11)
 please omit the BCM5709S.
 
  i have BCM5709 BCM5716S intel82599  environment, but now , cannot access
 the BCM5716S and intel82599.
 
 now, the test on BCM5709 is ok, but the test on BCM5716S intel82599 before is
 failed.

Are you saying MSI-X affinity works as expected on the BCM5709?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 02/14] ARM: Section based HYP idmap

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:16 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:42:24PM +, Christoffer Dall wrote:
 Add a method (hyp_idmap_setup) to populate a hyp pgd with an
 identity mapping of the code contained in the .hyp.idmap.text
 section.

 Offer a method to drop this identity mapping through
 hyp_idmap_teardown.

 Make all the above depend on CONFIG_ARM_VIRT_EXT and CONFIG_ARM_LPAE.

 Cc: Will Deacon will.dea...@arm.com
 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/idmap.h|5 ++
  arch/arm/include/asm/pgtable-3level-hwdef.h |1
  arch/arm/kernel/vmlinux.lds.S   |6 ++
  arch/arm/mm/idmap.c |   74 
 +++
  4 files changed, 73 insertions(+), 13 deletions(-)

 diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h
 index bf863ed..36708ba 100644
 --- a/arch/arm/include/asm/idmap.h
 +++ b/arch/arm/include/asm/idmap.h
 @@ -11,4 +11,9 @@ extern pgd_t *idmap_pgd;

  void setup_mm_for_reboot(void);

 +#ifdef CONFIG_ARM_VIRT_EXT
 +void hyp_idmap_teardown(pgd_t *hyp_pgd);
 +void hyp_idmap_setup(pgd_t *hyp_pgd);
 +#endif

 I wonder whether the dependency is quite right here. Functionally, we're
 only dependent on LPAE but it doesn't make sense if !ARM_VIRT_EXT. In
 fact, I start to question whether CONFIG_ARM_VIRT_EXT is worthwhile at
 all as it stands. Maybe it would be better to add the LPAE dependency
 there and make the hyp_stub stuff that's currently in mainline
 unconditional?


perhaps it would be more clean, but it's not something that would
break to fix later, and not in that sense something this patch series
would depend on. If the hyp_stub stuff is truly compatible with all
legacy stuff then yes, we could remove that define, but if not, then I
guess it's necessary.

For now, I'll simply remove these ifdef's as Rob pointed out.

 +/*
 + * This version actually frees the underlying pmds for all pgds in range and
 + * clear the pgds themselves afterwards.
 + */
 +void hyp_idmap_teardown(pgd_t *hyp_pgd)
 +{
 + unsigned long addr, end;
 + unsigned long next;
 + pgd_t *pgd = hyp_pgd;
 +
 + addr = virt_to_phys(__hyp_idmap_text_start);
 + end = virt_to_phys(__hyp_idmap_text_end);
 +
 + pgd += pgd_index(addr);
 + do {
 + next = pgd_addr_end(addr, end);
 + if (!pgd_none_or_clear_bad(pgd))
 + hyp_idmap_del_pmd(pgd, addr);
 + } while (pgd++, addr = next, addr  end);
 +}
 +EXPORT_SYMBOL_GPL(hyp_idmap_teardown);
 +
 +void hyp_idmap_setup(pgd_t *hyp_pgd)
 +{
 + identity_mapping_add(hyp_pgd, __hyp_idmap_text_start,
 +  __hyp_idmap_text_end, PMD_SECT_AP1);
 +}
 +EXPORT_SYMBOL_GPL(hyp_idmap_setup);
 +#endif

 Again, do we need these exports?

no we don't


 I also think it might be cleaner to declare the hyp_pgd next to the
 idmap_pgd then add the hyp_idmap_setup code to init_static_idmap, so
 that we don't add new entry points to this file. The teardown can all be
 moved into the kvm/ code as it doesn't need any of the existing idmap
 functions.


hmm, we had some attempts at this earlier, but it wasn't all that
nice. Allocating the hyp_pgd inside idmap.c requires some more logic,
and the #ifdefs inside init_static_idmap are also not pretty.
Additionally, other potential users of Hyp mode might have a separate
hyp_pgd, in theory.

While KVM is the only current user of hyp_idmap_teardown it's not
really KVM specific, and the could theoretically be other users of hyp
idmaps. Also, the externs __hyp_idmap_text_start|end would have to
be either moved to a header file or duplicated inside KVM, which is
also not that pretty.

I see how you would like to clean this up, but it's not really hard to
read or understand, imho:

The ifdef cleanup patch:

From 9cbe2830b74072b4d7167254562fc06c08f724e1 Mon Sep 17 00:00:00 2001
From: Christoffer Dall c.d...@virtualopensystems.com
Date: Thu, 29 Nov 2012 13:56:03 -0500
Subject: [PATCH] KVM: ARM: Remove exports and ifdefs in hyp idmap code

The exports are no longer needed as KVM/ARM cannot be compiled as a
module and the the #ifdef is not required around a declaration.

Cc: Will Deacon will.dea...@arm.com
Cc: Rob Herring robherri...@gmail.com
Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
---
 arch/arm/include/asm/idmap.h |2 --
 arch/arm/mm/idmap.c  |2 --
 2 files changed, 4 deletions(-)

diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h
index 36708ba..6ddb707 100644
--- a/arch/arm/include/asm/idmap.h
+++ b/arch/arm/include/asm/idmap.h
@@ -11,9 +11,7 @@ extern pgd_t *idmap_pgd;

 void setup_mm_for_reboot(void);

-#ifdef CONFIG_ARM_VIRT_EXT
 void hyp_idmap_teardown(pgd_t *hyp_pgd);
 void hyp_idmap_setup(pgd_t *hyp_pgd);
-#endif

 #endif /* __ASM_IDMAP_H */

[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #13 from liyi yiliker...@gmail.com  2012-11-29 19:12:47 ---
yes.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-tool: fix zombie reaping in guest/init.c

2012-11-29 Thread Marc Zyngier
init.c is not very kind with processes that get reparented when
their own parent die, leaving them hanging around. Looking at the
code, it only seem to care about its own flesh and blood. Bad init.

Teach it some basic zombie reaping skills.

Cc: Pekka Enberg penb...@kernel.org
Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 tools/kvm/guest/init.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/guest/init.c b/tools/kvm/guest/init.c
index ece48fd..8c49a03 100644
--- a/tools/kvm/guest/init.c
+++ b/tools/kvm/guest/init.c
@@ -61,7 +61,11 @@ int main(int argc, char *argv[])
else
run_process(/bin/sh);
} else {
-   waitpid(child, status, 0);
+   pid_t corpse;
+
+   do {
+   corpse = waitpid(-1, status, 0);
+   } while (corpse != child);
}
 
reboot(LINUX_REBOOT_CMD_RESTART);
-- 
1.8.0.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance issue

2012-11-29 Thread Vadim Rozenfeld
On Thursday, November 29, 2012 03:56:10 PM Gleb Natapov wrote:
 On Thu, Nov 29, 2012 at 03:45:52PM +0200, George-Cristian Bîrzan wrote:
  On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com 
wrote:
   I've also added +hv_relaxed since then, but this is the command I'm
   
   I would suggest activating relaxed timing for all W2K8R2/Win7 guests.
  
  Is there any place I can read up on the downsides of this for Linux,
  or is Just Better?
 
 You shouldn't use hyper-v flags for Linux guests. In theory Linux should
 just ignore them, in practice there may be bugs that will prevent Linux
 from detecting that it runs as a guest and disable optimizations.
 
As Gleb said, hyper-v flag are relevant to the Windows guests only. 
IIRC spinlocks and vapic should work for Vista and higher. Relaxed timing and
partition reference time work for Win7/W2K8R2.
   Other than that, was looking into a profiling trace of the software
   
   running and a lot of time (60%?) is spent calling two functions from
   hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and
   HalpHPETProgramRolloverTimer which do point at something related to
   the timers.
   
   It means that hyper-v time stamp source was not activated.
  
  I recompiled the whole kernel, with your patch, and while I cannot
  check at 70Mbps now, a test stream of 20 seems to do better. Also, now
  I don't see any of those functions, which used to account ~60% of the
  time spent by the program. I'm waiting for the customer to come back
  and start the 'real' stream, but from my tests, time spent in hal.dll
  is now an order of magnitude smaller.
  
  --
  George-Cristian Bîrzan
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 0/2] Enable guest use of TSC_ADJUST functionality

2012-11-29 Thread Will Auld
I have re-based this patch set version (V6) to kvm.git's queue branch.  

Will Auld (2):
  Add code to track call origin for msr assignment.
  Enabling IA32_TSC_ADJUST for KVM guest VM support

 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   | 14 ---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 ++
 arch/x86/kvm/svm.c| 22 ++---
 arch/x86/kvm/vmx.c| 27 +++-
 arch/x86/kvm/x86.c| 52 +--
 arch/x86/kvm/x86.h|  2 +-
 9 files changed, 108 insertions(+), 21 deletions(-)

-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 1/2] Add code to track call origin for msr assignment.

2012-11-29 Thread Will Auld
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 12 +---
 arch/x86/kvm/svm.c  | 15 +++
 arch/x86/kvm/vmx.c  | 18 --
 arch/x86/kvm/x86.c  | 29 ++---
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9fb6d8d..56c5dca 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -620,6 +620,12 @@ struct kvm_vcpu_stat {
 
 struct x86_instruction_info;
 
+struct msr_data {
+   bool host_initiated;
+   u32 index;
+   u64 data;
+};
+
 struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void);  /* __init */
int (*disabled_by_bios)(void); /* __init */
@@ -642,7 +648,7 @@ struct kvm_x86_ops {
 
void (*update_db_bp_intercept)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -793,7 +799,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 struct x86_emulate_ctxt;
 
@@ -820,7 +826,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 161a5fa..a8442e8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3127,13 +3127,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   u32 ecx = msr-index;
+   u64 data = msr-data;
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, msr);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3188,20 +3190,25 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, msr);
}
return 0;
 }
 
 static int wrmsr_interception(struct vcpu_svm *svm)
 {
+   struct msr_data msr;
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
| ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
 
+   msr.data = data;
+   msr.index = ecx;
+   msr.host_initiated = false;
 
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if (svm_set_msr(svm-vcpu, ecx, data)) {
+   if (svm_set_msr(svm-vcpu, msr)) {
+
trace_kvm_msr_write_ex(ecx, data);
kvm_inject_gp(svm-vcpu, 0);
} else {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bb18923..a166558 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2201,15 +2201,17 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct shared_msr_entry *msr;
int ret = 0;
+   u32 msr_index = msr_info-index;

[PATCH V6 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-29 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the use TSC counter offsetting VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.

The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:

The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   |  2 ++
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 
 arch/x86/kvm/svm.c|  7 +++
 arch/x86/kvm/vmx.c|  9 +
 arch/x86/kvm/x86.c| 23 +++
 8 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 8c297aa..602c476 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -202,6 +202,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_TSC_ADJUST (9*32+ 1) /* TSC adjustment MSR 0x3b */
 #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation 
extensions */
 #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */
 #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 56c5dca..dc87b65 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -444,6 +444,7 @@ struct kvm_vcpu_arch {
s8 virtual_tsc_shift;
u32 virtual_tsc_mult;
u32 virtual_tsc_khz;
+   s64 ia32_tsc_adjust_msr;
 
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
@@ -711,6 +712,7 @@ struct kvm_x86_ops {
bool (*has_wbinvd_exit)(void);
 
void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
+   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..c2dea36 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -236,6 +236,7 @@
 #define MSR_IA32_EBL_CR_POWERON0x002a
 #define MSR_EBC_FREQUENCY_ID   0x002c
 #define MSR_IA32_FEATURE_CONTROL0x003a
+#define MSR_IA32_TSC_ADJUST 0x003b
 
 #define FEATURE_CONTROL_LOCKED (10)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (11)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ec79e77..ee4e76c 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
if (index == 0) {
entry-ebx = kvm_supported_word9_x86_features;
cpuid_mask(entry-ebx, 9);
+   // TSC_ADJUST is emulated 
+   entry-ebx |= F(TSC_ADJUST);
} else
entry-ebx = 0;
entry-eax = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a10e460..3a8b504 100644
--- a/arch/x86/kvm/cpuid.h
+++ 

Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-29 Thread Marcelo Tosatti
On Thu, Nov 29, 2012 at 07:21:28PM +, Auld, Will wrote:
 Marcelo,
 
 The behavior on reset is to return the TSC_AJUST msr value to 0x0. I am 
 currently initializing this emulated msr in kvm_arch_vcpu_init(). 

Will,

Reset is handled by QEMU. kvm_arch_vcpu_init is only called during vcpu
allocation (ie vm creation). See x86_cpu_reset in QEMU's
target-i386/cpu.c file.

 
 - Behaviour on reset: what is the behaviour on RESET?
 
 I am testing the rebase now. I would like to get any needed changes for this 
 initialization into my next patch set version (v6) if possible.

For the kernel patch, everything is fine. Apparently for the QEMU patch
it is  necessary to set value to zero on reset.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Fix user memslot overlap check

2012-11-29 Thread Alex Williamson
Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Cc: sta...@vger.kernel.org
---
 virt/kvm/kvm_main.c |   12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..cac294d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -710,7 +710,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
gfn_t base_gfn;
unsigned long npages;
unsigned long i;
-   struct kvm_memory_slot *memslot;
+   struct kvm_memory_slot *memslot, *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots, *old_memslots;
 
@@ -761,13 +761,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* Check for overlaps */
r = -EEXIST;
-   for (i = 0; i  KVM_MEMORY_SLOTS; ++i) {
-   struct kvm_memory_slot *s = kvm-memslots-memslots[i];
-
-   if (s == memslot || !s-npages)
+   kvm_for_each_memslot(slot, kvm-memslots) {
+   if (slot-id = KVM_MEMORY_SLOTS || slot == memslot)
continue;
-   if (!((base_gfn + npages = s-base_gfn) ||
- (base_gfn = s-base_gfn + s-npages)))
+   if (!((base_gfn + npages = slot-base_gfn) ||
+ (base_gfn = slot-base_gfn + slot-npages)))
goto out_free;
}
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm: Fix user memslot overlap check

2012-11-29 Thread Alex Williamson
Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Cc: sta...@vger.kernel.org
---

v2: Remove unused variable i

 virt/kvm/kvm_main.c |   13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..6e8fa7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -709,8 +709,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int r;
gfn_t base_gfn;
unsigned long npages;
-   unsigned long i;
-   struct kvm_memory_slot *memslot;
+   struct kvm_memory_slot *memslot, *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots, *old_memslots;
 
@@ -761,13 +760,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* Check for overlaps */
r = -EEXIST;
-   for (i = 0; i  KVM_MEMORY_SLOTS; ++i) {
-   struct kvm_memory_slot *s = kvm-memslots-memslots[i];
-
-   if (s == memslot || !s-npages)
+   kvm_for_each_memslot(slot, kvm-memslots) {
+   if (slot-id = KVM_MEMORY_SLOTS || slot == memslot)
continue;
-   if (!((base_gfn + npages = s-base_gfn) ||
- (base_gfn = s-base_gfn + s-npages)))
+   if (!((base_gfn + npages = slot-base_gfn) ||
+ (base_gfn = slot-base_gfn + slot-npages)))
goto out_free;
}
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 03/14] ARM: Factor out cpuid implementor and part number

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:21 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:42:31PM +, Christoffer Dall wrote:
 Decoding the implementor and part number of the CPU id in the CPU ID
 register is needed by KVM, so we factor it out to share the code.

 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

 [...]

 diff --git a/arch/arm/include/asm/cputype.h b/arch/arm/include/asm/cputype.h
 index cb47d28..306fb2c 100644
 --- a/arch/arm/include/asm/cputype.h
 +++ b/arch/arm/include/asm/cputype.h
 @@ -51,6 +51,22 @@ extern unsigned int processor_id;
  #define read_cpuid_ext(reg) 0
  #endif

 +#define IMPLEMENTOR_ARM  0x41
 +#define IMPLEMENTOR_INTEL0x69
 +
 +#define PART_NUMBER_ARM1136  0xB360
 +#define PART_NUMBER_ARM1156  0xB560
 +#define PART_NUMBER_ARM1176  0xB760
 +#define PART_NUMBER_ARM11MPCORE  0xB020
 +#define PART_NUMBER_CORTEX_A80xC080
 +#define PART_NUMBER_CORTEX_A90xC090
 +#define PART_NUMBER_CORTEX_A50xC050
 +#define PART_NUMBER_CORTEX_A15   0xC0F0
 +#define PART_NUMBER_CORTEX_A70xC070
 +
 +#define PART_NUMBER_XSCALE1  0x1
 +#define PART_NUMBER_XSCALE2  0x2

 We should probably prefix these with ARM_CPU_ and make the current names
 shorter to compensate. e.g. ARM_CPU_PART_1136, ARM_CPU_IMP_ARM ?

  /*
   * The CPU ID never changes at run time, so we might as well tell the
   * compiler that it's constant.  Use this function to read the CPU ID
 @@ -61,6 +77,16 @@ static inline unsigned int __attribute_const__ 
 read_cpuid_id(void)
   return read_cpuid(CPUID_ID);
  }

 +static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
 +{
 + return (read_cpuid_id()  0xFF00)  24;
 +}
 +
 +static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
 +{
 + return (read_cpuid_id()  0xFFF0);
 +}

 Perhaps this should take the implementor as an argument, given that the
 part number is described differently between implementors. The xscale
 stuff can then move in here (we'll need to check the xscale docs in case
 perf is using a subfield -- I can't remember off-hand).

  static inline unsigned int __attribute_const__ read_cpuid_cachetype(void)
  {
   return read_cpuid(CPUID_CACHETYPE);
 diff --git a/arch/arm/kernel/perf_event_cpu.c 
 b/arch/arm/kernel/perf_event_cpu.c
 index 8d7d8d4..ff18566 100644
 --- a/arch/arm/kernel/perf_event_cpu.c
 +++ b/arch/arm/kernel/perf_event_cpu.c
 @@ -200,46 +200,46 @@ static struct arm_pmu *__devinit 
 probe_current_pmu(void)
   struct arm_pmu *pmu = NULL;
   int cpu = get_cpu();
   unsigned long cpuid = read_cpuid_id();
 - unsigned long implementor = (cpuid  0xFF00)  24;
 - unsigned long part_number = (cpuid  0xFFF0);
 + unsigned long implementor = read_cpuid_implementor();
 + unsigned long part_number = read_cpuid_part_number();

   pr_info(probing PMU on CPU %d\n, cpu);

   /* ARM Ltd CPUs. */
 - if (0x41 == implementor) {
 + if (implementor == IMPLEMENTOR_ARM) {
   switch (part_number) {
 - case 0xB360:/* ARM1136 */
 - case 0xB560:/* ARM1156 */
 - case 0xB760:/* ARM1176 */
 + case PART_NUMBER_ARM1136:
 + case PART_NUMBER_ARM1156:
 + case PART_NUMBER_ARM1176:
   pmu = armv6pmu_init();
   break;
 - case 0xB020:/* ARM11mpcore */
 + case PART_NUMBER_ARM11MPCORE:
   pmu = armv6mpcore_pmu_init();
   break;
 - case 0xC080:/* Cortex-A8 */
 + case PART_NUMBER_CORTEX_A8:
   pmu = armv7_a8_pmu_init();
   break;
 - case 0xC090:/* Cortex-A9 */
 + case PART_NUMBER_CORTEX_A9:
   pmu = armv7_a9_pmu_init();
   break;
 - case 0xC050:/* Cortex-A5 */
 + case PART_NUMBER_CORTEX_A5:
   pmu = armv7_a5_pmu_init();
   break;
 - case 0xC0F0:/* Cortex-A15 */
 + case PART_NUMBER_CORTEX_A15:
   pmu = armv7_a15_pmu_init();
   break;
 - case 0xC070:/* Cortex-A7 */
 + case PART_NUMBER_CORTEX_A7:
   pmu = armv7_a7_pmu_init();
   break;
   }
   /* Intel CPUs [xscale]. */
 - } else if (0x69 == implementor) {
 + } else if (implementor == IMPLEMENTOR_INTEL) {
   part_number = (cpuid  13)  0x7;
   switch (part_number) {
 - case 1:
 + case PART_NUMBER_XSCALE1:
   pmu = xscale1pmu_init();
   break;
 - case 2:
 + case PART_NUMBER_XSCALE2:
   pmu = xscale2pmu_init();
   

Re: [PATCH v4 04/14] KVM: ARM: Initial skeleton to compile KVM support

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:41 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:42:38PM +, Christoffer Dall wrote:
 Targets KVM support for Cortex A-15 processors.

 Contains all the framework components, make files, header files, some
 tracing functionality, and basic user space API.

 Only supported core is Cortex-A15 for now.

 Most functionality is in arch/arm/kvm/* or arch/arm/include/asm/kvm_*.h.

 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Rusty Russell rusty.russ...@linaro.org
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

 [...]

  4.69 KVM_GET_ONE_REG

  Capability: KVM_CAP_ONE_REG
 @@ -1791,6 +1800,7 @@ The list of registers accessible using this interface 
 is identical to the
  list in 4.68.


 +

 Whitespace.

  4.70 KVM_KVMCLOCK_CTRL

  Capability: KVM_CAP_KVMCLOCK_CTRL
 @@ -2072,6 +2082,46 @@ KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external 
 call; source cpu in parm
  Note that the vcpu ioctl is asynchronous to vcpu execution.


 +4.77 KVM_ARM_VCPU_INIT
 +
 +Capability: basic
 +Architectures: arm
 +Type: vcpu ioctl
 +Parameters: struct struct kvm_vcpu_init (in)
 +Returns: 0 on success; -1 on error
 +Errors:
 +  EINVAL:the target is unknown, or the combination of features is 
 invalid.
 +  ENOENT:a features bit specified is unknown.
 +
 +This tells KVM what type of CPU to present to the guest, and what
 +optional features it should have.  This will cause a reset of the cpu
 +registers to their initial values.  If this is not called, KVM_RUN will
 +return ENOEXEC for that vcpu.
 +
 +Note that because some registers reflect machine topology, all vcpus
 +should be created before this ioctl is invoked.
 +
 +
 +4.78 KVM_GET_REG_LIST
 +
 +Capability: basic
 +Architectures: arm
 +Type: vcpu ioctl
 +Parameters: struct kvm_reg_list (in/out)
 +Returns: 0 on success; -1 on error
 +Errors:
 +  E2BIG: the reg index list is too big to fit in the array specified by
 + the user (the number required will be written into n).
 +
 +struct kvm_reg_list {
 +   __u64 n; /* number of registers in reg[] */
 +   __u64 reg[0];
 +};
 +
 +This ioctl returns the guest registers that are supported for the
 +KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
 +
 +
  5. The kvm_run structure
  

 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
 index ade7e92..43a997a 100644
 --- a/arch/arm/Kconfig
 +++ b/arch/arm/Kconfig
 @@ -2311,3 +2311,5 @@ source security/Kconfig
  source crypto/Kconfig

  source lib/Kconfig
 +
 +source arch/arm/kvm/Kconfig
 diff --git a/arch/arm/Makefile b/arch/arm/Makefile
 index 5f914fc..433becd 100644
 --- a/arch/arm/Makefile
 +++ b/arch/arm/Makefile
 @@ -257,6 +257,7 @@ core-$(CONFIG_XEN)  += arch/arm/xen/
  core-y += arch/arm/kernel/ arch/arm/mm/ 
 arch/arm/common/
  core-y += arch/arm/net/
  core-y += arch/arm/crypto/
 +core-y += arch/arm/kvm/

 Predicate this on CONFIG_KVM_ARM_HOST, then add some obj-y to the
 kvm/Makefile.

  core-y += $(machdirs) $(platdirs)

  drivers-$(CONFIG_OPROFILE)  += arch/arm/oprofile/
 diff --git a/arch/arm/include/asm/kvm_host.h 
 b/arch/arm/include/asm/kvm_host.h
 new file mode 100644
 index 000..67826b2
 --- /dev/null
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -0,0 +1,114 @@
 +/*
 + * Copyright (C) 2012 - Virtual Open Systems and Columbia University
 + * Author: Christoffer Dall c.d...@virtualopensystems.com
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License, version 2, as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
 + */
 +
 +#ifndef __ARM_KVM_HOST_H__
 +#define __ARM_KVM_HOST_H__
 +
 +#include asm/kvm.h
 +#include asm/kvm_asm.h
 +
 +#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS

 One thing I noticed when hacking kvmtool is that we don't support the
 KVM_CAP_{MAX,NR}_VCPUS capabilities. Why is this? Since we have a
 maximum of 8 due to limitations of the GIC, it might be nice to put that
 somewhere visible to userspace.


indeed

 +#define KVM_MEMORY_SLOTS 32
 +#define KVM_PRIVATE_MEM_SLOTS 4
 +#define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 +
 +#define KVM_VCPU_MAX_FEATURES 0
 +
 +/* We don't currently support large pages. */
 +#define KVM_HPAGE_GFN_SHIFT(x) 0
 +#define KVM_NR_PAGE_SIZES  1
 +#define 

Re: Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-29 Thread Marcelo Tosatti
On Tue, Nov 27, 2012 at 07:53:47PM +0900, Yoshihiro YUNOMAE wrote:
 Hi Marcelo,
 
 (2012/11/27 8:16), Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:
 500h. event tsc_write tsc_offset=-3000
 
 Then a guest trace containing events with a TSC timestamp.
 Which tsc_offset to use?
 
 (that is the problem, which unless i am mistaken can only be solved
 easily if the guest can convert RDTSC - TSC of host).
 
 There are three following cases of changing TSC offset:
   1. Reset TSC at guest boot time
   2. Adjust TSC offset due to some host's problems
   3. Write TSC on guests
 The scenario which you mentioned is case 3, so we'll discuss this case.
 Here, we assume that a guest is allocated single CPU for the sake of
 ease.
 
 If a guest executes write_tsc, TSC values jumps to forward or backward.
 For the forward case, trace data are as follows:
 
 host guest   
 cyclesevents   cycles   events
   3000   tsc_offset=-2950
   3001   kvm_enter
   53 eventX
   
  100 (write_tsc=+900)
   3060   kvm_exit
   3075   tsc_offset=-2050
   3080   kvm_enter
 1050 event1
 1055 event2
   ...
 
 
 This case is simple. The guest TSC of the first kvm_enter is calculated
 as follows:
 
(host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
 
 Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
 events between 51 and 130, that is, 53 eventX is inserted between the
 first pair of kvm_enter and kvm_exit. To insert events of the guests
 between 51 and 130, we convert the guest TSC to the host TSC using TSC
 offset 2950.
 
 For the backward case, trace data are as follows:
 
 host guest   
 cyclesevents   cycles   events
   3000   tsc_offset=-2950
   3001   kvm_enter
   53 eventX
   
  100 (write_tsc=-50)
   3060   kvm_exit
   3075   tsc_offset=-2050
   3080   kvm_enter
   90 event1
   95 event2
   ...
 
 3400  100(write_tsc=-50)
 
90event3
95event4
 
 As you say, in this case, the previous method is invalid. When we
 calculate the guest TSC value for the tsc_offset=-3000 event, the value
 is 75 on the guest. This seems like prior event of write_tsc=-50 event.
 So, we need to consider more.
 
 In this case, it is important that we can understand where the guest
 executes write_tsc or the host rewrites the TSC offset. write_tsc on
 the guest equals wrmsr 0x0010, so this instruction induces vm_exit.
 This implies that the guest does not operate when the host changes TSC
 offset on the cpu. In other words, the guest cannot use new TSC before
 the host rewrites the new TSC offset. So, if timestamp on the guest is
 not monotonically increased, we can understand the guest executes
 write_tsc. Moreover, in the region where timestamp is decreasing, we
 can understand when the host rewrote the TSC offset in the guest trace
 data. Therefore, we can sort trace data in chronological order.
 
 This requires an entire trace of events. That is, to be able
 to reconstruct timeline you require the entire trace from the moment
 guest starts. So that you can correlate wrmsr-to-tsc on the guest with
 vmexit-due-to-tsc-write on the host.
 
 Which means that running out of space for trace buffer equals losing
 ability to order events.
 
 Is that desirable? It seems cumbersome to me.
 
 As you say, tracing events can overwrite important events like
 kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
 needed by this feature. Normal events which often hit record the buffer
 A, and important events which rarely hit record the buffer B. In our
 case, the important event is write_tsc_offset.
 Also the need to correlate each write_tsc event in the guest trace
 with a corresponding tsc_offset write in the host trace means that it
 is _necessary_ for the guest and host to enable tracing simultaneously.
 Correct?
 
 Also, there are WRMSR executions in the guest for which there is
 no event in the trace buffer. From SeaBIOS, during boot.
 In that case, there is no explicit event in the guest trace which you
 can correlate with tsc_offset changes in the host side.
 
 I understand that you want to say, but we don't correlate between
 write_tsc event and write_tsc_offset event directly. This is because
 the write_tsc tracepoint (also WRMSR instruction) is not prepared in
 the current kernel. So, in the previous mail
 (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
 don't need to prepare 

RE: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-29 Thread Auld, Will
Marcelo, 

My latest QEMU patch ([PATCH V5] target-i386: Enabling IA32_TSC_ADJUST for QEMU 
KVM guest VMs) seems to be OK with regards to reset. The tsc_adjust variable is 
being zeroed by the memset() in x86_cpu_reset(). Later code seems to write 
these values through kvm_put_msr() by way of cpu_synchronize_all_post_reset(). 

Andreas said: I'm expecting this to go through Marcello's queue unless I'm told 
otherwise.

Does this meet your expectations as well?

Thanks,

Will

 -Original Message-
 From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
 Sent: Thursday, November 29, 2012 12:44 PM
 To: Auld, Will
 Cc: kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang,
 Xiantao; a...@redhat.com; qemu-devel; Gleb
 Subject: Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM
 support
 
 On Thu, Nov 29, 2012 at 07:21:28PM +, Auld, Will wrote:
  Marcelo,
 
  The behavior on reset is to return the TSC_AJUST msr value to 0x0. I
 am currently initializing this emulated msr in kvm_arch_vcpu_init().
 
 Will,
 
 Reset is handled by QEMU. kvm_arch_vcpu_init is only called during vcpu
 allocation (ie vm creation). See x86_cpu_reset in QEMU's target-
 i386/cpu.c file.
 
 
  - Behaviour on reset: what is the behaviour on RESET?
 
  I am testing the rebase now. I would like to get any needed changes
 for this initialization into my next patch set version (v6) if
 possible.
 
 For the kernel patch, everything is fine. Apparently for the QEMU patch
 it is  necessary to set value to zero on reset.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-29 Thread Yoshihiro YUNOMAE

Hi Marcelo,


Can you please write a succint but complete description of the method
so it can be verified?


Sure.

- Prerequisite
1. the host TSC is synchronized and stable.
2. kvm_write_tsc_offset events include previous and next TSC offset
values.
3. Every event's trace_clock is TSC.

- Assumption for the sake of ease
1. One VCPU
2. The guest executes no write_tsc or write_tsc only once.

- Configuration
1. On the host, kvm_exit/entry events are recorded in the buffer A and
kvm_write_tsc_offset events are recorded in the buffer B.
2. Boot a guest

- Sort method
1.
  Confirm which the kvm_write_tsc_offset events are recorded except for
boot. Note that a vcpu thread writes TSC offset when boot as an initial
operation.

1-1.
  If not recorded, it means that the guest did not execute write_tsc.
So, we convert the guest TSC to the host TSC using TSC offset of boot.
= END

1-2.
  If recorded, it means that the guest executed write_tsc.
So, we use new kvm_write_tsc_offset event information.

2.
We convert the host TSC(Th) of the kvm_write_tsc_offset event to
the guest TSC(Tg) using previous_tsc_offset value:
 Tg = Th + previous_tsc_offset

3.
To search the point where the guest executed write_tsc, we find n
which satisfies the condition Tn  Tg  Tn+1 or Tn+1  Tn  Tg from
older events of the guest.
The former condition means trace data of
the guest were recorded monotonically. On the other hand, the latter
condition means trace data of the guest moved backward.
4.
We convert the guest TSC of trace data to the host TSC using
previous_tsc_offset value before n and using next_tsc_offset value
after n+1.
= END

- Note
We assumed one vcpu and no write_tsc or write_tsc only once for the
sake of ease. For other conditions, we will use similar method.

Thanks,


There is no certainty. Consider the following information available

guest trace host trace
100: guest_tsc_write (tsc_offset=-100 = 
guest_tsc = 0)
104: guest_tsc_write (tsc_offset=-104 = 
guest_tsc = 0)
108: guest_tsc_write (tsc_offset=-108 = 
guest_tsc = 0)
1: eventA   
2: eventB
3: eventC
1: eventD
2: eventE
3: eventF

How can you tell which tsc_offset to use for eventD ? It could be either
-104 or -108. The notion of next_tsc_offset is subject to such
issue.

I suppose its fine to restrict the interface as follows: the tool will
accept one trace of events with monotonic timestamps and the user is
responsible for correlating that to a host trace.


OK, I'll add the restriction, which trace data must have monotonic
timestamps to use this feature. I think guests seldom will execute
write_tsc, so in many cases, timestamps will be monotonically recorded
in trace data.


That is, you can't feed distinct instances of guest kernel trace.


I'm not clear for distinct instances. Is this about SMP or multiple
guests? Would you explain about this?

Thanks,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-29 Thread Raghavendra K T

On 11/29/2012 05:46 PM, Gleb Natapov wrote:

On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:

On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:


Don't understand the reasoning behind why 3 is a good choice.


Here is where I came from. (explaining from scratch for
completeness, forgive me :))
In moderate overcommits, we can falsely exit from ple handler even when
we have preempted task of same VM waiting on other cpus. To reduce this
problem, we try few times before exiting.
The problem boils down to:
what is the probability that we exit ple handler even when we have more
than 1 task in other cpus. Theoretical worst case should be around 1.5x
overcommit (As also pointed by Andrew Theurer). [But practical
worstcase may be around 2x,3x overcommits as indicated by the results
for the patch series]

So if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:

  p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs  with length 1 ]

so
num tries: probability of aborting ple handler (1.5x overcommit)
  1 1/4
  2 1/8
  3 1/16

We can increase this probability with more tries, but the problem is
the overhead.

IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread
is preempted we do kvm-preempted_vcpus++, when it runs again we do
kvm-preempted_vcpus--. PLE handler can try harder if kvm-preempted_vcpus
is big or do not try at all if it is zero.


Thanks for the reply Gleb.

Yes.. It was on my next TODO as you know and it make sense to weigh all 
these approaches (undercommit patches/throttled yield/preempt
notifier/pvspinlock and their combination) to good extent before going 
further. I am happy if these patches are now in 'good shape to compare'

state. (same reason I had posted dynamic PLE appaoch too).




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/14] KVM: ARM: Hypervisor inititalization

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:51 AM, Will Deacon will.dea...@arm.com wrote:

 Typo in subject, use one of initiali[sz]ation instead.

 On Sat, Nov 10, 2012 at 03:42:45PM +, Christoffer Dall wrote:
 Sets up KVM code to handle all exceptions taken to Hyp mode.

 When the kernel is booted in Hyp mode, calling hvc #0xff with r0 pointing 
 to
 the new vectors, the HVBAR is changed to the the vector pointers.  This 
 allows
 subsystems (like KVM here) to execute code in Hyp-mode with the MMU disabled.

 We initialize other Hyp-mode registers and enables the MMU for Hyp-mode from
 the id-mapped hyp initialization code. Afterwards, the HVBAR is changed to
 point to KVM Hyp vectors used to catch guest faults and to switch to Hyp mode
 to perform a world-switch into a KVM guest.

 If the KVM module is unloaded we call hvc #0xff once more to disable the 
 MMU
 in Hyp mode again and install a vector handler to change the HVBAR for a
 subsequent reload of KVM or another hypervisor.

 0xff might be a bit too simple. I notice Xen use 0xEA1, which is
 probably less likely to conflict with anything else. We should probably
 also put these numbers in the same header file so that any conflicts
 become immediately apparent.


actually we don't even do this anymore, the commit text is dated, so
this is a non issue.


 Also provides memory mapping code to map required code pages, data 
 structures,
 and I/O regions  accessed in Hyp mode at the same virtual address as the host
 kernel virtual addresses, but which conforms to the architectural 
 requirements
 for translations in Hyp mode. This interface is added in 
 arch/arm/kvm/arm_mmu.c
 and comprises:
  - create_hyp_mappings(from, to);
  - create_hyp_io_mappings(from, to, phys_addr);
  - free_hyp_pmds();

 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

 [...]

 diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
 index c196a22..f6e8f6f 100644
 --- a/arch/arm/include/asm/kvm_arm.h
 +++ b/arch/arm/include/asm/kvm_arm.h
 @@ -21,4 +21,111 @@

  #include asm/types.h

 +/* Hyp Configuration Register (HCR) bits */
 +#define HCR_TGE(1  27)
 +#define HCR_TVM(1  26)
 +#define HCR_TTLB   (1  25)
 +#define HCR_TPU(1  24)
 +#define HCR_TPC(1  23)
 +#define HCR_TSW(1  22)
 +#define HCR_TAC(1  21)
 +#define HCR_TIDCP  (1  20)
 +#define HCR_TSC(1  19)
 +#define HCR_TID3   (1  18)
 +#define HCR_TID2   (1  17)
 +#define HCR_TID1   (1  16)
 +#define HCR_TID0   (1  15)
 +#define HCR_TWE(1  14)
 +#define HCR_TWI(1  13)
 +#define HCR_DC (1  12)
 +#define HCR_BSU(3  10)
 +#define HCR_BSU_IS (1  10)
 +#define HCR_FB (1  9)
 +#define HCR_VA (1  8)
 +#define HCR_VI (1  7)
 +#define HCR_VF (1  6)
 +#define HCR_AMO(1  5)
 +#define HCR_IMO(1  4)
 +#define HCR_FMO(1  3)
 +#define HCR_PTW(1  2)
 +#define HCR_SWIO   (1  1)
 +#define HCR_VM 1
 +
 +/*
 + * The bits we set in HCR:
 + * TAC:Trap ACTLR
 + * TSC:Trap SMC
 + * TSW:Trap cache operations by set/way
 + * TWI:Trap WFI
 + * TIDCP:  Trap L2CTLR/L2ECTLR
 + * BSU_IS: Upgrade barriers to the inner shareable domain
 + * FB: Force broadcast of all maintainance operations
 + * AMO:Override CPSR.A and enable signaling with VA
 + * IMO:Override CPSR.I and enable signaling with VI
 + * FMO:Override CPSR.F and enable signaling with VF
 + * SWIO:   Turn set/way invalidates into set/way clean+invalidate
 + */
 +#define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | 
 \
 +   HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
 +   HCR_SWIO | HCR_TIDCP)
 +
 +/* Hyp System Control Register (HSCTLR) bits */
 +#define HSCTLR_TE  (1  30)
 +#define HSCTLR_EE  (1  25)
 +#define HSCTLR_FI  (1  21)
 +#define HSCTLR_WXN (1  19)
 +#define HSCTLR_I   (1  12)
 +#define HSCTLR_C   (1  2)
 +#define HSCTLR_A   (1  1)
 +#define HSCTLR_M   1
 +#define HSCTLR_MASK(HSCTLR_M | HSCTLR_A | HSCTLR_C | HSCTLR_I | \
 +HSCTLR_WXN | HSCTLR_FI | HSCTLR_EE | HSCTLR_TE)
 +
 +/* TTBCR and HTCR Registers bits */
 +#define TTBCR_EAE  (1  31)
 +#define TTBCR_IMP  (1  30)
 +#define TTBCR_SH1  (3  28)
 +#define TTBCR_ORGN1(3  26)
 +#define TTBCR_IRGN1(3  24)
 +#define TTBCR_EPD1 (1  23)
 +#define TTBCR_A1   (1  22)
 +#define TTBCR_T1SZ (3  16)
 +#define TTBCR_SH0  (3  12)
 +#define TTBCR_ORGN0(3  10)
 +#define TTBCR_IRGN0(3  8)
 +#define TTBCR_EPD0 (1  7)
 +#define 

Re: [PATCH v4 07/14] KVM: ARM: Inject IRQs and FIQs from userspace

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 11:21 AM, Will Deacon will.dea...@arm.com wrote:
 On Mon, Nov 19, 2012 at 04:09:06PM +, Christoffer Dall wrote:
 On Mon, Nov 19, 2012 at 10:26 AM, Will Deacon will.dea...@arm.com wrote:
  On Mon, Nov 19, 2012 at 03:04:38PM +, Christoffer Dall wrote:
 In all seriousness, I will unite myself with an alcoholic beverage one
 of these evenings and try to see what I can do about it, maybe split
 it up somehow, I'll give it a shot.

 So this might be to do with the way you've split up the patches (likely I'll
 have similar complaints against the vGIC code, but at least it will make
 sense there!)...

   +int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level 
   *irq_level)
   +{
   +   u32 irq = irq_level-irq;
   +   unsigned int irq_type, vcpu_idx, irq_num;
   +   int nrcpus = atomic_read(kvm-online_vcpus);
   +   struct kvm_vcpu *vcpu = NULL;
   +   bool level = irq_level-level;
   +
   +   irq_type = (irq  KVM_ARM_IRQ_TYPE_SHIFT)  
   KVM_ARM_IRQ_TYPE_MASK;
   +   vcpu_idx = (irq  KVM_ARM_IRQ_VCPU_SHIFT)  
   KVM_ARM_IRQ_VCPU_MASK;
   +   irq_num = (irq  KVM_ARM_IRQ_NUM_SHIFT)  
   KVM_ARM_IRQ_NUM_MASK;
   +
   +   trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, 
   irq_level-level);
   +
   +   if (irq_type == KVM_ARM_IRQ_TYPE_CPU ||
   +   irq_type == KVM_ARM_IRQ_TYPE_PPI) {
   +   if (vcpu_idx = nrcpus)
   +   return -EINVAL;
   +
   +   vcpu = kvm_get_vcpu(kvm, vcpu_idx);
   +   if (!vcpu)
   +   return -EINVAL;
   +   }
   +
   +   switch (irq_type) {
   +   case KVM_ARM_IRQ_TYPE_CPU:
   +   if (irq_num  KVM_ARM_IRQ_CPU_FIQ)
   +   return -EINVAL;
   +
   +   return vcpu_interrupt_line(vcpu, irq_num, level);
   +   }
   +
   +   return -EINVAL;
   +}

 This obviously doesn't handle PPIs, which is where the confusion lies. You
 can just as easily write it as:

 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level)
 {
 u32 irq = irq_level-irq;
 unsigned int irq_type, vcpu_idx, irq_num;
 int nrcpus = atomic_read(kvm-online_vcpus);
 struct kvm_vcpu *vcpu = NULL;
 bool level = irq_level-level;

 irq_type = (irq  KVM_ARM_IRQ_TYPE_SHIFT)  KVM_ARM_IRQ_TYPE_MASK;
 vcpu_idx = (irq  KVM_ARM_IRQ_VCPU_SHIFT)  KVM_ARM_IRQ_VCPU_MASK;
 irq_num = (irq  KVM_ARM_IRQ_NUM_SHIFT)  KVM_ARM_IRQ_NUM_MASK;

 trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, irq_level-level);

 if (irq_type != KVM_ARM_IRQ_TYPE_CPU)
 return -EINVAL;

 if (vcpu_idx = nrcpus)
 return -EINVAL;

 vcpu = kvm_get_vcpu(kvm, vcpu_idx);
 if (!vcpu)
 return -EINVAL;

 if (irq_num  KVM_ARM_IRQ_CPU_FIQ)
 return -EINVAL;

 return vcpu_interrupt_line(vcpu, irq_num, level);
 }

 Then add all the IRQ complexity in a later patch!

that was pretty constructive, I did just that. Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/14] KVM: ARM: World-switch implementation

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 9:57 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:43:06PM +, Christoffer Dall wrote:
 Provides complete world-switch implementation to switch to other guests
 running in non-secure modes. Includes Hyp exception handlers that
 capture necessary exception information and stores the information on
 the VCPU and KVM structures.

 [...]

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 15e2ab1..d8f8c60 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -40,6 +40,7 @@
  #include asm/kvm_arm.h
  #include asm/kvm_asm.h
  #include asm/kvm_mmu.h
 +#include asm/kvm_emulate.h

  #ifdef REQUIRES_VIRT
  __asm__(.arch_extension   virt);
 @@ -49,6 +50,10 @@ static DEFINE_PER_CPU(unsigned long, 
 kvm_arm_hyp_stack_page);
  static struct vfp_hard_struct __percpu *kvm_host_vfp_state;
  static unsigned long hyp_default_vectors;

 +/* The VMID used in the VTTBR */
 +static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 +static u8 kvm_next_vmid;
 +static DEFINE_SPINLOCK(kvm_vmid_lock);

  int kvm_arch_hardware_enable(void *garbage)
  {
 @@ -264,6 +269,8 @@ int __attribute_const__ kvm_target_cpu(void)

  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
  {
 +   /* Force users to call KVM_ARM_VCPU_INIT */
 +   vcpu-arch.target = -1;
 return 0;
  }

 @@ -274,6 +281,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
  {
 vcpu-cpu = cpu;
 +   vcpu-arch.vfp_host = this_cpu_ptr(kvm_host_vfp_state);
  }

  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 @@ -306,12 +314,168 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)

  int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v)
  {
 +   return v-mode == IN_GUEST_MODE;
 +}
 +
 +static void reset_vm_context(void *info)
 +{
 +   kvm_call_hyp(__kvm_flush_vm_context);
 +}
 +
 +/**
 + * need_new_vmid_gen - check that the VMID is still valid
 + * @kvm: The VM's VMID to checkt
 + *
 + * return true if there is a new generation of VMIDs being used
 + *
 + * The hardware supports only 256 values with the value zero reserved for 
 the
 + * host, so we check if an assigned value belongs to a previous generation,
 + * which which requires us to assign a new value. If we're the first to use 
 a
 + * VMID for the new generation, we must flush necessary caches and TLBs on 
 all
 + * CPUs.
 + */
 +static bool need_new_vmid_gen(struct kvm *kvm)
 +{
 +   return unlikely(kvm-arch.vmid_gen != atomic64_read(kvm_vmid_gen));
 +}
 +
 +/**
 + * update_vttbr - Update the VTTBR with a valid VMID before the guest runs
 + * @kvmThe guest that we are about to run
 + *
 + * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure 
 the
 + * VM has a valid VMID, otherwise assigns a new one and flushes 
 corresponding
 + * caches and TLBs.
 + */
 +static void update_vttbr(struct kvm *kvm)
 +{
 +   phys_addr_t pgd_phys;
 +   u64 vmid;
 +
 +   if (!need_new_vmid_gen(kvm))
 +   return;
 +
 +   spin_lock(kvm_vmid_lock);
 +
 +   /* First user of a new VMID generation? */
 +   if (unlikely(kvm_next_vmid == 0)) {
 +   atomic64_inc(kvm_vmid_gen);
 +   kvm_next_vmid = 1;
 +
 +   /*
 +* On SMP we know no other CPUs can use this CPU's or
 +* each other's VMID since the kvm_vmid_lock blocks
 +* them from reentry to the guest.
 +*/
 +   on_each_cpu(reset_vm_context, NULL, 1);
 +   }
 +
 +   kvm-arch.vmid_gen = atomic64_read(kvm_vmid_gen);
 +   kvm-arch.vmid = kvm_next_vmid;
 +   kvm_next_vmid++;
 +
 +   /* update vttbr to be used with the new vmid */
 +   pgd_phys = virt_to_phys(kvm-arch.pgd);
 +   vmid = ((u64)(kvm-arch.vmid)  VTTBR_VMID_SHIFT)  VTTBR_VMID_MASK;
 +   kvm-arch.vttbr = pgd_phys  VTTBR_BADDR_MASK;
 +   kvm-arch.vttbr |= vmid;
 +
 +   spin_unlock(kvm_vmid_lock);
 +}

 I must be missing something here: how do you ensure that a guest running
 on multiple CPUs continues to have the same VMID across them after a
 rollover?


when a roll over occurs, there's no problem until someone comes along
that doesn't have a valid vmid (need_new_vmid_gen will return true).

In this case, to assign a vmid, we need to start a new generation of
id's to assign one, and must ensure that all old vmid's are no longer
used. So how do we ensure that?

Well, we increment the kvm_vmid_gen, causing all other cpus who try to
run a VM to hit the spin_lock if they exit the VMs. We reserve the
vmid 1 for the new cpu, and we call on_each_cpu, which causes an ipi
to all other physical cpus, and waits until the other physical cpus
actually complete reset_vm_context.

At this point, once on_each_cpu(reset_vm_context) returns, all other
physical CPUs have cleared their data structures for occurences of old
vmids, and the kvm_vmid_gen has been incremented, so no other vcpus

Re: [PATCH v4 10/14] KVM: ARM: User space API for getting/setting co-proc registers

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 10:02 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:43:21PM +, Christoffer Dall wrote:
 The following three ioctls are implemented:
  -  KVM_GET_REG_LIST
  -  KVM_GET_ONE_REG
  -  KVM_SET_ONE_REG

 [...]

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 845ceda..9671cd2 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1798,6 +1798,11 @@ is the register group type, or coprocessor number:
  ARM core registers have the following id bit patterns:
0x4002  0010 index into the kvm_regs struct:16

 +ARM 32-bit CP15 registers have the following id bit patterns:
 +  0x4002  000F zero:1 crn:4 crm:4 opc1:4 opc2:3
 +
 +ARM 64-bit CP15 registers have the following id bit patterns:
 +  0x4003  000F zero:1 zero:4 crm:4 opc1:4 zero:3


  4.69 KVM_GET_ONE_REG
 @@ -2139,6 +2144,45 @@ This ioctl returns the guest registers that are 
 supported for the
  KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.


 +4.77 KVM_ARM_VCPU_INIT
 +
 +Capability: basic
 +Architectures: arm
 +Type: vcpu ioctl
 +Parameters: struct struct kvm_vcpu_init (in)
 +Returns: 0 on success; -1 on error
 +Errors:
 +  EINVAL:the target is unknown, or the combination of features is 
 invalid.
 +  ENOENT:a features bit specified is unknown.
 +
 +This tells KVM what type of CPU to present to the guest, and what
 +optional features it should have.  This will cause a reset of the cpu
 +registers to their initial values.  If this is not called, KVM_RUN will
 +return ENOEXEC for that vcpu.
 +
 +Note that because some registers reflect machine topology, all vcpus
 +should be created before this ioctl is invoked.
 +
 +4.78 KVM_GET_REG_LIST
 +
 +Capability: basic
 +Architectures: arm
 +Type: vcpu ioctl
 +Parameters: struct kvm_reg_list (in/out)
 +Returns: 0 on success; -1 on error
 +Errors:
 +  E2BIG: the reg index list is too big to fit in the array specified by
 + the user (the number required will be written into n).
 +
 +struct kvm_reg_list {
 +   __u64 n; /* number of registers in reg[] */
 +   __u64 reg[0];
 +};
 +
 +This ioctl returns the guest registers that are supported for the
 +KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
 +
 +

 You already added this hunk earlier (and looking at the final result,
 you do end up wih two entries for 4.77 and 4.78).

duh, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 11/14] KVM: ARM: Demux CCSIDR in the userspace API

2012-11-29 Thread Christoffer Dall
On Mon, Nov 19, 2012 at 10:03 AM, Will Deacon will.dea...@arm.com wrote:
 On Sat, Nov 10, 2012 at 03:43:28PM +, Christoffer Dall wrote:
 The Cache Size Selection Register (CSSELR) selects the current Cache
 Size ID Register (CCSIDR).  You write which cache you are interested
 in to CSSELR, and read the information out of CCSIDR.

 Which cache numbers are valid is known by reading the Cache Level ID
 Register (CLIDR).

 To export this state to userspace, we add a KVM_REG_ARM_DEMUX
 numberspace (17), which uses 8 bits to represent which register is
 being demultiplexed (0 for CCSIDR), and the lower 8 bits to represent
 this demultiplexing (in our case, the CSSELR value, which is 4 bits).

 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Rusty Russell rusty.russ...@linaro.org

 [...]

  Documentation/virtual/kvm/api.txt |2
  arch/arm/include/uapi/asm/kvm.h   |9 ++
  arch/arm/kvm/coproc.c |  163 
 -
  3 files changed, 171 insertions(+), 3 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 9671cd2..bf8f99c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1804,6 +1804,8 @@ ARM 32-bit CP15 registers have the following id bit 
 patterns:
  ARM 64-bit CP15 registers have the following id bit patterns:
0x4003  000F zero:1 zero:4 crm:4 opc1:4 zero:3

 +ARM CCSIDR registers are demultiplexed by CSSELR value:
 +  0x4002  0011 00 csselr:8

  4.69 KVM_GET_ONE_REG

 diff --git a/arch/arm/include/uapi/asm/kvm.h 
 b/arch/arm/include/uapi/asm/kvm.h
 index a807d9a..78a629c 100644
 --- a/arch/arm/include/uapi/asm/kvm.h
 +++ b/arch/arm/include/uapi/asm/kvm.h
 @@ -80,6 +80,15 @@ struct kvm_arch_memory_slot {
  #define KVM_REG_ARM_CORE (0x0010  KVM_REG_ARM_COPROC_SHIFT)
  #define KVM_REG_ARM_CORE_REG(name)   (offsetof(struct kvm_regs, name) / 4)

 +/* Some registers need more space to represent values. */
 +#define KVM_REG_ARM_DEMUX(0x0011  KVM_REG_ARM_COPROC_SHIFT)
 +#define KVM_REG_ARM_DEMUX_ID_MASK0xFF00
 +#define KVM_REG_ARM_DEMUX_ID_SHIFT   8
 +#define KVM_REG_ARM_DEMUX_ID_CCSIDR  (0x00  KVM_REG_ARM_DEMUX_ID_SHIFT)
 +#define KVM_REG_ARM_DEMUX_VAL_MASK   0x00FF
 +#define KVM_REG_ARM_DEMUX_VAL_SHIFT  0
 +
 +
  /* KVM_IRQ_LINE irq field index values */
  #define KVM_ARM_IRQ_TYPE_SHIFT   24
  #define KVM_ARM_IRQ_TYPE_MASK0xff
 diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
 index 95a0f5e..9ce5861 100644
 --- a/arch/arm/kvm/coproc.c
 +++ b/arch/arm/kvm/coproc.c
 @@ -35,6 +35,12 @@
   * Co-processor emulation
   
 */

 +/* 3 bits per cache level, as per CLIDR, but non-existent caches always 0 */
 +static u32 cache_levels;
 +
 +/* CSSELR values; used to index KVM_REG_ARM_DEMUX_ID_CCSIDR */
 +#define CSSELR_MAX 12
 +
  int kvm_handle_cp10_id(struct kvm_vcpu *vcpu, struct kvm_run *run)
  {
   kvm_inject_undefined(vcpu);
 @@ -548,11 +554,112 @@ static int set_invariant_cp15(u64 id, void __user 
 *uaddr)
   return 0;
  }

 +static bool is_valid_cache(u32 val)
 +{
 + u32 level, ctype;
 +
 + if (val = CSSELR_MAX)
 + return -ENOENT;
 +
 + /* Bottom bit is Instruction or Data bit.  Next 3 bits are level. */
 +level = (val  1);
 +ctype = (cache_levels  (level * 3))  7;
 +
 + switch (ctype) {
 + case 0: /* No cache */
 + return false;
 + case 1: /* Instruction cache only */
 + return (val  1);
 + case 2: /* Data cache only */
 + case 4: /* Unified cache */
 + return !(val  1);
 + case 3: /* Separate instruction and data caches */
 + return true;
 + default: /* Reserved: we can't know instruction or data. */
 + return false;
 + }
 +}
 +
 +/* Which cache CCSIDR represents depends on CSSELR value. */
 +static u32 get_ccsidr(u32 csselr)
 +{
 + u32 ccsidr;
 +
 + /* Make sure noone else changes CSSELR during this! */
 + local_irq_disable();
 + /* Put value into CSSELR */
 + asm volatile(mcr p15, 2, %0, c0, c0, 0 : : r (csselr));

 You need an isb() here, otherwise you might still address the wrong
 cache (the two accesses don't hazard in the CPU).

that was just totally obvious ;)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html