Re: pci_enable_msix() fails with ENOMEM/EINVAL
Hi Alex, did I understand correctly that vector value with which the call to request_threaded_irq() is made is *not* supposed to be zero? Because in my case, it is always zero, and still the failure I observe is not happening always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each attached PCI device of each KVM VM. I will try to repro this like you suggested and let you know. Thanks for your help, Alex. -Original Message- From: Alex Williamson Sent: 26 November, 2012 10:04 PM To: Alex Lyakas Cc: kvm@vger.kernel.org Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote: Hi Alex, thanks for your response. I printed out the vector and entry values of dev-host_msix_entries[i] within assigned_device_enable_host_msix() before call to request_threaded_irq(). I see that they are all 0s: kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924 kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #0: [v=0 e=0] kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #1: [v=0 e=1] kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #2: [v=0 e=2] So I don't really understand how they all ask for irq=0; I must be missing something. Is there any other explanation of request_threaded_irq() to return EBUSY? From the code I don't see that there is. The vectors all being zero sounds like an indication that pci_enable_msix() didn't actually work. Each of those should be a unique vector. Does booting the host with nointremap perhaps make a difference? Maybe we can isolate the problem to the interrupt remapper code. This issue is reproducible and is not going to go away by itself. Working around it is also problematic. We thought to check whether all IRQs are properly attached after QEMU sets the vm state to running. However, vm state is set to running before IRQ attachments are performed; we debugged this and found out that they are done from a different thread, from a stack trace like this: kvm_assign_irq() assigned_dev_update_msix() assigned_dev_pci_write_config() pci_host_config_write_common() pci_data_write() pci_host_data_write() memory_region_write_accessor() access_with_adjusted_size() memory_region_iorange_write() ioport_writew_thunk() ioport_write() cpu_outw() kvm_handle_io() kvm_cpu_exec() qemu_kvm_cpu_thread_fn() So looks like this is performed on-demand (on first IO), so no reliable point to check that IRQs are attached properly. Correct, MSI-X is setup when the guest enables MSI-X on the device, which is likely a long way into guest boot. There's no guarantee that the guest ever enables MSI-X, so there's no association to whether the guest is running. Another issue that in KVM code the return value of pci_host_config_write_common() is not checked, so there is no way to report a failure. A common problem in qemu, imho Is there any way you think you can help me debug this further? It seems like pci_enable_msix is still failing, but perhaps silently without irqbalance. We need to figure out where and why. Isolating it to the interrupt remapper with nointremap might give us some clues (this is an Intel VT-d system, right?). Thanks, Alex -Original Message- From: Alex Williamson Sent: 22 November, 2012 12:25 AM To: Alex Lyakas Cc: kvm@vger.kernel.org Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote: Hi, I was advised to turn off irqbalance and reproduced this issue, but the failure is in a different place now. Now request_threaded_irq() fails with EBUSY. According to the code, this can only happen on the path: request_threaded_irq() - __setup_irq() Now in setup irq, the only place where EBUSY can show up for us is here: ... raw_spin_lock_irqsave(desc-lock, flags); old_ptr = desc-action; old = *old_ptr; if (old) { /* * Can't share interrupts unless both agree to and are * the same type (level, edge, polarity). So both flag * fields must have IRQF_SHARED set and the bits which * set the trigger type must match. Also all must * agree on ONESHOT. */ if (!((old-flags new-flags) IRQF_SHARED) || ((old-flags ^ new-flags) IRQF_TRIGGER_MASK) || ((old-flags ^ new-flags) IRQF_ONESHOT)) { old_name = old-name; goto mismatch; } /* All handlers must agree on per-cpuness */ if ((old-flags IRQF_PERCPU) != (new-flags IRQF_PERCPU)) goto mismatch; KVM calls request_threaded_irq() with flags==0, so can it be that different KVM processes request the same IRQ? Shouldn't be possible, irqs are allocated from a bitmap protected by a mutex, see __irq_alloc_descs How different KVM processes spawned simultaneously agree between them on IRQ numbers? They don't, MSI/X vectors are not currently share-able. Can you show that you're actually getting duplicate irq vectors? Thanks, Alex --
Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
On 11/29/2012 07:37 AM, Chegu Vinod wrote: On 11/26/2012 4:07 AM, Raghavendra K T wrote: In some special scenarios like #vcpu = #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. The first patch optimizes all the yield_to by bailing out when there is no need to continue in yield_to (i.e., when there is only one task in source and target rq). Second patch uses that in PLE handler. Further when a yield_to fails we do not immediately go out of PLE handler instead we try thrice to have better statistical possibility of false return. Otherwise that would affect moderate overcommit cases. Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and around 51% for dbench 1x with 32 core PLE machine with 32 vcpu guest. base = 3.7.0-rc6 machine: 32 core mx3850 x5 PLE mc --+---+---+---++---+ ebizzy (rec/sec higher is beter) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 2511.300021.54096051.8000 170.2592 140.98276 2x 2679.4000 332.44822692.3000 251.4005 0.48145 3x 2253.5000 266.42432192.1667 178.9753-2.72169 4x 1784.3750 102.26992018.7500 187.572313.13485 --+---+---+---++---+ --+---+---+---++---+ dbench (throughput in MB/sec. higher is better) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 6677.4080 638.504810098.0060 3449.7026 51.22643 2x 2012.676064.76422019.0440 62.6702 0.31639 3x 1302.078340.83361292.7517 27.0515 -0.71629 4x 3043.1725 3243.72814664.4662 5946.5741 53.27643 --+---+---+---++---+ Here is the refernce of no ple result. ebizzy-1x_nople 7592.6000 rec/sec dbench_1x_nople 7853.6960 MB/sec The result says we can still improve by 60% for ebizzy, but overall we are getting impressive performance with the patches. Changes Since V2: - Dropped global measures usage patch (Peter Zilstra) - Do not bail out on first failure (Avi Kivity) - Try thrice for the failure of yield_to to get statistically more correct behaviour. Changes since V1: - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter) - Use yield() instead of schedule in overcommit scenarios (Rik) - Use loadavg knowledge to detect undercommit/overcommit Peter Zijlstra (1): Bail out of yield_to when source and target runqueue has one task Raghavendra K T (1): Handle yield_to failure return for potential undercommit case Please let me know your comments and suggestions. Link for V2: https://lkml.org/lkml/2012/10/29/287 Link for V1: https://lkml.org/lkml/2012/9/21/168 kernel/sched/core.c | 25 +++-- virt/kvm/kvm_main.c | 26 -- 2 files changed, 35 insertions(+), 16 deletions(-) . Tested-by: Chegu Vinod chegu_vi...@hp.com Thanks for testing.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wednesday, November 28, 2012 09:09:29 PM George-Cristian Bîrzan wrote: On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote: On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I have some code which do both reference time and invariant TSC but it will not work after migration. I will send it later today. Do you mean migrating guests? This is not an issue for us. OK, but don't say I didn't warn you :) There are two patches, one for kvm and another one for qemu. you will probably need to rebase them. Add hv_tsc cpu parameter to activate this feature. you will probably need to deactivate hpet by adding -no-hpet parameter as well. I've also added +hv_relaxed since then, but this is the command I'm I would suggest activating relaxed timing for all W2K8R2/Win7 guests. using now and there's no change: /usr/bin/qemu-kvm -name b691546e-79f8-49c6-a293-81067503a6ad -S -M pc-1.2 -enable-kvm -m 16384 -smp 9,sockets=1,cores=9,threads=1 -uuid b691546e-79f8-49c6-a293-81067503a6ad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/b691546e-79f8-49c6-a293-8 1067503a6ad.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-hpet -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/dis-magnetics-2-223101/d8b233c6-8424-4de9-ae3c -7c9a60288514,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,ai o=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=vir tio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:2e:fb:a2:36:be,bus=pci.0,addr =0x3 -netdev tap,fd=40,id=hostnet1,vhost=on,vhostfd=41 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=22:94:44:5a:cb:24,bus=pci.0,addr =0x4 -vnc 127.0.0.1:0,password -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -cpu host,hv_tsc I compiled qemu-1.2.0-24 after applying your patch, used the head for KVM, and I see no difference. I've tried setting windows' useplatformclock on and off, no change either. Other than that, was looking into a profiling trace of the software running and a lot of time (60%?) is spent calling two functions from hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and HalpHPETProgramRolloverTimer which do point at something related to the timers. It means that hyper-v time stamp source was not activated. Any other thing I can try? -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote: On 11/28/2012 06:42 AM, Marcelo Tosatti wrote: Don't understand the reasoning behind why 3 is a good choice. Here is where I came from. (explaining from scratch for completeness, forgive me :)) In moderate overcommits, we can falsely exit from ple handler even when we have preempted task of same VM waiting on other cpus. To reduce this problem, we try few times before exiting. The problem boils down to: what is the probability that we exit ple handler even when we have more than 1 task in other cpus. Theoretical worst case should be around 1.5x overcommit (As also pointed by Andrew Theurer). [But practical worstcase may be around 2x,3x overcommits as indicated by the results for the patch series] So if p is the probability of finding rq length one on a particular cpu, and if we do n tries, then probability of exiting ple handler is: p^(n+1) [ because we would have come across one source with rq length 1 and n target cpu rqs with length 1 ] so num tries: probability of aborting ple handler (1.5x overcommit) 1 1/4 2 1/8 3 1/16 We can increase this probability with more tries, but the problem is the overhead. IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread is preempted we do kvm-preempted_vcpus++, when it runs again we do kvm-preempted_vcpus--. PLE handler can try harder if kvm-preempted_vcpus is big or do not try at all if it is zero. Also, If we have tried three times that means we would have iterated over 3 good eligible vcpus along with many non-eligible candidates. In worst case if we iterate all the vcpus, we reduce 1x performance and overcommit performance get hit. [ as in results ]. I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I concluded 3 is enough. Infact I have also run kernbench and hackbench which are giving 5-20% improvement. [ As a side note , I also thought how about having num_tries = f(n) = ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much overhead and also there is no point in probably making it dependent on online cpus ] Please let me know if you are happy with this rationale/ or correct me if you foresee some problem. (Infact Avi, Rik's concern about false exiting made me arrive at 'try' logic which I did not have earlier). I am currently trying out the result for 1.5x overcommit will post the result. On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com yield_to returns -ESRCH, When source and target of yield_to run queue length is one. When we see three successive failures of yield_to we assume we are in potential undercommit case and abort from PLE handler. The assumption is backed by low probability of wrong decision for even worst case scenarios such as average runqueue length between 1 and 2. note that we do not update last boosted vcpu in failure cases. Thank Avi for raising question on aborting after first fail from yield_to. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com [...] -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I've also added +hv_relaxed since then, but this is the command I'm I would suggest activating relaxed timing for all W2K8R2/Win7 guests. Is there any place I can read up on the downsides of this for Linux, or is Just Better? Other than that, was looking into a profiling trace of the software running and a lot of time (60%?) is spent calling two functions from hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and HalpHPETProgramRolloverTimer which do point at something related to the timers. It means that hyper-v time stamp source was not activated. I recompiled the whole kernel, with your patch, and while I cannot check at 70Mbps now, a test stream of 20 seems to do better. Also, now I don't see any of those functions, which used to account ~60% of the time spent by the program. I'm waiting for the customer to come back and start the 'real' stream, but from my tests, time spent in hal.dll is now an order of magnitude smaller. -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Thu, Nov 29, 2012 at 03:45:52PM +0200, George-Cristian Bîrzan wrote: On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I've also added +hv_relaxed since then, but this is the command I'm I would suggest activating relaxed timing for all W2K8R2/Win7 guests. Is there any place I can read up on the downsides of this for Linux, or is Just Better? You shouldn't use hyper-v flags for Linux guests. In theory Linux should just ignore them, in practice there may be bugs that will prevent Linux from detecting that it runs as a guest and disable optimizations. Other than that, was looking into a profiling trace of the software running and a lot of time (60%?) is spent calling two functions from hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and HalpHPETProgramRolloverTimer which do point at something related to the timers. It means that hyper-v time stamp source was not activated. I recompiled the whole kernel, with your patch, and while I cannot check at 70Mbps now, a test stream of 20 seems to do better. Also, now I don't see any of those functions, which used to account ~60% of the time spent by the program. I'm waiting for the customer to come back and start the 'real' stream, but from my tests, time spent in hal.dll is now an order of magnitude smaller. -- George-Cristian Bîrzan -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM VMX: register state after reset violates spec
Hello, we have noticed that at least on 3.6.8 with VMX after a VCPU has been reset via the INIT-SIPI-SIPI sequence its register state violates Intel's specification. Specifically for our case we see at the end of vmx_vcpu_reset the following vcpu state: regs_avail=ffef regs_dirty=00010010 EIP= EAX=06e8 EBX=0001 ECX=8001 EDX=0600 ESI=d238 EDI= EBP= ESP= although EAX, EBX, ECX, ESI, EDI, EBP, ESP should _all_ be zero. See http://download.intel.com/products/processor/manual/253668.pdf section 9.1.1 (page 9-2). Shouldn't vmx_vcpu_reset actively clear those registers? And from a quick glance at the SVM code the problem might exist there, too. A workaround is to use qemu-kvm with -kvm-no-irqchip. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking latency - what to expect?
Thus spake David Mohr damaili...@mcbf.net: * vm-vm (same host)22k This number is in the same ballpark as what I am seeing on pretty much the same hardware. AFAICS, there is little you can do to the current virtio-virtio code path that would make this substantially faster. Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM VMX: Make register state after reset conform to specification.
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/vmx.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..ec5a3b3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-soft_vnmi_blocked = 0; - vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) @@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); + + kvm_register_write(vcpu, VCPU_REGS_RAX, 0); + kvm_register_write(vcpu, VCPU_REGS_RBX, 0); + kvm_register_write(vcpu, VCPU_REGS_RCX, 0); + kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val()); + kvm_register_write(vcpu, VCPU_REGS_RSI, 0); + kvm_register_write(vcpu, VCPU_REGS_RDI, 0); + kvm_register_write(vcpu, VCPU_REGS_RBP, 0); kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #4 from liyi yiliker...@gmail.com 2012-11-29 15:38:10 --- 1: i have disable the irqbalance in the host, and the irq affinity on the host is ok.you can see the 4 step in comment 0. 2: you see in comment 0, we have set irq affinity on the host is pcpu1. and so reduce the ipi interrupt, so we want to set the irq affinity at the vcpu2 on the guest os according vcpupin vcpu='2' cpuset='1'/, but the irq affinity on the guest os when using msi-x is failed. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM VMX: Make register state after reset conform to specification.
Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kvm/vmx.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ff66a3b..ec5a3b3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3948,7 +3948,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-soft_vnmi_blocked = 0; - vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) @@ -3999,6 +3998,14 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) kvm_rip_write(vcpu, 0xfff0); else kvm_rip_write(vcpu, 0); + + kvm_register_write(vcpu, VCPU_REGS_RAX, 0); + kvm_register_write(vcpu, VCPU_REGS_RBX, 0); + kvm_register_write(vcpu, VCPU_REGS_RCX, 0); + kvm_register_write(vcpu, VCPU_REGS_RDX, get_rdx_init_val()); + kvm_register_write(vcpu, VCPU_REGS_RSI, 0); + kvm_register_write(vcpu, VCPU_REGS_RDI, 0); + kvm_register_write(vcpu, VCPU_REGS_RBP, 0); kvm_register_write(vcpu, VCPU_REGS_RSP, 0); vmcs_writel(GUEST_DR7, 0x400); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking latency - what to expect?
On 2012-11-29 07:48, Julian Stecklina wrote: Thus spake David Mohr damaili...@mcbf.net: * vm-vm (same host)22k This number is in the same ballpark as what I am seeing on pretty much the same hardware. AFAICS, there is little you can do to the current virtio-virtio code path that would make this substantially faster. Thanks for the feedback. Considering that it's better than the hardware network performance my main issue is actually the latency of communication between VMs on different hosts: * vm-vm (diff. hosts) 7k It is obvious that there is a lot more going on compared to same host communication, but only ~30% of the performance when the network hardware should not be slowing it down (too) much? Thanks, ~David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pci_enable_msix() fails with ENOMEM/EINVAL
On Thu, 2012-11-29 at 10:42 +0200, Alex Lyakas wrote: Hi Alex, did I understand correctly that vector value with which the call to request_threaded_irq() is made is *not* supposed to be zero? Because in my case, it is always zero, and still the failure I observe is not happening always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each attached PCI device of each KVM VM. I will try to repro this like you suggested and let you know. pci_enable_msix should be setting vectors for all the host_msix_entries. That non-zero vector is the first parameter to request_threaded_irq. If I put a printk in the loop installing the interrupt handler, I get: assigned_device_enable_host_msix entry 0, vector 44 assigned_device_enable_host_msix entry 0, vector 44 assigned_device_enable_host_msix entry 1, vector 45 assigned_device_enable_host_msix entry 0, vector 44 assigned_device_enable_host_msix entry 1, vector 45 assigned_device_enable_host_msix entry 2, vector 103 So we enable MSI-X with only one vector enabled, then a second vector gets enabled and we disable and re-enable with two, and so on with three. This is for an assigned e1000e device. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/14] ARM: Add page table and page defines needed by KVM
On Mon, Nov 19, 2012 at 9:14 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:42:17PM +, Christoffer Dall wrote: KVM uses the stage-2 page tables and the Hyp page table format, so we define the fields and page protection flags needed by KVM. The nomenclature is this: - page_hyp:PL2 code/data mappings - page_hyp_device: PL2 device mappings (vgic access) - page_s2: Stage-2 code/data page mappings - page_s2_device: Stage-2 device mappings (vgic access) Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Christoffer Dall c.d...@virtualopensystems.com [...] diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 941dfb9..087d949 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -57,43 +57,61 @@ static unsigned int cachepolicy __initdata = CPOLICY_WRITEBACK; static unsigned int ecc_mask __initdata = 0; pgprot_t pgprot_user; pgprot_t pgprot_kernel; +pgprot_t pgprot_hyp_device; +pgprot_t pgprot_s2; +pgprot_t pgprot_s2_device; EXPORT_SYMBOL(pgprot_user); EXPORT_SYMBOL(pgprot_kernel); +EXPORT_SYMBOL(pgprot_hyp_device); +EXPORT_SYMBOL(pgprot_s2); +EXPORT_SYMBOL(pgprot_s2_device); Do we still need these? the exports, nope. struct cachepolicy { const char policy[16]; unsigned intcr_mask; pmdval_tpmd; pteval_tpte; + pteval_tpte_s2; }; +#ifdef CONFIG_ARM_LPAE +#define s2_policy(policy)policy +#else +#define s2_policy(policy)0 +#endif Put the macro in pgtable-{2,3}level.h? I think that's weird, defining something far away from where it's used will only make it harder to read, pgtable.h is not included in this file, and there are other #ifdef CONFIG_ARM_LPAE in that file. Here's the fix from above: diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 087d949..46ca62b 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -63,9 +63,6 @@ pgprot_t pgprot_s2_device; EXPORT_SYMBOL(pgprot_user); EXPORT_SYMBOL(pgprot_kernel); -EXPORT_SYMBOL(pgprot_hyp_device); -EXPORT_SYMBOL(pgprot_s2); -EXPORT_SYMBOL(pgprot_s2_device); struct cachepolicy { const char policy[16]; -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm-tool: ARM: set interrupt priority mask in secondary boot path
A bug in the KVM GIC init code set the priority mask to the highest possible value, while the reset value should be zero. Now that the kernel bug is fixed, kvm-tool must properly configure its GIC CPU interface in order to receive the boot IPI. Just set the GICC_PMR register to the maximum value (0xff), and it just works. Cc: Will Deacon will.dea...@arm.com Cc: Pekka Enberg penb...@kernel.org Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- tools/kvm/arm/aarch32/smp-pen.S| 4 tools/kvm/arm/include/arm-common/gic.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/tools/kvm/arm/aarch32/smp-pen.S b/tools/kvm/arm/aarch32/smp-pen.S index 0861171..cee1d3b8 100644 --- a/tools/kvm/arm/aarch32/smp-pen.S +++ b/tools/kvm/arm/aarch32/smp-pen.S @@ -17,6 +17,10 @@ smp_pen_start: mov r1, #GIC_CPUI_CTLR_EN str r1, [r0] + @ Set the priority mask to accept any interrupt + mov r1, #GIC_CPUI_DEFAULT_PMR + str r1, [r0, #4] + @ Now wait for the primary to poke us adr r0, smp_jump_addr ldr r1, =AARCH32_SMP_BAD_MAGIC diff --git a/tools/kvm/arm/include/arm-common/gic.h b/tools/kvm/arm/include/arm-common/gic.h index d534174..6fa806f 100644 --- a/tools/kvm/arm/include/arm-common/gic.h +++ b/tools/kvm/arm/include/arm-common/gic.h @@ -20,6 +20,8 @@ #define GIC_CPUI_CTLR_EN (1 0) +#define GIC_CPUI_DEFAULT_PMR 0xff + #define GIC_MAX_CPUS 8 #define GIC_MAX_IRQ255 -- 1.8.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH] kvm-tool: ARM: set interrupt priority mask in secondary boot path
On 29 November 2012 15:58, Marc Zyngier marc.zyng...@arm.com wrote: A bug in the KVM GIC init code set the priority mask to the highest possible value, while the reset value should be zero. Now that the kernel bug is fixed, kvm-tool must properly configure its GIC CPU interface in order to receive the boot IPI. Just set the GICC_PMR register to the maximum value (0xff), and it just works. --- a/tools/kvm/arm/include/arm-common/gic.h +++ b/tools/kvm/arm/include/arm-common/gic.h @@ -20,6 +20,8 @@ #define GIC_CPUI_CTLR_EN (1 0) +#define GIC_CPUI_DEFAULT_PMR 0xff This seems an odd name, since 0xff is not the default, which is why we have to write it into the register. Also the register name should presumably go after CPUI, to match CTLR above. GIC_CPUI_PMR_MINIMUM_PRIORITY ? -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #5 from Alex Williamson alex.william...@redhat.com 2012-11-29 16:10:55 --- (In reply to comment #4) 1: i have disable the irqbalance in the host, and the irq affinity on the host is ok.you can see the 4 step in comment 0. 2: you see in comment 0, we have set irq affinity on the host is pcpu1. and so reduce the ipi interrupt, so we want to set the irq affinity at the vcpu2 on the guest os according vcpupin vcpu='2' cpuset='1'/, but the irq affinity on the guest os when using msi-x is failed. How does it fail? When I test this, I look at /proc/interrupts on the guest and the interrupt count on the row with the pinned IRQ only increases in the column under the target CPU. Perhaps if you included some logs and exact commands used I could better understand where things are going wrong. Thanks. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #6 from liyi yiliker...@gmail.com 2012-11-29 16:21:16 --- 1:i have set the irq (associate the passthrough device) to the vcpu2 on the guest os.but look at /proc/interrupts on the guest and the interrupt count with the pinned IRQ increases a lot of vcpus except the vcpu0. 2 my qemu kvm version is 1.2 , and have not add the patch http://www.spinics.net/lists/kvm/msg83109.html from, maybe it will help me. are you sure that your qemu-kvm is from the git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #7 from Alex Williamson alex.william...@redhat.com 2012-11-29 16:33:38 --- (In reply to comment #6) 1:i have set the irq (associate the passthrough device) to the vcpu2 on the guest os.but look at /proc/interrupts on the guest and the interrupt count with the pinned IRQ increases a lot of vcpus except the vcpu0. Perhaps double check that irqbalance is not running the in guest, ps aux | grep irqbalance. 2 my qemu kvm version is 1.2 , and have not add the patch http://www.spinics.net/lists/kvm/msg83109.html from, maybe it will help me. are you sure that your qemu-kvm is from the git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git? Those patches are against git://git.qemu.org/qemu.git I believe the BCM5716 will try to use MSI-X interrupts, not MSI. Please provide lspci -vvv of the device in the guest and we can verify what interrupt mode it's working in. What is the guest operating system? Thanks. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking latency - what to expect?
Thus spake David Mohr damaili...@mcbf.net: On 2012-11-29 07:48, Julian Stecklina wrote: Thus spake David Mohr damaili...@mcbf.net: * vm-vm (same host)22k This number is in the same ballpark as what I am seeing on pretty much the same hardware. AFAICS, there is little you can do to the current virtio-virtio code path that would make this substantially faster. Thanks for the feedback. Considering that it's better than the hardware network performance my main issue is actually the latency of communication between VMs on different hosts: * vm-vm (diff. hosts) 7k It is obvious that there is a lot more going on compared to same host communication, but only ~30% of the performance when the network hardware should not be slowing it down (too) much? You are probably better of using SR-IOV NICs with PCI passthrough in this case. Maybe someone can comment on whether virtual interrupt delivery and posted interrupts[1] are already usable. The first one should help in either the virtio and SR-IOV scenarios, the latter only applies to SR-IOV (and PCI passthrough in general). Julian [1] http://www.spinics.net/lists/kvm/msg82762.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:55 PM, Glauber Costa wrote: On 11/28/2012 10:43 PM, Michael Wolf wrote: On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. Ok, so correct me if I am wrong, but I believe you would be using something like the bandwidth capper in the cpu cgroup to set those entitlements, right? Yes, in the context above I'm referring to the cfs bandwidth control. Some time has passed since I last looked into it, but IIRC, after you get are out of your quota, you should be out of the runqueue. In the lovely world of KVM, we approximate steal time as runqueue time: arch/x86/kvm/x86.c: delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; vcpu-arch.st.accum_steal = delta; include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. So I looked at something like this in the past. To make sure things haven't changed I set up a cgroup on my test server running a kernel built from the latest tip tree. [root]# cat cpu.cfs_quota_us 5 [root]# cat cpu.cfs_period_us 10 [root]# cat cpuset.cpus 1 [root]# cat cpuset.mems 0 Next I put the PID from the cpu thread into tasks. When I start a script that will hog the cpu I see the following in top on the guest Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si, 49.8%st So the steal time here is in line with the bandwidth control settings. In case I am wrong, and run_delay also includes the time you can't run because you are out of capacity, then maybe what we should do, is to just subtract it from run_delay in kvm/x86.c before we pass it on. In summary: About a year ago I was playing with this patch. It is out of date now but will give you an idea of what I was looking at. kernel/sched_fair.c |4 ++-- kernel/sched_stats.h |7 ++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..a837e4e 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_FAIR_GROUP_SCHED /* we need this in update_cfs_load and load-balance functions below */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); +inline int throttled_hierarchy(struct cfs_rq *cfs_rq); # ifdef CONFIG_SMP static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq, int global_update) @@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) } /* check whether cfs_rq, or any parent, is throttled */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) +inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { return cfs_rq-throttle_count; } diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h index 87f9e36..e30ff26 100644 --- a/kernel/sched_stats.h +++ b/kernel/sched_stats.h @@ -213,14 +213,19 @@ static inline void sched_info_queued(struct task_struct *t) * sched_info_queued() to mark that it has now again started waiting on * the runqueue. */ +extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq); static inline void sched_info_depart(struct task_struct *t) { +struct task_group *tg = task_group(t); +struct cfs_rq *cfs_rq; unsigned long long delta = task_rq(t)-clock - t-sched_info.last_arrival; +cfs_rq = tg-cfs_rq[smp_processor_id()]; rq_sched_info_depart(task_rq(t), delta); -if (t-state == TASK_RUNNING) + +if (t-state == TASK_RUNNING !throttled_hierarchy(cfs_rq)) sched_info_queued(t); } So then the steal time did not show on the guest. You have no value that needs to be passed around. What I did not like about this approach was * only works for cfs bandwidth control. If another type of hard limit was added to the kernel the code would potentially need to change. * This approach doesn't help if the limits are set by
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #8 from liyi yiliker...@gmail.com 2012-11-29 17:52:47 --- ok,i am sure the irqbalance is not running in the guest. you are right, BCM5716S using the MSI-X interrupts default. 1:i have set the smp_affinity cannot work correctly using the MSI-X. but the test is ok after insmod the bnx2 with disable_msi=1. 2: lspci -vvv: 00:06.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 11 Region 0: Memory at f600 (32-bit, non-prefetchable) [size=32M] Capabilities: [50] Vital Product Data ? Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap:MaxPayload 512 bytes, PhantFunc 0, Latency L0s 4us, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl:Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta:CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap:Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 4us, L1 4us ClockPM- Suprise- LLActRep- BwNot- LnkCtl:ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta:Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a0] MSI-X: Enable+ Mask- TabSize=9 Vector table: BAR=0 offset=c000 PBA: BAR=0 offset=e000 Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Count=1/16 Enable- Address: Data: Kernel driver in use: bnx2 Kernel modules: bnx2 00:06.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Dell PowerEdge R710 BCM5709 Gigabit Ethernet Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0 Interrupt: pin B routed to IRQ 11 Region 0: Memory at f800 (32-bit, non-prefetchable) [size=32M] Capabilities: [50] Vital Product Data ? Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap:MaxPayload 512 bytes, PhantFunc 0, Latency L0s 4us, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl:Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta:CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap:Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 4us, L1 4us ClockPM- Suprise- LLActRep- BwNot- LnkCtl:ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta:Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a0] MSI-X: Enable+ Mask- TabSize=9 Vector table: BAR=0 offset=c000 PBA: BAR=0 offset=e000 Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Count=1/16 Enable- Address: Data: Kernel driver in use: bnx2 Kernel modules: bnx2 the guest os: 3.7-rc6 (the same as host) -- Configure bugmail:
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #9 from Alex Williamson alex.william...@redhat.com 2012-11-29 18:08:44 --- (In reply to comment #8) ok,i am sure the irqbalance is not running in the guest. you are right, BCM5716S using the MSI-X interrupts default. 1:i have set the smp_affinity cannot work correctly using the MSI-X. but the test is ok after insmod the bnx2 with disable_msi=1. 2: lspci -vvv: 00:06.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) ... 00:06.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Where's the BCM5716? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #10 from liyi yiliker...@gmail.com 2012-11-29 18:18:21 --- sorry, i find the test on BCM5709 is ok. now, i cannot access my physical machine (with BCM5709S,or intel 82599), i will give the debug info asap. Thanks -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #11 from liyi yiliker...@gmail.com 2012-11-29 18:20:59 --- please omit the BCM5709S. i have BCM5709 BCM5716S intel82599 environment, but now , cannot access the BCM5716S and intel82599. now, the test on BCM5709 is ok, but the test on BCM5716S intel82599 before is failed. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #12 from Alex Williamson alex.william...@redhat.com 2012-11-29 18:55:46 --- (In reply to comment #11) please omit the BCM5709S. i have BCM5709 BCM5716S intel82599 environment, but now , cannot access the BCM5716S and intel82599. now, the test on BCM5709 is ok, but the test on BCM5716S intel82599 before is failed. Are you saying MSI-X affinity works as expected on the BCM5709? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 02/14] ARM: Section based HYP idmap
On Mon, Nov 19, 2012 at 9:16 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:42:24PM +, Christoffer Dall wrote: Add a method (hyp_idmap_setup) to populate a hyp pgd with an identity mapping of the code contained in the .hyp.idmap.text section. Offer a method to drop this identity mapping through hyp_idmap_teardown. Make all the above depend on CONFIG_ARM_VIRT_EXT and CONFIG_ARM_LPAE. Cc: Will Deacon will.dea...@arm.com Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/idmap.h|5 ++ arch/arm/include/asm/pgtable-3level-hwdef.h |1 arch/arm/kernel/vmlinux.lds.S |6 ++ arch/arm/mm/idmap.c | 74 +++ 4 files changed, 73 insertions(+), 13 deletions(-) diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h index bf863ed..36708ba 100644 --- a/arch/arm/include/asm/idmap.h +++ b/arch/arm/include/asm/idmap.h @@ -11,4 +11,9 @@ extern pgd_t *idmap_pgd; void setup_mm_for_reboot(void); +#ifdef CONFIG_ARM_VIRT_EXT +void hyp_idmap_teardown(pgd_t *hyp_pgd); +void hyp_idmap_setup(pgd_t *hyp_pgd); +#endif I wonder whether the dependency is quite right here. Functionally, we're only dependent on LPAE but it doesn't make sense if !ARM_VIRT_EXT. In fact, I start to question whether CONFIG_ARM_VIRT_EXT is worthwhile at all as it stands. Maybe it would be better to add the LPAE dependency there and make the hyp_stub stuff that's currently in mainline unconditional? perhaps it would be more clean, but it's not something that would break to fix later, and not in that sense something this patch series would depend on. If the hyp_stub stuff is truly compatible with all legacy stuff then yes, we could remove that define, but if not, then I guess it's necessary. For now, I'll simply remove these ifdef's as Rob pointed out. +/* + * This version actually frees the underlying pmds for all pgds in range and + * clear the pgds themselves afterwards. + */ +void hyp_idmap_teardown(pgd_t *hyp_pgd) +{ + unsigned long addr, end; + unsigned long next; + pgd_t *pgd = hyp_pgd; + + addr = virt_to_phys(__hyp_idmap_text_start); + end = virt_to_phys(__hyp_idmap_text_end); + + pgd += pgd_index(addr); + do { + next = pgd_addr_end(addr, end); + if (!pgd_none_or_clear_bad(pgd)) + hyp_idmap_del_pmd(pgd, addr); + } while (pgd++, addr = next, addr end); +} +EXPORT_SYMBOL_GPL(hyp_idmap_teardown); + +void hyp_idmap_setup(pgd_t *hyp_pgd) +{ + identity_mapping_add(hyp_pgd, __hyp_idmap_text_start, + __hyp_idmap_text_end, PMD_SECT_AP1); +} +EXPORT_SYMBOL_GPL(hyp_idmap_setup); +#endif Again, do we need these exports? no we don't I also think it might be cleaner to declare the hyp_pgd next to the idmap_pgd then add the hyp_idmap_setup code to init_static_idmap, so that we don't add new entry points to this file. The teardown can all be moved into the kvm/ code as it doesn't need any of the existing idmap functions. hmm, we had some attempts at this earlier, but it wasn't all that nice. Allocating the hyp_pgd inside idmap.c requires some more logic, and the #ifdefs inside init_static_idmap are also not pretty. Additionally, other potential users of Hyp mode might have a separate hyp_pgd, in theory. While KVM is the only current user of hyp_idmap_teardown it's not really KVM specific, and the could theoretically be other users of hyp idmaps. Also, the externs __hyp_idmap_text_start|end would have to be either moved to a header file or duplicated inside KVM, which is also not that pretty. I see how you would like to clean this up, but it's not really hard to read or understand, imho: The ifdef cleanup patch: From 9cbe2830b74072b4d7167254562fc06c08f724e1 Mon Sep 17 00:00:00 2001 From: Christoffer Dall c.d...@virtualopensystems.com Date: Thu, 29 Nov 2012 13:56:03 -0500 Subject: [PATCH] KVM: ARM: Remove exports and ifdefs in hyp idmap code The exports are no longer needed as KVM/ARM cannot be compiled as a module and the the #ifdef is not required around a declaration. Cc: Will Deacon will.dea...@arm.com Cc: Rob Herring robherri...@gmail.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/idmap.h |2 -- arch/arm/mm/idmap.c |2 -- 2 files changed, 4 deletions(-) diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h index 36708ba..6ddb707 100644 --- a/arch/arm/include/asm/idmap.h +++ b/arch/arm/include/asm/idmap.h @@ -11,9 +11,7 @@ extern pgd_t *idmap_pgd; void setup_mm_for_reboot(void); -#ifdef CONFIG_ARM_VIRT_EXT void hyp_idmap_teardown(pgd_t *hyp_pgd); void hyp_idmap_setup(pgd_t *hyp_pgd); -#endif #endif /* __ASM_IDMAP_H */
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #13 from liyi yiliker...@gmail.com 2012-11-29 19:12:47 --- yes. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm-tool: fix zombie reaping in guest/init.c
init.c is not very kind with processes that get reparented when their own parent die, leaving them hanging around. Looking at the code, it only seem to care about its own flesh and blood. Bad init. Teach it some basic zombie reaping skills. Cc: Pekka Enberg penb...@kernel.org Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- tools/kvm/guest/init.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/kvm/guest/init.c b/tools/kvm/guest/init.c index ece48fd..8c49a03 100644 --- a/tools/kvm/guest/init.c +++ b/tools/kvm/guest/init.c @@ -61,7 +61,11 @@ int main(int argc, char *argv[]) else run_process(/bin/sh); } else { - waitpid(child, status, 0); + pid_t corpse; + + do { + corpse = waitpid(-1, status, 0); + } while (corpse != child); } reboot(LINUX_REBOOT_CMD_RESTART); -- 1.8.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Thursday, November 29, 2012 03:56:10 PM Gleb Natapov wrote: On Thu, Nov 29, 2012 at 03:45:52PM +0200, George-Cristian Bîrzan wrote: On Thu, Nov 29, 2012 at 1:56 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I've also added +hv_relaxed since then, but this is the command I'm I would suggest activating relaxed timing for all W2K8R2/Win7 guests. Is there any place I can read up on the downsides of this for Linux, or is Just Better? You shouldn't use hyper-v flags for Linux guests. In theory Linux should just ignore them, in practice there may be bugs that will prevent Linux from detecting that it runs as a guest and disable optimizations. As Gleb said, hyper-v flag are relevant to the Windows guests only. IIRC spinlocks and vapic should work for Vista and higher. Relaxed timing and partition reference time work for Win7/W2K8R2. Other than that, was looking into a profiling trace of the software running and a lot of time (60%?) is spent calling two functions from hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and HalpHPETProgramRolloverTimer which do point at something related to the timers. It means that hyper-v time stamp source was not activated. I recompiled the whole kernel, with your patch, and while I cannot check at 70Mbps now, a test stream of 20 seems to do better. Also, now I don't see any of those functions, which used to account ~60% of the time spent by the program. I'm waiting for the customer to come back and start the 'real' stream, but from my tests, time spent in hal.dll is now an order of magnitude smaller. -- George-Cristian Bîrzan -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 0/2] Enable guest use of TSC_ADJUST functionality
I have re-based this patch set version (V6) to kvm.git's queue branch. Will Auld (2): Add code to track call origin for msr assignment. Enabling IA32_TSC_ADJUST for KVM guest VM support arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 14 --- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 ++ arch/x86/kvm/svm.c| 22 ++--- arch/x86/kvm/vmx.c| 27 +++- arch/x86/kvm/x86.c| 52 +-- arch/x86/kvm/x86.h| 2 +- 9 files changed, 108 insertions(+), 21 deletions(-) -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 1/2] Add code to track call origin for msr assignment.
In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful for other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 15 +++ arch/x86/kvm/vmx.c | 18 -- arch/x86/kvm/x86.c | 29 ++--- arch/x86/kvm/x86.h | 2 +- 5 files changed, 55 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9fb6d8d..56c5dca 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -620,6 +620,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { + bool host_initiated; + u32 index; + u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -642,7 +648,7 @@ struct kvm_x86_ops { void (*update_db_bp_intercept)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -793,7 +799,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -820,7 +826,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 161a5fa..a8442e8 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3127,13 +3127,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct vcpu_svm *svm = to_svm(vcpu); + u32 ecx = msr-index; + u64 data = msr-data; switch (ecx) { case MSR_IA32_TSC: - kvm_write_tsc(vcpu, data); + kvm_write_tsc(vcpu, msr); break; case MSR_STAR: svm-vmcb-save.star = data; @@ -3188,20 +3190,25 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, ecx, data); break; default: - return kvm_set_msr_common(vcpu, ecx, data); + return kvm_set_msr_common(vcpu, msr); } return 0; } static int wrmsr_interception(struct vcpu_svm *svm) { + struct msr_data msr; u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX]; u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX] -1u) | ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX] -1u) 32); + msr.data = data; + msr.index = ecx; + msr.host_initiated = false; svm-next_rip = kvm_rip_read(svm-vcpu) + 2; - if (svm_set_msr(svm-vcpu, ecx, data)) { + if (svm_set_msr(svm-vcpu, msr)) { + trace_kvm_msr_write_ex(ecx, data); kvm_inject_gp(svm-vcpu, 0); } else { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bb18923..a166558 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2201,15 +2201,17 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) * Returns 0 on success, non-0 otherwise. * Assumes vcpu_load() was already called. */ -static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) +static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct shared_msr_entry *msr; int ret = 0; + u32 msr_index = msr_info-index;
[PATCH V6 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to a guest vcpu specific location to store the value of the emulated MSR while adding the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This is of course as long as the use TSC counter offsetting VM-execution control is enabled as well as the IA32_TSC_ADJUST control. However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc tsc_offset for a guest process when it does and rdtsc (with the correct settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one of these three locations. The argument against storing it in the actual MSR is performance. This is likely to be seldom used while the save/restore is required on every transition. IA32_TSC_ADJUST was created as a way to solve some issues with writing TSC itself so that is not an option either. The remaining option, defined above as our solution has the problem of returning incorrect vmcs tsc_offset values (unless we intercept and fix, not done here) as mentioned above. However, more problematic is that storing the data in vmcs tsc_offset will have a different semantic effect on the system than does using the actual MSR. This is illustrated in the following example: The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest process performs a rdtsc. In this case the guest process will get TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics as seen by the guest do not and hence this will not cause a problem. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 arch/x86/kvm/svm.c| 7 +++ arch/x86/kvm/vmx.c| 9 + arch/x86/kvm/x86.c| 23 +++ 8 files changed, 53 insertions(+) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 8c297aa..602c476 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -202,6 +202,7 @@ /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */ #define X86_FEATURE_FSGSBASE (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/ +#define X86_FEATURE_TSC_ADJUST (9*32+ 1) /* TSC adjustment MSR 0x3b */ #define X86_FEATURE_BMI1 (9*32+ 3) /* 1st group bit manipulation extensions */ #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */ #define X86_FEATURE_AVX2 (9*32+ 5) /* AVX2 instructions */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 56c5dca..dc87b65 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -444,6 +444,7 @@ struct kvm_vcpu_arch { s8 virtual_tsc_shift; u32 virtual_tsc_mult; u32 virtual_tsc_khz; + s64 ia32_tsc_adjust_msr; atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ unsigned nmi_pending; /* NMI queued after currently running handler */ @@ -711,6 +712,7 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale); + u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 7f0edce..c2dea36 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -236,6 +236,7 @@ #define MSR_IA32_EBL_CR_POWERON0x002a #define MSR_EBC_FREQUENCY_ID 0x002c #define MSR_IA32_FEATURE_CONTROL0x003a +#define MSR_IA32_TSC_ADJUST 0x003b #define FEATURE_CONTROL_LOCKED (10) #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (11) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index ec79e77..ee4e76c 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, if (index == 0) { entry-ebx = kvm_supported_word9_x86_features; cpuid_mask(entry-ebx, 9); + // TSC_ADJUST is emulated + entry-ebx |= F(TSC_ADJUST); } else entry-ebx = 0; entry-eax = 0; diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index a10e460..3a8b504 100644 --- a/arch/x86/kvm/cpuid.h +++
Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support
On Thu, Nov 29, 2012 at 07:21:28PM +, Auld, Will wrote: Marcelo, The behavior on reset is to return the TSC_AJUST msr value to 0x0. I am currently initializing this emulated msr in kvm_arch_vcpu_init(). Will, Reset is handled by QEMU. kvm_arch_vcpu_init is only called during vcpu allocation (ie vm creation). See x86_cpu_reset in QEMU's target-i386/cpu.c file. - Behaviour on reset: what is the behaviour on RESET? I am testing the rebase now. I would like to get any needed changes for this initialization into my next patch set version (v6) if possible. For the kernel patch, everything is fine. Apparently for the QEMU patch it is necessary to set value to zero on reset. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Fix user memslot overlap check
Prior to memory slot sorting this loop compared all of the user memory slots for overlap with new entries. With memory slot sorting, we're just checking some number of entries in the array that may or may not be user slots. Instead, walk all the slots with kvm_for_each_memslot, which has the added benefit of terminating early when we hit the first empty slot, and skip comparison to private slots. Signed-off-by: Alex Williamson alex.william...@redhat.com Cc: sta...@vger.kernel.org --- virt/kvm/kvm_main.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..cac294d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -710,7 +710,7 @@ int __kvm_set_memory_region(struct kvm *kvm, gfn_t base_gfn; unsigned long npages; unsigned long i; - struct kvm_memory_slot *memslot; + struct kvm_memory_slot *memslot, *slot; struct kvm_memory_slot old, new; struct kvm_memslots *slots, *old_memslots; @@ -761,13 +761,11 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Check for overlaps */ r = -EEXIST; - for (i = 0; i KVM_MEMORY_SLOTS; ++i) { - struct kvm_memory_slot *s = kvm-memslots-memslots[i]; - - if (s == memslot || !s-npages) + kvm_for_each_memslot(slot, kvm-memslots) { + if (slot-id = KVM_MEMORY_SLOTS || slot == memslot) continue; - if (!((base_gfn + npages = s-base_gfn) || - (base_gfn = s-base_gfn + s-npages))) + if (!((base_gfn + npages = slot-base_gfn) || + (base_gfn = slot-base_gfn + slot-npages))) goto out_free; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm: Fix user memslot overlap check
Prior to memory slot sorting this loop compared all of the user memory slots for overlap with new entries. With memory slot sorting, we're just checking some number of entries in the array that may or may not be user slots. Instead, walk all the slots with kvm_for_each_memslot, which has the added benefit of terminating early when we hit the first empty slot, and skip comparison to private slots. Signed-off-by: Alex Williamson alex.william...@redhat.com Cc: sta...@vger.kernel.org --- v2: Remove unused variable i virt/kvm/kvm_main.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..6e8fa7e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -709,8 +709,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int r; gfn_t base_gfn; unsigned long npages; - unsigned long i; - struct kvm_memory_slot *memslot; + struct kvm_memory_slot *memslot, *slot; struct kvm_memory_slot old, new; struct kvm_memslots *slots, *old_memslots; @@ -761,13 +760,11 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Check for overlaps */ r = -EEXIST; - for (i = 0; i KVM_MEMORY_SLOTS; ++i) { - struct kvm_memory_slot *s = kvm-memslots-memslots[i]; - - if (s == memslot || !s-npages) + kvm_for_each_memslot(slot, kvm-memslots) { + if (slot-id = KVM_MEMORY_SLOTS || slot == memslot) continue; - if (!((base_gfn + npages = s-base_gfn) || - (base_gfn = s-base_gfn + s-npages))) + if (!((base_gfn + npages = slot-base_gfn) || + (base_gfn = slot-base_gfn + slot-npages))) goto out_free; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 03/14] ARM: Factor out cpuid implementor and part number
On Mon, Nov 19, 2012 at 9:21 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:42:31PM +, Christoffer Dall wrote: Decoding the implementor and part number of the CPU id in the CPU ID register is needed by KVM, so we factor it out to share the code. Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com [...] diff --git a/arch/arm/include/asm/cputype.h b/arch/arm/include/asm/cputype.h index cb47d28..306fb2c 100644 --- a/arch/arm/include/asm/cputype.h +++ b/arch/arm/include/asm/cputype.h @@ -51,6 +51,22 @@ extern unsigned int processor_id; #define read_cpuid_ext(reg) 0 #endif +#define IMPLEMENTOR_ARM 0x41 +#define IMPLEMENTOR_INTEL0x69 + +#define PART_NUMBER_ARM1136 0xB360 +#define PART_NUMBER_ARM1156 0xB560 +#define PART_NUMBER_ARM1176 0xB760 +#define PART_NUMBER_ARM11MPCORE 0xB020 +#define PART_NUMBER_CORTEX_A80xC080 +#define PART_NUMBER_CORTEX_A90xC090 +#define PART_NUMBER_CORTEX_A50xC050 +#define PART_NUMBER_CORTEX_A15 0xC0F0 +#define PART_NUMBER_CORTEX_A70xC070 + +#define PART_NUMBER_XSCALE1 0x1 +#define PART_NUMBER_XSCALE2 0x2 We should probably prefix these with ARM_CPU_ and make the current names shorter to compensate. e.g. ARM_CPU_PART_1136, ARM_CPU_IMP_ARM ? /* * The CPU ID never changes at run time, so we might as well tell the * compiler that it's constant. Use this function to read the CPU ID @@ -61,6 +77,16 @@ static inline unsigned int __attribute_const__ read_cpuid_id(void) return read_cpuid(CPUID_ID); } +static inline unsigned int __attribute_const__ read_cpuid_implementor(void) +{ + return (read_cpuid_id() 0xFF00) 24; +} + +static inline unsigned int __attribute_const__ read_cpuid_part_number(void) +{ + return (read_cpuid_id() 0xFFF0); +} Perhaps this should take the implementor as an argument, given that the part number is described differently between implementors. The xscale stuff can then move in here (we'll need to check the xscale docs in case perf is using a subfield -- I can't remember off-hand). static inline unsigned int __attribute_const__ read_cpuid_cachetype(void) { return read_cpuid(CPUID_CACHETYPE); diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c index 8d7d8d4..ff18566 100644 --- a/arch/arm/kernel/perf_event_cpu.c +++ b/arch/arm/kernel/perf_event_cpu.c @@ -200,46 +200,46 @@ static struct arm_pmu *__devinit probe_current_pmu(void) struct arm_pmu *pmu = NULL; int cpu = get_cpu(); unsigned long cpuid = read_cpuid_id(); - unsigned long implementor = (cpuid 0xFF00) 24; - unsigned long part_number = (cpuid 0xFFF0); + unsigned long implementor = read_cpuid_implementor(); + unsigned long part_number = read_cpuid_part_number(); pr_info(probing PMU on CPU %d\n, cpu); /* ARM Ltd CPUs. */ - if (0x41 == implementor) { + if (implementor == IMPLEMENTOR_ARM) { switch (part_number) { - case 0xB360:/* ARM1136 */ - case 0xB560:/* ARM1156 */ - case 0xB760:/* ARM1176 */ + case PART_NUMBER_ARM1136: + case PART_NUMBER_ARM1156: + case PART_NUMBER_ARM1176: pmu = armv6pmu_init(); break; - case 0xB020:/* ARM11mpcore */ + case PART_NUMBER_ARM11MPCORE: pmu = armv6mpcore_pmu_init(); break; - case 0xC080:/* Cortex-A8 */ + case PART_NUMBER_CORTEX_A8: pmu = armv7_a8_pmu_init(); break; - case 0xC090:/* Cortex-A9 */ + case PART_NUMBER_CORTEX_A9: pmu = armv7_a9_pmu_init(); break; - case 0xC050:/* Cortex-A5 */ + case PART_NUMBER_CORTEX_A5: pmu = armv7_a5_pmu_init(); break; - case 0xC0F0:/* Cortex-A15 */ + case PART_NUMBER_CORTEX_A15: pmu = armv7_a15_pmu_init(); break; - case 0xC070:/* Cortex-A7 */ + case PART_NUMBER_CORTEX_A7: pmu = armv7_a7_pmu_init(); break; } /* Intel CPUs [xscale]. */ - } else if (0x69 == implementor) { + } else if (implementor == IMPLEMENTOR_INTEL) { part_number = (cpuid 13) 0x7; switch (part_number) { - case 1: + case PART_NUMBER_XSCALE1: pmu = xscale1pmu_init(); break; - case 2: + case PART_NUMBER_XSCALE2: pmu = xscale2pmu_init();
Re: [PATCH v4 04/14] KVM: ARM: Initial skeleton to compile KVM support
On Mon, Nov 19, 2012 at 9:41 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:42:38PM +, Christoffer Dall wrote: Targets KVM support for Cortex A-15 processors. Contains all the framework components, make files, header files, some tracing functionality, and basic user space API. Only supported core is Cortex-A15 for now. Most functionality is in arch/arm/kvm/* or arch/arm/include/asm/kvm_*.h. Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Rusty Russell rusty.russ...@linaro.org Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com [...] 4.69 KVM_GET_ONE_REG Capability: KVM_CAP_ONE_REG @@ -1791,6 +1800,7 @@ The list of registers accessible using this interface is identical to the list in 4.68. + Whitespace. 4.70 KVM_KVMCLOCK_CTRL Capability: KVM_CAP_KVMCLOCK_CTRL @@ -2072,6 +2082,46 @@ KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; source cpu in parm Note that the vcpu ioctl is asynchronous to vcpu execution. +4.77 KVM_ARM_VCPU_INIT + +Capability: basic +Architectures: arm +Type: vcpu ioctl +Parameters: struct struct kvm_vcpu_init (in) +Returns: 0 on success; -1 on error +Errors: + EINVAL:the target is unknown, or the combination of features is invalid. + ENOENT:a features bit specified is unknown. + +This tells KVM what type of CPU to present to the guest, and what +optional features it should have. This will cause a reset of the cpu +registers to their initial values. If this is not called, KVM_RUN will +return ENOEXEC for that vcpu. + +Note that because some registers reflect machine topology, all vcpus +should be created before this ioctl is invoked. + + +4.78 KVM_GET_REG_LIST + +Capability: basic +Architectures: arm +Type: vcpu ioctl +Parameters: struct kvm_reg_list (in/out) +Returns: 0 on success; -1 on error +Errors: + E2BIG: the reg index list is too big to fit in the array specified by + the user (the number required will be written into n). + +struct kvm_reg_list { + __u64 n; /* number of registers in reg[] */ + __u64 reg[0]; +}; + +This ioctl returns the guest registers that are supported for the +KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. + + 5. The kvm_run structure diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index ade7e92..43a997a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -2311,3 +2311,5 @@ source security/Kconfig source crypto/Kconfig source lib/Kconfig + +source arch/arm/kvm/Kconfig diff --git a/arch/arm/Makefile b/arch/arm/Makefile index 5f914fc..433becd 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -257,6 +257,7 @@ core-$(CONFIG_XEN) += arch/arm/xen/ core-y += arch/arm/kernel/ arch/arm/mm/ arch/arm/common/ core-y += arch/arm/net/ core-y += arch/arm/crypto/ +core-y += arch/arm/kvm/ Predicate this on CONFIG_KVM_ARM_HOST, then add some obj-y to the kvm/Makefile. core-y += $(machdirs) $(platdirs) drivers-$(CONFIG_OPROFILE) += arch/arm/oprofile/ diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h new file mode 100644 index 000..67826b2 --- /dev/null +++ b/arch/arm/include/asm/kvm_host.h @@ -0,0 +1,114 @@ +/* + * Copyright (C) 2012 - Virtual Open Systems and Columbia University + * Author: Christoffer Dall c.d...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#ifndef __ARM_KVM_HOST_H__ +#define __ARM_KVM_HOST_H__ + +#include asm/kvm.h +#include asm/kvm_asm.h + +#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS One thing I noticed when hacking kvmtool is that we don't support the KVM_CAP_{MAX,NR}_VCPUS capabilities. Why is this? Since we have a maximum of 8 due to limitations of the GIC, it might be nice to put that somewhere visible to userspace. indeed +#define KVM_MEMORY_SLOTS 32 +#define KVM_PRIVATE_MEM_SLOTS 4 +#define KVM_COALESCED_MMIO_PAGE_OFFSET 1 + +#define KVM_VCPU_MAX_FEATURES 0 + +/* We don't currently support large pages. */ +#define KVM_HPAGE_GFN_SHIFT(x) 0 +#define KVM_NR_PAGE_SIZES 1 +#define
Re: Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset
On Tue, Nov 27, 2012 at 07:53:47PM +0900, Yoshihiro YUNOMAE wrote: Hi Marcelo, (2012/11/27 8:16), Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote: 500h. event tsc_write tsc_offset=-3000 Then a guest trace containing events with a TSC timestamp. Which tsc_offset to use? (that is the problem, which unless i am mistaken can only be solved easily if the guest can convert RDTSC - TSC of host). There are three following cases of changing TSC offset: 1. Reset TSC at guest boot time 2. Adjust TSC offset due to some host's problems 3. Write TSC on guests The scenario which you mentioned is case 3, so we'll discuss this case. Here, we assume that a guest is allocated single CPU for the sake of ease. If a guest executes write_tsc, TSC values jumps to forward or backward. For the forward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=+900) 3060 kvm_exit 3075 tsc_offset=-2050 3080 kvm_enter 1050 event1 1055 event2 ... This case is simple. The guest TSC of the first kvm_enter is calculated as follows: (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51 Similarly, the guest TSC of the second kvm_enter is 130. So, the guest events between 51 and 130, that is, 53 eventX is inserted between the first pair of kvm_enter and kvm_exit. To insert events of the guests between 51 and 130, we convert the guest TSC to the host TSC using TSC offset 2950. For the backward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=-50) 3060 kvm_exit 3075 tsc_offset=-2050 3080 kvm_enter 90 event1 95 event2 ... 3400 100(write_tsc=-50) 90event3 95event4 As you say, in this case, the previous method is invalid. When we calculate the guest TSC value for the tsc_offset=-3000 event, the value is 75 on the guest. This seems like prior event of write_tsc=-50 event. So, we need to consider more. In this case, it is important that we can understand where the guest executes write_tsc or the host rewrites the TSC offset. write_tsc on the guest equals wrmsr 0x0010, so this instruction induces vm_exit. This implies that the guest does not operate when the host changes TSC offset on the cpu. In other words, the guest cannot use new TSC before the host rewrites the new TSC offset. So, if timestamp on the guest is not monotonically increased, we can understand the guest executes write_tsc. Moreover, in the region where timestamp is decreasing, we can understand when the host rewrote the TSC offset in the guest trace data. Therefore, we can sort trace data in chronological order. This requires an entire trace of events. That is, to be able to reconstruct timeline you require the entire trace from the moment guest starts. So that you can correlate wrmsr-to-tsc on the guest with vmexit-due-to-tsc-write on the host. Which means that running out of space for trace buffer equals losing ability to order events. Is that desirable? It seems cumbersome to me. As you say, tracing events can overwrite important events like kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is needed by this feature. Normal events which often hit record the buffer A, and important events which rarely hit record the buffer B. In our case, the important event is write_tsc_offset. Also the need to correlate each write_tsc event in the guest trace with a corresponding tsc_offset write in the host trace means that it is _necessary_ for the guest and host to enable tracing simultaneously. Correct? Also, there are WRMSR executions in the guest for which there is no event in the trace buffer. From SeaBIOS, during boot. In that case, there is no explicit event in the guest trace which you can correlate with tsc_offset changes in the host side. I understand that you want to say, but we don't correlate between write_tsc event and write_tsc_offset event directly. This is because the write_tsc tracepoint (also WRMSR instruction) is not prepared in the current kernel. So, in the previous mail (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we don't need to prepare
RE: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support
Marcelo, My latest QEMU patch ([PATCH V5] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs) seems to be OK with regards to reset. The tsc_adjust variable is being zeroed by the memset() in x86_cpu_reset(). Later code seems to write these values through kvm_put_msr() by way of cpu_synchronize_all_post_reset(). Andreas said: I'm expecting this to go through Marcello's queue unless I'm told otherwise. Does this meet your expectations as well? Thanks, Will -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Thursday, November 29, 2012 12:44 PM To: Auld, Will Cc: kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com; qemu-devel; Gleb Subject: Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support On Thu, Nov 29, 2012 at 07:21:28PM +, Auld, Will wrote: Marcelo, The behavior on reset is to return the TSC_AJUST msr value to 0x0. I am currently initializing this emulated msr in kvm_arch_vcpu_init(). Will, Reset is handled by QEMU. kvm_arch_vcpu_init is only called during vcpu allocation (ie vm creation). See x86_cpu_reset in QEMU's target- i386/cpu.c file. - Behaviour on reset: what is the behaviour on RESET? I am testing the rebase now. I would like to get any needed changes for this initialization into my next patch set version (v6) if possible. For the kernel patch, everything is fine. Apparently for the QEMU patch it is necessary to set value to zero on reset. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset
Hi Marcelo, Can you please write a succint but complete description of the method so it can be verified? Sure. - Prerequisite 1. the host TSC is synchronized and stable. 2. kvm_write_tsc_offset events include previous and next TSC offset values. 3. Every event's trace_clock is TSC. - Assumption for the sake of ease 1. One VCPU 2. The guest executes no write_tsc or write_tsc only once. - Configuration 1. On the host, kvm_exit/entry events are recorded in the buffer A and kvm_write_tsc_offset events are recorded in the buffer B. 2. Boot a guest - Sort method 1. Confirm which the kvm_write_tsc_offset events are recorded except for boot. Note that a vcpu thread writes TSC offset when boot as an initial operation. 1-1. If not recorded, it means that the guest did not execute write_tsc. So, we convert the guest TSC to the host TSC using TSC offset of boot. = END 1-2. If recorded, it means that the guest executed write_tsc. So, we use new kvm_write_tsc_offset event information. 2. We convert the host TSC(Th) of the kvm_write_tsc_offset event to the guest TSC(Tg) using previous_tsc_offset value: Tg = Th + previous_tsc_offset 3. To search the point where the guest executed write_tsc, we find n which satisfies the condition Tn Tg Tn+1 or Tn+1 Tn Tg from older events of the guest. The former condition means trace data of the guest were recorded monotonically. On the other hand, the latter condition means trace data of the guest moved backward. 4. We convert the guest TSC of trace data to the host TSC using previous_tsc_offset value before n and using next_tsc_offset value after n+1. = END - Note We assumed one vcpu and no write_tsc or write_tsc only once for the sake of ease. For other conditions, we will use similar method. Thanks, There is no certainty. Consider the following information available guest trace host trace 100: guest_tsc_write (tsc_offset=-100 = guest_tsc = 0) 104: guest_tsc_write (tsc_offset=-104 = guest_tsc = 0) 108: guest_tsc_write (tsc_offset=-108 = guest_tsc = 0) 1: eventA 2: eventB 3: eventC 1: eventD 2: eventE 3: eventF How can you tell which tsc_offset to use for eventD ? It could be either -104 or -108. The notion of next_tsc_offset is subject to such issue. I suppose its fine to restrict the interface as follows: the tool will accept one trace of events with monotonic timestamps and the user is responsible for correlating that to a host trace. OK, I'll add the restriction, which trace data must have monotonic timestamps to use this feature. I think guests seldom will execute write_tsc, so in many cases, timestamps will be monotonically recorded in trace data. That is, you can't feed distinct instances of guest kernel trace. I'm not clear for distinct instances. Is this about SMP or multiple guests? Would you explain about this? Thanks, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
On 11/29/2012 05:46 PM, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote: On 11/28/2012 06:42 AM, Marcelo Tosatti wrote: Don't understand the reasoning behind why 3 is a good choice. Here is where I came from. (explaining from scratch for completeness, forgive me :)) In moderate overcommits, we can falsely exit from ple handler even when we have preempted task of same VM waiting on other cpus. To reduce this problem, we try few times before exiting. The problem boils down to: what is the probability that we exit ple handler even when we have more than 1 task in other cpus. Theoretical worst case should be around 1.5x overcommit (As also pointed by Andrew Theurer). [But practical worstcase may be around 2x,3x overcommits as indicated by the results for the patch series] So if p is the probability of finding rq length one on a particular cpu, and if we do n tries, then probability of exiting ple handler is: p^(n+1) [ because we would have come across one source with rq length 1 and n target cpu rqs with length 1 ] so num tries: probability of aborting ple handler (1.5x overcommit) 1 1/4 2 1/8 3 1/16 We can increase this probability with more tries, but the problem is the overhead. IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread is preempted we do kvm-preempted_vcpus++, when it runs again we do kvm-preempted_vcpus--. PLE handler can try harder if kvm-preempted_vcpus is big or do not try at all if it is zero. Thanks for the reply Gleb. Yes.. It was on my next TODO as you know and it make sense to weigh all these approaches (undercommit patches/throttled yield/preempt notifier/pvspinlock and their combination) to good extent before going further. I am happy if these patches are now in 'good shape to compare' state. (same reason I had posted dynamic PLE appaoch too). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 05/14] KVM: ARM: Hypervisor inititalization
On Mon, Nov 19, 2012 at 9:51 AM, Will Deacon will.dea...@arm.com wrote: Typo in subject, use one of initiali[sz]ation instead. On Sat, Nov 10, 2012 at 03:42:45PM +, Christoffer Dall wrote: Sets up KVM code to handle all exceptions taken to Hyp mode. When the kernel is booted in Hyp mode, calling hvc #0xff with r0 pointing to the new vectors, the HVBAR is changed to the the vector pointers. This allows subsystems (like KVM here) to execute code in Hyp-mode with the MMU disabled. We initialize other Hyp-mode registers and enables the MMU for Hyp-mode from the id-mapped hyp initialization code. Afterwards, the HVBAR is changed to point to KVM Hyp vectors used to catch guest faults and to switch to Hyp mode to perform a world-switch into a KVM guest. If the KVM module is unloaded we call hvc #0xff once more to disable the MMU in Hyp mode again and install a vector handler to change the HVBAR for a subsequent reload of KVM or another hypervisor. 0xff might be a bit too simple. I notice Xen use 0xEA1, which is probably less likely to conflict with anything else. We should probably also put these numbers in the same header file so that any conflicts become immediately apparent. actually we don't even do this anymore, the commit text is dated, so this is a non issue. Also provides memory mapping code to map required code pages, data structures, and I/O regions accessed in Hyp mode at the same virtual address as the host kernel virtual addresses, but which conforms to the architectural requirements for translations in Hyp mode. This interface is added in arch/arm/kvm/arm_mmu.c and comprises: - create_hyp_mappings(from, to); - create_hyp_io_mappings(from, to, phys_addr); - free_hyp_pmds(); Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com [...] diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h index c196a22..f6e8f6f 100644 --- a/arch/arm/include/asm/kvm_arm.h +++ b/arch/arm/include/asm/kvm_arm.h @@ -21,4 +21,111 @@ #include asm/types.h +/* Hyp Configuration Register (HCR) bits */ +#define HCR_TGE(1 27) +#define HCR_TVM(1 26) +#define HCR_TTLB (1 25) +#define HCR_TPU(1 24) +#define HCR_TPC(1 23) +#define HCR_TSW(1 22) +#define HCR_TAC(1 21) +#define HCR_TIDCP (1 20) +#define HCR_TSC(1 19) +#define HCR_TID3 (1 18) +#define HCR_TID2 (1 17) +#define HCR_TID1 (1 16) +#define HCR_TID0 (1 15) +#define HCR_TWE(1 14) +#define HCR_TWI(1 13) +#define HCR_DC (1 12) +#define HCR_BSU(3 10) +#define HCR_BSU_IS (1 10) +#define HCR_FB (1 9) +#define HCR_VA (1 8) +#define HCR_VI (1 7) +#define HCR_VF (1 6) +#define HCR_AMO(1 5) +#define HCR_IMO(1 4) +#define HCR_FMO(1 3) +#define HCR_PTW(1 2) +#define HCR_SWIO (1 1) +#define HCR_VM 1 + +/* + * The bits we set in HCR: + * TAC:Trap ACTLR + * TSC:Trap SMC + * TSW:Trap cache operations by set/way + * TWI:Trap WFI + * TIDCP: Trap L2CTLR/L2ECTLR + * BSU_IS: Upgrade barriers to the inner shareable domain + * FB: Force broadcast of all maintainance operations + * AMO:Override CPSR.A and enable signaling with VA + * IMO:Override CPSR.I and enable signaling with VI + * FMO:Override CPSR.F and enable signaling with VF + * SWIO: Turn set/way invalidates into set/way clean+invalidate + */ +#define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ + HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ + HCR_SWIO | HCR_TIDCP) + +/* Hyp System Control Register (HSCTLR) bits */ +#define HSCTLR_TE (1 30) +#define HSCTLR_EE (1 25) +#define HSCTLR_FI (1 21) +#define HSCTLR_WXN (1 19) +#define HSCTLR_I (1 12) +#define HSCTLR_C (1 2) +#define HSCTLR_A (1 1) +#define HSCTLR_M 1 +#define HSCTLR_MASK(HSCTLR_M | HSCTLR_A | HSCTLR_C | HSCTLR_I | \ +HSCTLR_WXN | HSCTLR_FI | HSCTLR_EE | HSCTLR_TE) + +/* TTBCR and HTCR Registers bits */ +#define TTBCR_EAE (1 31) +#define TTBCR_IMP (1 30) +#define TTBCR_SH1 (3 28) +#define TTBCR_ORGN1(3 26) +#define TTBCR_IRGN1(3 24) +#define TTBCR_EPD1 (1 23) +#define TTBCR_A1 (1 22) +#define TTBCR_T1SZ (3 16) +#define TTBCR_SH0 (3 12) +#define TTBCR_ORGN0(3 10) +#define TTBCR_IRGN0(3 8) +#define TTBCR_EPD0 (1 7) +#define
Re: [PATCH v4 07/14] KVM: ARM: Inject IRQs and FIQs from userspace
On Mon, Nov 19, 2012 at 11:21 AM, Will Deacon will.dea...@arm.com wrote: On Mon, Nov 19, 2012 at 04:09:06PM +, Christoffer Dall wrote: On Mon, Nov 19, 2012 at 10:26 AM, Will Deacon will.dea...@arm.com wrote: On Mon, Nov 19, 2012 at 03:04:38PM +, Christoffer Dall wrote: In all seriousness, I will unite myself with an alcoholic beverage one of these evenings and try to see what I can do about it, maybe split it up somehow, I'll give it a shot. So this might be to do with the way you've split up the patches (likely I'll have similar complaints against the vGIC code, but at least it will make sense there!)... +int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level) +{ + u32 irq = irq_level-irq; + unsigned int irq_type, vcpu_idx, irq_num; + int nrcpus = atomic_read(kvm-online_vcpus); + struct kvm_vcpu *vcpu = NULL; + bool level = irq_level-level; + + irq_type = (irq KVM_ARM_IRQ_TYPE_SHIFT) KVM_ARM_IRQ_TYPE_MASK; + vcpu_idx = (irq KVM_ARM_IRQ_VCPU_SHIFT) KVM_ARM_IRQ_VCPU_MASK; + irq_num = (irq KVM_ARM_IRQ_NUM_SHIFT) KVM_ARM_IRQ_NUM_MASK; + + trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, irq_level-level); + + if (irq_type == KVM_ARM_IRQ_TYPE_CPU || + irq_type == KVM_ARM_IRQ_TYPE_PPI) { + if (vcpu_idx = nrcpus) + return -EINVAL; + + vcpu = kvm_get_vcpu(kvm, vcpu_idx); + if (!vcpu) + return -EINVAL; + } + + switch (irq_type) { + case KVM_ARM_IRQ_TYPE_CPU: + if (irq_num KVM_ARM_IRQ_CPU_FIQ) + return -EINVAL; + + return vcpu_interrupt_line(vcpu, irq_num, level); + } + + return -EINVAL; +} This obviously doesn't handle PPIs, which is where the confusion lies. You can just as easily write it as: int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level) { u32 irq = irq_level-irq; unsigned int irq_type, vcpu_idx, irq_num; int nrcpus = atomic_read(kvm-online_vcpus); struct kvm_vcpu *vcpu = NULL; bool level = irq_level-level; irq_type = (irq KVM_ARM_IRQ_TYPE_SHIFT) KVM_ARM_IRQ_TYPE_MASK; vcpu_idx = (irq KVM_ARM_IRQ_VCPU_SHIFT) KVM_ARM_IRQ_VCPU_MASK; irq_num = (irq KVM_ARM_IRQ_NUM_SHIFT) KVM_ARM_IRQ_NUM_MASK; trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, irq_level-level); if (irq_type != KVM_ARM_IRQ_TYPE_CPU) return -EINVAL; if (vcpu_idx = nrcpus) return -EINVAL; vcpu = kvm_get_vcpu(kvm, vcpu_idx); if (!vcpu) return -EINVAL; if (irq_num KVM_ARM_IRQ_CPU_FIQ) return -EINVAL; return vcpu_interrupt_line(vcpu, irq_num, level); } Then add all the IRQ complexity in a later patch! that was pretty constructive, I did just that. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/14] KVM: ARM: World-switch implementation
On Mon, Nov 19, 2012 at 9:57 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:43:06PM +, Christoffer Dall wrote: Provides complete world-switch implementation to switch to other guests running in non-secure modes. Includes Hyp exception handlers that capture necessary exception information and stores the information on the VCPU and KVM structures. [...] diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 15e2ab1..d8f8c60 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -40,6 +40,7 @@ #include asm/kvm_arm.h #include asm/kvm_asm.h #include asm/kvm_mmu.h +#include asm/kvm_emulate.h #ifdef REQUIRES_VIRT __asm__(.arch_extension virt); @@ -49,6 +50,10 @@ static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); static struct vfp_hard_struct __percpu *kvm_host_vfp_state; static unsigned long hyp_default_vectors; +/* The VMID used in the VTTBR */ +static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); +static u8 kvm_next_vmid; +static DEFINE_SPINLOCK(kvm_vmid_lock); int kvm_arch_hardware_enable(void *garbage) { @@ -264,6 +269,8 @@ int __attribute_const__ kvm_target_cpu(void) int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { + /* Force users to call KVM_ARM_VCPU_INIT */ + vcpu-arch.target = -1; return 0; } @@ -274,6 +281,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; + vcpu-arch.vfp_host = this_cpu_ptr(kvm_host_vfp_state); } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) @@ -306,12 +314,168 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v) { + return v-mode == IN_GUEST_MODE; +} + +static void reset_vm_context(void *info) +{ + kvm_call_hyp(__kvm_flush_vm_context); +} + +/** + * need_new_vmid_gen - check that the VMID is still valid + * @kvm: The VM's VMID to checkt + * + * return true if there is a new generation of VMIDs being used + * + * The hardware supports only 256 values with the value zero reserved for the + * host, so we check if an assigned value belongs to a previous generation, + * which which requires us to assign a new value. If we're the first to use a + * VMID for the new generation, we must flush necessary caches and TLBs on all + * CPUs. + */ +static bool need_new_vmid_gen(struct kvm *kvm) +{ + return unlikely(kvm-arch.vmid_gen != atomic64_read(kvm_vmid_gen)); +} + +/** + * update_vttbr - Update the VTTBR with a valid VMID before the guest runs + * @kvmThe guest that we are about to run + * + * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure the + * VM has a valid VMID, otherwise assigns a new one and flushes corresponding + * caches and TLBs. + */ +static void update_vttbr(struct kvm *kvm) +{ + phys_addr_t pgd_phys; + u64 vmid; + + if (!need_new_vmid_gen(kvm)) + return; + + spin_lock(kvm_vmid_lock); + + /* First user of a new VMID generation? */ + if (unlikely(kvm_next_vmid == 0)) { + atomic64_inc(kvm_vmid_gen); + kvm_next_vmid = 1; + + /* +* On SMP we know no other CPUs can use this CPU's or +* each other's VMID since the kvm_vmid_lock blocks +* them from reentry to the guest. +*/ + on_each_cpu(reset_vm_context, NULL, 1); + } + + kvm-arch.vmid_gen = atomic64_read(kvm_vmid_gen); + kvm-arch.vmid = kvm_next_vmid; + kvm_next_vmid++; + + /* update vttbr to be used with the new vmid */ + pgd_phys = virt_to_phys(kvm-arch.pgd); + vmid = ((u64)(kvm-arch.vmid) VTTBR_VMID_SHIFT) VTTBR_VMID_MASK; + kvm-arch.vttbr = pgd_phys VTTBR_BADDR_MASK; + kvm-arch.vttbr |= vmid; + + spin_unlock(kvm_vmid_lock); +} I must be missing something here: how do you ensure that a guest running on multiple CPUs continues to have the same VMID across them after a rollover? when a roll over occurs, there's no problem until someone comes along that doesn't have a valid vmid (need_new_vmid_gen will return true). In this case, to assign a vmid, we need to start a new generation of id's to assign one, and must ensure that all old vmid's are no longer used. So how do we ensure that? Well, we increment the kvm_vmid_gen, causing all other cpus who try to run a VM to hit the spin_lock if they exit the VMs. We reserve the vmid 1 for the new cpu, and we call on_each_cpu, which causes an ipi to all other physical cpus, and waits until the other physical cpus actually complete reset_vm_context. At this point, once on_each_cpu(reset_vm_context) returns, all other physical CPUs have cleared their data structures for occurences of old vmids, and the kvm_vmid_gen has been incremented, so no other vcpus
Re: [PATCH v4 10/14] KVM: ARM: User space API for getting/setting co-proc registers
On Mon, Nov 19, 2012 at 10:02 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:43:21PM +, Christoffer Dall wrote: The following three ioctls are implemented: - KVM_GET_REG_LIST - KVM_GET_ONE_REG - KVM_SET_ONE_REG [...] diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 845ceda..9671cd2 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1798,6 +1798,11 @@ is the register group type, or coprocessor number: ARM core registers have the following id bit patterns: 0x4002 0010 index into the kvm_regs struct:16 +ARM 32-bit CP15 registers have the following id bit patterns: + 0x4002 000F zero:1 crn:4 crm:4 opc1:4 opc2:3 + +ARM 64-bit CP15 registers have the following id bit patterns: + 0x4003 000F zero:1 zero:4 crm:4 opc1:4 zero:3 4.69 KVM_GET_ONE_REG @@ -2139,6 +2144,45 @@ This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. +4.77 KVM_ARM_VCPU_INIT + +Capability: basic +Architectures: arm +Type: vcpu ioctl +Parameters: struct struct kvm_vcpu_init (in) +Returns: 0 on success; -1 on error +Errors: + EINVAL:the target is unknown, or the combination of features is invalid. + ENOENT:a features bit specified is unknown. + +This tells KVM what type of CPU to present to the guest, and what +optional features it should have. This will cause a reset of the cpu +registers to their initial values. If this is not called, KVM_RUN will +return ENOEXEC for that vcpu. + +Note that because some registers reflect machine topology, all vcpus +should be created before this ioctl is invoked. + +4.78 KVM_GET_REG_LIST + +Capability: basic +Architectures: arm +Type: vcpu ioctl +Parameters: struct kvm_reg_list (in/out) +Returns: 0 on success; -1 on error +Errors: + E2BIG: the reg index list is too big to fit in the array specified by + the user (the number required will be written into n). + +struct kvm_reg_list { + __u64 n; /* number of registers in reg[] */ + __u64 reg[0]; +}; + +This ioctl returns the guest registers that are supported for the +KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. + + You already added this hunk earlier (and looking at the final result, you do end up wih two entries for 4.77 and 4.78). duh, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 11/14] KVM: ARM: Demux CCSIDR in the userspace API
On Mon, Nov 19, 2012 at 10:03 AM, Will Deacon will.dea...@arm.com wrote: On Sat, Nov 10, 2012 at 03:43:28PM +, Christoffer Dall wrote: The Cache Size Selection Register (CSSELR) selects the current Cache Size ID Register (CCSIDR). You write which cache you are interested in to CSSELR, and read the information out of CCSIDR. Which cache numbers are valid is known by reading the Cache Level ID Register (CLIDR). To export this state to userspace, we add a KVM_REG_ARM_DEMUX numberspace (17), which uses 8 bits to represent which register is being demultiplexed (0 for CCSIDR), and the lower 8 bits to represent this demultiplexing (in our case, the CSSELR value, which is 4 bits). Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Rusty Russell rusty.russ...@linaro.org [...] Documentation/virtual/kvm/api.txt |2 arch/arm/include/uapi/asm/kvm.h |9 ++ arch/arm/kvm/coproc.c | 163 - 3 files changed, 171 insertions(+), 3 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 9671cd2..bf8f99c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1804,6 +1804,8 @@ ARM 32-bit CP15 registers have the following id bit patterns: ARM 64-bit CP15 registers have the following id bit patterns: 0x4003 000F zero:1 zero:4 crm:4 opc1:4 zero:3 +ARM CCSIDR registers are demultiplexed by CSSELR value: + 0x4002 0011 00 csselr:8 4.69 KVM_GET_ONE_REG diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index a807d9a..78a629c 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -80,6 +80,15 @@ struct kvm_arch_memory_slot { #define KVM_REG_ARM_CORE (0x0010 KVM_REG_ARM_COPROC_SHIFT) #define KVM_REG_ARM_CORE_REG(name) (offsetof(struct kvm_regs, name) / 4) +/* Some registers need more space to represent values. */ +#define KVM_REG_ARM_DEMUX(0x0011 KVM_REG_ARM_COPROC_SHIFT) +#define KVM_REG_ARM_DEMUX_ID_MASK0xFF00 +#define KVM_REG_ARM_DEMUX_ID_SHIFT 8 +#define KVM_REG_ARM_DEMUX_ID_CCSIDR (0x00 KVM_REG_ARM_DEMUX_ID_SHIFT) +#define KVM_REG_ARM_DEMUX_VAL_MASK 0x00FF +#define KVM_REG_ARM_DEMUX_VAL_SHIFT 0 + + /* KVM_IRQ_LINE irq field index values */ #define KVM_ARM_IRQ_TYPE_SHIFT 24 #define KVM_ARM_IRQ_TYPE_MASK0xff diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c index 95a0f5e..9ce5861 100644 --- a/arch/arm/kvm/coproc.c +++ b/arch/arm/kvm/coproc.c @@ -35,6 +35,12 @@ * Co-processor emulation */ +/* 3 bits per cache level, as per CLIDR, but non-existent caches always 0 */ +static u32 cache_levels; + +/* CSSELR values; used to index KVM_REG_ARM_DEMUX_ID_CCSIDR */ +#define CSSELR_MAX 12 + int kvm_handle_cp10_id(struct kvm_vcpu *vcpu, struct kvm_run *run) { kvm_inject_undefined(vcpu); @@ -548,11 +554,112 @@ static int set_invariant_cp15(u64 id, void __user *uaddr) return 0; } +static bool is_valid_cache(u32 val) +{ + u32 level, ctype; + + if (val = CSSELR_MAX) + return -ENOENT; + + /* Bottom bit is Instruction or Data bit. Next 3 bits are level. */ +level = (val 1); +ctype = (cache_levels (level * 3)) 7; + + switch (ctype) { + case 0: /* No cache */ + return false; + case 1: /* Instruction cache only */ + return (val 1); + case 2: /* Data cache only */ + case 4: /* Unified cache */ + return !(val 1); + case 3: /* Separate instruction and data caches */ + return true; + default: /* Reserved: we can't know instruction or data. */ + return false; + } +} + +/* Which cache CCSIDR represents depends on CSSELR value. */ +static u32 get_ccsidr(u32 csselr) +{ + u32 ccsidr; + + /* Make sure noone else changes CSSELR during this! */ + local_irq_disable(); + /* Put value into CSSELR */ + asm volatile(mcr p15, 2, %0, c0, c0, 0 : : r (csselr)); You need an isb() here, otherwise you might still address the wrong cache (the two accesses don't hazard in the CPU). that was just totally obvious ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html