Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 07:10 PM, Michael Wolf wrote: On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. Confusing users is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest. So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? Before I answer this, can you please detail which mechanism are you using to enforce the entitlement? Is it the cgroup cpu controller, or something else? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm:queue 23/25] arch/s390/kvm/kvm-s390.c:358:6: error: conflicting types for 'kvm_arch_vcpu_postcreate'
On Tue, Nov 27, 2012 at 11:51:19PM -0200, Marcelo Tosatti wrote: On Tue, Nov 27, 2012 at 10:56:50AM +0800, Fengguang Wu wrote: Hi Marcelo, FYI, kernel build failed on tree: git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue head: fc1ddea318fa2c1ac3d496d8653ca4bc9b66e679 commit: 438d76a60e7be59a558f8712a47565fa8258d17d [23/25] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization config: make ARCH=s390 allmodconfig Fengguang Wu, Thanks. Repository has been updated, it would be good if you can rerun the tests. Marcelo, the build log shows that the new kvm/queue head d98d07ca7e0347d712d54a865af323c4aee04bc2 builds fine. :) Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 0/2] kvm: direct msix injection
On Tue, Nov 27, 2012 at 09:34:57PM -0700, Alex Williamson wrote: On Wed, 2012-11-21 at 21:26 +0200, Michael S. Tsirkin wrote: On Wed, Oct 17, 2012 at 06:05:53PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSIX, from atomic context. Here's an untested patch to do this (compiled only). Changes from v2: Don't inject broadcast interrupts directly Changes from v1: Tried to address comments from v1, except unifying with kvm_set_irq: passing flags to it looks too ugly. Added a comment. Jan, you said you can test this? I have tested this with some networking workloads and this patchset seems to work fine. My setup isn't a good fit for benchmarking device assignment though. Alex, could you pls verifyu that this solves the latency issue that you sometimes observe? With this patchset device assignment latency should be as fast as vfio. Yep, that seems to cover the gap. My environment is too noisy to declare an absolute winner, but pci-assign and vfio-pci are now very, very similar under netperf TCP_RR with these patches. Thanks, Alex Thanks very much for the report. Gleb, Marcelo, ACK? Also please add Tested-by: Alex Williamson alex.william...@redhat.com -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM
On Tue, Nov 27, 2012 at 10:29:08PM -0200, Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote: The launch state is not a member in the VMCS area, use a separate variable (list) to store it instead. Signed-off-by: Dongxiao Xu dongxiao...@intel.com 1. What is the problem with keeping launched state in the VMCS? Assuming there is a positive answer to the above: 2. Don't you have to change VMCS ID? 3. Can't it be kept somewhere else other than a list? Current scheme allows guest to allocate unlimited amounts of host memory. 4. What is the state of migration / nested vmx again? If vmcs12 is migrated, this means launched state is not migrated anymore. Patches 1-3 seem fine. According to Dongxiao they are slowing down nested guest by 4%. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC
On Wed, Nov 21, 2012 at 04:09:34PM +0800, Yang Zhang wrote: When PIT connects to IOAPIC, it route to pin 2 not pin 0. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- virt/kvm/ioapic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index cfb7e4d..166c450 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -181,7 +181,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) #ifdef CONFIG_X86 /* Always delivery PIT interrupt to vcpu 0 */ - if (irq == 0) { + if (irq == 2) { Hmm, this means all this time the code didn't work correctly which makes me wonder if we need this hack at all. irqe.dest_mode = 0; /* Physical mode. */ /* need to read apic_id from apic regiest since * it can be rewritten */ -- 1.7.1 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote: On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I have some code which do both reference time and invariant TSC but it will not work after migration. I will send it later today. Do you mean migrating guests? This is not an issue for us. OK, but don't say I didn't warn you :) There are two patches, one for kvm and another one for qemu. you will probably need to rebase them. Add hv_tsc cpu parameter to activate this feature. you will probably need to deactivate hpet by adding -no-hpet parameter as well. best regards, Vadim. Also, it would be much appreciated! -- George-Cristian Bîrzan diff --git a/arch/x86/include/asm/hyperv.h b/arch/x86/include/asm/hyperv.h index b80420b..9c5ffef 100644 --- a/arch/x86/include/asm/hyperv.h +++ b/arch/x86/include/asm/hyperv.h @@ -136,6 +136,9 @@ /* MSR used to read the per-partition time reference counter */ #define HV_X64_MSR_TIME_REF_COUNT 0x4020 +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* Define the virtual APIC registers */ #define HV_X64_MSR_EOI0x4070 #define HV_X64_MSR_ICR0x4071 @@ -179,6 +182,10 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE 0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + + #define HV_PROCESSOR_POWER_STATE_C0 0 #define HV_PROCESSOR_POWER_STATE_C1 1 #define HV_PROCESSOR_POWER_STATE_C2 2 @@ -191,4 +198,11 @@ #define HV_STATUS_INVALID_ALIGNMENT 4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { +uint32_t TscSequence; +uint32_t Rserved1; +uint64_t TscScale; +int64_t TscOffset; +} HV_REFERENCE_TSC_PAGE, * PHV_REFERENCE_TSC_PAGE; + #endif diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..63ee09e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -565,6 +565,8 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_ref_count; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4f76417..4538295 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -813,7 +813,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_REFERENCE_TSC, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1428,6 +1428,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_TIME_REF_COUNT: + case HV_X64_MSR_REFERENCE_TSC: r = true; break; } @@ -1438,6 +1440,7 @@ static bool kvm_hv_msr_partition_wide(u32 msr) static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) { struct kvm *kvm = vcpu-kvm; + unsigned long addr; switch (msr) { case HV_X64_MSR_GUEST_OS_ID: @@ -1467,6 +1470,27 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + kvm-arch.hv_ref_count = get_kernel_ns(); + break; + } + case HV_X64_MSR_REFERENCE_TSC: { + HV_REFERENCE_TSC_PAGE tsc_ref; + tsc_ref.TscSequence = + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0; + tsc_ref.TscScale = + ((1LL 32) /vcpu-arch.virtual_tsc_khz) 32; + tsc_ref.TscOffset = 0; + if (!(data HV_X64_MSR_TSC_REFERENCE_ENABLE)) { + kvm-arch.hv_tsc_page = data; + break; + } + addr = gfn_to_hva(vcpu-kvm, data + HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT); + if (kvm_is_error_hva(addr)) + return 1; + if(__copy_to_user((void __user *)addr, tsc_ref, sizeof(tsc_ref))) + return 1; + kvm-arch.hv_tsc_page = data; break; } default: @@ -1881,6 +1905,13 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case HV_X64_MSR_HYPERCALL: data = kvm-arch.hv_hypercall; break; + case HV_X64_MSR_TIME_REF_COUNT: + data = get_kernel_ns() - kvm-arch.hv_ref_count; + do_div(data, 100); + break; + case HV_X64_MSR_REFERENCE_TSC: + data = kvm-arch.hv_tsc_page; + break; default: vcpu_unimpl(vcpu, Hyper-V unhandled rdmsr: 0x%x\n, msr); return 1; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index f3708e6..ad77b72 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1250,6 +1250,8 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) hyperv_enable_relaxed_timing(true);
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +vector, 1); + } + + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm, } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm *kvm, #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msix(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: lazily drop large spte
On Wed, Nov 28, 2012 at 01:27:38PM +0800, Xiao Guangrong wrote: On 11/18/2012 11:00 AM, Marcelo Tosatti wrote: map gfn 4? See corrected step 7 above. Ah, this is a real bug, and unfortunately, it exists in current code. I will make a separate patchset to fix it. Thank you, Marcelo! Is it? Hum.. Anyway, it would be great if you can write a testcase (should be similar in size to rmap_chain). Marcelo, is this patch acceptable? Yes, can we get reexecute_instruction fix first? (which should then be able to handle the case where a large read-only spte is created). I'll merge the testcase later today. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + vector, 1); + } + + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm, } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm *kvm, #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msix(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Can host be intx but guest msi? You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +vector, 1); + } + + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm, } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm *kvm, #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msix(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + vector, 1); + } + + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm, } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm *kvm, #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msix(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM
On 11/28/2012 02:29 AM, Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote: The launch state is not a member in the VMCS area, use a separate variable (list) to store it instead. Signed-off-by: Dongxiao Xu dongxiao...@intel.com 1. What is the problem with keeping launched state in the VMCS? Assuming there is a positive answer to the above: 2. Don't you have to change VMCS ID? 3. Can't it be kept somewhere else other than a list? Current scheme allows guest to allocate unlimited amounts of host memory. I agree with Marcelo you have to limit the number of VMCS in the list otherwise it will be easy to attack a host with nested :) 4. What is the state of migration / nested vmx again? If vmcs12 is migrated, this means launched state is not migrated anymore. Patches 1-3 seem fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. VIFO uses irqfd, no? So, why legacy device assignment needs that code to achieve parity with VFIO? Also why long term? What are the complications? Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. I prefer to figure that out before introducing the interface. Hmm, we can get rid of vcpu loop in pic (should be very easily done by checking for kvm_apic_accept_pic_intr() during apic configuration and keeping global extint vcpu) and then sorting irq routing entries so that ioapic entry is first since only ioapic injection can fail. + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +vector, 1); + } + + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm, } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm *kvm, #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t
Re: [PATCH v4 02/13] ARM: KVM: Keep track of currently running vcpus
Just a bunch of typos in this one :) On Sat, Nov 10, 2012 at 03:44:30PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com When an interrupt occurs for the guest, it is sometimes necessary to find out which vcpu was running at that point. Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(), run and allow the data to be retrived using either: retrieved - kvm_arm_get_running_vcpu(): returns the vcpu running at this point on the current CPU. Can only be used in a non-preemptable context. preemptible - kvm_arm_get_running_vcpus(): returns the per-CPU variable holding the the running vcpus, useable for per-CPU interrupts. -the usable Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_host.h | 10 ++ arch/arm/kvm/arm.c | 30 ++ 2 files changed, 40 insertions(+) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index e7fc249..e66cd56 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -154,4 +154,14 @@ static inline int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) { return 0; } + +struct kvm_vcpu *kvm_arm_get_running_vcpu(void); +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void); DECLARE_PER_CPU? +int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices); +unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu); +struct kvm_one_reg; +int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *); +int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *); + #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2cdc07b..60b119a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -53,11 +53,38 @@ static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); static struct vfp_hard_struct __percpu *kvm_host_vfp_state; static unsigned long hyp_default_vectors; +/* Per-CPU variable containing the currently running vcpu. */ +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu); + /* The VMID used in the VTTBR */ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); static u8 kvm_next_vmid; static DEFINE_SPINLOCK(kvm_vmid_lock); +static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu) +{ + BUG_ON(preemptible()); + __get_cpu_var(kvm_arm_running_vcpu) = vcpu; +} + +/** + * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU. + * Must be called from non-preemptible context + */ +struct kvm_vcpu *kvm_arm_get_running_vcpu(void) +{ + BUG_ON(preemptible()); + return __get_cpu_var(kvm_arm_running_vcpu); +} + +/** + * kvm_arm_get_running_vcpus - get the per-CPU array on currently running vcpus. + */ s/on/of/ ? Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support
On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Wire the basic framework code for VGIC support. Nothing to enable yet. Again, not sure how useful this patch is. Might as well merge it with code that actually does something. Couple of comments inline anyway... Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_host.h |7 arch/arm/include/asm/kvm_vgic.h | 70 +++ arch/arm/kvm/arm.c | 21 +++- arch/arm/kvm/interrupts.S |4 ++ arch/arm/kvm/mmio.c |3 ++ virt/kvm/kvm_main.c |5 ++- 6 files changed, 107 insertions(+), 3 deletions(-) create mode 100644 arch/arm/include/asm/kvm_vgic.h [...] diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 60b119a..426828a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -183,6 +183,9 @@ int kvm_dev_ioctl_check_extension(long ext) { int r; switch (ext) { +#ifdef CONFIG_KVM_ARM_VGIC + case KVM_CAP_IRQCHIP: +#endif case KVM_CAP_USER_MEMORY: case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: case KVM_CAP_ONE_REG: @@ -304,6 +307,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { /* Force users to call KVM_ARM_VCPU_INIT */ vcpu-arch.target = -1; + + /* Set up VGIC */ + kvm_vgic_vcpu_init(vcpu); + return 0; } @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, */ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { - return !!v-arch.irq_lines; + return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v); } So interrupt injection without the in-kernel GIC updates irq_lines, but the in-kernel GIC has its own separate data structures? Why can't the in-kernel GIC just use irq_lines instead of irq_pending_on_cpu? int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v) @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) update_vttbr(vcpu-kvm); + kvm_vgic_sync_to_cpu(vcpu); + local_irq_disable(); /* @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) { local_irq_enable(); + kvm_vgic_sync_from_cpu(vcpu); continue; } For VFP, we use different terminology (sync and flush). I don't think they're any clearer than what you have, but the consistency would be nice. Given that both these functions are run with interrupts enabled, why doesn't the second require a lock for updating dist-irq_pending_on_cpu? I notice there's a random smp_mb() over there... @@ -683,6 +693,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) * Back from guest */ + kvm_vgic_sync_from_cpu(vcpu); Likewise. ret = handle_exit(vcpu, run, ret); } @@ -965,6 +977,13 @@ static int init_hyp_mode(void) } } + /* + * Init HYP view of VGIC + */ + err = kvm_vgic_hyp_init(); + if (err) + goto out_free_mappings; + return 0; out_free_vfp: free_percpu(kvm_host_vfp_state); [...] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2fb7319..665af96 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1880,12 +1880,13 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; -#if defined(CONFIG_S390) || defined(CONFIG_PPC) +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_ARM) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, * so vcpu_load() would break it. */ - if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT) + if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT || + ioctl == KVM_IRQ_LINE) return kvm_arch_vcpu_ioctl(filp, ioctl, arg); #endif Separate patch? Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: VMX: fix invalid cpu passed to smp_call_function_single
In loaded_vmcs_clear, loaded_vmcs-cpu is the fist parameter passed to smp_call_function_single, if the target cpu is downing (doing cpu hot remove), loaded_vmcs-cpu can become -1 then -1 is passed to smp_call_function_single It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called in the preemptionable context Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6599e45..29e8f42 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1007,9 +1007,11 @@ static void __loaded_vmcs_clear(void *arg) static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs) { - if (loaded_vmcs-cpu != -1) - smp_call_function_single( - loaded_vmcs-cpu, __loaded_vmcs_clear, loaded_vmcs, 1); + int cpu = loaded_vmcs-cpu; + + if (cpu != -1) + smp_call_function_single(cpu, +__loaded_vmcs_clear, loaded_vmcs, 1); } static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs
vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs does not exist on any vcpu If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu list. The list can be corrupted if the cpu prefetch the vmcs's list before reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before making vmcs-vcpu == -1 be visible Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 29e8f42..6056d88 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg) if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs) per_cpu(current_vmcs, cpu) = NULL; list_del(loaded_vmcs-loaded_vmcss_on_cpu_link); + + /* +* we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link +* is before setting loaded_vmcs-vcpu to -1 which is done in +* loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist +* then adds the vmcs into percpu list before it is deleted. +*/ + smp_wmb(); + loaded_vmcs_init(loaded_vmcs); } @@ -1537,6 +1546,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); local_irq_disable(); + + /* +* Read loaded_vmcs-cpu should be before fetching +* loaded_vmcs-loaded_vmcss_on_cpu_link. +* See the comments in __loaded_vmcs_clear(). +*/ + smp_rmb(); + list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link, per_cpu(loaded_vmcss_on_cpu, cpu)); local_irq_enable(); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support
On 28/11/12 12:49, Will Deacon wrote: On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Wire the basic framework code for VGIC support. Nothing to enable yet. Again, not sure how useful this patch is. Might as well merge it with code that actually does something. Couple of comments inline anyway... Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_host.h |7 arch/arm/include/asm/kvm_vgic.h | 70 +++ arch/arm/kvm/arm.c | 21 +++- arch/arm/kvm/interrupts.S |4 ++ arch/arm/kvm/mmio.c |3 ++ virt/kvm/kvm_main.c |5 ++- 6 files changed, 107 insertions(+), 3 deletions(-) create mode 100644 arch/arm/include/asm/kvm_vgic.h [...] diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 60b119a..426828a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -183,6 +183,9 @@ int kvm_dev_ioctl_check_extension(long ext) { int r; switch (ext) { +#ifdef CONFIG_KVM_ARM_VGIC +case KVM_CAP_IRQCHIP: +#endif case KVM_CAP_USER_MEMORY: case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: case KVM_CAP_ONE_REG: @@ -304,6 +307,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { /* Force users to call KVM_ARM_VCPU_INIT */ vcpu-arch.target = -1; + +/* Set up VGIC */ +kvm_vgic_vcpu_init(vcpu); + return 0; } @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, */ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { -return !!v-arch.irq_lines; +return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v); } So interrupt injection without the in-kernel GIC updates irq_lines, but the in-kernel GIC has its own separate data structures? Why can't the in-kernel GIC just use irq_lines instead of irq_pending_on_cpu? They serve very different purposes: - irq_lines directly controls the IRQ and FIQ lines (it is or-ed into the HCR register before entering the guest) - irq_pending_on_cpu deals with the CPU interface, and only that. Plus, it is a kernel only thing. What triggers the interrupt on the guest is the presence of list registers with a pending state. You signal interrupts one way or the other. int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v) @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) update_vttbr(vcpu-kvm); +kvm_vgic_sync_to_cpu(vcpu); + local_irq_disable(); /* @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) { local_irq_enable(); +kvm_vgic_sync_from_cpu(vcpu); continue; } For VFP, we use different terminology (sync and flush). I don't think they're any clearer than what you have, but the consistency would be nice. Which one maps to which? Given that both these functions are run with interrupts enabled, why doesn't the second require a lock for updating dist-irq_pending_on_cpu? I notice there's a random smp_mb() over there... Updating *only* irq_pending_on_cpu doesn't require the lock (set_bit() should be safe, and I think the smp_mb() is a leftover of some debugging hack). kvm_vgic_to_cpu() does a lot more (it picks interrupt from the distributor, hence requires the lock to be taken). @@ -683,6 +693,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) * Back from guest */ +kvm_vgic_sync_from_cpu(vcpu); Likewise. ret = handle_exit(vcpu, run, ret); } @@ -965,6 +977,13 @@ static int init_hyp_mode(void) } } +/* + * Init HYP view of VGIC + */ +err = kvm_vgic_hyp_init(); +if (err) +goto out_free_mappings; + return 0; out_free_vfp: free_percpu(kvm_host_vfp_state); [...] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2fb7319..665af96 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1880,12 +1880,13 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; -#if defined(CONFIG_S390) || defined(CONFIG_PPC) +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_ARM) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, * so vcpu_load() would break it. */ -if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT) +if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT || +ioctl == KVM_IRQ_LINE)
Re: [PATCH v4 04/13] ARM: KVM: Initial VGIC MMIO support code
On Sat, Nov 10, 2012 at 03:44:44PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Wire the initial in-kernel MMIO support code for the VGIC, used for the distributor emulation. [...] diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c new file mode 100644 index 000..26ada3b --- /dev/null +++ b/arch/arm/kvm/vgic.c @@ -0,0 +1,138 @@ +/* + * Copyright (C) 2012 ARM Ltd. + * Author: Marc Zyngier marc.zyng...@arm.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kvm.h +#include linux/kvm_host.h +#include linux/interrupt.h +#include linux/io.h +#include asm/kvm_emulate.h + +#define ACCESS_READ_VALUE(1 0) +#define ACCESS_READ_RAZ (0 0) +#define ACCESS_READ_MASK(x) ((x) (1 0)) +#define ACCESS_WRITE_IGNORED (0 1) +#define ACCESS_WRITE_SETBIT (1 1) +#define ACCESS_WRITE_CLEARBIT(2 1) +#define ACCESS_WRITE_VALUE (3 1) +#define ACCESS_WRITE_MASK(x) ((x) (3 1)) + +/** + * vgic_reg_access - access vgic register + * @mmio: pointer to the data describing the mmio access + * @reg:pointer to the virtual backing of the vgic distributor struct + * @offset: least significant 2 bits used for word offset + * @mode: ACCESS_ mode (see defines above) + * + * Helper to make vgic register access easier using one of the access + * modes defined for vgic register access + * (read,raz,write-ignored,setbit,clearbit,write) + */ +static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg, + u32 offset, int mode) +{ + int word_offset = offset 3; You can get rid of this variable. + int shift = word_offset * 8; shift = (offset 3) 3; + u32 mask; + u32 regval; + + /* + * Any alignment fault should have been delivered to the guest + * directly (ARM ARM B3.12.7 Prioritization of aborts). + */ + + mask = (~0U) (word_offset * 8); then use shift here. + if (reg) + regval = *reg; + else { Use braces for the if clause. + BUG_ON(mode != (ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED)); + regval = 0; + } + + if (mmio-is_write) { + u32 data = (*((u32 *)mmio-data) mask) shift; + switch (ACCESS_WRITE_MASK(mode)) { + case ACCESS_WRITE_IGNORED: + return; + + case ACCESS_WRITE_SETBIT: + regval |= data; + break; + + case ACCESS_WRITE_CLEARBIT: + regval = ~data; + break; + + case ACCESS_WRITE_VALUE: + regval = (regval ~(mask shift)) | data; + break; + } + *reg = regval; + } else { + switch (ACCESS_READ_MASK(mode)) { + case ACCESS_READ_RAZ: + regval = 0; + /* fall through */ + + case ACCESS_READ_VALUE: + *((u32 *)mmio-data) = (regval shift) mask; + } + } +} It might be a good idea to have some port accessors for mmio-data otherwise you'll likely get endianness issues creeping in. +/* All this should be handled by kvm_bus_io_*()... FIXME!!! */ I don't follow this comment :) Can you either make it clearer (and less alarming!) or just drop it please? +struct mmio_range { + unsigned long base; + unsigned long len; + bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, + u32 offset); +}; Why not make offset a phys_addr_t? +static const struct mmio_range vgic_ranges[] = { + {} +}; + +static const +struct mmio_range *find_matching_range(const struct mmio_range *ranges, +struct kvm_exit_mmio *mmio, +unsigned long base) +{ + const struct mmio_range *r = ranges; + unsigned long addr = mmio-phys_addr - base; Same here, I don't think we want to truncate everything to unsigned long. + while (r-len) { + if (addr = r-base + (addr + mmio-len) = (r-base + r-len)) + return r; + r++; + } Hmm, does this work correctly for adjacent mmio devices where addr sits on
Re: [PATCH v4 05/13] ARM: KVM: VGIC accept vcpu and dist base addresses from user space
On Sat, Nov 10, 2012 at 03:44:51PM +, Christoffer Dall wrote: User space defines the model to emulate to a guest and should therefore decide which addresses are used for both the virtual CPU interface directly mapped in the guest physical address space and for the emulated distributor interface, which is mapped in software by the in-kernel VGIC support. Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_mmu.h |2 + arch/arm/include/asm/kvm_vgic.h |9 ++ arch/arm/kvm/arm.c | 16 ++ arch/arm/kvm/vgic.c | 61 +++ 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 9bd0508..0800531 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -26,6 +26,8 @@ * To save a bit of memory and to avoid alignment issues we assume 39-bit IPA * for now, but remember that the level-1 table must be aligned to its size. */ +#define KVM_PHYS_SHIFT (38) Seems a bit small... +#define KVM_PHYS_MASK((1ULL KVM_PHYS_SHIFT) - 1) #define PTRS_PER_PGD2512 #define PGD2_ORDER get_order(PTRS_PER_PGD2 * sizeof(pgd_t)) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index b444ecf..9ca8d21 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -20,6 +20,9 @@ #define __ASM_ARM_KVM_VGIC_H struct vgic_dist { + /* Distributor and vcpu interface mapping in the guest */ + phys_addr_t vgic_dist_base; + phys_addr_t vgic_cpu_base; }; struct vgic_cpu { @@ -31,6 +34,7 @@ struct kvm_run; struct kvm_exit_mmio; #ifdef CONFIG_KVM_ARM_VGIC +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr); bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct kvm_exit_mmio *mmio); @@ -40,6 +44,11 @@ static inline int kvm_vgic_hyp_init(void) return 0; } +static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr) +{ + return 0; +} + static inline int kvm_vgic_init(struct kvm *kvm) { return 0; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 426828a..3ac1aab 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); static u8 kvm_next_vmid; static DEFINE_SPINLOCK(kvm_vmid_lock); +static bool vgic_present; + static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu) { BUG_ON(preemptible()); @@ -825,7 +827,19 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) static int kvm_vm_ioctl_set_device_address(struct kvm *kvm, struct kvm_device_address *dev_addr) { - return -ENODEV; + unsigned long dev_id, type; + + dev_id = (dev_addr-id KVM_DEVICE_ID_MASK) KVM_DEVICE_ID_SHIFT; + type = (dev_addr-id KVM_DEVICE_TYPE_MASK) KVM_DEVICE_TYPE_SHIFT; + + switch (dev_id) { + case KVM_ARM_DEVICE_VGIC_V2: + if (!vgic_present) + return -ENXIO; + return kvm_vgic_set_addr(kvm, type, dev_addr-addr); + default: + return -ENODEV; + } } long kvm_arch_vm_ioctl(struct file *filp, diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c index 26ada3b..f85b275 100644 --- a/arch/arm/kvm/vgic.c +++ b/arch/arm/kvm/vgic.c @@ -22,6 +22,13 @@ #include linux/io.h #include asm/kvm_emulate.h +#define VGIC_ADDR_UNDEF (-1) +#define IS_VGIC_ADDR_UNDEF(_x) ((_x) == (typeof(_x))VGIC_ADDR_UNDEF) + +#define VGIC_DIST_SIZE 0x1000 +#define VGIC_CPU_SIZE0x2000 These defines might be useful to userspace so that they don't request the distributor and the cpu interface to be place too close together (been there, done that :). + + #define ACCESS_READ_VALUE(1 0) #define ACCESS_READ_RAZ (0 0) #define ACCESS_READ_MASK(x) ((x) (1 0)) @@ -136,3 +143,57 @@ bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct kvm_exi { return KVM_EXIT_MMIO; } + +static bool vgic_ioaddr_overlap(struct kvm *kvm) +{ + phys_addr_t dist = kvm-arch.vgic.vgic_dist_base; + phys_addr_t cpu = kvm-arch.vgic.vgic_cpu_base; + + if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu)) + return false; + if ((dist = cpu dist + VGIC_DIST_SIZE cpu) || + (cpu = dist cpu + VGIC_CPU_SIZE dist)) + return true; + return false; Just return the predicate that you're testing. +} + +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr) +{ + int r = 0; + struct vgic_dist *vgic = kvm-arch.vgic; + + if (addr ~KVM_PHYS_MASK)
Re: [PATCH v4 02/13] ARM: KVM: Keep track of currently running vcpus
On 28/11/12 12:47, Will Deacon wrote: Just a bunch of typos in this one :) Typos? me? ;-) On Sat, Nov 10, 2012 at 03:44:30PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com When an interrupt occurs for the guest, it is sometimes necessary to find out which vcpu was running at that point. Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(), run and allow the data to be retrived using either: retrieved - kvm_arm_get_running_vcpu(): returns the vcpu running at this point on the current CPU. Can only be used in a non-preemptable context. preemptible - kvm_arm_get_running_vcpus(): returns the per-CPU variable holding the the running vcpus, useable for per-CPU interrupts. -the usable Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_host.h | 10 ++ arch/arm/kvm/arm.c | 30 ++ 2 files changed, 40 insertions(+) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index e7fc249..e66cd56 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -154,4 +154,14 @@ static inline int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) { return 0; } + +struct kvm_vcpu *kvm_arm_get_running_vcpu(void); +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void); DECLARE_PER_CPU? Ah, nice one. I didn't even now it existed! +int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices); +unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu); +struct kvm_one_reg; +int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *); +int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *); + #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2cdc07b..60b119a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -53,11 +53,38 @@ static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); static struct vfp_hard_struct __percpu *kvm_host_vfp_state; static unsigned long hyp_default_vectors; +/* Per-CPU variable containing the currently running vcpu. */ +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu); + /* The VMID used in the VTTBR */ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); static u8 kvm_next_vmid; static DEFINE_SPINLOCK(kvm_vmid_lock); +static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu) +{ +BUG_ON(preemptible()); +__get_cpu_var(kvm_arm_running_vcpu) = vcpu; +} + +/** + * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU. + * Must be called from non-preemptible context + */ +struct kvm_vcpu *kvm_arm_get_running_vcpu(void) +{ +BUG_ON(preemptible()); +return __get_cpu_var(kvm_arm_running_vcpu); +} + +/** + * kvm_arm_get_running_vcpus - get the per-CPU array on currently running vcpus. + */ s/on/of/ ? Indeed. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: use is_idle_task() instead of idle_cpu() to decide when to halt in async_pf
As Frederic pointed idle_cpu() may return false even if async fault happened in the idle task if wake up is pending. In this case the code will try to put idle task to sleep. Fix this by using is_idle_task() to check for idle task. Reported-by: Frederic Weisbecker fweis...@gmail.com Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a91c6b4..08b973f 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -120,11 +120,6 @@ void kvm_async_pf_task_wait(u32 token) struct kvm_task_sleep_head *b = async_pf_sleepers[key]; struct kvm_task_sleep_node n, *e; DEFINE_WAIT(wait); - int cpu, idle; - - cpu = get_cpu(); - idle = idle_cpu(cpu); - put_cpu(); spin_lock(b-lock); e = _find_apf_task(b, token); @@ -138,7 +133,7 @@ void kvm_async_pf_task_wait(u32 token) n.token = token; n.cpu = smp_processor_id(); - n.halted = idle || preempt_count() 1; + n.halted = is_idle_task(current) || preempt_count() 1; init_waitqueue_head(n.wq); hlist_add_head(n.link, b-list); spin_unlock(b-lock); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 06/13] ARM: KVM: VGIC distributor handling
On Sat, Nov 10, 2012 at 03:44:58PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Add the GIC distributor emulation code. A number of the GIC features are simply ignored as they are not required to boot a Linux guest. Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_vgic.h | 167 ++ arch/arm/kvm/vgic.c | 471 +++ 2 files changed, 637 insertions(+), 1 deletion(-) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index 9ca8d21..9e60b1d 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -19,10 +19,177 @@ #ifndef __ASM_ARM_KVM_VGIC_H #define __ASM_ARM_KVM_VGIC_H +#include linux/kernel.h +#include linux/kvm.h +#include linux/kvm_host.h +#include linux/irqreturn.h +#include linux/spinlock.h +#include linux/types.h + +#define VGIC_NR_IRQS 128 #define VGIC_NR_PRIVATE_IRQS32? +#define VGIC_NR_SHARED_IRQS(VGIC_NR_IRQS - 32) then subtract it here +#define VGIC_MAX_CPUS NR_CPUS We already have KVM_MAX_VCPUS, why do we need another? + +/* Sanity checks... */ +#if (VGIC_MAX_CPUS 8) +#error Invalid number of CPU interfaces +#endif + +#if (VGIC_NR_IRQS 31) +#error VGIC_NR_IRQS must be a multiple of 32 +#endif + +#if (VGIC_NR_IRQS 1024) +#error VGIC_NR_IRQS must be = 1024 +#endif Maybe put each check directly below the #define being tested, to make it super-obvious to people thinking of changing the constants? +/* + * The GIC distributor registers describing interrupts have two parts: + * - 32 per-CPU interrupts (SGI + PPI) + * - a bunch of shared interrups (SPI) interrupts + */ +struct vgic_bitmap { + union { + u32 reg[1]; + unsigned long reg_ul[0]; + } percpu[VGIC_MAX_CPUS]; + union { + u32 reg[VGIC_NR_SHARED_IRQS / 32]; + unsigned long reg_ul[0]; + } shared; +}; Whoa, this is nasty! Firstly, let's replace the `32' with sizeof(u32) for fun. Secondly, can we make the reg_ul arrays sized using the BITS_TO_LONGS macro? + +static inline u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x, + int cpuid, u32 offset) +{ + offset = 2; + BUG_ON(offset (VGIC_NR_IRQS / 32)); Hmm, where is offset sanity-checked before here? Do you just rely on all trapped accesses being valid? + if (!offset) + return x-percpu[cpuid].reg; + else + return x-shared.reg + offset - 1; An alternative to this would be to have a single array, with the per-cpu interrupts all laid out at the start and a macro to convert an offset to an index. Might make the code more readable and the struct definition more concise. +} + +static inline int vgic_bitmap_get_irq_val(struct vgic_bitmap *x, +int cpuid, int irq) +{ + if (irq 32) VGIC_NR_PRIVATE_IRQS (inless you go with the suggestion above) + return test_bit(irq, x-percpu[cpuid].reg_ul); + + return test_bit(irq - 32, x-shared.reg_ul); +} + +static inline void vgic_bitmap_set_irq_val(struct vgic_bitmap *x, + int cpuid, int irq, int val) +{ + unsigned long *reg; + + if (irq 32) + reg = x-percpu[cpuid].reg_ul; + else { + reg = x-shared.reg_ul; + irq -= 32; + } Likewise. + + if (val) + set_bit(irq, reg); + else + clear_bit(irq, reg); +} + +static inline unsigned long *vgic_bitmap_get_cpu_map(struct vgic_bitmap *x, +int cpuid) +{ + if (unlikely(cpuid = VGIC_MAX_CPUS)) + return NULL; + return x-percpu[cpuid].reg_ul; +} + +static inline unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap *x) +{ + return x-shared.reg_ul; +} + +struct vgic_bytemap { + union { + u32 reg[8]; + unsigned long reg_ul[0]; + } percpu[VGIC_MAX_CPUS]; + union { + u32 reg[VGIC_NR_SHARED_IRQS / 4]; + unsigned long reg_ul[0]; + } shared; +}; Argh, it's another one! :) + +static inline u32 *vgic_bytemap_get_reg(struct vgic_bytemap *x, + int cpuid, u32 offset) +{ + offset = 2; + BUG_ON(offset (VGIC_NR_IRQS / 4)); + if (offset 4) + return x-percpu[cpuid].reg + offset; + else + return x-shared.reg + offset - 8; +} + +static inline int vgic_bytemap_get_irq_val(struct vgic_bytemap *x, + int cpuid, int irq) +{ + u32 *reg, shift; + shift =
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. VIFO uses irqfd, no? So, why legacy device assignment needs that code to achieve parity with VFIO? Clarification: there are two issues: 1. legacy assignment has bad worst case latency this is because we bounce all ainterrupts through threads this patch fixes this 2. irqfd injects all MSIs from an atomic context this patch does not fix this, but it can be fixed on top of this patch Also why long term? What are the complications? Nothing special. Just need to be careful with all the rcu trickery that irqfd uses. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. I prefer to figure that out before introducing the interface. Ow come on, it's just an internal interface, not even exported to modules. Changing it would be trivial and the implementation is very small too. Hmm, we can get rid of vcpu loop in pic (should be very easily done by checking for kvm_apic_accept_pic_intr() during apic configuration and keeping global extint vcpu) and then sorting irq routing entries so that ioapic entry is first since only ioapic injection can fail. Yes, I think it's a good idea to remove as many vcpu loops as possible: for example, this vcpu loop is currently hit from atomic context anyway, isn't it? + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int index = find_index_from_host_irq(assigned_dev, irq); + u32 vector; + int ret = 0; + + if (index = 0) { + vector = assigned_dev-guest_msix_entries[index].vector; + ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + vector, 1); + } + +
Re: [kvmarm] [PATCH v4 05/13] ARM: KVM: VGIC accept vcpu and dist base addresses from user space
On 28/11/12 13:11, Will Deacon wrote: On Sat, Nov 10, 2012 at 03:44:51PM +, Christoffer Dall wrote: User space defines the model to emulate to a guest and should therefore decide which addresses are used for both the virtual CPU interface directly mapped in the guest physical address space and for the emulated distributor interface, which is mapped in software by the in-kernel VGIC support. Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_mmu.h |2 + arch/arm/include/asm/kvm_vgic.h |9 ++ arch/arm/kvm/arm.c | 16 ++ arch/arm/kvm/vgic.c | 61 +++ 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 9bd0508..0800531 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -26,6 +26,8 @@ * To save a bit of memory and to avoid alignment issues we assume 39-bit IPA * for now, but remember that the level-1 table must be aligned to its size. */ +#define KVM_PHYS_SHIFT (38) Seems a bit small... It's now been fixed to be 40 bits. +#define KVM_PHYS_MASK((1ULL KVM_PHYS_SHIFT) - 1) #define PTRS_PER_PGD2 512 #define PGD2_ORDER get_order(PTRS_PER_PGD2 * sizeof(pgd_t)) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index b444ecf..9ca8d21 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -20,6 +20,9 @@ #define __ASM_ARM_KVM_VGIC_H struct vgic_dist { +/* Distributor and vcpu interface mapping in the guest */ +phys_addr_t vgic_dist_base; +phys_addr_t vgic_cpu_base; }; struct vgic_cpu { @@ -31,6 +34,7 @@ struct kvm_run; struct kvm_exit_mmio; #ifdef CONFIG_KVM_ARM_VGIC +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr); bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct kvm_exit_mmio *mmio); @@ -40,6 +44,11 @@ static inline int kvm_vgic_hyp_init(void) return 0; } +static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr) +{ +return 0; +} + static inline int kvm_vgic_init(struct kvm *kvm) { return 0; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 426828a..3ac1aab 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); static u8 kvm_next_vmid; static DEFINE_SPINLOCK(kvm_vmid_lock); +static bool vgic_present; + static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu) { BUG_ON(preemptible()); @@ -825,7 +827,19 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) static int kvm_vm_ioctl_set_device_address(struct kvm *kvm, struct kvm_device_address *dev_addr) { -return -ENODEV; +unsigned long dev_id, type; + +dev_id = (dev_addr-id KVM_DEVICE_ID_MASK) KVM_DEVICE_ID_SHIFT; +type = (dev_addr-id KVM_DEVICE_TYPE_MASK) KVM_DEVICE_TYPE_SHIFT; + +switch (dev_id) { +case KVM_ARM_DEVICE_VGIC_V2: +if (!vgic_present) +return -ENXIO; +return kvm_vgic_set_addr(kvm, type, dev_addr-addr); +default: +return -ENODEV; +} } long kvm_arch_vm_ioctl(struct file *filp, diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c index 26ada3b..f85b275 100644 --- a/arch/arm/kvm/vgic.c +++ b/arch/arm/kvm/vgic.c @@ -22,6 +22,13 @@ #include linux/io.h #include asm/kvm_emulate.h +#define VGIC_ADDR_UNDEF (-1) +#define IS_VGIC_ADDR_UNDEF(_x) ((_x) == (typeof(_x))VGIC_ADDR_UNDEF) + +#define VGIC_DIST_SIZE 0x1000 +#define VGIC_CPU_SIZE 0x2000 These defines might be useful to userspace so that they don't request the distributor and the cpu interface to be place too close together (been there, done that :). Fair enough. + + #define ACCESS_READ_VALUE (1 0) #define ACCESS_READ_RAZ (0 0) #define ACCESS_READ_MASK(x) ((x) (1 0)) @@ -136,3 +143,57 @@ bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct kvm_exi { return KVM_EXIT_MMIO; } + +static bool vgic_ioaddr_overlap(struct kvm *kvm) +{ +phys_addr_t dist = kvm-arch.vgic.vgic_dist_base; +phys_addr_t cpu = kvm-arch.vgic.vgic_cpu_base; + +if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu)) +return false; +if ((dist = cpu dist + VGIC_DIST_SIZE cpu) || +(cpu = dist cpu + VGIC_CPU_SIZE dist)) +return true; +return false; Just return the predicate that you're testing. +} + +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr) +{ +int r = 0; +struct vgic_dist
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, +assigned_dev-irq_source_id, +assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. VIFO uses irqfd, no? So, why legacy device assignment needs that code to achieve parity with VFIO? Clarification: there are two issues: 1. legacy assignment has bad worst case latency this is because we bounce all ainterrupts through threads this patch fixes this 2. irqfd injects all MSIs from an atomic context this patch does not fix this, but it can be fixed on top of this patch Thanks for clarification. Also why long term? What are the complications? Nothing special. Just need to be careful with all the rcu trickery that irqfd uses. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. I prefer to figure that out before introducing the interface. Ow come on, it's just an internal interface, not even exported to modules. Changing it would be trivial and the implementation is very small too. The question is if it can be done at all or not. If it cannot then it does not matter that interface is internal, but fortunately looks like it is possible, so I am fine with proposed implementation for now. Hmm, we can get rid of vcpu loop in pic (should be very easily done by checking for kvm_apic_accept_pic_intr() during apic configuration and keeping global extint vcpu) and then sorting irq routing entries so that ioapic entry is first since only ioapic injection can fail. Yes, I think it's a good idea to remove as many vcpu loops as possible: for example, this vcpu loop is currently hit from atomic context anyway, isn't it? Actually it is not. The lock is dropped just before the loop, so this loop shouldn't be the roadblock at all. + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) #endif #ifdef __KVM_HAVE_MSIX +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
Re: [PATCH v4 04/13] ARM: KVM: Initial VGIC MMIO support code
On 28/11/12 13:09, Will Deacon wrote: On Sat, Nov 10, 2012 at 03:44:44PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Wire the initial in-kernel MMIO support code for the VGIC, used for the distributor emulation. [...] diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c new file mode 100644 index 000..26ada3b --- /dev/null +++ b/arch/arm/kvm/vgic.c @@ -0,0 +1,138 @@ +/* + * Copyright (C) 2012 ARM Ltd. + * Author: Marc Zyngier marc.zyng...@arm.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kvm.h +#include linux/kvm_host.h +#include linux/interrupt.h +#include linux/io.h +#include asm/kvm_emulate.h + +#define ACCESS_READ_VALUE (1 0) +#define ACCESS_READ_RAZ (0 0) +#define ACCESS_READ_MASK(x) ((x) (1 0)) +#define ACCESS_WRITE_IGNORED(0 1) +#define ACCESS_WRITE_SETBIT (1 1) +#define ACCESS_WRITE_CLEARBIT (2 1) +#define ACCESS_WRITE_VALUE (3 1) +#define ACCESS_WRITE_MASK(x)((x) (3 1)) + +/** + * vgic_reg_access - access vgic register + * @mmio: pointer to the data describing the mmio access + * @reg:pointer to the virtual backing of the vgic distributor struct + * @offset: least significant 2 bits used for word offset + * @mode: ACCESS_ mode (see defines above) + * + * Helper to make vgic register access easier using one of the access + * modes defined for vgic register access + * (read,raz,write-ignored,setbit,clearbit,write) + */ +static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg, +u32 offset, int mode) +{ +int word_offset = offset 3; You can get rid of this variable. +int shift = word_offset * 8; shift = (offset 3) 3; +u32 mask; +u32 regval; + +/* + * Any alignment fault should have been delivered to the guest + * directly (ARM ARM B3.12.7 Prioritization of aborts). + */ + +mask = (~0U) (word_offset * 8); then use shift here. Sure. +if (reg) +regval = *reg; +else { Use braces for the if clause. Indeed. +BUG_ON(mode != (ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED)); +regval = 0; +} + +if (mmio-is_write) { +u32 data = (*((u32 *)mmio-data) mask) shift; +switch (ACCESS_WRITE_MASK(mode)) { +case ACCESS_WRITE_IGNORED: +return; + +case ACCESS_WRITE_SETBIT: +regval |= data; +break; + +case ACCESS_WRITE_CLEARBIT: +regval = ~data; +break; + +case ACCESS_WRITE_VALUE: +regval = (regval ~(mask shift)) | data; +break; +} +*reg = regval; +} else { +switch (ACCESS_READ_MASK(mode)) { +case ACCESS_READ_RAZ: +regval = 0; +/* fall through */ + +case ACCESS_READ_VALUE: +*((u32 *)mmio-data) = (regval shift) mask; +} +} +} It might be a good idea to have some port accessors for mmio-data otherwise you'll likely get endianness issues creeping in. Aarrgh! This depends on the relative endianess of host and guest. 'm feeling slightly sick... +/* All this should be handled by kvm_bus_io_*()... FIXME!!! */ I don't follow this comment :) Can you either make it clearer (and less alarming!) or just drop it please? The non-alarming version should read: /* * I would have liked to use the kvm_bus_io_*() API instead, but * it cannot cope with banked registers (only the VM pointer is * passed around, and we need the vcpu). One of these days, someone * please fix it! */ +struct mmio_range { +unsigned long base; +unsigned long len; +bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, +u32 offset); +}; Why not make offset a phys_addr_t? Very good point. +static const struct mmio_range vgic_ranges[] = { +{} +}; + +static const +struct mmio_range *find_matching_range(const struct mmio_range *ranges, + struct kvm_exit_mmio *mmio, + unsigned long base) +{ +
Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp
On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote: On 11/28/2012 07:32 AM, Marcelo Tosatti wrote: On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote: +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2) { -gpa_t gpa; +gpa_t gpa = cr2; pfn_t pfn; -if (tdp_enabled) +if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) return false; How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used to read it? Hi Marcelo, It is protected by mmu-lock for it only be changed when mmu-lock is hold. And ACCESS_ONCE is used on read path avoiding magic optimization from compiler. Please switch to mmu_lock protection, there is no reason to have access to this variable locklessly - not performance critical. For example, there is no use of barriers when modifying the variable. This is not bad, the worst case is, the direct mmu failed to unprotect the shadow pages, (meet indirect_shadow_pages = 0, but there has shadow pages being shadowed.), after enter to guest, we will go into reexecute_instruction again, then it will remove shadow pages. Isn't the same scenario can happen even with mmu lock around indirect_shadow_pages access? But, i do not have strong opinion on it, i respect your idea! :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support
On Wed, Nov 28, 2012 at 01:09:37PM +, Marc Zyngier wrote: On 28/11/12 12:49, Will Deacon wrote: On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote: @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, */ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { - return !!v-arch.irq_lines; + return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v); } So interrupt injection without the in-kernel GIC updates irq_lines, but the in-kernel GIC has its own separate data structures? Why can't the in-kernel GIC just use irq_lines instead of irq_pending_on_cpu? They serve very different purposes: - irq_lines directly controls the IRQ and FIQ lines (it is or-ed into the HCR register before entering the guest) - irq_pending_on_cpu deals with the CPU interface, and only that. Plus, it is a kernel only thing. What triggers the interrupt on the guest is the presence of list registers with a pending state. You signal interrupts one way or the other. Ok, thanks for the explanation. I suspect that we could use (another) cosmetic change then. How about cpui_irq_pending and hcr_irq_pending? int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v) @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) update_vttbr(vcpu-kvm); + kvm_vgic_sync_to_cpu(vcpu); + local_irq_disable(); /* @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) { local_irq_enable(); + kvm_vgic_sync_from_cpu(vcpu); continue; } For VFP, we use different terminology (sync and flush). I don't think they're any clearer than what you have, but the consistency would be nice. Which one maps to which? sync: hardware - data structure flush: data structure - hardware Given that both these functions are run with interrupts enabled, why doesn't the second require a lock for updating dist-irq_pending_on_cpu? I notice there's a random smp_mb() over there... Updating *only* irq_pending_on_cpu doesn't require the lock (set_bit() should be safe, and I think the smp_mb() is a leftover of some debugging hack). kvm_vgic_to_cpu() does a lot more (it picks interrupt from the distributor, hence requires the lock to be taken). Ok, if the barrier is just a hangover from something else and you don't have any races with test/clear operations then you should be alright. Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 06/13] ARM: KVM: VGIC distributor handling
On 28/11/12 13:21, Will Deacon wrote: On Sat, Nov 10, 2012 at 03:44:58PM +, Christoffer Dall wrote: From: Marc Zyngier marc.zyng...@arm.com Add the GIC distributor emulation code. A number of the GIC features are simply ignored as they are not required to boot a Linux guest. Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com --- arch/arm/include/asm/kvm_vgic.h | 167 ++ arch/arm/kvm/vgic.c | 471 +++ 2 files changed, 637 insertions(+), 1 deletion(-) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index 9ca8d21..9e60b1d 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -19,10 +19,177 @@ #ifndef __ASM_ARM_KVM_VGIC_H #define __ASM_ARM_KVM_VGIC_H +#include linux/kernel.h +#include linux/kvm.h +#include linux/kvm_host.h +#include linux/irqreturn.h +#include linux/spinlock.h +#include linux/types.h + +#define VGIC_NR_IRQS 128 #define VGIC_NR_PRIVATE_IRQS32? +#define VGIC_NR_SHARED_IRQS(VGIC_NR_IRQS - 32) then subtract it here Sure. +#define VGIC_MAX_CPUS NR_CPUS We already have KVM_MAX_VCPUS, why do we need another? They really should be the same, and this NR_CPUS is a bug that has already been fixed. + +/* Sanity checks... */ +#if (VGIC_MAX_CPUS 8) +#error Invalid number of CPU interfaces +#endif + +#if (VGIC_NR_IRQS 31) +#error VGIC_NR_IRQS must be a multiple of 32 +#endif + +#if (VGIC_NR_IRQS 1024) +#error VGIC_NR_IRQS must be = 1024 +#endif Maybe put each check directly below the #define being tested, to make it super-obvious to people thinking of changing the constants? OK. +/* + * The GIC distributor registers describing interrupts have two parts: + * - 32 per-CPU interrupts (SGI + PPI) + * - a bunch of shared interrups (SPI) interrupts + */ +struct vgic_bitmap { + union { + u32 reg[1]; + unsigned long reg_ul[0]; + } percpu[VGIC_MAX_CPUS]; + union { + u32 reg[VGIC_NR_SHARED_IRQS / 32]; + unsigned long reg_ul[0]; + } shared; +}; Whoa, this is nasty! Firstly, let's replace the `32' with sizeof(u32) for fun. Secondly, can we make the reg_ul arrays sized using the BITS_TO_LONGS macro? This has already been replaced with: struct vgic_bitmap { union { u32 reg[1]; DECLARE_BITMAP(reg_ul, 32); } percpu[VGIC_MAX_CPUS]; union { u32 reg[VGIC_NR_SHARED_IRQS / 32]; DECLARE_BITMAP(reg_ul, VGIC_NR_SHARED_IRQS); } shared; }; which should address most of your concerns. As for the sizeof(u32), I think assuming that u32 has a grand total of 32bits is safe enough ;-) + +static inline u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x, + int cpuid, u32 offset) +{ + offset = 2; + BUG_ON(offset (VGIC_NR_IRQS / 32)); Hmm, where is offset sanity-checked before here? Do you just rely on all trapped accesses being valid? You've already validated the access being in a valid range. Offset is just derived the access address. the BUG_ON() is a leftover from early debugging and should go away now. + if (!offset) + return x-percpu[cpuid].reg; + else + return x-shared.reg + offset - 1; An alternative to this would be to have a single array, with the per-cpu interrupts all laid out at the start and a macro to convert an offset to an index. Might make the code more readable and the struct definition more concise. I'll try and see if this makes the code more palatable. +} + +static inline int vgic_bitmap_get_irq_val(struct vgic_bitmap *x, +int cpuid, int irq) +{ + if (irq 32) VGIC_NR_PRIVATE_IRQS (inless you go with the suggestion above) yep. + return test_bit(irq, x-percpu[cpuid].reg_ul); + + return test_bit(irq - 32, x-shared.reg_ul); +} + +static inline void vgic_bitmap_set_irq_val(struct vgic_bitmap *x, + int cpuid, int irq, int val) +{ + unsigned long *reg; + + if (irq 32) + reg = x-percpu[cpuid].reg_ul; + else { + reg = x-shared.reg_ul; + irq -= 32; + } Likewise. + + if (val) + set_bit(irq, reg); + else + clear_bit(irq, reg); +} + +static inline unsigned long *vgic_bitmap_get_cpu_map(struct vgic_bitmap *x, +int cpuid) +{ + if (unlikely(cpuid = VGIC_MAX_CPUS)) + return NULL; + return x-percpu[cpuid].reg_ul; +} + +static inline unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap *x) +{ +
kvm: let's add HYPERV capabilities (was Re: [PATCH 01/20] hyper-v: introduce Hyper-V support infrastructure.)
On Fri, Jan 20, 2012 at 03:26:27PM -0200, Marcelo Tosatti wrote: From: Vadim Rozenfeld vroze...@redhat.com [Jan: fix build with CONFIG_USER_ONLY] Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- Makefile.target |2 + target-i386/cpuid.c | 14 +++ target-i386/hyperv.c | 64 ++ target-i386/hyperv.h | 43 + 4 files changed, 123 insertions(+), 0 deletions(-) create mode 100644 target-i386/hyperv.c create mode 100644 target-i386/hyperv.h diff --git a/Makefile.target b/Makefile.target index 06d79b8..798dd30 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,8 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o obj-y += memory.o savevm.o LIBS+=-lz +obj-i386-y +=hyperv.o + QEMU_CFLAGS += $(VNC_TLS_CFLAGS) QEMU_CFLAGS += $(VNC_SASL_CFLAGS) QEMU_CFLAGS += $(VNC_JPEG_CFLAGS) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index 91a104b..b9bfeaf 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -27,6 +27,8 @@ #include qemu-option.h #include qemu-config.h +#include hyperv.h + /* feature flags taken from Intel Processor Identification and the CPUID * Instruction and AMD's CPUID Specification. In cases of disagreement * between feature naming conventions, aliases may be added. @@ -716,6 +718,14 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) goto error; } x86_cpu_def-tsc_khz = tsc_freq / 1000; +} else if (!strcmp(featurestr, hv_spinlocks)) { +char *err; +numvalue = strtoul(val, err, 0); +if (!*val || *err) { +fprintf(stderr, bad numerical value %s\n, val); +goto error; +} +hyperv_set_spinlock_retries(numvalue); } else { fprintf(stderr, unrecognized feature %s\n, featurestr); goto error; @@ -724,6 +734,10 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) check_cpuid = 1; } else if (!strcmp(featurestr, enforce)) { check_cpuid = enforce_cpuid = 1; +} else if (!strcmp(featurestr, hv_relaxed)) { +hyperv_enable_relaxed_timing(true); +} else if (!strcmp(featurestr, hv_vapic)) { +hyperv_enable_vapic_recommended(true); } else { fprintf(stderr, feature string `%s' not in format (+feature|-feature|feature=xyz)\n, featurestr); goto error; diff --git a/target-i386/hyperv.c b/target-i386/hyperv.c new file mode 100644 index 000..f284e99 --- /dev/null +++ b/target-i386/hyperv.c @@ -0,0 +1,64 @@ +/* + * QEMU Hyper-V support + * + * Copyright Red Hat, Inc. 2011 + * + * Author: Vadim Rozenfeld vroze...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include hyperv.h + +static bool hyperv_vapic; +static bool hyperv_relaxed_timing; +static int hyperv_spinlock_attempts = HYPERV_SPINLOCK_NEVER_RETRY; + +void hyperv_enable_vapic_recommended(bool val) +{ +hyperv_vapic = val; +} + +void hyperv_enable_relaxed_timing(bool val) +{ +hyperv_relaxed_timing = val; +} + +void hyperv_set_spinlock_retries(int val) +{ +hyperv_spinlock_attempts = val; +if (hyperv_spinlock_attempts 0xFFF) { +hyperv_spinlock_attempts = 0xFFF; +} +} + +bool hyperv_enabled(void) +{ +return hyperv_hypercall_available() || hyperv_relaxed_timing_enabled(); +} + +bool hyperv_hypercall_available(void) +{ +if (hyperv_vapic || +(hyperv_spinlock_attempts != HYPERV_SPINLOCK_NEVER_RETRY)) { + return true; +} +return false; +} + +bool hyperv_vapic_recommended(void) +{ +return hyperv_vapic; +} + +bool hyperv_relaxed_timing_enabled(void) +{ +return hyperv_relaxed_timing; +} + +int hyperv_get_spinlock_retries(void) +{ +return hyperv_spinlock_attempts; +} diff --git a/target-i386/hyperv.h b/target-i386/hyperv.h new file mode 100644 index 000..15467bf --- /dev/null +++ b/target-i386/hyperv.h @@ -0,0 +1,43 @@ +/* + * QEMU Hyper-V support + * + * Copyright Red Hat, Inc. 2011 + * + * Author: Vadim Rozenfeld vroze...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef QEMU_HW_HYPERV_H +#define QEMU_HW_HYPERV_H 1 + +#include qemu-common.h +#include asm/hyperv.h + +#ifndef HYPERV_SPINLOCK_NEVER_RETRY +#define HYPERV_SPINLOCK_NEVER_RETRY 0x +#endif + +#ifndef KVM_CPUID_SIGNATURE_NEXT +#define KVM_CPUID_SIGNATURE_NEXT0x4100 +#endif
Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp
On 11/28/2012 10:01 PM, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote: On 11/28/2012 07:32 AM, Marcelo Tosatti wrote: On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote: +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2) { -gpa_t gpa; +gpa_t gpa = cr2; pfn_t pfn; -if (tdp_enabled) +if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) return false; How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used to read it? Hi Marcelo, It is protected by mmu-lock for it only be changed when mmu-lock is hold. And ACCESS_ONCE is used on read path avoiding magic optimization from compiler. Please switch to mmu_lock protection, there is no reason to have access to this variable locklessly - not performance critical. For example, there is no use of barriers when modifying the variable. This is not bad, the worst case is, the direct mmu failed to unprotect the shadow pages, (meet indirect_shadow_pages = 0, but there has shadow pages being shadowed.), after enter to guest, we will go into reexecute_instruction again, then it will remove shadow pages. Isn't the same scenario can happen even with mmu lock around indirect_shadow_pages access? Hmm..., i also think it is no different. Even using mmu-lock, we can not prevent the target pfn can not be write-protected later. Marcelo? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 03:38:40PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, + assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. VIFO uses irqfd, no? So, why legacy device assignment needs that code to achieve parity with VFIO? Clarification: there are two issues: 1. legacy assignment has bad worst case latency this is because we bounce all ainterrupts through threads this patch fixes this 2. irqfd injects all MSIs from an atomic context this patch does not fix this, but it can be fixed on top of this patch Thanks for clarification. Also why long term? What are the complications? Nothing special. Just need to be careful with all the rcu trickery that irqfd uses. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. I prefer to figure that out before introducing the interface. Ow come on, it's just an internal interface, not even exported to modules. Changing it would be trivial and the implementation is very small too. The question is if it can be done at all or not. If it cannot then it does not matter that interface is internal, but fortunately looks like it is possible, so I am fine with proposed implementation for now. Hmm, we can get rid of vcpu loop in pic (should be very easily done by checking for kvm_apic_accept_pic_intr() during apic configuration and keeping global extint vcpu) and then sorting irq routing entries so that ioapic entry is first since only ioapic injection can fail. Yes, I think it's a good idea to remove as many vcpu loops as possible: for example, this vcpu loop is currently hit from atomic context anyway, isn't it? Actually it is not. The lock is dropped just before the loop, so this loop shouldn't be the roadblock at all. Hmm you are saying PIC injections in atomic context always succeeds? + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED; +} + static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote: On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote: Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) So its u64 here. { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); But not u64 here? Any particular reason? It should be void everywhere, thanks for catching that I will change the code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler
On Wed, Nov 28, 2012 at 05:25:17PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 03:38:40PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote: On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote: We can deliver certain interrupts, notably MSI, from atomic context. Use kvm_set_irq_inatomic, to implement an irq handler for msi. This reduces the pressure on scheduler in case where host and guest irq share a host cpu. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- virt/kvm/assigned-dev.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 23a41a9..3642239 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id) } #ifdef __KVM_HAVE_MSI +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) +{ + struct kvm_assigned_dev_kernel *assigned_dev = dev_id; + int ret = kvm_set_irq_inatomic(assigned_dev-kvm, + assigned_dev-irq_source_id, +assigned_dev-guest_irq, 1); Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from previous patch? kvm_set_msi_inatomic needs a routing entry, and we don't have the routing entry at this level. Yes, right. BTW is this interface will be used only for legacy assigned device or there will be other users too? I think long term we should convert irqfd to this too. VIFO uses irqfd, no? So, why legacy device assignment needs that code to achieve parity with VFIO? Clarification: there are two issues: 1. legacy assignment has bad worst case latency this is because we bounce all ainterrupts through threads this patch fixes this 2. irqfd injects all MSIs from an atomic context this patch does not fix this, but it can be fixed on top of this patch Thanks for clarification. Also why long term? What are the complications? Nothing special. Just need to be careful with all the rcu trickery that irqfd uses. Further, guest irq might not be an MSI: host MSI can cause guest intx injection I think, we need to bounce it to thread as we did earlier. Ah, so msi in kvm_assigned_dev_msi() is about host msi? Yes. Can host be intx but guest msi? No. You seems to not handle this case. Also injection of intx via ioapic is the same as injecting MSI. The format and the capability of irq message are essentially the same. Absolutely. So we will be able to extend this to intx long term. The difference is in the fact that unlike msi, intx can (and does) have multiple entries per GSI. I have not yet figured out how to report and handle failure in case one of these can be injected in atomic context, another can't. There's likely an easy way but can be a follow up patch I think. I prefer to figure that out before introducing the interface. Ow come on, it's just an internal interface, not even exported to modules. Changing it would be trivial and the implementation is very small too. The question is if it can be done at all or not. If it cannot then it does not matter that interface is internal, but fortunately looks like it is possible, so I am fine with proposed implementation for now. Hmm, we can get rid of vcpu loop in pic (should be very easily done by checking for kvm_apic_accept_pic_intr() during apic configuration and keeping global extint vcpu) and then sorting irq routing entries so that ioapic entry is first since only ioapic injection can fail. Yes, I think it's a good idea to remove as many vcpu loops as possible: for example, this vcpu loop is currently hit from atomic context anyway, isn't it? Actually it is not. The lock is dropped just before the loop, so this loop shouldn't be the roadblock at all. Hmm you are saying PIC injections in atomic context always succeeds? No, I am saying vcpu loop is not hit
Re: [PATCH] vhost-blk: Add vhost-blk support v5
From: Michael S. Tsirkin m...@redhat.com Date: Mon, 26 Nov 2012 17:14:16 +0200 On Mon, Nov 19, 2012 at 10:26:41PM +0200, Michael S. Tsirkin wrote: Userspace bits: - 1) LKVM The latest vhost-blk userspace bits for kvm tool can be found here: g...@github.com:asias/linux-kvm.git blk.vhost-blk 2) QEMU The latest vhost-blk userspace prototype for QEMU can be found here: g...@github.com:asias/qemu.git blk.vhost-blk Changes in v5: - Do not assume the buffer layout - Fix wakeup race Changes in v4: - Mark req-status as userspace pointer - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() - Add if (need_resched()) schedule() in blk thread - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() - Use vq_err() instead of pr_warn() - Fail un Unsupported request - Add flush in vhost_blk_set_features() Changes in v3: - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! - Check file passed by user is a raw block device file Signed-off-by: Asias He as...@redhat.com Since there are files shared by this and vhost net it's easiest for me to merge this all through the vhost tree. Hi Dave, are you ok with this proposal? I have no problems with this, for networking parts: Acked-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PCI device pass through support
Hi Alex, I am trying to pass through a PCI device to the guest to compare the MSI interrupt latency with normal device pass through and pass through using VFIO framework. I used the following script for dev in $(ls /sys/bus/pci/devices/:06:00.0/iommu_group/devices); do vendor=$(cat /sys/bus/pci/devices/$dev/vendor) device=$(cat /sys/bus/pci/devices/$dev/device) if [ -e /sys/bus/pci/devices/$dev/driver ]; then echo $dev /sys/bus/pci/devices/$dev/driver/unbind fi echo $vendor $device /sys/bus/pci/drivers/vfio-pci/new_id done and added -device vfio-pci,host,host=:06:00.0 to my qemu command line. I am getting Error - qemu-system-x86_64: -device vfio-pci,host,host=:06:00.0: Property 'vfio-pci.host' doesn't take value 'on'-- when i use the qemu for your git repo and and also the latest v1.3.0-rc1 qemu. I am using Kernel 3.6.7-rt18. Am i missing step ? Krishna -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI device pass through support
On Wed, 2012-11-28 at 16:32 +, Krishna J wrote: Hi Alex, I am trying to pass through a PCI device to the guest to compare the MSI interrupt latency with normal device pass through and pass through using VFIO framework. I used the following script for dev in $(ls /sys/bus/pci/devices/:06:00.0/iommu_group/devices); do vendor=$(cat /sys/bus/pci/devices/$dev/vendor) device=$(cat /sys/bus/pci/devices/$dev/device) if [ -e /sys/bus/pci/devices/$dev/driver ]; then echo $dev /sys/bus/pci/devices/$dev/driver/unbind fi echo $vendor $device /sys/bus/pci/drivers/vfio-pci/new_id done and added -device vfio-pci,host,host=:06:00.0 to my qemu command line. ^ I am getting Error - qemu-system-x86_64: -device vfio-pci,host,host=:06:00.0: Property 'vfio-pci.host' doesn't take value 'on'-- when i use the qemu for your git repo and and also the latest v1.3.0-rc1 qemu. I am using Kernel 3.6.7-rt18. Am i missing step ? Hi Krishna, There's an extra host, in the option above, is that the problem? Please let me know if I've mis-documented it anywhere. I hope you can share your results when you get it working. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: powerpc: convert spapr pci to new device iterators
Hi Will, On 2012-11-27 13:00, Will Deacon wrote: Commit 8d35d32d0148 (kvm tools: add generic device registration mechanism) introduced a tree-based device lookup-by-bus mechanism as well as iterators to enumerate the devices on a particular bus. Whilst both x86 and ppc were converted by the original patch, the spapr pci changes were incomplete, so include the required changes here. Compile-tested only on ppc64 970mp. Note that I had to hack the Makefile in order to build guest_init.o with a toolchain defaulting to ppc: $(GUEST_INIT): guest/init.c $(E) LINK $@ - $(Q) $(CC) -static guest/init.c -o $@ - $(Q) $(LD) -r -b binary -o guest/guest_init.o $(GUEST_INIT) + $(Q) $(CC) -m64 -static guest/init.c -o $@ + $(Q) $(LD) -m elf64ppc -r -b binary -o guest/guest_init.o $(GUEST_INIT) $(DEPS): Cc: Matt Evans m...@ozlabs.org Signed-off-by: Will Deacon will.dea...@arm.com I lack PPC But this patch appears too good To be left aside Acked-by: Matt Evans m...@ozlabs.org Cheers, Matt --- tools/kvm/powerpc/spapr_pci.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tools/kvm/powerpc/spapr_pci.c b/tools/kvm/powerpc/spapr_pci.c index 5f4016c..ed4b9ab 100644 --- a/tools/kvm/powerpc/spapr_pci.c +++ b/tools/kvm/powerpc/spapr_pci.c @@ -15,6 +15,7 @@ #include spapr.h #include spapr_pci.h +#include kvm/devices.h #include kvm/fdt.h #include kvm/util.h #include kvm/pci.h @@ -248,6 +249,7 @@ int spapr_populate_pci_devices(struct kvm *kvm, void *fdt) { int bus_off, node_off = 0, devid, fn, i, n, devices; + struct device_header *dev_hdr; char nodename[256]; struct { uint32_t hi; @@ -301,14 +303,15 @@ int spapr_populate_pci_devices(struct kvm *kvm, /* Populate PCI devices and allocate IRQs */ devices = 0; - - for (devid = 0; devid KVM_MAX_DEVICES; devid++) { + dev_hdr = device__first_dev(DEVICE_BUS_PCI); + while (dev_hdr) { uint32_t *irqmap = interrupt_map[devices]; - struct pci_device_header *hdr = pci__find_dev(devid); + struct pci_device_header *hdr = dev_hdr-data; if (!hdr) continue; + devid = dev_hdr-dev_num; fn = 0; /* kvmtool doesn't yet do multifunction devices */ sprintf(nodename, pci@%u,%u, devid, fn); @@ -413,6 +416,7 @@ int spapr_populate_pci_devices(struct kvm *kvm, /* We don't set ibm,dma-window property as we don't have an IOMMU. */ ++devices; + dev_hdr = device__next_dev(dev_hdr); } /* Write interrupt map */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:45 AM, Glauber Costa wrote: On 11/27/2012 07:10 PM, Michael Wolf wrote: On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. Confusing users is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest. So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? Before I answer this, can you please detail which mechanism are you using to enforce the entitlement? Is it the cgroup cpu controller, or something else? It is setup using cpu overcommit. But the request was for something that would work in both the overcommit environment as well as when hard capping is being used. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote: On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com wrote: I have some code which do both reference time and invariant TSC but it will not work after migration. I will send it later today. Do you mean migrating guests? This is not an issue for us. OK, but don't say I didn't warn you :) There are two patches, one for kvm and another one for qemu. you will probably need to rebase them. Add hv_tsc cpu parameter to activate this feature. you will probably need to deactivate hpet by adding -no-hpet parameter as well. I've also added +hv_relaxed since then, but this is the command I'm using now and there's no change: /usr/bin/qemu-kvm -name b691546e-79f8-49c6-a293-81067503a6ad -S -M pc-1.2 -enable-kvm -m 16384 -smp 9,sockets=1,cores=9,threads=1 -uuid b691546e-79f8-49c6-a293-81067503a6ad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/b691546e-79f8-49c6-a293-81067503a6ad.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-hpet -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/dis-magnetics-2-223101/d8b233c6-8424-4de9-ae3c-7c9a60288514,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:2e:fb:a2:36:be,bus=pci.0,addr=0x3 -netdev tap,fd=40,id=hostnet1,vhost=on,vhostfd=41 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=22:94:44:5a:cb:24,bus=pci.0,addr=0x4 -vnc 127.0.0.1:0,password -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -cpu host,hv_tsc I compiled qemu-1.2.0-24 after applying your patch, used the head for KVM, and I see no difference. I've tried setting windows' useplatformclock on and off, no change either. Other than that, was looking into a profiling trace of the software running and a lot of time (60%?) is spent calling two functions from hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and HalpHPETProgramRolloverTimer which do point at something related to the timers. Any other thing I can try? -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
Glauber Costa glom...@parallels.com writes: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. This only is helpful for static entitlements. But if we allow dynamic entitlements--which is a very useful feature, think buying an online upgrade in a cloud environment--then you need to account for entitlement loss at the same place where you do the rest of the accounting: in /proc/stat. Now, that is a *way* more sensible thing to do. Much more. Confusing users is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. I'll bring you a nice bottle of scotch at the next KVM Forum if you can find me one user that can accurately describe what steal time is. The semantics are so incredibly subtle that I have a hard time believing anyone actually understands what it means today. Regards, Anthony Liguori --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: There are two patches, one for kvm and another one for qemu. I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote: On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: There are two patches, one for kvm and another one for qemu. I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu Does not matter, but you need to also recompile kernel with the first patch. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wed, Nov 28, 2012 at 9:56 PM, Gleb Natapov g...@redhat.com wrote: On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote: On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: There are two patches, one for kvm and another one for qemu. I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu Does not matter, but you need to also recompile kernel with the first patch. Do I have to recompile the kernel, or just the module? I followed the instructions at http://www.linux-kvm.org/page/Code#building_an_external_module_with_older_kernels but I guess I can do the whole kernel, if it might help. -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Wed, Nov 28, 2012 at 10:01:04PM +0200, George-Cristian Bîrzan wrote: On Wed, Nov 28, 2012 at 9:56 PM, Gleb Natapov g...@redhat.com wrote: On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote: On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote: There are two patches, one for kvm and another one for qemu. I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu Does not matter, but you need to also recompile kernel with the first patch. Do I have to recompile the kernel, or just the module? I followed the instructions at http://www.linux-kvm.org/page/Code#building_an_external_module_with_older_kernels but I guess I can do the whole kernel, if it might help. Module is enough, but kvm-kmod is not what you want. Just rebuild the whole kernel if you do not know how to rebuild only the module for your distribution's kernel. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 10:43 PM, Michael Wolf wrote: On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. Ok, so correct me if I am wrong, but I believe you would be using something like the bandwidth capper in the cpu cgroup to set those entitlements, right? Some time has passed since I last looked into it, but IIRC, after you get are out of your quota, you should be out of the runqueue. In the lovely world of KVM, we approximate steal time as runqueue time: arch/x86/kvm/x86.c: delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; vcpu-arch.st.accum_steal = delta; include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. In case I am wrong, and run_delay also includes the time you can't run because you are out of capacity, then maybe what we should do, is to just subtract it from run_delay in kvm/x86.c before we pass it on. In summary: Alter the amount of steal time reported by the guest. Maybe this should go away. Expand the steal time msr to also contain the consigned time. Maybe this should go away Add the code to send the consigned time from the host to the guest This definitely should be heavily modified Add a timer to allow the separation of consigned from steal time. Maybe this should go away Add an ioctl to communicate the consign limit to the host. This definitely should go away. More specifically, *whatever* way we use to cap the processor, the host system will have all the information at all times. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: implemented IOMMU driver for VFIO
On Wed, 2012-11-28 at 18:21 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch implements an IOMMU driver for VFIO which does mapping/unmapping pages for the guest IO and provides information about DMA window (required by a POWERPC guest). The counterpart in QEMU is required to support this functionality. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/Kconfig|6 + drivers/vfio/Makefile |1 + drivers/vfio/vfio_iommu_spapr_tce.c | 332 +++ include/linux/vfio.h| 33 4 files changed, 372 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 7cd5dec..b464687 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1 depends on VFIO default n +config VFIO_IOMMU_SPAPR_TCE + tristate + depends on VFIO SPAPR_TCE_IOMMU + default n + menuconfig VFIO tristate VFIO Non-Privileged userspace driver framework depends on IOMMU_API select VFIO_IOMMU_TYPE1 if X86 + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV help VFIO provides a framework for secure userspace device drivers. See Documentation/vfio.txt for more details. diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index 2398d4a..72bfabc 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_VFIO) += vfio.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_PCI) += pci/ diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c new file mode 100644 index 000..b98770e --- /dev/null +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -0,0 +1,332 @@ +/* + * VFIO: IOMMU DMA mapping support for TCE on POWER + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy a...@ozlabs.ru + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * Derived from original vfio_iommu_type1.c: + * Copyright (C) 2012 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson alex.william...@redhat.com + */ + +#include linux/module.h +#include linux/pci.h +#include linux/slab.h +#include linux/uaccess.h +#include linux/err.h +#include linux/vfio.h +#include asm/iommu.h + +#define DRIVER_VERSION 0.1 +#define DRIVER_AUTHOR a...@ozlabs.ru +#define DRIVER_DESC VFIO IOMMU SPAPR TCE + +static void tce_iommu_detach_group(void *iommu_data, + struct iommu_group *iommu_group); + +/* + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation + */ + +/* + * This code handles mapping and unmapping of user data buffers + * into DMA'ble space using the IOMMU + */ + +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) PAGE_SHIFT) + +struct vwork { + struct mm_struct*mm; + longnpage; + struct work_struct work; +}; + +/* delayed decrement/increment for locked_vm */ +static void lock_acct_bg(struct work_struct *work) +{ + struct vwork *vwork = container_of(work, struct vwork, work); + struct mm_struct *mm; + + mm = vwork-mm; + down_write(mm-mmap_sem); + mm-locked_vm += vwork-npage; + up_write(mm-mmap_sem); + mmput(mm); + kfree(vwork); +} + +static void lock_acct(long npage) +{ + struct vwork *vwork; + struct mm_struct *mm; + + if (!current-mm) + return; /* process exited */ + + if (down_write_trylock(current-mm-mmap_sem)) { + current-mm-locked_vm += npage; + up_write(current-mm-mmap_sem); + return; + } + + /* + * Couldn't get mmap_sem lock, so must setup to update + * mm-locked_vm later. If locked_vm were atomic, we + * wouldn't need this silliness + */ + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); + if (!vwork) + return; + mm = get_task_mm(current); + if (!mm) { + kfree(vwork); + return; + } + INIT_WORK(vwork-work, lock_acct_bg); + vwork-mm = mm; + vwork-npage = npage; + schedule_work(vwork-work); +} This looks familiar, should we split it out to a common file instead of duplicating it? + +/* + * The container descriptor supports only a single group per container. + * Required by the
[PATCH V2] Added tests for ia32_tsc_adjust funtionality.
Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing tests for reading and writing the emulated IA32_TSC_ADJUST msr. Signed-off-by: Will Auld will.a...@intel.com --- config-x86-common.mak | 5 - x86/tsc_adjust.c | 60 +++ x86/vmexit.c | 13 +++ 3 files changed, 77 insertions(+), 1 deletion(-) create mode 100644 x86/tsc_adjust.c diff --git a/config-x86-common.mak b/config-x86-common.mak index c76cd11..8f909f7 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -34,7 +34,8 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \ $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \ $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \ $(TEST_DIR)/kvmclock_test.flat $(TEST_DIR)/eventinj.flat \ - $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat $(TEST_DIR)/asyncpf.flat + $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \ + $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat ifdef API tests-common += api/api-sample @@ -64,6 +65,8 @@ $(TEST_DIR)/port80.elf: $(cstart.o) $(TEST_DIR)/port80.o $(TEST_DIR)/tsc.elf: $(cstart.o) $(TEST_DIR)/tsc.o +$(TEST_DIR)/tsc_adjust.elf: $(cstart.o) $(TEST_DIR)/tsc_adjust.o + $(TEST_DIR)/apic.elf: $(cstart.o) $(TEST_DIR)/apic.o $(TEST_DIR)/realmode.elf: $(TEST_DIR)/realmode.o diff --git a/x86/tsc_adjust.c b/x86/tsc_adjust.c new file mode 100644 index 000..05cc5d9 --- /dev/null +++ b/x86/tsc_adjust.c @@ -0,0 +1,60 @@ +#include libcflat.h +#include processor.h + +#define IA32_TSC_ADJUST 0x3b + +int main() +{ + u64 t1, t2, t3, t4, t5; + u64 est_delta_time; + bool pass = true; + + if (cpuid(7).b (1 1)) { // IA32_TSC_ADJUST Feature is enabled? + if ( rdmsr(IA32_TSC_ADJUST) != 0x0) { + printf(failure: IA32_TSC_ADJUST msr was incorrectly +initialized\n); + pass = false; + } + t3 = 1000ull; + t1 = rdtsc(); + wrmsr(IA32_TSC_ADJUST, t3); + t2 = rdtsc(); + if (rdmsr(IA32_TSC_ADJUST) != t3) { + printf(failure: IA32_TSC_ADJUST msr read / write +incorrect\n); + pass = false; + } + if (t2 - t1 t3) { + printf(failure: TSC did not adjust for IA32_TSC_ADJUST +value\n); + pass = false; + } + t3 = 0x0; + wrmsr(IA32_TSC_ADJUST, t3); + if (rdmsr(IA32_TSC_ADJUST) != t3) { + printf(failure: IA32_TSC_ADJUST msr read / write +incorrect\n); + pass = false; + } + t4 = 1000ull; + t1 = rdtsc(); + wrtsc(t4); + t2 = rdtsc(); + t5 = rdmsr(IA32_TSC_ADJUST); + // est of time between reading tsc and writing tsc, + // (based on IA32_TSC_ADJUST msr value) should be small + est_delta_time = t4 - t5 - t1; + if (est_delta_time 2 * (t2 - t4)) { + // arbitray 2x latency (wrtsc-rdtsc) threshold + printf(failure: IA32_TSC_ADJUST msr incorrectly +adjusted on tsc write\n); + pass = false; + } + if (pass) printf(success: IA32_TSC_ADJUST enabled and +working correctly\n); + } + else { + printf(success: IA32_TSC_ADJUST feature not enabled\n); + } + return pass?0:1; +} diff --git a/x86/vmexit.c b/x86/vmexit.c index ad8ab55..99ff964 100644 --- a/x86/vmexit.c +++ b/x86/vmexit.c @@ -34,6 +34,7 @@ static void vmcall(void) asm volatile (vmcall : +a(a), =b(b), =c(c), =d(d)); } +#define MSR_TSC_ADJUST 0x3b #define MSR_EFER 0xc080 #define EFER_NX_MASK(1ull 11) @@ -103,6 +104,16 @@ static void ple_round_robin(void) ++counters[you].n1; } +static void rd_tsc_adjust_msr(void) +{ + rdmsr(MSR_TSC_ADJUST); +} + +static void wr_tsc_adjust_msr(void) +{ + wrmsr(MSR_TSC_ADJUST, 0x0); +} + static struct test { void (*func)(void); const char *name; @@ -119,6 +130,8 @@ static struct test { { ipi, ipi, is_smp, .parallel = 0, }, { ipi_halt, ipi+halt, is_smp, .parallel = 0, }, { ple_round_robin, ple-round-robin, .parallel = 1 }, + { wr_tsc_adjust_msr, wr_tsc_adjust_msr, .parallel = 1 }, + { rd_tsc_adjust_msr, rd_tsc_adjust_msr, .parallel = 1 }, }; unsigned iterations; -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp
On Wed, Nov 28, 2012 at 10:55:26PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:01 PM, Gleb Natapov wrote: On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote: On 11/28/2012 07:32 AM, Marcelo Tosatti wrote: On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote: +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2) { - gpa_t gpa; + gpa_t gpa = cr2; pfn_t pfn; - if (tdp_enabled) + if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) return false; How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used to read it? Hi Marcelo, It is protected by mmu-lock for it only be changed when mmu-lock is hold. And ACCESS_ONCE is used on read path avoiding magic optimization from compiler. Please switch to mmu_lock protection, there is no reason to have access to this variable locklessly - not performance critical. For example, there is no use of barriers when modifying the variable. This is not bad, the worst case is, the direct mmu failed to unprotect the shadow pages, (meet indirect_shadow_pages = 0, but there has shadow pages being shadowed.), after enter to guest, we will go into reexecute_instruction again, then it will remove shadow pages. Isn't the same scenario can happen even with mmu lock around indirect_shadow_pages access? Hmm..., i also think it is no different. Even using mmu-lock, we can not prevent the target pfn can not be write-protected later. Marcelo? In this particular case, it appears to be harmless (unsure if kvm_mmu_pte_write one is safe). The point is, lockless access should not be special. Lockless access must be carefully documented (access protocol to variables documented, all possible cases listed), and done when necessary due to performance. Otherwise, don't do it. On this case, its not necessary. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: -return false; +again: +page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + +/* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ +kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); +emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) I do object, since it is possible to detect precisely the condition by storing which gfns have been cached. Then, Xiao, you need a way to handle large read-only sptes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] Added tests for ia32_tsc_adjust funtionality.
On Wed, Nov 28, 2012 at 01:32:09PM -0800, Will Auld wrote: Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing tests for reading and writing the emulated IA32_TSC_ADJUST msr. Signed-off-by: Will Auld will.a...@intel.com --- config-x86-common.mak | 5 - x86/tsc_adjust.c | 60 +++ x86/vmexit.c | 13 +++ 3 files changed, 77 insertions(+), 1 deletion(-) create mode 100644 x86/tsc_adjust.c Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On 11/29/2012 05:57 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: -return false; +again: +page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + +/* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ +kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); +emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) I do object, since it is possible to detect precisely the condition by storing which gfns have been cached. Then, Xiao, you need a way to handle large read-only sptes. Sorry, Marcelo, i am still confused why read-only sptes can not work under this patch? The code after read-only large spte is is: + if ((level PT_PAGE_TABLE_LEVEL + has_wrprotected_page(vcpu-kvm, gfn, level)) || + mmu_need_write_protect(vcpu, gfn, can_unsync)) { pgprintk(%s: found shadow page for %llx, marking ro\n, __func__, gfn); ret = 1; It return 1, then reexecute_instruction return 0. It is the same as without readonly large-spte. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On 11/29/2012 06:40 AM, Xiao Guangrong wrote: On 11/29/2012 05:57 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* +* if emulation was due to access to shadowed page table +* and it failed try to unshadow page and re-enter the +* guest to let CPU execute the instruction. +*/ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) I do object, since it is possible to detect precisely the condition by storing which gfns have been cached. Then, Xiao, you need a way to handle large read-only sptes. Sorry, Marcelo, i am still confused why read-only sptes can not work under this patch? The code after read-only large spte is is: + if ((level PT_PAGE_TABLE_LEVEL +has_wrprotected_page(vcpu-kvm, gfn, level)) || + mmu_need_write_protect(vcpu, gfn, can_unsync)) { pgprintk(%s: found shadow page for %llx, marking ro\n, __func__, gfn); ret = 1; It return 1, then reexecute_instruction return 0. It is the same as without readonly large-spte. Ah, wait, There is a case, the large page located at 0-2M, the 0-4K is used as a page-table (e.g. PDE), and the guest want to write the memory located at 5K which should be freely written. This patch can return 0 for both current code and readonly large spte. I need to think it more. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2] Added tests for ia32_tsc_adjust functionality.
Thanks Marcelo! -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Wednesday, November 28, 2012 2:14 PM To: Auld, Will Cc: kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com; Gleb Subject: Re: [PATCH V2] Added tests for ia32_tsc_adjust funtionality. On Wed, Nov 28, 2012 at 01:32:09PM -0800, Will Auld wrote: Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing tests for reading and writing the emulated IA32_TSC_ADJUST msr. Signed-off-by: Will Auld will.a...@intel.com --- config-x86-common.mak | 5 - x86/tsc_adjust.c | 60 +++ x86/vmexit.c | 13 +++ 3 files changed, 77 insertions(+), 1 deletion(-) create mode 100644 x86/tsc_adjust.c Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On Thu, Nov 29, 2012 at 06:40:51AM +0800, Xiao Guangrong wrote: On 11/29/2012 05:57 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) I do object, since it is possible to detect precisely the condition by storing which gfns have been cached. Then, Xiao, you need a way to handle large read-only sptes. Sorry, Marcelo, i am still confused why read-only sptes can not work under this patch? The code after read-only large spte is is: + if ((level PT_PAGE_TABLE_LEVEL +has_wrprotected_page(vcpu-kvm, gfn, level)) || + mmu_need_write_protect(vcpu, gfn, can_unsync)) { pgprintk(%s: found shadow page for %llx, marking ro\n, __func__, gfn); ret = 1; It return 1, then reexecute_instruction return 0. It is the same as without readonly large-spte. https://lkml.org/lkml/2012/11/17/75 Does unshadowing work with large sptes at reexecute_instruction? That is, do we nuke any large read-only sptes that might be causing a certain gfn to be read-only? That is, following the sequence there, is the large read-only spte properly destroyed? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: VMX: fix invalid cpu passed to smp_call_function_single
On Wed, Nov 28, 2012 at 08:53:15PM +0800, Xiao Guangrong wrote: In loaded_vmcs_clear, loaded_vmcs-cpu is the fist parameter passed to smp_call_function_single, if the target cpu is downing (doing cpu hot remove), loaded_vmcs-cpu can become -1 then -1 is passed to smp_call_function_single It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called in the preemptionable context Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: use is_idle_task() instead of idle_cpu() to decide when to halt in async_pf
On Wed, Nov 28, 2012 at 03:19:08PM +0200, Gleb Natapov wrote: As Frederic pointed idle_cpu() may return false even if async fault happened in the idle task if wake up is pending. In this case the code will try to put idle task to sleep. Fix this by using is_idle_task() to check for idle task. Reported-by: Frederic Weisbecker fweis...@gmail.com Signed-off-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs
On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote: vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs does not exist on any vcpu If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu list. The list can be corrupted if the cpu prefetch the vmcs's list before reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before making vmcs-vcpu == -1 be visible Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 29e8f42..6056d88 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg) if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs) per_cpu(current_vmcs, cpu) = NULL; list_del(loaded_vmcs-loaded_vmcss_on_cpu_link); + + /* + * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link + * is before setting loaded_vmcs-vcpu to -1 which is done in + * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist + * then adds the vmcs into percpu list before it is deleted. + */ + smp_wmb(); + Neither loads nor stores are reordered with like operations (see section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier not necessary. However, i agree access to loaded_vmcs is not obviously safe. I can't tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any time). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On Thu, Nov 29, 2012 at 07:16:50AM +0800, Xiao Guangrong wrote: On 11/29/2012 06:40 AM, Xiao Guangrong wrote: On 11/29/2012 05:57 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote: On 11/28/2012 10:12 PM, Gleb Natapov wrote: On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote: On 11/27/2012 06:41 AM, Marcelo Tosatti wrote: - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? Sure. :) The instruction emulation is caused by fault access on cr3. After unprotect the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3. if it return 1, mmu can not fix the mapping, we should report the error, otherwise it is good to return to guest and let it re-execute the instruction again. page_fault_count is used to avoid the race on other vcpus, since after we unprotect the target page, other cpu can enter page fault path and let the page be write-protected again. This way can help us to detect all the case that mmu can not be fixed. Can you write this in a comment above vcpu-arch.mmu.page_fault()? Okay, if Marcelo does not object this way. :) I do object, since it is possible to detect precisely the condition by storing which gfns have been cached. Then, Xiao, you need a way to handle large read-only sptes. Sorry, Marcelo, i am still confused why read-only sptes can not work under this patch? The code after read-only large spte is is: + if ((level PT_PAGE_TABLE_LEVEL + has_wrprotected_page(vcpu-kvm, gfn, level)) || + mmu_need_write_protect(vcpu, gfn, can_unsync)) { pgprintk(%s: found shadow page for %llx, marking ro\n, __func__, gfn); ret = 1; It return 1, then reexecute_instruction return 0. It is the same as without readonly large-spte. Ah, wait, There is a case, the large page located at 0-2M, the 0-4K is used as a page-table (e.g. PDE), and the guest want to write the memory located at 5K which should be freely written. This patch can return 0 for both current code and readonly large spte. Yes, should remove the read-only large spte if any write to 0-2M fails (said 'unshadow' in the previous email but the correct is 'remove large spte in range'). I need to think it more. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On Wed, 2012-11-28 at 14:26 -0800, Chegu Vinod wrote: On 11/27/2012 6:23 AM, Chegu Vinod wrote: On 11/27/2012 2:30 AM, Raghavendra K T wrote: On 11/26/2012 07:05 PM, Andrew Jones wrote: On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + *true (0) if we indeed boosted the target task. + *false (0) if we failed to boost the target. + *-ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); +/* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ +if (rq-nr_running == 1 p_rq-nr_running == 1) { +yielded = -ESRCH; +goto out_irq; +} + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr-sched_class-yield_to_task) -goto out; +goto out_unlock; if (curr-sched_class != p-sched_class) -goto out; +goto out_unlock; if (task_running(p_rq, p) || p-state) -goto out; +goto out_unlock; yielded = curr-sched_class-yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq-curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); -if (yielded) +if (yielded 0) schedule(); return yielded; Acked-by: Andrew Jones drjo...@redhat.com Thank you Drew. Marcelo Gleb.. Please let me know if you have comments / concerns on the patches.. Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios especially for large guests where we do have overhead of vcpu iteration of ple handler.. . Thanks Raghu. Will try to get this latest patch set evaluated and get back to you. Hi Raghu, Here is some preliminary data with your latest set of PLE patches ( also with Andrew's throttled yield_to() change). Ran a single guest on a 80 core Westmere platform. [Note: Host and Guest had the latest kernel from kvm.git and also using the latest qemu from qemu.git as of yesterday morning]. The guest was running a AIM7 high_systime workload. (Note: high_systime is a kernel intensive micro-benchmark but in this case it was run just as a workload in the guest to trigger spinlock etc. contention in the guest OS and hence PLE (i.e. this is not a real benchmark run). 'have run this workload with a constant # (i.e. 2000) users with 100 jobs per user. The numbers below represent the # of jobs per minute (JPM) - higher the better) . 40VCPU 60VCPU 80VCPU a) 3.7.0-rc6+ w/ ple_gap=0 ~102K ~88K~81K b) 3.7.0-rc6+ ~53K ~25K~18-20K c) 3.7.0-rc6+ w/ PLE patches ~100K ~81K~48K-69K - lot of variation from run to run.
Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs
On 11/29/2012 08:04 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote: vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs does not exist on any vcpu If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu list. The list can be corrupted if the cpu prefetch the vmcs's list before reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before making vmcs-vcpu == -1 be visible Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 29e8f42..6056d88 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg) if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs) per_cpu(current_vmcs, cpu) = NULL; list_del(loaded_vmcs-loaded_vmcss_on_cpu_link); + +/* + * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link + * is before setting loaded_vmcs-vcpu to -1 which is done in + * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist + * then adds the vmcs into percpu list before it is deleted. + */ +smp_wmb(); + Neither loads nor stores are reordered with like operations (see section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier not necessary. Ouch, yes, you are right. My memory is wrong. It seems only later-read can be reordered with early-write. But if 'read vs read' and 'write vs write' can be guaranteed by CPU, smp_wmb() and smp_rmb() should only be a complier barrier, so i think we can add the barriers to improve the readable and the portable. And anyway, the current code missed complier-barrier. However, i agree access to loaded_vmcs is not obviously safe. I can't tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any time). If vmm_exclusive = 0, the vmcs can removed from percpu list when vcpu is scheduled out. The list is not broken. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM
-Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Wednesday, November 28, 2012 7:28 PM To: Marcelo Tosatti Cc: Xu, Dongxiao; kvm@vger.kernel.org Subject: Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM On Tue, Nov 27, 2012 at 10:29:08PM -0200, Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote: The launch state is not a member in the VMCS area, use a separate variable (list) to store it instead. Signed-off-by: Dongxiao Xu dongxiao...@intel.com 1. What is the problem with keeping launched state in the VMCS? Assuming there is a positive answer to the above: 2. Don't you have to change VMCS ID? 3. Can't it be kept somewhere else other than a list? Current scheme allows guest to allocate unlimited amounts of host memory. 4. What is the state of migration / nested vmx again? If vmcs12 is migrated, this means launched state is not migrated anymore. Patches 1-3 seem fine. According to Dongxiao they are slowing down nested guest by 4%. For this version, it will introduce certain performance downgrade. Actually in my new patch, I simplified the vmcs12_read() and vmcs12_write() functions and there is no obvious performance downgrade. Thanks, Dongxiao -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
On 11/26/2012 4:07 AM, Raghavendra K T wrote: In some special scenarios like #vcpu = #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. The first patch optimizes all the yield_to by bailing out when there is no need to continue in yield_to (i.e., when there is only one task in source and target rq). Second patch uses that in PLE handler. Further when a yield_to fails we do not immediately go out of PLE handler instead we try thrice to have better statistical possibility of false return. Otherwise that would affect moderate overcommit cases. Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and around 51% for dbench 1x with 32 core PLE machine with 32 vcpu guest. base = 3.7.0-rc6 machine: 32 core mx3850 x5 PLE mc --+---+---+---++---+ ebizzy (rec/sec higher is beter) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 2511.300021.54096051.8000 170.2592 140.98276 2x 2679.4000 332.44822692.3000 251.4005 0.48145 3x 2253.5000 266.42432192.1667 178.9753-2.72169 4x 1784.3750 102.26992018.7500 187.572313.13485 --+---+---+---++---+ --+---+---+---++---+ dbench (throughput in MB/sec. higher is better) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 6677.4080 638.504810098.0060 3449.7026 51.22643 2x 2012.676064.76422019.0440 62.6702 0.31639 3x 1302.078340.83361292.7517 27.0515 -0.71629 4x 3043.1725 3243.72814664.4662 5946.5741 53.27643 --+---+---+---++---+ Here is the refernce of no ple result. ebizzy-1x_nople 7592.6000 rec/sec dbench_1x_nople 7853.6960 MB/sec The result says we can still improve by 60% for ebizzy, but overall we are getting impressive performance with the patches. Changes Since V2: - Dropped global measures usage patch (Peter Zilstra) - Do not bail out on first failure (Avi Kivity) - Try thrice for the failure of yield_to to get statistically more correct behaviour. Changes since V1: - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter) - Use yield() instead of schedule in overcommit scenarios (Rik) - Use loadavg knowledge to detect undercommit/overcommit Peter Zijlstra (1): Bail out of yield_to when source and target runqueue has one task Raghavendra K T (1): Handle yield_to failure return for potential undercommit case Please let me know your comments and suggestions. Link for V2: https://lkml.org/lkml/2012/10/29/287 Link for V1: https://lkml.org/lkml/2012/9/21/168 kernel/sched/core.c | 25 +++-- virt/kvm/kvm_main.c | 26 -- 2 files changed, 35 insertions(+), 16 deletions(-) . Tested-by: Chegu Vinod chegu_vi...@hp.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM
-Original Message- From: Orit Wasserman [mailto:owass...@redhat.com] Sent: Wednesday, November 28, 2012 8:30 PM To: Marcelo Tosatti Cc: Xu, Dongxiao; kvm@vger.kernel.org; g...@redhat.com Subject: Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM On 11/28/2012 02:29 AM, Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote: The launch state is not a member in the VMCS area, use a separate variable (list) to store it instead. Signed-off-by: Dongxiao Xu dongxiao...@intel.com 1. What is the problem with keeping launched state in the VMCS? Assuming there is a positive answer to the above: 2. Don't you have to change VMCS ID? 3. Can't it be kept somewhere else other than a list? Current scheme allows guest to allocate unlimited amounts of host memory. I agree with Marcelo you have to limit the number of VMCS in the list otherwise it will be easy to attack a host with nested :) Yes it is a point. I will add a limitation of the VMCS number for the guest VMM. Thanks, Dongxiao 4. What is the state of migration / nested vmx again? If vmcs12 is migrated, this means launched state is not migrated anymore. Patches 1-3 seem fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On 11/28/2012 5:09 PM, Chegu Vinod wrote: On 11/27/2012 6:23 AM, Chegu Vinod wrote: On 11/27/2012 2:30 AM, Raghavendra K T wrote: On 11/26/2012 07:05 PM, Andrew Jones wrote: On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + *true (0) if we indeed boosted the target task. + *false (0) if we failed to boost the target. + *-ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); +/* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ +if (rq-nr_running == 1 p_rq-nr_running == 1) { +yielded = -ESRCH; +goto out_irq; +} + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr-sched_class-yield_to_task) -goto out; +goto out_unlock; if (curr-sched_class != p-sched_class) -goto out; +goto out_unlock; if (task_running(p_rq, p) || p-state) -goto out; +goto out_unlock; yielded = curr-sched_class-yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq-curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); -if (yielded) +if (yielded 0) schedule(); return yielded; Acked-by: Andrew Jones drjo...@redhat.com Thank you Drew. Marcelo Gleb.. Please let me know if you have comments / concerns on the patches.. Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios especially for large guests where we do have overhead of vcpu iteration of ple handler.. . Thanks Raghu. Will try to get this latest patch set evaluated and get back to you. Vinod Resending as prev. email to the kvm and lkml email aliases bounced twice... Apologies for any repeats! Hi Raghu, Here is some preliminary data with your latest set ofPLE patches ( also with Andrew's throttled yield_to() change). Ran a single guest on a 80 core Westmere platform. [Note: Host and Guest had the latest kernel from kvm.git and also using the latestqemu from qemu.git as of yesterday morning]. The guest was running a AIM7 high_systime workload. (Note: high_systime is a kernel intensive micro-benchmark but in this case it was run just as a workload in the guest to trigger spinlock etc. contention in the guest OS and hence PLE (i.e. this is not a real benchmark run). 'have run this workload with a constant # (i.e. 2000) users with 100 jobs per user. The numbers below represent the # of jobs per minute (JPM) -higher the better) . 40VCPU60VCPU80VCPU a) 3.7.0-rc6+ w/ ple_gap=0~102K~88K~81K b) 3.7.0-rc6+~53K~25K~18-20K c) 3.7.0-rc6+ w/ PLE patches~100K~81K~48K-69K- lot of variation from run to run. d) 3.7.0-rc6+ w/throttled yield_to() change~101K~87K~78K --- The PLE patches case (c) does show improvements in this non-overcommit large guest case when compared to the case (b). However at 80way started to observe quite a bit of variation from run to run and the JPM was lower when compared with the throttled yield_to() change case (d). For this 80way in case (c) also noticed that average time spent in the PLE exit (as reported by a small samplings from perf kvm stat) was varying quite a bit and was at times much greater when compared with the case of throttled yield_to() change case (d). More details are
Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs
On 11/29/2012 08:04 AM, Marcelo Tosatti wrote: On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote: vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs does not exist on any vcpu If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu list. The list can be corrupted if the cpu prefetch the vmcs's list before reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before making vmcs-vcpu == -1 be visible Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 29e8f42..6056d88 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg) if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs) per_cpu(current_vmcs, cpu) = NULL; list_del(loaded_vmcs-loaded_vmcss_on_cpu_link); + +/* + * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link + * is before setting loaded_vmcs-vcpu to -1 which is done in + * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist + * then adds the vmcs into percpu list before it is deleted. + */ +smp_wmb(); + Neither loads nor stores are reordered with like operations (see section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier not necessary. Ouch, yes, you are right. My memory is wrong. It seems only later-read can be reordered with early-write. But if 'read vs read' and 'write vs write' can be guaranteed by CPU, smp_wmb() and smp_rmb() should act as a complier barrier, so i think we can add the barriers to improve the readable and the portable. And anyway, the current code missed complier-barrier. However, i agree access to loaded_vmcs is not obviously safe. I can't tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any time). If vmm_exclusive = 0, the vmcs can removed from percpu list when vcpu is scheduled out. The list is not broken. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: implemented IOMMU driver for VFIO
On 29/11/12 08:01, Alex Williamson wrote: On Wed, 2012-11-28 at 18:21 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch implements an IOMMU driver for VFIO which does mapping/unmapping pages for the guest IO and provides information about DMA window (required by a POWERPC guest). The counterpart in QEMU is required to support this functionality. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/Kconfig|6 + drivers/vfio/Makefile |1 + drivers/vfio/vfio_iommu_spapr_tce.c | 332 +++ include/linux/vfio.h| 33 4 files changed, 372 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 7cd5dec..b464687 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1 depends on VFIO default n +config VFIO_IOMMU_SPAPR_TCE + tristate + depends on VFIO SPAPR_TCE_IOMMU + default n + menuconfig VFIO tristate VFIO Non-Privileged userspace driver framework depends on IOMMU_API select VFIO_IOMMU_TYPE1 if X86 + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV help VFIO provides a framework for secure userspace device drivers. See Documentation/vfio.txt for more details. diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index 2398d4a..72bfabc 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_VFIO) += vfio.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_PCI) += pci/ diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c new file mode 100644 index 000..b98770e --- /dev/null +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -0,0 +1,332 @@ +/* + * VFIO: IOMMU DMA mapping support for TCE on POWER + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy a...@ozlabs.ru + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * Derived from original vfio_iommu_type1.c: + * Copyright (C) 2012 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson alex.william...@redhat.com + */ + +#include linux/module.h +#include linux/pci.h +#include linux/slab.h +#include linux/uaccess.h +#include linux/err.h +#include linux/vfio.h +#include asm/iommu.h + +#define DRIVER_VERSION 0.1 +#define DRIVER_AUTHOR a...@ozlabs.ru +#define DRIVER_DESC VFIO IOMMU SPAPR TCE + +static void tce_iommu_detach_group(void *iommu_data, + struct iommu_group *iommu_group); + +/* + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation + */ + +/* + * This code handles mapping and unmapping of user data buffers + * into DMA'ble space using the IOMMU + */ + +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) PAGE_SHIFT) + +struct vwork { + struct mm_struct*mm; + longnpage; + struct work_struct work; +}; + +/* delayed decrement/increment for locked_vm */ +static void lock_acct_bg(struct work_struct *work) +{ + struct vwork *vwork = container_of(work, struct vwork, work); + struct mm_struct *mm; + + mm = vwork-mm; + down_write(mm-mmap_sem); + mm-locked_vm += vwork-npage; + up_write(mm-mmap_sem); + mmput(mm); + kfree(vwork); +} + +static void lock_acct(long npage) +{ + struct vwork *vwork; + struct mm_struct *mm; + + if (!current-mm) + return; /* process exited */ + + if (down_write_trylock(current-mm-mmap_sem)) { + current-mm-locked_vm += npage; + up_write(current-mm-mmap_sem); + return; + } + + /* +* Couldn't get mmap_sem lock, so must setup to update +* mm-locked_vm later. If locked_vm were atomic, we +* wouldn't need this silliness +*/ + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); + if (!vwork) + return; + mm = get_task_mm(current); + if (!mm) { + kfree(vwork); + return; + } + INIT_WORK(vwork-work, lock_acct_bg); + vwork-mm = mm; + vwork-npage = npage; + schedule_work(vwork-work); +} This looks familiar, should we split it out to a common file instead of duplicating it? It is simple cut-n-paste from type1 driver :) Moving it to a separate file is up to you but it is
Re: [RFC PATCH 0/4] MSI affinity for assigned devices
On 2012-11-28 00:21, Alex Williamson wrote: On Wed, 2012-11-28 at 00:08 +0100, Jan Kiszka wrote: On 2012-11-27 23:00, Alex Williamson wrote: This is post-1.3 material, so I'll just post it as an RFC for now. MSI routing updates aren't currently handled by pci-assign or vfio-pci (when using KVM acceleration), which means that trying to set interrupt SMP affinity in the guest has no effect unless MSI is completely disabled and re-enabled. This series fixes this for both device assignment backends using similar schemes. We store the last MSIMessage programmed to KVM and do updates to the MSI route when it changes. pci-assign takes a little bit of refactoring to make this happen cleanly. Thanks, This should rather be done by implementing vector notifiers for MSI as well. That way the device model no longer has to track reasons for vector changes in an open-coded fashion, just like we already do for MSI-X. Was on my todo list for a long time, but I never reached this item. MSI masking is optional and not many devices seem to support it. What I see with a linux guest is that it just overwrites the address/data while MSI is enabled. What were you thinking for notifiers? mask, unmask, update? I'm not sure I'm interested enough in this to add MSI vector notifiers. Thanks, We didn't merge mask notifiers but a more generic concept. Our notifiers a supposed to fire when the configuration changes effectively. So, for MSI, a use event has to trigger also on change, i.e. when the configuration is overwritten while the vector is already active. Alternatively, you could also add such an update notifier, using the MSIVectorUseNotifier signature if it makes the user side simpler. Jan signature.asc Description: OpenPGP digital signature
Re: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian
Jimmy Crossley jcrossley at CoNetrix.com writes: -Original Message- From: kvm-owner at vger.kernel.org [mailto:kvm-owner at vger.kernel.org] On Behalf Of Jimmy Crossley Sent: Friday, June 08, 2012 17:29 Subject: Garbled audio - Windows 7 64 bit guest on Debian I am experiencing garbled sound on a Windows 7 64 bit guest running under 64 bit Debian. I have searched many discussion groups, etc. on the net and could find nothing useful, so I thought I would post this here, hoping someone with a deeper understanding could help out. ** snip If I connect to the machine using remote desktop (mstsc.exe, rdesktop, xfreerdp), the sound gets redirected to the local machine and sounds perfect. The sound is only garbled when using SDL. The same audio problems exist if I start up the machine and connect to it with vnc. I have mostly solved this issue. The sound works much, much, better, but is still not as good as on my host machine. I installed PulseAudio and used it instead of ALSA. In order to get kvm to use it, I set the environment variable QEMU_AUDIO_DRV=pa. I had been using sudo to start the VM, and that kept this environment variable from being used. I did a sudo setcap cap_net_admin+ep /usr/bin/kvm in order to be able to run kvm under a normal user account. Now the sound works quite well. i've pretty much nearly duplicated this setup (gentoo host, qemu-kvm 1.2.1, pulseaudio, sdl, emulated hda audio, windows 7 64bit sp1 guest) due to the same reasons and the same result. however, as jcrossley mentioned, it's not as crisp as the host; there's a lot of crackling and popping still. based on jcrossley's investigation using rdp instead (and seeing no ill effects wrt sound), wouldn't the issue be pointed at the hda emulation? how would one go about getting involved in terms of helping out w/ the hda emulator? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html