Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer
On Sat, Sep 14, 2013 at 3:44 PM, Jan Kiszka jan.kis...@web.de wrote: On 2013-09-13 19:15, Paolo Bonzini wrote: Il 06/09/2013 04:04, Arthur Chunqi Li ha scritto: +preempt_val_l1 = delta_tsc_l1 preempt_scale; +if (preempt_val_l2 = preempt_val_l1) +preempt_val_l2 = 0; +else +preempt_val_l2 -= preempt_val_l1; +vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); Did you test that a value of 0 triggers an immediate exit, rather than counting down by 2^32? Perhaps it's safer to limit the value to 1 instead of 0. To my experience, 0 triggers immediate exists when the preemption timer is enabled. Yes, L2 VM will exit immediately when the value is 0 with my patch. Arthur Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: VMX: set blocked by NMI flag if EPT violation happens during IRET from NMI
Set blocked by NMI flag if EPT violation happens during IRET from NMI otherwise NMI can be called recursively causing stack corruption. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..fa1984c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5339,6 +5339,15 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) return 0; } + /* +* EPT violation happened while executing iret from NMI, +* blocked by NMI bit has to be set before next VM entry. +* There are errata that may cause this bit to not be set: +* AAK134, BY25. +*/ + if (exit_qualification INTR_INFO_UNBLOCK_NMI) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); + gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); trace_kvm_page_fault(gpa, exit_qualification); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests] Test fault during IRET from NMI.
This test checks that NMI window opens only after IRET from NMI is executed without a fault. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/lib/x86/processor.h b/lib/x86/processor.h index e46d8d0..de1dc47 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -62,6 +62,13 @@ static inline u16 read_gs(void) return val; } +static inline unsigned long read_rflags(void) +{ + unsigned long f; + asm (pushf; pop %0\n\t : =rm(f)); + return f; +} + static inline void write_ds(unsigned val) { asm (mov %0, %%ds : : rm(val) : memory); diff --git a/x86/eventinj.c b/x86/eventinj.c index 7bed7c5..c171e30 100644 --- a/x86/eventinj.c +++ b/x86/eventinj.c @@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r) printf(After nested NMI to itself\n); } +unsigned long after_iret_addr; + +static void nested_nmi_iret_isr(struct ex_regs *r) +{ + printf(Nested NMI isr running rip=%x\n, r-rip); + + if (r-rip == after_iret_addr) + test_count++; +} +static void nmi_iret_isr(struct ex_regs *r) +{ + unsigned long *s = alloc_page(); + test_count++; + printf(NMI isr running %p stack %p\n, after_iret, s); + handle_exception(2, nested_nmi_iret_isr); + printf(Try send nested NMI to itself\n); + apic_self_nmi(); + printf(After nested NMI to itself\n); + s[4] = read_ss(); + s[3] = 0; /* rsp */ + s[2] = read_rflags(); + s[1] = read_cs(); + s[0] = after_iret_addr = (unsigned long)after_iret; + asm (mov %%rsp, %0\n\t +mov %1, %%rsp\n\t +outl %2, $0xe4\n\t /* flush stack page */ +#ifdef __x86_64__ +iretq\n\t +#else +iretl\n\t +#endif +: =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : memory); +after_iret: + printf(After iret\n); +} + static void tirq0(isr_regs_t *r) { printf(irq0 running\n); @@ -300,6 +336,20 @@ int main() irq_disable(); report(NMI, test_count == 2); + /* generate NMI that will fault on IRET */ + printf(Before NMI IRET test\n); + test_count = 0; + handle_exception(2, nmi_iret_isr); + printf(Try send NMI to itself\n); + apic_self_nmi(); + /* this is needed on VMX without NMI window notificatoin. + Interrupt windows is used instead, so let pending NMI + to be injected */ + irq_enable(); + asm volatile (nop); + irq_disable(); + printf(After NMI to itself\n); + report(NMI, test_count == 2); #ifndef __x86_64__ stack_phys = (ulong)virt_to_phys(alloc_page()); stack_va = alloc_vpage(); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] x86: kvm: remove KVM_SOFT_MAX_VCPUS
On Sat, Sep 14, 2013 at 02:16:51PM +0200, Andrew Jones wrote: This patch removes KVM_SOFT_MAX_VCPUS and uses num_online_cpus() for KVM_CAP_NR_VCPUS instead, as ARM does. While the API doc simply says KVM_CAP_NR_VCPUS should return the recommended maximum number of vcpus, it has been returning KVM_SOFT_MAX_VCPUS, which was defined as the maximum tested number of vcpus. As that concept could be distro-specific, this patch uses the other recommended maximum, the number of physical cpus, as we never recommend configuring a guest that has more vcpus than the host has pcpus. Of course a guest can always still be configured with up to KVM_CAP_MAX_VCPUS though anyway. I've put RFC on this patch because I'm not sure if there are any gotchas lurking with this change. The change now means hosts no longer return the same number for KVM_CAP_NR_VCPUS, and that number is likely going to generally be quite a bit less than what KVM_SOFT_MAX_VCPUS was (160). I can't think of anything other than generating more warnings[1] from qemu with guests that configure more vcpus than pcpus though. Another gotcha is that on a host with more then 160 cpus recommended value will grow which is not a good idea without appropriate testing. [1] Actually, until 972fc544b6034a in uq/master is merged there won't be any warnings either. Signed-off-by: Andrew Jones drjo...@redhat.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/x86.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c76ff74a98f2e..9236c63315a9b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -32,7 +32,6 @@ #include asm/asm.h #define KVM_MAX_VCPUS 255 -#define KVM_SOFT_MAX_VCPUS 160 #define KVM_USER_MEM_SLOTS 125 /* memory slots that are not exposed to userspace */ #define KVM_PRIVATE_MEM_SLOTS 3 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e5ca72a5cdb6d..d9d3e2ed68ee9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2604,7 +2604,7 @@ int kvm_dev_ioctl_check_extension(long ext) r = !kvm_x86_ops-cpu_has_accelerated_tpr(); break; case KVM_CAP_NR_VCPUS: - r = KVM_SOFT_MAX_VCPUS; + r = min(num_online_cpus(), KVM_MAX_VCPUS); s/KVM_MAX_VCPUS/KVM_SOFT_MAX_VCPUS/. Also what about hotplug cpus? break; case KVM_CAP_MAX_VCPUS: r = KVM_MAX_VCPUS; -- 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: kvm: introduce CONFIG_KVM_MAX_VCPUS
On Sat, Sep 14, 2013 at 02:18:49PM +0200, Andrew Jones wrote: Take CONFIG_KVM_MAX_VCPUS from arm32, but set the default to 255. Signed-off-by: Andrew Jones drjo...@redhat.com --- arch/x86/include/asm/kvm_host.h | 5 +++-- arch/x86/kvm/Kconfig| 10 ++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c76ff74a98f2e..e7e9b523a8f7e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -31,8 +31,9 @@ #include asm/msr-index.h #include asm/asm.h -#define KVM_MAX_VCPUS 255 -#define KVM_SOFT_MAX_VCPUS 160 +#define KVM_MAX_VCPUS CONFIG_KVM_MAX_VCPUS +#define KVM_SOFT_MAX_VCPUS min(160, KVM_MAX_VCPUS) + #define KVM_USER_MEM_SLOTS 125 /* memory slots that are not exposed to userspace */ #define KVM_PRIVATE_MEM_SLOTS 3 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index a47a3e54b964b..e9532c33527ee 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -52,6 +52,16 @@ config KVM If unsure, say N. +config KVM_MAX_VCPUS + int Number maximum supported virtual CPUs per VM + depends on KVM + default 255 + help + Static number of max supported virtual CPUs per VM. + + Set to a lower number to save some resources. Set to a higher + number to test scalability. + Maximum this can save is around 2K per VM. This is pretty insignificant considering overall memory footprint even smallest VM has. config KVM_INTEL tristate KVM for Intel processors support depends on KVM -- 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest
Alexander Graf ag...@suse.de writes: Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Benjamin Herrenschmidt b...@kernel.crashing.org writes: On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote: Aneesh and I are currently investigating an alternative approach, which is much more like the x86 way of doing things. We are looking at splitting the code into three modules: a kvm_pr.ko module with the PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a core kvm.ko module with the generic bits (basically book3s.c, powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV use). Basically the core module would have a pointer to a struct full of function pointers for the various ops that book3s_pr.c and book3s_hv.c both provide. You would only be able to have one of kvm_pr and kvm_hv loaded at any one time. If they were built in, you could have them both built in but only one could register its function pointer struct with the core. Obviously the kvm_hv module would only load and register its struct on a machine that had hypervisor mode available. If they were both built in I would think we would give HV the first chance to register itself, and let PR register if we can't do HV. How does that sound? As long as we can force-load the PR one on a machine that normally runs HV for the sake of testing ... This is what I currently have [root@llmp24l02 kvm]# insmod ./kvm-hv.ko [root@llmp24l02 kvm]# insmod ./kvm-pr.ko insmod: ERROR: could not insert module ./kvm-pr.ko: File exists The reason this model makes sense for x86 is that you never have SVM and VMX in the cpu at the same time. Either it is an AMD chip or an Intel chip. PR and HV however are not mutually exclusive in hardware. What you really want is 1) distro can force HV/PR 2) admin can force HV/PR 3) user can force HV/PR 4) by default things just work 1 can be done through kernel config options. 2 can be done through modules that get loaded or not 3 can be done through a vm ioctl 4 only works if you allow hv and pr to be available at the same time I can assume who you talked to about this to make these design decisions, but it definitely was not me. I didn't had much discussion around the design with anybody yet. What you saw above was me changing/moving code around madly to get something working in a day. I was hoping to get something that I can post as RFC early and let others to comment. Good to get the feedback early. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] Test fault during IRET from NMI.
Il 15/09/2013 10:17, Gleb Natapov ha scritto: This test checks that NMI window opens only after IRET from NMI is executed without a fault. Signed-off-by: Gleb Natapov g...@redhat.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com ... with a couple English nits (see inline). diff --git a/lib/x86/processor.h b/lib/x86/processor.h index e46d8d0..de1dc47 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -62,6 +62,13 @@ static inline u16 read_gs(void) return val; } +static inline unsigned long read_rflags(void) +{ + unsigned long f; + asm (pushf; pop %0\n\t : =rm(f)); + return f; +} + static inline void write_ds(unsigned val) { asm (mov %0, %%ds : : rm(val) : memory); diff --git a/x86/eventinj.c b/x86/eventinj.c index 7bed7c5..c171e30 100644 --- a/x86/eventinj.c +++ b/x86/eventinj.c @@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r) printf(After nested NMI to itself\n); } +unsigned long after_iret_addr; + +static void nested_nmi_iret_isr(struct ex_regs *r) +{ + printf(Nested NMI isr running rip=%x\n, r-rip); + + if (r-rip == after_iret_addr) + test_count++; +} +static void nmi_iret_isr(struct ex_regs *r) +{ + unsigned long *s = alloc_page(); + test_count++; + printf(NMI isr running %p stack %p\n, after_iret, s); + handle_exception(2, nested_nmi_iret_isr); + printf(Try send nested NMI to itself\n); s/Try send/Sending/ + apic_self_nmi(); + printf(After nested NMI to itself\n); + s[4] = read_ss(); + s[3] = 0; /* rsp */ + s[2] = read_rflags(); + s[1] = read_cs(); + s[0] = after_iret_addr = (unsigned long)after_iret; + asm (mov %%rsp, %0\n\t + mov %1, %%rsp\n\t + outl %2, $0xe4\n\t /* flush stack page */ +#ifdef __x86_64__ + iretq\n\t +#else + iretl\n\t +#endif + : =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : memory); +after_iret: + printf(After iret\n); +} + static void tirq0(isr_regs_t *r) { printf(irq0 running\n); @@ -300,6 +336,20 @@ int main() irq_disable(); report(NMI, test_count == 2); + /* generate NMI that will fault on IRET */ + printf(Before NMI IRET test\n); + test_count = 0; + handle_exception(2, nmi_iret_isr); + printf(Try send NMI to itself\n); s/Try send/Sending/ + apic_self_nmi(); + /* this is needed on VMX without NMI window notificatoin. s/notifiatoin/notification/ +Interrupt windows is used instead, so let pending NMI +to be injected */ + irq_enable(); + asm volatile (nop); + irq_disable(); + printf(After NMI to itself\n); + report(NMI, test_count == 2); #ifndef __x86_64__ stack_phys = (ulong)virt_to_phys(alloc_page()); stack_va = alloc_vpage(); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] Test fault during IRET from NMI.
On Sun, Sep 15, 2013 at 11:54:15AM +0200, Paolo Bonzini wrote: Il 15/09/2013 10:17, Gleb Natapov ha scritto: This test checks that NMI window opens only after IRET from NMI is executed without a fault. Signed-off-by: Gleb Natapov g...@redhat.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com ... with a couple English nits (see inline). Yeah, those are copy/paste from other places. The file is full of it :) diff --git a/lib/x86/processor.h b/lib/x86/processor.h index e46d8d0..de1dc47 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -62,6 +62,13 @@ static inline u16 read_gs(void) return val; } +static inline unsigned long read_rflags(void) +{ + unsigned long f; + asm (pushf; pop %0\n\t : =rm(f)); + return f; +} + static inline void write_ds(unsigned val) { asm (mov %0, %%ds : : rm(val) : memory); diff --git a/x86/eventinj.c b/x86/eventinj.c index 7bed7c5..c171e30 100644 --- a/x86/eventinj.c +++ b/x86/eventinj.c @@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r) printf(After nested NMI to itself\n); } +unsigned long after_iret_addr; + +static void nested_nmi_iret_isr(struct ex_regs *r) +{ + printf(Nested NMI isr running rip=%x\n, r-rip); + + if (r-rip == after_iret_addr) + test_count++; +} +static void nmi_iret_isr(struct ex_regs *r) +{ + unsigned long *s = alloc_page(); + test_count++; + printf(NMI isr running %p stack %p\n, after_iret, s); + handle_exception(2, nested_nmi_iret_isr); + printf(Try send nested NMI to itself\n); s/Try send/Sending/ + apic_self_nmi(); + printf(After nested NMI to itself\n); + s[4] = read_ss(); + s[3] = 0; /* rsp */ + s[2] = read_rflags(); + s[1] = read_cs(); + s[0] = after_iret_addr = (unsigned long)after_iret; + asm (mov %%rsp, %0\n\t +mov %1, %%rsp\n\t +outl %2, $0xe4\n\t /* flush stack page */ +#ifdef __x86_64__ +iretq\n\t +#else +iretl\n\t +#endif +: =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : memory); +after_iret: + printf(After iret\n); +} + static void tirq0(isr_regs_t *r) { printf(irq0 running\n); @@ -300,6 +336,20 @@ int main() irq_disable(); report(NMI, test_count == 2); + /* generate NMI that will fault on IRET */ + printf(Before NMI IRET test\n); + test_count = 0; + handle_exception(2, nmi_iret_isr); + printf(Try send NMI to itself\n); s/Try send/Sending/ + apic_self_nmi(); + /* this is needed on VMX without NMI window notificatoin. s/notifiatoin/notification/ + Interrupt windows is used instead, so let pending NMI + to be injected */ + irq_enable(); + asm volatile (nop); + irq_disable(); + printf(After NMI to itself\n); + report(NMI, test_count == 2); #ifndef __x86_64__ stack_phys = (ulong)virt_to_phys(alloc_page()); stack_va = alloc_vpage(); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] arm32: kvm: rename CONFIG_KVM_ARM_MAX_VCPUS
On Sat, Sep 14, 2013 at 02:10:55PM +0200, Andrew Jones wrote: Drop the _ARM_ part of the name. We can then introduce a config option like this to aarch64 and other arches using the same name - allowing grep to show them all. Also update the help text to describe the option more completely. Signed-off-by: Andrew Jones drjo...@redhat.com --- arch/arm/include/asm/kvm_host.h | 4 ++-- arch/arm/kvm/Kconfig| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 7d22517d80711..c614d3eb176c6 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -25,8 +25,8 @@ #include asm/fpstate.h #include kvm/arm_arch_timer.h -#if defined(CONFIG_KVM_ARM_MAX_VCPUS) -#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS +#if defined(CONFIG_KVM_MAX_VCPUS) +#define KVM_MAX_VCPUS CONFIG_KVM_MAX_VCPUS #else #define KVM_MAX_VCPUS 0 #endif diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig index ebf5015508b52..de63bfccb3eb5 100644 --- a/arch/arm/kvm/Kconfig +++ b/arch/arm/kvm/Kconfig @@ -40,16 +40,16 @@ config KVM_ARM_HOST ---help--- Provides host support for ARM processors. -config KVM_ARM_MAX_VCPUS +config KVM_MAX_VCPUS int Number maximum supported virtual CPUs per VM depends on KVM_ARM_HOST default 4 help Static number of max supported virtual CPUs per VM. - If you choose a high number, the vcpu structures will be quite - large, so only choose a reasonable number that you expect to - actually use. I do no see why on ARM vcpu structure size depends on KVM_ARM_MAX_VCPUS. Can somebody point me to it. + The default is set to the highest number of vcpus that + current hardware supports. Set to a lower number to save + some resources. Set to a higher number to test scalability. config KVM_ARM_VGIC bool KVM support for Virtual GIC -- 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/15] KVM: MMU: locklessly wirte-protect
Marcelo do you feel your comments are addressed in patches 3 and 5 of this series? On Thu, Sep 05, 2013 at 06:29:03PM +0800, Xiao Guangrong wrote: Changelog v2: - the changes from Gleb's review: 1) fix calculating the number of spte in the pte_list_add() 2) set iter-desc to NULL if meet a nulls desc to cleanup the code of rmap_get_next() 3) fix hlist corruption due to accessing sp-hlish out of mmu-lock 4) use rcu functions to access the rcu protected pointer 5) spte will be missed in lockless walker if the spte is moved in a desc (remove a spte from the rmap using only one desc). Fix it by bottom-up walking the desc - the changes from Paolo's review 1) make the order and memory barriers between update spte / add spte into rmap and dirty-log more clear - the changes from Marcelo's review: 1) let fast page fault only fix the spts on the last level (level = 1) 2) improve some changelogs and comments - the changes from Takuya's review: move the patch flush tlb if the spte can be locklessly modified forward to make it's more easily merged Thank all of you very much for your time and patience on this patchset! Since we use rcu_assign_pointer() to update the points in desc even if dirty log is disabled, i have measured the performance: Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 36G memory - migrate-perf (benchmark the time of get-dirty-log) before: Run 10 times, Avg time:9009483 ns. after: Run 10 times, Avg time:4807343 ns. - kerbench Guest: 12 VCPUs + 8G memory before: EPT is enabled: # cat 09-05-origin-ept | grep real real 85.58 real 83.47 real 82.95 EPT is disabled: # cat 09-05-origin-shadow | grep real real 138.77 real 138.99 real 139.55 after: EPT is enabled: # cat 09-05-lockless-ept | grep real real 83.40 real 82.81 real 83.39 EPT is disabled: # cat 09-05-lockless-shadow | grep real real 138.91 real 139.71 real 138.94 No performance regression! Background == Currently, when mark memslot dirty logged or get dirty page, we need to write-protect large guest memory, it is the heavy work, especially, we need to hold mmu-lock which is also required by vcpu to fix its page table fault and mmu-notifier when host page is being changed. In the extreme cpu / memory used guest, it becomes a scalability issue. This patchset introduces a way to locklessly write-protect guest memory. Idea == There are the challenges we meet and the ideas to resolve them. 1) How to locklessly walk rmap? The first idea we got to prevent desc being freed when we are walking the rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode, it updates the rmap really frequently. So we uses SLAB_DESTROY_BY_RCU to manage desc instead, it allows the object to be reused more quickly. We also store a nulls in the last desc (desc-more) which can help us to detect whether the desc is moved to anther rmap then we can re-walk the rmap if that happened. I learned this idea from nulls-list. Another issue is, when a spte is deleted from the desc, another spte in the last desc will be moved to this position to replace the deleted one. If the deleted one has been accessed and we do not access the replaced one, the replaced one is missed when we do lockless walk. To fix this case, we do not backward move the spte, instead, we forward move the entry: when a spte is deleted, we move the entry in the first desc to that position. 2) How to locklessly access shadow page table? It is easy if the handler is in the vcpu context, in that case we can use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that disable interrupt to stop shadow page be freed. But we are on the ioctl context and the paths we are optimizing for have heavy workload, disabling interrupt is not good for the system performance. We add a indicator into kvm struct (kvm-arch.rcu_free_shadow_page), then use call_rcu() to free the shadow page if that indicator is set. Set/Clear the indicator are protected by slot-lock, so it need not be atomic and does not hurt the performance and the scalability. 3) How to locklessly write-protect guest memory? Currently, there are two behaviors when we write-protect guest memory, one is clearing the Writable bit on spte and the another one is dropping spte when it points to large page. The former is easy we only need to atomicly clear a bit but the latter is hard since we need to remove the spte from rmap. so we unify these two behaviors that only make the spte readonly. Making large spte readonly instead of nonpresent is also good for reducing jitter. And we need to pay more attention on the order of making spte writable, adding spte into rmap and setting the corresponding bit on dirty bitmap since kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap, we
Re: [PATCH v2 0/2] KVM: s390: add floating irq controller
On Fri, Sep 06, 2013 at 03:30:38PM +0200, Christian Borntraeger wrote: On 06/09/13 14:19, Jens Freimann wrote: This series adds a kvm_device that acts as a irq controller for floating interrupts. As a first step it implements functionality to retrieve and inject interrupts for the purpose of migration and for hardening the reset code by allowing user space to explicitly remove all pending floating interrupts. PFAULT patches will also use this device for enabling/disabling pfault, therefore the pfault patch series will be reworked to use this device. * Patch 1/2 adds a new data structure to hold interrupt information. The current one (struct kvm_s390_interrupt) does not allow to inject every kind of interrupt, e.g. some data for program interrupts and machine check interruptions were missing. * Patch 2/2 adds a kvm_device which supports getting/setting currently pending floating interrupts as well as deleting all currently pending interrupts Jens Freimann (2): KVM: s390: add and extend interrupt information data structs KVM: s390: add floating irq controller Documentation/virtual/kvm/devices/s390_flic.txt | 36 +++ arch/s390/include/asm/kvm_host.h| 35 +-- arch/s390/include/uapi/asm/kvm.h| 5 + arch/s390/kvm/interrupt.c | 304 arch/s390/kvm/kvm-s390.c| 1 + include/linux/kvm_host.h| 1 + include/uapi/linux/kvm.h| 65 + virt/kvm/kvm_main.c | 5 + 8 files changed, 368 insertions(+), 84 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/s390_flic.txt Gleb, Paolo, since the qemu part relies on a kernel header file, it makes sense to not only let the kernel part go via the kvm tree, but also the qemu part. I want Alex to Ack the interface, and if he agrees then I am fine with applying the whole series. Still waiting for Alex's ACK. If nothing else comes up, feel free to apply the small change request from Peter yourself or ask Jens for a resend. --snip --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -908,7 +908,7 @@ struct kvm_device_attr { #define KVM_DEV_TYPE_FSL_MPIC_20 1 #define KVM_DEV_TYPE_FSL_MPIC_42 2 #define KVM_DEV_TYPE_XICS 3 -#define KVM_DEV_TYPE_FLIC 4 +#define KVM_DEV_TYPE_FLIC 5 /* * ioctls for VM fds --snip -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Tue, Sep 10, 2013 at 09:14:14PM +0800, Arthur Chunqi Li wrote: On Mon, Sep 2, 2013 at 4:21 PM, Gleb Natapov g...@redhat.com wrote: On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote: Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the Not a typo :) That what Avi asked for do during initial nested VMX review: http://markmail.org/message/hhidqyhbo2mrgxxc But there is at least one transition check that kvm_set_cr0() does that should not be done during vmexit emulation, namely CS.L bit check, so I tend to agree that kvm_set_cr0() is not appropriate here, at lest not as it is. But can we skip other checks kvm_set_cr0() does? For instance what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0 during nested vmexit? What _should_ prevent it is vmentry check from 26.2.4 If the host address-space size VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. Hi Jan and Gleb, Our nested VMX testing framework may not support such testing modes. Here we need to catch the failed result (ZF flag) close to vmresume, but vmresume/vmlaunch is well encapsulated in our framework. If we simply write a vmresume inline function, the VMX will act unexpectedly when it doesn't cause vmresume fail. Do you have any ideas about this? I am not sure what you mean. The framework does capture failed vmentry flags, but it handles the failure internally in vmx_run(). If you want framework to be able to provide vmentry failure handler do what Paolo suggests. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60850] BUG: Bad page state in process libvirtd pfn:76000
https://bugzilla.kernel.org/show_bug.cgi?id=60850 Alexander Y. Tiurin alexande...@gmail.com changed: What|Removed |Added Version|unspecified |2.5 Product|Virtualization |Drivers Component|kvm |Network -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest
Am 15.09.2013 um 04:16 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Alexander Graf ag...@suse.de writes: Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Benjamin Herrenschmidt b...@kernel.crashing.org writes: On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote: Aneesh and I are currently investigating an alternative approach, which is much more like the x86 way of doing things. We are looking at splitting the code into three modules: a kvm_pr.ko module with the PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a core kvm.ko module with the generic bits (basically book3s.c, powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV use). Basically the core module would have a pointer to a struct full of function pointers for the various ops that book3s_pr.c and book3s_hv.c both provide. You would only be able to have one of kvm_pr and kvm_hv loaded at any one time. If they were built in, you could have them both built in but only one could register its function pointer struct with the core. Obviously the kvm_hv module would only load and register its struct on a machine that had hypervisor mode available. If they were both built in I would think we would give HV the first chance to register itself, and let PR register if we can't do HV. How does that sound? As long as we can force-load the PR one on a machine that normally runs HV for the sake of testing ... This is what I currently have [root@llmp24l02 kvm]# insmod ./kvm-hv.ko [root@llmp24l02 kvm]# insmod ./kvm-pr.ko insmod: ERROR: could not insert module ./kvm-pr.ko: File exists The reason this model makes sense for x86 is that you never have SVM and VMX in the cpu at the same time. Either it is an AMD chip or an Intel chip. PR and HV however are not mutually exclusive in hardware. What you really want is 1) distro can force HV/PR 2) admin can force HV/PR 3) user can force HV/PR 4) by default things just work 1 can be done through kernel config options. 2 can be done through modules that get loaded or not 3 can be done through a vm ioctl 4 only works if you allow hv and pr to be available at the same time I can assume who you talked to about this to make these design decisions, but it definitely was not me. I didn't had much discussion around the design with anybody yet. What you saw above was me changing/moving code around madly to get something working in a day. I was hoping to get something that I can post as RFC early and let others to comment. Good to get the feedback early. Heh ok :). I think we want to be flexible here unless complexity grows too much of a maintenance burden and/or slows things down. Alex -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer
On Fri, Sep 06, 2013 at 10:04:51AM +0800, Arthur Chunqi Li wrote: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- ChangeLog to v3: Move nested_adjust_preemption_timer to the latest place just before vmenter. Some minor changes. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 49 +++-- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..f364d16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -374,6 +374,8 @@ struct nested_vmx { */ struct page *apic_access_page; u64 msr_ia32_feature_control; + /* Set if vmexit is L2-L1 */ + bool nested_vmx_exit; Do not see why it is needed, see bellow. }; #define POSTED_INTR_ON 0 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high + PIN_BASED_VMX_PREEMPTION_TIMER) || + !(nested_vmx_exit_ctls_high + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); + } nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) +{ + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 = preempt_val_l1) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); +} + /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) vmx-nested.nested_run_pending = 0; if (is_guest_mode(vcpu) nested_vmx_exit_handled(vcpu)) { + vmx-nested.nested_vmx_exit = true; nested_vmx_vmexit(vcpu); return 1; } + vmx-nested.nested_vmx_exit = false; if (exit_reason VMX_EXIT_REASONS_FAILED_VMENTRY) { vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) debugctlmsr = get_debugctlmsr(); vmx-__launched = vmx-loaded_vmcs-launched; + if (is_guest_mode(vcpu) !(vmx-nested.nested_vmx_exit)) How is_guest_mode() and nested_vmx_exi can be both true? The only place nested_vmx_exit is set to true is just before call to nested_vmx_vmexit(). The firs thing nested_vmx_vmexit() does is makes is_guest_mode() false. To enter guest mode again at least one other vmexit from L1 to L0 is needed at which point nested_vmx_exit will be reset to false again. If you want to avoid calling nested_adjust_preemption_timer() during vmlaunch/vmresume emulation (and it looks like this is what you are trying to achieve here) you can check nested_run_pending. + nested_adjust_preemption_timer(vcpu); asm( /* Store host registers */ push %% _ASM_DX ; push %%
Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency
On 09/14/2013 02:25 AM, Alex Williamson wrote: On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote: On 09/13/2013 07:23 AM, Alex Williamson wrote: So far we've succeeded at making KVM and VFIO mostly unaware of each other, but there's any important point where that breaks down. Intel VT-d hardware may or may not support snoop control. When snoop control is available, intel-iommu promotes No-Snoop transactions on PCIe to be cache coherent. That allows KVM to handle things like the x86 WBINVD opcode as a nop. When the hardware does not support this, KVM must implement a hardware visible WBINVD for the guest. We could simply let userspace tell KVM how to handle WBINVD, but it's privileged for a reason. Allowing an arbitrary user to enable physical WBINVD gives them a more access to the hardware. Previously, this has only been enabled for guests supporting legacy PCI device assignment. In such cases it's necessary for proper guest execution. We therefore create a new KVM-VFIO virtual device. The user can add and remove VFIO groups to this device via file descriptors. KVM makes use of the VFIO external user interface to validate that the user has access to physical hardware and gets the coherency state of the IOMMU from VFIO. This provides equivalent functionality to legacy KVM assignment, while keeping (nearly) all the bits isolated. The one intrusion is the resulting flag indicating the coherency state. For this RFC it's placed on the x86 kvm_arch struct, however I know POWER has interest in using the VFIO external user interface, and I'm hoping we can share a common KVM-VFIO device. Perhaps they care about No-Snoop handling as well or the code can be #ifdef'd. POWER does not support (at least boos3s - server, not sure about others) this cache-non-coherent stuff at all. Then it's easy for your IOMMU API interface to return always cache coherent or never cache coherent or whatever ;) Regarding reusing this device with external API for POWER - I posted a patch which introduces KVM device to link KVM with IOMMU but besides the list of groups registered in KVM, it also provides the way to find a group by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So in my case kvm_vfio_group struct needs LIOBN and it would be nice to have there window_size too (for a quick boundary check). I am not sure we want to mix everything here. It is in [PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel handling if you are interested (kvmppc_spapr_tce_iommu_device). Yes, I stole the code to get the vfio symbols from your code. The convergence I was hoping to achieve is that KVM doesn't really want to know about VFIO and vica versa. We can therefore at least limit the intrusion by sharing a common device. Obviously for you it will need some extra interfaces to associate an LIOBN to a group, but we keep both the kernel an userspace cleaner by avoiding duplication where we can. Is this really not extensible to your usage? Well, I do not have a good taste for this kind of stuff so I cannot tell for sure. I can reuse this device and hack it to do whatever I need but how? There are 2 things I need from KVM device: 1. associate IOMMU group with LIOBN 2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table pointer which is an IOMMU data of an IOMMU group so we could take a shortcut here). There are 3 possible interfaces to use: A. set/get attributes B. ioctl C. additional API What I posted a week ago uses A for 1 and C for 2. Now we move this to virt/kvm/vfio.c. A for 1 is more or less ok but how exactly? Yet another attribute? Platform specific bus ID? In your patch attr-addr is not really an address (to be accessed with get_user()) but just an fd. For 2 - there are already A and B interfaces so we do not want C, right? What kind of a good device has a backdoor? :) But A and B do not have access control to prevent the user space from receiving a IOMMU group pointer (which it would not be able to use anyway but still). Do we care about this (I do not)? And using B (ioctls) within the kernel - I cannot say I saw many usages of this. I am pretty sure we will spend some time (not much) arguing about these things and when we agree on something, then some of KVM maintainers will step in and say that there is 1001st way of doing this and - goto start. And after all, this still won't be a device as it does not emulate anything present in the real hardware, this is just yet another interface to KVM and some ways of using it won't be natural for somebody. /me sighs. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/23] KVM: PPC: Book3S PR: Better handling of host-side read-only pages
On Sat, Sep 14, 2013 at 03:23:53PM -0500, Alexander Graf wrote: Am 14.09.2013 um 00:24 schrieb Paul Mackerras pau...@samba.org: Bootup (F19 guest, 3 runs): Without the patch: average 20.12 seconds, st. dev. 0.17 seconds With the patch: 20.47 seconds, st. dev. 0.19 seconds Delta: 0.35 seconds, or 1.7%. time for i in $(seq 1000); do /bin/echo $i /dev/null; done: Without the patch: average 7.27 seconds, st. dev. 0.23 seconds With the patch: average 7.55 seconds, st. dev. 0.39 seconds Delta: 0.28 seconds, or 3.8%. So there appears to be a small effect, of a few percent. So in the normal case it slows us down, but allows ksm to be effective. Do we actually want this change then? I was a bit puzzled why there was a measurable slowdown until I remembered that this patch was intended to go along with the patch powerpc: Implement __get_user_pages_fast(), which Ben took and which is now upstream in Linus' tree (1f7bf028). So, I applied that patch on top of this Better handling of host-side read-only pages pages, and did the same measurements. The results were: Bootup (F19 guest, 3 runs): average 20.05 seconds, st. dev. 0.53s 1000 /bin/echo (4 runs): average 7.27 seconds, st. dev. 0.032s So with both patches applied there is no slowdown at all, and KSM works properly. I think we want this patch. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer
On Sat, Sep 14, 2013 at 1:15 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 06/09/2013 04:04, Arthur Chunqi Li ha scritto: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- ChangeLog to v3: Move nested_adjust_preemption_timer to the latest place just before vmenter. Some minor changes. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 49 +++-- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..f364d16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -374,6 +374,8 @@ struct nested_vmx { */ struct page *apic_access_page; u64 msr_ia32_feature_control; + /* Set if vmexit is L2-L1 */ + bool nested_vmx_exit; }; #define POSTED_INTR_ON 0 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high + PIN_BASED_VMX_PREEMPTION_TIMER) || + !(nested_vmx_exit_ctls_high + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { Align this under the other !. Also, I prefer to have one long line for the whole !(... ...) || (and likewise below), but I don't know if Gleb agrees + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); Please remove parentheses around ~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, and likewise elsewhere in the patch. + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); + } nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) +{ + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; Should this exit immediately if the preemption timer pin-based control is disabled? Hi Paolo, How can I get pin-based control here from struct kvm_vcpu *vcpu? Arthur + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; Please format this like: delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 = preempt_val_l1) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); Did you test that a value of 0 triggers an immediate exit, rather than counting down by 2^32? Perhaps it's safer to limit the value to 1 instead of 0. +} + /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) vmx-nested.nested_run_pending = 0; if (is_guest_mode(vcpu) nested_vmx_exit_handled(vcpu)) { + vmx-nested.nested_vmx_exit = true; I think this assignment should be in nested_vmx_vmexit, since it is called from other places as well. nested_vmx_vmexit(vcpu); return 1; } + vmx-nested.nested_vmx_exit = false; if (exit_reason VMX_EXIT_REASONS_FAILED_VMENTRY) { vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer
On Sun, Sep 15, 2013 at 8:31 PM, Gleb Natapov g...@redhat.com wrote: On Fri, Sep 06, 2013 at 10:04:51AM +0800, Arthur Chunqi Li wrote: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- ChangeLog to v3: Move nested_adjust_preemption_timer to the latest place just before vmenter. Some minor changes. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 49 +++-- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..f364d16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -374,6 +374,8 @@ struct nested_vmx { */ struct page *apic_access_page; u64 msr_ia32_feature_control; + /* Set if vmexit is L2-L1 */ + bool nested_vmx_exit; Do not see why it is needed, see bellow. }; #define POSTED_INTR_ON 0 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high + PIN_BASED_VMX_PREEMPTION_TIMER) || + !(nested_vmx_exit_ctls_high + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); + } nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) +{ + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 = preempt_val_l1) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); +} + /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) vmx-nested.nested_run_pending = 0; if (is_guest_mode(vcpu) nested_vmx_exit_handled(vcpu)) { + vmx-nested.nested_vmx_exit = true; nested_vmx_vmexit(vcpu); return 1; } + vmx-nested.nested_vmx_exit = false; if (exit_reason VMX_EXIT_REASONS_FAILED_VMENTRY) { vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) debugctlmsr = get_debugctlmsr(); vmx-__launched = vmx-loaded_vmcs-launched; + if (is_guest_mode(vcpu) !(vmx-nested.nested_vmx_exit)) How is_guest_mode() and nested_vmx_exi can be both true? The only place nested_vmx_exit is set to true is just before call to nested_vmx_vmexit(). The firs thing nested_vmx_vmexit() does is makes is_guest_mode() false. To enter guest mode again at least one other vmexit from L1 to L0 is needed at which point nested_vmx_exit will be reset to false again. If you want to avoid calling nested_adjust_preemption_timer() during vmlaunch/vmresume emulation (and it looks like this is what you are trying to achieve here) you can check nested_run_pending. Besides vmlaunch/vmresume emulation, every exit from L2-L1 should not call
Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest
Alexander Graf ag...@suse.de writes: Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Benjamin Herrenschmidt b...@kernel.crashing.org writes: On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote: Aneesh and I are currently investigating an alternative approach, which is much more like the x86 way of doing things. We are looking at splitting the code into three modules: a kvm_pr.ko module with the PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a core kvm.ko module with the generic bits (basically book3s.c, powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV use). Basically the core module would have a pointer to a struct full of function pointers for the various ops that book3s_pr.c and book3s_hv.c both provide. You would only be able to have one of kvm_pr and kvm_hv loaded at any one time. If they were built in, you could have them both built in but only one could register its function pointer struct with the core. Obviously the kvm_hv module would only load and register its struct on a machine that had hypervisor mode available. If they were both built in I would think we would give HV the first chance to register itself, and let PR register if we can't do HV. How does that sound? As long as we can force-load the PR one on a machine that normally runs HV for the sake of testing ... This is what I currently have [root@llmp24l02 kvm]# insmod ./kvm-hv.ko [root@llmp24l02 kvm]# insmod ./kvm-pr.ko insmod: ERROR: could not insert module ./kvm-pr.ko: File exists The reason this model makes sense for x86 is that you never have SVM and VMX in the cpu at the same time. Either it is an AMD chip or an Intel chip. PR and HV however are not mutually exclusive in hardware. What you really want is 1) distro can force HV/PR 2) admin can force HV/PR 3) user can force HV/PR 4) by default things just work 1 can be done through kernel config options. 2 can be done through modules that get loaded or not 3 can be done through a vm ioctl 4 only works if you allow hv and pr to be available at the same time I can assume who you talked to about this to make these design decisions, but it definitely was not me. I didn't had much discussion around the design with anybody yet. What you saw above was me changing/moving code around madly to get something working in a day. I was hoping to get something that I can post as RFC early and let others to comment. Good to get the feedback early. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest
Am 15.09.2013 um 04:16 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Alexander Graf ag...@suse.de writes: Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com: Benjamin Herrenschmidt b...@kernel.crashing.org writes: On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote: Aneesh and I are currently investigating an alternative approach, which is much more like the x86 way of doing things. We are looking at splitting the code into three modules: a kvm_pr.ko module with the PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a core kvm.ko module with the generic bits (basically book3s.c, powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV use). Basically the core module would have a pointer to a struct full of function pointers for the various ops that book3s_pr.c and book3s_hv.c both provide. You would only be able to have one of kvm_pr and kvm_hv loaded at any one time. If they were built in, you could have them both built in but only one could register its function pointer struct with the core. Obviously the kvm_hv module would only load and register its struct on a machine that had hypervisor mode available. If they were both built in I would think we would give HV the first chance to register itself, and let PR register if we can't do HV. How does that sound? As long as we can force-load the PR one on a machine that normally runs HV for the sake of testing ... This is what I currently have [root@llmp24l02 kvm]# insmod ./kvm-hv.ko [root@llmp24l02 kvm]# insmod ./kvm-pr.ko insmod: ERROR: could not insert module ./kvm-pr.ko: File exists The reason this model makes sense for x86 is that you never have SVM and VMX in the cpu at the same time. Either it is an AMD chip or an Intel chip. PR and HV however are not mutually exclusive in hardware. What you really want is 1) distro can force HV/PR 2) admin can force HV/PR 3) user can force HV/PR 4) by default things just work 1 can be done through kernel config options. 2 can be done through modules that get loaded or not 3 can be done through a vm ioctl 4 only works if you allow hv and pr to be available at the same time I can assume who you talked to about this to make these design decisions, but it definitely was not me. I didn't had much discussion around the design with anybody yet. What you saw above was me changing/moving code around madly to get something working in a day. I was hoping to get something that I can post as RFC early and let others to comment. Good to get the feedback early. Heh ok :). I think we want to be flexible here unless complexity grows too much of a maintenance burden and/or slows things down. Alex -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/23] KVM: PPC: Book3S PR: Better handling of host-side read-only pages
On Sat, Sep 14, 2013 at 03:23:53PM -0500, Alexander Graf wrote: Am 14.09.2013 um 00:24 schrieb Paul Mackerras pau...@samba.org: Bootup (F19 guest, 3 runs): Without the patch: average 20.12 seconds, st. dev. 0.17 seconds With the patch: 20.47 seconds, st. dev. 0.19 seconds Delta: 0.35 seconds, or 1.7%. time for i in $(seq 1000); do /bin/echo $i /dev/null; done: Without the patch: average 7.27 seconds, st. dev. 0.23 seconds With the patch: average 7.55 seconds, st. dev. 0.39 seconds Delta: 0.28 seconds, or 3.8%. So there appears to be a small effect, of a few percent. So in the normal case it slows us down, but allows ksm to be effective. Do we actually want this change then? I was a bit puzzled why there was a measurable slowdown until I remembered that this patch was intended to go along with the patch powerpc: Implement __get_user_pages_fast(), which Ben took and which is now upstream in Linus' tree (1f7bf028). So, I applied that patch on top of this Better handling of host-side read-only pages pages, and did the same measurements. The results were: Bootup (F19 guest, 3 runs): average 20.05 seconds, st. dev. 0.53s 1000 /bin/echo (4 runs): average 7.27 seconds, st. dev. 0.032s So with both patches applied there is no slowdown at all, and KSM works properly. I think we want this patch. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html