Re: [PATCH 0/3] KVM: Make KVM_CHECK_EXTENSION a VM ioctl
On Mon, 14 Jul 2014 19:03:35 +0200 Alexander Graf wrote: > On PowerPC we have a small problem :). We can run both HV and PR style VMs > on the same kvm fd. While this is great, it means that anything that's > different between the two needs to have a token in form of a VM fd to find > out which one we're asking for. > > The one thing where this bites us are CAPs. We ask for them on the kvm fd, > not the vm fd. So we can only take a random guess whether the user is asking > for HV or PR capabilities. > > So far we got away with this reasonably well - most people will only load one > of the two modules and the only thing that *really* breaks is hypercall > exposure > to user space, so a PR guest will not be able to do KVM hypercalls when HV KVM > is loaded on the host, making the magic page unavailable to it. > > But this still isn't a great situation to be in. Instead, we really should > just > make the CHECK_EXTENSION ioctl available at VM level. Then we know for sure > what user space is asking for. I think this makes sense. It adds one situation we didn't have before, though: We now have the checking ioctl acting on a target where we also have the enabling ioctl. Previously, it made sense to advertise capabilities that can be enabled on the vcpus on the kvm fd; but what should happen for a capability that can be enabled on a vm if the vm fd is queried? Should it always be advertised, or only if it has been enabled? (I just noticed that we don't advertise KVM_CAP_S390_IRQCHIP, which is a capability that can be enabled per-vm...) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V6 1/2] perf ignore LBR and extra_rsp
From: Kan Liang x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Kan Liang --- V2: Move the check code to initialization time. V3: Add flag for each extra register. Check all LBR MSRs at initialization time. V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr failed. Disable all extra msrs in creation places if check_msr failed. V5: Fix check_msr broken Don't check any more MSRs after the first fail Return error when checking fail to stop creating the event Remove the checking code path which never get V6: Move extra_msr_access to struct extra_reg Modify and move check_msr function to perf_event_intel.c Other minor code changes arch/x86/kernel/cpu/perf_event.c | 3 ++ arch/x86/kernel/cpu/perf_event.h | 20 arch/x86/kernel/cpu/perf_event_intel.c| 66 ++- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 4 +- 4 files changed, 81 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 2bdfbff..2879ecd 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event) continue; if (event->attr.config1 & ~er->valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!er->extra_msr_access) + return -ENXIO; reg->idx = er->idx; reg->config = event->attr.config1; diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 3b2f9bd..2e613b2 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -295,22 +295,24 @@ struct extra_reg { u64 config_mask; u64 valid_mask; int idx; /* per_xxx->regs[] reg index */ + boolextra_msr_access; }; -#define EVENT_EXTRA_REG(e, ms, m, vm, i) { \ - .event = (e), \ - .msr = (ms),\ - .config_mask = (m), \ - .valid_mask = (vm), \ - .idx = EXTRA_REG_##i, \ +#define EVENT_EXTRA_REG(e, ms, m, vm, i, a) { \ + .event = (e), \ + .msr = (ms),\ + .config_mask = (m), \ + .valid_mask = (vm), \ + .idx = EXTRA_REG_##i, \ + .extra_msr_access = (a),\ } #define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx) \ - EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx) + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx, true) #define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \ EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \ - ARCH_PERFMON_EVENTSEL_UMASK, vm, idx) + ARCH_PERFMON_EVENTSEL_UMASK, vm, idx, true) #define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \ INTEL_UEVENT_EXTRA_REG(c, \ @@ -318,7 +320,7 @@ struct extra_reg { 0x, \ LDLAT) -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0) +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0, false) union perf_capabilities { struct { diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..9c234d1 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2173,6 +2173,40 @@ static void intel_snb_check_microcode(void) } } +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static bool check_msr(unsigned long msr, u64 mask) +{ + u64 val_old, val_new, val_tmp; + + /* +* Read the current value, change it and read it back to see if it +* matches, this is needed to detect certain hardware emulators +* (qemu/kvm) that don't trap on the MSR access and always return 0s. +*/ + if (
[PATCH V6 2/2] kvm: ignore LBR and extra rsp
From: Kan Liang With -cpu host KVM reports LBR and extra_regs support, so the perf driver may accesses the LBR and extra_regs MSRs. However, there is no LBR and extra_regs virtualization support yet. This could causes guest to crash. As a workaround, KVM just simply ignore the LBR and extra_regs MSRs to lie the guest. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Andi Kleen Signed-off-by: Kan Liang --- V3: add MSR_LBR_TOS V4: add MSR_LBR_SELECT and MSR_PEBS_LD_LAT_THRESHOLD V5: set_msr should return 0 to lie the guest arch/x86/kvm/pmu.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index cbecaa9..5fd5b44 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -331,6 +331,18 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu->version > 1; break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + return 1; /* to avoid crashes */ default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) @@ -358,6 +370,19 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: *data = pmu->global_ovf_ctrl; return 0; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + *data = 0; + return 0; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { @@ -409,6 +434,19 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + /* dummy for now */ + return 0; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately on PPC some of the capabilities change depending on the way a VM was created. So instead we need a way to expose capabilities as VM ioctl, so that we can see which VM type we're using (HV or PR). To enable this, add the KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio. Signed-off-by: Alexander Graf --- v1 -> v2: - add CAP to expose the vm based CHECK_EXTENSION ioctl availability --- Documentation/virtual/kvm/api.txt | 7 +++-- include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 58 +-- 3 files changed, 37 insertions(+), 29 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6955318..235630c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -148,9 +148,9 @@ of banks, as set via the KVM_X86_SETUP_MCE ioctl. 4.4 KVM_CHECK_EXTENSION -Capability: basic +Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl Architectures: all -Type: system ioctl +Type: system ioctl, vm ioctl Parameters: extension identifier (KVM_CAP_*) Returns: 0 if unsupported; 1 (or some other positive integer) if supported @@ -160,6 +160,9 @@ receives an integer that describes the extension availability. Generally 0 means no and 1 means yes, but some extensions may report additional information in the integer return value. +Based on their initialization different VMs may have different capabilities. +It is thus encouraged to use the vm ioctl to query for capabilities (available +with KVM_CAP_CHECK_EXTENSION_VM on the vm fd) 4.5 KVM_GET_VCPU_MMAP_SIZE diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0418b74..51776ca 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -759,6 +759,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_ARM_PSCI_0_2 102 #define KVM_CAP_PPC_FIXUP_HCALL 103 #define KVM_CAP_PPC_ENABLE_HCALL 104 +#define KVM_CAP_CHECK_EXTENSION_VM 105 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e28f3ca..1b95cc9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2324,6 +2324,34 @@ static int kvm_ioctl_create_device(struct kvm *kvm, return 0; } +static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) +{ + switch (arg) { + case KVM_CAP_USER_MEMORY: + case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: + case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: +#ifdef CONFIG_KVM_APIC_ARCHITECTURE + case KVM_CAP_SET_BOOT_CPU_ID: +#endif + case KVM_CAP_INTERNAL_ERROR_DATA: +#ifdef CONFIG_HAVE_KVM_MSI + case KVM_CAP_SIGNAL_MSI: +#endif +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING + case KVM_CAP_IRQFD_RESAMPLE: +#endif + case KVM_CAP_CHECK_EXTENSION_VM: + return 1; +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING + case KVM_CAP_IRQ_ROUTING: + return KVM_MAX_IRQ_ROUTES; +#endif + default: + break; + } + return kvm_vm_ioctl_check_extension(kvm, arg); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2487,6 +2515,9 @@ static long kvm_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_CHECK_EXTENSION: + r = kvm_vm_ioctl_check_extension_generic(kvm, arg); + break; default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) @@ -2571,33 +2602,6 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) return r; } -static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) -{ - switch (arg) { - case KVM_CAP_USER_MEMORY: - case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: - case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: -#ifdef CONFIG_KVM_APIC_ARCHITECTURE - case KVM_CAP_SET_BOOT_CPU_ID: -#endif - case KVM_CAP_INTERNAL_ERROR_DATA: -#ifdef CONFIG_HAVE_KVM_MSI - case KVM_CAP_SIGNAL_MSI: -#endif -#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING - case KVM_CAP_IRQFD_RESAMPLE: -#endif - return 1; -#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING - case KVM_CAP_IRQ_ROUTING: - return KVM_MAX_IRQ_ROUTES; -#endif - default: - break; - } - return kvm_vm_ioctl_check_extension(kvm, arg); -} - static long kvm_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd
On 14.07.14 19:03, Alexander Graf wrote: The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately on PPC some of the capabilities change depending on the way a VM was created. So instead we need a way to expose capabilities as VM ioctl, so that we can see which VM type we're using (HV or PR). To enable this, add the KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio. Signed-off-by: Alexander Graf Hrm, I guess this itself should also get a CAP :). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately on PPC some of the capabilities change depending on the way a VM was created. So instead we need a way to expose capabilities as VM ioctl, so that we can see which VM type we're using (HV or PR). To enable this, add the KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio. Signed-off-by: Alexander Graf --- Documentation/virtual/kvm/api.txt | 5 +++- virt/kvm/kvm_main.c | 57 --- 2 files changed, 34 insertions(+), 28 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6955318..f0ac03d 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -150,7 +150,7 @@ of banks, as set via the KVM_X86_SETUP_MCE ioctl. Capability: basic Architectures: all -Type: system ioctl +Type: system ioctl, vm ioctl Parameters: extension identifier (KVM_CAP_*) Returns: 0 if unsupported; 1 (or some other positive integer) if supported @@ -160,6 +160,9 @@ receives an integer that describes the extension availability. Generally 0 means no and 1 means yes, but some extensions may report additional information in the integer return value. +Based on their initialization different VMs may have different capabilities. +It is thus encouraged to use the vm ioctl to query for capabilities (available +as of 3.17). 4.5 KVM_GET_VCPU_MMAP_SIZE diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e28f3ca..6f1c1cc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2324,6 +2324,33 @@ static int kvm_ioctl_create_device(struct kvm *kvm, return 0; } +static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) +{ + switch (arg) { + case KVM_CAP_USER_MEMORY: + case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: + case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: +#ifdef CONFIG_KVM_APIC_ARCHITECTURE + case KVM_CAP_SET_BOOT_CPU_ID: +#endif + case KVM_CAP_INTERNAL_ERROR_DATA: +#ifdef CONFIG_HAVE_KVM_MSI + case KVM_CAP_SIGNAL_MSI: +#endif +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING + case KVM_CAP_IRQFD_RESAMPLE: +#endif + return 1; +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING + case KVM_CAP_IRQ_ROUTING: + return KVM_MAX_IRQ_ROUTES; +#endif + default: + break; + } + return kvm_vm_ioctl_check_extension(kvm, arg); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2487,6 +2514,9 @@ static long kvm_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_CHECK_EXTENSION: + r = kvm_vm_ioctl_check_extension_generic(kvm, arg); + break; default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) @@ -2571,33 +2601,6 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) return r; } -static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) -{ - switch (arg) { - case KVM_CAP_USER_MEMORY: - case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: - case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: -#ifdef CONFIG_KVM_APIC_ARCHITECTURE - case KVM_CAP_SET_BOOT_CPU_ID: -#endif - case KVM_CAP_INTERNAL_ERROR_DATA: -#ifdef CONFIG_HAVE_KVM_MSI - case KVM_CAP_SIGNAL_MSI: -#endif -#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING - case KVM_CAP_IRQFD_RESAMPLE: -#endif - return 1; -#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING - case KVM_CAP_IRQ_ROUTING: - return KVM_MAX_IRQ_ROUTES; -#endif - default: - break; - } - return kvm_vm_ioctl_check_extension(kvm, arg); -} - static long kvm_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM: Rename and add argument to check_extension
In preparation to make the check_extension function available to VM scope we add a struct kvm * argument to the function header and rename the function accordingly. It will still be called from the /dev/kvm fd, but with a NULL argument for struct kvm *. Signed-off-by: Alexander Graf --- arch/arm/kvm/arm.c | 2 +- arch/ia64/kvm/kvm-ia64.c | 2 +- arch/mips/kvm/mips.c | 2 +- arch/powerpc/kvm/powerpc.c | 2 +- arch/s390/kvm/kvm-s390.c | 2 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c| 6 +++--- 8 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 3c82b37..cb77f999 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -184,7 +184,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) } } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; switch (ext) { diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 6a4309b..0729ba6 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -190,7 +190,7 @@ void kvm_arch_check_processor_compat(void *rtn) *(int *)rtn = 0; } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 4fda672..cd71141 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -886,7 +886,7 @@ int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) return VM_FAULT_SIGBUS; } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index fe0257a..2a0e497 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -385,7 +385,7 @@ void kvm_arch_sync_events(struct kvm *kvm) { } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; /* FIXME!! diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 2f3e14f..00268ca 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -146,7 +146,7 @@ long kvm_arch_dev_ioctl(struct file *filp, return -EINVAL; } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f65c22c..e9f0a92 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2616,7 +2616,7 @@ out: return r; } -int kvm_dev_ioctl_check_extension(long ext) +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ec4e3bd..5065b95 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -602,7 +602,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf); -int kvm_dev_ioctl_check_extension(long ext); +int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext); int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, int *is_dirty); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4b6c01b..e28f3ca 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2571,7 +2571,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) return r; } -static long kvm_dev_ioctl_check_extension_generic(long arg) +static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) { switch (arg) { case KVM_CAP_USER_MEMORY: @@ -2595,7 +2595,7 @@ static long kvm_dev_ioctl_check_extension_generic(long arg) default: break; } - return kvm_dev_ioctl_check_extension(arg); + return kvm_vm_ioctl_check_extension(kvm, arg); } static long kvm_dev_ioctl(struct file *filp, @@ -2614,7 +2614,7 @@ static long kvm_dev_ioctl(struct file *filp, r = kvm_dev_ioctl_create_vm(arg); break; case KVM_CHECK_EXTENSION: - r = kvm_dev_ioctl_check_extension_generic(arg); + r = kvm_vm_ioctl_check_extension_generic(NULL, arg); break; case KVM_GET_VCPU_MMAP_SIZE: r = -EINVAL; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] KVM: Make KVM_CHECK_EXTENSION a VM ioctl
On PowerPC we have a small problem :). We can run both HV and PR style VMs on the same kvm fd. While this is great, it means that anything that's different between the two needs to have a token in form of a VM fd to find out which one we're asking for. The one thing where this bites us are CAPs. We ask for them on the kvm fd, not the vm fd. So we can only take a random guess whether the user is asking for HV or PR capabilities. So far we got away with this reasonably well - most people will only load one of the two modules and the only thing that *really* breaks is hypercall exposure to user space, so a PR guest will not be able to do KVM hypercalls when HV KVM is loaded on the host, making the magic page unavailable to it. But this still isn't a great situation to be in. Instead, we really should just make the CHECK_EXTENSION ioctl available at VM level. Then we know for sure what user space is asking for. Alex Alexander Graf (3): KVM: Rename and add argument to check_extension KVM: Allow KVM_CHECK_EXTENSION on the vm fd KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode Documentation/virtual/kvm/api.txt | 5 +++- arch/arm/kvm/arm.c| 2 +- arch/ia64/kvm/kvm-ia64.c | 2 +- arch/mips/kvm/mips.c | 2 +- arch/powerpc/kvm/powerpc.c| 14 +++--- arch/s390/kvm/kvm-s390.c | 2 +- arch/x86/kvm/x86.c| 2 +- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 59 --- 9 files changed, 51 insertions(+), 39 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
With Book3S KVM we can create both PR and HV VMs in parallel on the same machine. That gives us new challenges on the CAPs we return - both have different capabilities. When we get asked about CAPs on the kvm fd, there's nothing we can do. We can try to be smart and assume we're running HV if HV is available, PR otherwise. However with the newly added VM CHECK_EXTENSION we can now ask for capabilities directly on a VM which knows whether it's PR or HV. With this patch I can successfully expose KVM PVINFO data to user space in the PR case, fixing magic page mapping for PAPR guests. Signed-off-by: Alexander Graf --- arch/powerpc/kvm/powerpc.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 2a0e497..6e13fcf 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -388,11 +388,17 @@ void kvm_arch_sync_events(struct kvm *kvm) int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r; - /* FIXME!! -* Should some of this be vm ioctl ? is it possible now ? -*/ + /* Assume we're using HV mode when the HV module is loaded */ int hv_enabled = kvmppc_hv_ops ? 1 : 0; + if (kvm) { + /* +* Hooray - we know which VM type we're running on. Depend on +* that rather than the guess above. +*/ + hv_enabled = is_kvmppc_hv_enabled(kvm); + } + switch (ext) { #ifdef CONFIG_BOOKE case KVM_CAP_PPC_BOOKE_SREGS: -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
so once more; and then I'm going to route your emails to /dev/null, wrap text at 78 chars. On Mon, Jul 14, 2014 at 02:28:36PM +, Liang, Kan wrote: > > > +++ b/arch/x86/kernel/cpu/perf_event.h > > > @@ -464,6 +464,12 @@ struct x86_pmu { > > >*/ > > > struct extra_reg *extra_regs; > > > unsigned int er_flags; > > > + /* > > > + * EXTRA REG MSR can be accessed > > > + * The extra registers are completely unrelated to each other. > > > + * So it needs a flag for each extra register. > > > + */ > > > + boolextra_msr_access[EXTRA_REG_MAX]; > > > > So why not in struct extra_reg again? You didn't give a straight answer > > there. > > I think I did in the email. > You mentioned that there's still (only) 4 empty bytes at the tail of > extra_reg itself. However, the extra_reg_type may be extended in the > near future. So that may not be a reason to move to extra_reg. Well, you can always grow. Also be explicit, 'may be' is an empty statement. > Furthermore, if we move extra_msr_access to extra_reg, I guess we have > to modify all the related micros (i.e EVENT_EXTRA_REG) to initialize > the new items. That could be a big change. Nah, trivial stuff :-) > On the other side, in x86_pmu structure, there are extra_regs related > items defined under the comments "Extra registers for events". And > the bit holes are enough for current usage and future extension. > > So I guess x86_pmu should be a good place to store the availability > of the reg. It just doesn't make sense to me to have multiple arrays of the same thing. pgpllkEQRnDii.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors
On 14/07/14 15:35, Peter Maydell wrote: > On 14 July 2014 14:33, James Hogan wrote: >> On 10/07/14 13:17, Peter Maydell wrote: >>> More generally, there doesn't really seem to be provision in the >>> KVM KVM_EXIT_MMIO API for returning "this access failed". >>> I guess in theory userspace could do all the "figure out how >>> to adjust CPU state to do exception entry and then run VCPU", >>> but that seems like quite a lot of work which the kernel already >>> knows how to do; is there some way to provide a simpler API >>> that lets userspace just inform the kernel that it needs to >>> fault the access? >> >> Indeed. Paolo's idea would work well I think. A data bus error exception >> would likely be the only sensible error response required other than >> ignoring writes or returning a garbage value for a read (which the >> current KVM MMIO API already allows). > > I think we should make the API at least permit returning an > (architecture-specific) error code to the kernel, rather than just > a single boolean "failed" response. For instance on ARM the > fault status registers include a single ExT bit for classifying > the type of an external abort. (The meaning of the bit is > IMPDEF; on the Cortex-A15 it can be used to distinguish > AXI bus DECERR ("decode error", ie the interconnect doesn't > have anything attached at that address) and SLVERR ("slave > error", ie there was a device at that address but it chose to > reject the transaction for some reason, eg unsupported > transfer size, timeout, write to read-only location, FIFO > overrun or just because the device was in a bad mood.) Agreed (I wasn't suggesting a bool, just thinking out loud about how mips would use that arch specific value :-) ). Cheers James -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
CCing Jan to check my nested kvm findings below. On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote: > Hi Gleb, > > Thanks for the reply. Please see below. > > On 07/12/2014 04:04 PM, Gleb Natapov wrote: > >On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote: > >>apic access page is pinned in memory. As a result, it cannot be > >>migrated/hot-removed. > >>Actually, it is not necessary to be pinned. > >> > >>The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When > >>the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the > >>corresponding ept entry. This patch introduces a new vcpu request named > >>KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this > >>time, > >>and force all the vcpus exit guest, and re-enter guest till they updates > >>the VMCS > >>APIC_ACCESS_ADDR pointer to the new apic access page address, and updates > >>kvm->arch.apic_access_page to the new page. > >> > >By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism > >is not used since no MMIO access to APIC is ever done. Have you tested > >this with "-cpu modelname,-x2apic" qemu flag? > > I used the following commandline to test the patches. > > # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp > 2 > That most likely uses x2apic. > And I think the guest used APIC_ACCESS_ADDR mechanism because the previous > patch-set has some problem which will happen when the apic page is accessed. > And it did happen. > > I'll test this patch-set with "-cpu modelname,-x2apic" flag. > Replace "modelname" with one of supported model names such as nehalem of course :) > > > >>Signed-off-by: Tang Chen > >>--- > >> arch/x86/include/asm/kvm_host.h | 1 + > >> arch/x86/kvm/mmu.c | 11 +++ > >> arch/x86/kvm/svm.c | 6 ++ > >> arch/x86/kvm/vmx.c | 8 +++- > >> arch/x86/kvm/x86.c | 14 ++ > >> include/linux/kvm_host.h| 2 ++ > >> virt/kvm/kvm_main.c | 12 > >> 7 files changed, 53 insertions(+), 1 deletion(-) > >> > >>diff --git a/arch/x86/include/asm/kvm_host.h > >>b/arch/x86/include/asm/kvm_host.h > >>index 62f973e..9ce6bfd 100644 > >>--- a/arch/x86/include/asm/kvm_host.h > >>+++ b/arch/x86/include/asm/kvm_host.h > >>@@ -737,6 +737,7 @@ struct kvm_x86_ops { > >>void (*hwapic_isr_update)(struct kvm *kvm, int isr); > >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); > >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > >>+ void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); > >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > >>index 9314678..551693d 100644 > >>--- a/arch/x86/kvm/mmu.c > >>+++ b/arch/x86/kvm/mmu.c > >>@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, > >>gva_t gpa, u32 error_code, > >> level, gfn, pfn, prefault); > >>spin_unlock(&vcpu->kvm->mmu_lock); > >> > >>+ /* > >>+* apic access page could be migrated. When the guest tries to access > >>+* the apic access page, ept violation will occur, and we can use GUP > >>+* to find the new page. > >>+* > >>+* GUP will wait till the migrate entry be replaced with the new page. > >>+*/ > >>+ if (gpa == APIC_DEFAULT_PHYS_BASE) > >>+ vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm, > >>+ APIC_DEFAULT_PHYS_BASE>> PAGE_SHIFT); > >Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here? > > I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here. > > In kvm_mmu_notifier_invalidate_page() I made the request. And the handler > called > gfn_to_page_no_pin() to get the new page, which will wait till the migration > finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the > vcpus > were forced to exit the guest mode, they would wait till the VMCS > APIC_ACCESS_ADDR > pointer was updated. > > As a result, we don't need to make the request here. OK, I do not see what's the purpose of the code here then. > > > > > >>+ > >>return r; > >> > >> out_unlock: > >>diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > >>index 576b525..dc76f29 100644 > >>--- a/arch/x86/kvm/svm.c > >>+++ b/arch/x86/kvm/svm.c > >>@@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct > >>kvm_vcpu *vcpu, bool set) > >>return; > >> } > >> > >>+static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) > >>+{ > >>+ return; > >>+} > >>+ > >> static int svm_vm_has_apicv(struct kvm *kvm) > >> { > >>return 0; > >>@@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = { > >>.enable_irq_window = enable_irq_window, >
Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors
On 14 July 2014 14:33, James Hogan wrote: > On 10/07/14 13:17, Peter Maydell wrote: >> More generally, there doesn't really seem to be provision in the >> KVM KVM_EXIT_MMIO API for returning "this access failed". >> I guess in theory userspace could do all the "figure out how >> to adjust CPU state to do exception entry and then run VCPU", >> but that seems like quite a lot of work which the kernel already >> knows how to do; is there some way to provide a simpler API >> that lets userspace just inform the kernel that it needs to >> fault the access? > > Indeed. Paolo's idea would work well I think. A data bus error exception > would likely be the only sensible error response required other than > ignoring writes or returning a garbage value for a read (which the > current KVM MMIO API already allows). I think we should make the API at least permit returning an (architecture-specific) error code to the kernel, rather than just a single boolean "failed" response. For instance on ARM the fault status registers include a single ExT bit for classifying the type of an external abort. (The meaning of the bit is IMPDEF; on the Cortex-A15 it can be used to distinguish AXI bus DECERR ("decode error", ie the interconnect doesn't have anything attached at that address) and SLVERR ("slave error", ie there was a device at that address but it chose to reject the transaction for some reason, eg unsupported transfer size, timeout, write to read-only location, FIFO overrun or just because the device was in a bad mood.) thanks -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V5 1/2] perf ignore LBR and extra_regs
> > diff --git a/arch/x86/kernel/cpu/perf_event.h > > b/arch/x86/kernel/cpu/perf_event.h > > index 3b2f9bd..992c678 100644 > > --- a/arch/x86/kernel/cpu/perf_event.h > > +++ b/arch/x86/kernel/cpu/perf_event.h > > @@ -464,6 +464,12 @@ struct x86_pmu { > > */ > > struct extra_reg *extra_regs; > > unsigned int er_flags; > > + /* > > +* EXTRA REG MSR can be accessed > > +* The extra registers are completely unrelated to each other. > > +* So it needs a flag for each extra register. > > +*/ > > + boolextra_msr_access[EXTRA_REG_MAX]; > > So why not in struct extra_reg again? You didn't give a straight answer there. I think I did in the email. You mentioned that there's still (only) 4 empty bytes at the tail of extra_reg itself. However, the extra_reg_type may be extended in the near future. So that may not be a reason to move to extra_reg. Furthermore, if we move extra_msr_access to extra_reg, I guess we have to modify all the related micros (i.e EVENT_EXTRA_REG) to initialize the new items. That could be a big change. On the other side, in x86_pmu structure, there are extra_regs related items defined under the comments "Extra registers for events". And the bit holes are enough for current usage and future extension. So I guess x86_pmu should be a good place to store the availability of the reg. /* --- cacheline 6 boundary (384 bytes) --- */ bool lbr_double_abort; /* 384 1 */ /* XXX 7 bytes hole, try to pack */ struct extra_reg * extra_regs; /* 392 8 */ unsigned int er_flags; /* 400 4 */ /* XXX 4 bytes hole, try to pack */ struct perf_guest_switch_msr * (*guest_get_msrs)(int *); /* 408 8 */ /* size: 416, cachelines: 7, members: 64 */ /* sum members: 391, holes: 6, sum holes: 25 */ /* bit holes: 1, sum bit holes: 27 bits */ /* last cacheline: 32 bytes */ > > > +/* > > + * Under certain circumstances, access certain MSR may cause #GP. > > + * The function tests if the input MSR can be safely accessed. > > + */ > > +static inline bool check_msr(unsigned long msr) { > > This reads like a generic function; > > > + u64 val_old, val_new, val_tmp; > > + > > + /* > > +* Read the current value, change it and read it back to see if it > > +* matches, this is needed to detect certain hardware emulators > > +* (qemu/kvm) that don't trap on the MSR access and always return > 0s. > > +*/ > > + if (rdmsrl_safe(msr, &val_old)) > > + goto msr_fail; > > + /* > > +* Only chagne it slightly, > > +* since the higher bits of some MSRs cannot be updated by wrmsrl. > > +* E.g. MSR_LBR_TOS > > +*/ > > + val_tmp = val_old ^ 0x3UL; > > but this is not generally true; not all MSRs can write the 2 LSB, can they? > One > option would be to extend the function with a u64 mask. Right, the function should be easily used to check all MSRs, not just for the MSRs I tested. I will pass a mask as a parameter of the function. > > > + if (wrmsrl_safe(msr, val_tmp) || > > + rdmsrl_safe(msr, &val_new)) > > + goto msr_fail; > > + > > + if (val_new != val_tmp) > > + goto msr_fail; > > + > > + /* Here it's sure that the MSR can be safely accessed. > > +* Restore the old value and return. > > +*/ > > + wrmsrl(msr, val_old); > > + > > + return true; > > + > > +msr_fail: > > + return false; > > +} > > Also, by now this function is far too large to be inline and in a header. OK. I will move it to perf_event_intel.c as a static function. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.
On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote: > On 07/12/2014 03:44 PM, Gleb Natapov wrote: > >On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote: > >>kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But > >>it is never used to refer to the page at all. > >> > >>In vcpu initialization, it indicates two things: > >>1. indicates if ept page is allocated > >>2. indicates if a memory slot for identity page is initialized > >> > >>Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept > >>identity pagetable is initialized. So we can remove ept_identity_pagetable. > >> > >>Signed-off-by: Tang Chen > >>--- > >> arch/x86/include/asm/kvm_host.h | 1 - > >> arch/x86/kvm/vmx.c | 25 +++-- > >> 2 files changed, 11 insertions(+), 15 deletions(-) > >> > >>diff --git a/arch/x86/include/asm/kvm_host.h > >>b/arch/x86/include/asm/kvm_host.h > >>index 4931415..62f973e 100644 > >>--- a/arch/x86/include/asm/kvm_host.h > >>+++ b/arch/x86/include/asm/kvm_host.h > >>@@ -578,7 +578,6 @@ struct kvm_arch { > >> > >>gpa_t wall_clock; > >> > >>- struct page *ept_identity_pagetable; > >>bool ept_identity_pagetable_done; > >>gpa_t ept_identity_map_addr; > >> > >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>index 0918635e..fe2e5f4 100644 > >>--- a/arch/x86/kvm/vmx.c > >>+++ b/arch/x86/kvm/vmx.c > >>@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu > >>*vcpu); > >> static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); > >> static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); > >> static bool vmx_mpx_supported(void); > >>+static int alloc_identity_pagetable(struct kvm *kvm); > >> > >> static DEFINE_PER_CPU(struct vmcs *, vmxarea); > >> static DEFINE_PER_CPU(struct vmcs *, current_vmcs); > >>@@ -3921,21 +3922,21 @@ out: > >> > >> static int init_rmode_identity_map(struct kvm *kvm) > >> { > >>- int i, idx, r, ret; > >>+ int i, idx, r, ret = 0; > >>pfn_t identity_map_pfn; > >>u32 tmp; > >> > >>if (!enable_ept) > >>return 1; > >>- if (unlikely(!kvm->arch.ept_identity_pagetable)) { > >>- printk(KERN_ERR "EPT: identity-mapping pagetable " > >>- "haven't been allocated!\n"); > >>- return 0; > >>- } > >>if (likely(kvm->arch.ept_identity_pagetable_done)) > >>return 1; > >>- ret = 0; > >>identity_map_pfn = kvm->arch.ept_identity_map_addr>> PAGE_SHIFT; > >>+ > >>+ mutex_lock(&kvm->slots_lock); > >Why move this out of alloc_identity_pagetable()? > > > > Referring to the original code, I think mutex_lock(&kvm->slots_lock) is used > to protect kvm->arch.ept_identity_pagetable. If two or more threads try to > modify it at the same time, the mutex ensures that the identity table is > only > allocated once. > > Now, we dropped kvm->arch.ept_identity_pagetable. And use > kvm->arch.ept_identity_pagetable_done > to check if the identity table is allocated and initialized. So we should > protect > memory slot operation in alloc_identity_pagetable() and > kvm->arch.ept_identity_pagetable_done > with this mutex. > > Of course, I can see that the name "slots_lock" indicates that it may be > used > to protect the memory slot operation only. Maybe move it out here is not > suitable. > > If I'm wrong, please tell me. > No, you are right that besides memory slot creation slots_lock protects checking of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done is tested outside of slots_lock so the allocation can happen twice, no? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] IRQFD without IRQ routing, enabled for XICS
On Mon, 30 Jun 2014 20:51:08 +1000 Paul Mackerras wrote: > I would like to see this go into 3.17. FWIW: I've given this a whirl on s390 (with a dataplane disk), and everything seems to work as before. The only thing which is I think worth mentioning is that embedding the routing entry into the irqfd struct will grow it a bit, which might be noticable on large installations with hundreds of devices. OTOH, the routing entry isn't too large, so I don't think it will become a problem. > > arch/ia64/kvm/Kconfig| 1 + > arch/powerpc/kvm/Kconfig | 3 + > arch/powerpc/kvm/book3s_hv_rm_xics.c | 5 ++ > arch/powerpc/kvm/book3s_xics.c | 55 +++--- > arch/powerpc/kvm/book3s_xics.h | 2 + > arch/powerpc/kvm/mpic.c | 4 +- > arch/s390/kvm/Kconfig| 1 + > arch/s390/kvm/interrupt.c| 3 +- > arch/x86/kvm/Kconfig | 1 + > include/linux/kvm_host.h | 43 --- > virt/kvm/Kconfig | 3 + > virt/kvm/eventfd.c | 134 > ++- > virt/kvm/irq_comm.c | 24 +++ > virt/kvm/irqchip.c | 98 ++--- > virt/kvm/kvm_main.c | 2 +- > 15 files changed, 227 insertions(+), 152 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
Il 14/07/2014 14:48, Peter Zijlstra ha scritto: In fact there's no reason why LBR cannot be virtualized (though it does need support from the processor), and it may even be possible to support OFFCORE_RSP_X in the KVM virtual PMU. But its not, so something needs to be done, right? A first thing that could be done, is to have a way for the kernel to detect absence of LBR Which is exactly what this patch does, no? A documented (and recommended) way for the kernel to detect it is superior though... , for example an all-1s setting of the LBR format field of IA32_PERF_CAPABILITIES. If Intel can tell us "all 1s will never be used", we can have KVM set the field that way. The kernel then should disable LBR. So what's wrong with testing if the MSRs that provide LBR actually work or not? That doesn't require any magic co-operation of the host/vm machinery and is therefore far more robust. ... because the difference is that whenever we hack the kernel, Windows and vTune will remain broken forever. Whenever we get Intel to make the hack part of the spec, there is some hope that Windows and vTune will eventually get fixed. The hack certainly works, I'm not disputing that. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V5 1/2] perf ignore LBR and extra_regs
> -Original Message- > From: Paolo Bonzini [mailto:pbonz...@redhat.com] > Sent: Monday, July 14, 2014 9:40 AM > To: Liang, Kan; Peter Zijlstra > Cc: a...@firstfloor.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org > Subject: Re: [PATCH V5 1/2] perf ignore LBR and extra_regs > > Il 14/07/2014 15:36, Liang, Kan ha scritto: > >> > Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage > >> > or just no events reported? > >> > > > Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will > handle the #GP. So there is no error report in guest. > > Yeah, but what's the effect on the output of perf? They can be still counted/sampled, but the results are garbage. Kan > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
Il 14/07/2014 15:36, Liang, Kan ha scritto: > Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage or > just no events reported? > Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will handle the #GP. So there is no error report in guest. Yeah, but what's the effect on the output of perf? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V5 1/2] perf ignore LBR and extra_regs
> >>> For reproducing the issue, please build the kernel with > CONFIG_KVM_INTEL = y (for host kernel). > >>> And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest > kernel). > >> > >> I'm not sure this is a useful patch. > >> > >> This is #GP'ing just because of a limitation in the PMU; just compile > >> the kernel with CONFIG_PARAVIRT > > > > How's that going to help? If you run kvm -host the VM is lying through > > its teeth is the kernel is going to assume all those MSRs present, > > PARAVIRT isn't going to help with this. > > > >> , or split the "rdmsr is always rdmsr_safe" > >> behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol. > > > > That's not useful either, because non of these code-paths are going to > > check the return value. > > Hmmm, I thought rdmsr_safe was going to return zero, but it just returns > whatever happened to be in edx:eax which maybe should also be fixed. > > Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage or > just no events reported? > Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will handle the #GP. So there is no error report in guest. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors
Hi Peter, On 10/07/14 13:17, Peter Maydell wrote: > On 17 June 2014 23:10, James Hogan wrote: >> The patchset depends on v4 of "target-mips: implement UserLocal >> Register". I'm aiming for QEMU 2.1, hopefully it isn't too late to get >> some final review. >> >> Thanks to everybody who has already taken part in review. >> >> This patchset implements KVM support for MIPS32 processors, using Trap & >> Emulation. > > I was looking at what happens for MMIO accesses to nonexistent > memory when we're running under KVM, and interestingly this > patchset means MIPS is now the only CPU which both (a) supports > KVM and (b) has an implementation of the do_unassigned_access() > CPU hook. Does the current code support a guest attempt to > access unassigned physical addresses? I had a look at the code > and it seems like mips_cpu_unassigned_access() will end up > calling cpu_restore_state() and cpu_loop_exit(), which I think > will probably crash if you call them when using KVM rather than > TCG... Yes, I have observed this in the past when experimentally writing to the Malta reset region from the guest (see the patch "mips/malta: allow volatile writes to reset flash" which didn't get applied but worked around the specific issue). That was because read only memory regions were treated as unassigned from the point of view of writes (which tbh seems wrong), but it could happen with any unassigned access > 0x8000 from the guest AFAICT. So yeh, until mips_cpu_unassigned_access does something more portable for KVM, conditionally setting do_unassigned_access only if !kvm_enabled() looks sensible. I'll see if I can reproduce it and submit a patch. > More generally, there doesn't really seem to be provision in the > KVM KVM_EXIT_MMIO API for returning "this access failed". > I guess in theory userspace could do all the "figure out how > to adjust CPU state to do exception entry and then run VCPU", > but that seems like quite a lot of work which the kernel already > knows how to do; is there some way to provide a simpler API > that lets userspace just inform the kernel that it needs to > fault the access? Indeed. Paolo's idea would work well I think. A data bus error exception would likely be the only sensible error response required other than ignoring writes or returning a garbage value for a read (which the current KVM MMIO API already allows). Cheers James -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Mon, Jul 14, 2014 at 02:40:33PM +0200, Paolo Bonzini wrote: > Hmmm, I thought rdmsr_safe was going to return zero, but it just returns > whatever happened to be in edx:eax which maybe should also be fixed. > > Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage or just no > events reported? Either way; its not functioning according to 'spec'. Therefore we should 'disable' things. > >>In fact there's no reason why LBR cannot be virtualized (though it does need > >>support from the processor), and it may even be possible to support > >>OFFCORE_RSP_X in the KVM virtual PMU. > > > >But its not, so something needs to be done, right? > > A first thing that could be done, is to have a way for the kernel to detect > absence of LBR Which is exactly what this patch does, no? > , for example an all-1s setting of the LBR format field of > IA32_PERF_CAPABILITIES. If Intel can tell us "all 1s will never be used", > we can have KVM set the field that way. The kernel then should disable LBR. So what's wrong with testing if the MSRs that provide LBR actually work or not? That doesn't require any magic co-operation of the host/vm machinery and is therefore far more robust. pgpBAg5uoyJMK.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
Il 14/07/2014 14:09, Peter Zijlstra ha scritto: On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote: Il 10/07/2014 12:59, kan.li...@intel.com ha scritto: From: Kan Liang x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). I'm not sure this is a useful patch. This is #GP'ing just because of a limitation in the PMU; just compile the kernel with CONFIG_PARAVIRT How's that going to help? If you run kvm -host the VM is lying through its teeth is the kernel is going to assume all those MSRs present, PARAVIRT isn't going to help with this. , or split the "rdmsr is always rdmsr_safe" behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol. That's not useful either, because non of these code-paths are going to check the return value. Hmmm, I thought rdmsr_safe was going to return zero, but it just returns whatever happened to be in edx:eax which maybe should also be fixed. Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage or just no events reported? In fact there's no reason why LBR cannot be virtualized (though it does need support from the processor), and it may even be possible to support OFFCORE_RSP_X in the KVM virtual PMU. But its not, so something needs to be done, right? A first thing that could be done, is to have a way for the kernel to detect absence of LBR, for example an all-1s setting of the LBR format field of IA32_PERF_CAPABILITIES. If Intel can tell us "all 1s will never be used", we can have KVM set the field that way. The kernel then should disable LBR. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote: > Il 10/07/2014 12:59, kan.li...@intel.com ha scritto: > >From: Kan Liang > > > >x86, perf: Protect LBR and extra_regs against KVM lying > > > >With -cpu host, KVM reports LBR and extra_regs support, if the host has > >support. > >When the guest perf driver tries to access LBR or extra_regs MSR, > >it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. > >So check the related MSRs access right once at initialization time to avoid > >the error access at runtime. > > > >For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y > >(for host kernel). > >And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). > > I'm not sure this is a useful patch. > > This is #GP'ing just because of a limitation in the PMU; just compile the > kernel with CONFIG_PARAVIRT How's that going to help? If you run kvm -host the VM is lying through its teeth is the kernel is going to assume all those MSRs present, PARAVIRT isn't going to help with this. > , or split the "rdmsr is always rdmsr_safe" > behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol. That's not useful either, because non of these code-paths are going to check the return value. > In fact there's no reason why LBR cannot be virtualized (though it does need > support from the processor), and it may even be possible to support > OFFCORE_RSP_X in the KVM virtual PMU. But its not, so something needs to be done, right? pgpmnEwGpgwpV.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
Il 10/07/2014 12:59, kan.li...@intel.com ha scritto: From: Kan Liang x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). I'm not sure this is a useful patch. This is #GP'ing just because of a limitation in the PMU; just compile the kernel with CONFIG_PARAVIRT, or split the "rdmsr is always rdmsr_safe" behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol. In fact there's no reason why LBR cannot be virtualized (though it does need support from the processor), and it may even be possible to support OFFCORE_RSP_X in the KVM virtual PMU. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] KVM: emulate: move around some checks
The only purpose of this patch is to make the next patch simpler to review. No semantic change. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 15453e569f3d..d79677c6056d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4388,12 +4388,15 @@ done_prefixes: ctxt->d |= opcode.flags; } + /* Unrecognised? */ + if (ctxt->d == 0) + return EMULATION_FAILED; + ctxt->execute = opcode.u.execute; ctxt->check_perm = opcode.check_perm; ctxt->intercept = opcode.intercept; - /* Unrecognised? */ - if (ctxt->d == 0 || (ctxt->d & NotImpl)) + if (ctxt->d & NotImpl) return EMULATION_FAILED; if (!(ctxt->d & EmulateOnUD) && ctxt->ud) @@ -4535,19 +4538,19 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) ctxt->mem_read.pos = 0; - if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) || - (ctxt->d & Undefined)) { + /* LOCK prefix is allowed only with some instructions */ + if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != OP_MEM)) { rc = emulate_ud(ctxt); goto done; } - /* LOCK prefix is allowed only with some instructions */ - if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != OP_MEM)) { + if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) { rc = emulate_ud(ctxt); goto done; } - if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) { + if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) || + (ctxt->d & Undefined)) { rc = emulate_ud(ctxt); goto done; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/19] KVM: x86: return all bits from get_interrupt_shadow
For the next patch we will need to know the full state of the interrupt shadow; we will then set KVM_REQ_EVENT when one bit is cleared. However, right now get_interrupt_shadow only returns the one corresponding to the emulated instruction, or an unconditional 0 if the emulated instruction does not have an interrupt shadow. This is confusing and does not allow us to check for cleared bits as mentioned above. Clean the callback up, and modify toggle_interruptibility to match the comment above the call. As a small result, the call to set_interrupt_shadow will be skipped in the common case where int_shadow == 0 && mask == 0. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm.c | 6 +++--- arch/x86/kvm/vmx.c | 4 ++-- arch/x86/kvm/x86.c | 10 +- 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index af36f89fe67a..b8a4480176b9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -717,7 +717,7 @@ struct kvm_x86_ops { int (*handle_exit)(struct kvm_vcpu *vcpu); void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask); - u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask); + u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu); void (*patch_hypercall)(struct kvm_vcpu *vcpu, unsigned char *hypercall_addr); void (*set_irq)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 4925a94a970a..ddf742768ecf 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -486,14 +486,14 @@ static int is_external_interrupt(u32 info) return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR); } -static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) +static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); u32 ret = 0; if (svm->vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) - ret |= KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS; - return ret & mask; + ret = KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS; + return ret; } static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5c7bbde25bbf..0c9569b994f9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1943,7 +1943,7 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) vmcs_writel(GUEST_RFLAGS, rflags); } -static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) +static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu) { u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); int ret = 0; @@ -1953,7 +1953,7 @@ static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) if (interruptibility & GUEST_INTR_STATE_MOV_SS) ret |= KVM_X86_SHADOW_INT_MOV_SS; - return ret & mask; + return ret; } static void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7553530e3502..a56126e6bd75 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2978,9 +2978,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft; events->interrupt.nr = vcpu->arch.interrupt.nr; events->interrupt.soft = 0; - events->interrupt.shadow = - kvm_x86_ops->get_interrupt_shadow(vcpu, - KVM_X86_SHADOW_INT_MOV_SS | KVM_X86_SHADOW_INT_STI); + events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu); events->nmi.injected = vcpu->arch.nmi_injected; events->nmi.pending = vcpu->arch.nmi_pending != 0; @@ -4860,7 +4858,7 @@ static const struct x86_emulate_ops emulate_ops = { static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) { - u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu, mask); + u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu); /* * an sti; sti; sequence only disable interrupts for the first * instruction. So, if the last instruction, be it emulated or @@ -4868,7 +4866,9 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) * means that the last instruction is an sti. We should not * leave the flag on in this case. The same goes for mov ss */ - if (!(int_shadow & mask)) + if (int_shadow & mask) + mask = 0; + if (unlikely(int_shadow || mask)) kvm_x86_ops->set_interrupt_shadow(vcpu, mask); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe k
[PATCH 03/19] KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
Despite the provisions to emulate up to 130 consecutive instructions, in practice KVM will emulate just one before exiting handle_invalid_guest_state, because x86_emulate_instruction always sets KVM_REQ_EVENT. However, we only need to do this if an interrupt could be injected, which happens a) if an interrupt shadow bit (STI or MOV SS) has gone away; b) if the interrupt flag has just been set (other instructions than STI can set it without enabling an interrupt shadow). This cuts another 700-900 cycles from the cost of emulating an instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles before the patch on kvm-unit-tests, 925-1700 afterwards). Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a56126e6bd75..cd9316786dca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -87,6 +87,7 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE); static void update_cr8_intercept(struct kvm_vcpu *vcpu); static void process_nmi(struct kvm_vcpu *vcpu); +static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); struct kvm_x86_ops *kvm_x86_ops; EXPORT_SYMBOL_GPL(kvm_x86_ops); @@ -4868,8 +4869,11 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) */ if (int_shadow & mask) mask = 0; - if (unlikely(int_shadow || mask)) + if (unlikely(int_shadow || mask)) { kvm_x86_ops->set_interrupt_shadow(vcpu, mask); + if (!mask) + kvm_make_request(KVM_REQ_EVENT, vcpu); + } } static void inject_emulated_exception(struct kvm_vcpu *vcpu) @@ -5095,20 +5099,18 @@ static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 type, u32 dr7, return dr6; } -static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, int *r) +static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, unsigned long rflags, int *r) { struct kvm_run *kvm_run = vcpu->run; /* -* Use the "raw" value to see if TF was passed to the processor. -* Note that the new value of the flags has not been saved yet. +* rflags is the old, "raw" value of the flags. The new value has +* not been saved yet. * * This is correct even for TF set by the guest, because "the * processor will not generate this exception after the instruction * that sets the TF flag". */ - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); - if (unlikely(rflags & X86_EFLAGS_TF)) { if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) { kvm_run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1; @@ -5275,13 +5277,22 @@ restart: r = EMULATE_DONE; if (writeback) { + unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); toggle_interruptibility(vcpu, ctxt->interruptibility); - kvm_make_request(KVM_REQ_EVENT, vcpu); vcpu->arch.emulate_regs_need_sync_to_vcpu = false; kvm_rip_write(vcpu, ctxt->eip); if (r == EMULATE_DONE) - kvm_vcpu_check_singlestep(vcpu, &r); - kvm_set_rflags(vcpu, ctxt->eflags); + kvm_vcpu_check_singlestep(vcpu, rflags, &r); + __kvm_set_rflags(vcpu, ctxt->eflags); + + /* +* For STI, interrupts are shadowed; so KVM_REQ_EVENT will +* do nothing, and it will be requested again as soon as +* the shadow expires. But we still need to check here, +* because POPF has no interrupt shadow. +*/ + if (unlikely((ctxt->eflags & ~rflags) & X86_EFLAGS_IF)) + kvm_make_request(KVM_REQ_EVENT, vcpu); } else vcpu->arch.emulate_regs_need_sync_to_vcpu = true; @@ -7406,12 +7417,17 @@ unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_get_rflags); -void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) +static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP && kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip)) rflags |= X86_EFLAGS_TF; kvm_x86_ops->set_rflags(vcpu, rflags); +} + +void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) +{ + __kvm_set_rflags(vcpu, rflags); kvm_make_request(KVM_REQ_EVENT, vcpu); } EXPORT_SYMBOL_GPL(kvm_set_rflags); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/19] KVM: emulate: speed up emulated moves
We can just blindly move all 16 bytes of ctxt->src's value to ctxt->dst. write_register_operand will take care of writing only the lower bytes. Avoiding a call to memcpy (the compiler optimizes it out) gains about 200 cycles on kvm-unit-tests for register-to-register moves, and makes them about as fast as arithmetic instructions. We could perhaps get a larger speedup by moving all instructions _except_ moves out of x86_emulate_insn, removing opcode_len, and replacing the switch statement with an inlined em_mov. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 2 +- arch/x86/kvm/emulate.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 0e0151c13b2c..432447370044 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -233,7 +233,7 @@ struct operand { union { unsigned long val; u64 val64; - char valptr[sizeof(unsigned long) + 2]; + char valptr[sizeof(sse128_t)]; sse128_t vec_val; u64 mm_val; void *data; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ea56dae3e67c..27f677ef703e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2990,7 +2990,7 @@ static int em_rdpmc(struct x86_emulate_ctxt *ctxt) static int em_mov(struct x86_emulate_ctxt *ctxt) { - memcpy(ctxt->dst.valptr, ctxt->src.valptr, ctxt->op_bytes); + memcpy(ctxt->dst.valptr, ctxt->src.valptr, sizeof(ctxt->src.valptr)); return X86EMUL_CONTINUE; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/19] KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
There are several checks for "peculiar" aspects of instructions in both x86_decode_insn and x86_emulate_insn. Group them together, and guard them with a single "if" that lets the processor quickly skip them all. Make this more effective by adding two more flag bits that say whether the .intercept and .check_perm fields are valid. We will reuse these flags later to avoid initializing fields of the emulate_ctxt struct. This skims about 30 cycles for each emulated instructions, which is approximately a 3% improvement. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 175 ++--- 1 file changed, 94 insertions(+), 81 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d79677c6056d..ea56dae3e67c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -162,6 +162,8 @@ #define NoWrite ((u64)1 << 45) /* No writeback */ #define SrcWrite((u64)1 << 46) /* Write back src operand */ #define NoMod ((u64)1 << 47) /* Mod field is ignored */ +#define Intercept ((u64)1 << 48) /* Has valid intercept field */ +#define CheckPerm ((u64)1 << 49) /* Has valid check_perm field */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -3546,9 +3548,9 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) } #define D(_y) { .flags = (_y) } -#define DI(_y, _i) { .flags = (_y), .intercept = x86_intercept_##_i } -#define DIP(_y, _i, _p) { .flags = (_y), .intercept = x86_intercept_##_i, \ - .check_perm = (_p) } +#define DI(_y, _i) { .flags = (_y)|Intercept, .intercept = x86_intercept_##_i } +#define DIP(_y, _i, _p) { .flags = (_y)|Intercept|CheckPerm, \ + .intercept = x86_intercept_##_i, .check_perm = (_p) } #define ND(NotImpl) #define EXT(_f, _e) { .flags = ((_f) | RMExt), .u.group = (_e) } #define G(_f, _g) { .flags = ((_f) | Group | ModRM), .u.group = (_g) } @@ -3557,10 +3559,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) #define I(_f, _e) { .flags = (_f), .u.execute = (_e) } #define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) } #define II(_f, _e, _i) \ - { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i } + { .flags = (_f)|Intercept, .u.execute = (_e), .intercept = x86_intercept_##_i } #define IIP(_f, _e, _i, _p) \ - { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i, \ - .check_perm = (_p) } + { .flags = (_f)|Intercept|CheckPerm, .u.execute = (_e), \ + .intercept = x86_intercept_##_i, .check_perm = (_p) } #define GP(_f, _g) { .flags = ((_f) | Prefix), .u.gprefix = (_g) } #define D2bv(_f) D((_f) | ByteOp), D(_f) @@ -4393,29 +4395,37 @@ done_prefixes: return EMULATION_FAILED; ctxt->execute = opcode.u.execute; - ctxt->check_perm = opcode.check_perm; - ctxt->intercept = opcode.intercept; - if (ctxt->d & NotImpl) - return EMULATION_FAILED; + if (unlikely(ctxt->d & + (NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) { + /* +* These are copied unconditionally here, and checked unconditionally +* in x86_emulate_insn. +*/ + ctxt->check_perm = opcode.check_perm; + ctxt->intercept = opcode.intercept; - if (!(ctxt->d & EmulateOnUD) && ctxt->ud) - return EMULATION_FAILED; + if (ctxt->d & NotImpl) + return EMULATION_FAILED; - if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack)) - ctxt->op_bytes = 8; + if (!(ctxt->d & EmulateOnUD) && ctxt->ud) + return EMULATION_FAILED; - if (ctxt->d & Op3264) { - if (mode == X86EMUL_MODE_PROT64) + if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack)) ctxt->op_bytes = 8; - else - ctxt->op_bytes = 4; - } - if (ctxt->d & Sse) - ctxt->op_bytes = 16; - else if (ctxt->d & Mmx) - ctxt->op_bytes = 8; + if (ctxt->d & Op3264) { + if (mode == X86EMUL_MODE_PROT64) + ctxt->op_bytes = 8; + else + ctxt->op_bytes = 4; + } + + if (ctxt->d & Sse) + ctxt->op_bytes = 16; + else if (ctxt->d & Mmx) + ctxt->op_bytes = 8; + } /* ModRM and SIB bytes. */ if (ctxt->d & ModRM) { @@ -4549,75 +4559,78 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) goto done; } - if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) || - (ctxt->d & Undefined)) { - rc = emulate_ud(ctxt); - goto done; -
[PATCH 07/19] KVM: emulate: simplify writeback
The "if/return" checks are useless, because we return X86EMUL_CONTINUE anyway if we do not return. Reviewed-by: Marcelo Tosatti Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 28 +++- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 27f677ef703e..32d3da82da2e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1588,34 +1588,28 @@ static void write_register_operand(struct operand *op) static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) { - int rc; - switch (op->type) { case OP_REG: write_register_operand(op); break; case OP_MEM: if (ctxt->lock_prefix) - rc = segmented_cmpxchg(ctxt, + return segmented_cmpxchg(ctxt, +op->addr.mem, +&op->orig_val, +&op->val, +op->bytes); + else + return segmented_write(ctxt, op->addr.mem, - &op->orig_val, &op->val, op->bytes); - else - rc = segmented_write(ctxt, -op->addr.mem, -&op->val, -op->bytes); - if (rc != X86EMUL_CONTINUE) - return rc; break; case OP_MEM_STR: - rc = segmented_write(ctxt, - op->addr.mem, - op->data, - op->bytes * op->count); - if (rc != X86EMUL_CONTINUE) - return rc; + return segmented_write(ctxt, + op->addr.mem, + op->data, + op->bytes * op->count); break; case OP_XMM: write_sse_reg(ctxt, &op->vec_val, op->addr.xmm); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/19] KVM: emulate: cleanup decode_modrm
From: Bandan Das Remove the if conditional - that will help us avoid an "else initialize to 0" Also, rearrange operators for slightly better code. Signed-off-by: Bandan Das Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 000fc7398832..94f5f8b94ce9 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1066,19 +1066,17 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt, struct operand *op) { u8 sib; - int index_reg = 0, base_reg = 0, scale; + int index_reg, base_reg, scale; int rc = X86EMUL_CONTINUE; ulong modrm_ea = 0; - if (ctxt->rex_prefix) { - ctxt->modrm_reg = (ctxt->rex_prefix & 4) << 1; /* REX.R */ - index_reg = (ctxt->rex_prefix & 2) << 2; /* REX.X */ - ctxt->modrm_rm = base_reg = (ctxt->rex_prefix & 1) << 3; /* REG.B */ - } + ctxt->modrm_reg = ((ctxt->rex_prefix << 1) & 8); /* REX.R */ + index_reg = (ctxt->rex_prefix << 2) & 8; /* REX.X */ + base_reg = (ctxt->rex_prefix << 3) & 8; /* REX.B */ - ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6; + ctxt->modrm_mod = (ctxt->modrm & 0xc0) >> 6; ctxt->modrm_reg |= (ctxt->modrm & 0x38) >> 3; - ctxt->modrm_rm |= (ctxt->modrm & 0x07); + ctxt->modrm_rm = base_reg | (ctxt->modrm & 0x07); ctxt->modrm_seg = VCPU_SREG_DS; if (ctxt->modrm_mod == 3 || (ctxt->d & NoMod)) { -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/19] KVM: vmx: speed up emulation of invalid guest state
About 25% of the time spent in emulation of invalid guest state is wasted in checking whether emulation is required for the next instruction. However, this almost never changes except when a segment register (or TR or LDTR) changes, or when there is a mode transition (i.e. CR0 changes). In fact, vmx_set_segment and vmx_set_cr0 already modify vmx->emulation_required (except that the former for some reason uses |= instead of just an assignment). So there is no need to call guest_state_valid in the emulation loop. Emulation performance test results indicate 1650-2600 cycles for common instructions, versus 2300-3200 before this patch on a Sandy Bridge Xeon. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8748c2e19ed6..5c7bbde25bbf 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3672,7 +3672,7 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu, vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var)); out: - vmx->emulation_required |= emulation_required(vcpu); + vmx->emulation_required = emulation_required(vcpu); } static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) @@ -5640,7 +5640,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); intr_window_requested = cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING; - while (!guest_state_valid(vcpu) && count-- != 0) { + while (vmx->emulation_required && count-- != 0) { if (intr_window_requested && vmx_interrupt_allowed(vcpu)) return handle_interrupt_window(&vmx->vcpu); @@ -5674,7 +5674,6 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) schedule(); } - vmx->emulation_required = emulation_required(vcpu); out: return ret; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/19] KVM: emulate: speed up do_insn_fetch
Hoist the common case up from do_insn_fetch_byte to do_insn_fetch, and prime the fetch_cache in x86_decode_insn. This helps a bit the compiler and the branch predictor, but above all it lays the ground for further changes in the next few patches. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 67 +++--- 1 file changed, 36 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 390400a54a9c..ea188a338afb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -705,51 +705,51 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, } /* - * Fetch the next byte of the instruction being emulated which is pointed to - * by ctxt->_eip, then increment ctxt->_eip. - * - * Also prefetch the remaining bytes of the instruction without crossing page + * Prefetch the remaining bytes of the instruction without crossing page * boundary if they are not in fetch_cache yet. */ -static int do_insn_fetch_byte(struct x86_emulate_ctxt *ctxt, u8 *dest) +static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt) { struct fetch_cache *fc = &ctxt->fetch; int rc; int size, cur_size; - - if (ctxt->_eip == fc->end) { - unsigned long linear; - struct segmented_address addr = { .seg = VCPU_SREG_CS, - .ea = ctxt->_eip }; - cur_size = fc->end - fc->start; - size = min(15UL - cur_size, - PAGE_SIZE - offset_in_page(ctxt->_eip)); - rc = __linearize(ctxt, addr, size, false, true, &linear); - if (unlikely(rc != X86EMUL_CONTINUE)) - return rc; - rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size, - size, &ctxt->exception); - if (unlikely(rc != X86EMUL_CONTINUE)) - return rc; - fc->end += size; - } - *dest = fc->data[ctxt->_eip - fc->start]; - ctxt->_eip++; + unsigned long linear; + + struct segmented_address addr = { .seg = VCPU_SREG_CS, + .ea = fc->end }; + cur_size = fc->end - fc->start; + size = min(15UL - cur_size, + PAGE_SIZE - offset_in_page(fc->end)); + if (unlikely(size == 0)) + return X86EMUL_UNHANDLEABLE; + rc = __linearize(ctxt, addr, size, false, true, &linear); + if (unlikely(rc != X86EMUL_CONTINUE)) + return rc; + rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size, + size, &ctxt->exception); + if (unlikely(rc != X86EMUL_CONTINUE)) + return rc; + fc->end += size; return X86EMUL_CONTINUE; } static int do_insn_fetch(struct x86_emulate_ctxt *ctxt, -void *dest, unsigned size) +void *__dest, unsigned size) { int rc; + struct fetch_cache *fc = &ctxt->fetch; + u8 *dest = __dest; + u8 *src = &fc->data[ctxt->_eip - fc->start]; - /* x86 instructions are limited to 15 bytes. */ - if (unlikely(ctxt->_eip + size - ctxt->eip > 15)) - return X86EMUL_UNHANDLEABLE; while (size--) { - rc = do_insn_fetch_byte(ctxt, dest++); - if (rc != X86EMUL_CONTINUE) - return rc; + if (unlikely(ctxt->_eip == fc->end)) { + rc = do_insn_fetch_bytes(ctxt); + if (rc != X86EMUL_CONTINUE) + return rc; + } + *dest++ = *src++; + ctxt->_eip++; + continue; } return X86EMUL_CONTINUE; } @@ -4227,6 +4227,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) ctxt->opcode_len = 1; if (insn_len > 0) memcpy(ctxt->fetch.data, insn, insn_len); + else { + rc = do_insn_fetch_bytes(ctxt); + if (rc != X86EMUL_CONTINUE) + return rc; + } switch (mode) { case X86EMUL_MODE_REAL: -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/19] KVM: emulate: clean up initializations in init_decode_cache
From: Bandan Das A lot of initializations are unnecessary as they get set to appropriate values before actually being used. Optimize placement of fields in x86_emulate_ctxt Signed-off-by: Bandan Das Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 20 +++- arch/x86/kvm/emulate.c | 7 ++- 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index c22bd9af4311..d40a10a38a80 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -288,30 +288,32 @@ struct x86_emulate_ctxt { u8 opcode_len; u8 b; u8 intercept; - u8 lock_prefix; - u8 rep_prefix; u8 op_bytes; u8 ad_bytes; - u8 rex_prefix; struct operand src; struct operand src2; struct operand dst; - bool has_seg_override; - u8 seg_override; - u64 d; int (*execute)(struct x86_emulate_ctxt *ctxt); int (*check_perm)(struct x86_emulate_ctxt *ctxt); + bool has_seg_override; + bool rip_relative; + u8 rex_prefix; + u8 lock_prefix; + u8 rep_prefix; + u8 seg_override; + /* bitmaps of registers in _regs[] that can be read */ + u32 regs_valid; + /* bitmaps of registers in _regs[] that have been written */ + u32 regs_dirty; /* modrm */ u8 modrm; u8 modrm_mod; u8 modrm_reg; u8 modrm_rm; u8 modrm_seg; - bool rip_relative; + u64 d; unsigned long _eip; struct operand memop; - u32 regs_valid; /* bitmaps of registers in _regs[] that can be read */ - u32 regs_dirty; /* bitmaps of registers in _regs[] that have been written */ /* Fields above regs are cleared together. */ unsigned long _regs[NR_VCPU_REGS]; struct operand *memopp; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 94f5f8b94ce9..3e9bbdc4c76a 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4534,14 +4534,11 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)) void init_decode_cache(struct x86_emulate_ctxt *ctxt) { - memset(&ctxt->opcode_len, 0, - (void *)&ctxt->_regs - (void *)&ctxt->opcode_len); + memset(&ctxt->has_seg_override, 0, + (void *)&ctxt->modrm - (void *)&ctxt->has_seg_override); - ctxt->fetch.start = 0; - ctxt->fetch.end = 0; ctxt->io_read.pos = 0; ctxt->io_read.end = 0; - ctxt->mem_read.pos = 0; ctxt->mem_read.end = 0; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/19] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
do_insn_fetch_bytes will only be called once in a given insn_fetch and insn_fetch_arr, because in fact it will only be called at most twice for any instruction and the first call is explicit in x86_decode_insn. This observation lets us hoist the call out of the memory copying loop. It does not buy performance, because most fetches are one byte long anyway, but it prepares for the next patch. The overflow check is tricky, but correct. Because do_insn_fetch_bytes has already been called once, we know that fc->end is at least 15. So it is okay to subtract the number of bytes we want to read. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ea188a338afb..ca82ec9c5ffe 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -708,7 +708,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, * Prefetch the remaining bytes of the instruction without crossing page * boundary if they are not in fetch_cache yet. */ -static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt) +static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) { struct fetch_cache *fc = &ctxt->fetch; int rc; @@ -720,7 +720,14 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt) cur_size = fc->end - fc->start; size = min(15UL - cur_size, PAGE_SIZE - offset_in_page(fc->end)); - if (unlikely(size == 0)) + + /* +* One instruction can only straddle two pages, +* and one has been loaded at the beginning of +* x86_decode_insn. So, if not enough bytes +* still, we must have hit the 15-byte boundary. +*/ + if (unlikely(size < op_size)) return X86EMUL_UNHANDLEABLE; rc = __linearize(ctxt, addr, size, false, true, &linear); if (unlikely(rc != X86EMUL_CONTINUE)) @@ -736,17 +743,18 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt) static int do_insn_fetch(struct x86_emulate_ctxt *ctxt, void *__dest, unsigned size) { - int rc; struct fetch_cache *fc = &ctxt->fetch; u8 *dest = __dest; u8 *src = &fc->data[ctxt->_eip - fc->start]; + /* We have to be careful about overflow! */ + if (unlikely(ctxt->_eip > fc->end - size)) { + int rc = do_insn_fetch_bytes(ctxt, size); + if (rc != X86EMUL_CONTINUE) + return rc; + } + while (size--) { - if (unlikely(ctxt->_eip == fc->end)) { - rc = do_insn_fetch_bytes(ctxt); - if (rc != X86EMUL_CONTINUE) - return rc; - } *dest++ = *src++; ctxt->_eip++; continue; @@ -4228,7 +4236,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) if (insn_len > 0) memcpy(ctxt->fetch.data, insn, insn_len); else { - rc = do_insn_fetch_bytes(ctxt); + rc = do_insn_fetch_bytes(ctxt, 1); if (rc != X86EMUL_CONTINUE) return rc; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/19] KVM: emulate: rework seg_override
From: Bandan Das x86_decode_insn already sets a default for seg_override, so remove it from the zeroed area. Also replace set/get functions with direct access to the field. Signed-off-by: Bandan Das Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 3 +-- arch/x86/kvm/emulate.c | 41 +++--- 2 files changed, 17 insertions(+), 27 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index d40a10a38a80..6c808408326f 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -295,12 +295,10 @@ struct x86_emulate_ctxt { struct operand dst; int (*execute)(struct x86_emulate_ctxt *ctxt); int (*check_perm)(struct x86_emulate_ctxt *ctxt); - bool has_seg_override; bool rip_relative; u8 rex_prefix; u8 lock_prefix; u8 rep_prefix; - u8 seg_override; /* bitmaps of registers in _regs[] that can be read */ u32 regs_valid; /* bitmaps of registers in _regs[] that have been written */ @@ -311,6 +309,7 @@ struct x86_emulate_ctxt { u8 modrm_reg; u8 modrm_rm; u8 modrm_seg; + u8 seg_override; u64 d; unsigned long _eip; struct operand memop; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 3e9bbdc4c76a..08badf638fb0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -514,12 +514,6 @@ static u32 desc_limit_scaled(struct desc_struct *desc) return desc->g ? (limit << 12) | 0xfff : limit; } -static void set_seg_override(struct x86_emulate_ctxt *ctxt, int seg) -{ - ctxt->has_seg_override = true; - ctxt->seg_override = seg; -} - static unsigned long seg_base(struct x86_emulate_ctxt *ctxt, int seg) { if (ctxt->mode == X86EMUL_MODE_PROT64 && seg < VCPU_SREG_FS) @@ -528,14 +522,6 @@ static unsigned long seg_base(struct x86_emulate_ctxt *ctxt, int seg) return ctxt->ops->get_cached_segment_base(ctxt, seg); } -static unsigned seg_override(struct x86_emulate_ctxt *ctxt) -{ - if (!ctxt->has_seg_override) - return 0; - - return ctxt->seg_override; -} - static int emulate_exception(struct x86_emulate_ctxt *ctxt, int vec, u32 error, bool valid) { @@ -4169,7 +4155,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes; op->addr.mem.ea = register_address(ctxt, reg_read(ctxt, VCPU_REGS_RSI)); - op->addr.mem.seg = seg_override(ctxt); + op->addr.mem.seg = ctxt->seg_override; op->val = 0; op->count = 1; break; @@ -4180,7 +4166,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, register_address(ctxt, reg_read(ctxt, VCPU_REGS_RBX) + (reg_read(ctxt, VCPU_REGS_RAX) & 0xff)); - op->addr.mem.seg = seg_override(ctxt); + op->addr.mem.seg = ctxt->seg_override; op->val = 0; break; case OpImmFAddr: @@ -4227,6 +4213,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) int mode = ctxt->mode; int def_op_bytes, def_ad_bytes, goffset, simd_prefix; bool op_prefix = false; + bool has_seg_override = false; struct opcode opcode; ctxt->memop.type = OP_NONE; @@ -4280,11 +4267,13 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) case 0x2e: /* CS override */ case 0x36: /* SS override */ case 0x3e: /* DS override */ - set_seg_override(ctxt, (ctxt->b >> 3) & 3); + has_seg_override = true; + ctxt->seg_override = (ctxt->b >> 3) & 3; break; case 0x64: /* FS override */ case 0x65: /* GS override */ - set_seg_override(ctxt, ctxt->b & 7); + has_seg_override = true; + ctxt->seg_override = ctxt->b & 7; break; case 0x40 ... 0x4f: /* REX */ if (mode != X86EMUL_MODE_PROT64) @@ -4422,17 +4411,19 @@ done_prefixes: /* ModRM and SIB bytes. */ if (ctxt->d & ModRM) { rc = decode_modrm(ctxt, &ctxt->memop); - if (!ctxt->has_seg_override) - set_seg_override(ctxt, ctxt->modrm_seg); + if (!has_seg_override) { + has_seg_override = true; + ctxt->seg_override = ctxt->modrm_seg; + } } else if (ctxt->d & MemAbs)
[PATCH 17/19] KVM: emulate: put pointers in the fetch_cache
This simplifies the code a bit, especially the overflow checks. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 4 ++-- arch/x86/kvm/emulate.c | 34 +++--- arch/x86/kvm/trace.h | 6 +++--- 3 files changed, 20 insertions(+), 24 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index fcf58cd25ebd..eb181178fe0b 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -242,8 +242,8 @@ struct operand { struct fetch_cache { u8 data[15]; - unsigned long start; - unsigned long end; + u8 *ptr; + u8 *end; }; struct read_cache { diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 02c668aca2b6..c16314807756 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -710,16 +710,15 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, */ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) { - struct fetch_cache *fc = &ctxt->fetch; int rc; - int size, cur_size; + int size; unsigned long linear; - + int cur_size = ctxt->fetch.end - ctxt->fetch.data; struct segmented_address addr = { .seg = VCPU_SREG_CS, - .ea = fc->end }; - cur_size = fc->end - fc->start; - size = min(15UL - cur_size, - PAGE_SIZE - offset_in_page(fc->end)); + .ea = ctxt->eip + cur_size }; + + size = min(15UL ^ cur_size, + PAGE_SIZE - offset_in_page(addr.ea)); /* * One instruction can only straddle two pages, @@ -732,19 +731,18 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) rc = __linearize(ctxt, addr, size, false, true, &linear); if (unlikely(rc != X86EMUL_CONTINUE)) return rc; - rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size, + rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end, size, &ctxt->exception); if (unlikely(rc != X86EMUL_CONTINUE)) return rc; - fc->end += size; + ctxt->fetch.end += size; return X86EMUL_CONTINUE; } static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, unsigned size) { - /* We have to be careful about overflow! */ - if (unlikely(ctxt->_eip > ctxt->fetch.end - size)) + if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size)) return __do_insn_fetch_bytes(ctxt, size); else return X86EMUL_CONTINUE; @@ -753,26 +751,24 @@ static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, /* Fetch next part of the instruction being emulated. */ #define insn_fetch(_type, _ctxt) \ ({ _type _x; \ - struct fetch_cache *_fc;\ \ rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \ if (rc != X86EMUL_CONTINUE) \ goto done; \ - _fc = &ctxt->fetch; \ - _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \ ctxt->_eip += sizeof(_type);\ + _x = *(_type __aligned(1) *) ctxt->fetch.ptr; \ + ctxt->fetch.ptr += sizeof(_type); \ _x; \ }) #define insn_fetch_arr(_arr, _size, _ctxt) \ ({ \ - struct fetch_cache *_fc;\ rc = do_insn_fetch_bytes(_ctxt, _size); \ if (rc != X86EMUL_CONTINUE) \ goto done; \ - _fc = &ctxt->fetch; \ - memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size); \ ctxt->_eip += (_size); \ + memcpy(_arr, ctxt->fetch.ptr, _size); \ + ctxt->fetch.ptr += (_size); \ }) /* @@ -4228,8 +4224,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) ctxt->memop.type = OP_NONE; ctxt->memopp = NULL; ctxt->_eip = ctxt->eip; - ctxt->fetch.start = ctxt->_eip; - ctxt->fetch.end = ctxt->fetch.
[PATCH 16/19] KVM: emulate: avoid per-byte copying in instruction fetches
We do not need a memory copying loop anymore in insn_fetch; we can use a byte-aligned pointer to access instruction fields directly from the fetch_cache. This eliminates 50-150 cycles (corresponding to a 5-10% improvement in performance) from each instruction. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 46 ++ 1 file changed, 22 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ca82ec9c5ffe..02c668aca2b6 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -708,7 +708,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, * Prefetch the remaining bytes of the instruction without crossing page * boundary if they are not in fetch_cache yet. */ -static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) +static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) { struct fetch_cache *fc = &ctxt->fetch; int rc; @@ -740,41 +740,39 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) return X86EMUL_CONTINUE; } -static int do_insn_fetch(struct x86_emulate_ctxt *ctxt, -void *__dest, unsigned size) +static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, + unsigned size) { - struct fetch_cache *fc = &ctxt->fetch; - u8 *dest = __dest; - u8 *src = &fc->data[ctxt->_eip - fc->start]; - /* We have to be careful about overflow! */ - if (unlikely(ctxt->_eip > fc->end - size)) { - int rc = do_insn_fetch_bytes(ctxt, size); - if (rc != X86EMUL_CONTINUE) - return rc; - } - - while (size--) { - *dest++ = *src++; - ctxt->_eip++; - continue; - } - return X86EMUL_CONTINUE; + if (unlikely(ctxt->_eip > ctxt->fetch.end - size)) + return __do_insn_fetch_bytes(ctxt, size); + else + return X86EMUL_CONTINUE; } /* Fetch next part of the instruction being emulated. */ #define insn_fetch(_type, _ctxt) \ -({ unsigned long _x; \ - rc = do_insn_fetch(_ctxt, &_x, sizeof(_type)); \ +({ _type _x; \ + struct fetch_cache *_fc;\ + \ + rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \ if (rc != X86EMUL_CONTINUE) \ goto done; \ - (_type)_x; \ + _fc = &ctxt->fetch; \ + _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \ + ctxt->_eip += sizeof(_type);\ + _x; \ }) #define insn_fetch_arr(_arr, _size, _ctxt) \ -({ rc = do_insn_fetch(_ctxt, _arr, (_size)); \ +({ \ + struct fetch_cache *_fc;\ + rc = do_insn_fetch_bytes(_ctxt, _size); \ if (rc != X86EMUL_CONTINUE) \ goto done; \ + _fc = &ctxt->fetch; \ + memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size); \ + ctxt->_eip += (_size); \ }) /* @@ -4236,7 +4234,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) if (insn_len > 0) memcpy(ctxt->fetch.data, insn, insn_len); else { - rc = do_insn_fetch_bytes(ctxt, 1); + rc = __do_insn_fetch_bytes(ctxt, 1); if (rc != X86EMUL_CONTINUE) return rc; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/19] KVM: x86: use kvm_read_guest_page for emulator accesses
Emulator accesses are always done a page at a time, either by the emulator itself (for fetches) or because we need to query the MMU for address translations. Speed up these accesses by using kvm_read_guest_page and, in the case of fetches, by inlining kvm_read_guest_virt_helper and dropping the loop around kvm_read_guest_page. This final tweak saves 30-100 more clock cycles (4-10%), bringing the count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series and 925-1700 after the first two low-hanging fruit changes. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 905edf8557e7..f750b69ca443 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4085,7 +4085,8 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, if (gpa == UNMAPPED_GVA) return X86EMUL_PROPAGATE_FAULT; - ret = kvm_read_guest(vcpu->kvm, gpa, data, toread); + ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data, + offset, toread); if (ret < 0) { r = X86EMUL_IO_NEEDED; goto out; @@ -4106,10 +4107,24 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + unsigned offset; + int ret; - return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, - access | PFERR_FETCH_MASK, - exception); + /* Inline kvm_read_guest_virt_helper for speed. */ + gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK, + exception); + if (unlikely(gpa == UNMAPPED_GVA)) + return X86EMUL_PROPAGATE_FAULT; + + offset = addr & (PAGE_SIZE-1); + if (WARN_ON(offset + bytes > PAGE_SIZE)) + bytes = (unsigned)PAGE_SIZE - offset; + ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, val, + offset, bytes); + if (unlikely(ret < 0)) + return X86EMUL_IO_NEEDED; + + return X86EMUL_CONTINUE; } int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt, -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/19] KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
From: Bandan Das The same information can be gleaned from ctxt->d and avoids having to zero/NULL initialize intercept and check_perm Signed-off-by: Bandan Das Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 851315d93bf7..000fc7398832 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4599,7 +4599,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) fetch_possible_mmx_operand(ctxt, &ctxt->dst); } - if (unlikely(ctxt->guest_mode) && ctxt->intercept) { + if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) { rc = emulator_check_intercept(ctxt, ctxt->intercept, X86_ICPT_PRE_EXCEPT); if (rc != X86EMUL_CONTINUE) @@ -4619,13 +4619,13 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) } /* Do instruction specific permission checks */ - if (ctxt->check_perm) { + if (ctxt->d & CheckPerm) { rc = ctxt->check_perm(ctxt); if (rc != X86EMUL_CONTINUE) goto done; } - if (unlikely(ctxt->guest_mode) && ctxt->intercept) { + if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) { rc = emulator_check_intercept(ctxt, ctxt->intercept, X86_ICPT_POST_EXCEPT); if (rc != X86EMUL_CONTINUE) @@ -4671,7 +4671,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) special_insn: - if (unlikely(ctxt->guest_mode) && ctxt->intercept) { + if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) { rc = emulator_check_intercept(ctxt, ctxt->intercept, X86_ICPT_POST_MEMACCESS); if (rc != X86EMUL_CONTINUE) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/19] KVM: x86: ensure emulator fetches do not span multiple pages
When the CS base is not page-aligned, the linear address of the code could get close to the page boundary (e.g. 0x...ffe) even if the EIP value is not. So we need to first linearize the address, and only then compute the number of valid bytes that can be fetched. This happens relatively often when executing real mode code. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c16314807756..6a1d60956d63 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -711,14 +711,18 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) { int rc; - int size; + unsigned size; unsigned long linear; int cur_size = ctxt->fetch.end - ctxt->fetch.data; struct segmented_address addr = { .seg = VCPU_SREG_CS, .ea = ctxt->eip + cur_size }; - size = min(15UL ^ cur_size, - PAGE_SIZE - offset_in_page(addr.ea)); + size = 15UL ^ cur_size; + rc = __linearize(ctxt, addr, size, false, true, &linear); + if (unlikely(rc != X86EMUL_CONTINUE)) + return rc; + + size = min_t(unsigned, size, PAGE_SIZE - offset_in_page(linear)); /* * One instruction can only straddle two pages, @@ -728,9 +732,6 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) */ if (unlikely(size < op_size)) return X86EMUL_UNHANDLEABLE; - rc = __linearize(ctxt, addr, size, false, true, &linear); - if (unlikely(rc != X86EMUL_CONTINUE)) - return rc; rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end, size, &ctxt->exception); if (unlikely(rc != X86EMUL_CONTINUE)) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend 00/19] Emulator speedups for 3.17
These are the emulator speedup patches that have survived autotest and kvm-unit-tests. I dropped the patches for direct access to memory operands because they caused a failure in vmx.flat. These patches have been in kvm/queue for a while, but I've always left them out of kvm/next because they hadn't undergone full review. Since three months have passed, I'll include them in the next push to kvm/next to give them more exposure and to prepare for 3.17. Paolo Bandan Das (6): KVM: emulate: move init_decode_cache to emulate.c KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks KVM: emulate: cleanup decode_modrm KVM: emulate: clean up initializations in init_decode_cache KVM: emulate: rework seg_override KVM: emulate: do not initialize memopp Paolo Bonzini (13): KVM: vmx: speed up emulation of invalid guest state KVM: x86: return all bits from get_interrupt_shadow KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation KVM: emulate: move around some checks KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())" KVM: emulate: speed up emulated moves KVM: emulate: simplify writeback KVM: emulate: speed up do_insn_fetch KVM: emulate: avoid repeated calls to do_insn_fetch_bytes KVM: emulate: avoid per-byte copying in instruction fetches KVM: emulate: put pointers in the fetch_cache KVM: x86: ensure emulator fetches do not span multiple pages KVM: x86: use kvm_read_guest_page for emulator accesses arch/x86/include/asm/kvm_emulate.h | 31 +-- arch/x86/include/asm/kvm_host.h| 2 +- arch/x86/kvm/emulate.c | 379 +++-- arch/x86/kvm/svm.c | 6 +- arch/x86/kvm/trace.h | 6 +- arch/x86/kvm/vmx.c | 9 +- arch/x86/kvm/x86.c | 80 +--- 7 files changed, 277 insertions(+), 236 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/19] KVM: emulate: do not initialize memopp
From: Bandan Das rip_relative is only set if decode_modrm runs, and if you have ModRM you will also have a memopp. We can then access memopp unconditionally. Note that rip_relative cannot be hoisted up to decode_modrm, or you break "mov $0, xyz(%rip)". Also, move typecast on "out of range value" of mem.ea to decode_modrm. Together, all these optimizations save about 50 cycles on each emulated instructions (4-6%). Signed-off-by: Bandan Das [Fix immediate operands with rip-relative addressing. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 5 + arch/x86/kvm/emulate.c | 8 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 6c808408326f..fcf58cd25ebd 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -295,6 +295,11 @@ struct x86_emulate_ctxt { struct operand dst; int (*execute)(struct x86_emulate_ctxt *ctxt); int (*check_perm)(struct x86_emulate_ctxt *ctxt); + /* +* The following six fields are cleared together, +* the rest are initialized unconditionally in x86_decode_insn +* or elsewhere +*/ bool rip_relative; u8 rex_prefix; u8 lock_prefix; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 08badf638fb0..390400a54a9c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1177,6 +1177,9 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt, } } op->addr.mem.ea = modrm_ea; + if (ctxt->ad_bytes != 8) + ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea; + done: return rc; } @@ -4425,9 +4428,6 @@ done_prefixes: ctxt->memop.addr.mem.seg = ctxt->seg_override; - if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8) - ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea; - /* * Decode and fetch the source operand: register, memory * or immediate. @@ -4448,7 +4448,7 @@ done_prefixes: rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask); done: - if (ctxt->memopp && ctxt->memopp->type == OP_MEM && ctxt->rip_relative) + if (ctxt->rip_relative) ctxt->memopp->addr.mem.ea += ctxt->_eip; return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/19] KVM: emulate: move init_decode_cache to emulate.c
From: Bandan Das Core emulator functions all belong in emulator.c, x86 should have no knowledge of emulator internals Signed-off-by: Bandan Das Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 1 + arch/x86/kvm/emulate.c | 13 + arch/x86/kvm/x86.c | 13 - 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 432447370044..c22bd9af4311 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -409,6 +409,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); #define EMULATION_OK 0 #define EMULATION_RESTART 1 #define EMULATION_INTERCEPTED 2 +void init_decode_cache(struct x86_emulate_ctxt *ctxt); int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); int emulator_task_switch(struct x86_emulate_ctxt *ctxt, u16 tss_selector, int idt_index, int reason, diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 32d3da82da2e..851315d93bf7 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4534,6 +4534,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)) return X86EMUL_CONTINUE; } +void init_decode_cache(struct x86_emulate_ctxt *ctxt) +{ + memset(&ctxt->opcode_len, 0, + (void *)&ctxt->_regs - (void *)&ctxt->opcode_len); + + ctxt->fetch.start = 0; + ctxt->fetch.end = 0; + ctxt->io_read.pos = 0; + ctxt->io_read.end = 0; + ctxt->mem_read.pos = 0; + ctxt->mem_read.end = 0; +} + int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) { const struct x86_emulate_ops *ops = ctxt->ops; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd9316786dca..905edf8557e7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4888,19 +4888,6 @@ static void inject_emulated_exception(struct kvm_vcpu *vcpu) kvm_queue_exception(vcpu, ctxt->exception.vector); } -static void init_decode_cache(struct x86_emulate_ctxt *ctxt) -{ - memset(&ctxt->opcode_len, 0, - (void *)&ctxt->_regs - (void *)&ctxt->opcode_len); - - ctxt->fetch.start = 0; - ctxt->fetch.end = 0; - ctxt->io_read.pos = 0; - ctxt->io_read.end = 0; - ctxt->mem_read.pos = 0; - ctxt->mem_read.end = 0; -} - static void init_emulate_ctxt(struct kvm_vcpu *vcpu) { struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote: > +/* > + * Under certain circumstances, access certain MSR may cause #GP. > + * The function tests if the input MSR can be safely accessed. > + */ > +static inline bool check_msr(unsigned long msr) > +{ > + u64 val_old, val_new, val_tmp; > + > + /* > + * Read the current value, change it and read it back to see if it > + * matches, this is needed to detect certain hardware emulators > + * (qemu/kvm) that don't trap on the MSR access and always return 0s. > + */ > + if (rdmsrl_safe(msr, &val_old)) > + goto msr_fail; > + /* > + * Only chagne it slightly, > + * since the higher bits of some MSRs cannot be updated by wrmsrl. > + * E.g. MSR_LBR_TOS > + */ > + val_tmp = val_old ^ 0x3UL; > + if (wrmsrl_safe(msr, val_tmp) || > + rdmsrl_safe(msr, &val_new)) > + goto msr_fail; > + > + if (val_new != val_tmp) > + goto msr_fail; > + > + /* Here it's sure that the MSR can be safely accessed. > + * Restore the old value and return. > + */ > + wrmsrl(msr, val_old); > + > + return true; > + > +msr_fail: > + return false; > +} I don't think we need the msr_fail thing, there's no state to clean up, you can return false at all places you now have goto. pgpkp9qBX8LOU.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote: > From: Kan Liang > > x86, perf: Protect LBR and extra_regs against KVM lying > > With -cpu host, KVM reports LBR and extra_regs support, if the host has > support. > When the guest perf driver tries to access LBR or extra_regs MSR, > it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. > So check the related MSRs access right once at initialization time to avoid > the error access at runtime. > > For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y > (for host kernel). > And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). > Start the guest with -cpu host. > Run perf record with --branch-any or --branch-filter in guest to trigger LBR > #GP. > Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to > trigger offcore_rsp #GP This is still not properly wrapped at 78 chars. > Signed-off-by: Kan Liang > > V2: Move the check code to initialization time. > V3: Add flag for each extra register. > Check all LBR MSRs at initialization time. > V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr > failed. > Disable all extra msrs in creation places if check_msr failed. > V5: Fix check_msr broken > Don't check any more MSRs after the first fail > Return error when checking fail to stop creating the event > Remove the checking code path which never get These things should go below the --- so they get thrown away when applying the patch, its of no relevance once applied. > --- > arch/x86/kernel/cpu/perf_event.c | 3 +++ > arch/x86/kernel/cpu/perf_event.h | 45 > ++ > arch/x86/kernel/cpu/perf_event_intel.c | 38 +++- > 3 files changed, 85 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/perf_event.c > b/arch/x86/kernel/cpu/perf_event.c > index 2bdfbff..a7c5e4b 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct > perf_event *event) > continue; > if (event->attr.config1 & ~er->valid_mask) > return -EINVAL; > + /* Check if the extra msrs can be safely accessed*/ > + if (!x86_pmu.extra_msr_access[er->idx]) > + return -EFAULT; This is not a correct usage of -EFAULT. Event creation did not fail because we took a fault dereferencing a user provided pointer. Possibly ENXIO is appropriate. > reg->idx = er->idx; > reg->config = event->attr.config1; > diff --git a/arch/x86/kernel/cpu/perf_event.h > b/arch/x86/kernel/cpu/perf_event.h > index 3b2f9bd..992c678 100644 > --- a/arch/x86/kernel/cpu/perf_event.h > +++ b/arch/x86/kernel/cpu/perf_event.h > @@ -464,6 +464,12 @@ struct x86_pmu { >*/ > struct extra_reg *extra_regs; > unsigned int er_flags; > + /* > + * EXTRA REG MSR can be accessed > + * The extra registers are completely unrelated to each other. > + * So it needs a flag for each extra register. > + */ > + boolextra_msr_access[EXTRA_REG_MAX]; So why not in struct extra_reg again? You didn't give a straight answer there. > +/* > + * Under certain circumstances, access certain MSR may cause #GP. > + * The function tests if the input MSR can be safely accessed. > + */ > +static inline bool check_msr(unsigned long msr) > +{ This reads like a generic function; > + u64 val_old, val_new, val_tmp; > + > + /* > + * Read the current value, change it and read it back to see if it > + * matches, this is needed to detect certain hardware emulators > + * (qemu/kvm) that don't trap on the MSR access and always return 0s. > + */ > + if (rdmsrl_safe(msr, &val_old)) > + goto msr_fail; > + /* > + * Only chagne it slightly, > + * since the higher bits of some MSRs cannot be updated by wrmsrl. > + * E.g. MSR_LBR_TOS > + */ > + val_tmp = val_old ^ 0x3UL; but this is not generally true; not all MSRs can write the 2 LSB, can they? One option would be to extend the function with a u64 mask. > + if (wrmsrl_safe(msr, val_tmp) || > + rdmsrl_safe(msr, &val_new)) > + goto msr_fail; > + > + if (val_new != val_tmp) > + goto msr_fail; > + > + /* Here it's sure that the MSR can be safely accessed. > + * Restore the old value and return. > + */ > + wrmsrl(msr, val_old); > + > + return true; > + > +msr_fail: > + return false; > +} Also, by now this function is far too large to be inline and in a header. > + /* > + * Access LBR MSR may cause #GP under certain circumstances. > + * E.g. KVM doesn't support LBR MSR > + * Check all LBT MSR here. > + * Disable LBR access if any LBR MSRs can not be acc
[PATCH] kvm: ppc: bookehv: Sync srr0/1 into GSRR0/1
This patch adds missing sync of SRR0/1 in set SREGS interface. Signed-off-by: Bharat Bhushan --- arch/powerpc/kvm/booke.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index c2471ed..368b48e 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -1326,6 +1326,10 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) kvmppc_set_msr(vcpu, regs->msr); vcpu->arch.shared->srr0 = regs->srr0; vcpu->arch.shared->srr1 = regs->srr1; +#ifdef CONFIG_KVM_BOOKE_HV + mtspr(SPRN_GSRR0, regs->srr0); + mtspr(SPRN_GSRR1, regs->srr1); +#endif kvmppc_set_pid(vcpu, regs->pid); vcpu->arch.shared->sprg0 = regs->sprg0; vcpu->arch.shared->sprg1 = regs->sprg1; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.
Hi Gleb, Please see below. On 07/12/2014 03:44 PM, Gleb Natapov wrote: On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote: kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But it is never used to refer to the page at all. In vcpu initialization, it indicates two things: 1. indicates if ept page is allocated 2. indicates if a memory slot for identity page is initialized Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept identity pagetable is initialized. So we can remove ept_identity_pagetable. Signed-off-by: Tang Chen --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/vmx.c | 25 +++-- 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4931415..62f973e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -578,7 +578,6 @@ struct kvm_arch { gpa_t wall_clock; - struct page *ept_identity_pagetable; bool ept_identity_pagetable_done; gpa_t ept_identity_map_addr; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0918635e..fe2e5f4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu); static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); static bool vmx_mpx_supported(void); +static int alloc_identity_pagetable(struct kvm *kvm); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -3921,21 +3922,21 @@ out: static int init_rmode_identity_map(struct kvm *kvm) { - int i, idx, r, ret; + int i, idx, r, ret = 0; pfn_t identity_map_pfn; u32 tmp; if (!enable_ept) return 1; - if (unlikely(!kvm->arch.ept_identity_pagetable)) { - printk(KERN_ERR "EPT: identity-mapping pagetable " - "haven't been allocated!\n"); - return 0; - } if (likely(kvm->arch.ept_identity_pagetable_done)) return 1; - ret = 0; identity_map_pfn = kvm->arch.ept_identity_map_addr>> PAGE_SHIFT; + + mutex_lock(&kvm->slots_lock); Why move this out of alloc_identity_pagetable()? Referring to the original code, I think mutex_lock(&kvm->slots_lock) is used to protect kvm->arch.ept_identity_pagetable. If two or more threads try to modify it at the same time, the mutex ensures that the identity table is only allocated once. Now, we dropped kvm->arch.ept_identity_pagetable. And use kvm->arch.ept_identity_pagetable_done to check if the identity table is allocated and initialized. So we should protect memory slot operation in alloc_identity_pagetable() and kvm->arch.ept_identity_pagetable_done with this mutex. Of course, I can see that the name "slots_lock" indicates that it may be used to protect the memory slot operation only. Maybe move it out here is not suitable. If I'm wrong, please tell me. + r = alloc_identity_pagetable(kvm); + if (r) + goto out2; + idx = srcu_read_lock(&kvm->srcu); r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); if (r< 0) @@ -3953,6 +3954,9 @@ static int init_rmode_identity_map(struct kvm *kvm) ret = 1; out: srcu_read_unlock(&kvm->srcu, idx); + +out2: + mutex_unlock(&kvm->slots_lock); return ret; } @@ -4006,9 +4010,6 @@ static int alloc_identity_pagetable(struct kvm *kvm) struct kvm_userspace_memory_region kvm_userspace_mem; int r = 0; - mutex_lock(&kvm->slots_lock); - if (kvm->arch.ept_identity_pagetable) - goto out; kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; kvm_userspace_mem.flags = 0; kvm_userspace_mem.guest_phys_addr = @@ -4025,9 +4026,7 @@ static int alloc_identity_pagetable(struct kvm *kvm) goto out; } - kvm->arch.ept_identity_pagetable = page; I think we can drop gfn_to_page() above too now. Why would we need it? Yes, will remove it in the next version. Thanks. out: - mutex_unlock(&kvm->slots_lock); return r; } @@ -7583,8 +7582,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) kvm->arch.ept_identity_map_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR; err = -ENOMEM; - if (alloc_identity_pagetable(kvm) != 0) - goto free_vmcs; if (!init_rmode_identity_map(kvm)) goto free_vmcs; } -- 1.8.3.1 -- Gleb. . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vge
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
Hi Gleb, Thanks for the reply. Please see below. On 07/12/2014 04:04 PM, Gleb Natapov wrote: On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote: apic access page is pinned in memory. As a result, it cannot be migrated/hot-removed. Actually, it is not necessary to be pinned. The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the corresponding ept entry. This patch introduces a new vcpu request named KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time, and force all the vcpus exit guest, and re-enter guest till they updates the VMCS APIC_ACCESS_ADDR pointer to the new apic access page address, and updates kvm->arch.apic_access_page to the new page. By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism is not used since no MMIO access to APIC is ever done. Have you tested this with "-cpu modelname,-x2apic" qemu flag? I used the following commandline to test the patches. # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp 2 And I think the guest used APIC_ACCESS_ADDR mechanism because the previous patch-set has some problem which will happen when the apic page is accessed. And it did happen. I'll test this patch-set with "-cpu modelname,-x2apic" flag. Signed-off-by: Tang Chen --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu.c | 11 +++ arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 8 +++- arch/x86/kvm/x86.c | 14 ++ include/linux/kvm_host.h| 2 ++ virt/kvm/kvm_main.c | 12 7 files changed, 53 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 62f973e..9ce6bfd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -737,6 +737,7 @@ struct kvm_x86_ops { void (*hwapic_isr_update)(struct kvm *kvm, int isr); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9314678..551693d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, level, gfn, pfn, prefault); spin_unlock(&vcpu->kvm->mmu_lock); + /* +* apic access page could be migrated. When the guest tries to access +* the apic access page, ept violation will occur, and we can use GUP +* to find the new page. +* +* GUP will wait till the migrate entry be replaced with the new page. +*/ + if (gpa == APIC_DEFAULT_PHYS_BASE) + vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm, + APIC_DEFAULT_PHYS_BASE>> PAGE_SHIFT); Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here? I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here. In kvm_mmu_notifier_invalidate_page() I made the request. And the handler called gfn_to_page_no_pin() to get the new page, which will wait till the migration finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the vcpus were forced to exit the guest mode, they would wait till the VMCS APIC_ACCESS_ADDR pointer was updated. As a result, we don't need to make the request here. + return r; out_unlock: diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 576b525..dc76f29 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, + .set_apic_access_page_addr = svm_set_apic_access_page_addr, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_isr_update = svm_hwapic_isr_update, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5532ac8..f7c6313 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3992,7 +3992,7 @@ static int alloc_apic_access_page(struct kvm *kvm) if (r) goto out; -