Re: [PATCH 0/3] KVM: Make KVM_CHECK_EXTENSION a VM ioctl

2014-07-14 Thread Cornelia Huck
On Mon, 14 Jul 2014 19:03:35 +0200
Alexander Graf  wrote:

> On PowerPC we have a small problem :). We can run both HV and PR style VMs
> on the same kvm fd. While this is great, it means that anything that's
> different between the two needs to have a token in form of a VM fd to find
> out which one we're asking for.
> 
> The one thing where this bites us are CAPs. We ask for them on the kvm fd,
> not the vm fd. So we can only take a random guess whether the user is asking
> for HV or PR capabilities.
> 
> So far we got away with this reasonably well - most people will only load one
> of the two modules and the only thing that *really* breaks is hypercall 
> exposure
> to user space, so a PR guest will not be able to do KVM hypercalls when HV KVM
> is loaded on the host, making the magic page unavailable to it.
> 
> But this still isn't a great situation to be in. Instead, we really should 
> just
> make the CHECK_EXTENSION ioctl available at VM level. Then we know for sure
> what user space is asking for.

I think this makes sense.

It adds one situation we didn't have before, though: We now have the
checking ioctl acting on a target where we also have the enabling
ioctl. Previously, it made sense to advertise capabilities that can be
enabled on the vcpus on the kvm fd; but what should happen for a
capability that can be enabled on a vm if the vm fd is queried? Should
it always be advertised, or only if it has been enabled?

(I just noticed that we don't advertise KVM_CAP_S390_IRQCHIP, which is a
capability that can be enabled per-vm...)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 1/2] perf ignore LBR and extra_rsp

2014-07-14 Thread kan . liang
From: Kan Liang 

x86, perf: Protect LBR and extra_regs against KVM lying

With -cpu host, KVM reports LBR and extra_regs support, if the host has
support.
When the guest perf driver tries to access LBR or extra_regs MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
So check the related MSRs access right once at initialization time to avoid
the error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to
trigger offcore_rsp #GP

Signed-off-by: Kan Liang 

---
V2: Move the check code to initialization time.
V3: Add flag for each extra register.
Check all LBR MSRs at initialization time.
V4: Remove lbr_msr_access.
For LBR msr, simply set lbr_nr to 0 if check_msr failed.
Disable all extra msrs in creation places if check_msr failed.
V5: Fix check_msr broken
Don't check any more MSRs after the first fail
Return error when checking fail to stop creating the event
Remove the checking code path which never get
V6: Move extra_msr_access to struct extra_reg
Modify and move check_msr function to perf_event_intel.c
Other minor code changes

 arch/x86/kernel/cpu/perf_event.c  |  3 ++
 arch/x86/kernel/cpu/perf_event.h  | 20 
 arch/x86/kernel/cpu/perf_event_intel.c| 66 ++-
 arch/x86/kernel/cpu/perf_event_intel_uncore.h |  4 +-
 4 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2bdfbff..2879ecd 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event 
*event)
continue;
if (event->attr.config1 & ~er->valid_mask)
return -EINVAL;
+   /* Check if the extra msrs can be safely accessed*/
+   if (!er->extra_msr_access)
+   return -ENXIO;
 
reg->idx = er->idx;
reg->config = event->attr.config1;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..2e613b2 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -295,22 +295,24 @@ struct extra_reg {
u64 config_mask;
u64 valid_mask;
int idx;  /* per_xxx->regs[] reg index */
+   boolextra_msr_access;
 };
 
-#define EVENT_EXTRA_REG(e, ms, m, vm, i) { \
-   .event = (e),   \
-   .msr = (ms),\
-   .config_mask = (m), \
-   .valid_mask = (vm), \
-   .idx = EXTRA_REG_##i,   \
+#define EVENT_EXTRA_REG(e, ms, m, vm, i, a) {  \
+   .event = (e),   \
+   .msr = (ms),\
+   .config_mask = (m), \
+   .valid_mask = (vm), \
+   .idx = EXTRA_REG_##i,   \
+   .extra_msr_access = (a),\
}
 
 #define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx) \
-   EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx)
+   EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx, true)
 
 #define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \
EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
-   ARCH_PERFMON_EVENTSEL_UMASK, vm, idx)
+   ARCH_PERFMON_EVENTSEL_UMASK, vm, idx, true)
 
 #define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \
INTEL_UEVENT_EXTRA_REG(c, \
@@ -318,7 +320,7 @@ struct extra_reg {
   0x, \
   LDLAT)
 
-#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
+#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0, false)
 
 union perf_capabilities {
struct {
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..9c234d1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2173,6 +2173,40 @@ static void intel_snb_check_microcode(void)
}
 }
 
+/*
+ * Under certain circumstances, access certain MSR may cause #GP.
+ * The function tests if the input MSR can be safely accessed.
+ */
+static bool check_msr(unsigned long msr, u64 mask)
+{
+   u64 val_old, val_new, val_tmp;
+
+   /*
+* Read the current value, change it and read it back to see if it
+* matches, this is needed to detect certain hardware emulators
+* (qemu/kvm) that don't trap on the MSR access and always return 0s.
+*/
+   if (

[PATCH V6 2/2] kvm: ignore LBR and extra rsp

2014-07-14 Thread kan . liang
From: Kan Liang 

With -cpu host KVM reports LBR and extra_regs support, so the perf driver may
accesses the LBR and extra_regs MSRs.
However, there is no LBR and extra_regs virtualization support yet. This could
causes guest to crash.
As a workaround, KVM just simply ignore the LBR and extra_regs MSRs to lie the
guest.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to
trigger offcore_rsp #GP

Signed-off-by: Andi Kleen 
Signed-off-by: Kan Liang 

---
V3: add MSR_LBR_TOS
V4: add MSR_LBR_SELECT and MSR_PEBS_LD_LAT_THRESHOLD
V5: set_msr should return 0 to lie the guest

 arch/x86/kvm/pmu.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cbecaa9..5fd5b44 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -331,6 +331,18 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   return 1; /* to avoid crashes */
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
|| get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0)
@@ -358,6 +370,19 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 
*data)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
*data = pmu->global_ovf_ctrl;
return 0;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   *data = 0;
+   return 0;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
@@ -409,6 +434,19 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
return 0;
}
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   /* dummy for now */
+   return 0;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd

2014-07-14 Thread Alexander Graf
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately
on PPC some of the capabilities change depending on the way a VM was created.

So instead we need a way to expose capabilities as VM ioctl, so that we can
see which VM type we're using (HV or PR). To enable this, add the
KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - add CAP to expose the vm based CHECK_EXTENSION ioctl availability
---
 Documentation/virtual/kvm/api.txt |  7 +++--
 include/uapi/linux/kvm.h  |  1 +
 virt/kvm/kvm_main.c   | 58 +--
 3 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6955318..235630c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -148,9 +148,9 @@ of banks, as set via the KVM_X86_SETUP_MCE ioctl.
 
 4.4 KVM_CHECK_EXTENSION
 
-Capability: basic
+Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl
 Architectures: all
-Type: system ioctl
+Type: system ioctl, vm ioctl
 Parameters: extension identifier (KVM_CAP_*)
 Returns: 0 if unsupported; 1 (or some other positive integer) if supported
 
@@ -160,6 +160,9 @@ receives an integer that describes the extension 
availability.
 Generally 0 means no and 1 means yes, but some extensions may report
 additional information in the integer return value.
 
+Based on their initialization different VMs may have different capabilities.
+It is thus encouraged to use the vm ioctl to query for capabilities (available
+with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
 
 4.5 KVM_GET_VCPU_MMAP_SIZE
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0418b74..51776ca 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -759,6 +759,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI_0_2 102
 #define KVM_CAP_PPC_FIXUP_HCALL 103
 #define KVM_CAP_PPC_ENABLE_HCALL 104
+#define KVM_CAP_CHECK_EXTENSION_VM 105
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e28f3ca..1b95cc9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2324,6 +2324,34 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
return 0;
 }
 
+static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
+{
+   switch (arg) {
+   case KVM_CAP_USER_MEMORY:
+   case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
+   case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
+#ifdef CONFIG_KVM_APIC_ARCHITECTURE
+   case KVM_CAP_SET_BOOT_CPU_ID:
+#endif
+   case KVM_CAP_INTERNAL_ERROR_DATA:
+#ifdef CONFIG_HAVE_KVM_MSI
+   case KVM_CAP_SIGNAL_MSI:
+#endif
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
+   case KVM_CAP_CHECK_EXTENSION_VM:
+   return 1;
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+   case KVM_CAP_IRQ_ROUTING:
+   return KVM_MAX_IRQ_ROUTES;
+#endif
+   default:
+   break;
+   }
+   return kvm_vm_ioctl_check_extension(kvm, arg);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -2487,6 +2515,9 @@ static long kvm_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_CHECK_EXTENSION:
+   r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
+   break;
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)
@@ -2571,33 +2602,6 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
return r;
 }
 
-static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
-{
-   switch (arg) {
-   case KVM_CAP_USER_MEMORY:
-   case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
-   case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
-#ifdef CONFIG_KVM_APIC_ARCHITECTURE
-   case KVM_CAP_SET_BOOT_CPU_ID:
-#endif
-   case KVM_CAP_INTERNAL_ERROR_DATA:
-#ifdef CONFIG_HAVE_KVM_MSI
-   case KVM_CAP_SIGNAL_MSI:
-#endif
-#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-   case KVM_CAP_IRQFD_RESAMPLE:
-#endif
-   return 1;
-#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-   case KVM_CAP_IRQ_ROUTING:
-   return KVM_MAX_IRQ_ROUTES;
-#endif
-   default:
-   break;
-   }
-   return kvm_vm_ioctl_check_extension(kvm, arg);
-}
-
 static long kvm_dev_ioctl(struct file *filp,
  unsigned int ioctl, unsigned long arg)
 {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd

2014-07-14 Thread Alexander Graf


On 14.07.14 19:03, Alexander Graf wrote:

The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately
on PPC some of the capabilities change depending on the way a VM was created.

So instead we need a way to expose capabilities as VM ioctl, so that we can
see which VM type we're using (HV or PR). To enable this, add the
KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio.

Signed-off-by: Alexander Graf 


Hrm, I guess this itself should also get a CAP :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM: Allow KVM_CHECK_EXTENSION on the vm fd

2014-07-14 Thread Alexander Graf
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately
on PPC some of the capabilities change depending on the way a VM was created.

So instead we need a way to expose capabilities as VM ioctl, so that we can
see which VM type we're using (HV or PR). To enable this, add the
KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio.

Signed-off-by: Alexander Graf 
---
 Documentation/virtual/kvm/api.txt |  5 +++-
 virt/kvm/kvm_main.c   | 57 ---
 2 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6955318..f0ac03d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -150,7 +150,7 @@ of banks, as set via the KVM_X86_SETUP_MCE ioctl.
 
 Capability: basic
 Architectures: all
-Type: system ioctl
+Type: system ioctl, vm ioctl
 Parameters: extension identifier (KVM_CAP_*)
 Returns: 0 if unsupported; 1 (or some other positive integer) if supported
 
@@ -160,6 +160,9 @@ receives an integer that describes the extension 
availability.
 Generally 0 means no and 1 means yes, but some extensions may report
 additional information in the integer return value.
 
+Based on their initialization different VMs may have different capabilities.
+It is thus encouraged to use the vm ioctl to query for capabilities (available
+as of 3.17).
 
 4.5 KVM_GET_VCPU_MMAP_SIZE
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e28f3ca..6f1c1cc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2324,6 +2324,33 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
return 0;
 }
 
+static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
+{
+   switch (arg) {
+   case KVM_CAP_USER_MEMORY:
+   case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
+   case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
+#ifdef CONFIG_KVM_APIC_ARCHITECTURE
+   case KVM_CAP_SET_BOOT_CPU_ID:
+#endif
+   case KVM_CAP_INTERNAL_ERROR_DATA:
+#ifdef CONFIG_HAVE_KVM_MSI
+   case KVM_CAP_SIGNAL_MSI:
+#endif
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
+   return 1;
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+   case KVM_CAP_IRQ_ROUTING:
+   return KVM_MAX_IRQ_ROUTES;
+#endif
+   default:
+   break;
+   }
+   return kvm_vm_ioctl_check_extension(kvm, arg);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -2487,6 +2514,9 @@ static long kvm_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_CHECK_EXTENSION:
+   r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
+   break;
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)
@@ -2571,33 +2601,6 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
return r;
 }
 
-static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
-{
-   switch (arg) {
-   case KVM_CAP_USER_MEMORY:
-   case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
-   case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
-#ifdef CONFIG_KVM_APIC_ARCHITECTURE
-   case KVM_CAP_SET_BOOT_CPU_ID:
-#endif
-   case KVM_CAP_INTERNAL_ERROR_DATA:
-#ifdef CONFIG_HAVE_KVM_MSI
-   case KVM_CAP_SIGNAL_MSI:
-#endif
-#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-   case KVM_CAP_IRQFD_RESAMPLE:
-#endif
-   return 1;
-#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-   case KVM_CAP_IRQ_ROUTING:
-   return KVM_MAX_IRQ_ROUTES;
-#endif
-   default:
-   break;
-   }
-   return kvm_vm_ioctl_check_extension(kvm, arg);
-}
-
 static long kvm_dev_ioctl(struct file *filp,
  unsigned int ioctl, unsigned long arg)
 {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM: Rename and add argument to check_extension

2014-07-14 Thread Alexander Graf
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.

Signed-off-by: Alexander Graf 
---
 arch/arm/kvm/arm.c | 2 +-
 arch/ia64/kvm/kvm-ia64.c   | 2 +-
 arch/mips/kvm/mips.c   | 2 +-
 arch/powerpc/kvm/powerpc.c | 2 +-
 arch/s390/kvm/kvm-s390.c   | 2 +-
 arch/x86/kvm/x86.c | 2 +-
 include/linux/kvm_host.h   | 2 +-
 virt/kvm/kvm_main.c| 6 +++---
 8 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 3c82b37..cb77f999 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -184,7 +184,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
}
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
switch (ext) {
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 6a4309b..0729ba6 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -190,7 +190,7 @@ void kvm_arch_check_processor_compat(void *rtn)
*(int *)rtn = 0;
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
 
int r;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 4fda672..cd71141 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -886,7 +886,7 @@ int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct 
vm_fault *vmf)
return VM_FAULT_SIGBUS;
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index fe0257a..2a0e497 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -385,7 +385,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
 {
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
/* FIXME!!
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2f3e14f..00268ca 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -146,7 +146,7 @@ long kvm_arch_dev_ioctl(struct file *filp,
return -EINVAL;
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f65c22c..e9f0a92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2616,7 +2616,7 @@ out:
return r;
 }
 
-int kvm_dev_ioctl_check_extension(long ext)
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec4e3bd..5065b95 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -602,7 +602,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg);
 int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
 
-int kvm_dev_ioctl_check_extension(long ext);
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
 
 int kvm_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log, int *is_dirty);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..e28f3ca 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2571,7 +2571,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
return r;
 }
 
-static long kvm_dev_ioctl_check_extension_generic(long arg)
+static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
switch (arg) {
case KVM_CAP_USER_MEMORY:
@@ -2595,7 +2595,7 @@ static long kvm_dev_ioctl_check_extension_generic(long 
arg)
default:
break;
}
-   return kvm_dev_ioctl_check_extension(arg);
+   return kvm_vm_ioctl_check_extension(kvm, arg);
 }
 
 static long kvm_dev_ioctl(struct file *filp,
@@ -2614,7 +2614,7 @@ static long kvm_dev_ioctl(struct file *filp,
r = kvm_dev_ioctl_create_vm(arg);
break;
case KVM_CHECK_EXTENSION:
-   r = kvm_dev_ioctl_check_extension_generic(arg);
+   r = kvm_vm_ioctl_check_extension_generic(NULL, arg);
break;
case KVM_GET_VCPU_MMAP_SIZE:
r = -EINVAL;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] KVM: Make KVM_CHECK_EXTENSION a VM ioctl

2014-07-14 Thread Alexander Graf
On PowerPC we have a small problem :). We can run both HV and PR style VMs
on the same kvm fd. While this is great, it means that anything that's
different between the two needs to have a token in form of a VM fd to find
out which one we're asking for.

The one thing where this bites us are CAPs. We ask for them on the kvm fd,
not the vm fd. So we can only take a random guess whether the user is asking
for HV or PR capabilities.

So far we got away with this reasonably well - most people will only load one
of the two modules and the only thing that *really* breaks is hypercall exposure
to user space, so a PR guest will not be able to do KVM hypercalls when HV KVM
is loaded on the host, making the magic page unavailable to it.

But this still isn't a great situation to be in. Instead, we really should just
make the CHECK_EXTENSION ioctl available at VM level. Then we know for sure
what user space is asking for.


Alex 

Alexander Graf (3):
  KVM: Rename and add argument to check_extension
  KVM: Allow KVM_CHECK_EXTENSION on the vm fd
  KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode

 Documentation/virtual/kvm/api.txt |  5 +++-
 arch/arm/kvm/arm.c|  2 +-
 arch/ia64/kvm/kvm-ia64.c  |  2 +-
 arch/mips/kvm/mips.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 14 +++---
 arch/s390/kvm/kvm-s390.c  |  2 +-
 arch/x86/kvm/x86.c|  2 +-
 include/linux/kvm_host.h  |  2 +-
 virt/kvm/kvm_main.c   | 59 ---
 9 files changed, 51 insertions(+), 39 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode

2014-07-14 Thread Alexander Graf
With Book3S KVM we can create both PR and HV VMs in parallel on the same
machine. That gives us new challenges on the CAPs we return - both have
different capabilities.

When we get asked about CAPs on the kvm fd, there's nothing we can do. We
can try to be smart and assume we're running HV if HV is available, PR
otherwise. However with the newly added VM CHECK_EXTENSION we can now ask
for capabilities directly on a VM which knows whether it's PR or HV.

With this patch I can successfully expose KVM PVINFO data to user space
in the PR case, fixing magic page mapping for PAPR guests.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/powerpc.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2a0e497..6e13fcf 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -388,11 +388,17 @@ void kvm_arch_sync_events(struct kvm *kvm)
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
int r;
-   /* FIXME!!
-* Should some of this be vm ioctl ? is it possible now ?
-*/
+   /* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;
 
+   if (kvm) {
+   /*
+* Hooray - we know which VM type we're running on. Depend on
+* that rather than the guess above.
+*/
+   hv_enabled = is_kvmppc_hv_enabled(kvm);
+   }
+
switch (ext) {
 #ifdef CONFIG_BOOKE
case KVM_CAP_PPC_BOOKE_SREGS:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra

so once more; and then I'm going to route your emails to /dev/null, wrap
text at 78 chars.

On Mon, Jul 14, 2014 at 02:28:36PM +, Liang, Kan wrote:
> > > +++ b/arch/x86/kernel/cpu/perf_event.h
> > > @@ -464,6 +464,12 @@ struct x86_pmu {
> > >*/
> > >   struct extra_reg *extra_regs;
> > >   unsigned int er_flags;
> > > + /*
> > > +  * EXTRA REG MSR can be accessed
> > > +  * The extra registers are completely unrelated to each other.
> > > +  * So it needs a flag for each extra register.
> > > +  */
> > > + boolextra_msr_access[EXTRA_REG_MAX];
> > 
> > So why not in struct extra_reg again? You didn't give a straight answer 
> > there.
> 
> I think I did in the email.
> You mentioned that there's still (only) 4 empty bytes at the tail of
> extra_reg itself.  However, the extra_reg_type may be extended in the
> near future.  So that may not be a reason to move to extra_reg.

Well, you can always grow. Also be explicit, 'may be' is an empty
statement.

> Furthermore, if we move extra_msr_access to extra_reg, I guess we have
> to modify all the related micros (i.e EVENT_EXTRA_REG) to initialize
> the new items.  That could be a big change.

Nah, trivial stuff :-)

> On the other side, in x86_pmu structure, there are extra_regs related
> items defined under the comments "Extra registers for events".  And
> the bit holes are enough for current usage and future extension.
> 
> So I guess  x86_pmu should be a good place to store the availability
> of the reg.

It just doesn't make sense to me to have multiple arrays of the same
thing.



pgpllkEQRnDii.pgp
Description: PGP signature


Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors

2014-07-14 Thread James Hogan
On 14/07/14 15:35, Peter Maydell wrote:
> On 14 July 2014 14:33, James Hogan  wrote:
>> On 10/07/14 13:17, Peter Maydell wrote:
>>> More generally, there doesn't really seem to be provision in the
>>> KVM KVM_EXIT_MMIO API for returning "this access failed".
>>> I guess in theory userspace could do all the "figure out how
>>> to adjust CPU state to do exception entry and then run VCPU",
>>> but that seems like quite a lot of work which the kernel already
>>> knows how to do; is there some way to provide a simpler API
>>> that lets userspace just inform the kernel that it needs to
>>> fault the access?
>>
>> Indeed. Paolo's idea would work well I think. A data bus error exception
>> would likely be the only sensible error response required other than
>> ignoring writes or returning a garbage value for a read (which the
>> current KVM MMIO API already allows).
> 
> I think we should make the API at least permit returning an
> (architecture-specific) error code to the kernel, rather than just
> a single boolean "failed" response. For instance on ARM the
> fault status registers include a single ExT bit for classifying
> the type of an external abort. (The meaning of the bit is
> IMPDEF; on the Cortex-A15 it can be used to distinguish
> AXI bus DECERR ("decode error", ie the interconnect doesn't
> have anything attached at that address) and SLVERR ("slave
> error", ie there was a device at that address but it chose to
> reject the transaction for some reason, eg unsupported
> transfer size, timeout, write to read-only location, FIFO
> overrun or just because the device was in a bad mood.)

Agreed (I wasn't suggesting a bool, just thinking out loud about how
mips would use that arch specific value :-) ).

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-14 Thread Gleb Natapov
CCing Jan to check my nested kvm findings below.

On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> Thanks for the reply. Please see below.
> 
> On 07/12/2014 04:04 PM, Gleb Natapov wrote:
> >On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:
> >>apic access page is pinned in memory. As a result, it cannot be 
> >>migrated/hot-removed.
> >>Actually, it is not necessary to be pinned.
> >>
> >>The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> >>the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> >>corresponding ept entry. This patch introduces a new vcpu request named
> >>KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> >>time,
> >>and force all the vcpus exit guest, and re-enter guest till they updates 
> >>the VMCS
> >>APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> >>kvm->arch.apic_access_page to the new page.
> >>
> >By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
> >is not used since no MMIO access to APIC is ever done. Have you tested
> >this with "-cpu modelname,-x2apic" qemu flag?
> 
> I used the following commandline to test the patches.
> 
> # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp
> 2
> 
That most likely uses x2apic.

> And I think the guest used APIC_ACCESS_ADDR mechanism because the previous
> patch-set has some problem which will happen when the apic page is accessed.
> And it did happen.
> 
> I'll test this patch-set with "-cpu modelname,-x2apic" flag.
> 
Replace "modelname" with one of supported model names such as nehalem of course 
:)

> >
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 +
> >>  arch/x86/kvm/mmu.c  | 11 +++
> >>  arch/x86/kvm/svm.c  |  6 ++
> >>  arch/x86/kvm/vmx.c  |  8 +++-
> >>  arch/x86/kvm/x86.c  | 14 ++
> >>  include/linux/kvm_host.h|  2 ++
> >>  virt/kvm/kvm_main.c | 12 
> >>  7 files changed, 53 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 62f973e..9ce6bfd 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -737,6 +737,7 @@ struct kvm_x86_ops {
> >>void (*hwapic_isr_update)(struct kvm *kvm, int isr);
> >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> >>+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>index 9314678..551693d 100644
> >>--- a/arch/x86/kvm/mmu.c
> >>+++ b/arch/x86/kvm/mmu.c
> >>@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, 
> >>gva_t gpa, u32 error_code,
> >> level, gfn, pfn, prefault);
> >>spin_unlock(&vcpu->kvm->mmu_lock);
> >>
> >>+   /*
> >>+* apic access page could be migrated. When the guest tries to access
> >>+* the apic access page, ept violation will occur, and we can use GUP
> >>+* to find the new page.
> >>+*
> >>+* GUP will wait till the migrate entry be replaced with the new page.
> >>+*/
> >>+   if (gpa == APIC_DEFAULT_PHYS_BASE)
> >>+   vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm,
> >>+   APIC_DEFAULT_PHYS_BASE>>  PAGE_SHIFT);
> >Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?
> 
> I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here.
> 
> In kvm_mmu_notifier_invalidate_page() I made the request. And the handler
> called
> gfn_to_page_no_pin() to get the new page, which will wait till the migration
> finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the
> vcpus
> were forced to exit the guest mode, they would wait till the VMCS
> APIC_ACCESS_ADDR
> pointer was updated.
> 
> As a result, we don't need to make the request here.
OK, I do not see what's the purpose of the code here then.

> 
> 
> >
> >>+
> >>return r;
> >>
> >>  out_unlock:
> >>diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >>index 576b525..dc76f29 100644
> >>--- a/arch/x86/kvm/svm.c
> >>+++ b/arch/x86/kvm/svm.c
> >>@@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct 
> >>kvm_vcpu *vcpu, bool set)
> >>return;
> >>  }
> >>
> >>+static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> >>+{
> >>+   return;
> >>+}
> >>+
> >>  static int svm_vm_has_apicv(struct kvm *kvm)
> >>  {
> >>return 0;
> >>@@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = {
> >>.enable_irq_window = enable_irq_window,
>

Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors

2014-07-14 Thread Peter Maydell
On 14 July 2014 14:33, James Hogan  wrote:
> On 10/07/14 13:17, Peter Maydell wrote:
>> More generally, there doesn't really seem to be provision in the
>> KVM KVM_EXIT_MMIO API for returning "this access failed".
>> I guess in theory userspace could do all the "figure out how
>> to adjust CPU state to do exception entry and then run VCPU",
>> but that seems like quite a lot of work which the kernel already
>> knows how to do; is there some way to provide a simpler API
>> that lets userspace just inform the kernel that it needs to
>> fault the access?
>
> Indeed. Paolo's idea would work well I think. A data bus error exception
> would likely be the only sensible error response required other than
> ignoring writes or returning a garbage value for a read (which the
> current KVM MMIO API already allows).

I think we should make the API at least permit returning an
(architecture-specific) error code to the kernel, rather than just
a single boolean "failed" response. For instance on ARM the
fault status registers include a single ExT bit for classifying
the type of an external abort. (The meaning of the bit is
IMPDEF; on the Cortex-A15 it can be used to distinguish
AXI bus DECERR ("decode error", ie the interconnect doesn't
have anything attached at that address) and SLVERR ("slave
error", ie there was a device at that address but it chose to
reject the transaction for some reason, eg unsupported
transfer size, timeout, write to read-only location, FIFO
overrun or just because the device was in a bad mood.)

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Liang, Kan


> > diff --git a/arch/x86/kernel/cpu/perf_event.h
> > b/arch/x86/kernel/cpu/perf_event.h
> > index 3b2f9bd..992c678 100644
> > --- a/arch/x86/kernel/cpu/perf_event.h
> > +++ b/arch/x86/kernel/cpu/perf_event.h
> > @@ -464,6 +464,12 @@ struct x86_pmu {
> >  */
> > struct extra_reg *extra_regs;
> > unsigned int er_flags;
> > +   /*
> > +* EXTRA REG MSR can be accessed
> > +* The extra registers are completely unrelated to each other.
> > +* So it needs a flag for each extra register.
> > +*/
> > +   boolextra_msr_access[EXTRA_REG_MAX];
> 
> So why not in struct extra_reg again? You didn't give a straight answer there.

I think I did in the email.
You mentioned that there's still (only) 4 empty bytes at the tail of extra_reg 
itself. 
However, the extra_reg_type may be extended in the near future.
So that may not be a reason to move to extra_reg.

Furthermore, if we move extra_msr_access to extra_reg, 
I guess we have to modify all the related micros (i.e EVENT_EXTRA_REG) to 
initialize the new items.
That could be a big change.

On the other side, in x86_pmu structure, there are extra_regs related items 
defined under the comments "Extra registers for events".
 And the bit holes are enough for current usage and future extension.

So I guess  x86_pmu should be a good place to store the availability of the reg.

 /* --- cacheline 6 boundary (384 bytes) --- */
 bool   lbr_double_abort; /*   384 1 */
 
 /* XXX 7 bytes hole, try to pack */
 
 struct extra_reg * extra_regs;   /*   392 8 */
 unsigned int   er_flags; /*   400 4 */
 
 /* XXX 4 bytes hole, try to pack */
 
 struct perf_guest_switch_msr * (*guest_get_msrs)(int *); /*   408 
8 */
 
 /* size: 416, cachelines: 7, members: 64 */
 /* sum members: 391, holes: 6, sum holes: 25 */
 /* bit holes: 1, sum bit holes: 27 bits */
 /* last cacheline: 32 bytes */

> 
> > +/*
> > + * Under certain circumstances, access certain MSR may cause #GP.
> > + * The function tests if the input MSR can be safely accessed.
> > + */
> > +static inline bool check_msr(unsigned long msr) {
> 
> This reads like a generic function;
> 
> > +   u64 val_old, val_new, val_tmp;
> > +
> > +   /*
> > +* Read the current value, change it and read it back to see if it
> > +* matches, this is needed to detect certain hardware emulators
> > +* (qemu/kvm) that don't trap on the MSR access and always return
> 0s.
> > +*/
> > +   if (rdmsrl_safe(msr, &val_old))
> > +   goto msr_fail;
> > +   /*
> > +* Only chagne it slightly,
> > +* since the higher bits of some MSRs cannot be updated by wrmsrl.
> > +* E.g. MSR_LBR_TOS
> > +*/
> > +   val_tmp = val_old ^ 0x3UL;
> 
> but this is not generally true; not all MSRs can write the 2 LSB, can they? 
> One
> option would be to extend the function with a u64 mask.

Right, the function should be easily used to check all MSRs, not just for the 
MSRs I tested.
I will pass a mask as a parameter of the function.  

> 
> > +   if (wrmsrl_safe(msr, val_tmp) ||
> > +   rdmsrl_safe(msr, &val_new))
> > +   goto msr_fail;
> > +
> > +   if (val_new != val_tmp)
> > +   goto msr_fail;
> > +
> > +   /* Here it's sure that the MSR can be safely accessed.
> > +* Restore the old value and return.
> > +*/
> > +   wrmsrl(msr, val_old);
> > +
> > +   return true;
> > +
> > +msr_fail:
> > +   return false;
> > +}
> 
> Also, by now this function is far too large to be inline and in a header.

OK. I will move it to perf_event_intel.c as a static function.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-14 Thread Gleb Natapov
On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote:
> On 07/12/2014 03:44 PM, Gleb Natapov wrote:
> >On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:
> >>kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> >>it is never used to refer to the page at all.
> >>
> >>In vcpu initialization, it indicates two things:
> >>1. indicates if ept page is allocated
> >>2. indicates if a memory slot for identity page is initialized
> >>
> >>Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> >>identity pagetable is initialized. So we can remove ept_identity_pagetable.
> >>
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 -
> >>  arch/x86/kvm/vmx.c  | 25 +++--
> >>  2 files changed, 11 insertions(+), 15 deletions(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 4931415..62f973e 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -578,7 +578,6 @@ struct kvm_arch {
> >>
> >>gpa_t wall_clock;
> >>
> >>-   struct page *ept_identity_pagetable;
> >>bool ept_identity_pagetable_done;
> >>gpa_t ept_identity_map_addr;
> >>
> >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>index 0918635e..fe2e5f4 100644
> >>--- a/arch/x86/kvm/vmx.c
> >>+++ b/arch/x86/kvm/vmx.c
> >>@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
> >>*vcpu);
> >>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
> >>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
> >>  static bool vmx_mpx_supported(void);
> >>+static int alloc_identity_pagetable(struct kvm *kvm);
> >>
> >>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> >>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> >>@@ -3921,21 +3922,21 @@ out:
> >>
> >>  static int init_rmode_identity_map(struct kvm *kvm)
> >>  {
> >>-   int i, idx, r, ret;
> >>+   int i, idx, r, ret = 0;
> >>pfn_t identity_map_pfn;
> >>u32 tmp;
> >>
> >>if (!enable_ept)
> >>return 1;
> >>-   if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> >>-   printk(KERN_ERR "EPT: identity-mapping pagetable "
> >>-   "haven't been allocated!\n");
> >>-   return 0;
> >>-   }
> >>if (likely(kvm->arch.ept_identity_pagetable_done))
> >>return 1;
> >>-   ret = 0;
> >>identity_map_pfn = kvm->arch.ept_identity_map_addr>>  PAGE_SHIFT;
> >>+
> >>+   mutex_lock(&kvm->slots_lock);
> >Why move this out of alloc_identity_pagetable()?
> >
> 
> Referring to the original code, I think mutex_lock(&kvm->slots_lock) is used
> to protect kvm->arch.ept_identity_pagetable. If two or more threads try to
> modify it at the same time, the mutex ensures that the identity table is
> only
> allocated once.
> 
> Now, we dropped kvm->arch.ept_identity_pagetable. And use
> kvm->arch.ept_identity_pagetable_done
> to check if the identity table is allocated and initialized. So we should
> protect
> memory slot operation in alloc_identity_pagetable() and
> kvm->arch.ept_identity_pagetable_done
> with this mutex.
> 
> Of course, I can see that the name "slots_lock" indicates that it may be
> used
> to protect the memory slot operation only. Maybe move it out here is not
> suitable.
> 
> If I'm wrong, please tell me.
> 
No, you are right that besides memory slot creation slots_lock protects checking
of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done 
is
tested outside of slots_lock so the allocation can happen twice, no?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] IRQFD without IRQ routing, enabled for XICS

2014-07-14 Thread Cornelia Huck
On Mon, 30 Jun 2014 20:51:08 +1000
Paul Mackerras  wrote:

> I would like to see this go into 3.17.

FWIW: I've given this a whirl on s390 (with a dataplane disk), and
everything seems to work as before.

The only thing which is I think worth mentioning is that embedding the
routing entry into the irqfd struct will grow it a bit, which might be
noticable on large installations with hundreds of devices. OTOH, the
routing entry isn't too large, so I don't think it will become a
problem.

> 
>  arch/ia64/kvm/Kconfig|   1 +
>  arch/powerpc/kvm/Kconfig |   3 +
>  arch/powerpc/kvm/book3s_hv_rm_xics.c |   5 ++
>  arch/powerpc/kvm/book3s_xics.c   |  55 +++---
>  arch/powerpc/kvm/book3s_xics.h   |   2 +
>  arch/powerpc/kvm/mpic.c  |   4 +-
>  arch/s390/kvm/Kconfig|   1 +
>  arch/s390/kvm/interrupt.c|   3 +-
>  arch/x86/kvm/Kconfig |   1 +
>  include/linux/kvm_host.h |  43 ---
>  virt/kvm/Kconfig |   3 +
>  virt/kvm/eventfd.c   | 134 
> ++-
>  virt/kvm/irq_comm.c  |  24 +++
>  virt/kvm/irqchip.c   |  98 ++---
>  virt/kvm/kvm_main.c  |   2 +-
>  15 files changed, 227 insertions(+), 152 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Paolo Bonzini

Il 14/07/2014 14:48, Peter Zijlstra ha scritto:

In fact there's no reason why LBR cannot be virtualized (though it does need
support from the processor), and it may even be possible to support
OFFCORE_RSP_X in the KVM virtual PMU.


But its not, so something needs to be done, right?


A first thing that could be done, is to have a way for the kernel to detect
absence of LBR


Which is exactly what this patch does, no?


A documented (and recommended) way for the kernel to detect it is 
superior though...



, for example an all-1s setting of the LBR format field of
IA32_PERF_CAPABILITIES.  If Intel can tell us "all 1s will never be used",
we can have KVM set the field that way.  The kernel then should disable LBR.


So what's wrong with testing if the MSRs that provide LBR actually work
or not? That doesn't require any magic co-operation of the host/vm
machinery and is therefore far more robust.


... because the difference is that whenever we hack the kernel, Windows 
and vTune will remain broken forever.  Whenever we get Intel to make the 
hack part of the spec, there is some hope that Windows and vTune will 
eventually get fixed.


The hack certainly works, I'm not disputing that.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Liang, Kan


> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Monday, July 14, 2014 9:40 AM
> To: Liang, Kan; Peter Zijlstra
> Cc: a...@firstfloor.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
> 
> Il 14/07/2014 15:36, Liang, Kan ha scritto:
> >> > Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage
> >> > or just no events reported?
> >> >
> > Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will
> handle the #GP. So there is no error report in guest.
> 
> Yeah, but what's the effect on the output of perf?

They can be still counted/sampled, but the results are garbage.

Kan

> 
> Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Paolo Bonzini

Il 14/07/2014 15:36, Liang, Kan ha scritto:

> Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage or
> just no events reported?
>

Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will handle 
the #GP. So there is no error report in guest.


Yeah, but what's the effect on the output of perf?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Liang, Kan


> >>> For reproducing the issue, please build the kernel with
> CONFIG_KVM_INTEL = y (for host kernel).
> >>> And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest
> kernel).
> >>
> >> I'm not sure this is a useful patch.
> >>
> >> This is #GP'ing just because of a limitation in the PMU; just compile
> >> the kernel with CONFIG_PARAVIRT
> >
> > How's that going to help? If you run kvm -host the VM is lying through
> > its teeth is the kernel is going to assume all those MSRs present,
> > PARAVIRT isn't going to help with this.
> >
> >> , or split the "rdmsr is always rdmsr_safe"
> >> behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol.
> >
> > That's not useful either, because non of these code-paths are going to
> > check the return value.
> 
> Hmmm, I thought rdmsr_safe was going to return zero, but it just returns
> whatever happened to be in edx:eax which maybe should also be fixed.
> 
> Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage or
> just no events reported?
> 

Guest rdmsr/wrmsr will eventually call rdmsr_safe/wrmsr_safe. They will handle 
the #GP. So there is no error report in guest.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors

2014-07-14 Thread James Hogan
Hi Peter,

On 10/07/14 13:17, Peter Maydell wrote:
> On 17 June 2014 23:10, James Hogan  wrote:
>> The patchset depends on v4 of "target-mips: implement UserLocal
>> Register". I'm aiming for QEMU 2.1, hopefully it isn't too late to get
>> some final review.
>>
>> Thanks to everybody who has already taken part in review.
>>
>> This patchset implements KVM support for MIPS32 processors, using Trap &
>> Emulation.
> 
> I was looking at what happens for MMIO accesses to nonexistent
> memory when we're running under KVM, and interestingly this
> patchset means MIPS is now the only CPU which both (a) supports
> KVM and (b) has an implementation of the do_unassigned_access()
> CPU hook. Does the current code support a guest attempt to
> access unassigned physical addresses? I had a look at the code
> and it seems like mips_cpu_unassigned_access() will end up
> calling cpu_restore_state() and cpu_loop_exit(), which I think
> will probably crash if you call them when using KVM rather than
> TCG...

Yes, I have observed this in the past when experimentally writing to the
Malta reset region from the guest (see the patch "mips/malta: allow
volatile writes to reset flash" which didn't get applied but worked
around the specific issue). That was because read only memory regions
were treated as unassigned from the point of view of writes (which tbh
seems wrong), but it could happen with any unassigned access >
0x8000 from the guest AFAICT.

So yeh, until mips_cpu_unassigned_access does something more portable
for KVM, conditionally setting do_unassigned_access only if
!kvm_enabled() looks sensible. I'll see if I can reproduce it and submit
a patch.

> More generally, there doesn't really seem to be provision in the
> KVM KVM_EXIT_MMIO API for returning "this access failed".
> I guess in theory userspace could do all the "figure out how
> to adjust CPU state to do exception entry and then run VCPU",
> but that seems like quite a lot of work which the kernel already
> knows how to do; is there some way to provide a simpler API
> that lets userspace just inform the kernel that it needs to
> fault the access?

Indeed. Paolo's idea would work well I think. A data bus error exception
would likely be the only sensible error response required other than
ignoring writes or returning a garbage value for a read (which the
current KVM MMIO API already allows).

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 02:40:33PM +0200, Paolo Bonzini wrote:
> Hmmm, I thought rdmsr_safe was going to return zero, but it just returns
> whatever happened to be in edx:eax which maybe should also be fixed.
> 
> Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage or just no
> events reported?

Either way; its not functioning according to 'spec'. Therefore we should
'disable' things.

> >>In fact there's no reason why LBR cannot be virtualized (though it does need
> >>support from the processor), and it may even be possible to support
> >>OFFCORE_RSP_X in the KVM virtual PMU.
> >
> >But its not, so something needs to be done, right?
> 
> A first thing that could be done, is to have a way for the kernel to detect
> absence of LBR

Which is exactly what this patch does, no?

> , for example an all-1s setting of the LBR format field of
> IA32_PERF_CAPABILITIES.  If Intel can tell us "all 1s will never be used",
> we can have KVM set the field that way.  The kernel then should disable LBR.

So what's wrong with testing if the MSRs that provide LBR actually work
or not? That doesn't require any magic co-operation of the host/vm
machinery and is therefore far more robust.


pgpBAg5uoyJMK.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Paolo Bonzini

Il 14/07/2014 14:09, Peter Zijlstra ha scritto:

On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote:

Il 10/07/2014 12:59, kan.li...@intel.com ha scritto:

From: Kan Liang 

x86, perf: Protect LBR and extra_regs against KVM lying

With -cpu host, KVM reports LBR and extra_regs support, if the host has support.
When the guest perf driver tries to access LBR or extra_regs MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
So check the related MSRs access right once at initialization time to avoid the 
error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).


I'm not sure this is a useful patch.

This is #GP'ing just because of a limitation in the PMU; just compile the
kernel with CONFIG_PARAVIRT


How's that going to help? If you run kvm -host the VM is lying through
its teeth is the kernel is going to assume all those MSRs present,
PARAVIRT isn't going to help with this.


, or split the "rdmsr is always rdmsr_safe"
behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol.


That's not useful either, because non of these code-paths are going to
check the return value.


Hmmm, I thought rdmsr_safe was going to return zero, but it just returns 
whatever happened to be in edx:eax which maybe should also be fixed.


Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage or 
just no events reported?



In fact there's no reason why LBR cannot be virtualized (though it does need
support from the processor), and it may even be possible to support
OFFCORE_RSP_X in the KVM virtual PMU.


But its not, so something needs to be done, right?


A first thing that could be done, is to have a way for the kernel to 
detect absence of LBR, for example an all-1s setting of the LBR format 
field of IA32_PERF_CAPABILITIES.  If Intel can tell us "all 1s will 
never be used", we can have KVM set the field that way.  The kernel then 
should disable LBR.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote:
> Il 10/07/2014 12:59, kan.li...@intel.com ha scritto:
> >From: Kan Liang 
> >
> >x86, perf: Protect LBR and extra_regs against KVM lying
> >
> >With -cpu host, KVM reports LBR and extra_regs support, if the host has 
> >support.
> >When the guest perf driver tries to access LBR or extra_regs MSR,
> >it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
> >So check the related MSRs access right once at initialization time to avoid 
> >the error access at runtime.
> >
> >For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
> >(for host kernel).
> >And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
> 
> I'm not sure this is a useful patch.
> 
> This is #GP'ing just because of a limitation in the PMU; just compile the
> kernel with CONFIG_PARAVIRT

How's that going to help? If you run kvm -host the VM is lying through
its teeth is the kernel is going to assume all those MSRs present,
PARAVIRT isn't going to help with this.

> , or split the "rdmsr is always rdmsr_safe"
> behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol.

That's not useful either, because non of these code-paths are going to
check the return value.

> In fact there's no reason why LBR cannot be virtualized (though it does need
> support from the processor), and it may even be possible to support
> OFFCORE_RSP_X in the KVM virtual PMU.

But its not, so something needs to be done, right?


pgpmnEwGpgwpV.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Paolo Bonzini

Il 10/07/2014 12:59, kan.li...@intel.com ha scritto:

From: Kan Liang 

x86, perf: Protect LBR and extra_regs against KVM lying

With -cpu host, KVM reports LBR and extra_regs support, if the host has support.
When the guest perf driver tries to access LBR or extra_regs MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
So check the related MSRs access right once at initialization time to avoid the 
error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).


I'm not sure this is a useful patch.

This is #GP'ing just because of a limitation in the PMU; just compile 
the kernel with CONFIG_PARAVIRT, or split the "rdmsr is always 
rdmsr_safe" behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol.


In fact there's no reason why LBR cannot be virtualized (though it does 
need support from the processor), and it may even be possible to support 
OFFCORE_RSP_X in the KVM virtual PMU.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] KVM: emulate: move around some checks

2014-07-14 Thread Paolo Bonzini
The only purpose of this patch is to make the next patch simpler
to review.  No semantic change.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 15453e569f3d..d79677c6056d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4388,12 +4388,15 @@ done_prefixes:
ctxt->d |= opcode.flags;
}
 
+   /* Unrecognised? */
+   if (ctxt->d == 0)
+   return EMULATION_FAILED;
+
ctxt->execute = opcode.u.execute;
ctxt->check_perm = opcode.check_perm;
ctxt->intercept = opcode.intercept;
 
-   /* Unrecognised? */
-   if (ctxt->d == 0 || (ctxt->d & NotImpl))
+   if (ctxt->d & NotImpl)
return EMULATION_FAILED;
 
if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
@@ -4535,19 +4538,19 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 
ctxt->mem_read.pos = 0;
 
-   if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
-   (ctxt->d & Undefined)) {
+   /* LOCK prefix is allowed only with some instructions */
+   if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != 
OP_MEM)) {
rc = emulate_ud(ctxt);
goto done;
}
 
-   /* LOCK prefix is allowed only with some instructions */
-   if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != 
OP_MEM)) {
+   if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) {
rc = emulate_ud(ctxt);
goto done;
}
 
-   if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) {
+   if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
+   (ctxt->d & Undefined)) {
rc = emulate_ud(ctxt);
goto done;
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/19] KVM: x86: return all bits from get_interrupt_shadow

2014-07-14 Thread Paolo Bonzini
For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call.  As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/svm.c  |  6 +++---
 arch/x86/kvm/vmx.c  |  4 ++--
 arch/x86/kvm/x86.c  | 10 +-
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index af36f89fe67a..b8a4480176b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -717,7 +717,7 @@ struct kvm_x86_ops {
int (*handle_exit)(struct kvm_vcpu *vcpu);
void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
-   u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
+   u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu);
void (*patch_hypercall)(struct kvm_vcpu *vcpu,
unsigned char *hypercall_addr);
void (*set_irq)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4925a94a970a..ddf742768ecf 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -486,14 +486,14 @@ static int is_external_interrupt(u32 info)
return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR);
 }
 
-static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u32 ret = 0;
 
if (svm->vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK)
-   ret |= KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS;
-   return ret & mask;
+   ret = KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS;
+   return ret;
 }
 
 static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5c7bbde25bbf..0c9569b994f9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1943,7 +1943,7 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
vmcs_writel(GUEST_RFLAGS, rflags);
 }
 
-static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 {
u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
int ret = 0;
@@ -1953,7 +1953,7 @@ static u32 vmx_get_interrupt_shadow(struct kvm_vcpu 
*vcpu, int mask)
if (interruptibility & GUEST_INTR_STATE_MOV_SS)
ret |= KVM_X86_SHADOW_INT_MOV_SS;
 
-   return ret & mask;
+   return ret;
 }
 
 static void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7553530e3502..a56126e6bd75 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2978,9 +2978,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct 
kvm_vcpu *vcpu,
vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft;
events->interrupt.nr = vcpu->arch.interrupt.nr;
events->interrupt.soft = 0;
-   events->interrupt.shadow =
-   kvm_x86_ops->get_interrupt_shadow(vcpu,
-   KVM_X86_SHADOW_INT_MOV_SS | KVM_X86_SHADOW_INT_STI);
+   events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
 
events->nmi.injected = vcpu->arch.nmi_injected;
events->nmi.pending = vcpu->arch.nmi_pending != 0;
@@ -4860,7 +4858,7 @@ static const struct x86_emulate_ops emulate_ops = {
 
 static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
 {
-   u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu, mask);
+   u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
/*
 * an sti; sti; sequence only disable interrupts for the first
 * instruction. So, if the last instruction, be it emulated or
@@ -4868,7 +4866,9 @@ static void toggle_interruptibility(struct kvm_vcpu 
*vcpu, u32 mask)
 * means that the last instruction is an sti. We should not
 * leave the flag on in this case. The same goes for mov ss
 */
-   if (!(int_shadow & mask))
+   if (int_shadow & mask)
+   mask = 0;
+   if (unlikely(int_shadow || mask))
kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe k

[PATCH 03/19] KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation

2014-07-14 Thread Paolo Bonzini
Despite the provisions to emulate up to 130 consecutive instructions, in
practice KVM will emulate just one before exiting handle_invalid_guest_state,
because x86_emulate_instruction always sets KVM_REQ_EVENT.

However, we only need to do this if an interrupt could be injected,
which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
away; b) if the interrupt flag has just been set (other instructions
than STI can set it without enabling an interrupt shadow).

This cuts another 700-900 cycles from the cost of emulating an
instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
before the patch on kvm-unit-tests, 925-1700 afterwards).

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/x86.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a56126e6bd75..cd9316786dca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -87,6 +87,7 @@ static u64 __read_mostly efer_reserved_bits = 
~((u64)EFER_SCE);
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu);
 static void process_nmi(struct kvm_vcpu *vcpu);
+static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
 
 struct kvm_x86_ops *kvm_x86_ops;
 EXPORT_SYMBOL_GPL(kvm_x86_ops);
@@ -4868,8 +4869,11 @@ static void toggle_interruptibility(struct kvm_vcpu 
*vcpu, u32 mask)
 */
if (int_shadow & mask)
mask = 0;
-   if (unlikely(int_shadow || mask))
+   if (unlikely(int_shadow || mask)) {
kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
+   if (!mask)
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+   }
 }
 
 static void inject_emulated_exception(struct kvm_vcpu *vcpu)
@@ -5095,20 +5099,18 @@ static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 
type, u32 dr7,
return dr6;
 }
 
-static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, int *r)
+static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, unsigned long 
rflags, int *r)
 {
struct kvm_run *kvm_run = vcpu->run;
 
/*
-* Use the "raw" value to see if TF was passed to the processor.
-* Note that the new value of the flags has not been saved yet.
+* rflags is the old, "raw" value of the flags.  The new value has
+* not been saved yet.
 *
 * This is correct even for TF set by the guest, because "the
 * processor will not generate this exception after the instruction
 * that sets the TF flag".
 */
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
-
if (unlikely(rflags & X86_EFLAGS_TF)) {
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
kvm_run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1;
@@ -5275,13 +5277,22 @@ restart:
r = EMULATE_DONE;
 
if (writeback) {
+   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
toggle_interruptibility(vcpu, ctxt->interruptibility);
-   kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
kvm_rip_write(vcpu, ctxt->eip);
if (r == EMULATE_DONE)
-   kvm_vcpu_check_singlestep(vcpu, &r);
-   kvm_set_rflags(vcpu, ctxt->eflags);
+   kvm_vcpu_check_singlestep(vcpu, rflags, &r);
+   __kvm_set_rflags(vcpu, ctxt->eflags);
+
+   /*
+* For STI, interrupts are shadowed; so KVM_REQ_EVENT will
+* do nothing, and it will be requested again as soon as
+* the shadow expires.  But we still need to check here,
+* because POPF has no interrupt shadow.
+*/
+   if (unlikely((ctxt->eflags & ~rflags) & X86_EFLAGS_IF))
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
} else
vcpu->arch.emulate_regs_need_sync_to_vcpu = true;
 
@@ -7406,12 +7417,17 @@ unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_get_rflags);
 
-void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP &&
kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))
rflags |= X86_EFLAGS_TF;
kvm_x86_ops->set_rflags(vcpu, rflags);
+}
+
+void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+   __kvm_set_rflags(vcpu, rflags);
kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/19] KVM: emulate: speed up emulated moves

2014-07-14 Thread Paolo Bonzini
We can just blindly move all 16 bytes of ctxt->src's value to ctxt->dst.
write_register_operand will take care of writing only the lower bytes.

Avoiding a call to memcpy (the compiler optimizes it out) gains about
200 cycles on kvm-unit-tests for register-to-register moves, and makes
them about as fast as arithmetic instructions.

We could perhaps get a larger speedup by moving all instructions _except_
moves out of x86_emulate_insn, removing opcode_len, and replacing the
switch statement with an inlined em_mov.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h | 2 +-
 arch/x86/kvm/emulate.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 0e0151c13b2c..432447370044 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -233,7 +233,7 @@ struct operand {
union {
unsigned long val;
u64 val64;
-   char valptr[sizeof(unsigned long) + 2];
+   char valptr[sizeof(sse128_t)];
sse128_t vec_val;
u64 mm_val;
void *data;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index ea56dae3e67c..27f677ef703e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2990,7 +2990,7 @@ static int em_rdpmc(struct x86_emulate_ctxt *ctxt)
 
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
-   memcpy(ctxt->dst.valptr, ctxt->src.valptr, ctxt->op_bytes);
+   memcpy(ctxt->dst.valptr, ctxt->src.valptr, sizeof(ctxt->src.valptr));
return X86EMUL_CONTINUE;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/19] KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"

2014-07-14 Thread Paolo Bonzini
There are several checks for "peculiar" aspects of instructions in both
x86_decode_insn and x86_emulate_insn.  Group them together, and guard
them with a single "if" that lets the processor quickly skip them all.
Make this more effective by adding two more flag bits that say whether the
.intercept and .check_perm fields are valid.  We will reuse these
flags later to avoid initializing fields of the emulate_ctxt struct.

This skims about 30 cycles for each emulated instructions, which is
approximately a 3% improvement.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 175 ++---
 1 file changed, 94 insertions(+), 81 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d79677c6056d..ea56dae3e67c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -162,6 +162,8 @@
 #define NoWrite ((u64)1 << 45)  /* No writeback */
 #define SrcWrite((u64)1 << 46)  /* Write back src operand */
 #define NoMod  ((u64)1 << 47)  /* Mod field is ignored */
+#define Intercept   ((u64)1 << 48)  /* Has valid intercept field */
+#define CheckPerm   ((u64)1 << 49)  /* Has valid check_perm field */
 
 #define DstXacc (DstAccLo | SrcAccHi | SrcWrite)
 
@@ -3546,9 +3548,9 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 }
 
 #define D(_y) { .flags = (_y) }
-#define DI(_y, _i) { .flags = (_y), .intercept = x86_intercept_##_i }
-#define DIP(_y, _i, _p) { .flags = (_y), .intercept = x86_intercept_##_i, \
- .check_perm = (_p) }
+#define DI(_y, _i) { .flags = (_y)|Intercept, .intercept = x86_intercept_##_i }
+#define DIP(_y, _i, _p) { .flags = (_y)|Intercept|CheckPerm, \
+ .intercept = x86_intercept_##_i, .check_perm = (_p) }
 #define ND(NotImpl)
 #define EXT(_f, _e) { .flags = ((_f) | RMExt), .u.group = (_e) }
 #define G(_f, _g) { .flags = ((_f) | Group | ModRM), .u.group = (_g) }
@@ -3557,10 +3559,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 #define I(_f, _e) { .flags = (_f), .u.execute = (_e) }
 #define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) }
 #define II(_f, _e, _i) \
-   { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i }
+   { .flags = (_f)|Intercept, .u.execute = (_e), .intercept = 
x86_intercept_##_i }
 #define IIP(_f, _e, _i, _p) \
-   { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i, \
- .check_perm = (_p) }
+   { .flags = (_f)|Intercept|CheckPerm, .u.execute = (_e), \
+ .intercept = x86_intercept_##_i, .check_perm = (_p) }
 #define GP(_f, _g) { .flags = ((_f) | Prefix), .u.gprefix = (_g) }
 
 #define D2bv(_f)  D((_f) | ByteOp), D(_f)
@@ -4393,29 +4395,37 @@ done_prefixes:
return EMULATION_FAILED;
 
ctxt->execute = opcode.u.execute;
-   ctxt->check_perm = opcode.check_perm;
-   ctxt->intercept = opcode.intercept;
 
-   if (ctxt->d & NotImpl)
-   return EMULATION_FAILED;
+   if (unlikely(ctxt->d &
+
(NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
+   /*
+* These are copied unconditionally here, and checked 
unconditionally
+* in x86_emulate_insn.
+*/
+   ctxt->check_perm = opcode.check_perm;
+   ctxt->intercept = opcode.intercept;
 
-   if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
-   return EMULATION_FAILED;
+   if (ctxt->d & NotImpl)
+   return EMULATION_FAILED;
 
-   if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack))
-   ctxt->op_bytes = 8;
+   if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
+   return EMULATION_FAILED;
 
-   if (ctxt->d & Op3264) {
-   if (mode == X86EMUL_MODE_PROT64)
+   if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack))
ctxt->op_bytes = 8;
-   else
-   ctxt->op_bytes = 4;
-   }
 
-   if (ctxt->d & Sse)
-   ctxt->op_bytes = 16;
-   else if (ctxt->d & Mmx)
-   ctxt->op_bytes = 8;
+   if (ctxt->d & Op3264) {
+   if (mode == X86EMUL_MODE_PROT64)
+   ctxt->op_bytes = 8;
+   else
+   ctxt->op_bytes = 4;
+   }
+
+   if (ctxt->d & Sse)
+   ctxt->op_bytes = 16;
+   else if (ctxt->d & Mmx)
+   ctxt->op_bytes = 8;
+   }
 
/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
@@ -4549,75 +4559,78 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
goto done;
}
 
-   if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
-   (ctxt->d & Undefined)) {
-   rc = emulate_ud(ctxt);
-   goto done;
- 

[PATCH 07/19] KVM: emulate: simplify writeback

2014-07-14 Thread Paolo Bonzini
The "if/return" checks are useless, because we return X86EMUL_CONTINUE
anyway if we do not return.

Reviewed-by: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 28 +++-
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 27f677ef703e..32d3da82da2e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1588,34 +1588,28 @@ static void write_register_operand(struct operand *op)
 
 static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
 {
-   int rc;
-
switch (op->type) {
case OP_REG:
write_register_operand(op);
break;
case OP_MEM:
if (ctxt->lock_prefix)
-   rc = segmented_cmpxchg(ctxt,
+   return segmented_cmpxchg(ctxt,
+op->addr.mem,
+&op->orig_val,
+&op->val,
+op->bytes);
+   else
+   return segmented_write(ctxt,
   op->addr.mem,
-  &op->orig_val,
   &op->val,
   op->bytes);
-   else
-   rc = segmented_write(ctxt,
-op->addr.mem,
-&op->val,
-op->bytes);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
break;
case OP_MEM_STR:
-   rc = segmented_write(ctxt,
-   op->addr.mem,
-   op->data,
-   op->bytes * op->count);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
+   return segmented_write(ctxt,
+  op->addr.mem,
+  op->data,
+  op->bytes * op->count);
break;
case OP_XMM:
write_sse_reg(ctxt, &op->vec_val, op->addr.xmm);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/19] KVM: emulate: cleanup decode_modrm

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

Remove the if conditional - that will help us avoid
an "else initialize to 0" Also, rearrange operators
for slightly better code.

Signed-off-by: Bandan Das 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 000fc7398832..94f5f8b94ce9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1066,19 +1066,17 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
struct operand *op)
 {
u8 sib;
-   int index_reg = 0, base_reg = 0, scale;
+   int index_reg, base_reg, scale;
int rc = X86EMUL_CONTINUE;
ulong modrm_ea = 0;
 
-   if (ctxt->rex_prefix) {
-   ctxt->modrm_reg = (ctxt->rex_prefix & 4) << 1;  /* REX.R */
-   index_reg = (ctxt->rex_prefix & 2) << 2; /* REX.X */
-   ctxt->modrm_rm = base_reg = (ctxt->rex_prefix & 1) << 3; /* 
REG.B */
-   }
+   ctxt->modrm_reg = ((ctxt->rex_prefix << 1) & 8); /* REX.R */
+   index_reg = (ctxt->rex_prefix << 2) & 8; /* REX.X */
+   base_reg = (ctxt->rex_prefix << 3) & 8; /* REX.B */
 
-   ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6;
+   ctxt->modrm_mod = (ctxt->modrm & 0xc0) >> 6;
ctxt->modrm_reg |= (ctxt->modrm & 0x38) >> 3;
-   ctxt->modrm_rm |= (ctxt->modrm & 0x07);
+   ctxt->modrm_rm = base_reg | (ctxt->modrm & 0x07);
ctxt->modrm_seg = VCPU_SREG_DS;
 
if (ctxt->modrm_mod == 3 || (ctxt->d & NoMod)) {
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/19] KVM: vmx: speed up emulation of invalid guest state

2014-07-14 Thread Paolo Bonzini
About 25% of the time spent in emulation of invalid guest state
is wasted in checking whether emulation is required for the next
instruction.  However, this almost never changes except when a
segment register (or TR or LDTR) changes, or when there is a mode
transition (i.e. CR0 changes).

In fact, vmx_set_segment and vmx_set_cr0 already modify
vmx->emulation_required (except that the former for some reason
uses |= instead of just an assignment).  So there is no need to
call guest_state_valid in the emulation loop.

Emulation performance test results indicate 1650-2600 cycles
for common instructions, versus 2300-3200 before this patch on
a Sandy Bridge Xeon.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8748c2e19ed6..5c7bbde25bbf 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3672,7 +3672,7 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu,
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
 
 out:
-   vmx->emulation_required |= emulation_required(vcpu);
+   vmx->emulation_required = emulation_required(vcpu);
 }
 
 static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
@@ -5640,7 +5640,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
intr_window_requested = cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING;
 
-   while (!guest_state_valid(vcpu) && count-- != 0) {
+   while (vmx->emulation_required && count-- != 0) {
if (intr_window_requested && vmx_interrupt_allowed(vcpu))
return handle_interrupt_window(&vmx->vcpu);
 
@@ -5674,7 +5674,6 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
schedule();
}
 
-   vmx->emulation_required = emulation_required(vcpu);
 out:
return ret;
 }
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/19] KVM: emulate: speed up do_insn_fetch

2014-07-14 Thread Paolo Bonzini
Hoist the common case up from do_insn_fetch_byte to do_insn_fetch,
and prime the fetch_cache in x86_decode_insn.  This helps a bit the
compiler and the branch predictor, but above all it lays the
ground for further changes in the next few patches.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 67 +++---
 1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 390400a54a9c..ea188a338afb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -705,51 +705,51 @@ static int segmented_read_std(struct x86_emulate_ctxt 
*ctxt,
 }
 
 /*
- * Fetch the next byte of the instruction being emulated which is pointed to
- * by ctxt->_eip, then increment ctxt->_eip.
- *
- * Also prefetch the remaining bytes of the instruction without crossing page
+ * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_byte(struct x86_emulate_ctxt *ctxt, u8 *dest)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
 {
struct fetch_cache *fc = &ctxt->fetch;
int rc;
int size, cur_size;
-
-   if (ctxt->_eip == fc->end) {
-   unsigned long linear;
-   struct segmented_address addr = { .seg = VCPU_SREG_CS,
- .ea  = ctxt->_eip };
-   cur_size = fc->end - fc->start;
-   size = min(15UL - cur_size,
-  PAGE_SIZE - offset_in_page(ctxt->_eip));
-   rc = __linearize(ctxt, addr, size, false, true, &linear);
-   if (unlikely(rc != X86EMUL_CONTINUE))
-   return rc;
-   rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
- size, &ctxt->exception);
-   if (unlikely(rc != X86EMUL_CONTINUE))
-   return rc;
-   fc->end += size;
-   }
-   *dest = fc->data[ctxt->_eip - fc->start];
-   ctxt->_eip++;
+   unsigned long linear;
+
+   struct segmented_address addr = { .seg = VCPU_SREG_CS,
+ .ea  = fc->end };
+   cur_size = fc->end - fc->start;
+   size = min(15UL - cur_size,
+  PAGE_SIZE - offset_in_page(fc->end));
+   if (unlikely(size == 0))
+   return X86EMUL_UNHANDLEABLE;
+   rc = __linearize(ctxt, addr, size, false, true, &linear);
+   if (unlikely(rc != X86EMUL_CONTINUE))
+   return rc;
+   rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+ size, &ctxt->exception);
+   if (unlikely(rc != X86EMUL_CONTINUE))
+   return rc;
+   fc->end += size;
return X86EMUL_CONTINUE;
 }
 
 static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
-void *dest, unsigned size)
+void *__dest, unsigned size)
 {
int rc;
+   struct fetch_cache *fc = &ctxt->fetch;
+   u8 *dest = __dest;
+   u8 *src = &fc->data[ctxt->_eip - fc->start];
 
-   /* x86 instructions are limited to 15 bytes. */
-   if (unlikely(ctxt->_eip + size - ctxt->eip > 15))
-   return X86EMUL_UNHANDLEABLE;
while (size--) {
-   rc = do_insn_fetch_byte(ctxt, dest++);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
+   if (unlikely(ctxt->_eip == fc->end)) {
+   rc = do_insn_fetch_bytes(ctxt);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   }
+   *dest++ = *src++;
+   ctxt->_eip++;
+   continue;
}
return X86EMUL_CONTINUE;
 }
@@ -4227,6 +4227,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
ctxt->opcode_len = 1;
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
+   else {
+   rc = do_insn_fetch_bytes(ctxt);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   }
 
switch (mode) {
case X86EMUL_MODE_REAL:
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/19] KVM: emulate: clean up initializations in init_decode_cache

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

A lot of initializations are unnecessary as they get set to
appropriate values before actually being used. Optimize
placement of fields in x86_emulate_ctxt

Signed-off-by: Bandan Das 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h | 20 +++-
 arch/x86/kvm/emulate.c |  7 ++-
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c22bd9af4311..d40a10a38a80 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -288,30 +288,32 @@ struct x86_emulate_ctxt {
u8 opcode_len;
u8 b;
u8 intercept;
-   u8 lock_prefix;
-   u8 rep_prefix;
u8 op_bytes;
u8 ad_bytes;
-   u8 rex_prefix;
struct operand src;
struct operand src2;
struct operand dst;
-   bool has_seg_override;
-   u8 seg_override;
-   u64 d;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+   bool has_seg_override;
+   bool rip_relative;
+   u8 rex_prefix;
+   u8 lock_prefix;
+   u8 rep_prefix;
+   u8 seg_override;
+   /* bitmaps of registers in _regs[] that can be read */
+   u32 regs_valid;
+   /* bitmaps of registers in _regs[] that have been written */
+   u32 regs_dirty;
/* modrm */
u8 modrm;
u8 modrm_mod;
u8 modrm_reg;
u8 modrm_rm;
u8 modrm_seg;
-   bool rip_relative;
+   u64 d;
unsigned long _eip;
struct operand memop;
-   u32 regs_valid;  /* bitmaps of registers in _regs[] that can be read */
-   u32 regs_dirty;  /* bitmaps of registers in _regs[] that have been 
written */
/* Fields above regs are cleared together. */
unsigned long _regs[NR_VCPU_REGS];
struct operand *memopp;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 94f5f8b94ce9..3e9bbdc4c76a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4534,14 +4534,11 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *))
 
 void init_decode_cache(struct x86_emulate_ctxt *ctxt)
 {
-   memset(&ctxt->opcode_len, 0,
-  (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
+   memset(&ctxt->has_seg_override, 0,
+  (void *)&ctxt->modrm - (void *)&ctxt->has_seg_override);
 
-   ctxt->fetch.start = 0;
-   ctxt->fetch.end = 0;
ctxt->io_read.pos = 0;
ctxt->io_read.end = 0;
-   ctxt->mem_read.pos = 0;
ctxt->mem_read.end = 0;
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/19] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes

2014-07-14 Thread Paolo Bonzini
do_insn_fetch_bytes will only be called once in a given insn_fetch and
insn_fetch_arr, because in fact it will only be called at most twice
for any instruction and the first call is explicit in x86_decode_insn.
This observation lets us hoist the call out of the memory copying loop.
It does not buy performance, because most fetches are one byte long
anyway, but it prepares for the next patch.

The overflow check is tricky, but correct.  Because do_insn_fetch_bytes
has already been called once, we know that fc->end is at least 15.  So
it is okay to subtract the number of bytes we want to read.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index ea188a338afb..ca82ec9c5ffe 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -708,7 +708,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
  * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
struct fetch_cache *fc = &ctxt->fetch;
int rc;
@@ -720,7 +720,14 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt 
*ctxt)
cur_size = fc->end - fc->start;
size = min(15UL - cur_size,
   PAGE_SIZE - offset_in_page(fc->end));
-   if (unlikely(size == 0))
+
+   /*
+* One instruction can only straddle two pages,
+* and one has been loaded at the beginning of
+* x86_decode_insn.  So, if not enough bytes
+* still, we must have hit the 15-byte boundary.
+*/
+   if (unlikely(size < op_size))
return X86EMUL_UNHANDLEABLE;
rc = __linearize(ctxt, addr, size, false, true, &linear);
if (unlikely(rc != X86EMUL_CONTINUE))
@@ -736,17 +743,18 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt 
*ctxt)
 static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
 void *__dest, unsigned size)
 {
-   int rc;
struct fetch_cache *fc = &ctxt->fetch;
u8 *dest = __dest;
u8 *src = &fc->data[ctxt->_eip - fc->start];
 
+   /* We have to be careful about overflow! */
+   if (unlikely(ctxt->_eip > fc->end - size)) {
+   int rc = do_insn_fetch_bytes(ctxt, size);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   }
+
while (size--) {
-   if (unlikely(ctxt->_eip == fc->end)) {
-   rc = do_insn_fetch_bytes(ctxt);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
-   }
*dest++ = *src++;
ctxt->_eip++;
continue;
@@ -4228,7 +4236,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
else {
-   rc = do_insn_fetch_bytes(ctxt);
+   rc = do_insn_fetch_bytes(ctxt, 1);
if (rc != X86EMUL_CONTINUE)
return rc;
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/19] KVM: emulate: rework seg_override

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

x86_decode_insn already sets a default for seg_override,
so remove it from the zeroed area. Also replace set/get functions
with direct access to the field.

Signed-off-by: Bandan Das 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h |  3 +--
 arch/x86/kvm/emulate.c | 41 +++---
 2 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index d40a10a38a80..6c808408326f 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -295,12 +295,10 @@ struct x86_emulate_ctxt {
struct operand dst;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
-   bool has_seg_override;
bool rip_relative;
u8 rex_prefix;
u8 lock_prefix;
u8 rep_prefix;
-   u8 seg_override;
/* bitmaps of registers in _regs[] that can be read */
u32 regs_valid;
/* bitmaps of registers in _regs[] that have been written */
@@ -311,6 +309,7 @@ struct x86_emulate_ctxt {
u8 modrm_reg;
u8 modrm_rm;
u8 modrm_seg;
+   u8 seg_override;
u64 d;
unsigned long _eip;
struct operand memop;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 3e9bbdc4c76a..08badf638fb0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -514,12 +514,6 @@ static u32 desc_limit_scaled(struct desc_struct *desc)
return desc->g ? (limit << 12) | 0xfff : limit;
 }
 
-static void set_seg_override(struct x86_emulate_ctxt *ctxt, int seg)
-{
-   ctxt->has_seg_override = true;
-   ctxt->seg_override = seg;
-}
-
 static unsigned long seg_base(struct x86_emulate_ctxt *ctxt, int seg)
 {
if (ctxt->mode == X86EMUL_MODE_PROT64 && seg < VCPU_SREG_FS)
@@ -528,14 +522,6 @@ static unsigned long seg_base(struct x86_emulate_ctxt 
*ctxt, int seg)
return ctxt->ops->get_cached_segment_base(ctxt, seg);
 }
 
-static unsigned seg_override(struct x86_emulate_ctxt *ctxt)
-{
-   if (!ctxt->has_seg_override)
-   return 0;
-
-   return ctxt->seg_override;
-}
-
 static int emulate_exception(struct x86_emulate_ctxt *ctxt, int vec,
 u32 error, bool valid)
 {
@@ -4169,7 +4155,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
op->addr.mem.ea =
register_address(ctxt, reg_read(ctxt, VCPU_REGS_RSI));
-   op->addr.mem.seg = seg_override(ctxt);
+   op->addr.mem.seg = ctxt->seg_override;
op->val = 0;
op->count = 1;
break;
@@ -4180,7 +4166,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
register_address(ctxt,
reg_read(ctxt, VCPU_REGS_RBX) +
(reg_read(ctxt, VCPU_REGS_RAX) & 0xff));
-   op->addr.mem.seg = seg_override(ctxt);
+   op->addr.mem.seg = ctxt->seg_override;
op->val = 0;
break;
case OpImmFAddr:
@@ -4227,6 +4213,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
int mode = ctxt->mode;
int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
bool op_prefix = false;
+   bool has_seg_override = false;
struct opcode opcode;
 
ctxt->memop.type = OP_NONE;
@@ -4280,11 +4267,13 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
case 0x2e:  /* CS override */
case 0x36:  /* SS override */
case 0x3e:  /* DS override */
-   set_seg_override(ctxt, (ctxt->b >> 3) & 3);
+   has_seg_override = true;
+   ctxt->seg_override = (ctxt->b >> 3) & 3;
break;
case 0x64:  /* FS override */
case 0x65:  /* GS override */
-   set_seg_override(ctxt, ctxt->b & 7);
+   has_seg_override = true;
+   ctxt->seg_override = ctxt->b & 7;
break;
case 0x40 ... 0x4f: /* REX */
if (mode != X86EMUL_MODE_PROT64)
@@ -4422,17 +4411,19 @@ done_prefixes:
/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
rc = decode_modrm(ctxt, &ctxt->memop);
-   if (!ctxt->has_seg_override)
-   set_seg_override(ctxt, ctxt->modrm_seg);
+   if (!has_seg_override) {
+   has_seg_override = true;
+   ctxt->seg_override = ctxt->modrm_seg;
+   }
} else if (ctxt->d & MemAbs)

[PATCH 17/19] KVM: emulate: put pointers in the fetch_cache

2014-07-14 Thread Paolo Bonzini
This simplifies the code a bit, especially the overflow checks.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h |  4 ++--
 arch/x86/kvm/emulate.c | 34 +++---
 arch/x86/kvm/trace.h   |  6 +++---
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index fcf58cd25ebd..eb181178fe0b 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -242,8 +242,8 @@ struct operand {
 
 struct fetch_cache {
u8 data[15];
-   unsigned long start;
-   unsigned long end;
+   u8 *ptr;
+   u8 *end;
 };
 
 struct read_cache {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 02c668aca2b6..c16314807756 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -710,16 +710,15 @@ static int segmented_read_std(struct x86_emulate_ctxt 
*ctxt,
  */
 static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
-   struct fetch_cache *fc = &ctxt->fetch;
int rc;
-   int size, cur_size;
+   int size;
unsigned long linear;
-
+   int cur_size = ctxt->fetch.end - ctxt->fetch.data;
struct segmented_address addr = { .seg = VCPU_SREG_CS,
- .ea  = fc->end };
-   cur_size = fc->end - fc->start;
-   size = min(15UL - cur_size,
-  PAGE_SIZE - offset_in_page(fc->end));
+  .ea = ctxt->eip + cur_size };
+
+   size = min(15UL ^ cur_size,
+  PAGE_SIZE - offset_in_page(addr.ea));
 
/*
 * One instruction can only straddle two pages,
@@ -732,19 +731,18 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt 
*ctxt, int op_size)
rc = __linearize(ctxt, addr, size, false, true, &linear);
if (unlikely(rc != X86EMUL_CONTINUE))
return rc;
-   rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+   rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end,
  size, &ctxt->exception);
if (unlikely(rc != X86EMUL_CONTINUE))
return rc;
-   fc->end += size;
+   ctxt->fetch.end += size;
return X86EMUL_CONTINUE;
 }
 
 static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
   unsigned size)
 {
-   /* We have to be careful about overflow! */
-   if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+   if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size))
return __do_insn_fetch_bytes(ctxt, size);
else
return X86EMUL_CONTINUE;
@@ -753,26 +751,24 @@ static __always_inline int do_insn_fetch_bytes(struct 
x86_emulate_ctxt *ctxt,
 /* Fetch next part of the instruction being emulated. */
 #define insn_fetch(_type, _ctxt)   \
 ({ _type _x;   \
-   struct fetch_cache *_fc;\
\
rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \
if (rc != X86EMUL_CONTINUE) \
goto done;  \
-   _fc = &ctxt->fetch; \
-   _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
ctxt->_eip += sizeof(_type);\
+   _x = *(_type __aligned(1) *) ctxt->fetch.ptr;   \
+   ctxt->fetch.ptr += sizeof(_type);   \
_x; \
 })
 
 #define insn_fetch_arr(_arr, _size, _ctxt) \
 ({ \
-   struct fetch_cache *_fc;\
rc = do_insn_fetch_bytes(_ctxt, _size); \
if (rc != X86EMUL_CONTINUE) \
goto done;  \
-   _fc = &ctxt->fetch; \
-   memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size);   \
ctxt->_eip += (_size);  \
+   memcpy(_arr, ctxt->fetch.ptr, _size);   \
+   ctxt->fetch.ptr += (_size); \
 })
 
 /*
@@ -4228,8 +4224,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
ctxt->memop.type = OP_NONE;
ctxt->memopp = NULL;
ctxt->_eip = ctxt->eip;
-   ctxt->fetch.start = ctxt->_eip;
-   ctxt->fetch.end = ctxt->fetch.

[PATCH 16/19] KVM: emulate: avoid per-byte copying in instruction fetches

2014-07-14 Thread Paolo Bonzini
We do not need a memory copying loop anymore in insn_fetch; we
can use a byte-aligned pointer to access instruction fields directly
from the fetch_cache.  This eliminates 50-150 cycles (corresponding to
a 5-10% improvement in performance) from each instruction.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 46 ++
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index ca82ec9c5ffe..02c668aca2b6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -708,7 +708,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
  * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
+static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
struct fetch_cache *fc = &ctxt->fetch;
int rc;
@@ -740,41 +740,39 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt 
*ctxt, int op_size)
return X86EMUL_CONTINUE;
 }
 
-static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
-void *__dest, unsigned size)
+static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
+  unsigned size)
 {
-   struct fetch_cache *fc = &ctxt->fetch;
-   u8 *dest = __dest;
-   u8 *src = &fc->data[ctxt->_eip - fc->start];
-
/* We have to be careful about overflow! */
-   if (unlikely(ctxt->_eip > fc->end - size)) {
-   int rc = do_insn_fetch_bytes(ctxt, size);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
-   }
-
-   while (size--) {
-   *dest++ = *src++;
-   ctxt->_eip++;
-   continue;
-   }
-   return X86EMUL_CONTINUE;
+   if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+   return __do_insn_fetch_bytes(ctxt, size);
+   else
+   return X86EMUL_CONTINUE;
 }
 
 /* Fetch next part of the instruction being emulated. */
 #define insn_fetch(_type, _ctxt)   \
-({ unsigned long _x;   \
-   rc = do_insn_fetch(_ctxt, &_x, sizeof(_type));  \
+({ _type _x;   \
+   struct fetch_cache *_fc;\
+   \
+   rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \
if (rc != X86EMUL_CONTINUE) \
goto done;  \
-   (_type)_x;  \
+   _fc = &ctxt->fetch; \
+   _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
+   ctxt->_eip += sizeof(_type);\
+   _x; \
 })
 
 #define insn_fetch_arr(_arr, _size, _ctxt) \
-({ rc = do_insn_fetch(_ctxt, _arr, (_size));   \
+({ \
+   struct fetch_cache *_fc;\
+   rc = do_insn_fetch_bytes(_ctxt, _size); \
if (rc != X86EMUL_CONTINUE) \
goto done;  \
+   _fc = &ctxt->fetch; \
+   memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size);   \
+   ctxt->_eip += (_size);  \
 })
 
 /*
@@ -4236,7 +4234,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
else {
-   rc = do_insn_fetch_bytes(ctxt, 1);
+   rc = __do_insn_fetch_bytes(ctxt, 1);
if (rc != X86EMUL_CONTINUE)
return rc;
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 19/19] KVM: x86: use kvm_read_guest_page for emulator accesses

2014-07-14 Thread Paolo Bonzini
Emulator accesses are always done a page at a time, either by the emulator
itself (for fetches) or because we need to query the MMU for address
translations.  Speed up these accesses by using kvm_read_guest_page
and, in the case of fetches, by inlining kvm_read_guest_virt_helper and
dropping the loop around kvm_read_guest_page.

This final tweak saves 30-100 more clock cycles (4-10%), bringing the
count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on
a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series
and 925-1700 after the first two low-hanging fruit changes.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/x86.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 905edf8557e7..f750b69ca443 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4085,7 +4085,8 @@ static int kvm_read_guest_virt_helper(gva_t addr, void 
*val, unsigned int bytes,
 
if (gpa == UNMAPPED_GVA)
return X86EMUL_PROPAGATE_FAULT;
-   ret = kvm_read_guest(vcpu->kvm, gpa, data, toread);
+   ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data,
+ offset, toread);
if (ret < 0) {
r = X86EMUL_IO_NEEDED;
goto out;
@@ -4106,10 +4107,24 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
*ctxt,
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   unsigned offset;
+   int ret;
 
-   return kvm_read_guest_virt_helper(addr, val, bytes, vcpu,
- access | PFERR_FETCH_MASK,
- exception);
+   /* Inline kvm_read_guest_virt_helper for speed.  */
+   gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
access|PFERR_FETCH_MASK,
+   exception);
+   if (unlikely(gpa == UNMAPPED_GVA))
+   return X86EMUL_PROPAGATE_FAULT;
+
+   offset = addr & (PAGE_SIZE-1);
+   if (WARN_ON(offset + bytes > PAGE_SIZE))
+   bytes = (unsigned)PAGE_SIZE - offset;
+   ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, val,
+ offset, bytes);
+   if (unlikely(ret < 0))
+   return X86EMUL_IO_NEEDED;
+
+   return X86EMUL_CONTINUE;
 }
 
 int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/19] KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

The same information can be gleaned from ctxt->d and avoids having
to zero/NULL initialize intercept and check_perm

Signed-off-by: Bandan Das 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 851315d93bf7..000fc7398832 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4599,7 +4599,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
fetch_possible_mmx_operand(ctxt, &ctxt->dst);
}
 
-   if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+   if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
  X86_ICPT_PRE_EXCEPT);
if (rc != X86EMUL_CONTINUE)
@@ -4619,13 +4619,13 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}
 
/* Do instruction specific permission checks */
-   if (ctxt->check_perm) {
+   if (ctxt->d & CheckPerm) {
rc = ctxt->check_perm(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
}
 
-   if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+   if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
  X86_ICPT_POST_EXCEPT);
if (rc != X86EMUL_CONTINUE)
@@ -4671,7 +4671,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 
 special_insn:
 
-   if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+   if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
  X86_ICPT_POST_MEMACCESS);
if (rc != X86EMUL_CONTINUE)
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/19] KVM: x86: ensure emulator fetches do not span multiple pages

2014-07-14 Thread Paolo Bonzini
When the CS base is not page-aligned, the linear address of the code could
get close to the page boundary (e.g. 0x...ffe) even if the EIP value is
not.  So we need to first linearize the address, and only then compute
the number of valid bytes that can be fetched.

This happens relatively often when executing real mode code.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/emulate.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c16314807756..6a1d60956d63 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -711,14 +711,18 @@ static int segmented_read_std(struct x86_emulate_ctxt 
*ctxt,
 static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
int rc;
-   int size;
+   unsigned size;
unsigned long linear;
int cur_size = ctxt->fetch.end - ctxt->fetch.data;
struct segmented_address addr = { .seg = VCPU_SREG_CS,
   .ea = ctxt->eip + cur_size };
 
-   size = min(15UL ^ cur_size,
-  PAGE_SIZE - offset_in_page(addr.ea));
+   size = 15UL ^ cur_size;
+   rc = __linearize(ctxt, addr, size, false, true, &linear);
+   if (unlikely(rc != X86EMUL_CONTINUE))
+   return rc;
+
+   size = min_t(unsigned, size, PAGE_SIZE - offset_in_page(linear));
 
/*
 * One instruction can only straddle two pages,
@@ -728,9 +732,6 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt 
*ctxt, int op_size)
 */
if (unlikely(size < op_size))
return X86EMUL_UNHANDLEABLE;
-   rc = __linearize(ctxt, addr, size, false, true, &linear);
-   if (unlikely(rc != X86EMUL_CONTINUE))
-   return rc;
rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end,
  size, &ctxt->exception);
if (unlikely(rc != X86EMUL_CONTINUE))
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 00/19] Emulator speedups for 3.17

2014-07-14 Thread Paolo Bonzini
These are the emulator speedup patches that have survived autotest
and kvm-unit-tests.  I dropped the patches for direct access to
memory operands because they caused a failure in vmx.flat.

These patches have been in kvm/queue for a while, but I've always
left them out of kvm/next because they hadn't undergone full review.
Since three months have passed, I'll include them in the next push to
kvm/next to give them more exposure and to prepare for 3.17.

Paolo

Bandan Das (6):
  KVM: emulate: move init_decode_cache to emulate.c
  KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
  KVM: emulate: cleanup decode_modrm
  KVM: emulate: clean up initializations in init_decode_cache
  KVM: emulate: rework seg_override
  KVM: emulate: do not initialize memopp

Paolo Bonzini (13):
  KVM: vmx: speed up emulation of invalid guest state
  KVM: x86: return all bits from get_interrupt_shadow
  KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
  KVM: emulate: move around some checks
  KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
  KVM: emulate: speed up emulated moves
  KVM: emulate: simplify writeback
  KVM: emulate: speed up do_insn_fetch
  KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
  KVM: emulate: avoid per-byte copying in instruction fetches
  KVM: emulate: put pointers in the fetch_cache
  KVM: x86: ensure emulator fetches do not span multiple pages
  KVM: x86: use kvm_read_guest_page for emulator accesses

 arch/x86/include/asm/kvm_emulate.h |  31 +--
 arch/x86/include/asm/kvm_host.h|   2 +-
 arch/x86/kvm/emulate.c | 379 +++--
 arch/x86/kvm/svm.c |   6 +-
 arch/x86/kvm/trace.h   |   6 +-
 arch/x86/kvm/vmx.c |   9 +-
 arch/x86/kvm/x86.c |  80 +---
 7 files changed, 277 insertions(+), 236 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/19] KVM: emulate: do not initialize memopp

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

rip_relative is only set if decode_modrm runs, and if you have ModRM
you will also have a memopp.  We can then access memopp unconditionally.
Note that rip_relative cannot be hoisted up to decode_modrm, or you
break "mov $0, xyz(%rip)".

Also, move typecast on "out of range value" of mem.ea to decode_modrm.

Together, all these optimizations save about 50 cycles on each emulated
instructions (4-6%).

Signed-off-by: Bandan Das 
[Fix immediate operands with rip-relative addressing. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h | 5 +
 arch/x86/kvm/emulate.c | 8 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 6c808408326f..fcf58cd25ebd 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -295,6 +295,11 @@ struct x86_emulate_ctxt {
struct operand dst;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+   /*
+* The following six fields are cleared together,
+* the rest are initialized unconditionally in x86_decode_insn
+* or elsewhere
+*/
bool rip_relative;
u8 rex_prefix;
u8 lock_prefix;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 08badf638fb0..390400a54a9c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1177,6 +1177,9 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
}
}
op->addr.mem.ea = modrm_ea;
+   if (ctxt->ad_bytes != 8)
+   ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
+
 done:
return rc;
 }
@@ -4425,9 +4428,6 @@ done_prefixes:
 
ctxt->memop.addr.mem.seg = ctxt->seg_override;
 
-   if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8)
-   ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
-
/*
 * Decode and fetch the source operand: register, memory
 * or immediate.
@@ -4448,7 +4448,7 @@ done_prefixes:
rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);
 
 done:
-   if (ctxt->memopp && ctxt->memopp->type == OP_MEM && ctxt->rip_relative)
+   if (ctxt->rip_relative)
ctxt->memopp->addr.mem.ea += ctxt->_eip;
 
return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/19] KVM: emulate: move init_decode_cache to emulate.c

2014-07-14 Thread Paolo Bonzini
From: Bandan Das 

Core emulator functions all belong in emulator.c,
x86 should have no knowledge of emulator internals

Signed-off-by: Bandan Das 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_emulate.h |  1 +
 arch/x86/kvm/emulate.c | 13 +
 arch/x86/kvm/x86.c | 13 -
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 432447370044..c22bd9af4311 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -409,6 +409,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt 
*ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
 u16 tss_selector, int idt_index, int reason,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 32d3da82da2e..851315d93bf7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4534,6 +4534,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *))
return X86EMUL_CONTINUE;
 }
 
+void init_decode_cache(struct x86_emulate_ctxt *ctxt)
+{
+   memset(&ctxt->opcode_len, 0,
+  (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
+
+   ctxt->fetch.start = 0;
+   ctxt->fetch.end = 0;
+   ctxt->io_read.pos = 0;
+   ctxt->io_read.end = 0;
+   ctxt->mem_read.pos = 0;
+   ctxt->mem_read.end = 0;
+}
+
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 {
const struct x86_emulate_ops *ops = ctxt->ops;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd9316786dca..905edf8557e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4888,19 +4888,6 @@ static void inject_emulated_exception(struct kvm_vcpu 
*vcpu)
kvm_queue_exception(vcpu, ctxt->exception.vector);
 }
 
-static void init_decode_cache(struct x86_emulate_ctxt *ctxt)
-{
-   memset(&ctxt->opcode_len, 0,
-  (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
-
-   ctxt->fetch.start = 0;
-   ctxt->fetch.end = 0;
-   ctxt->io_read.pos = 0;
-   ctxt->io_read.end = 0;
-   ctxt->mem_read.pos = 0;
-   ctxt->mem_read.end = 0;
-}
-
 static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
 {
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote:
> +/*
> + * Under certain circumstances, access certain MSR may cause #GP.
> + * The function tests if the input MSR can be safely accessed.
> + */
> +static inline bool check_msr(unsigned long msr)
> +{
> + u64 val_old, val_new, val_tmp;
> +
> + /*
> +  * Read the current value, change it and read it back to see if it
> +  * matches, this is needed to detect certain hardware emulators
> +  * (qemu/kvm) that don't trap on the MSR access and always return 0s.
> +  */
> + if (rdmsrl_safe(msr, &val_old))
> + goto msr_fail;
> + /*
> +  * Only chagne it slightly,
> +  * since the higher bits of some MSRs cannot be updated by wrmsrl.
> +  * E.g. MSR_LBR_TOS
> +  */
> + val_tmp = val_old ^ 0x3UL;
> + if (wrmsrl_safe(msr, val_tmp) ||
> + rdmsrl_safe(msr, &val_new))
> + goto msr_fail;
> +
> + if (val_new != val_tmp)
> + goto msr_fail;
> +
> + /* Here it's sure that the MSR can be safely accessed.
> +  * Restore the old value and return.
> +  */
> + wrmsrl(msr, val_old);
> +
> + return true;
> +
> +msr_fail:
> + return false;
> +}

I don't think we need the msr_fail thing, there's no state to clean up,
you can return false at all places you now have goto.


pgpkp9qBX8LOU.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote:
> From: Kan Liang 
> 
> x86, perf: Protect LBR and extra_regs against KVM lying
> 
> With -cpu host, KVM reports LBR and extra_regs support, if the host has 
> support.
> When the guest perf driver tries to access LBR or extra_regs MSR,
> it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
> So check the related MSRs access right once at initialization time to avoid 
> the error access at runtime.
> 
> For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
> (for host kernel).
> And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
> Start the guest with -cpu host.
> Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
> #GP.
> Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
> trigger offcore_rsp #GP

This is still not properly wrapped at 78 chars.

> Signed-off-by: Kan Liang 
> 
> V2: Move the check code to initialization time.
> V3: Add flag for each extra register.
> Check all LBR MSRs at initialization time.
> V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr 
> failed.
> Disable all extra msrs in creation places if check_msr failed.
> V5: Fix check_msr broken
> Don't check any more MSRs after the first fail
> Return error when checking fail to stop creating the event
> Remove the checking code path which never get

These things should go below the --- so they get thrown away when
applying the patch, its of no relevance once applied.

> ---
>  arch/x86/kernel/cpu/perf_event.c   |  3 +++
>  arch/x86/kernel/cpu/perf_event.h   | 45 
> ++
>  arch/x86/kernel/cpu/perf_event_intel.c | 38 +++-
>  3 files changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c 
> b/arch/x86/kernel/cpu/perf_event.c
> index 2bdfbff..a7c5e4b 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct 
> perf_event *event)
>   continue;
>   if (event->attr.config1 & ~er->valid_mask)
>   return -EINVAL;
> + /* Check if the extra msrs can be safely accessed*/
> + if (!x86_pmu.extra_msr_access[er->idx])
> + return -EFAULT;

This is not a correct usage of -EFAULT. Event creation did not fail
because we took a fault dereferencing a user provided pointer. Possibly
ENXIO is appropriate.

>   reg->idx = er->idx;
>   reg->config = event->attr.config1;
> diff --git a/arch/x86/kernel/cpu/perf_event.h 
> b/arch/x86/kernel/cpu/perf_event.h
> index 3b2f9bd..992c678 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -464,6 +464,12 @@ struct x86_pmu {
>*/
>   struct extra_reg *extra_regs;
>   unsigned int er_flags;
> + /*
> +  * EXTRA REG MSR can be accessed
> +  * The extra registers are completely unrelated to each other.
> +  * So it needs a flag for each extra register.
> +  */
> + boolextra_msr_access[EXTRA_REG_MAX];

So why not in struct extra_reg again? You didn't give a straight answer
there.

> +/*
> + * Under certain circumstances, access certain MSR may cause #GP.
> + * The function tests if the input MSR can be safely accessed.
> + */
> +static inline bool check_msr(unsigned long msr)
> +{

This reads like a generic function;

> + u64 val_old, val_new, val_tmp;
> +
> + /*
> +  * Read the current value, change it and read it back to see if it
> +  * matches, this is needed to detect certain hardware emulators
> +  * (qemu/kvm) that don't trap on the MSR access and always return 0s.
> +  */
> + if (rdmsrl_safe(msr, &val_old))
> + goto msr_fail;
> + /*
> +  * Only chagne it slightly,
> +  * since the higher bits of some MSRs cannot be updated by wrmsrl.
> +  * E.g. MSR_LBR_TOS
> +  */
> + val_tmp = val_old ^ 0x3UL;

but this is not generally true; not all MSRs can write the 2 LSB, can
they? One option would be to extend the function with a u64 mask.

> + if (wrmsrl_safe(msr, val_tmp) ||
> + rdmsrl_safe(msr, &val_new))
> + goto msr_fail;
> +
> + if (val_new != val_tmp)
> + goto msr_fail;
> +
> + /* Here it's sure that the MSR can be safely accessed.
> +  * Restore the old value and return.
> +  */
> + wrmsrl(msr, val_old);
> +
> + return true;
> +
> +msr_fail:
> + return false;
> +}

Also, by now this function is far too large to be inline and in a
header.

> + /*
> +  * Access LBR MSR may cause #GP under certain circumstances.
> +  * E.g. KVM doesn't support LBR MSR
> +  * Check all LBT MSR here.
> +  * Disable LBR access if any LBR MSRs can not be acc

[PATCH] kvm: ppc: bookehv: Sync srr0/1 into GSRR0/1

2014-07-14 Thread Bharat Bhushan
This patch adds missing sync of SRR0/1 in set SREGS interface.

Signed-off-by: Bharat Bhushan 
---
 arch/powerpc/kvm/booke.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index c2471ed..368b48e 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1326,6 +1326,10 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
kvmppc_set_msr(vcpu, regs->msr);
vcpu->arch.shared->srr0 = regs->srr0;
vcpu->arch.shared->srr1 = regs->srr1;
+#ifdef CONFIG_KVM_BOOKE_HV
+   mtspr(SPRN_GSRR0, regs->srr0);
+   mtspr(SPRN_GSRR1, regs->srr1);
+#endif
kvmppc_set_pid(vcpu, regs->pid);
vcpu->arch.shared->sprg0 = regs->sprg0;
vcpu->arch.shared->sprg1 = regs->sprg1;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-14 Thread Tang Chen

Hi Gleb,

Please see below.

On 07/12/2014 03:44 PM, Gleb Natapov wrote:

On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:

kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

Signed-off-by: Tang Chen
---
  arch/x86/include/asm/kvm_host.h |  1 -
  arch/x86/kvm/vmx.c  | 25 +++--
  2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4931415..62f973e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -578,7 +578,6 @@ struct kvm_arch {

gpa_t wall_clock;

-   struct page *ept_identity_pagetable;
bool ept_identity_pagetable_done;
gpa_t ept_identity_map_addr;

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0918635e..fe2e5f4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
*vcpu);
  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
  static bool vmx_mpx_supported(void);
+static int alloc_identity_pagetable(struct kvm *kvm);

  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -3921,21 +3922,21 @@ out:

  static int init_rmode_identity_map(struct kvm *kvm)
  {
-   int i, idx, r, ret;
+   int i, idx, r, ret = 0;
pfn_t identity_map_pfn;
u32 tmp;

if (!enable_ept)
return 1;
-   if (unlikely(!kvm->arch.ept_identity_pagetable)) {
-   printk(KERN_ERR "EPT: identity-mapping pagetable "
-   "haven't been allocated!\n");
-   return 0;
-   }
if (likely(kvm->arch.ept_identity_pagetable_done))
return 1;
-   ret = 0;
identity_map_pfn = kvm->arch.ept_identity_map_addr>>  PAGE_SHIFT;
+
+   mutex_lock(&kvm->slots_lock);

Why move this out of alloc_identity_pagetable()?



Referring to the original code, I think mutex_lock(&kvm->slots_lock) is 
used

to protect kvm->arch.ept_identity_pagetable. If two or more threads try to
modify it at the same time, the mutex ensures that the identity table is 
only

allocated once.

Now, we dropped kvm->arch.ept_identity_pagetable. And use 
kvm->arch.ept_identity_pagetable_done
to check if the identity table is allocated and initialized. So we 
should protect
memory slot operation in alloc_identity_pagetable() and 
kvm->arch.ept_identity_pagetable_done

with this mutex.

Of course, I can see that the name "slots_lock" indicates that it may be 
used
to protect the memory slot operation only. Maybe move it out here is not 
suitable.


If I'm wrong, please tell me.


+   r = alloc_identity_pagetable(kvm);
+   if (r)
+   goto out2;
+
idx = srcu_read_lock(&kvm->srcu);
r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
if (r<  0)
@@ -3953,6 +3954,9 @@ static int init_rmode_identity_map(struct kvm *kvm)
ret = 1;
  out:
srcu_read_unlock(&kvm->srcu, idx);
+
+out2:
+   mutex_unlock(&kvm->slots_lock);
return ret;
  }

@@ -4006,9 +4010,6 @@ static int alloc_identity_pagetable(struct kvm *kvm)
struct kvm_userspace_memory_region kvm_userspace_mem;
int r = 0;

-   mutex_lock(&kvm->slots_lock);
-   if (kvm->arch.ept_identity_pagetable)
-   goto out;
kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
kvm_userspace_mem.flags = 0;
kvm_userspace_mem.guest_phys_addr =
@@ -4025,9 +4026,7 @@ static int alloc_identity_pagetable(struct kvm *kvm)
goto out;
}

-   kvm->arch.ept_identity_pagetable = page;

I think we can drop gfn_to_page() above too now. Why would we need it?



Yes, will remove it in the next version.

Thanks.


  out:
-   mutex_unlock(&kvm->slots_lock);
return r;
  }

@@ -7583,8 +7582,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
kvm->arch.ept_identity_map_addr =
VMX_EPT_IDENTITY_PAGETABLE_ADDR;
err = -ENOMEM;
-   if (alloc_identity_pagetable(kvm) != 0)
-   goto free_vmcs;
if (!init_rmode_identity_map(kvm))
goto free_vmcs;
}
--
1.8.3.1



--
Gleb.
.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vge

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-14 Thread Tang Chen

Hi Gleb,

Thanks for the reply. Please see below.

On 07/12/2014 04:04 PM, Gleb Natapov wrote:

On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:

apic access page is pinned in memory. As a result, it cannot be 
migrated/hot-removed.
Actually, it is not necessary to be pinned.

The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
corresponding ept entry. This patch introduces a new vcpu request named
KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time,
and force all the vcpus exit guest, and re-enter guest till they updates the 
VMCS
APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
kvm->arch.apic_access_page to the new page.


By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
is not used since no MMIO access to APIC is ever done. Have you tested
this with "-cpu modelname,-x2apic" qemu flag?


I used the following commandline to test the patches.

# /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm 
-smp 2


And I think the guest used APIC_ACCESS_ADDR mechanism because the previous
patch-set has some problem which will happen when the apic page is accessed.
And it did happen.

I'll test this patch-set with "-cpu modelname,-x2apic" flag.




Signed-off-by: Tang Chen
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/mmu.c  | 11 +++
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  |  8 +++-
  arch/x86/kvm/x86.c  | 14 ++
  include/linux/kvm_host.h|  2 ++
  virt/kvm/kvm_main.c | 12 
  7 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 62f973e..9ce6bfd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -737,6 +737,7 @@ struct kvm_x86_ops {
void (*hwapic_isr_update)(struct kvm *kvm, int isr);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..551693d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
 level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);

+   /*
+* apic access page could be migrated. When the guest tries to access
+* the apic access page, ept violation will occur, and we can use GUP
+* to find the new page.
+*
+* GUP will wait till the migrate entry be replaced with the new page.
+*/
+   if (gpa == APIC_DEFAULT_PHYS_BASE)
+   vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm,
+   APIC_DEFAULT_PHYS_BASE>>  PAGE_SHIFT);

Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?


I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here.

In kvm_mmu_notifier_invalidate_page() I made the request. And the 
handler called
gfn_to_page_no_pin() to get the new page, which will wait till the 
migration
finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when 
the vcpus
were forced to exit the guest mode, they would wait till the VMCS 
APIC_ACCESS_ADDR

pointer was updated.

As a result, we don't need to make the request here.





+
return r;

  out_unlock:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 576b525..dc76f29 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
return;
  }

+static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
+{
+   return;
+}
+
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
return 0;
@@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
.vm_has_apicv = svm_vm_has_apicv,
.load_eoi_exitmap = svm_load_eoi_exitmap,
.hwapic_isr_update = svm_hwapic_isr_update,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5532ac8..f7c6313 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3992,7 +3992,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
if (r)
goto out;

-