Re: KVM call agenda for 2014-04-28

2014-04-28 Thread Michael S. Tsirkin
On Mon, Apr 28, 2014 at 05:34:34PM +0200, Markus Armbruster wrote:
> Juan Quintela  writes:
> 
> > Hi
> >
> > Please, send any topic that you are interested in covering.
> 
> [...]
> 
> I'd like to have these things settled sooner than five minutes before
> the scheduled hour, so here goes: call or no call?  Agenda?

If not too late, I'd like to discuss our security process.
Do we as the project generally agree to use responsible disclosure policy
http://en.wikipedia.org/wiki/Responsible_disclosure ?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] target-i386: block migration and savevm if invariant tsc is exposed

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 21:23, Eduardo Habkost ha scritto:

> Makes sense.  Basically "-cpu host,migratable=yes" is close to
> libvirt's host-model and Alex Graf's proposed "-cpu best".  Should we
> call it "-cpu best" and drop migratability of "-cpu host"?

"-cpu best" is different from the modes above. It means "use the best
existing CPU model (from the pre-defined table) that can run on this
host".


Yes, it's not exactly the same.  In practice the behavior should be close.


And it would have the same ambiguities we found on "-cpu host": if a CPU
model in the table have invtsc enabled, should it be considered a
candidate for "-cpu best", or not?


No CPU model in the table should have invtsc enabled. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 11/12] ARM/ARM64: KVM: Emulate PSCI v0.2 CPU_SUSPEND

2014-04-28 Thread Anup Patel
This patch adds emulation of PSCI v0.2 CPU_SUSPEND function call for
KVM ARM/ARM64. This is a CPU-level function call which can suspend
current CPU or current CPU cluster. We don't have VCPU clusters in
KVM so we only suspend the current VCPU.

The CPU_SUSPEND emulation is not tested much because currently there
is no CPUIDLE driver in Linux kernel that uses PSCI CPU_SUSPEND. The
PSCI CPU_SUSPEND implementation in ARM64 kernel was tested using a
Simple CPUIDLE driver which is not published due to unstable DT-bindings
for PSCI.
(For more info, http://lwn.net/Articles/574950/)

For simplicity, we implement CPU_SUSPEND emulation similar to WFI
(Wait-for-interrupt) emulation and we also treat power-down request
to be same as stand-by request. This is consistent with section
5.4.1 and section 5.4.2 of PSCI v0.2 specification.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Acked-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/psci.c |   28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 1067579..09cf377 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -37,6 +37,26 @@ static unsigned long psci_affinity_mask(unsigned long 
affinity_level)
return 0;
 }
 
+static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
+{
+   /*
+* NOTE: For simplicity, we make VCPU suspend emulation to be
+* same-as WFI (Wait-for-interrupt) emulation.
+*
+* This means for KVM the wakeup events are interrupts and
+* this is consistent with intended use of StateID as described
+* in section 5.4.1 of PSCI v0.2 specification (ARM DEN 0022A).
+*
+* Further, we also treat power-down request to be same as
+* stand-by request as-per section 5.4.2 clause 3 of PSCI v0.2
+* specification (ARM DEN 0022A). This means all suspend states
+* for KVM will preserve the register state.
+*/
+   kvm_vcpu_block(vcpu);
+
+   return PSCI_RET_SUCCESS;
+}
+
 static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
 {
vcpu->arch.pause = true;
@@ -183,6 +203,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
 */
val = 2;
break;
+   case PSCI_0_2_FN_CPU_SUSPEND:
+   case PSCI_0_2_FN64_CPU_SUSPEND:
+   val = kvm_psci_vcpu_suspend(vcpu);
+   break;
case PSCI_0_2_FN_CPU_OFF:
kvm_psci_vcpu_off(vcpu);
val = PSCI_RET_SUCCESS;
@@ -235,10 +259,6 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
val = PSCI_RET_INTERNAL_FAILURE;
ret = 0;
break;
-   case PSCI_0_2_FN_CPU_SUSPEND:
-   case PSCI_0_2_FN64_CPU_SUSPEND:
-   val = PSCI_RET_NOT_SUPPORTED;
-   break;
default:
return -EINVAL;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 10/12] ARM/ARM64: KVM: Fix CPU_ON emulation for PSCI v0.2

2014-04-28 Thread Anup Patel
As-per PSCI v0.2, the source CPU provides physical address of
"entry point" and "context id" for starting a target CPU. Also,
if target CPU is already running then we should return ALREADY_ON.

Current emulation of CPU_ON function does not consider physical
address of "context id" and returns INVALID_PARAMETERS if target
CPU is already running.

This patch updates kvm_psci_vcpu_on() such that it works for both
PSCI v0.1 and PSCI v0.2.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/psci.c |   15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index cce901a..1067579 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -48,6 +48,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
struct kvm_vcpu *vcpu = NULL, *tmp;
wait_queue_head_t *wq;
unsigned long cpu_id;
+   unsigned long context_id;
unsigned long mpidr;
phys_addr_t target_pc;
int i;
@@ -68,10 +69,17 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
 * Make sure the caller requested a valid CPU and that the CPU is
 * turned off.
 */
-   if (!vcpu || !vcpu->arch.pause)
+   if (!vcpu)
return PSCI_RET_INVALID_PARAMS;
+   if (!vcpu->arch.pause) {
+   if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1)
+   return PSCI_RET_ALREADY_ON;
+   else
+   return PSCI_RET_INVALID_PARAMS;
+   }
 
target_pc = *vcpu_reg(source_vcpu, 2);
+   context_id = *vcpu_reg(source_vcpu, 3);
 
kvm_reset_vcpu(vcpu);
 
@@ -86,6 +94,11 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
kvm_vcpu_set_be(vcpu);
 
*vcpu_pc(vcpu) = target_pc;
+   /*
+* NOTE: We always update r0 (or x0) because for PSCI v0.1
+* the general puspose registers are undefined upon CPU_ON.
+*/
+   *vcpu_reg(vcpu, 0) = context_id;
vcpu->arch.pause = false;
smp_mb();   /* Make sure the above is visible */
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 12/12] ARM/ARM64: KVM: Advertise KVM_CAP_ARM_PSCI_0_2 to user space

2014-04-28 Thread Anup Patel
We have PSCI v0.2 emulation available in KVM ARM/ARM64
hence advertise this to user space (i.e. QEMU or KVMTOOL)
via KVM_CHECK_EXTENSION ioctl.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Acked-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/arm.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index f0e50a0..3c82b37 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -197,6 +197,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
case KVM_CAP_ONE_REG:
case KVM_CAP_ARM_PSCI:
+   case KVM_CAP_ARM_PSCI_0_2:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 09/12] ARM/ARM64: KVM: Emulate PSCI v0.2 MIGRATE_INFO_TYPE and related functions

2014-04-28 Thread Anup Patel
This patch adds emulation of PSCI v0.2 MIGRATE, MIGRATE_INFO_TYPE, and
MIGRATE_INFO_UP_CPU function calls for KVM ARM/ARM64.

KVM ARM/ARM64 being a hypervisor (and not a Trusted OS), we cannot provide
this functions hence we emulate these functions in following way:
1. MIGRATE - Returns "Not Supported"
2. MIGRATE_INFO_TYPE - Return 2 i.e. Trusted OS is not present
3. MIGRATE_INFO_UP_CPU - Returns "Not Supported"

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/psci.c |   21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 3b6a0cf..cce901a 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -182,6 +182,22 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
case PSCI_0_2_FN64_AFFINITY_INFO:
val = kvm_psci_vcpu_affinity_info(vcpu);
break;
+   case PSCI_0_2_FN_MIGRATE:
+   case PSCI_0_2_FN64_MIGRATE:
+   val = PSCI_RET_NOT_SUPPORTED;
+   break;
+   case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
+   /*
+* Trusted OS is MP hence does not require migration
+* or
+* Trusted OS is not present
+*/
+   val = PSCI_0_2_TOS_MP;
+   break;
+   case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
+   case PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU:
+   val = PSCI_RET_NOT_SUPPORTED;
+   break;
case PSCI_0_2_FN_SYSTEM_OFF:
kvm_psci_system_off(vcpu);
/*
@@ -207,12 +223,7 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
ret = 0;
break;
case PSCI_0_2_FN_CPU_SUSPEND:
-   case PSCI_0_2_FN_MIGRATE:
-   case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
-   case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
case PSCI_0_2_FN64_CPU_SUSPEND:
-   case PSCI_0_2_FN64_MIGRATE:
-   case PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU:
val = PSCI_RET_NOT_SUPPORTED;
break;
default:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 07/12] ARM/ARM64: KVM: Emulate PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET

2014-04-28 Thread Anup Patel
The PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET functions are system-level
functions hence cannot be fully emulated by in-kernel PSCI emulation code.

To tackle this, we forward PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET function
calls from vcpu to user space (i.e. QEMU or KVMTOOL) via kvm_run structure
using KVM_EXIT_SYSTEM_EVENT exit reasons.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/psci.c |   46 +++---
 1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 14e6fa6..5936213 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -85,6 +85,23 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
return PSCI_RET_SUCCESS;
 }
 
+static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type)
+{
+   memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
+   vcpu->run->system_event.type = type;
+   vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+}
+
+static void kvm_psci_system_off(struct kvm_vcpu *vcpu)
+{
+   kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN);
+}
+
+static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
+{
+   kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
+}
+
 int kvm_psci_version(struct kvm_vcpu *vcpu)
 {
if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
@@ -95,6 +112,7 @@ int kvm_psci_version(struct kvm_vcpu *vcpu)
 
 static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
 {
+   int ret = 1;
unsigned long psci_fn = *vcpu_reg(vcpu, 0) & ~((u32) 0);
unsigned long val;
 
@@ -114,13 +132,35 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
case PSCI_0_2_FN64_CPU_ON:
val = kvm_psci_vcpu_on(vcpu);
break;
+   case PSCI_0_2_FN_SYSTEM_OFF:
+   kvm_psci_system_off(vcpu);
+   /*
+* We should'nt be going back to guest VCPU after
+* receiving SYSTEM_OFF request.
+*
+* If user space accidently/deliberately resumes
+* guest VCPU after SYSTEM_OFF request then guest
+* VCPU should see internal failure from PSCI return
+* value. To achieve this, we preload r0 (or x0) with
+* PSCI return value INTERNAL_FAILURE.
+*/
+   val = PSCI_RET_INTERNAL_FAILURE;
+   ret = 0;
+   break;
+   case PSCI_0_2_FN_SYSTEM_RESET:
+   kvm_psci_system_reset(vcpu);
+   /*
+* Same reason as SYSTEM_OFF for preloading r0 (or x0)
+* with PSCI return value INTERNAL_FAILURE.
+*/
+   val = PSCI_RET_INTERNAL_FAILURE;
+   ret = 0;
+   break;
case PSCI_0_2_FN_CPU_SUSPEND:
case PSCI_0_2_FN_AFFINITY_INFO:
case PSCI_0_2_FN_MIGRATE:
case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
-   case PSCI_0_2_FN_SYSTEM_OFF:
-   case PSCI_0_2_FN_SYSTEM_RESET:
case PSCI_0_2_FN64_CPU_SUSPEND:
case PSCI_0_2_FN64_AFFINITY_INFO:
case PSCI_0_2_FN64_MIGRATE:
@@ -132,7 +172,7 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
}
 
*vcpu_reg(vcpu, 0) = val;
-   return 1;
+   return ret;
 }
 
 static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 08/12] ARM/ARM64: KVM: Emulate PSCI v0.2 AFFINITY_INFO

2014-04-28 Thread Anup Patel
This patch adds emulation of PSCI v0.2 AFFINITY_INFO function call
for KVM ARM/ARM64. This is a VCPU-level function call which will be
used to determine current state of given affinity level.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/kvm/psci.c |   52 +--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 5936213..3b6a0cf 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -27,6 +27,16 @@
  * as described in ARM document number ARM DEN 0022A.
  */
 
+#define AFFINITY_MASK(level)   ~((0x1UL << ((level) * MPIDR_LEVEL_BITS)) - 1)
+
+static unsigned long psci_affinity_mask(unsigned long affinity_level)
+{
+   if (affinity_level <= 3)
+   return MPIDR_HWID_BITMASK & AFFINITY_MASK(affinity_level);
+
+   return 0;
+}
+
 static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
 {
vcpu->arch.pause = true;
@@ -85,6 +95,42 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
return PSCI_RET_SUCCESS;
 }
 
+static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
+{
+   int i;
+   unsigned long mpidr;
+   unsigned long target_affinity;
+   unsigned long target_affinity_mask;
+   unsigned long lowest_affinity_level;
+   struct kvm *kvm = vcpu->kvm;
+   struct kvm_vcpu *tmp;
+
+   target_affinity = *vcpu_reg(vcpu, 1);
+   lowest_affinity_level = *vcpu_reg(vcpu, 2);
+
+   /* Determine target affinity mask */
+   target_affinity_mask = psci_affinity_mask(lowest_affinity_level);
+   if (!target_affinity_mask)
+   return PSCI_RET_INVALID_PARAMS;
+
+   /* Ignore other bits of target affinity */
+   target_affinity &= target_affinity_mask;
+
+   /*
+* If one or more VCPU matching target affinity are running
+* then ON else OFF
+*/
+   kvm_for_each_vcpu(i, tmp, kvm) {
+   mpidr = kvm_vcpu_get_mpidr(tmp);
+   if (((mpidr & target_affinity_mask) == target_affinity) &&
+   !tmp->arch.pause) {
+   return PSCI_0_2_AFFINITY_LEVEL_ON;
+   }
+   }
+
+   return PSCI_0_2_AFFINITY_LEVEL_OFF;
+}
+
 static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type)
 {
memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
@@ -132,6 +178,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
case PSCI_0_2_FN64_CPU_ON:
val = kvm_psci_vcpu_on(vcpu);
break;
+   case PSCI_0_2_FN_AFFINITY_INFO:
+   case PSCI_0_2_FN64_AFFINITY_INFO:
+   val = kvm_psci_vcpu_affinity_info(vcpu);
+   break;
case PSCI_0_2_FN_SYSTEM_OFF:
kvm_psci_system_off(vcpu);
/*
@@ -157,12 +207,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
ret = 0;
break;
case PSCI_0_2_FN_CPU_SUSPEND:
-   case PSCI_0_2_FN_AFFINITY_INFO:
case PSCI_0_2_FN_MIGRATE:
case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
case PSCI_0_2_FN64_CPU_SUSPEND:
-   case PSCI_0_2_FN64_AFFINITY_INFO:
case PSCI_0_2_FN64_MIGRATE:
case PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU:
val = PSCI_RET_NOT_SUPPORTED;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 04/12] KVM: Documentation: Add info regarding KVM_ARM_VCPU_PSCI_0_2 feature

2014-04-28 Thread Anup Patel
We have in-kernel emulation of PSCI v0.2 in KVM ARM/ARM64. To provide
PSCI v0.2 interface to VCPUs, we have to enable KVM_ARM_VCPU_PSCI_0_2
feature when doing KVM_ARM_VCPU_INIT ioctl.

The patch updates documentation of KVM_ARM_VCPU_INIT ioctl to provide
info regarding KVM_ARM_VCPU_PSCI_0_2 feature.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Acked-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 Documentation/virtual/kvm/api.txt |2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index a9380ba5..6dc1db5 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2376,6 +2376,8 @@ Possible features:
  Depends on KVM_CAP_ARM_PSCI.
- KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode.
  Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
+   - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 for the CPU.
+ Depends on KVM_CAP_ARM_PSCI_0_2.
 
 
 4.83 KVM_ARM_PREFERRED_TARGET
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 05/12] ARM/ARM64: KVM: Make kvm_psci_call() return convention more flexible

2014-04-28 Thread Anup Patel
Currently, the kvm_psci_call() returns 'true' or 'false' based on whether
the PSCI function call was handled successfully or not. This does not help
us emulate system-level PSCI functions where the actual emulation work will
be done by user space (QEMU or KVMTOOL). Examples of such system-level PSCI
functions are: PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET.

This patch updates kvm_psci_call() to return three types of values:
1) > 0 (success)
2) = 0 (success but exit to user space)
3) < 0 (errors)

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_psci.h   |2 +-
 arch/arm/kvm/handle_exit.c|   10 +++---
 arch/arm/kvm/psci.c   |   28 
 arch/arm64/include/asm/kvm_psci.h |2 +-
 arch/arm64/kvm/handle_exit.c  |   10 +++---
 5 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/arm/include/asm/kvm_psci.h b/arch/arm/include/asm/kvm_psci.h
index 4c0e3e1..6bda945 100644
--- a/arch/arm/include/asm/kvm_psci.h
+++ b/arch/arm/include/asm/kvm_psci.h
@@ -22,6 +22,6 @@
 #define KVM_ARM_PSCI_0_2   2
 
 int kvm_psci_version(struct kvm_vcpu *vcpu);
-bool kvm_psci_call(struct kvm_vcpu *vcpu);
+int kvm_psci_call(struct kvm_vcpu *vcpu);
 
 #endif /* __ARM_KVM_PSCI_H__ */
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index 0de91fc..4c979d4 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -38,14 +38,18 @@ static int handle_svc_hyp(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
 static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
+   int ret;
+
trace_kvm_hvc(*vcpu_pc(vcpu), *vcpu_reg(vcpu, 0),
  kvm_vcpu_hvc_get_imm(vcpu));
 
-   if (kvm_psci_call(vcpu))
+   ret = kvm_psci_call(vcpu);
+   if (ret < 0) {
+   kvm_inject_undefined(vcpu);
return 1;
+   }
 
-   kvm_inject_undefined(vcpu);
-   return 1;
+   return ret;
 }
 
 static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 8c42596c..14e6fa6 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -93,7 +93,7 @@ int kvm_psci_version(struct kvm_vcpu *vcpu)
return KVM_ARM_PSCI_0_1;
 }
 
-static bool kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
+static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
 {
unsigned long psci_fn = *vcpu_reg(vcpu, 0) & ~((u32) 0);
unsigned long val;
@@ -128,14 +128,14 @@ static bool kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
val = PSCI_RET_NOT_SUPPORTED;
break;
default:
-   return false;
+   return -EINVAL;
}
 
*vcpu_reg(vcpu, 0) = val;
-   return true;
+   return 1;
 }
 
-static bool kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
+static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
 {
unsigned long psci_fn = *vcpu_reg(vcpu, 0) & ~((u32) 0);
unsigned long val;
@@ -153,11 +153,11 @@ static bool kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
val = PSCI_RET_NOT_SUPPORTED;
break;
default:
-   return false;
+   return -EINVAL;
}
 
*vcpu_reg(vcpu, 0) = val;
-   return true;
+   return 1;
 }
 
 /**
@@ -165,12 +165,16 @@ static bool kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
  * @vcpu: Pointer to the VCPU struct
  *
  * Handle PSCI calls from guests through traps from HVC instructions.
- * The calling convention is similar to SMC calls to the secure world where
- * the function number is placed in r0 and this function returns true if the
- * function number specified in r0 is withing the PSCI range, and false
- * otherwise.
+ * The calling convention is similar to SMC calls to the secure world
+ * where the function number is placed in r0.
+ *
+ * This function returns: > 0 (success), 0 (success but exit to user
+ * space), and < 0 (errors)
+ *
+ * Errors:
+ * -EINVAL: Unrecognized PSCI function
  */
-bool kvm_psci_call(struct kvm_vcpu *vcpu)
+int kvm_psci_call(struct kvm_vcpu *vcpu)
 {
switch (kvm_psci_version(vcpu)) {
case KVM_ARM_PSCI_0_2:
@@ -178,6 +182,6 @@ bool kvm_psci_call(struct kvm_vcpu *vcpu)
case KVM_ARM_PSCI_0_1:
return kvm_psci_0_1_call(vcpu);
default:
-   return false;
+   return -EINVAL;
};
 }
diff --git a/arch/arm64/include/asm/kvm_psci.h 
b/arch/arm64/include/asm/kvm_psci.h
index e25c658..bc39e55 100644
--- a/arch/arm64/include/asm/kvm_psci.h
+++ b/arch/arm64/include/asm/kvm_psci.h
@@ -22,6 +22,6 @@
 #define KVM_ARM_PSCI_0_2   2
 
 int kvm_psci_version(struct kvm_vcpu *vcpu);
-bool kvm_psci_call(struct kvm_vcpu *vcpu);
+int kvm_psci_call(struct kvm_vcpu *vcpu);
 
 #endif /* __ARM64_KVM_PSCI_H__ */
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.

[PATCH v11 06/12] KVM: Add KVM_EXIT_SYSTEM_EVENT to user space API header

2014-04-28 Thread Anup Patel
Currently, we don't have an exit reason to notify user space about
a system-level event (for e.g. system reset or shutdown) triggered
by the VCPU. This patch adds exit reason KVM_EXIT_SYSTEM_EVENT for
this purpose. We can also inform user space about the 'type' and
architecture specific 'flags' of a system-level event using the
kvm_run structure.

This newly added KVM_EXIT_SYSTEM_EVENT will be used by KVM ARM/ARM64
in-kernel PSCI v0.2 support to reset/shutdown VMs.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Reviewed-by: Christoffer Dall 
Reviewed-by: Marc Zyngier 
---
 Documentation/virtual/kvm/api.txt |   15 +++
 include/uapi/linux/kvm.h  |8 
 2 files changed, 23 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6dc1db5..c02d725 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2740,6 +2740,21 @@ It gets triggered whenever both KVM_CAP_PPC_EPR are 
enabled and an
 external interrupt has just been delivered into the guest. User space
 should put the acknowledged interrupt vector into the 'epr' field.
 
+   /* KVM_EXIT_SYSTEM_EVENT */
+   struct {
+#define KVM_SYSTEM_EVENT_SHUTDOWN   1
+#define KVM_SYSTEM_EVENT_RESET  2
+   __u32 type;
+   __u64 flags;
+   } system_event;
+
+If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
+a system-level event using some architecture specific mechanism (hypercall
+or some special instruction). In case of ARM/ARM64, this is triggered using
+HVC instruction based PSCI call from the vcpu. The 'type' field describes
+the system-level event type. The 'flags' field describes architecture
+specific flags for the system-level event.
+
/* Fix the size of the union. */
char padding[256];
};
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 01c5624..e86c36a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -171,6 +171,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_WATCHDOG 21
 #define KVM_EXIT_S390_TSCH22
 #define KVM_EXIT_EPR  23
+#define KVM_EXIT_SYSTEM_EVENT 24
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -301,6 +302,13 @@ struct kvm_run {
struct {
__u32 epr;
} epr;
+   /* KVM_EXIT_SYSTEM_EVENT */
+   struct {
+#define KVM_SYSTEM_EVENT_SHUTDOWN   1
+#define KVM_SYSTEM_EVENT_RESET  2
+   __u32 type;
+   __u64 flags;
+   } system_event;
/* Fix the size of the union. */
char padding[256];
};
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 03/12] ARM/ARM64: KVM: Add base for PSCI v0.2 emulation

2014-04-28 Thread Anup Patel
Currently, the in-kernel PSCI emulation provides PSCI v0.1 interface to
VCPUs. This patch extends current in-kernel PSCI emulation to provide
PSCI v0.2 interface to VCPUs.

By default, ARM/ARM64 KVM will always provide PSCI v0.1 interface for
keeping the ABI backward-compatible.

To select PSCI v0.2 interface for VCPUs, the user space (i.e. QEMU or
KVMTOOL) will have to set KVM_ARM_VCPU_PSCI_0_2 feature when doing VCPU
init using KVM_ARM_VCPU_INIT ioctl.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Acked-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_host.h   |2 +-
 arch/arm/include/asm/kvm_psci.h   |4 ++
 arch/arm/include/uapi/asm/kvm.h   |   10 ++--
 arch/arm/kvm/psci.c   |   93 ++---
 arch/arm64/include/asm/kvm_host.h |2 +-
 arch/arm64/include/asm/kvm_psci.h |4 ++
 arch/arm64/include/uapi/asm/kvm.h |   10 ++--
 7 files changed, 99 insertions(+), 26 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 09af149..193ceaf 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -36,7 +36,7 @@
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 #define KVM_HAVE_ONE_REG
 
-#define KVM_VCPU_MAX_FEATURES 1
+#define KVM_VCPU_MAX_FEATURES 2
 
 #include 
 
diff --git a/arch/arm/include/asm/kvm_psci.h b/arch/arm/include/asm/kvm_psci.h
index 9a83d98..4c0e3e1 100644
--- a/arch/arm/include/asm/kvm_psci.h
+++ b/arch/arm/include/asm/kvm_psci.h
@@ -18,6 +18,10 @@
 #ifndef __ARM_KVM_PSCI_H__
 #define __ARM_KVM_PSCI_H__
 
+#define KVM_ARM_PSCI_0_1   1
+#define KVM_ARM_PSCI_0_2   2
+
+int kvm_psci_version(struct kvm_vcpu *vcpu);
 bool kvm_psci_call(struct kvm_vcpu *vcpu);
 
 #endif /* __ARM_KVM_PSCI_H__ */
diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index ef0c878..e6ebdd3 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -20,6 +20,7 @@
 #define __ARM_KVM_H__
 
 #include 
+#include 
 #include 
 
 #define __KVM_HAVE_GUEST_DEBUG
@@ -83,6 +84,7 @@ struct kvm_regs {
 #define KVM_VGIC_V2_CPU_SIZE   0x2000
 
 #define KVM_ARM_VCPU_POWER_OFF 0 /* CPU is started in OFF state */
+#define KVM_ARM_VCPU_PSCI_0_2  1 /* CPU uses PSCI v0.2 */
 
 struct kvm_vcpu_init {
__u32 target;
@@ -201,9 +203,9 @@ struct kvm_arch_memory_slot {
 #define KVM_PSCI_FN_CPU_ON KVM_PSCI_FN(2)
 #define KVM_PSCI_FN_MIGRATEKVM_PSCI_FN(3)
 
-#define KVM_PSCI_RET_SUCCESS   0
-#define KVM_PSCI_RET_NI((unsigned long)-1)
-#define KVM_PSCI_RET_INVAL ((unsigned long)-2)
-#define KVM_PSCI_RET_DENIED((unsigned long)-3)
+#define KVM_PSCI_RET_SUCCESS   PSCI_RET_SUCCESS
+#define KVM_PSCI_RET_NIPSCI_RET_NOT_SUPPORTED
+#define KVM_PSCI_RET_INVAL PSCI_RET_INVALID_PARAMS
+#define KVM_PSCI_RET_DENIEDPSCI_RET_DENIED
 
 #endif /* __ARM_KVM_H__ */
diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 448f60e..8c42596c 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -59,7 +59,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
 * turned off.
 */
if (!vcpu || !vcpu->arch.pause)
-   return KVM_PSCI_RET_INVAL;
+   return PSCI_RET_INVALID_PARAMS;
 
target_pc = *vcpu_reg(source_vcpu, 2);
 
@@ -82,20 +82,60 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
wq = kvm_arch_vcpu_wq(vcpu);
wake_up_interruptible(wq);
 
-   return KVM_PSCI_RET_SUCCESS;
+   return PSCI_RET_SUCCESS;
 }
 
-/**
- * kvm_psci_call - handle PSCI call if r0 value is in range
- * @vcpu: Pointer to the VCPU struct
- *
- * Handle PSCI calls from guests through traps from HVC instructions.
- * The calling convention is similar to SMC calls to the secure world where
- * the function number is placed in r0 and this function returns true if the
- * function number specified in r0 is withing the PSCI range, and false
- * otherwise.
- */
-bool kvm_psci_call(struct kvm_vcpu *vcpu)
+int kvm_psci_version(struct kvm_vcpu *vcpu)
+{
+   if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
+   return KVM_ARM_PSCI_0_2;
+
+   return KVM_ARM_PSCI_0_1;
+}
+
+static bool kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
+{
+   unsigned long psci_fn = *vcpu_reg(vcpu, 0) & ~((u32) 0);
+   unsigned long val;
+
+   switch (psci_fn) {
+   case PSCI_0_2_FN_PSCI_VERSION:
+   /*
+* Bits[31:16] = Major Version = 0
+* Bits[15:0] = Minor Version = 2
+*/
+   val = 2;
+   break;
+   case PSCI_0_2_FN_CPU_OFF:
+   kvm_psci_vcpu_off(vcpu);
+   val = PSCI_RET_SUCCESS;
+   break;
+   case PSCI_0_2_FN_CPU_ON:
+   case PSCI_0_2_FN64_CPU_ON:

[PATCH v11 00/12] In-kernel PSCI v0.2 emulation for KVM ARM/ARM64

2014-04-28 Thread Anup Patel
Currently, KVM ARM/ARM64 only provides in-kernel emulation of Power State
and Coordination Interface (PSCI) v0.1.

This patchset aims at providing newer PSCI v0.2 for KVM ARM/ARM64 VCPUs
such that it does not break current KVM ARM/ARM64 ABI.

The user space tools (i.e. QEMU or KVMTOOL) will have to explicitly enable
KVM_ARM_VCPU_PSCI_0_2 feature using KVM_ARM_VCPU_INIT ioctl for providing
PSCI v0.2 to VCPUs.

Changlog:

V11:
 - Added more comments to uapi/linux/psci.h
 - Added comment about why we store INTERNAL_FAILURE in r0 (or x0)
   for SYSTEM_OFF and SYSTEM_RESET emulation

V10:
 - Updated PSCI_VERSION_xxx defines in uapi/linux/psci.h
 - Added PSCI_0_2_AFFINITY_LEVEL_ defines in uapi/linux/psci.h
 - Removed PSCI v0.1 related defines from uapi/linux/psci.h
 - Inject undefined exception for all types of errors in PSCI
   emulation (i.e kvm_psci_call(vcpu) < 0)
 - Removed "inline" attribute of kvm_prepare_system_event()
 - Store INTERNAL_FAILURE in r0 (or x0) before exiting to userspace
 - Use MPIDR_LEVEL_BITS in AFFINITY_MASK define
 - Updated comment in kvm_psci_vcpu_suspend() as-per Marc's suggestion

V9:
 - Rename undefined PSCI_VER_xxx defines to PSCI_VERSION_xxx defines

V8:
 - Add #define for possible values of migrate type in uapi/linux/psci.h
 - Simplified psci_affinity_mask() in psci.c
 - Update comments in kvm_psci_vcpu_suspend() to indicate that for KVM
   wakeup events are interrupts.
 - Unconditionally update r0 (or x0) in kvm_psci_vcpu_on()

V7:
 - Make uapi/linux/psci.h inline with Ashwin's patch
   http://www.spinics.net/lists/arm-kernel/msg319090.html
 - Incorporate Rob's suggestions for uapi/linux/psci.h
 - Treat CPU_SUSPEND power-down request to be same as standby
   request. This further simplifies CPU_SUSPEND emulation.

V6:
 - Introduce uapi/linux/psci.h for sharing PSCI defines between
   ARM kernel, ARM64 kernel, KVM ARM/ARM64 and user space
 - Make CPU_SUSPEND emulation similar to WFI emulation

V5:
 - Have separate last patch to advertise KVM_CAP_ARM_PSCI_0_2
 - Use kvm_psci_version() in kvm_psci_vcpu_on()
 - Return ALREADY_ON for PSCI v0.2 CPU_ON if VCPU is not paused
 - Remove per-VCPU suspend context
 - As-per PSCI v0.2 spec, only current CPU can suspend itself

V4:
 - Implement all mandatory functions required by PSCI v0.2

V3:
 - Make KVM_ARM_VCPU_PSCI_0_2 feature experiementatl for now so that
   it fails for user space till all mandatory PSCI v0.2 functions are
   emulated by KVM ARM/ARM64
 - Have separate patch for making KVM_ARM_VCPU_PSCI_0_2 feature available
   to user space. This patch can be defferred for now

V2:
 - Don't rename PSCI return values KVM_PSCI_RET_NI and KVM_PSCI_RET_INVAL
 - Added kvm_psci_version() to get PSCI version available to VCPU
 - Fixed grammer in Documentation/virtual/kvm/api.txt

V1:
 - Initial RFC PATCH

Anup Patel (12):
  KVM: Add capability to advertise PSCI v0.2 support
  ARM/ARM64: KVM: Add common header for PSCI related defines
  ARM/ARM64: KVM: Add base for PSCI v0.2 emulation
  KVM: Documentation: Add info regarding KVM_ARM_VCPU_PSCI_0_2 feature
  ARM/ARM64: KVM: Make kvm_psci_call() return convention more flexible
  KVM: Add KVM_EXIT_SYSTEM_EVENT to user space API header
  ARM/ARM64: KVM: Emulate PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET
  ARM/ARM64: KVM: Emulate PSCI v0.2 AFFINITY_INFO
  ARM/ARM64: KVM: Emulate PSCI v0.2 MIGRATE_INFO_TYPE and related
functions
  ARM/ARM64: KVM: Fix CPU_ON emulation for PSCI v0.2
  ARM/ARM64: KVM: Emulate PSCI v0.2 CPU_SUSPEND
  ARM/ARM64: KVM: Advertise KVM_CAP_ARM_PSCI_0_2 to user space

 Documentation/virtual/kvm/api.txt |   17 +++
 arch/arm/include/asm/kvm_host.h   |2 +-
 arch/arm/include/asm/kvm_psci.h   |6 +-
 arch/arm/include/uapi/asm/kvm.h   |   10 +-
 arch/arm/kvm/arm.c|1 +
 arch/arm/kvm/handle_exit.c|   10 +-
 arch/arm/kvm/psci.c   |  235 ++---
 arch/arm64/include/asm/kvm_host.h |2 +-
 arch/arm64/include/asm/kvm_psci.h |6 +-
 arch/arm64/include/uapi/asm/kvm.h |   10 +-
 arch/arm64/kvm/handle_exit.c  |   10 +-
 include/uapi/linux/Kbuild |1 +
 include/uapi/linux/kvm.h  |9 ++
 include/uapi/linux/psci.h |   90 ++
 14 files changed, 372 insertions(+), 37 deletions(-)
 create mode 100644 include/uapi/linux/psci.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 01/12] KVM: Add capability to advertise PSCI v0.2 support

2014-04-28 Thread Anup Patel
User space (i.e. QEMU or KVMTOOL) should be able to check whether KVM
ARM/ARM64 supports in-kernel PSCI v0.2 emulation. For this purpose, we
define KVM_CAP_ARM_PSCI_0_2 in KVM user space interface header.

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Acked-by: Christoffer Dall 
Acked-by: Marc Zyngier 
---
 include/uapi/linux/kvm.h |1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a8f4ee5..01c5624 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -743,6 +743,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IOAPIC_POLARITY_IGNORED 97
 #define KVM_CAP_ENABLE_CAP_VM 98
 #define KVM_CAP_S390_IRQCHIP 99
+#define KVM_CAP_ARM_PSCI_0_2 100
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 02/12] ARM/ARM64: KVM: Add common header for PSCI related defines

2014-04-28 Thread Anup Patel
We need a common place to share PSCI related defines among ARM kernel,
ARM64 kernel, KVM ARM/ARM64 PSCI emulation, and user space.

We introduce uapi/linux/psci.h for this purpose. This newly added
header will be first used by KVM ARM/ARM64 in-kernel PSCI emulation
and user space (i.e. QEMU or KVMTOOL).

Signed-off-by: Anup Patel 
Signed-off-by: Pranavkumar Sawargaonkar 
Signed-off-by: Ashwin Chaugule 
---
 include/uapi/linux/Kbuild |1 +
 include/uapi/linux/psci.h |   90 +
 2 files changed, 91 insertions(+)
 create mode 100644 include/uapi/linux/psci.h

diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6929571..24e9033 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -317,6 +317,7 @@ header-y += ppp-ioctl.h
 header-y += ppp_defs.h
 header-y += pps.h
 header-y += prctl.h
+header-y += psci.h
 header-y += ptp_clock.h
 header-y += ptrace.h
 header-y += qnx4_fs.h
diff --git a/include/uapi/linux/psci.h b/include/uapi/linux/psci.h
new file mode 100644
index 000..310d83e
--- /dev/null
+++ b/include/uapi/linux/psci.h
@@ -0,0 +1,90 @@
+/*
+ * ARM Power State and Coordination Interface (PSCI) header
+ *
+ * This header holds common PSCI defines and macros shared
+ * by: ARM kernel, ARM64 kernel, KVM ARM/ARM64 and user space.
+ *
+ * Copyright (C) 2014 Linaro Ltd.
+ * Author: Anup Patel 
+ */
+
+#ifndef _UAPI_LINUX_PSCI_H
+#define _UAPI_LINUX_PSCI_H
+
+/*
+ * PSCI v0.1 interface
+ *
+ * The PSCI v0.1 function numbers are implementation defined.
+ *
+ * Only PSCI return values such as: SUCCESS, NOT_SUPPORTED,
+ * INVALID_PARAMS, and DENIED defined below are applicable
+ * to PSCI v0.1.
+ */
+
+/* PSCI v0.2 interface */
+#define PSCI_0_2_FN_BASE   0x8400
+#define PSCI_0_2_FN(n) (PSCI_0_2_FN_BASE + (n))
+#define PSCI_0_2_64BIT 0x4000
+#define PSCI_0_2_FN64_BASE \
+   (PSCI_0_2_FN_BASE + PSCI_0_2_64BIT)
+#define PSCI_0_2_FN64(n)   (PSCI_0_2_FN64_BASE + (n))
+
+#define PSCI_0_2_FN_PSCI_VERSION   PSCI_0_2_FN(0)
+#define PSCI_0_2_FN_CPU_SUSPENDPSCI_0_2_FN(1)
+#define PSCI_0_2_FN_CPU_OFFPSCI_0_2_FN(2)
+#define PSCI_0_2_FN_CPU_ON PSCI_0_2_FN(3)
+#define PSCI_0_2_FN_AFFINITY_INFO  PSCI_0_2_FN(4)
+#define PSCI_0_2_FN_MIGRATEPSCI_0_2_FN(5)
+#define PSCI_0_2_FN_MIGRATE_INFO_TYPE  PSCI_0_2_FN(6)
+#define PSCI_0_2_FN_MIGRATE_INFO_UP_CPUPSCI_0_2_FN(7)
+#define PSCI_0_2_FN_SYSTEM_OFF PSCI_0_2_FN(8)
+#define PSCI_0_2_FN_SYSTEM_RESET   PSCI_0_2_FN(9)
+
+#define PSCI_0_2_FN64_CPU_SUSPEND  PSCI_0_2_FN64(1)
+#define PSCI_0_2_FN64_CPU_ON   PSCI_0_2_FN64(3)
+#define PSCI_0_2_FN64_AFFINITY_INFOPSCI_0_2_FN64(4)
+#define PSCI_0_2_FN64_MIGRATE  PSCI_0_2_FN64(5)
+#define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU  PSCI_0_2_FN64(7)
+
+/* PSCI v0.2 power state encoding for CPU_SUSPEND function */
+#define PSCI_0_2_POWER_STATE_ID_MASK   0x
+#define PSCI_0_2_POWER_STATE_ID_SHIFT  0
+#define PSCI_0_2_POWER_STATE_TYPE_SHIFT16
+#define PSCI_0_2_POWER_STATE_TYPE_MASK \
+   (0x1 << PSCI_0_2_POWER_STATE_TYPE_SHIFT)
+#define PSCI_0_2_POWER_STATE_AFFL_SHIFT24
+#define PSCI_0_2_POWER_STATE_AFFL_MASK \
+   (0x3 << PSCI_0_2_POWER_STATE_AFFL_SHIFT)
+
+/* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
+#define PSCI_0_2_AFFINITY_LEVEL_ON 0
+#define PSCI_0_2_AFFINITY_LEVEL_OFF1
+#define PSCI_0_2_AFFINITY_LEVEL_ON_PENDING 2
+
+/* PSCI v0.2 multicore support in Trusted OS returned by MIGRATE_INFO_TYPE */
+#define PSCI_0_2_TOS_UP_MIGRATE0
+#define PSCI_0_2_TOS_UP_NO_MIGRATE 1
+#define PSCI_0_2_TOS_MP2
+
+/* PSCI version decoding (independent of PSCI version) */
+#define PSCI_VERSION_MAJOR_SHIFT   16
+#define PSCI_VERSION_MINOR_MASK\
+   ((1U << PSCI_VERSION_MAJOR_SHIFT) - 1)
+#define PSCI_VERSION_MAJOR_MASK~PSCI_VERSION_MINOR_MASK
+#define PSCI_VERSION_MAJOR(ver)\
+   (((ver) & PSCI_VERSION_MAJOR_MASK) >> PSCI_VERSION_MAJOR_SHIFT)
+#define PSCI_VERSION_MINOR(ver)\
+   ((ver) & PSCI_VERSION_MINOR_MASK)
+
+/* PSCI return values (inclusive of all PSCI versions) */
+#define PSCI_RET_SUCCESS   0
+#define PSCI_RET_NOT_SUPPORTED -1
+#define PSCI_RET_INVALID_PARAMS-2
+#define PSCI_RET_DENIED-3
+#define PSCI_RET_ALREADY_ON-4
+#define PSCI_RET_

[PATCH v4 5/5] change update_range to handle > 4GB 2nd stage range for ARMv7

2014-04-28 Thread Mario Smarduch

This patch adds support for unmapping 2nd stage page tables for addresses >4GB
on ARMv7.

Signed-off-by: Mario Smarduch 
---
 arch/arm/kvm/mmu.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 88f5503..afbf8ba 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -176,21 +176,25 @@ static void clear_pte_entry(struct kvm *kvm, pte_t *pte, 
phys_addr_t addr)
}
 }
 
+/* Function shared between identity and 2nd stage mappings. For 2nd stage
+ * the IPA may be > 4GB on ARMv7, and page table range functions
+ * will fail. kvm_xxx_addr_end() is used to handle both cases.
+ */
 static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
-   unsigned long long start, u64 size)
+   phys_addr_t start, u64 size)
 {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
-   unsigned long long addr = start, end = start + size;
-   u64 next;
+   phys_addr_t addr = start, end = start + size;
+   phys_addr_t next;
 
while (addr < end) {
pgd = pgdp + pgd_index(addr);
pud = pud_offset(pgd, addr);
if (pud_none(*pud)) {
-   addr = pud_addr_end(addr, end);
+   addr = kvm_pud_addr_end(addr, end);
continue;
}
 
@@ -200,13 +204,13 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
 * move on.
 */
clear_pud_entry(kvm, pud, addr);
-   addr = pud_addr_end(addr, end);
+   addr = kvm_pud_addr_end(addr, end);
continue;
}
 
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   addr = pmd_addr_end(addr, end);
+   addr = kvm_pmd_addr_end(addr, end);
continue;
}
 
@@ -221,10 +225,10 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
 */
if (kvm_pmd_huge(*pmd) || page_empty(pte)) {
clear_pmd_entry(kvm, pmd, addr);
-   next = pmd_addr_end(addr, end);
+   next = kvm_pmd_addr_end(addr, end);
if (page_empty(pmd) && !page_empty(pud)) {
clear_pud_entry(kvm, pud, addr);
-   next = pud_addr_end(addr, end);
+   next = kvm_pud_addr_end(addr, end);
}
}
 
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/5] add 2nd stage page fault handling during live migration

2014-04-28 Thread Mario Smarduch
This patch add support for handling 2nd stage page faults during migration,
it disables faulting in huge pages, and splits up existing huge pages.


Signed-off-by: Mario Smarduch 
---
 arch/arm/kvm/mmu.c |   31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 3442594..88f5503 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -978,6 +978,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
struct vm_area_struct *vma;
pfn_t pfn;
+   bool migration_active;
 
write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
if (fault_status == FSC_PERM && !write_fault) {
@@ -1029,12 +1030,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
 
spin_lock(&kvm->mmu_lock);
+   /* place inside lock to prevent race condition when whole VM is being
+* write proteced. Prevent race of huge page install when migration is
+* active.
+*/
+   migration_active = vcpu->kvm->arch.migration_in_progress;
+
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
-   if (!hugetlb && !force_pte)
+
+   /* During migration don't rebuild huge pages */
+   if (!hugetlb && !force_pte && !migration_active)
hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-   if (hugetlb) {
+   /* During migration don't install new huge pages */
+   if (hugetlb && !migration_active) {
pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
new_pmd = pmd_mkhuge(new_pmd);
if (writable) {
@@ -1046,6 +1056,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
} else {
pte_t new_pte = pfn_pte(pfn, PAGE_S2);
if (writable) {
+   /* First convert huge page pfn to normal 4k page pfn,
+* while  migration is in progress.
+* Second in migration mode and rare case where
+* splitting of huge pages fails check if pmd is
+* mapping a huge page if it is then clear it so
+* stage2_set_pte() can map in a small page.
+*/
+   if (migration_active && hugetlb) {
+   pmd_t *pmd;
+   pfn += pte_index(fault_ipa);
+   new_pte = pfn_pte(pfn, PAGE_S2);
+   pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
+   if (pmd && kvm_pmd_huge(*pmd))
+   clear_pmd_entry(kvm, pmd, fault_ipa);
+   }
kvm_set_s2pte_writable(&new_pte);
kvm_set_pfn_dirty(pfn);
}
@@ -1053,6 +1078,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
}
 
+   if (writable)
+   mark_page_dirty(kvm, gfn);
 
 out_unlock:
spin_unlock(&kvm->mmu_lock);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/5] live migration dirty bitmap support for ARMv7

2014-04-28 Thread Mario Smarduch

Hi,
 this the fourth iteration of live migration support for the time being
tested on ARMv7. The patches depend on Eric Augers patch for memory regions.

- Tested on two 4-way A15 systems, 2-way/4-way SMP guest upto 2GB memory
- Various dirty data rates tested - 2GB/1s ... 2048 pgs/5ms
- validated source/destination memory image integrity
- Issue: time skips few seconds on dest., timekeeper offset from last
  cycle appears to big, need to investigate further.

Changes since v3:
- changed pte updates to reset write bit instead of setting default 
  value for existing pte's - Steve's comment 
- In addition to PUD add 2nd stage >4GB range functions - Steves
  suggestion
- Restructured initial memory slot write protect function for PGD, PUD, PMD
  table walking - Steves suggestion
- Renamed variable types to resemble their use - Steves suggestions
- Added couple pte helpers for 2nd stage tables - Steves suggestion
- Updated unmap_range() that handles 2nd stage tables and identity mappings
  to handle 2nd stage addresses >4GB. Left ARMv8 unchanged.

Changes since v2:
- move initial VM write protect to memory region architecture prepare function
  (needed to make dirty logging function generic) 
- added stage2_mark_pte_ro() - to mark ptes ro - Marc's comment
- optimized initial VM memory region write protect to do fewer table lookups -
  applied Marc's comment for walking dirty bitmap mask
- added pud_addr_end() for stage2 tables, to make the walk 4-level
- added kvm_flush_remote_tlbs() to use ARM TLB invalidation, made the generic
  one weak, Marc's comment to for generic dirty bitmap log function
- optimized walking dirty bit map mask to skip upper tables - Marc's comment
- deleted x86,arm kvm_vm_ioctl_get_dirty_log(), moved to kvm_main.c tagged 
  the function weak - Marc's comment
- changed Data Abort handler pte index handling - Marc's comment


Mario Smarduch (5):
  add ARMv7 HYP API to flush VM TLBs without address param
  live migration support for initial write protect of VM
  live migration support for VM dirty log management
  add 2nd stage page fault handling during live migration
  change update_range to handle > 4GB 2nd stage range for ARMv7

 arch/arm/include/asm/kvm_asm.h  |1 +
 arch/arm/include/asm/kvm_host.h |   13 ++
 arch/arm/include/asm/kvm_mmu.h  |   11 ++
 arch/arm/kvm/arm.c  |8 +-
 arch/arm/kvm/interrupts.S   |5 +
 arch/arm/kvm/mmu.c  |  377 +--
 arch/x86/kvm/x86.c  |   78 
 virt/kvm/kvm_main.c |   89 -
 8 files changed, 488 insertions(+), 94 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/5] live migration support for VM dirty log management

2014-04-28 Thread Mario Smarduch

This patch adds support for keeping track of VM dirty pages, by updating 
per memslot dirty bitmap and write protecting the page again.


Signed-off-by: Mario Smarduch 
---
 arch/arm/include/asm/kvm_host.h |3 ++
 arch/arm/kvm/arm.c  |5 --
 arch/arm/kvm/mmu.c  |  101 +++
 arch/x86/kvm/x86.c  |   78 --
 virt/kvm/kvm_main.c |   84 
 5 files changed, 188 insertions(+), 83 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 9f827c8..c5c27d8 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -237,5 +237,8 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 
value);
 void kvm_tlb_flush_vmid(struct kvm *kvm);
 
 int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask);
 
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index b916478..6ca3e84 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -784,11 +784,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
}
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
-   return -EINVAL;
-}
-
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
struct kvm_arm_device_addr *dev_addr)
 {
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 15bbca2..3442594 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -864,6 +864,107 @@ out:
return ret;
 }
 
+
+/**
+ * kvm_mmu_write_protected_pt_masked - after migration thread write protects
+ *  the entire VM address space itterative call are made to get diry pags
+ *  as the VM pages are being migrated. New dirty pages may be subset
+ *  of initial WPed VM or new writes faulted in. Here write protect new
+ *  dirty pages again in preparation of next dirty log read. This function is
+ *  called as a result KVM_GET_DIRTY_LOG ioctl, to determine what pages
+ *  need to be migrated.
+ *   'kvm->mmu_lock' must be  held to protect against concurrent modification
+ *   of page tables (2nd stage fault, mmu modifiers, ...)
+ *
+ * @kvm:The KVM pointer
+ * @slot:   The memory slot the dirty log is retrieved for
+ * @gfn_offset: The gfn offset in memory slot
+ * @mask:   The mask of dirty pages at offset 'gnf_offset in this memory
+ *  slot to be writ protect
+ */
+
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
+{
+   phys_addr_t ipa, next, offset_ipa;
+   pgd_t *pgdp = kvm->arch.pgd, *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   gfn_t gfnofst = slot->base_gfn + gfn_offset;
+   bool crosses_pmd;
+
+   ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
+   offset_ipa  = gfnofst << PAGE_SHIFT;
+   next = (gfnofst + (BITS_PER_LONG - 1)) << PAGE_SHIFT;
+
+   /* check if mask width crosses 2nd level page table range, and
+* possibly 3rd, 4th. If not skip upper table lookups. Unlikely
+* to be true machine memory regions tend to start on atleast PMD
+* boundary and mask is a power of 2.
+*/
+   crosses_pmd = ((offset_ipa & PMD_MASK) ^ (next & PMD_MASK)) ? true :
+   false;
+
+   /* If pgd, pud, pmd not present and you cross pmd range check next
+* index. Unlikely that pgd and pud would be not present. Between
+* dirty page marking and now page tables may have been altered.
+*/
+   pgd = pgdp + pgd_index(ipa);
+   if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
+   pgd = pgdp + pgd_index(next);
+   if (!pgd_present(*pgd))
+   return;
+   }
+
+   pud = pud_offset(pgd, ipa);
+   if (unlikely(crosses_pmd && !pud_present(*pud))) {
+   pud = pud_offset(pgd, next);
+   if (!pud_present(*pud))
+   return;
+   }
+
+   pmd = pmd_offset(pud, ipa);
+   if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
+   pmd = pmd_offset(pud, next);
+   if (!pmd_present(*pmd))
+   return;
+   }
+
+   for (;;) {
+   pte = pte_offset_kernel(pmd, ipa);
+   if (!pte_present(*pte))
+   goto next_ipa;
+
+   if (kvm_s2pte_readonly(pte))
+   goto next_ipa;
+   kvm_set_s2pte_readonly(pte);
+next_ipa:
+   mask &= mask - 1;
+   if (!mask)
+   break;
+
+   /* find next page */
+   ipa = (gfnofst + __f

[PATCH v4 2/5] live migration support for initial write protect of VM

2014-04-28 Thread Mario Smarduch
Patch adds support for live migration initial split up of huge pages
in memory slot and write protection of all pages in memory slot.


Signed-off-by: Mario Smarduch 
---
 arch/arm/include/asm/kvm_host.h |8 ++
 arch/arm/include/asm/kvm_mmu.h  |   11 ++
 arch/arm/kvm/arm.c  |3 +
 arch/arm/kvm/mmu.c  |  215 +++
 virt/kvm/kvm_main.c |5 +-
 5 files changed, 241 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 1e739f9..9f827c8 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -67,6 +67,12 @@ struct kvm_arch {
 
/* Interrupt controller */
struct vgic_distvgic;
+
+   /* Marks start of migration, used to handle 2nd stage page faults
+* during migration, prevent installing huge pages and split huge pages
+* to small pages.
+*/
+   int migration_in_progress;
 };
 
 #define KVM_NR_MEM_OBJS 40
@@ -230,4 +236,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 
value);
 
 void kvm_tlb_flush_vmid(struct kvm *kvm);
 
+int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index a91c863..342ae81 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -111,6 +111,17 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
pte_val(*pte) |= L_PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pte_readonly(pte_t *pte)
+{
+   pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
+}
+
+static inline bool kvm_s2pte_readonly(pte_t *pte)
+{
+   return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
+}
+
+
 static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 {
pmd_val(*pmd) |= L_PMD_S2_RDWR;
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 9a4bc10..b916478 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -233,6 +233,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_userspace_memory_region *mem,
   enum kvm_mr_change change)
 {
+   /* Request for migration issued by user, write protect memory slot */
+   if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
+   return kvm_mmu_slot_remove_write_access(kvm, mem->slot);
return 0;
 }
 
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7ab77f3..15bbca2 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -44,6 +44,41 @@ static phys_addr_t hyp_idmap_vector;
 
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 
+/* Used for 2nd stage and identity mappings. For stage 2 mappings
+ * instead of unsigned long, u64 is use  which won't overflow on ARMv7 for
+ * IPAs above 4GB. For ARMv8 use default functions.
+ */
+
+static phys_addr_t kvm_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+#if BITS_PER_LONG == 32
+   u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;
+   return __boundary - 1 < end - 1 ? __boundary : end;
+#else
+   return pgd_addr_end(addr, end);
+#endif
+}
+
+static phys_addr_t kvm_pud_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+#if BITS_PER_LONG == 32
+   u64 __boundary = ((addr) + PUD_SIZE) & PUD_MASK;
+   return __boundary - 1 < end - 1 ? __boundary : end;
+#else
+   return pud_addr_end(addr, end);
+#endif
+}
+
+static phys_addr_t kvm_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+#if BITS_PER_LONG == 32
+   u64 __boundary = ((addr) + PMD_SIZE) & PMD_MASK;
+   return __boundary - 1 < end - 1 ? __boundary : end;
+#else
+   return pmd_addr_end(addr, end);
+#endif
+}
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
/*
@@ -649,6 +684,186 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, 
phys_addr_t *ipap)
return false;
 }
 
+/**
+ * kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
+ * log of smaller memory granules, otherwise huge pages would need to be
+ * migrated. Practically an idle system has problems migrating with
+ * huge pages.  Called during WP of entire VM address space, done
+ * initially when  migration thread isses the KVM_MEM_LOG_DIRTY_PAGES
+ * ioctl.
+ * The mmu_lock is held during splitting.
+ *
+ * @kvm:The KVM pointer
+ * @pmd:Pmd to 2nd stage huge page
+ * @addr:   Guest Physical Address
+ */
+static int kvm_split_pmd(struct kvm *kvm, pmd_t *pmd, u64 addr)
+{
+   struct page *page;
+   pfn_t pfn = pmd_pfn(*pmd);
+   pte_t *pte;
+   int i;
+
+   page = alloc_page(GFP_KERNEL);
+   if (page == NULL)
+   return -ENOMEM;
+
+   pte = page_address(page);
+   /* cycle through ptes first, use pmd pfn */
+   for (i = 0; i < PTRS_PER_PMD; i++)
+

KVM: x86: expose invariant tsc cpuid bit (v2)

2014-04-28 Thread Marcelo Tosatti

Invariant TSC is a property of TSC, no additional 
support code necessary.

Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..333b88d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -495,6 +495,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
entry->ecx &= kvm_supported_word6_x86_features;
cpuid_mask(&entry->ecx, 6);
break;
+   case 0x8007: /* Advanced power management */
+   /* invariant TSC is CPUID.8007H:EDX[8] */
+   entry->edx &= (1 << 8);
+   /* mask against host */
+   entry->edx &= boot_cpu_data.x86_power;
+   entry->eax = entry->ebx = entry->ecx = 0;
+   break;
case 0x8008: {
unsigned g_phys_as = (entry->eax >> 16) & 0xff;
unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U);
@@ -525,7 +532,6 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
case 3: /* Processor serial number */
case 5: /* MONITOR/MWAIT */
case 6: /* Thermal management */
-   case 0x8007: /* Advanced power management */
case 0xC002:
case 0xC003:
case 0xC004:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] savevm: check vmsd for migratability status

2014-04-28 Thread Eduardo Habkost
On Fri, Apr 25, 2014 at 07:47:09PM -0300, Marcelo Tosatti wrote:
> 
> Check vmsd for unmigratable field, allowing migratibility status
> to be modified after vmstate_register.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/savevm.c b/savevm.c
> index 22123be..61a25c0 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -452,7 +452,7 @@ bool qemu_savevm_state_blocked(Error **errp)
>  SaveStateEntry *se;
>  
>  QTAILQ_FOREACH(se, &savevm_handlers, entry) {
> -if (se->no_migrate) {
> +if (se->no_migrate || (se->vmsd && se->vmsd->unmigratable)) {

The only place where se->no_migrate is set to non-zero is when a vmsd is
provided. What about just removing the field and using
(se->vmsd && se->vmsd->unmigratable) only?

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 02/11] ARM SMMU: Add capability IOMMU_CAP_DMA_EXEC

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 20:37 +0100, Will Deacon wrote:
> On Mon, Apr 28, 2014 at 04:52:42PM +0100, Antonios Motakis wrote:
> > The ARM SMMU can take an IOMMU_EXEC protection flag in addition to
> > IOMMU_READ and IOMMU_WRITE. Expose this as an IOMMU capability.
> 
> The other way of handling this would be to negate the capability and
> advertise a NOEXEC cap instead. That would need the IOMMU_EXEC flag to
> become IOMMU_NOEXEC and the ARM SMMU driver updating accordingly, but it
> might make more sense if people don't object to mixing positive and negative
> logic in the IOMMU_* flags.
> 
> Any thoughts?

A benefit of doing that would be that the flag becomes enforceable.  As
written in this draft, if a user does not specify EXEC, the mapping may
or may not be executable, depending on the IOMMU capability (assuming
that if EXEC is not supported that it follows READ).  If the flag
changes to NOEXEC, then all the domains in the container should support
it or else the mapping should fail.  We could also avoid the test in
vfio code when doing a mapping and just let the IOMMU driver fail the
map if NOEXEC is unsupported.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 03/11] VFIO_IOMMU_TYPE1 for platform bus devices on ARM

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 20:19 +0100, Will Deacon wrote:
> Hi Alex,
> 
> On Mon, Apr 28, 2014 at 05:43:41PM +0100, Alex Williamson wrote:
> > On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> > > This allows to make use of the VFIO_IOMMU_TYPE1 driver with platform
> > > devices on ARM in addition to PCI. This is required in order to use the
> > > Exynos SMMU, or ARM SMMU driver with VFIO_IOMMU_TYPE1.
> 
> [...]
> 
> > > @@ -721,13 +722,15 @@ static int vfio_iommu_type1_attach_group(void 
> > > *iommu_data,
> > >   INIT_LIST_HEAD(&domain->group_list);
> > >   list_add(&group->next, &domain->group_list);
> > >  
> > > - if (!allow_unsafe_interrupts &&
> > > +#ifdef CONFIG_PCI
> > > + if (bus == &pci_bus_type && !allow_unsafe_interrupts &&
> > >   !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > >   pr_warn("%s: No interrupt remapping support.  Use the module 
> > > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
> > > platform\n",
> > >  __func__);
> > >   ret = -EPERM;
> > >   goto out_detach;
> > >   }
> > > +#endif
> > >  
> > >   if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
> > >   domain->prot |= IOMMU_CACHE;
> > 
> > This is not a PCI specific requirement.  Anything that can support MSI
> > needs an IOMMU that can provide isolation for both DMA and interrupts.
> > I think the IOMMU should still be telling us that it has this feature.
> 
> Please excuse any ignorance on part here (I'm not at all familiar with the
> Intel IOMMU), but shouldn't this really be a property of the interrupt
> controller itself? On ARM with GICv3, there is a separate block called the
> ITS (interrupt translation service) which is part of the interrupt
> controller. The ITS provides a doorbell page which the SMMU can map into a
> guest operating system to provide MSI for passthrough devices, but this
> isn't something the SMMU is aware of -- it will just see the iommu_map
> request for a non-cacheable mapping.

Hi Will,

I don't know the history of why this is an IOMMU domain capability on
x86, it's sort of a paradox.  An MSI from a device is conceptually just
a DMA write and is therefore logically co-located in the IOMMU hardware,
but x86 doesn't allow it to be mapped via the IOMMU API interfaces.  For
compatibility, interrupt remapping support is buried deep in the
request_irq interface and effectively invisible other than having this
path to query it.  Therefore this flag is effectively just saying "MSI
isolation support is present and enabled".  IOW, the host is protected
from interrupt injection attacks from malicious devices.  If there is
some property of your platform that makes this always the case, then the
IOMMU driver can always export this capability as true.

With PCI, MSI is configured via spec defined configuration space
registers, so we emulate these registers and prevent user access to them
so that we don't need to allow the user a way to setup an interrupt
remapping entry.  It's done for them via request_irq.

IIRC, the Freescale devices have a limited number of MSI pages and can
therefore create some instances with isolation while others may require
sharing.  In that case I would expect this flag to indicate whether the
domain has an exclusive or shared page.

In any case, I suspect keying on the bus_type here is not the correct
way to go.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 07/11] VFIO_PLATFORM: Read and write support for the device fd

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> VFIO returns a file descriptor which we can use to manipulate the memory
> regions of the device. Since some memory regions we cannot mmap due to
> security concerns, we also allow to read and write to this file descriptor
> directly.

If there are regions we cannot mmap for security, why do we provide full
read/write access to them?

> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/platform/vfio_platform.c | 110 
> +-
>  1 file changed, 107 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/platform/vfio_platform.c 
> b/drivers/vfio/platform/vfio_platform.c
> index 5430cbe..7c01ced 100644
> --- a/drivers/vfio/platform/vfio_platform.c
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -55,7 +55,8 @@ static int vfio_platform_regions_init(struct 
> vfio_platform_device *vdev)
>  
>   region.addr = res->start;
>   region.size = resource_size(res);
> - region.flags = 0;
> + region.flags = VFIO_REGION_INFO_FLAG_READ
> + | VFIO_REGION_INFO_FLAG_WRITE;
>  
>   vdev->region[i] = region;
>   }
> @@ -153,13 +154,116 @@ static long vfio_platform_ioctl(void *device_data,
>  static ssize_t vfio_platform_read(void *device_data, char __user *buf,
>size_t count, loff_t *ppos)
>  {
> - return 0;
> + struct vfio_platform_device *vdev = device_data;
> + unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
> + loff_t off = *ppos & VFIO_PLATFORM_OFFSET_MASK;
> + unsigned int done = 0;
> + void __iomem *io;
> +
> + if (index >= vdev->num_regions)
> + return -EINVAL;
> +
> + io = ioremap_nocache(vdev->region[index].addr,
> +  vdev->region[index].size);

I haven't looked at ioremap on arm, but if it's remotely non-trivially,
it's probably a good idea to do this only on first access, cache it, and
unmap it on region cleanup.  Thanks,

Alex

> +
> + while (count) {
> + size_t filled;
> +
> + if (count >= 4 && !(off % 4)) {
> + u32 val;
> +
> + val = ioread32(io + off);
> + if (copy_to_user(buf, &val, 4))
> + goto err;
> +
> + filled = 4;
> + } else if (count >= 2 && !(off % 2)) {
> + u16 val;
> +
> + val = ioread16(io + off);
> + if (copy_to_user(buf, &val, 2))
> + goto err;
> +
> + filled = 2;
> + } else {
> + u8 val;
> +
> + val = ioread8(io + off);
> + if (copy_to_user(buf, &val, 1))
> + goto err;
> +
> + filled = 1;
> + }
> +
> +
> + count -= filled;
> + done += filled;
> + off += filled;
> + buf += filled;
> + }
> +
> + iounmap(io);
> + return done;
> +err:
> + iounmap(io);
> + return -EFAULT;
>  }
>  
>  static ssize_t vfio_platform_write(void *device_data, const char __user *buf,
> size_t count, loff_t *ppos)
>  {
> - return 0;
> + struct vfio_platform_device *vdev = device_data;
> + unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
> + loff_t off = *ppos & VFIO_PLATFORM_OFFSET_MASK;
> + unsigned int done = 0;
> + void __iomem *io;
> +
> + if (index >= vdev->num_regions)
> + return -EINVAL;
> +
> + io = ioremap_nocache(vdev->region[index].addr,
> +  vdev->region[index].size);
> +
> + while (count) {
> + size_t filled;
> +
> + if (count >= 4 && !(off % 4)) {
> + u32 val;
> +
> + if (copy_from_user(&val, buf, 4))
> + goto err;
> + iowrite32(val, io + off);
> +
> + filled = 4;
> + } else if (count >= 2 && !(off % 2)) {
> + u16 val;
> +
> + if (copy_from_user(&val, buf, 2))
> + goto err;
> + iowrite16(val, io + off);
> +
> + filled = 2;
> + } else {
> + u8 val;
> +
> + if (copy_from_user(&val, buf, 1))
> + goto err;
> + iowrite8(val, io + off);
> +
> + filled = 1;
> + }
> +
> + count -= filled;
> + done += filled;
> + off += filled;
> + buf += filled;
> + }
> +
> + iounmap(io);
> + return done;
> +err:
> + iounmap(io);
> + return -EFAULT;
>  }
>  
>  static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)



--

Re: [RFC PATCH v5 11/11] VFIO_PLATFORM: Support for maskable and automasked interrupts

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> Adds support to mask interrupts, and also for automasked interrupts.
> Level sensitive interrupts are exposed as automasked interrupts and
> are masked and disabled automatically when they fire.
> 
> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/platform/vfio_platform_irq.c | 117 
> --
>  drivers/vfio/platform/vfio_platform_private.h |   2 +
>  2 files changed, 113 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
> b/drivers/vfio/platform/vfio_platform_irq.c
> index 433edc1..e38982f 100644
> --- a/drivers/vfio/platform/vfio_platform_irq.c
> +++ b/drivers/vfio/platform/vfio_platform_irq.c
> @@ -52,9 +52,16 @@ int vfio_platform_irq_init(struct vfio_platform_device 
> *vdev)
>   struct vfio_platform_irq irq;
>   int hwirq = platform_get_irq(vdev->pdev, i);
>  
> - irq.flags = VFIO_IRQ_INFO_EVENTFD;
> + spin_lock_init(&irq.lock);
> +
> + irq.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_MASKABLE;
> +
> + if (irq_get_trigger_type(hwirq) & IRQ_TYPE_LEVEL_MASK)
> + irq.flags |= VFIO_IRQ_INFO_AUTOMASKED;
> +
>   irq.count = 1;
>   irq.hwirq = hwirq;
> + irq.masked = false;
>  
>   vdev->irq[i] = irq;
>   }
> @@ -66,19 +73,39 @@ void vfio_platform_irq_cleanup(struct 
> vfio_platform_device *vdev)
>  {
>   int i;
>  
> - for (i = 0; i < vdev->num_irqs; i++)
> + for (i = 0; i < vdev->num_irqs; i++) {
>   vfio_set_trigger(vdev, i, -1);
>  
> + if (vdev->irq[i].masked)
> + enable_irq(vdev->irq[i].hwirq);

This looks suspicious.  set_trigger(,, -1) calls free_irq() and here we
enable_irq().  Shouldn't the nexe user's call to request_irq() be
sufficient to re-enable it?  Thanks,

Alex

> + }
> +
>   kfree(vdev->irq);
>  }
>  
>  static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
>  {
> - struct eventfd_ctx *trigger = dev_id;
> + struct vfio_platform_irq *irq_ctx = dev_id;
> + unsigned long flags;
> + int ret = IRQ_NONE;
> +
> + spin_lock_irqsave(&irq_ctx->lock, flags);
> +
> + if (!irq_ctx->masked) {
> + ret = IRQ_HANDLED;
> +
> + if (irq_ctx->flags & VFIO_IRQ_INFO_AUTOMASKED) {
> + disable_irq_nosync(irq_ctx->hwirq);
> + irq_ctx->masked = true;
> + }
> + }
>  
> - eventfd_signal(trigger, 1);
> + spin_unlock_irqrestore(&irq_ctx->lock, flags);
>  
> - return IRQ_HANDLED;
> + if (ret == IRQ_HANDLED)
> + eventfd_signal(irq_ctx->trigger, 1);
> +
> + return ret;
>  }
>  
>  static int vfio_set_trigger(struct vfio_platform_device *vdev,
> @@ -159,6 +186,82 @@ static int vfio_platform_set_irq_trigger(struct 
> vfio_platform_device *vdev,
>   return -EFAULT;
>  }
>  
> +static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
> + unsigned index, unsigned start,
> + unsigned count, uint32_t flags, void *data)
> +{
> + uint8_t arr;
> +
> + if (start != 0 || count != 1)
> + return -EINVAL;
> +
> + switch (flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
> + case VFIO_IRQ_SET_DATA_BOOL:
> + if (copy_from_user(&arr, data, sizeof(uint8_t)))
> + return -EFAULT;
> +
> + if (arr != 0x1)
> + return -EINVAL;
> +
> + case VFIO_IRQ_SET_DATA_NONE:
> +
> + spin_lock_irq(&vdev->irq[index].lock);
> +
> + if (vdev->irq[index].masked) {
> + enable_irq(vdev->irq[index].hwirq);
> + vdev->irq[index].masked = false;
> + }
> +
> + spin_unlock_irq(&vdev->irq[index].lock);
> +
> + return 0;
> +
> + case VFIO_IRQ_SET_DATA_EVENTFD: /* XXX not implemented yet */
> + default:
> + return -ENOTTY;
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
> + unsigned index, unsigned start,
> + unsigned count, uint32_t flags, void *data)
> +{
> + uint8_t arr;
> +
> + if (start != 0 || count != 1)
> + return -EINVAL;
> +
> + switch (flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
> + case VFIO_IRQ_SET_DATA_BOOL:
> + if (copy_from_user(&arr, data, sizeof(uint8_t)))
> + return -EFAULT;
> +
> + if (arr != 0x1)
> + return -EINVAL;
> +
> + case VFIO_IRQ_SET_DATA_NONE:
> +
> + spin_lock_irq(&vdev->irq[index].lock);
> +
> + if (!vdev->irq[index].masked) {
> + disable_irq(vdev->irq[index].hwirq);
> + vdev->irq[index].masked = true;

Re: [RFC PATCH v5 08/11] VFIO_PLATFORM: Support MMAP of MMIO regions

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> Allow to memory map the MMIO regions of the device so userspace can
> directly access them.
> 
> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/platform/vfio_platform.c | 40 
> ++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/platform/vfio_platform.c 
> b/drivers/vfio/platform/vfio_platform.c
> index 7c01ced..37beff3 100644
> --- a/drivers/vfio/platform/vfio_platform.c
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -58,6 +58,11 @@ static int vfio_platform_regions_init(struct 
> vfio_platform_device *vdev)
>   region.flags = VFIO_REGION_INFO_FLAG_READ
>   | VFIO_REGION_INFO_FLAG_WRITE;
>  
> + /* Only regions addressed with PAGE granularity can be MMAPed
> +  * securely. */
> + if (!(region.addr & ~PAGE_MASK) && !(region.size & ~PAGE_MASK))
> + region.flags |= VFIO_REGION_INFO_FLAG_MMAP;
> +

Ok, so the security is only that we need page alignment.  Maybe that
could be made more clear in the previous patch.  PCI has regions like
the MSI-X table that we need to protect from userspace access beyond
page alignment requirements.  Thanks,

Alex

>   vdev->region[i] = region;
>   }
>  
> @@ -268,7 +273,40 @@ err:
>  
>  static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
>  {
> - return -EINVAL;
> + struct vfio_platform_device *vdev = device_data;
> + unsigned int index;
> + u64 req_len, pgoff, req_start;
> + struct vfio_platform_region region;
> +
> + index = vma->vm_pgoff >> (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT);
> +
> + if (vma->vm_end < vma->vm_start)
> + return -EINVAL;
> + if ((vma->vm_flags & VM_SHARED) == 0)
> + return -EINVAL;
> + if (index >= vdev->num_regions)
> + return -EINVAL;
> + if (vma->vm_start & ~PAGE_MASK)
> + return -EINVAL;
> + if (vma->vm_end & ~PAGE_MASK)
> + return -EINVAL;
> +
> + region = vdev->region[index];
> +
> + req_len = vma->vm_end - vma->vm_start;
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> + req_start = pgoff << PAGE_SHIFT;
> +
> + if (region.size < PAGE_SIZE || req_start + req_len > region.size)
> + return -EINVAL;
> +
> + vma->vm_private_data = vdev;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> +
> + return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> +req_len, vma->vm_page_prot);
>  }
>  
>  static const struct vfio_device_ops vfio_platform_ops = {



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 02/11] ARM SMMU: Add capability IOMMU_CAP_DMA_EXEC

2014-04-28 Thread Will Deacon
On Mon, Apr 28, 2014 at 04:52:42PM +0100, Antonios Motakis wrote:
> The ARM SMMU can take an IOMMU_EXEC protection flag in addition to
> IOMMU_READ and IOMMU_WRITE. Expose this as an IOMMU capability.

The other way of handling this would be to negate the capability and
advertise a NOEXEC cap instead. That would need the IOMMU_EXEC flag to
become IOMMU_NOEXEC and the ARM SMMU driver updating accordingly, but it
might make more sense if people don't object to mixing positive and negative
logic in the IOMMU_* flags.

Any thoughts?

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] target-i386: block migration and savevm if invariant tsc is exposed

2014-04-28 Thread Eduardo Habkost
On Mon, Apr 28, 2014 at 09:06:42PM +0200, Paolo Bonzini wrote:
> Il 28/04/2014 17:46, Eduardo Habkost ha scritto:
> >>So I couldn't indeed think of uses of "-cpu host" together with
> >>migration.  But if you're sure there is none, block it in QEMU.
> >>There's no reason why this should be specific to libvirt.
> >
> >It's not that there are no useful use cases, it is that we simply can't
> >guarantee it will work because (by design) "-cpu host" enables features
> >QEMU doesn't know about (so it doesn't know if migration is safe or
> >not). And that's the main use case for "-cpu host".
> 
> True, but in practice QEMU and KVM support is added in parallel, and
> we already have full support for Broadwell (IIRC) and are starting to
> add Skylake features.
> 
> >So, what about doing the following on QEMU 2.1:
> >
> > * "-cpu host,migratable=yes":
> >   * No invtsc
> >   * Migration not blocked
> >   * No feature flag unknown to QEMU will be enabled
> > * "-cpu host,migratable=no":
> >   * invtsc enabled
> >   * Unknown-to-QEMU features enabled
> >   * Migration blocked
> > * "-cpu host":
> >   * Same behavior as "-cpu host,migratable=yes"
> > * Release notes indicating that "-cpu host,migratable=no" is
> >   required if the user doesn't care about migration and wants new
> >   (unknown to QEMU) features to be available
> >
> >I was going to propose making "migratable=no" the default after a few
> >releases, as I expect the majority of existing users of "-cpu host" to
> >not care about migration. But I am not sure, because there's less harm
> >in _not_ getting new bleeding edge features enabled, and those users
> >(and management software) can start using "migratable=no" if they really
> >want the new unmigratable features.
> 
> Makes sense.  Basically "-cpu host,migratable=yes" is close to
> libvirt's host-model and Alex Graf's proposed "-cpu best".  Should we
> call it "-cpu best" and drop migratability of "-cpu host"?

"-cpu best" is different from the modes above. It means "use the best
existing CPU model (from the pre-defined table) that can run on this
host".

And it would have the same ambiguities we found on "-cpu host": if a CPU
model in the table have invtsc enabled, should it be considered a
candidate for "-cpu best", or not?

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 09/11] VFIO_PLATFORM: Return IRQ info

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> Return information for the interrupts exposed by the device.
> This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
> and enables VFIO_DEVICE_GET_IRQ_INFO
> 
> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/platform/Makefile|  2 +-
>  drivers/vfio/platform/vfio_platform.c | 35 +--
>  drivers/vfio/platform/vfio_platform_irq.c | 63 
> +++
>  drivers/vfio/platform/vfio_platform_private.h | 11 +
>  4 files changed, 106 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/vfio/platform/vfio_platform_irq.c
> 
> diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
> index df3a014..2c53327 100644
> --- a/drivers/vfio/platform/Makefile
> +++ b/drivers/vfio/platform/Makefile
> @@ -1,4 +1,4 @@
>  
> -vfio-platform-y := vfio_platform.o
> +vfio-platform-y := vfio_platform.o vfio_platform_irq.o
>  
>  obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
> diff --git a/drivers/vfio/platform/vfio_platform.c 
> b/drivers/vfio/platform/vfio_platform.c
> index 37beff3..2e16595 100644
> --- a/drivers/vfio/platform/vfio_platform.c
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -79,6 +79,7 @@ static void vfio_platform_release(void *device_data)
>   struct vfio_platform_device *vdev = device_data;
>  
>   vfio_platform_regions_cleanup(vdev);
> + vfio_platform_irq_cleanup(vdev);
>  
>   module_put(THIS_MODULE);
>  }
> @@ -92,12 +93,22 @@ static int vfio_platform_open(void *device_data)
>   if (ret)
>   return ret;
>  
> + ret = vfio_platform_irq_init(vdev);
> + if (ret)
> + goto err_irq;
> +
>   if (!try_module_get(THIS_MODULE)) {
> - vfio_platform_regions_cleanup(vdev);
> - return -ENODEV;
> + ret = -ENODEV;
> + goto err_mod;
>   }


Same comments from REGION_INFO patch apply here.  Ordering above,
paranoia about keeping num_irqs consistent with allocation, nit about
local variable use in initialization loop.  Thanks,

Alex
>  
>   return 0;
> +
> +err_mod:
> + vfio_platform_irq_cleanup(vdev);
> +err_irq:
> + vfio_platform_regions_cleanup(vdev);
> + return ret;
>  }
>  
>  static long vfio_platform_ioctl(void *device_data,
> @@ -119,7 +130,7 @@ static long vfio_platform_ioctl(void *device_data,
>  
>   info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
>   info.num_regions = vdev->num_regions;
> - info.num_irqs = 0;
> + info.num_irqs = vdev->num_irqs;
>  
>   return copy_to_user((void __user *)arg, &info, minsz);
>  
> @@ -145,7 +156,23 @@ static long vfio_platform_ioctl(void *device_data,
>   return copy_to_user((void __user *)arg, &info, minsz);
>  
>   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
> - return -EINVAL;
> + struct vfio_irq_info info;
> +
> + minsz = offsetofend(struct vfio_irq_info, count);
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + if (info.index >= vdev->num_irqs)
> + return -EINVAL;
> +
> + info.flags = vdev->irq[info.index].flags;
> + info.count = vdev->irq[info.index].count;
> +
> + return copy_to_user((void __user *)arg, &info, minsz);
>  
>   } else if (cmd == VFIO_DEVICE_SET_IRQS)
>   return -EINVAL;
> diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
> b/drivers/vfio/platform/vfio_platform_irq.c
> new file mode 100644
> index 000..075c401
> --- /dev/null
> +++ b/drivers/vfio/platform/vfio_platform_irq.c
> @@ -0,0 +1,63 @@
> +/*
> + * VFIO platform devices interrupt handling
> + *
> + * Copyright (C) 2013 - Virtual Open Systems
> + * Author: Antonios Motakis 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License, version 2, as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vfio_platform_private.h"
> +
> +int vfio_platform_irq_init(struct vfio_platform_device *vdev)
> +{
> + int cnt = 0, i;
> +
> + while (platform_get_irq(vdev->pdev, cnt) > 0)
> + cnt++;
> +
> + vdev->num_irqs = cnt;
> +
> + vdev->irq = kzalloc(sizeof(struct vfio_platform_irq) * vdev->num_irqs,
> +   

Re: [RFC PATCH v5 03/11] VFIO_IOMMU_TYPE1 for platform bus devices on ARM

2014-04-28 Thread Will Deacon
Hi Alex,

On Mon, Apr 28, 2014 at 05:43:41PM +0100, Alex Williamson wrote:
> On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> > This allows to make use of the VFIO_IOMMU_TYPE1 driver with platform
> > devices on ARM in addition to PCI. This is required in order to use the
> > Exynos SMMU, or ARM SMMU driver with VFIO_IOMMU_TYPE1.

[...]

> > @@ -721,13 +722,15 @@ static int vfio_iommu_type1_attach_group(void 
> > *iommu_data,
> > INIT_LIST_HEAD(&domain->group_list);
> > list_add(&group->next, &domain->group_list);
> >  
> > -   if (!allow_unsafe_interrupts &&
> > +#ifdef CONFIG_PCI
> > +   if (bus == &pci_bus_type && !allow_unsafe_interrupts &&
> > !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > pr_warn("%s: No interrupt remapping support.  Use the module 
> > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
> > platform\n",
> >__func__);
> > ret = -EPERM;
> > goto out_detach;
> > }
> > +#endif
> >  
> > if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
> > domain->prot |= IOMMU_CACHE;
> 
> This is not a PCI specific requirement.  Anything that can support MSI
> needs an IOMMU that can provide isolation for both DMA and interrupts.
> I think the IOMMU should still be telling us that it has this feature.

Please excuse any ignorance on part here (I'm not at all familiar with the
Intel IOMMU), but shouldn't this really be a property of the interrupt
controller itself? On ARM with GICv3, there is a separate block called the
ITS (interrupt translation service) which is part of the interrupt
controller. The ITS provides a doorbell page which the SMMU can map into a
guest operating system to provide MSI for passthrough devices, but this
isn't something the SMMU is aware of -- it will just see the iommu_map
request for a non-cacheable mapping.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 06/11] VFIO_PLATFORM: Return info for device and its memory mapped IO regions

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> A VFIO userspace driver will start by opening the VFIO device
> that corresponds to an IOMMU group, and will use the ioctl interface
> to get the basic device info, such as number of memory regions and
> interrupts, and their properties.
> 
> This patch enables the IOCTLs:
>  - VFIO_DEVICE_GET_INFO
>  - VFIO_DEVICE_GET_REGION_INFO
> 
>  IRQ info is provided by one of the latter patches.
> 
> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/platform/vfio_platform.c | 77 
> ---
>  drivers/vfio/platform/vfio_platform_private.h | 17 ++
>  2 files changed, 88 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/platform/vfio_platform.c 
> b/drivers/vfio/platform/vfio_platform.c
> index 1661746..5430cbe 100644
> --- a/drivers/vfio/platform/vfio_platform.c
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -34,15 +34,62 @@
>  #define DRIVER_AUTHOR   "Antonios Motakis "
>  #define DRIVER_DESC "VFIO for platform devices - User Level meta-driver"
>  
> +static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
> +{
> + int cnt = 0, i;
> +
> + while (platform_get_resource(vdev->pdev, IORESOURCE_MEM, cnt))
> + cnt++;
> +
> + vdev->num_regions = cnt;
> +
> + vdev->region = kzalloc(sizeof(struct vfio_platform_region) * cnt,
> + GFP_KERNEL);
> + if (!vdev->region)

Should vdev->num_regions be cleared here or set at the end to avoid
possibly walking a null pointer later?

> + return -ENOMEM;
> +
> + for (i = 0; i < cnt;  i++) {
> + struct vfio_platform_region region;
> + struct resource *res =
> + platform_get_resource(vdev->pdev, IORESOURCE_MEM, i);
> +
> + region.addr = res->start;
> + region.size = resource_size(res);
> + region.flags = 0;
> +
> + vdev->region[i] = region;

nit, the local variable with copy at the end seems rather unnecessary
here.

> + }
> +
> + return 0;
> +}
> +
> +static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
> +{
> + kfree(vdev->region);

Makes me nervous again that we have vdev->num_regions still set to a
value.  Maybe just paranoia.

> +}
> +
>  static void vfio_platform_release(void *device_data)
>  {
> + struct vfio_platform_device *vdev = device_data;
> +
> + vfio_platform_regions_cleanup(vdev);
> +
>   module_put(THIS_MODULE);
>  }
>  
>  static int vfio_platform_open(void *device_data)
>  {
> - if (!try_module_get(THIS_MODULE))
> + struct vfio_platform_device *vdev = device_data;
> + int ret;
> +
> + ret = vfio_platform_regions_init(vdev);
> + if (ret)
> + return ret;
> +
> + if (!try_module_get(THIS_MODULE)) {
> + vfio_platform_regions_cleanup(vdev);
>   return -ENODEV;
> + }

Getting a reference to the module seems like it should be step 1 here.
Thanks,

Alex

>  
>   return 0;
>  }
> @@ -65,18 +112,36 @@ static long vfio_platform_ioctl(void *device_data,
>   return -EINVAL;
>  
>   info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
> - info.num_regions = 0;
> + info.num_regions = vdev->num_regions;
>   info.num_irqs = 0;
>  
>   return copy_to_user((void __user *)arg, &info, minsz);
>  
> - } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
> - return -EINVAL;
> + } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> + struct vfio_region_info info;
> +
> + minsz = offsetofend(struct vfio_region_info, offset);
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + if (info.index >= vdev->num_regions)
> + return -EINVAL;
> +
> + /* map offset to the physical address  */
> + info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index);
> + info.size = vdev->region[info.index].size;
> + info.flags = vdev->region[info.index].flags;
> +
> + return copy_to_user((void __user *)arg, &info, minsz);
>  
> - else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
> + } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
>   return -EINVAL;
>  
> - else if (cmd == VFIO_DEVICE_SET_IRQS)
> + } else if (cmd == VFIO_DEVICE_SET_IRQS)
>   return -EINVAL;
>  
>   else if (cmd == VFIO_DEVICE_RESET)
> diff --git a/drivers/vfio/platform/vfio_platform_private.h 
> b/drivers/vfio/platform/vfio_platform_private.h
> index 4ae88f8..3448f918 100644
> --- a/drivers/vfio/platform/vfio_platform_private.h
> +++ b/drivers/vfio/platform/vfio_platform_private.h
> @@ -15,8 +15,25 @@
>  #ifndef VFIO_PLATFORM_PRIVATE_H
>  #define VFIO_PLATFORM_PRIVATE_H

Re: [patch 2/2] target-i386: block migration and savevm if invariant tsc is exposed

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 17:46, Eduardo Habkost ha scritto:

So I couldn't indeed think of uses of "-cpu host" together with
migration.  But if you're sure there is none, block it in QEMU.
There's no reason why this should be specific to libvirt.


It's not that there are no useful use cases, it is that we simply can't
guarantee it will work because (by design) "-cpu host" enables features
QEMU doesn't know about (so it doesn't know if migration is safe or
not). And that's the main use case for "-cpu host".


True, but in practice QEMU and KVM support is added in parallel, and we 
already have full support for Broadwell (IIRC) and are starting to add 
Skylake features.



So, what about doing the following on QEMU 2.1:

 * "-cpu host,migratable=yes":
   * No invtsc
   * Migration not blocked
   * No feature flag unknown to QEMU will be enabled
 * "-cpu host,migratable=no":
   * invtsc enabled
   * Unknown-to-QEMU features enabled
   * Migration blocked
 * "-cpu host":
   * Same behavior as "-cpu host,migratable=yes"
 * Release notes indicating that "-cpu host,migratable=no" is
   required if the user doesn't care about migration and wants new
   (unknown to QEMU) features to be available

I was going to propose making "migratable=no" the default after a few
releases, as I expect the majority of existing users of "-cpu host" to
not care about migration. But I am not sure, because there's less harm
in _not_ getting new bleeding edge features enabled, and those users
(and management software) can start using "migratable=no" if they really
want the new unmigratable features.


Makes sense.  Basically "-cpu host,migratable=yes" is close to libvirt's 
host-model and Alex Graf's proposed "-cpu best".  Should we call it 
"-cpu best" and drop migratability of "-cpu host"?


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LOAN

2014-04-28 Thread Bakker, K.
Dear valued customer,

Do you need an urgent loan to pay of your bills, invest more on your business, 
if yes PREMIUM CAPITAL LOAN offer loan at 3% interest rate. We are fast and 
reliable when it comes to loan lending contact email: 
premiumcapitall...@hotmail.co.uk for more information.

Contact email: premiumcapitall...@hotmail.co.uk

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 02/12] ARM/ARM64: KVM: Add common header for PSCI related defines

2014-04-28 Thread Ashwin Chaugule
On 28 April 2014 13:13, Anup Patel  wrote:

>>> +#define PSCI_0_2_POWER_STATE_ID_MASK 0x
>>> +#define PSCI_0_2_POWER_STATE_ID_SHIFT0
>>> +#define PSCI_0_2_POWER_STATE_TYPE_MASK   0x1
>>
>> Shouldn't this be (0x1 << PSCI_0_2_POWER_STATE_TYPE_SHIFT)?
>>
>> That seems to be the definition of a mask in the PSCI_VERSION_MAJOR_MASK
>> below, at least be consistent in this file.
>
> This was blindly adopted from Ashwin's patchset. I kept it this way so
> that it works for him. May be I should change this for consistency.
>
> I hope this will be fine for Ashwin ??
>

I dont have a problem changing it for consistency reasons.

The original code has it that way:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/psci.c?id=v3.15-rc3

Thanks,
Ashwin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 03/11] VFIO_IOMMU_TYPE1 for platform bus devices on ARM

2014-04-28 Thread Alex Williamson
On Mon, 2014-04-28 at 17:52 +0200, Antonios Motakis wrote:
> This allows to make use of the VFIO_IOMMU_TYPE1 driver with platform
> devices on ARM in addition to PCI. This is required in order to use the
> Exynos SMMU, or ARM SMMU driver with VFIO_IOMMU_TYPE1.
> 
> Signed-off-by: Antonios Motakis 
> ---
>  drivers/vfio/Kconfig| 2 +-
>  drivers/vfio/vfio_iommu_type1.c | 5 -
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index af7b204..3a598ed 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -11,7 +11,7 @@ config VFIO_IOMMU_SPAPR_TCE
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
> - select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_TYPE1 if X86 || ARM
>   select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
>   select ANON_INODES
>   help
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1f90344..4dc989d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include/* pci_bus_type */
>  #include 
>  #include 
>  #include 
> @@ -721,13 +722,15 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   INIT_LIST_HEAD(&domain->group_list);
>   list_add(&group->next, &domain->group_list);
>  
> - if (!allow_unsafe_interrupts &&
> +#ifdef CONFIG_PCI
> + if (bus == &pci_bus_type && !allow_unsafe_interrupts &&
>   !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
>   pr_warn("%s: No interrupt remapping support.  Use the module 
> param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
> platform\n",
>  __func__);
>   ret = -EPERM;
>   goto out_detach;
>   }
> +#endif
>  
>   if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
>   domain->prot |= IOMMU_CACHE;

This is not a PCI specific requirement.  Anything that can support MSI
needs an IOMMU that can provide isolation for both DMA and interrupts.
I think the IOMMU should still be telling us that it has this feature.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-04-28 Thread Paolo Bonzini
From: Christian Borntraeger 

When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components.  Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs.  This is caused by the  synchronize_rcu and
the HZ=100 of s390.  By changing the code to use a private srcu we can
speed things up.  This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.

Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).

Note that we're hardly relying on the "sleepable" part of srcu.  We just
want SRCU's faster detection of grace periods.

Testing was done by Andrew Theurer using NETPERF.  The difference between
results "before" and "after" the patch has mean -0.2% and standard deviation
0.6%.  Using a paired t-test on the data points says that there is a 2.5%
probability that the patch is the cause of the performance difference
(rather than a random fluctuation).

Cc: Marcelo Tosatti 
Cc: Michael S. Tsirkin 
Signed-off-by: Christian Borntraeger 
Signed-off-by: Paolo Bonzini 
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/eventfd.c   | 25 +++--
 virt/kvm/irq_comm.c  | 17 +
 virt/kvm/irqchip.c   | 31 ---
 virt/kvm/kvm_main.c  | 16 ++--
 5 files changed, 51 insertions(+), 39 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 820fc2e1d9df..cd0df9a9352d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -368,6 +368,7 @@ struct kvm {
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots *memslots;
struct srcu_struct srcu;
+   struct srcu_struct irq_srcu;
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
u32 bsp_vcpu_id;
 #endif
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 912ec5a95e2c..20c3af7692c5 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "iodev.h"
@@ -118,19 +119,22 @@ static void
 irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
 {
struct _irqfd_resampler *resampler;
+   struct kvm *kvm;
struct _irqfd *irqfd;
+   int idx;
 
resampler = container_of(kian, struct _irqfd_resampler, notifier);
+   kvm = resampler->kvm;
 
-   kvm_set_irq(resampler->kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
+   kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
resampler->notifier.gsi, 0, false);
 
-   rcu_read_lock();
+   idx = srcu_read_lock(&kvm->irq_srcu);
 
list_for_each_entry_rcu(irqfd, &resampler->list, resampler_link)
eventfd_signal(irqfd->resamplefd, 1);
 
-   rcu_read_unlock();
+   srcu_read_unlock(&kvm->irq_srcu, idx);
 }
 
 static void
@@ -142,7 +146,7 @@ irqfd_resampler_shutdown(struct _irqfd *irqfd)
mutex_lock(&kvm->irqfds.resampler_lock);
 
list_del_rcu(&irqfd->resampler_link);
-   synchronize_rcu();
+   synchronize_srcu(&kvm->irq_srcu);
 
if (list_empty(&resampler->list)) {
list_del(&resampler->link);
@@ -221,17 +225,18 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
void *key)
unsigned long flags = (unsigned long)key;
struct kvm_kernel_irq_routing_entry *irq;
struct kvm *kvm = irqfd->kvm;
+   int idx;
 
if (flags & POLLIN) {
-   rcu_read_lock();
-   irq = rcu_dereference(irqfd->irq_entry);
+   idx = srcu_read_lock(&kvm->irq_srcu);
+   irq = srcu_dereference(irqfd->irq_entry, &kvm->irq_srcu);
/* An event has been signaled, inject an interrupt */
if (irq)
kvm_set_msi(irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
false);
else
schedule_work(&irqfd->inject);
-   rcu_read_unlock();
+   srcu_read_unlock(&kvm->irq_srcu, idx);
}
 
if (flags & POLLHUP) {
@@ -363,7 +368,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
}
 
list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
-   synchronize_rcu();
+   synchronize_srcu(&kvm->irq_srcu);
 
mutex_unlock(&kvm->irqfds.resampler_lock);
}
@@ -465,7 +470,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
 * another thread calls kvm_irq_routing_update before
 * we flush workqueue below (we synchronize with
 

Job Opportunities

2014-04-28 Thread Marriott International Hotel Canada
<<< No Message Collected >>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 02/11] ARM SMMU: Add capability IOMMU_CAP_DMA_EXEC

2014-04-28 Thread Antonios Motakis
The ARM SMMU can take an IOMMU_EXEC protection flag in addition to
IOMMU_READ and IOMMU_WRITE. Expose this as an IOMMU capability.

Signed-off-by: Antonios Motakis 
---
 drivers/iommu/arm-smmu.c | 2 ++
 include/linux/iommu.h| 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 1d9ab39..abf802f 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1509,6 +1509,8 @@ static int arm_smmu_domain_has_cap(struct iommu_domain 
*domain,
if (smmu_domain->root_cfg.smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
caps |= IOMMU_CAP_CACHE_COHERENCY;
 
+   caps |= IOMMU_CAP_DMA_EXEC;
+
return !!(cap & caps);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b96a5b2..4f547f3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -57,8 +57,9 @@ struct iommu_domain {
struct iommu_domain_geometry geometry;
 };
 
-#define IOMMU_CAP_CACHE_COHERENCY  0x1
-#define IOMMU_CAP_INTR_REMAP   0x2 /* isolates device intrs */
+#define IOMMU_CAP_CACHE_COHERENCY  (1 << 0)
+#define IOMMU_CAP_INTR_REMAP   (1 << 1) /* isolates device intrs */
+#define IOMMU_CAP_DMA_EXEC (1 << 2) /* EXEC protection flag */
 
 /*
  * Following constraints are specifc to FSL_PAMUV1:
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 03/11] VFIO_IOMMU_TYPE1 for platform bus devices on ARM

2014-04-28 Thread Antonios Motakis
This allows to make use of the VFIO_IOMMU_TYPE1 driver with platform
devices on ARM in addition to PCI. This is required in order to use the
Exynos SMMU, or ARM SMMU driver with VFIO_IOMMU_TYPE1.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/Kconfig| 2 +-
 drivers/vfio/vfio_iommu_type1.c | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index af7b204..3a598ed 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -11,7 +11,7 @@ config VFIO_IOMMU_SPAPR_TCE
 menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
-   select VFIO_IOMMU_TYPE1 if X86
+   select VFIO_IOMMU_TYPE1 if X86 || ARM
select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
select ANON_INODES
help
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1f90344..4dc989d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include  /* pci_bus_type */
 #include 
 #include 
 #include 
@@ -721,13 +722,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
INIT_LIST_HEAD(&domain->group_list);
list_add(&group->next, &domain->group_list);
 
-   if (!allow_unsafe_interrupts &&
+#ifdef CONFIG_PCI
+   if (bus == &pci_bus_type && !allow_unsafe_interrupts &&
!iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
pr_warn("%s: No interrupt remapping support.  Use the module 
param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
platform\n",
   __func__);
ret = -EPERM;
goto out_detach;
}
+#endif
 
if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 06/11] VFIO_PLATFORM: Return info for device and its memory mapped IO regions

2014-04-28 Thread Antonios Motakis
A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties.

This patch enables the IOCTLs:
 - VFIO_DEVICE_GET_INFO
 - VFIO_DEVICE_GET_REGION_INFO

 IRQ info is provided by one of the latter patches.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/vfio_platform.c | 77 ---
 drivers/vfio/platform/vfio_platform_private.h | 17 ++
 2 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index 1661746..5430cbe 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -34,15 +34,62 @@
 #define DRIVER_AUTHOR   "Antonios Motakis "
 #define DRIVER_DESC "VFIO for platform devices - User Level meta-driver"
 
+static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (platform_get_resource(vdev->pdev, IORESOURCE_MEM, cnt))
+   cnt++;
+
+   vdev->num_regions = cnt;
+
+   vdev->region = kzalloc(sizeof(struct vfio_platform_region) * cnt,
+   GFP_KERNEL);
+   if (!vdev->region)
+   return -ENOMEM;
+
+   for (i = 0; i < cnt;  i++) {
+   struct vfio_platform_region region;
+   struct resource *res =
+   platform_get_resource(vdev->pdev, IORESOURCE_MEM, i);
+
+   region.addr = res->start;
+   region.size = resource_size(res);
+   region.flags = 0;
+
+   vdev->region[i] = region;
+   }
+
+   return 0;
+}
+
+static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
+{
+   kfree(vdev->region);
+}
+
 static void vfio_platform_release(void *device_data)
 {
+   struct vfio_platform_device *vdev = device_data;
+
+   vfio_platform_regions_cleanup(vdev);
+
module_put(THIS_MODULE);
 }
 
 static int vfio_platform_open(void *device_data)
 {
-   if (!try_module_get(THIS_MODULE))
+   struct vfio_platform_device *vdev = device_data;
+   int ret;
+
+   ret = vfio_platform_regions_init(vdev);
+   if (ret)
+   return ret;
+
+   if (!try_module_get(THIS_MODULE)) {
+   vfio_platform_regions_cleanup(vdev);
return -ENODEV;
+   }
 
return 0;
 }
@@ -65,18 +112,36 @@ static long vfio_platform_ioctl(void *device_data,
return -EINVAL;
 
info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
-   info.num_regions = 0;
+   info.num_regions = vdev->num_regions;
info.num_irqs = 0;
 
return copy_to_user((void __user *)arg, &info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+   struct vfio_region_info info;
+
+   minsz = offsetofend(struct vfio_region_info, offset);
+
+   if (copy_from_user(&info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz < minsz)
+   return -EINVAL;
+
+   if (info.index >= vdev->num_regions)
+   return -EINVAL;
+
+   /* map offset to the physical address  */
+   info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index);
+   info.size = vdev->region[info.index].size;
+   info.flags = vdev->region[info.index].flags;
+
+   return copy_to_user((void __user *)arg, &info, minsz);
 
-   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
return -EINVAL;
 
-   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   } else if (cmd == VFIO_DEVICE_SET_IRQS)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_RESET)
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 4ae88f8..3448f918 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -15,8 +15,25 @@
 #ifndef VFIO_PLATFORM_PRIVATE_H
 #define VFIO_PLATFORM_PRIVATE_H
 
+#define VFIO_PLATFORM_OFFSET_SHIFT   40
+#define VFIO_PLATFORM_OFFSET_MASK (((u64)(1) << VFIO_PLATFORM_OFFSET_SHIFT) - 
1)
+
+#define VFIO_PLATFORM_OFFSET_TO_INDEX(off) \
+   (off >> VFIO_PLATFORM_OFFSET_SHIFT)
+
+#define VFIO_PLATFORM_INDEX_TO_OFFSET(index)   \
+   ((u64)(index) << VFIO_PLATFORM_OFFSET_SHIFT)
+
+struct vfio_platform_region {
+   u64 addr;
+   resource_size_t size;
+   u32 flags;
+};
+
 struct vfio_platform_device {
struct platform_device  *pdev;
+   struct vfio_plat

[RFC PATCH v5 01/11] driver core: platform: add device binding path 'driver_override'

2014-04-28 Thread Antonios Motakis
From: Kim Phillips 

Needed by platform device drivers, such as the vfio-platform driver [1],
in order to bypass the existing OF, ACPI, id_table and name string matches,
and successfully be able to be bound to any device, like so:

echo vfio-platform > /sys/bus/platform/devices/fff51000.ethernet/driver_override
echo fff51000.ethernet > 
/sys/bus/platform/devices/fff51000.ethernet/driver/unbind
echo fff51000.ethernet > /sys/bus/platform/drivers_probe

This mimics "PCI: Introduce new device binding path using
pci_dev.driver_override" [2], which is an interface enhancement
for more deterministic PCI device binding, e.g., when in the
presence of hotplug.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1402.1/00177.html
[2] 
http://lists-archives.com/linux-kernel/28030441-pci-introduce-new-device-binding-path-using-pci_dev-driver_override.html

Suggested-by: Alex Williamson 
Signed-off-by: Kim Phillips 
---
 Documentation/ABI/testing/sysfs-bus-platform | 20 
 drivers/base/platform.c  | 46 
 include/linux/platform_device.h  |  1 +
 3 files changed, 67 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-platform

diff --git a/Documentation/ABI/testing/sysfs-bus-platform 
b/Documentation/ABI/testing/sysfs-bus-platform
new file mode 100644
index 000..5172a61
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-platform
@@ -0,0 +1,20 @@
+What:  /sys/bus/platform/devices/.../driver_override
+Date:  April 2014
+Contact:   Kim Phillips 
+Description:
+   This file allows the driver for a device to be specified which
+   will override standard OF, ACPI, ID table, and name matching.
+   When specified, only a driver with a name matching the value
+   written to driver_override will have an opportunity to bind
+   to the device.  The override is specified by writing a string
+   to the driver_override file (echo vfio-platform > \
+   driver_override) and may be cleared with an empty string
+   (echo > driver_override).  This returns the device to standard
+   matching rules binding.  Writing to driver_override does not
+   automatically unbind the device from its current driver or make
+   any attempt to automatically load the specified driver.  If no
+   driver with a matching name is currently loaded in the kernel,
+   the device will not bind to any driver.  This also allows
+   devices to opt-out of driver binding using a driver_override
+   name such as "none".  Only a single driver may be specified in
+   the override, there is no support for parsing delimiters.
diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index bc78848..6db9fc2 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -693,8 +694,49 @@ static ssize_t modalias_show(struct device *dev, struct 
device_attribute *a,
 }
 static DEVICE_ATTR_RO(modalias);
 
+static ssize_t driver_override_store(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct platform_device *pdev = to_platform_device(dev);
+   char *driver_override, *old = pdev->driver_override, *cp;
+
+   if (count > PATH_MAX)
+   return -EINVAL;
+
+   driver_override = kstrndup(buf, count, GFP_KERNEL);
+   if (!driver_override)
+   return -ENOMEM;
+
+   cp = strchr(driver_override, '\n');
+   if (cp)
+   *cp = '\0';
+
+   if (strlen(driver_override)) {
+   pdev->driver_override = driver_override;
+   } else {
+   kfree(driver_override);
+   pdev->driver_override = NULL;
+   }
+
+   kfree(old);
+
+   return count;
+}
+
+static ssize_t driver_override_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct platform_device *pdev = to_platform_device(dev);
+
+   return sprintf(buf, "%s\n", pdev->driver_override);
+}
+static DEVICE_ATTR_RW(driver_override);
+
+
 static struct attribute *platform_dev_attrs[] = {
&dev_attr_modalias.attr,
+   &dev_attr_driver_override.attr,
NULL,
 };
 ATTRIBUTE_GROUPS(platform_dev);
@@ -750,6 +792,10 @@ static int platform_match(struct device *dev, struct 
device_driver *drv)
struct platform_device *pdev = to_platform_device(dev);
struct platform_driver *pdrv = to_platform_driver(drv);
 
+   /* When driver_override is set, only bind to the matching driver */
+   if (pdev->driver_override)
+   return !strcmp(pdev->driver_override, drv->name);
+
/* Attempt an OF style match

[RFC PATCH v5 10/11] VFIO_PLATFORM: Initial interrupts support

2014-04-28 Thread Antonios Motakis
This patch allows to set an eventfd for a patform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/vfio_platform.c |  36 +++-
 drivers/vfio/platform/vfio_platform_irq.c | 123 +-
 drivers/vfio/platform/vfio_platform_private.h |   7 ++
 3 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index 2e16595..8aa5d10 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -174,10 +174,40 @@ static long vfio_platform_ioctl(void *device_data,
 
return copy_to_user((void __user *)arg, &info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_SET_IRQS)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+   struct vfio_irq_set hdr;
+   int ret = 0;
+
+   minsz = offsetofend(struct vfio_irq_set, count);
+
+   if (copy_from_user(&hdr, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (hdr.argsz < minsz)
+   return -EINVAL;
+
+   if (hdr.index >= vdev->num_irqs)
+   return -EINVAL;
+
+   if (hdr.start != 0 || hdr.count > 1)
+   return -EINVAL;
+
+   if (hdr.count == 0 &&
+   (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE) ||
+!(hdr.flags & VFIO_IRQ_SET_ACTION_TRIGGER)))
+   return -EINVAL;
+
+   if (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+ VFIO_IRQ_SET_ACTION_TYPE_MASK))
+   return -EINVAL;
+
+   ret = vfio_platform_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+  hdr.start, hdr.count,
+  (void *)arg+minsz);
+
+   return ret;
 
-   else if (cmd == VFIO_DEVICE_RESET)
+   } else if (cmd == VFIO_DEVICE_RESET)
return -EINVAL;
 
return -ENOTTY;
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index 075c401..433edc1 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -31,6 +31,9 @@
 
 #include "vfio_platform_private.h"
 
+static int vfio_set_trigger(struct vfio_platform_device *vdev,
+   int index, int fd);
+
 int vfio_platform_irq_init(struct vfio_platform_device *vdev)
 {
int cnt = 0, i;
@@ -47,9 +50,11 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev)
 
for (i = 0; i < cnt; i++) {
struct vfio_platform_irq irq;
+   int hwirq = platform_get_irq(vdev->pdev, i);
 
-   irq.flags = 0;
+   irq.flags = VFIO_IRQ_INFO_EVENTFD;
irq.count = 1;
+   irq.hwirq = hwirq;
 
vdev->irq[i] = irq;
}
@@ -59,5 +64,121 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
 
 void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
 {
+   int i;
+
+   for (i = 0; i < vdev->num_irqs; i++)
+   vfio_set_trigger(vdev, i, -1);
+
kfree(vdev->irq);
 }
+
+static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
+{
+   struct eventfd_ctx *trigger = dev_id;
+
+   eventfd_signal(trigger, 1);
+
+   return IRQ_HANDLED;
+}
+
+static int vfio_set_trigger(struct vfio_platform_device *vdev,
+   int index, int fd)
+{
+   struct vfio_platform_irq *irq = &vdev->irq[index];
+   struct eventfd_ctx *trigger;
+   int ret;
+
+   if (irq->trigger) {
+   free_irq(irq->hwirq, irq);
+   kfree(irq->name);
+   eventfd_ctx_put(irq->trigger);
+   irq->trigger = NULL;
+   }
+
+   if (fd < 0) /* Disable only */
+   return 0;
+
+   irq->name = kasprintf(GFP_KERNEL, "vfio-irq[%d](%s)",
+   irq->hwirq, vdev->pdev->name);
+   if (!irq->name)
+   return -ENOMEM;
+
+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger)) {
+   kfree(irq->name);
+   return PTR_ERR(trigger);
+   }
+
+   irq->trigger = trigger;
+
+   ret = request_irq(irq->hwirq, vfio_irq_handler, 0, irq->name, irq);
+   if (ret) {
+   kfree(irq->name);
+   eventfd_ctx_put(trigger);
+   irq->trigger = NULL;
+   return ret;
+   }
+
+   return 0;
+}
+
+static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
+unsigned index, unsigned start,
+unsigned count, uint32_t flags, void *data

[RFC PATCH v5 04/11] VFIO_IOMMU_TYPE1: Introduce the VFIO_DMA_MAP_FLAG_EXEC flag

2014-04-28 Thread Antonios Motakis
The ARM SMMU driver expects the IOMMU_EXEC flag, otherwise it will
set the page tables for a device as XN (execute never). This affects
devices such as the ARM PL330 DMA Controller, which fails to operate
if the XN flag is set on the memory it tries to fetch its instructions
from.

We introduce the VFIO_DMA_MAP_FLAG_EXEC to VFIO, and use it in
VFIO_IOMMU_TYPE1 to set the IOMMU_EXEC flag. This way the user can
control whether the XN flag will be set on the requested mappings. If
the IOMMU_EXEC flag is available for at least one IOMMU of a container,
the new capability VFIO_IOMMU_PROT_EXEC will be exposed.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/vfio_iommu_type1.c | 34 +++---
 include/uapi/linux/vfio.h   |  2 ++
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4dc989d..6ce32bf 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -498,12 +498,15 @@ static int map_try_harder(struct vfio_domain *domain, 
dma_addr_t iova,
 }
 
 static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
- unsigned long pfn, long npage, int prot)
+ unsigned long pfn, long npage, int prot, bool exec)
 {
struct vfio_domain *d;
int ret;
 
list_for_each_entry(d, &iommu->domain_list, next) {
+   if (exec && iommu_domain_has_cap(d->domain, IOMMU_CAP_DMA_EXEC))
+   prot |= IOMMU_EXEC;
+
ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
npage << PAGE_SHIFT, prot | d->prot);
if (ret) {
@@ -530,6 +533,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
size_t size = map->size;
long npage;
int ret = 0, prot = 0;
+   bool prot_exec = false;
uint64_t mask;
struct vfio_dma *dma;
unsigned long pfn;
@@ -543,6 +547,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
prot |= IOMMU_WRITE;
if (map->flags & VFIO_DMA_MAP_FLAG_READ)
prot |= IOMMU_READ;
+   if (map->flags & VFIO_DMA_MAP_FLAG_EXEC)
+   prot_exec = true;
 
if (!prot)
return -EINVAL; /* No READ/WRITE? */
@@ -595,7 +601,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
}
 
/* Map it! */
-   ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
+   ret = vfio_iommu_map(iommu, iova, pfn, npage, prot, prot_exec);
if (ret) {
vfio_unpin_pages(pfn, npage, prot, true);
break;
@@ -887,6 +893,23 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu 
*iommu)
return ret;
 }
 
+static int vfio_domains_have_iommu_exec(struct vfio_iommu *iommu)
+{
+   struct vfio_domain *d;
+   int ret = 0;
+
+   mutex_lock(&iommu->lock);
+   list_for_each_entry(d, &iommu->domain_list, next) {
+   if (iommu_domain_has_cap(d->domain, IOMMU_CAP_DMA_EXEC)) {
+   ret = 1;
+   break;
+   }
+   }
+   mutex_unlock(&iommu->lock);
+
+   return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -902,6 +925,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (!iommu)
return 0;
return vfio_domains_have_iommu_cache(iommu);
+   case VFIO_IOMMU_PROT_EXEC:
+   if (!iommu)
+   return 0;
+   return vfio_domains_have_iommu_exec(iommu);
default:
return 0;
}
@@ -925,7 +952,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
} else if (cmd == VFIO_IOMMU_MAP_DMA) {
struct vfio_iommu_type1_dma_map map;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
-   VFIO_DMA_MAP_FLAG_WRITE;
+   VFIO_DMA_MAP_FLAG_WRITE |
+   VFIO_DMA_MAP_FLAG_EXEC;
 
minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..0847b29 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -29,6 +29,7 @@
  * capability is subject to change as groups are added or removed.
  */
 #define VFIO_DMA_CC_IOMMU  4
+#define VFIO_IOMMU_PROT_EXEC   5
 
 /*
  * The IOCTL interface is designed for extensibility by embedding the
@@ -398,6 +399,7 @@ struct vfio_iommu_type1_dma_map {
__u32   flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)/* readable from device 
*/
 #define VFIO_D

[RFC PATCH v5 07/11] VFIO_PLATFORM: Read and write support for the device fd

2014-04-28 Thread Antonios Motakis
VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Since some memory regions we cannot mmap due to
security concerns, we also allow to read and write to this file descriptor
directly.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/vfio_platform.c | 110 +-
 1 file changed, 107 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index 5430cbe..7c01ced 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -55,7 +55,8 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
 
region.addr = res->start;
region.size = resource_size(res);
-   region.flags = 0;
+   region.flags = VFIO_REGION_INFO_FLAG_READ
+   | VFIO_REGION_INFO_FLAG_WRITE;
 
vdev->region[i] = region;
}
@@ -153,13 +154,116 @@ static long vfio_platform_ioctl(void *device_data,
 static ssize_t vfio_platform_read(void *device_data, char __user *buf,
 size_t count, loff_t *ppos)
 {
-   return 0;
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
+   loff_t off = *ppos & VFIO_PLATFORM_OFFSET_MASK;
+   unsigned int done = 0;
+   void __iomem *io;
+
+   if (index >= vdev->num_regions)
+   return -EINVAL;
+
+   io = ioremap_nocache(vdev->region[index].addr,
+vdev->region[index].size);
+
+   while (count) {
+   size_t filled;
+
+   if (count >= 4 && !(off % 4)) {
+   u32 val;
+
+   val = ioread32(io + off);
+   if (copy_to_user(buf, &val, 4))
+   goto err;
+
+   filled = 4;
+   } else if (count >= 2 && !(off % 2)) {
+   u16 val;
+
+   val = ioread16(io + off);
+   if (copy_to_user(buf, &val, 2))
+   goto err;
+
+   filled = 2;
+   } else {
+   u8 val;
+
+   val = ioread8(io + off);
+   if (copy_to_user(buf, &val, 1))
+   goto err;
+
+   filled = 1;
+   }
+
+
+   count -= filled;
+   done += filled;
+   off += filled;
+   buf += filled;
+   }
+
+   iounmap(io);
+   return done;
+err:
+   iounmap(io);
+   return -EFAULT;
 }
 
 static ssize_t vfio_platform_write(void *device_data, const char __user *buf,
  size_t count, loff_t *ppos)
 {
-   return 0;
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
+   loff_t off = *ppos & VFIO_PLATFORM_OFFSET_MASK;
+   unsigned int done = 0;
+   void __iomem *io;
+
+   if (index >= vdev->num_regions)
+   return -EINVAL;
+
+   io = ioremap_nocache(vdev->region[index].addr,
+vdev->region[index].size);
+
+   while (count) {
+   size_t filled;
+
+   if (count >= 4 && !(off % 4)) {
+   u32 val;
+
+   if (copy_from_user(&val, buf, 4))
+   goto err;
+   iowrite32(val, io + off);
+
+   filled = 4;
+   } else if (count >= 2 && !(off % 2)) {
+   u16 val;
+
+   if (copy_from_user(&val, buf, 2))
+   goto err;
+   iowrite16(val, io + off);
+
+   filled = 2;
+   } else {
+   u8 val;
+
+   if (copy_from_user(&val, buf, 1))
+   goto err;
+   iowrite8(val, io + off);
+
+   filled = 1;
+   }
+
+   count -= filled;
+   done += filled;
+   off += filled;
+   buf += filled;
+   }
+
+   iounmap(io);
+   return done;
+err:
+   iounmap(io);
+   return -EFAULT;
 }
 
 static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 05/11] VFIO_PLATFORM: Initial skeleton of VFIO support for platform devices

2014-04-28 Thread Antonios Motakis
This patch forms the skeleton for platform devices support with VFIO.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/Kconfig  |   1 +
 drivers/vfio/Makefile |   1 +
 drivers/vfio/platform/Kconfig |   9 ++
 drivers/vfio/platform/Makefile|   4 +
 drivers/vfio/platform/vfio_platform.c | 172 ++
 drivers/vfio/platform/vfio_platform_private.h |  22 
 include/uapi/linux/vfio.h |   1 +
 7 files changed, 210 insertions(+)
 create mode 100644 drivers/vfio/platform/Kconfig
 create mode 100644 drivers/vfio/platform/Makefile
 create mode 100644 drivers/vfio/platform/vfio_platform.c
 create mode 100644 drivers/vfio/platform/vfio_platform_private.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 3a598ed..a484d72 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -21,3 +21,4 @@ menuconfig VFIO
  If you don't know what to do here, say N.
 
 source "drivers/vfio/pci/Kconfig"
+source "drivers/vfio/platform/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 72bfabc..b5e4a33 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += platform/
diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig
new file mode 100644
index 000..42b0022
--- /dev/null
+++ b/drivers/vfio/platform/Kconfig
@@ -0,0 +1,9 @@
+config VFIO_PLATFORM
+   tristate "VFIO support for platform devices"
+   depends on VFIO && EVENTFD
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on the system using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
new file mode 100644
index 000..df3a014
--- /dev/null
+++ b/drivers/vfio/platform/Makefile
@@ -0,0 +1,4 @@
+
+vfio-platform-y := vfio_platform.o
+
+obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
new file mode 100644
index 000..1661746
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_platform_private.h"
+
+#define DRIVER_VERSION  "0.5"
+#define DRIVER_AUTHOR   "Antonios Motakis "
+#define DRIVER_DESC "VFIO for platform devices - User Level meta-driver"
+
+static void vfio_platform_release(void *device_data)
+{
+   module_put(THIS_MODULE);
+}
+
+static int vfio_platform_open(void *device_data)
+{
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return 0;
+}
+
+static long vfio_platform_ioctl(void *device_data,
+  unsigned int cmd, unsigned long arg)
+{
+   struct vfio_platform_device *vdev = device_data;
+   unsigned long minsz;
+
+   if (cmd == VFIO_DEVICE_GET_INFO) {
+   struct vfio_device_info info;
+
+   minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+   if (copy_from_user(&info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz < minsz)
+   return -EINVAL;
+
+   info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
+   info.num_regions = 0;
+   info.num_irqs = 0;
+
+   return copy_to_user((void __user *)arg, &info, minsz);
+
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_RESET)
+   return -EINVAL;
+
+   return -ENOTTY;
+}
+
+static ssize_t vfio_platform_read(void *device_data, char __user *buf,
+size_t count, loff_t *ppos)
+{
+   return 0;
+}
+
+static ssize_t vfio_platform_write(void *device_data, const char __user *buf,
+

[RFC PATCH v5 09/11] VFIO_PLATFORM: Return IRQ info

2014-04-28 Thread Antonios Motakis
Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/Makefile|  2 +-
 drivers/vfio/platform/vfio_platform.c | 35 +--
 drivers/vfio/platform/vfio_platform_irq.c | 63 +++
 drivers/vfio/platform/vfio_platform_private.h | 11 +
 4 files changed, 106 insertions(+), 5 deletions(-)
 create mode 100644 drivers/vfio/platform/vfio_platform_irq.c

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index df3a014..2c53327 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,4 +1,4 @@
 
-vfio-platform-y := vfio_platform.o
+vfio-platform-y := vfio_platform.o vfio_platform_irq.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index 37beff3..2e16595 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -79,6 +79,7 @@ static void vfio_platform_release(void *device_data)
struct vfio_platform_device *vdev = device_data;
 
vfio_platform_regions_cleanup(vdev);
+   vfio_platform_irq_cleanup(vdev);
 
module_put(THIS_MODULE);
 }
@@ -92,12 +93,22 @@ static int vfio_platform_open(void *device_data)
if (ret)
return ret;
 
+   ret = vfio_platform_irq_init(vdev);
+   if (ret)
+   goto err_irq;
+
if (!try_module_get(THIS_MODULE)) {
-   vfio_platform_regions_cleanup(vdev);
-   return -ENODEV;
+   ret = -ENODEV;
+   goto err_mod;
}
 
return 0;
+
+err_mod:
+   vfio_platform_irq_cleanup(vdev);
+err_irq:
+   vfio_platform_regions_cleanup(vdev);
+   return ret;
 }
 
 static long vfio_platform_ioctl(void *device_data,
@@ -119,7 +130,7 @@ static long vfio_platform_ioctl(void *device_data,
 
info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
info.num_regions = vdev->num_regions;
-   info.num_irqs = 0;
+   info.num_irqs = vdev->num_irqs;
 
return copy_to_user((void __user *)arg, &info, minsz);
 
@@ -145,7 +156,23 @@ static long vfio_platform_ioctl(void *device_data,
return copy_to_user((void __user *)arg, &info, minsz);
 
} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
-   return -EINVAL;
+   struct vfio_irq_info info;
+
+   minsz = offsetofend(struct vfio_irq_info, count);
+
+   if (copy_from_user(&info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz < minsz)
+   return -EINVAL;
+
+   if (info.index >= vdev->num_irqs)
+   return -EINVAL;
+
+   info.flags = vdev->irq[info.index].flags;
+   info.count = vdev->irq[info.index].count;
+
+   return copy_to_user((void __user *)arg, &info, minsz);
 
} else if (cmd == VFIO_DEVICE_SET_IRQS)
return -EINVAL;
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
new file mode 100644
index 000..075c401
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -0,0 +1,63 @@
+/*
+ * VFIO platform devices interrupt handling
+ *
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_platform_private.h"
+
+int vfio_platform_irq_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (platform_get_irq(vdev->pdev, cnt) > 0)
+   cnt++;
+
+   vdev->num_irqs = cnt;
+
+   vdev->irq = kzalloc(sizeof(struct vfio_platform_irq) * vdev->num_irqs,
+   GFP_KERNEL);
+   if (!vdev->irq)
+   return -ENOMEM;
+
+   for (i = 0; i < cnt; i++) {
+   struct vfio_platform_irq irq;
+
+   irq.flags = 0;
+   irq.count = 1;
+
+   vdev->irq[i] = irq;
+   }
+
+   return 0;
+}
+
+void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
+{
+   kfree(vdev->irq);
+}
diff --git a/drivers/vfio/platform/vfio_platform_privat

[RFC PATCH v5 08/11] VFIO_PLATFORM: Support MMAP of MMIO regions

2014-04-28 Thread Antonios Motakis
Allow to memory map the MMIO regions of the device so userspace can
directly access them.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/vfio_platform.c | 40 ++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index 7c01ced..37beff3 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -58,6 +58,11 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
region.flags = VFIO_REGION_INFO_FLAG_READ
| VFIO_REGION_INFO_FLAG_WRITE;
 
+   /* Only regions addressed with PAGE granularity can be MMAPed
+* securely. */
+   if (!(region.addr & ~PAGE_MASK) && !(region.size & ~PAGE_MASK))
+   region.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+
vdev->region[i] = region;
}
 
@@ -268,7 +273,40 @@ err:
 
 static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
 {
-   return -EINVAL;
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index;
+   u64 req_len, pgoff, req_start;
+   struct vfio_platform_region region;
+
+   index = vma->vm_pgoff >> (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT);
+
+   if (vma->vm_end < vma->vm_start)
+   return -EINVAL;
+   if ((vma->vm_flags & VM_SHARED) == 0)
+   return -EINVAL;
+   if (index >= vdev->num_regions)
+   return -EINVAL;
+   if (vma->vm_start & ~PAGE_MASK)
+   return -EINVAL;
+   if (vma->vm_end & ~PAGE_MASK)
+   return -EINVAL;
+
+   region = vdev->region[index];
+
+   req_len = vma->vm_end - vma->vm_start;
+   pgoff = vma->vm_pgoff &
+   ((1U << (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+   req_start = pgoff << PAGE_SHIFT;
+
+   if (region.size < PAGE_SIZE || req_start + req_len > region.size)
+   return -EINVAL;
+
+   vma->vm_private_data = vdev;
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
+
+   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+  req_len, vma->vm_page_prot);
 }
 
 static const struct vfio_device_ops vfio_platform_ops = {
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 11/11] VFIO_PLATFORM: Support for maskable and automasked interrupts

2014-04-28 Thread Antonios Motakis
Adds support to mask interrupts, and also for automasked interrupts.
Level sensitive interrupts are exposed as automasked interrupts and
are masked and disabled automatically when they fire.

Signed-off-by: Antonios Motakis 
---
 drivers/vfio/platform/vfio_platform_irq.c | 117 --
 drivers/vfio/platform/vfio_platform_private.h |   2 +
 2 files changed, 113 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index 433edc1..e38982f 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -52,9 +52,16 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev)
struct vfio_platform_irq irq;
int hwirq = platform_get_irq(vdev->pdev, i);
 
-   irq.flags = VFIO_IRQ_INFO_EVENTFD;
+   spin_lock_init(&irq.lock);
+
+   irq.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_MASKABLE;
+
+   if (irq_get_trigger_type(hwirq) & IRQ_TYPE_LEVEL_MASK)
+   irq.flags |= VFIO_IRQ_INFO_AUTOMASKED;
+
irq.count = 1;
irq.hwirq = hwirq;
+   irq.masked = false;
 
vdev->irq[i] = irq;
}
@@ -66,19 +73,39 @@ void vfio_platform_irq_cleanup(struct vfio_platform_device 
*vdev)
 {
int i;
 
-   for (i = 0; i < vdev->num_irqs; i++)
+   for (i = 0; i < vdev->num_irqs; i++) {
vfio_set_trigger(vdev, i, -1);
 
+   if (vdev->irq[i].masked)
+   enable_irq(vdev->irq[i].hwirq);
+   }
+
kfree(vdev->irq);
 }
 
 static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
 {
-   struct eventfd_ctx *trigger = dev_id;
+   struct vfio_platform_irq *irq_ctx = dev_id;
+   unsigned long flags;
+   int ret = IRQ_NONE;
+
+   spin_lock_irqsave(&irq_ctx->lock, flags);
+
+   if (!irq_ctx->masked) {
+   ret = IRQ_HANDLED;
+
+   if (irq_ctx->flags & VFIO_IRQ_INFO_AUTOMASKED) {
+   disable_irq_nosync(irq_ctx->hwirq);
+   irq_ctx->masked = true;
+   }
+   }
 
-   eventfd_signal(trigger, 1);
+   spin_unlock_irqrestore(&irq_ctx->lock, flags);
 
-   return IRQ_HANDLED;
+   if (ret == IRQ_HANDLED)
+   eventfd_signal(irq_ctx->trigger, 1);
+
+   return ret;
 }
 
 static int vfio_set_trigger(struct vfio_platform_device *vdev,
@@ -159,6 +186,82 @@ static int vfio_platform_set_irq_trigger(struct 
vfio_platform_device *vdev,
return -EFAULT;
 }
 
+static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
+   unsigned index, unsigned start,
+   unsigned count, uint32_t flags, void *data)
+{
+   uint8_t arr;
+
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   switch (flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
+   case VFIO_IRQ_SET_DATA_BOOL:
+   if (copy_from_user(&arr, data, sizeof(uint8_t)))
+   return -EFAULT;
+
+   if (arr != 0x1)
+   return -EINVAL;
+
+   case VFIO_IRQ_SET_DATA_NONE:
+
+   spin_lock_irq(&vdev->irq[index].lock);
+
+   if (vdev->irq[index].masked) {
+   enable_irq(vdev->irq[index].hwirq);
+   vdev->irq[index].masked = false;
+   }
+
+   spin_unlock_irq(&vdev->irq[index].lock);
+
+   return 0;
+
+   case VFIO_IRQ_SET_DATA_EVENTFD: /* XXX not implemented yet */
+   default:
+   return -ENOTTY;
+   }
+
+   return 0;
+}
+
+static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
+   unsigned index, unsigned start,
+   unsigned count, uint32_t flags, void *data)
+{
+   uint8_t arr;
+
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   switch (flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
+   case VFIO_IRQ_SET_DATA_BOOL:
+   if (copy_from_user(&arr, data, sizeof(uint8_t)))
+   return -EFAULT;
+
+   if (arr != 0x1)
+   return -EINVAL;
+
+   case VFIO_IRQ_SET_DATA_NONE:
+
+   spin_lock_irq(&vdev->irq[index].lock);
+
+   if (!vdev->irq[index].masked) {
+   disable_irq(vdev->irq[index].hwirq);
+   vdev->irq[index].masked = true;
+   }
+
+   spin_unlock_irq(&vdev->irq[index].lock);
+
+   return 0;
+
+   case VFIO_IRQ_SET_DATA_EVENTFD: /* XXX not implemented yet */
+   default:
+   return -ENOTTY;
+   }
+
+   return 0;
+}
+
 int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
 uint32_t flags, u

[RFC PATCH v5 00/11] VFIO support for platform devices

2014-04-28 Thread Antonios Motakis
This patch series aims to implement VFIO support for platform devices that
reside behind an IOMMU. Examples of such devices are devices behind an ARM
SMMU, or behind a Samsung Exynos System MMU.

This v5 of the VFIO_PLATFORN patch series includes a number of fixes and
improvements over the features already implemented in the series. The next
version v6 of the VFIO_PLATFORM patch series is planned to introduce more
complete functionality with more comprehensive eventfd support.

The API used is based on the existing VFIO API that is also used with PCI
devices. Only devices that include a basic set of IRQs and memory regions are
targeted; devices with complex relationships with other devices on a device
tree are not taken into account at this stage.

A copy with all the dependencies applied can be cloned from branch
vfio-platform-v5 at g...@github.com:virtualopensystems/linux-kvm-arm.git

For those who want to apply manually, these patches are based on Alex
Williamson's 'next' branch available from https://github.com/awilliam/linux-vfio
which includes the multi domain support patches for VFIO_IOMMU_TYPE1.
Kim Phillip's driver_override patch for platform bus devices is also
included here for convenience.

The following IOCTLs have been found to be working on FastModels with an
ARM SMMU (MMU400). Testing was based on the ARM PL330 DMA Controller featured
on those models.
 - VFIO_GET_API_VERSION
 - VFIO_CHECK_EXTENSION
 - VFIO_GROUP_GET_STATUS
 - VFIO_GROUP_SET_CONTAINER
 - VFIO_SET_IOMMU
 - VFIO_IOMMU_GET_INFO
 - VFIO_IOMMU_MAP_DMA
 For this ioctl specifically, a new flag has been added:
 VFIO_DMA_MAP_FLAG_EXEC. This flag is taken into account on systems with
 an ARM SMMU. The availability of this flag is exposed via the capability
 VFIO_IOMMU_PROT_EXEC.

The VFIO platform driver proposed here implements the following:
 - VFIO_GROUP_GET_DEVICE_FD
 - VFIO_DEVICE_GET_INFO
 - VFIO_DEVICE_GET_REGION_INFO
 - VFIO_DEVICE_GET_IRQ_INFO
 - VFIO_DEVICE_SET_IRQS
 IRQs are implemented partially using this ioctl. Handling incoming
 interrupts with an eventfd is supported, as is masking and unmasking.
 Level sensitive interrupts are automasked. What is not implemented is
 masking/unmasking via eventfd. More comprehensive support for eventfd
 functionality is planned for v6.

In addition, the VFIO platform driver implements the following through
the VFIO device file descriptor:
 - MMAPing memory regions to the virtual address space of the VFIO user.
 - Read / write of memory regions directly through the file descriptor.

What still needs to be done, includes:
 - Eventfd for masking/unmasking (IRQFD and IOEVENTFD from the KVM side)
 - Extend the driver and API for device tree metadata
 - Support ARM AMBA devices natively
 - Device specific functionality (e.g. VFIO_DEVICE_RESET)
 - IOMMUs with nested page tables (Stage 1 & 2 translation on ARM SMMUs)

Changes since v4:
 - Use static offsets for each region in the VFIO device fd
 - Include patch in the series for the ARM SMMU to expose IOMMU_EXEC
   availability via IOMMU_CAP_DMA_EXEC
 - Rebased on VFIO multi domain support:
   - IOMMU_EXEC is now available if at least one IOMMU in the container
 supports it
   - Expose IOMMU_EXEC if available via the capability VFIO_IOMMU_PROT_EXEC
 - Some bug fixes
Changes since v3:
 - Use Kim Phillips' driver_probe_device()
Changes since v2:
 - Fixed Read/Write and MMAP on device regions
 - Removed dependency on Device Tree
 - Interrupts support
 - Interrupt masking/unmasking
 - Automask level sensitive interrupts
 - Introduced VFIO_DMA_MAP_FLAG_EXEC
 - Code clean ups


Antonios Motakis (10):
  ARM SMMU: Add capability IOMMU_CAP_DMA_EXEC
  VFIO_IOMMU_TYPE1 for platform bus devices on ARM
  VFIO_IOMMU_TYPE1: Introduce the VFIO_DMA_MAP_FLAG_EXEC flag
  VFIO_PLATFORM: Initial skeleton of VFIO support for platform devices
  VFIO_PLATFORM: Return info for device and its memory mapped IO regions
  VFIO_PLATFORM: Read and write support for the device fd
  VFIO_PLATFORM: Support MMAP of MMIO regions
  VFIO_PLATFORM: Return IRQ info
  VFIO_PLATFORM: Initial interrupts support
  VFIO_PLATFORM: Support for maskable and automasked interrupts

Kim Phillips (1):
  driver core: platform: add device binding path 'driver_override'

 Documentation/ABI/testing/sysfs-bus-platform  |  20 ++
 drivers/base/platform.c   |  46 +++
 drivers/iommu/arm-smmu.c  |   2 +
 drivers/vfio/Kconfig  |   3 +-
 drivers/vfio/Makefile |   1 +
 drivers/vfio/platform/Kconfig |   9 +
 drivers/vfio/platform/Makefile|   4 +
 drivers/vfio/platform/vfio_platform.c | 436 ++
 drivers/vfio/platform/vfio_platform_irq.c | 289 +
 drivers/vfio/platform/vfio_platform_private.h |  59 
 drivers/vfio/vfio_iommu_type1.c   |  39 ++-
 include/linux/iommu.h   

Re: [patch 2/2] target-i386: block migration and savevm if invariant tsc is exposed

2014-04-28 Thread Eduardo Habkost
On Fri, Apr 25, 2014 at 11:08:00PM +0200, Paolo Bonzini wrote:
> Il 25/04/2014 01:18, Eduardo Habkost ha scritto:
> >On Fri, Apr 25, 2014 at 12:57:48AM +0200, Paolo Bonzini wrote:
> >>Il 24/04/2014 22:57, Eduardo Habkost ha scritto:
> >>>If that didn't break other use cases, I would agree.
> >>>
> >>>But "-cpu host" today covers two use cases: 1) enabling everything that
> >>>can be enabled, even if it breaks migration; 2) enabling all stuff that
> >>>can be safely enabled without breaking migration.
> >>
> >>What does it enable *now* that breaks migration?
> >
> >Every single feature it enables can break it. It breaks if you upgrade
> >to a QEMU version with new feature words. It breaks if you upgrade to a
> >kernel which supports new features.
> 
> Yes, but it doesn't break it if you have the same processor, QEMU and
> kernel versions; this is not the case for invtsc (at least without
> improvements that aren't there yet).
> 
> >A feature that doesn't let you upgrade the kernel isn't a feature I
> >expect users to be relying upon.
> 
> It lets you upgrade the kernel as long as you do it for all machines.
> Whether this is acceptable depends on what you're using virt for.
> 
> It can be fine for "cattle" VMs that you can throw away at any time.
> Though people who buy into the cattle vs. pet distinction will tell
> you that you don't migrate cattle and this would make this discussion
> moot.
> 
> It can be fine for personal VMs (e.g. Windows VMs) that people use
> for development and will likely throw away at the end of each working
> day. Though for these VMs you also won't bother with migration, doing
> maintenance at night is easier (and if you need "-cpu host" you're
> probably doing other things such as GPU assignment that prevents
> migration).
> 
> So I couldn't indeed think of uses of "-cpu host" together with
> migration.  But if you're sure there is none, block it in QEMU.
> There's no reason why this should be specific to libvirt.

It's not that there are no useful use cases, it is that we simply can't
guarantee it will work because (by design) "-cpu host" enables features
QEMU doesn't know about (so it doesn't know if migration is safe or
not). And that's the main use case for "-cpu host".


> 
> >>Or if you plan ahead.  With additional logic even invariant TSC in
> >>principle can be made to work across migration if the host clocks are
> >>synchronized well enough (PTP accuracy is in the 100-1000 TSC ticks
> >>range).
> >
> >Yes, it is possible in the future. But we never planned for it, so "-cpu
> >host" never supported migration.
> 
> If it wasn't blocked, it was supported.  You can change this, but
> this implies documenting it in release notes, and perhaps warning for
> a couple of releases to give non-libvirt users a chance to tell us
> what they're doing.
> 
> >We may try to make a reliable implementation of use case (2) some day,
> >yes. But the choice I see right now is between trying not break a
> >feature that was never declared to exist, or breaking an existing
> >interface that is required to solve existing bugs between libvirt and
> >QEMU.
> 
> I'm not sure I follow, what existing interface would be broken and how?

Management can use "-cpu host" and the "feature-words" QOM property to
check and report which features are really supported by the host, before
trying to use them. That means management will think "invtsc" is
completely unavailable if it is not present on "-cpu host" by default.

But there's also another point: management may _want_ invtsc to be
unset, because it may expect a feature to be available only if it can be
safely migrated.

So, what about doing the following on QEMU 2.1:

 * "-cpu host,migratable=yes":
   * No invtsc
   * Migration not blocked
   * No feature flag unknown to QEMU will be enabled
 * "-cpu host,migratable=no":
   * invtsc enabled
   * Unknown-to-QEMU features enabled
   * Migration blocked
 * "-cpu host":
   * Same behavior as "-cpu host,migratable=yes"
 * Release notes indicating that "-cpu host,migratable=no" is
   required if the user doesn't care about migration and wants new
   (unknown to QEMU) features to be available

I was going to propose making "migratable=no" the default after a few
releases, as I expect the majority of existing users of "-cpu host" to
not care about migration. But I am not sure, because there's less harm
in _not_ getting new bleeding edge features enabled, and those users
(and management software) can start using "migratable=no" if they really
want the new unmigratable features.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/21] MIPS: KVM: Add nanosecond count bias KVM register

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 17:17, James Hogan ha scritto:

> If it is possible and not too hairy to use a raw value in userspace
> (together with KVM_REG_MIPS_COUNT_HZ), please do it---my suggestions were
> just that, a suggestion.  Otherwise, the patch looks good.

Do you mean expose the raw internal offset to userland instead of the
nanosecond one? Yeh it should be possible & slightly simpler for both
kernel and Qemu actually.


Yes, when I reviewed the QEMU patches I missed this consequence of 
exposing the HZ.



Qemu could then store that value (or the Count register) straight into
env->CP0_Count (depending on Cause.DC), then hw/mips/cputimer.c would
pretty much continue to work accurately. cputimer.c is only really
made use of by tcg at the moment though (reading/writing
count/compare/cause.DC), but it still makes sense to be consistent.


Yup.  cputimer.c would just use a HZ value stored in CPUMIPSState 
(TIMER_FREQ for TCG) instead of hardcoding TIMER_FREQ I guess.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: async_pf: kill the unnecessary use_mm/unuse_mm async_pf_execute()

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 16:06, Andrea Arcangeli ha scritto:

>
> "task" is only used to increment task_struct->xxx_flt. I don't think
> async_pf_execute() actually needs this (current is PF_WQ_WORKER after
> all), but I didn't dare to do another change in the code I can hardly
> understand.

Considering the faults would be randomly distributed among the kworker
threads my preference would also be for NULL instead of current.

ptrace and uprobes tends to be the only two places that look into
other mm with gup, ptrace knows the exact pid that it is triggering
the fault into, so it also can specify the correct task so the fault
goes in the right task struct. uprobes uses NULL.


KVM knows the correct task (it was in current when kvm_create_vm was 
called), and doing accounting right would be nice.  But I agree that 
NULL is less misleading than a dummy current, and I applied patch 3 too.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-28

2014-04-28 Thread Markus Armbruster
Juan Quintela  writes:

> Hi
>
> Please, send any topic that you are interested in covering.

[...]

I'd like to have these things settled sooner than five minutes before
the scheduled hour, so here goes: call or no call?  Agenda?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/2] KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 17:03, Oleg Nesterov ha scritto:

async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.

Signed-off-by: Oleg Nesterov 
---
 virt/kvm/async_pf.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ced4f3..62f4223 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -81,7 +81,7 @@ static void async_pf_execute(struct work_struct *work)
might_sleep();

down_read(&mm->mmap_sem);
-   get_user_pages(current, mm, addr, 1, 1, 0, NULL, NULL);
+   get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
up_read(&mm->mmap_sem);
kvm_async_page_present_sync(vcpu, apf);




Thanks, added a Suggested-by for Andrea and applied together with 1/2 to 
kvm/queue.


(Actually, I'm back from a longish vacation and I have a pretty large 
queue, so I haven't even compile tested these for now.  Once I get round 
to at least smoke-test them, I'll really push to kvm/queue).


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 00/18] lib/x86 TLC, extending tests for both 32- and 64-bits, and SMAP tests

2014-04-28 Thread Paolo Bonzini
This series is structured as follows:

- patches 1 to 8 cleanup lib/x86/, remove gratuitous differences
  between 32-bit and 64-bit, and unify code that is duplicated in
  various ways

- patches 9 to 12 port some eventinj tests so that they run as
  64-bit hosts too

- patches 13 to 15 are cleanups

- patches 16 to 18 add a testcase for SMAP (tested with QEMU by me
  and, by the patch submitter, with KVM)

Paolo Bonzini (18):
  libcflat: move stringification trick to a common place
  x86: move size #defines to processor.h
  x86: desc: change set_gdt_entry argument to selector
  x86: desc: move idt_entry_t and gdt_entry_t to header
  x86: desc: reuse cstart.S GDT and TSS
  x86: taskswitch: use desc library
  x86: unify GDT format between 32-bit and 64-bit
  x86: move CR0 and CR4 constants to processor.h
  x86: desc: support ISTs for alternate stacks in 64-bit mode
  x86: eventinj: fix inline assembly and make it more robust to new compilers
  x86: eventinj: enable NP test in 64-bit mode
  x86: eventinj: port consecutive exception tests to x86-64
  x86: xsave: use cpuid functions from processor.h
  x86: svm: rename get_pte function
  x86: vm: remove dead code
  x86: vm: export get_pte and return a pointer to it
  x86: vm: mark intermediate PTEs as user-accessible
  x86: smap: new testcase

 config-x86-common.mak |   4 +-
 lib/libcflat.h|   3 +
 lib/x86/desc.c| 166 +-
 lib/x86/desc.h|  80 
 lib/x86/isr.c |   6 --
 lib/x86/processor.h   |  38 +++-
 lib/x86/vm.c  |  60 +++---
 lib/x86/vm.h  |  42 +++--
 x86/asyncpf.c |   1 -
 x86/cstart.S  |  16 +
 x86/cstart64.S|  17 --
 x86/eventinj.c|  87 --
 x86/pcid.c|   7 ---
 x86/smap.c| 156 +++
 x86/svm.c |  18 +++---
 x86/taskswitch.c  | 134 +++-
 x86/taskswitch2.c |  19 +++---
 x86/vmexit.c  |   6 --
 x86/vmx.c |  32 +-
 x86/xsave.c   |  58 ++
 20 files changed, 485 insertions(+), 465 deletions(-)
 create mode 100644 x86/smap.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 01/18] libcflat: move stringification trick to a common place

2014-04-28 Thread Paolo Bonzini
Rename "str" to "xxstr" to avoid a conflict with lib/x86/processor.h.

Signed-off-by: Paolo Bonzini 
---
 lib/libcflat.h| 3 +++
 x86/taskswitch2.c | 3 ---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index f734fde..3dc3788 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -22,6 +22,9 @@
 
 #include 
 
+#define xstr(s) xxstr(s)
+#define xxstr(s) #s
+
 typedef unsigned char u8;
 typedef signed char s8;
 typedef unsigned short u16;
diff --git a/x86/taskswitch2.c b/x86/taskswitch2.c
index 3c8418e..08bcce9 100644
--- a/x86/taskswitch2.c
+++ b/x86/taskswitch2.c
@@ -9,9 +9,6 @@
 #define MAIN_TSS_INDEX (FREE_GDT_INDEX + 0)
 #define VM86_TSS_INDEX (FREE_GDT_INDEX + 1)
 
-#define xstr(s) str(s)
-#define str(s) #s
-
 static volatile int test_count;
 static volatile unsigned int test_divider;
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 10/18] x86: eventinj: fix inline assembly and make it more robust to new compilers

2014-04-28 Thread Paolo Bonzini
Using assembler trampolines instead of && fixes NMI IRET test.

There is another bug, however.  Returns to the same privilege level do
not pop SS:RSP on 32-bit, so the nested NMI underflowed the stack at "s".
This is fixed in the new code.  The new code doesn't set up SS:RSP and
instead leaves some space for the nested NMI handler on the alternate
stack.  The old stack pointer is kept and restored when the nested
handler returns.

Signed-off-by: Paolo Bonzini 
---
 x86/eventinj.c | 64 ++
 1 file changed, 42 insertions(+), 22 deletions(-)

diff --git a/x86/eventinj.c b/x86/eventinj.c
index 2124bdf..1df3a43 100644
--- a/x86/eventinj.c
+++ b/x86/eventinj.c
@@ -59,17 +59,25 @@ static volatile int test_count;
 ulong stack_phys;
 void *stack_va;
 
-static void pf_tss(void)
+void do_pf_tss(void)
 {
-start:
printf("PF running\n");
install_pte(phys_to_virt(read_cr3()), 1, stack_va,
stack_phys | PTE_PRESENT | PTE_WRITE, 0);
invlpg(stack_va);
-   asm volatile ("iret");
-   goto start;
 }
 
+extern void pf_tss(void);
+
+asm (
+"pf_tss: \n\t"
+"call do_pf_tss \n\t"
+"add $"S", %"R "sp\n\t"// discard error code
+"iret"W" \n\t"
+"jmp pf_tss\n\t"
+);
+
+
 static void of_isr(struct ex_regs *r)
 {
printf("OF isr running\n");
@@ -114,39 +122,51 @@ static void nmi_isr(struct ex_regs *r)
printf("After nested NMI to self\n");
 }
 
-unsigned long after_iret_addr;
+unsigned long *iret_stack;
 
 static void nested_nmi_iret_isr(struct ex_regs *r)
 {
printf("Nested NMI isr running rip=%x\n", r->rip);
 
-   if (r->rip == after_iret_addr)
+   if (r->rip == iret_stack[-3])
test_count++;
 }
+
+extern void do_iret(ulong phys_stack, void *virt_stack);
+
+// Return to same privilege level won't pop SS or SP, so
+// save it in RDX while we run on the nested stack
+
+asm("do_iret:"
+#ifdef __x86_64__
+   "mov %rdi, %rax \n\t"   // phys_stack
+   "mov %rsi, %rdx \n\t"   // virt_stack
+#else
+   "mov 4(%esp), %eax \n\t"// phys_stack
+   "mov 8(%esp), %edx \n\t"// virt_stack
+#endif
+   "xchg %"R "dx, %"R "sp \n\t"// point to new stack
+   "pushf"W" \n\t"
+   "mov %cs, %ecx \n\t"
+   "push"W" %"R "cx \n\t"
+   "push"W" $1f \n\t"
+   "outl %eax, $0xe4 \n\t" // flush page
+   "iret"W" \n\t"
+   "1: xchg %"R "dx, %"R "sp \n\t" // point to old stack
+   "ret\n\t"
+   );
+
 static void nmi_iret_isr(struct ex_regs *r)
 {
unsigned long *s = alloc_page();
test_count++;
-   printf("NMI isr running %p stack %p\n", &&after_iret, s);
+   printf("NMI isr running stack %p\n", s);
handle_exception(2, nested_nmi_iret_isr);
printf("Sending nested NMI to self\n");
apic_self_nmi();
printf("After nested NMI to self\n");
-   s[4] = read_ss();
-   s[3] = 0; /* rsp */
-   s[2] = read_rflags();
-   s[1] = read_cs();
-   s[0] = after_iret_addr = (unsigned long)&&after_iret;
-   asm ("mov %%" R "sp, %0\n\t"
-"mov %1, %%" R "sp\n\t"
-"outl %2, $0xe4\n\t" /* flush stack page */
-#ifdef __x86_64__
-"iretq\n\t"
-#else
-"iretl\n\t"
-#endif
-: "=m"(s[3]) : "rm"(&s[0]), "a"((unsigned int)virt_to_phys(s)) : 
"memory");
-after_iret:
+   iret_stack = &s[128];
+   do_iret(virt_to_phys(s), iret_stack);
printf("After iret\n");
 }
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 05/18] x86: desc: reuse cstart.S GDT and TSS

2014-04-28 Thread Paolo Bonzini
There is no particular reason to use a specific TSS in tests that
use task-switching.  In fact, in many cases the tests just want
a separate interrupt stack and could run on 64-bit just as well
if the task-switching is abstracted.

As a first step, remove duplicate protected mode setup from desc.c's
users.  Just leave some spare selectors in cstart.S's GDT before
the CPUs' main TSS.  Then reuse CPU 0's TSS as TSS_MAIN.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c| 98 ++-
 lib/x86/desc.h| 13 +---
 x86/asyncpf.c |  1 -
 x86/cstart.S  | 16 +
 x86/eventinj.c|  1 -
 x86/taskswitch2.c |  1 -
 6 files changed, 49 insertions(+), 81 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index 32034cf..812295c 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -193,69 +193,28 @@ unsigned exception_error_code(void)
  * 0x00 - NULL descriptor
  * 0x08 - Code segment
  * 0x10 - Data segment
- * 0x18 - Not presend code segment
- * 0x20 - Primery task
- * 0x28 - Interrupt task
- *
- * 0x30 to 0x48 - Free to use for test cases
+ * 0x18 - Not present code segment
+ * 0x20 - Interrupt task
+ * 0x28 to 0x78 - Free to use for test cases
+ * 0x80 - Primary task (CPU 0)
  */
 
-static gdt_entry_t gdt[10];
-
 void set_gdt_entry(int sel, u32 base,  u32 limit, u8 access, u8 gran)
 {
int num = sel >> 3;
 
/* Setup the descriptor base address */
-   gdt[num].base_low = (base & 0x);
-   gdt[num].base_middle = (base >> 16) & 0xFF;
-   gdt[num].base_high = (base >> 24) & 0xFF;
+   gdt32[num].base_low = (base & 0x);
+   gdt32[num].base_middle = (base >> 16) & 0xFF;
+   gdt32[num].base_high = (base >> 24) & 0xFF;
 
/* Setup the descriptor limits */
-   gdt[num].limit_low = (limit & 0x);
-   gdt[num].granularity = ((limit >> 16) & 0x0F);
+   gdt32[num].limit_low = (limit & 0x);
+   gdt32[num].granularity = ((limit >> 16) & 0x0F);
 
/* Finally, set up the granularity and access flags */
-   gdt[num].granularity |= (gran & 0xF0);
-   gdt[num].access = access;
-}
-
-void setup_gdt(void)
-{
-   struct descriptor_table_ptr gp;
-   /* Setup the GDT pointer and limit */
-   gp.limit = sizeof(gdt) - 1;
-   gp.base = (ulong)&gdt;
-
-   memset(gdt, 0, sizeof(gdt));
-
-   /* Our NULL descriptor */
-   set_gdt_entry(0, 0, 0, 0, 0);
-
-   /* The second entry is our Code Segment. The base address
-*  is 0, the limit is 4GBytes, it uses 4KByte granularity,
-*  uses 32-bit opcodes, and is a Code Segment descriptor. */
-   set_gdt_entry(KERNEL_CS, 0, 0x, 0x9A, 0xcf);
-
-   /* The third entry is our Data Segment. It's EXACTLY the
-*  same as our code segment, but the descriptor type in
-*  this entry's access byte says it's a Data Segment */
-   set_gdt_entry(KERNEL_DS, 0, 0x, 0x92, 0xcf);
-
-   /* Same as code register above but not present */
-   set_gdt_entry(NP_SEL, 0, 0x, 0x1A, 0xcf);
-
-
-   /* Flush out the old GDT and install the new changes! */
-   lgdt(&gp);
-
-   asm volatile ("mov %0, %%ds\n\t"
- "mov %0, %%es\n\t"
- "mov %0, %%fs\n\t"
- "mov %0, %%gs\n\t"
- "mov %0, %%ss\n\t"
- "jmp $" xstr(KERNEL_CS), $.Lflush2\n\t"
- ".Lflush2: "::"r"(0x10));
+   gdt32[num].granularity |= (gran & 0xF0);
+   gdt32[num].access = access;
 }
 
 void set_idt_task_gate(int vec, u16 sel)
@@ -276,46 +235,39 @@ void set_idt_task_gate(int vec, u16 sel)
  * 1 - interrupt task
  */
 
-static tss32_t tss[2];
-static char tss_stack[2][4096];
+static tss32_t tss_intr;
+static char tss_stack[4096];
 
 void setup_tss32(void)
 {
u16 desc_size = sizeof(tss32_t);
-   int i;
-
-   for (i = 0; i < 2; i++) {
-   tss[i].cr3 = read_cr3();
-   tss[i].ss0 = tss[i].ss1 = tss[i].ss2 = 0x10;
-   tss[i].esp = tss[i].esp0 = tss[i].esp1 = tss[i].esp2 =
-   (u32)tss_stack[i] + 4096;
-   tss[i].cs = KERNEL_CS;
-   tss[i].ds = tss[i].es = tss[i].fs = tss[i].gs =
-   tss[i].ss = KERNEL_DS;
-   tss[i].iomap_base = (u16)desc_size;
-   set_gdt_entry(TSS_MAIN + (i << 3), (u32)&tss[i],
-desc_size - 1, 0x89, 0x0f);
-   }
 
-   ltr(TSS_MAIN);
+   tss.cr3 = read_cr3();
+   tss_intr.cr3 = read_cr3();
+   tss_intr.ss0 = tss_intr.ss1 = tss_intr.ss2 = 0x10;
+   tss_intr.esp = tss_intr.esp0 = tss_intr.esp1 = tss_intr.esp2 =
+   (u32)tss_stack + 4096;
+   tss_intr.cs = 0x08;
+   tss_intr.ds = tss_intr.es = tss_intr.fs = tss_intr.gs = tss_intr.ss = 
0x10;
+   tss_intr.iomap_base = (u16)desc_size;
+   set_gdt_entry(TSS_INTR, (u32)&tss_intr, desc_size - 1, 0x89, 0x

[PATCH kvm-unit-tests 04/18] x86: desc: move idt_entry_t and gdt_entry_t to header

2014-04-28 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c | 27 ---
 lib/x86/desc.h | 27 +++
 2 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index b8b5452..32034cf 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -2,33 +2,6 @@
 #include "desc.h"
 #include "processor.h"
 
-typedef struct {
-unsigned short offset0;
-unsigned short selector;
-unsigned short ist : 3;
-unsigned short : 5;
-unsigned short type : 4;
-unsigned short : 1;
-unsigned short dpl : 2;
-unsigned short p : 1;
-unsigned short offset1;
-#ifdef __x86_64__
-unsigned offset2;
-unsigned reserved;
-#endif
-} idt_entry_t;
-
-typedef struct {
-   u16 limit_low;
-   u16 base_low;
-   u8 base_middle;
-   u8 access;
-   u8 granularity;
-   u8 base_high;
-} gdt_entry_t;
-
-extern idt_entry_t boot_idt[256];
-
 void set_idt_entry(int vec, void *addr, int dpl)
 {
 idt_entry_t *e = &boot_idt[vec];
diff --git a/lib/x86/desc.h b/lib/x86/desc.h
index e3f1ea0..0614273 100644
--- a/lib/x86/desc.h
+++ b/lib/x86/desc.h
@@ -78,6 +78,33 @@ typedef struct {
 #define TSS_INTR 0x28
 #define FIRST_SPARE_SEL 0x30
 
+typedef struct {
+unsigned short offset0;
+unsigned short selector;
+unsigned short ist : 3;
+unsigned short : 5;
+unsigned short type : 4;
+unsigned short : 1;
+unsigned short dpl : 2;
+unsigned short p : 1;
+unsigned short offset1;
+#ifdef __x86_64__
+unsigned offset2;
+unsigned reserved;
+#endif
+} idt_entry_t;
+
+typedef struct {
+   u16 limit_low;
+   u16 base_low;
+   u8 base_middle;
+   u8 access;
+   u8 granularity;
+   u8 base_high;
+} gdt_entry_t;
+
+extern idt_entry_t boot_idt[256];
+
 unsigned exception_vector(void);
 unsigned exception_error_code(void);
 void set_idt_entry(int vec, void *addr, int dpl);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 08/18] x86: move CR0 and CR4 constants to processor.h

2014-04-28 Thread Paolo Bonzini
Move them together with the inline function that read/write the
control registers.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/processor.h | 14 ++
 lib/x86/vm.h| 12 
 x86/pcid.c  |  7 ---
 3 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index fabd480..9cc1112 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -14,6 +14,20 @@
 #  define S "4"
 #endif
 
+#define X86_CR0_PE 0x0001
+#define X86_CR0_MP 0x0002
+#define X86_CR0_TS 0x0008
+#define X86_CR0_WP 0x0001
+#define X86_CR0_PG 0x8000
+#define X86_CR4_VMXE   0x0001
+#define X86_CR4_TSD0x0004
+#define X86_CR4_DE 0x0008
+#define X86_CR4_PSE0x0010
+#define X86_CR4_PAE0x0020
+#define X86_CR4_PCIDE  0x0002
+
+#define X86_IA32_EFER  0xc080
+#define X86_EFER_LMA   (1UL << 8)
 
 struct descriptor_table_ptr {
 u16 limit;
diff --git a/lib/x86/vm.h b/lib/x86/vm.h
index 03a9b4e..0b5b5c7 100644
--- a/lib/x86/vm.h
+++ b/lib/x86/vm.h
@@ -16,18 +16,6 @@
 #define PTE_USER(1ull << 2)
 #define PTE_ADDR(0xff000ull)
 
-#define X86_CR0_PE  0x0001
-#define X86_CR0_MP  0x0002
-#define X86_CR0_TS  0x0008
-#define X86_CR0_WP  0x0001
-#define X86_CR0_PG  0x8000
-#define X86_CR4_VMXE   0x0001
-#define X86_CR4_TSD 0x0004
-#define X86_CR4_DE  0x0008
-#define X86_CR4_PSE 0x0010
-#define X86_CR4_PAE 0x0020
-#define X86_CR4_PCIDE  0x0002
-
 void setup_vm();
 
 void *vmalloc(unsigned long size);
diff --git a/x86/pcid.c b/x86/pcid.c
index 45adfd5..164e9a1 100644
--- a/x86/pcid.c
+++ b/x86/pcid.c
@@ -7,13 +7,6 @@
 #define X86_FEATURE_PCID   (1 << 17)
 #define X86_FEATURE_INVPCID(1 << 10)
 
-#define X86_CR0_PG (1 << 31)
-#define X86_CR3_PCID_MASK  0x0fff
-#define X86_CR4_PCIDE  (1 << 17)
-
-#define X86_IA32_EFER  0xc080
-#define X86_EFER_LMA   (1UL << 8)
-
 struct invpcid_desc {
 unsigned long pcid : 12;
 unsigned long rsv  : 52;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 15/18] x86: vm: remove dead code

2014-04-28 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 lib/x86/vm.c | 17 -
 1 file changed, 17 deletions(-)

diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 188bf57..93aea71 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -116,23 +116,6 @@ void install_page(unsigned long *cr3,
 }
 
 
-static inline void load_gdt(unsigned long *table, int nent)
-{
-struct descriptor_table_ptr descr;
-
-descr.limit = nent * 8 - 1;
-descr.base = (ulong)table;
-lgdt(&descr);
-}
-
-#define SEG_CS_32 8
-#define SEG_CS_64 16
-
-struct ljmp {
-void *ofs;
-unsigned short seg;
-};
-
 static void setup_mmu_range(unsigned long *cr3, unsigned long start,
unsigned long len)
 {
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 14/18] x86: svm: rename get_pte function

2014-04-28 Thread Paolo Bonzini
We will make it public in vm.c with the next patch.

Signed-off-by: Paolo Bonzini 
---
 x86/svm.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 9d910ae..5cfd77d 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -95,7 +95,7 @@ static void setup_svm(void)
 pml4e[0] = ((u64)pdpe) | 0x27;
 }
 
-static u64 *get_pte(u64 address)
+static u64 *npt_get_pte(u64 address)
 {
 int i1, i2;
 
@@ -502,14 +502,14 @@ static void npt_nx_prepare(struct test *test)
 u64 *pte;
 
 vmcb_ident(test->vmcb);
-pte = get_pte((u64)null_test);
+pte = npt_get_pte((u64)null_test);
 
 *pte |= (1ULL << 63);
 }
 
 static bool npt_nx_check(struct test *test)
 {
-u64 *pte = get_pte((u64)null_test);
+u64 *pte = npt_get_pte((u64)null_test);
 
 *pte &= ~(1ULL << 63);
 
@@ -524,7 +524,7 @@ static void npt_us_prepare(struct test *test)
 u64 *pte;
 
 vmcb_ident(test->vmcb);
-pte = get_pte((u64)scratch_page);
+pte = npt_get_pte((u64)scratch_page);
 
 *pte &= ~(1ULL << 2);
 }
@@ -536,7 +536,7 @@ static void npt_us_test(struct test *test)
 
 static bool npt_us_check(struct test *test)
 {
-u64 *pte = get_pte((u64)scratch_page);
+u64 *pte = npt_get_pte((u64)scratch_page);
 
 *pte |= (1ULL << 2);
 
@@ -566,7 +566,7 @@ static void npt_rw_prepare(struct test *test)
 u64 *pte;
 
 vmcb_ident(test->vmcb);
-pte = get_pte(0x8);
+pte = npt_get_pte(0x8);
 
 *pte &= ~(1ULL << 1);
 }
@@ -580,7 +580,7 @@ static void npt_rw_test(struct test *test)
 
 static bool npt_rw_check(struct test *test)
 {
-u64 *pte = get_pte(0x8);
+u64 *pte = npt_get_pte(0x8);
 
 *pte |= (1ULL << 1);
 
@@ -594,14 +594,14 @@ static void npt_pfwalk_prepare(struct test *test)
 u64 *pte;
 
 vmcb_ident(test->vmcb);
-pte = get_pte(read_cr3());
+pte = npt_get_pte(read_cr3());
 
 *pte &= ~(1ULL << 1);
 }
 
 static bool npt_pfwalk_check(struct test *test)
 {
-u64 *pte = get_pte(read_cr3());
+u64 *pte = npt_get_pte(read_cr3());
 
 *pte |= (1ULL << 1);
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 02/18] x86: move size #defines to processor.h

2014-04-28 Thread Paolo Bonzini
These are necessary in many testcases that includes hand-written
assembly, otherwise they will only run for either 32- or 64-bit.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c  | 10 --
 lib/x86/isr.c   |  6 --
 lib/x86/processor.h | 11 +++
 x86/vmexit.c|  6 --
 4 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index f75ec1d..ac60686 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -101,16 +101,6 @@ void do_handle_exception(struct ex_regs *regs)
exit(7);
 }
 
-#ifdef __x86_64__
-#  define R "r"
-#  define W "q"
-#  define S "8"
-#else
-#  define R "e"
-#  define W "l"
-#  define S "4"
-#endif
-
 #define EX(NAME, N) extern char NAME##_fault;  \
asm (".pushsection .text \n\t"  \
 #NAME"_fault: \n\t"\
diff --git a/lib/x86/isr.c b/lib/x86/isr.c
index 9986d17..7dcd38a 100644
--- a/lib/x86/isr.c
+++ b/lib/x86/isr.c
@@ -3,12 +3,6 @@
 #include "vm.h"
 #include "desc.h"
 
-#ifdef __x86_64__
-#  define R "r"
-#else
-#  define R "e"
-#endif
-
 extern char isr_entry_point[];
 
 asm (
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index 29811d4..fabd480 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -4,6 +4,17 @@
 #include "libcflat.h"
 #include 
 
+#ifdef __x86_64__
+#  define R "r"
+#  define W "q"
+#  define S "8"
+#else
+#  define R "e"
+#  define W "l"
+#  define S "4"
+#endif
+
+
 struct descriptor_table_ptr {
 u16 limit;
 ulong base;
diff --git a/x86/vmexit.c b/x86/vmexit.c
index cc24738..3bd0c81 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -47,12 +47,6 @@ static unsigned int inl(unsigned short port)
 
 static int nr_cpus;
 
-#ifdef __x86_64__
-#  define R "r"
-#else
-#  define R "e"
-#endif
-
 static void cpuid_test(void)
 {
asm volatile ("push %%"R "bx; cpuid; pop %%"R "bx"
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 06/18] x86: taskswitch: use desc library

2014-04-28 Thread Paolo Bonzini
The APIs in desc.c make it much simpler to understand what the test
is doing.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c   |   7 ++-
 lib/x86/desc.h   |   2 +
 x86/taskswitch.c | 134 +--
 3 files changed, 18 insertions(+), 125 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index 812295c..442c9a1 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -217,6 +217,11 @@ void set_gdt_entry(int sel, u32 base,  u32 limit, u8 
access, u8 gran)
gdt32[num].access = access;
 }
 
+void set_gdt_task_gate(u16 sel, u16 tss_sel)
+{
+set_gdt_entry(sel, tss_sel, 0, 0x85, 0); // task, present
+}
+
 void set_idt_task_gate(int vec, u16 sel)
 {
 idt_entry_t *e = &boot_idt[vec];
@@ -235,7 +240,7 @@ void set_idt_task_gate(int vec, u16 sel)
  * 1 - interrupt task
  */
 
-static tss32_t tss_intr;
+tss32_t tss_intr;
 static char tss_stack[4096];
 
 void setup_tss32(void)
diff --git a/lib/x86/desc.h b/lib/x86/desc.h
index e0af335..4689474 100644
--- a/lib/x86/desc.h
+++ b/lib/x86/desc.h
@@ -106,6 +106,7 @@ extern idt_entry_t boot_idt[256];
 #ifndef __x86_64__
 extern gdt_entry_t gdt32[];
 extern tss32_t tss;
+extern tss32_t tss_intr;
 #endif
 
 unsigned exception_vector(void);
@@ -113,6 +114,7 @@ unsigned exception_error_code(void);
 void set_idt_entry(int vec, void *addr, int dpl);
 void set_idt_sel(int vec, u16 sel);
 void set_gdt_entry(int sel, u32 base,  u32 limit, u8 access, u8 gran);
+void set_gdt_task_gate(u16 tss_sel, u16 sel);
 void set_idt_task_gate(int vec, u16 sel);
 void set_intr_task_gate(int e, void *fn);
 void print_current_tss_info(void);
diff --git a/x86/taskswitch.c b/x86/taskswitch.c
index 8ed8a93..423f51e 100644
--- a/x86/taskswitch.c
+++ b/x86/taskswitch.c
@@ -6,152 +6,38 @@
  */
 
 #include "libcflat.h"
+#include "lib/x86/desc.h"
 
-#define FIRST_SPARE_SEL0x18
-
-struct exception_frame {
-   unsigned long error_code;
-   unsigned long ip;
-   unsigned long cs;
-   unsigned long flags;
-};
-
-struct tss32 {
-   unsigned short prev;
-   unsigned short res1;
-   unsigned long esp0;
-   unsigned short ss0;
-   unsigned short res2;
-   unsigned long esp1;
-   unsigned short ss1;
-   unsigned short res3;
-   unsigned long esp2;
-   unsigned short ss2;
-   unsigned short res4;
-   unsigned long cr3;
-   unsigned long eip;
-   unsigned long eflags;
-   unsigned long eax, ecx, edx, ebx, esp, ebp, esi, edi;
-   unsigned short es;
-   unsigned short res5;
-   unsigned short cs;
-   unsigned short res6;
-   unsigned short ss;
-   unsigned short res7;
-   unsigned short ds;
-   unsigned short res8;
-   unsigned short fs;
-   unsigned short res9;
-   unsigned short gs;
-   unsigned short res10;
-   unsigned short ldt;
-   unsigned short res11;
-   unsigned short t:1;
-   unsigned short res12:15;
-   unsigned short iomap_base;
-};
-
-static char main_stack[4096];
-static char fault_stack[4096];
-static struct tss32 main_tss;
-static struct tss32 fault_tss;
-
-static unsigned long long gdt[] __attribute__((aligned(16))) = {
-   0,
-   0x00cf9b00ull,
-   0x00cf9300ull,
-   0, 0,   /* TSS segments */
-   0,  /* task return gate */
-};
-
-static unsigned long long gdtr;
+#define TSS_RETURN (FIRST_SPARE_SEL)
 
 void fault_entry(void);
 
 static __attribute__((used, regparm(1))) void
 fault_handler(unsigned long error_code)
 {
-   unsigned short *desc;
-
-   printf("fault at %x:%x, prev task %x, error code %x\n",
-  main_tss.cs, main_tss.eip, fault_tss.prev, error_code);
+   print_current_tss_info();
+   printf("error code %x\n", error_code);
 
-   main_tss.eip += 2;
+   tss.eip += 2;
 
-   desc = (unsigned short *)&gdt[3];
-   desc[2] &= ~0x0200;
+   gdt32[TSS_MAIN / 8].access &= ~2;
 
-   desc = (unsigned short *)&gdt[5];
-   desc[0] = 0;
-   desc[1] = fault_tss.prev;
-   desc[2] = 0x8500;
-   desc[3] = 0;
+   set_gdt_task_gate(TSS_RETURN, tss_intr.prev);
 }
 
 asm (
"fault_entry:\n"
"   mov (%esp),%eax\n"
"   call fault_handler\n"
-   "   jmp $0x28, $0\n"
+   "   jmp $" xstr(TSS_RETURN) ", $0\n"
 );
 
-static void setup_tss(struct tss32 *tss, void *entry,
- void *stack_base, unsigned long stack_size)
-{
-   unsigned long cr3;
-   unsigned short cs, ds;
-
-   asm ("mov %%cr3,%0" : "=r" (cr3));
-   asm ("mov %%cs,%0" : "=r" (cs));
-   asm ("mov %%ds,%0" : "=r" (ds));
-
-   tss->ss0 = tss->ss1 = tss->ss2 = tss->ss = ds;
-   tss->esp0 = tss->esp1 = tss->esp2 = tss->esp =
-   (unsigned long)stack_base + stack_size;
-   tss->ds = tss->es = tss->fs = tss->gs = ds;
-   tss->cs = cs;
-   tss->eip = (unsigned long)entry;
-   tss->cr3 = cr3;
-}
-
-static 

[PATCH kvm-unit-tests 16/18] x86: vm: export get_pte and return a pointer to it

2014-04-28 Thread Paolo Bonzini
This lets us modify the flags for the PTE corresponding to a virtual address.
The SMAP testcase will use this to build supervisor-mode pages.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/vm.c | 41 +
 lib/x86/vm.h | 17 +
 2 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 93aea71..3e7ca97 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -54,11 +54,11 @@ static unsigned long end_of_memory;
 #definePGDIR_MASK  1023
 #endif
 
-void install_pte(unsigned long *cr3,
-int pte_level,
-void *virt,
-unsigned long pte,
-unsigned long *pt_page)
+unsigned long *install_pte(unsigned long *cr3,
+  int pte_level,
+  void *virt,
+  unsigned long pte,
+  unsigned long *pt_page)
 {
 int level;
 unsigned long *pt = cr3;
@@ -79,9 +79,10 @@ void install_pte(unsigned long *cr3,
 }
 offset = ((unsigned long)virt >> ((level-1) * PGDIR_WIDTH + 12)) & 
PGDIR_MASK;
 pt[offset] = pte;
+return &pt[offset];
 }
 
-static unsigned long get_pte(unsigned long *cr3, void *virt)
+unsigned long *get_pte(unsigned long *cr3, void *virt)
 {
 int level;
 unsigned long *pt = cr3, pte;
@@ -91,28 +92,28 @@ static unsigned long get_pte(unsigned long *cr3, void *virt)
offset = ((unsigned long)virt >> (((level-1) * PGDIR_WIDTH) + 12)) & 
PGDIR_MASK;
pte = pt[offset];
if (!(pte & PTE_PRESENT))
-   return 0;
+   return NULL;
if (level == 2 && (pte & PTE_PSE))
-   return pte;
+   return &pt[offset];
pt = phys_to_virt(pte & 0xff000ull);
 }
 offset = ((unsigned long)virt >> (((level-1) * PGDIR_WIDTH) + 12)) & 
PGDIR_MASK;
-pte = pt[offset];
-return pte;
+return &pt[offset];
 }
 
-void install_large_page(unsigned long *cr3,
-  unsigned long phys,
-  void *virt)
+unsigned long *install_large_page(unsigned long *cr3,
+ unsigned long phys,
+ void *virt)
 {
-install_pte(cr3, 2, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | 
PTE_PSE, 0);
+return install_pte(cr3, 2, virt,
+  phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
 }
 
-void install_page(unsigned long *cr3,
-  unsigned long phys,
-  void *virt)
+unsigned long *install_page(unsigned long *cr3,
+   unsigned long phys,
+   void *virt)
 {
-install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
+return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | 
PTE_USER, 0);
 }
 
 
@@ -194,7 +195,7 @@ void *vmalloc(unsigned long size)
 
 uint64_t virt_to_phys_cr3(void *mem)
 {
-return (get_pte(phys_to_virt(read_cr3()), mem) & PTE_ADDR) + ((ulong)mem & 
(PAGE_SIZE - 1));
+return (*get_pte(phys_to_virt(read_cr3()), mem) & PTE_ADDR) + ((ulong)mem 
& (PAGE_SIZE - 1));
 }
 
 void vfree(void *mem)
@@ -202,7 +203,7 @@ void vfree(void *mem)
 unsigned long size = ((unsigned long *)mem)[-1];
 
 while (size) {
-   free_page(phys_to_virt(get_pte(phys_to_virt(read_cr3()), mem) & 
PTE_ADDR));
+   free_page(phys_to_virt(*get_pte(phys_to_virt(read_cr3()), mem) & 
PTE_ADDR));
mem += PAGE_SIZE;
size -= PAGE_SIZE;
 }
diff --git a/lib/x86/vm.h b/lib/x86/vm.h
index 0b5b5c7..bd73840 100644
--- a/lib/x86/vm.h
+++ b/lib/x86/vm.h
@@ -25,17 +25,18 @@ void *alloc_vpage(void);
 void *alloc_vpages(ulong nr);
 uint64_t virt_to_phys_cr3(void *mem);
 
-void install_pte(unsigned long *cr3,
-int pte_level,
-void *virt,
-unsigned long pte,
-unsigned long *pt_page);
+unsigned long *get_pte(unsigned long *cr3, void *virt);
+unsigned long *install_pte(unsigned long *cr3,
+   int pte_level,
+   void *virt,
+   unsigned long pte,
+   unsigned long *pt_page);
 
 void *alloc_page();
 
-void install_large_page(unsigned long *cr3,unsigned long phys,
-   void *virt);
-void install_page(unsigned long *cr3, unsigned long phys, void *virt);
+unsigned long *install_large_page(unsigned long *cr3,unsigned long phys,
+  void *virt);
+unsigned long *install_page(unsigned long *cr3, unsigned long phys, void 
*virt);
 
 static inline unsigned long virt_to_phys(const void *virt)
 {
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 03/18] x86: desc: change set_gdt_entry argument to selector

2014-04-28 Thread Paolo Bonzini
This interface, already used in taskswitch.c, is a bit easier to use.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c| 20 +++-
 lib/x86/desc.h|  8 +---
 x86/taskswitch2.c | 15 +++
 3 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index ac60686..b8b5452 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -228,10 +228,11 @@ unsigned exception_error_code(void)
  */
 
 static gdt_entry_t gdt[10];
-#define TSS_GDT_OFFSET 4
 
-void set_gdt_entry(int num, u32 base,  u32 limit, u8 access, u8 gran)
+void set_gdt_entry(int sel, u32 base,  u32 limit, u8 access, u8 gran)
 {
+   int num = sel >> 3;
+
/* Setup the descriptor base address */
gdt[num].base_low = (base & 0x);
gdt[num].base_middle = (base >> 16) & 0xFF;
@@ -261,15 +262,15 @@ void setup_gdt(void)
/* The second entry is our Code Segment. The base address
 *  is 0, the limit is 4GBytes, it uses 4KByte granularity,
 *  uses 32-bit opcodes, and is a Code Segment descriptor. */
-   set_gdt_entry(1, 0, 0x, 0x9A, 0xcf);
+   set_gdt_entry(KERNEL_CS, 0, 0x, 0x9A, 0xcf);
 
/* The third entry is our Data Segment. It's EXACTLY the
 *  same as our code segment, but the descriptor type in
 *  this entry's access byte says it's a Data Segment */
-   set_gdt_entry(2, 0, 0x, 0x92, 0xcf);
+   set_gdt_entry(KERNEL_DS, 0, 0x, 0x92, 0xcf);
 
/* Same as code register above but not present */
-   set_gdt_entry(3, 0, 0x, 0x1A, 0xcf);
+   set_gdt_entry(NP_SEL, 0, 0x, 0x1A, 0xcf);
 
 
/* Flush out the old GDT and install the new changes! */
@@ -280,7 +281,7 @@ void setup_gdt(void)
  "mov %0, %%fs\n\t"
  "mov %0, %%gs\n\t"
  "mov %0, %%ss\n\t"
- "jmp $0x08, $.Lflush2\n\t"
+ "jmp $" xstr(KERNEL_CS), $.Lflush2\n\t"
  ".Lflush2: "::"r"(0x10));
 }
 
@@ -315,10 +316,11 @@ void setup_tss32(void)
tss[i].ss0 = tss[i].ss1 = tss[i].ss2 = 0x10;
tss[i].esp = tss[i].esp0 = tss[i].esp1 = tss[i].esp2 =
(u32)tss_stack[i] + 4096;
-   tss[i].cs = 0x08;
-   tss[i].ds = tss[i].es = tss[i].fs = tss[i].gs = tss[i].ss = 
0x10;
+   tss[i].cs = KERNEL_CS;
+   tss[i].ds = tss[i].es = tss[i].fs = tss[i].gs =
+   tss[i].ss = KERNEL_DS;
tss[i].iomap_base = (u16)desc_size;
-   set_gdt_entry(TSS_GDT_OFFSET + i, (u32)&tss[i],
+   set_gdt_entry(TSS_MAIN + (i << 3), (u32)&tss[i],
 desc_size - 1, 0x89, 0x0f);
}
 
diff --git a/lib/x86/desc.h b/lib/x86/desc.h
index b795aad..e3f1ea0 100644
--- a/lib/x86/desc.h
+++ b/lib/x86/desc.h
@@ -71,16 +71,18 @@ typedef struct {
 #define UD_VECTOR   6
 #define GP_VECTOR   13
 
+#define KERNEL_CS 0x08
+#define KERNEL_DS 0x10
+#define NP_SEL 0x18
 #define TSS_MAIN 0x20
 #define TSS_INTR 0x28
-
-#define NP_SEL 0x18
+#define FIRST_SPARE_SEL 0x30
 
 unsigned exception_vector(void);
 unsigned exception_error_code(void);
 void set_idt_entry(int vec, void *addr, int dpl);
 void set_idt_sel(int vec, u16 sel);
-void set_gdt_entry(int num, u32 base,  u32 limit, u8 access, u8 gran);
+void set_gdt_entry(int sel, u32 base,  u32 limit, u8 access, u8 gran);
 void set_idt_task_gate(int vec, u16 sel);
 void set_intr_task_gate(int e, void *fn);
 void print_current_tss_info(void);
diff --git a/x86/taskswitch2.c b/x86/taskswitch2.c
index 08bcce9..de7e969 100644
--- a/x86/taskswitch2.c
+++ b/x86/taskswitch2.c
@@ -5,9 +5,8 @@
 #include "processor.h"
 #include "vm.h"
 
-#define FREE_GDT_INDEX 6
-#define MAIN_TSS_INDEX (FREE_GDT_INDEX + 0)
-#define VM86_TSS_INDEX (FREE_GDT_INDEX + 1)
+#define MAIN_TSS_SEL (FIRST_SPARE_SEL + 0)
+#define VM86_TSS_SEL (FIRST_SPARE_SEL + 8)
 
 static volatile int test_count;
 static volatile unsigned int test_divider;
@@ -217,15 +216,15 @@ void test_vm86_switch(void)
 vm86_start[1] = 0x0b;
 
 /* Main TSS */
-set_gdt_entry(MAIN_TSS_INDEX, (u32)&main_tss, sizeof(tss32_t) - 1, 0x89, 
0);
-ltr(MAIN_TSS_INDEX << 3);
+set_gdt_entry(MAIN_TSS_SEL, (u32)&main_tss, sizeof(tss32_t) - 1, 0x89, 0);
+ltr(MAIN_TSS_SEL);
 main_tss = (tss32_t) {
-.prev   = VM86_TSS_INDEX << 3,
+.prev   = VM86_TSS_SEL,
 .cr3= read_cr3(),
 };
 
 /* VM86 TSS (marked as busy, so we can iret to it) */
-set_gdt_entry(VM86_TSS_INDEX, (u32)&vm86_tss, sizeof(tss32_t) - 1, 0x8b, 
0);
+set_gdt_entry(VM86_TSS_SEL, (u32)&vm86_tss, sizeof(tss32_t) - 1, 0x8b, 0);
 vm86_tss = (tss32_t) {
 .eflags = 0x20002,
 .cr3= read_cr3(),
@@ -236,7 +235,7 @@ void test_vm86_switch(void)
 };
 
 /* Setup task gate to main TSS for #UD */
-set_idt_task_gate(6, MAIN_TSS_IN

[PATCH kvm-unit-tests 13/18] x86: xsave: use cpuid functions from processor.h

2014-04-28 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 x86/xsave.c | 58 ++
 1 file changed, 6 insertions(+), 52 deletions(-)

diff --git a/x86/xsave.c b/x86/xsave.c
index 057b0ff..cd2cdce 100644
--- a/x86/xsave.c
+++ b/x86/xsave.c
@@ -1,5 +1,6 @@
 #include "libcflat.h"
 #include "desc.h"
+#include "processor.h"
 
 #ifdef __x86_64__
 #define uint64_t unsigned long
@@ -7,42 +8,6 @@
 #define uint64_t unsigned long long
 #endif
 
-static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
-unsigned int *ecx, unsigned int *edx)
-{
-/* ecx is often an input as well as an output. */
-asm volatile("cpuid"
-: "=a" (*eax),
-"=b" (*ebx),
-"=c" (*ecx),
-"=d" (*edx)
-: "0" (*eax), "2" (*ecx));
-}
-
-/*
- * Generic CPUID function
- * clear %ecx since some cpus (Cyrix MII) do not set or clear %ecx
- * resulting in stale register contents being returned.
- */
-void cpuid(unsigned int op,
-unsigned int *eax, unsigned int *ebx,
-unsigned int *ecx, unsigned int *edx)
-{
-*eax = op;
-*ecx = 0;
-__cpuid(eax, ebx, ecx, edx);
-}
-
-/* Some CPUID calls want 'count' to be placed in ecx */
-void cpuid_count(unsigned int op, int count,
-unsigned int *eax, unsigned int *ebx,
-unsigned int *ecx, unsigned int *edx)
-{
-*eax = op;
-*ecx = count;
-__cpuid(eax, ebx, ecx, edx);
-}
-
 int xgetbv_checking(u32 index, u64 *result)
 {
 u32 eax, edx;
@@ -68,13 +33,6 @@ int xsetbv_checking(u32 index, u64 value)
 return exception_vector();
 }
 
-unsigned long read_cr4(void)
-{
-unsigned long val;
-asm volatile("mov %%cr4,%0" : "=r" (val));
-return val;
-}
-
 int write_cr4_checking(unsigned long val)
 {
 asm volatile(ASM_TRY("1f")
@@ -87,20 +45,16 @@ int write_cr4_checking(unsigned long val)
 #define CPUID_1_ECX_OSXSAVE(1 << 27)
 int check_cpuid_1_ecx(unsigned int bit)
 {
-unsigned int eax, ebx, ecx, edx;
-cpuid(1, &eax, &ebx, &ecx, &edx);
-if (ecx & bit)
-return 1;
-return 0;
+return (cpuid(1).c & bit) != 0;
 }
 
 uint64_t get_supported_xcr0(void)
 {
-unsigned int eax, ebx, ecx, edx;
-cpuid_count(0xd, 0, &eax, &ebx, &ecx, &edx);
+struct cpuid r;
+r = cpuid_indexed(0xd, 0);
 printf("eax %x, ebx %x, ecx %x, edx %x\n",
-eax, ebx, ecx, edx);
-return eax + ((u64)edx << 32);
+r.a, r.b, r.c, r.d);
+return r.a + ((u64)r.d << 32);
 }
 
 #define X86_CR4_OSXSAVE0x0004
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 18/18] x86: smap: new testcase

2014-04-28 Thread Paolo Bonzini
Test various combinations of the AC bit and reading/writing into
user pages at CPL=0.

One notable missing test is implicit kernel reads and writes (e.g.
reading the IDT/GDT/LDT/TSS).  The interesting part of this is that
AC must be ignored in ring 3; the processor always behaves as if AC=0.
I skipped this because QEMU doesn't emulate this correctly, and because
right now there's no kvm-unit-tests infrastructure to run code in ring
3 at all.

Signed-off-by: Paolo Bonzini 
---
 config-x86-common.mak |   4 +-
 lib/x86/processor.h   |  13 -
 x86/smap.c| 156 ++
 3 files changed, 171 insertions(+), 2 deletions(-)
 create mode 100644 x86/smap.c

diff --git a/config-x86-common.mak b/config-x86-common.mak
index aa5a439..93c9fee 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -37,7 +37,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
$(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \
$(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
-   $(TEST_DIR)/init.flat
+   $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat
 
 ifdef API
 tests-common += api/api-sample
@@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 
 $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
+$(TEST_DIR)/smap.elf: $(cstart.o) $(TEST_DIR)/smap.o
+
 $(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o $(TEST_DIR)/vmx_tests.o
 
 $(TEST_DIR)/debug.elf: $(cstart.o) $(TEST_DIR)/debug.o
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index 9cc1112..7fc1026 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -25,6 +25,7 @@
 #define X86_CR4_PSE0x0010
 #define X86_CR4_PAE0x0020
 #define X86_CR4_PCIDE  0x0002
+#define X86_CR4_SMAP   0x0020
 
 #define X86_IA32_EFER  0xc080
 #define X86_EFER_LMA   (1UL << 8)
@@ -39,6 +40,16 @@ static inline void barrier(void)
 asm volatile ("" : : : "memory");
 }
 
+static inline void clac(void)
+{
+asm volatile (".byte 0x0f, 0x01, 0xca" : : : "memory");
+}
+
+static inline void stac(void)
+{
+asm volatile (".byte 0x0f, 0x01, 0xcb" : : : "memory");
+}
+
 static inline u16 read_cs(void)
 {
 unsigned val;
@@ -330,7 +341,7 @@ static inline void irq_enable(void)
 asm volatile("sti");
 }
 
-static inline void invlpg(void *va)
+static inline void invlpg(volatile void *va)
 {
asm volatile("invlpg (%0)" ::"r" (va) : "memory");
 }
diff --git a/x86/smap.c b/x86/smap.c
new file mode 100644
index 000..d0b9e07
--- /dev/null
+++ b/x86/smap.c
@@ -0,0 +1,156 @@
+#include "libcflat.h"
+#include "lib/x86/desc.h"
+#include "lib/x86/processor.h"
+#include "lib/x86/vm.h"
+
+#define X86_FEATURE_SMAP   20
+#define X86_EFLAGS_AC  (1 << 18)
+
+volatile int pf_count = 0;
+volatile int save;
+volatile unsigned test;
+
+
+// When doing ring 3 tests, page fault handlers will always run on a
+// separate stack (the ring 0 stack).  Seems easier to use the alt_stack
+// mechanism for both ring 0 and ring 3.
+
+void do_pf_tss(unsigned long error_code)
+{
+   pf_count++;
+   save = test;
+
+#ifndef __x86_64__
+   tss.eflags |= X86_EFLAGS_AC;
+#endif
+}
+
+extern void pf_tss(void);
+asm ("pf_tss:\n"
+#ifdef __x86_64__
+// no task on x86_64, save/restore caller-save regs
+"push %rax; push %rcx; push %rdx; push %rsi; push %rdi\n"
+"push %r8; push %r9; push %r10; push %r11\n"
+   "mov 9*8(%rsp),%rsi\n"
+#endif
+   "call do_pf_tss\n"
+#ifdef __x86_64__
+"pop %r11; pop %r10; pop %r9; pop %r8\n"
+"pop %rdi; pop %rsi; pop %rdx; pop %rcx; pop %rax\n"
+#endif
+   "add $"S", %"R "sp\n"
+#ifdef __x86_64__
+   "orl $" xstr(X86_EFLAGS_AC) ", 2*"S"(%"R "sp)\n"  // set EFLAGS.AC and 
retry
+#endif
+"iret"W" \n\t"
+"jmp pf_tss\n\t");
+
+
+#define USER_BASE  (1 << 24)
+#define USER_VAR(v)(*((__typeof__(&(v))) (((unsigned long)&v) + 
USER_BASE)))
+
+static void init_test(int i)
+{
+   pf_count = 0;
+   if (i) {
+   invlpg(&test);
+   invlpg(&USER_VAR(test));
+   }
+}
+
+int main(int ac, char **av)
+{
+   unsigned long i;
+
+   if (!(cpuid_indexed(7, 0).b & (1 << X86_FEATURE_SMAP))) {
+   printf("SMAP not enabled, exiting\n");
+   exit(1);
+   }
+
+   setup_vm();
+   setup_alt_stack();
+   set_intr_alt_stack(14, pf_tss);
+
+   // Map first 16MB as supervisor pages
+   for (i = 0; i < USER_BASE; i += PAGE_SIZE) {
+   *get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= 
~PTE_USER;
+   invlpg((void *)i);
+   }
+
+   // Present the same 16MB as user pages in the 16MB-32MB range
+   for (i = USER_BASE; i < 2 * USER_BASE; i += PAGE_SIZE) {
+   *get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= 
~USER_

[PATCH kvm-unit-tests 09/18] x86: desc: support ISTs for alternate stacks in 64-bit mode

2014-04-28 Thread Paolo Bonzini
Introduce a new API that replaces setup_tss32 and set_intr_task_gate
in tests that run in both modes.  This will enable three more tests in
eventinj to run in 64-bit mode.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c | 26 --
 lib/x86/desc.h | 34 ++
 x86/cstart64.S |  1 +
 x86/eventinj.c |  6 +++---
 4 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index ec1fda3..9a80f48 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -187,6 +187,8 @@ unsigned exception_error_code(void)
 return error_code;
 }
 
+static char intr_alt_stack[4096];
+
 #ifndef __x86_64__
 /*
  * GDT, with 6 entries:
@@ -243,7 +245,6 @@ void set_idt_task_gate(int vec, u16 sel)
  */
 
 tss32_t tss_intr;
-static char tss_stack[4096];
 
 void setup_tss32(void)
 {
@@ -253,7 +254,7 @@ void setup_tss32(void)
tss_intr.cr3 = read_cr3();
tss_intr.ss0 = tss_intr.ss1 = tss_intr.ss2 = 0x10;
tss_intr.esp = tss_intr.esp0 = tss_intr.esp1 = tss_intr.esp2 =
-   (u32)tss_stack + 4096;
+   (u32)intr_alt_stack + 4096;
tss_intr.cs = 0x08;
tss_intr.ds = tss_intr.es = tss_intr.fs = tss_intr.gs = tss_intr.ss = 
0x10;
tss_intr.iomap_base = (u16)desc_size;
@@ -266,6 +267,16 @@ void set_intr_task_gate(int e, void *fn)
set_idt_task_gate(e, TSS_INTR);
 }
 
+void setup_alt_stack(void)
+{
+   setup_tss32();
+}
+
+void set_intr_alt_stack(int e, void *fn)
+{
+   set_intr_task_gate(e, fn);
+}
+
 void print_current_tss_info(void)
 {
u16 tr = str();
@@ -276,6 +287,17 @@ void print_current_tss_info(void)
printf("TR=%x (%s) Main TSS back link %x. Intr TSS back link 
%x\n",
   tr, tr ? "interrupt" : "main", tss.prev, tss_intr.prev);
 }
+#else
+void set_intr_alt_stack(int e, void *addr)
+{
+   set_idt_entry(e, addr, 0);
+   boot_idt[e].ist = 1;
+}
+
+void setup_alt_stack(void)
+{
+   tss.ist1 = (u64)intr_alt_stack + 4096;
+}
 #endif
 
 static bool exception;
diff --git a/lib/x86/desc.h b/lib/x86/desc.h
index cd41a74..33e721c 100644
--- a/lib/x86/desc.h
+++ b/lib/x86/desc.h
@@ -2,11 +2,7 @@
 #define __IDT_TEST__
 
 void setup_idt(void);
-#ifndef __x86_64__
-void setup_tss32(void);
-#else
-static inline void setup_tss32(void){}
-#endif
+void setup_alt_stack(void);
 
 struct ex_regs {
 unsigned long rax, rcx, rdx, rbx;
@@ -57,6 +53,24 @@ typedef struct {
u16 iomap_base;
 } tss32_t;
 
+typedef struct  __attribute__((packed)) {
+   u32 res1;
+   u64 rsp0;
+   u64 rsp1;
+   u64 rsp2;
+   u64 res2;
+   u64 ist1;
+   u64 ist2;
+   u64 ist3;
+   u64 ist4;
+   u64 ist5;
+   u64 ist6;
+   u64 ist7;
+   u64 res3;
+   u16 res4;
+   u16 iomap_base;
+} tss64_t;
+
 #define ASM_TRY(catch)  \
 "movl $0, %%gs:4 \n\t"  \
 ".pushsection .data.ex \n\t"\
@@ -109,6 +123,12 @@ extern idt_entry_t boot_idt[256];
 extern gdt_entry_t gdt32[];
 extern tss32_t tss;
 extern tss32_t tss_intr;
+void set_gdt_task_gate(u16 tss_sel, u16 sel);
+void set_idt_task_gate(int vec, u16 sel);
+void set_intr_task_gate(int vec, void *fn);
+void setup_tss32(void);
+#else
+extern tss64_t tss;
 #endif
 
 unsigned exception_vector(void);
@@ -116,9 +136,7 @@ unsigned exception_error_code(void);
 void set_idt_entry(int vec, void *addr, int dpl);
 void set_idt_sel(int vec, u16 sel);
 void set_gdt_entry(int sel, u32 base,  u32 limit, u8 access, u8 gran);
-void set_gdt_task_gate(u16 tss_sel, u16 sel);
-void set_idt_task_gate(int vec, u16 sel);
-void set_intr_task_gate(int e, void *fn);
+void set_intr_alt_stack(int e, void *fn);
 void print_current_tss_info(void);
 void handle_exception(u8 v, void (*func)(struct ex_regs *regs));
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 1a0c85e..7a1d79d 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -77,6 +77,7 @@ tss_descr:
 gdt64_end:
 
 i = 0
+.globl tss
 tss:
.rept max_cpus
.long 0
diff --git a/x86/eventinj.c b/x86/eventinj.c
index 900cfda..2124bdf 100644
--- a/x86/eventinj.c
+++ b/x86/eventinj.c
@@ -183,7 +183,7 @@ int main()
 
setup_vm();
setup_idt();
-   setup_tss32();
+   setup_alt_stack();
 
handle_irq(32, tirq0);
handle_irq(33, tirq1);
@@ -344,7 +344,7 @@ int main()
 
/* Generate DE and PF exceptions serially */
test_divider = 0;
-   set_intr_task_gate(14, pf_tss);
+   set_intr_alt_stack(14, pf_tss);
handle_exception(0, de_isr);
printf("Try to divide by 0\n");
/* install read only pte */
@@ -363,7 +363,7 @@ int main()
/* Generate NP and PF exceptions serially */
printf("Before NP test\n");
test_count = 0;
-   set_intr_task_gate(14, pf_tss);
+   set_intr_alt_stack(14, pf_tss);
handle_exception(11, np_isr);
set_idt_s

[PATCH kvm-unit-tests 12/18] x86: eventinj: port consecutive exception tests to x86-64

2014-04-28 Thread Paolo Bonzini
Use ISTs instead of tasks.

One big remaining difference between 32- and 64-bits is that tasks
on 32-bits take care of saving registers.  ISTs do not, only the
stack is affected.

For now, just save/restore caller-save registers in the assembly
wrapper, since there's just one test that uses ISTs.  Later we
can extend the existing APIs to register exception handlers, so
that they can use tasks or ISTs too.

Signed-off-by: Paolo Bonzini 
---
 x86/eventinj.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/x86/eventinj.c b/x86/eventinj.c
index 050bc8c..6547351 100644
--- a/x86/eventinj.c
+++ b/x86/eventinj.c
@@ -55,7 +55,6 @@ static void flush_idt_page()
 static volatile unsigned int test_divider;
 static volatile int test_count;
 
-#ifndef __x86_64__
 ulong stack_phys;
 void *stack_va;
 
@@ -69,15 +68,24 @@ void do_pf_tss(void)
 
 extern void pf_tss(void);
 
-asm (
-"pf_tss: \n\t"
+asm ("pf_tss: \n\t"
+#ifdef __x86_64__
+// no task on x86_64, save/restore caller-save regs
+"push %rax; push %rcx; push %rdx; push %rsi; push %rdi\n"
+"push %r8; push %r9; push %r10; push %r11\n"
+#endif
 "call do_pf_tss \n\t"
-"add $"S", %"R "sp\n\t"// discard error code
+#ifdef __x86_64__
+"pop %r11; pop %r10; pop %r9; pop %r8\n"
+"pop %rdi; pop %rsi; pop %rdx; pop %rcx; pop %rax\n"
+#endif
+"add $"S", %"R "sp\n\t"// discard error code
 "iret"W" \n\t"
 "jmp pf_tss\n\t"
 );
 
 
+#ifndef __x86_64__
 static void of_isr(struct ex_regs *r)
 {
printf("OF isr running\n");
@@ -356,7 +364,6 @@ int main()
irq_disable();
printf("After NMI to self\n");
report("NMI", test_count == 2);
-#ifndef __x86_64__
stack_phys = (ulong)virt_to_phys(alloc_page());
stack_va = alloc_vpage();
 
@@ -395,7 +402,6 @@ int main()
restore_stack();
printf("After int33\n");
report("NP PF exceptions", test_count == 2);
-#endif
 
pt = alloc_page();
cr3 = (void*)read_cr3();
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 07/18] x86: unify GDT format between 32-bit and 64-bit

2014-04-28 Thread Paolo Bonzini
Except the TSS, which is 16-bytes in 64-bit mode, we can use the same
structure and share the constants.  This will aid in porting tests
to 64-bit.

Multiple bitwidth and ring 3 selectors aren't used yet.  I couldn't
make my mind on keeping vs. dropping them, in the end I kept the ring 3
selectors which have a chance of being used for SMAP or paging unit tests.

With this change, vmx.c can start using desc.h's constants and those
in vm.h (why vm.h?) can be dropped.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/desc.c | 12 +++-
 lib/x86/desc.h |  6 --
 lib/x86/vm.h   | 13 -
 x86/cstart.S   |  6 +++---
 x86/cstart64.S | 16 
 x86/vmx.c  | 32 
 6 files changed, 42 insertions(+), 43 deletions(-)

diff --git a/lib/x86/desc.c b/lib/x86/desc.c
index 442c9a1..ec1fda3 100644
--- a/lib/x86/desc.c
+++ b/lib/x86/desc.c
@@ -191,11 +191,13 @@ unsigned exception_error_code(void)
 /*
  * GDT, with 6 entries:
  * 0x00 - NULL descriptor
- * 0x08 - Code segment
- * 0x10 - Data segment
- * 0x18 - Not present code segment
- * 0x20 - Interrupt task
- * 0x28 to 0x78 - Free to use for test cases
+ * 0x08 - Code segment (ring 0)
+ * 0x10 - Data segment (ring 0)
+ * 0x18 - Not present code segment (ring 0)
+ * 0x20 - Code segment (ring 3)
+ * 0x28 - Data segment (ring 3)
+ * 0x30 - Interrupt task
+ * 0x38 to 0x78 - Free to use for test cases
  * 0x80 - Primary task (CPU 0)
  */
 
diff --git a/lib/x86/desc.h b/lib/x86/desc.h
index 4689474..cd41a74 100644
--- a/lib/x86/desc.h
+++ b/lib/x86/desc.h
@@ -72,8 +72,10 @@ typedef struct {
 #define KERNEL_CS 0x08
 #define KERNEL_DS 0x10
 #define NP_SEL 0x18
-#define TSS_INTR 0x20
-#define FIRST_SPARE_SEL 0x28
+#define USER_CS 0x20
+#define USER_DS 0x28
+#define TSS_INTR 0x30
+#define FIRST_SPARE_SEL 0x38
 #define TSS_MAIN 0x80
 
 typedef struct {
diff --git a/lib/x86/vm.h b/lib/x86/vm.h
index 6e0ce2b..03a9b4e 100644
--- a/lib/x86/vm.h
+++ b/lib/x86/vm.h
@@ -28,19 +28,6 @@
 #define X86_CR4_PAE 0x0020
 #define X86_CR4_PCIDE  0x0002
 
-#ifdef __x86_64__
-#define SEL_NULL_DESC  0x0
-#define SEL_KERN_CODE_64   0x8
-#define SEL_KERN_DATA_64   0x10
-#define SEL_USER_CODE_64   0x18
-#define SEL_USER_DATA_64   0x20
-#define SEL_CODE_320x28
-#define SEL_DATA_320x30
-#define SEL_CODE_160x38
-#define SEL_DATA_160x40
-#define SEL_TSS_RUN0x48
-#endif
-
 void setup_vm();
 
 void *vmalloc(unsigned long size);
diff --git a/x86/cstart.S b/x86/cstart.S
index ffa0768..f9ba1fe 100644
--- a/x86/cstart.S
+++ b/x86/cstart.S
@@ -34,10 +34,10 @@ gdt32:
.quad 0x00cf9b00 // flat 32-bit code segment
.quad 0x00cf9300 // flat 32-bit data segment
.quad 0x00cf1b00 // flat 32-bit code segment, not present
+   .quad 0x00cffb00 // 64-bit code segment (user)
+   .quad 0x00cff300 // 64-bit data segment (user)
 
-   .quad 0  // 12 spare selectors
-   .quad 0
-   .quad 0
+   .quad 0  // 10 spare selectors
.quad 0
.quad 0
.quad 0
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 0fe76da..1a0c85e 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -54,12 +54,20 @@ gdt64:
.quad 0
.quad 0x00af9b00 // 64-bit code segment
.quad 0x00cf9300 // 64-bit data segment
+   .quad 0x00af1b00 // 64-bit code segment, not present
.quad 0x00affb00 // 64-bit code segment (user)
.quad 0x00cff300 // 64-bit data segment (user)
-   .quad 0x00cf9b00 // 32-bit code segment
-   .quad 0x00cf9200 // 32-bit code segment
-   .quad 0x008F9A00 // 16-bit code segment
-   .quad 0x008F9200 // 16-bit data segment
+
+   .quad 0  // 10 spare selectors
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
+   .quad 0
 
 tss_descr:
.rept max_cpus
diff --git a/x86/vmx.c b/x86/vmx.c
index 2278078..1d28c6f 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -346,16 +346,16 @@ static void init_vmcs_host(void)
vmcs_write(HOST_CR3, read_cr3());
vmcs_write(HOST_CR4, read_cr4());
vmcs_write(HOST_SYSENTER_EIP, (u64)(&entry_sysenter));
-   vmcs_write(HOST_SYSENTER_CS,  SEL_KERN_CODE_64);
+   vmcs_write(HOST_SYSENTER_CS,  KERNEL_CS);
 
/* 26.2.3 */
-   vmcs_write(HOST_SEL_CS, SEL_KERN_CODE_64);
-   vmcs_write(HOST_SEL_SS, SEL_KERN_DATA_64);
-   vmcs_write(HOST_SEL_DS, SEL_KERN_DATA_64);
-   vmcs_write(HOST_SEL_ES, SEL_KERN_DATA_64);
-   vmcs_write(HOST_SEL_FS, SEL_KERN_DATA_64);
-   vmcs_write(HOST_SEL_GS, SEL_KERN_DATA_64);
-   vmcs_write(HOST_SEL_TR, SEL_TSS_RUN);
+   vmcs_write(HOST_SEL_CS, KERNEL_CS);
+   vmcs_write(HOST_SEL_SS, KERNEL_DS);
+   vmcs_write(HOST_SEL_DS, 

[PATCH kvm-unit-tests 17/18] x86: vm: mark intermediate PTEs as user-accessible

2014-04-28 Thread Paolo Bonzini
It is pointless to make the leaf user-accessible if the intermediate page
tables are not.  In a real OS, what would matter is that the page tables
themselves are only accessible through a supervisor mapping.

The SMAP testcase will rely on the user bit, so fix it now.

Signed-off-by: Paolo Bonzini 
---
 lib/x86/vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 3e7ca97..b820c7d 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -73,7 +73,7 @@ unsigned long *install_pte(unsigned long *cr3,
 else
 pt_page = 0;
memset(new_pt, 0, PAGE_SIZE);
-   pt[offset] = virt_to_phys(new_pt) | PTE_PRESENT | PTE_WRITE;
+   pt[offset] = virt_to_phys(new_pt) | PTE_PRESENT | PTE_WRITE | 
PTE_USER;
}
pt = phys_to_virt(pt[offset] & 0xff000ull);
 }
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 11/18] x86: eventinj: enable NP test in 64-bit mode

2014-04-28 Thread Paolo Bonzini
Sharing the GDT between 32-bit and 64-bit means that the non-present
code segment is now always there, and the test just works.

Signed-off-by: Paolo Bonzini 
---
 x86/eventinj.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/x86/eventinj.c b/x86/eventinj.c
index 1df3a43..050bc8c 100644
--- a/x86/eventinj.c
+++ b/x86/eventinj.c
@@ -83,6 +83,7 @@ static void of_isr(struct ex_regs *r)
printf("OF isr running\n");
test_count++;
 }
+#endif
 
 static void np_isr(struct ex_regs *r)
 {
@@ -90,7 +91,6 @@ static void np_isr(struct ex_regs *r)
set_idt_sel(33, read_cs());
test_count++;
 }
-#endif
 
 static void de_isr(struct ex_regs *r)
 {
@@ -316,7 +316,6 @@ int main()
while(test_count != 2); /* wait for second irq */
irq_disable();
 
-#ifndef __x86_64__
/* test fault durint NP delivery */
printf("Before NP test\n");
test_count = 0;
@@ -327,7 +326,6 @@ int main()
asm volatile ("int $33");
printf("After int33\n");
report("NP exception", test_count == 2);
-#endif
 
/* generate NMI that will fault on IDT */
test_count = 0;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/21] MIPS: KVM: Add nanosecond count bias KVM register

2014-04-28 Thread James Hogan
Hi Paolo,

On 28 April 2014 13:01, Paolo Bonzini  wrote:
> Il 25/04/2014 17:19, James Hogan ha scritto:
>
>> Expose the KVM guest CP0_Count bias (from the monotonic kernel time) to
>> userland in nanosecond units via a new KVM_REG_MIPS_COUNT_BIAS register
>> accessible with the KVM_{GET,SET}_ONE_REG ioctls. This gives userland
>> control of the bias so that it can exactly match its own monotonic time.
>>
>> The nanosecond bias is stored separately from the raw bias used
>> internally (since nanoseconds isn't a convenient or efficient unit for
>> various timer calculations), and is recalculated each time the raw count
>> bias is altered. The raw count bias used in CP0_Count determination is
>> recalculated when the nanosecond bias is altered via the KVM_SET_ONE_REG
>> ioctl.
>>
>> Signed-off-by: James Hogan 
>> Cc: Paolo Bonzini 
>> Cc: Gleb Natapov 
>> Cc: kvm@vger.kernel.org
>> Cc: Ralf Baechle 
>> Cc: linux-m...@linux-mips.org
>> Cc: David Daney 
>> Cc: Sanjay Lal 
>
>
> If it is possible and not too hairy to use a raw value in userspace
> (together with KVM_REG_MIPS_COUNT_HZ), please do it---my suggestions were
> just that, a suggestion.  Otherwise, the patch looks good.

Do you mean expose the raw internal offset to userland instead of the
nanosecond one? Yeh it should be possible & slightly simpler for both
kernel and Qemu actually.

Qemu could then store that value (or the Count register) straight into
env->CP0_Count (depending on Cause.DC), then hw/mips/cputimer.c would
pretty much continue to work accurately. cputimer.c is only really
made use of by tcg at the moment though (reading/writing
count/compare/cause.DC), but it still makes sense to be consistent.

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/2] KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)

2014-04-28 Thread Oleg Nesterov
On 04/28, Andrea Arcangeli wrote:
>
> On Mon, Apr 28, 2014 at 01:06:05PM +0200, Paolo Bonzini wrote:
> > Patch 1 will be for 3.16 only, I'd like a review from Marcelo or Andrea
> > though (that's "KVM: async_pf: kill the unnecessary use_mm/unuse_mm
> > async_pf_execute()" for easier googling).
>
> Patch 1:
>
> Reviewed-by: Andrea Arcangeli 

Thanks,

> I think current->NULL would be better too.

OK, let me send the trivial one-liner then. I won't mind if you fold it
into 1/2, or I can resend it with this change included.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/2] KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)

2014-04-28 Thread Oleg Nesterov
async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.

Signed-off-by: Oleg Nesterov 
---
 virt/kvm/async_pf.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ced4f3..62f4223 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -81,7 +81,7 @@ static void async_pf_execute(struct work_struct *work)
might_sleep();
 
down_read(&mm->mmap_sem);
-   get_user_pages(current, mm, addr, 1, 1, 0, NULL, NULL);
+   get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
up_read(&mm->mmap_sem);
kvm_async_page_present_sync(vcpu, apf);
 
-- 
1.5.5.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 05/12] ARM/ARM64: KVM: Make kvm_psci_call() return convention more flexible

2014-04-28 Thread Marc Zyngier
On Mon, Apr 28 2014 at  3:17:43 pm BST, Christoffer Dall 
 wrote:
> On Mon, Apr 21, 2014 at 06:29:59PM +0530, Anup Patel wrote:
>> Currently, the kvm_psci_call() returns 'true' or 'false' based on whether
>> the PSCI function call was handled successfully or not. This does not help
>> us emulate system-level PSCI functions where the actual emulation work will
>> be done by user space (QEMU or KVMTOOL). Examples of such system-level PSCI
>> functions are: PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET.
>> 
>> This patch updates kvm_psci_call() to return three types of values:
>> 1) > 0 (success)
>> 2) = 0 (success but exit to user space)
>> 3) < 0 (errors)
>> 
>> Signed-off-by: Anup Patel 
>> Signed-off-by: Pranavkumar Sawargaonkar 
>> Reviewed-by: Christoffer Dall 
>
> Marc, do you still have comments on this one?

No, this one is OK:

Acked-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 08/12] ARM/ARM64: KVM: Emulate PSCI v0.2 AFFINITY_INFO

2014-04-28 Thread Marc Zyngier
On Mon, Apr 21 2014 at  2:00:02 pm BST, Anup Patel  
wrote:
> This patch adds emulation of PSCI v0.2 AFFINITY_INFO function call
> for KVM ARM/ARM64. This is a VCPU-level function call which will be
> used to determine current state of given affinity level.
>
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> Reviewed-by: Christoffer Dall 
> ---
>  arch/arm/kvm/psci.c |   52 
> +--
>  1 file changed, 50 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
> index 4486d0f..122bc67 100644
> --- a/arch/arm/kvm/psci.c
> +++ b/arch/arm/kvm/psci.c
> @@ -27,6 +27,16 @@
>   * as described in ARM document number ARM DEN 0022A.
>   */
>  
> +#define AFFINITY_MASK(level) ~((0x1UL << ((level) * MPIDR_LEVEL_BITS)) - 1)
> +
> +static unsigned long psci_affinity_mask(unsigned long affinity_level)
> +{
> + if (affinity_level <= 3)
> + return MPIDR_HWID_BITMASK & AFFINITY_MASK(affinity_level);
> +
> + return 0;
> +}
> +
>  static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
>  {
>   vcpu->arch.pause = true;
> @@ -85,6 +95,42 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
> *source_vcpu)
>   return PSCI_RET_SUCCESS;
>  }
>  
> +static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
> +{
> + int i;
> + unsigned long mpidr;
> + unsigned long target_affinity;
> + unsigned long target_affinity_mask;
> + unsigned long lowest_affinity_level;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_vcpu *tmp;
> +
> + target_affinity = *vcpu_reg(vcpu, 1);
> + lowest_affinity_level = *vcpu_reg(vcpu, 2);
> +
> + /* Determine target affinity mask */
> + target_affinity_mask = psci_affinity_mask(lowest_affinity_level);
> + if (!target_affinity_mask)
> + return PSCI_RET_INVALID_PARAMS;
> +
> + /* Ignore other bits of target affinity */
> + target_affinity &= target_affinity_mask;
> +
> + /*
> +  * If one or more VCPU matching target affinity are running
> +  * then ON else OFF
> +  */
> + kvm_for_each_vcpu(i, tmp, kvm) {
> + mpidr = kvm_vcpu_get_mpidr(tmp);
> + if (((mpidr & target_affinity_mask) == target_affinity) &&
> + !tmp->arch.pause) {
> + return PSCI_0_2_AFFINITY_LEVEL_ON;
> + }
> + }
> +
> + return PSCI_0_2_AFFINITY_LEVEL_OFF;
> +}
> +
>  static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type)
>  {
>   memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
> @@ -132,6 +178,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   case PSCI_0_2_FN64_CPU_ON:
>   val = kvm_psci_vcpu_on(vcpu);
>   break;
> + case PSCI_0_2_FN_AFFINITY_INFO:
> + case PSCI_0_2_FN64_AFFINITY_INFO:
> + val = kvm_psci_vcpu_affinity_info(vcpu);
> + break;
>   case PSCI_0_2_FN_SYSTEM_OFF:
>   kvm_psci_system_off(vcpu);
>   val = PSCI_RET_INTERNAL_FAILURE;
> @@ -143,12 +193,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   ret = 0;
>   break;
>   case PSCI_0_2_FN_CPU_SUSPEND:
> - case PSCI_0_2_FN_AFFINITY_INFO:
>   case PSCI_0_2_FN_MIGRATE:
>   case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
>   case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
>   case PSCI_0_2_FN64_CPU_SUSPEND:
> - case PSCI_0_2_FN64_AFFINITY_INFO:
>   case PSCI_0_2_FN64_MIGRATE:
>   case PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU:
>   val = PSCI_RET_NOT_SUPPORTED;

Acked-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 11/12] ARM/ARM64: KVM: Emulate PSCI v0.2 CPU_SUSPEND

2014-04-28 Thread Marc Zyngier
On Mon, Apr 21 2014 at  2:00:05 pm BST, Anup Patel  
wrote:
> This patch adds emulation of PSCI v0.2 CPU_SUSPEND function call for
> KVM ARM/ARM64. This is a CPU-level function call which can suspend
> current CPU or current CPU cluster. We don't have VCPU clusters in
> KVM so we only suspend the current VCPU.
>
> The CPU_SUSPEND emulation is not tested much because currently there
> is no CPUIDLE driver in Linux kernel that uses PSCI CPU_SUSPEND. The
> PSCI CPU_SUSPEND implementation in ARM64 kernel was tested using a
> Simple CPUIDLE driver which is not published due to unstable DT-bindings
> for PSCI.
> (For more info, http://lwn.net/Articles/574950/)
>
> For simplicity, we implement CPU_SUSPEND emulation similar to WFI
> (Wait-for-interrupt) emulation and we also treat power-down request
> to be same as stand-by request. This is consistent with section
> 5.4.1 and section 5.4.2 of PSCI v0.2 specification.
>
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> ---
>  arch/arm/kvm/psci.c |   28 
>  1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
> index b582e99..757e506 100644
> --- a/arch/arm/kvm/psci.c
> +++ b/arch/arm/kvm/psci.c
> @@ -37,6 +37,26 @@ static unsigned long psci_affinity_mask(unsigned long 
> affinity_level)
>   return 0;
>  }
>  
> +static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
> +{
> + /*
> +  * NOTE: For simplicity, we make VCPU suspend emulation to be
> +  * same-as WFI (Wait-for-interrupt) emulation.
> +  *
> +  * This means for KVM the wakeup events are interrupts and
> +  * this is consistent with intended use of StateID as described
> +  * in section 5.4.1 of PSCI v0.2 specification (ARM DEN 0022A).
> +  *
> +  * Further, we also treat power-down request to be same as
> +  * stand-by request as-per section 5.4.2 clause 3 of PSCI v0.2
> +  * specification (ARM DEN 0022A). This means all suspend states
> +  * for KVM will preserve the register state.
> +  */
> + kvm_vcpu_block(vcpu);
> +
> + return PSCI_RET_SUCCESS;
> +}
> +
>  static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
>  {
>   vcpu->arch.pause = true;
> @@ -183,6 +203,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>*/
>   val = 2;
>   break;
> + case PSCI_0_2_FN_CPU_SUSPEND:
> + case PSCI_0_2_FN64_CPU_SUSPEND:
> + val = kvm_psci_vcpu_suspend(vcpu);
> + break;
>   case PSCI_0_2_FN_CPU_OFF:
>   kvm_psci_vcpu_off(vcpu);
>   val = PSCI_RET_SUCCESS;
> @@ -221,10 +245,6 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   val = PSCI_RET_INTERNAL_FAILURE;
>   ret = 0;
>   break;
> - case PSCI_0_2_FN_CPU_SUSPEND:
> - case PSCI_0_2_FN64_CPU_SUSPEND:
> - val = PSCI_RET_NOT_SUPPORTED;
> - break;
>   default:
>   return -EINVAL;
>   }

Acked-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 07/12] ARM/ARM64: KVM: Emulate PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET

2014-04-28 Thread Marc Zyngier
On Mon, Apr 21 2014 at  2:00:01 pm BST, Anup Patel  
wrote:
> The PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET functions are system-level
> functions hence cannot be fully emulated by in-kernel PSCI emulation code.
>
> To tackle this, we forward PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET function
> calls from vcpu to user space (i.e. QEMU or KVMTOOL) via kvm_run structure
> using KVM_EXIT_SYSTEM_EVENT exit reasons.
>
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> Reviewed-by: Christoffer Dall 
> ---
>  arch/arm/kvm/psci.c |   32 +---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
> index 14e6fa6..4486d0f 100644
> --- a/arch/arm/kvm/psci.c
> +++ b/arch/arm/kvm/psci.c
> @@ -85,6 +85,23 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
> *source_vcpu)
>   return PSCI_RET_SUCCESS;
>  }
>  
> +static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type)
> +{
> + memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
> + vcpu->run->system_event.type = type;
> + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
> +}
> +
> +static void kvm_psci_system_off(struct kvm_vcpu *vcpu)
> +{
> + kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN);
> +}
> +
> +static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
> +{
> + kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
> +}
> +
>  int kvm_psci_version(struct kvm_vcpu *vcpu)
>  {
>   if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
> @@ -95,6 +112,7 @@ int kvm_psci_version(struct kvm_vcpu *vcpu)
>  
>  static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>  {
> + int ret = 1;
>   unsigned long psci_fn = *vcpu_reg(vcpu, 0) & ~((u32) 0);
>   unsigned long val;
>  
> @@ -114,13 +132,21 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   case PSCI_0_2_FN64_CPU_ON:
>   val = kvm_psci_vcpu_on(vcpu);
>   break;
> + case PSCI_0_2_FN_SYSTEM_OFF:
> + kvm_psci_system_off(vcpu);
> + val = PSCI_RET_INTERNAL_FAILURE;
> + ret = 0;
> + break;
> + case PSCI_0_2_FN_SYSTEM_RESET:
> + kvm_psci_system_reset(vcpu);
> + val = PSCI_RET_INTERNAL_FAILURE;
> + ret = 0;
> + break;

Maybe add a comment about why we set INTERNAL_FAILURE here (we shouldn't
be able to come back from such a PSCI call).

>   case PSCI_0_2_FN_CPU_SUSPEND:
>   case PSCI_0_2_FN_AFFINITY_INFO:
>   case PSCI_0_2_FN_MIGRATE:
>   case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
>   case PSCI_0_2_FN_MIGRATE_INFO_UP_CPU:
> - case PSCI_0_2_FN_SYSTEM_OFF:
> - case PSCI_0_2_FN_SYSTEM_RESET:
>   case PSCI_0_2_FN64_CPU_SUSPEND:
>   case PSCI_0_2_FN64_AFFINITY_INFO:
>   case PSCI_0_2_FN64_MIGRATE:
> @@ -132,7 +158,7 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   }
>  
>   *vcpu_reg(vcpu, 0) = val;
> - return 1;
> + return ret;
>  }
>  
>  static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)

Aside from this minor comment:

Acked-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 11/12] ARM/ARM64: KVM: Emulate PSCI v0.2 CPU_SUSPEND

2014-04-28 Thread Christoffer Dall
On Mon, Apr 21, 2014 at 06:30:05PM +0530, Anup Patel wrote:
> This patch adds emulation of PSCI v0.2 CPU_SUSPEND function call for
> KVM ARM/ARM64. This is a CPU-level function call which can suspend
> current CPU or current CPU cluster. We don't have VCPU clusters in
> KVM so we only suspend the current VCPU.
> 
> The CPU_SUSPEND emulation is not tested much because currently there
> is no CPUIDLE driver in Linux kernel that uses PSCI CPU_SUSPEND. The
> PSCI CPU_SUSPEND implementation in ARM64 kernel was tested using a
> Simple CPUIDLE driver which is not published due to unstable DT-bindings
> for PSCI.
> (For more info, http://lwn.net/Articles/574950/)
> 
> For simplicity, we implement CPU_SUSPEND emulation similar to WFI
> (Wait-for-interrupt) emulation and we also treat power-down request
> to be same as stand-by request. This is consistent with section
> 5.4.1 and section 5.4.2 of PSCI v0.2 specification.
> 
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> ---
>  arch/arm/kvm/psci.c |   28 
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
> index b582e99..757e506 100644
> --- a/arch/arm/kvm/psci.c
> +++ b/arch/arm/kvm/psci.c
> @@ -37,6 +37,26 @@ static unsigned long psci_affinity_mask(unsigned long 
> affinity_level)
>   return 0;
>  }
>  
> +static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu)
> +{
> + /*
> +  * NOTE: For simplicity, we make VCPU suspend emulation to be
> +  * same-as WFI (Wait-for-interrupt) emulation.
> +  *
> +  * This means for KVM the wakeup events are interrupts and
> +  * this is consistent with intended use of StateID as described
> +  * in section 5.4.1 of PSCI v0.2 specification (ARM DEN 0022A).
> +  *
> +  * Further, we also treat power-down request to be same as
> +  * stand-by request as-per section 5.4.2 clause 3 of PSCI v0.2
> +  * specification (ARM DEN 0022A). This means all suspend states
> +  * for KVM will preserve the register state.
> +  */
> + kvm_vcpu_block(vcpu);
> +
> + return PSCI_RET_SUCCESS;
> +}
> +
>  static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
>  {
>   vcpu->arch.pause = true;
> @@ -183,6 +203,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>*/
>   val = 2;
>   break;
> + case PSCI_0_2_FN_CPU_SUSPEND:
> + case PSCI_0_2_FN64_CPU_SUSPEND:
> + val = kvm_psci_vcpu_suspend(vcpu);
> + break;
>   case PSCI_0_2_FN_CPU_OFF:
>   kvm_psci_vcpu_off(vcpu);
>   val = PSCI_RET_SUCCESS;
> @@ -221,10 +245,6 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
>   val = PSCI_RET_INTERNAL_FAILURE;
>   ret = 0;
>   break;
> - case PSCI_0_2_FN_CPU_SUSPEND:
> - case PSCI_0_2_FN64_CPU_SUSPEND:
> - val = PSCI_RET_NOT_SUPPORTED;
> - break;
>   default:
>   return -EINVAL;
>   }
> -- 
> 1.7.9.5
> 
Acked-by: Christoffer Dall 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm smm mode support?

2014-04-28 Thread Paolo Bonzini

Il 28/04/2014 16:01, Kevin O'Connor ha scritto:

> OVMF probably wants set aside some ram which can't be accessed by the
> OS, for secure boot emulation which is actually secure.  Guess we'll
> just go map/unmap some slot in the smm enter/leave vmexits?  Or there
> are better ways to do it?

Normally, the memory at 0xa-0xc is only mapped when in SMM.


Yes, and there's also a configuration space bit that lets you show/hide 
SMRAM at 0xa-0xc.  Another a configuration space bit that lets 
you lock the first bit.  QEMU doesn't implement the lock, but it should 
not be hard.


For OVMF, we would certainly lock SMRAM out.  For SeaBIOS, if we can 
avoid that it would help writing testcases...  SeaBIOS is not doing 
anything security-sensitive in SMM anyway.



And, as I understand it, in a multi-cpu system only the core handling
the SMI can access that ram.  (All other cores would continue to
access IO space at 0xa-0xc.)


QEMU just grew per-CPU address spaces, but not KVM.

I don't think we need it.  For SeaBIOS's callbacks we can assume single 
processor, SeaBIOS is not thread-safe anyway.  And the only interaction 
would be with legacy VGA VRAM, so no big deal.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 05/12] ARM/ARM64: KVM: Make kvm_psci_call() return convention more flexible

2014-04-28 Thread Christoffer Dall
On Mon, Apr 21, 2014 at 06:29:59PM +0530, Anup Patel wrote:
> Currently, the kvm_psci_call() returns 'true' or 'false' based on whether
> the PSCI function call was handled successfully or not. This does not help
> us emulate system-level PSCI functions where the actual emulation work will
> be done by user space (QEMU or KVMTOOL). Examples of such system-level PSCI
> functions are: PSCI v0.2 SYSTEM_OFF and SYSTEM_RESET.
> 
> This patch updates kvm_psci_call() to return three types of values:
> 1) > 0 (success)
> 2) = 0 (success but exit to user space)
> 3) < 0 (errors)
> 
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> Reviewed-by: Christoffer Dall 

Marc, do you still have comments on this one?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 02/12] ARM/ARM64: KVM: Add common header for PSCI related defines

2014-04-28 Thread Christoffer Dall
On Mon, Apr 21, 2014 at 06:29:56PM +0530, Anup Patel wrote:
> We need a common place to share PSCI related defines among ARM kernel,
> ARM64 kernel, KVM ARM/ARM64 PSCI emulation, and user space.
> 
> We introduce uapi/linux/psci.h for this purpose. This newly added
> header will be first used by KVM ARM/ARM64 in-kernel PSCI emulation
> and user space (i.e. QEMU or KVMTOOL).
> 
> Signed-off-by: Anup Patel 
> Signed-off-by: Pranavkumar Sawargaonkar 
> Signed-off-by: Ashwin Chaugule 
> ---
>  include/uapi/linux/Kbuild |1 +
>  include/uapi/linux/psci.h |   85 
> +
>  2 files changed, 86 insertions(+)
>  create mode 100644 include/uapi/linux/psci.h
> 
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 6929571..24e9033 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -317,6 +317,7 @@ header-y += ppp-ioctl.h
>  header-y += ppp_defs.h
>  header-y += pps.h
>  header-y += prctl.h
> +header-y += psci.h
>  header-y += ptp_clock.h
>  header-y += ptrace.h
>  header-y += qnx4_fs.h
> diff --git a/include/uapi/linux/psci.h b/include/uapi/linux/psci.h
> new file mode 100644
> index 000..0d4a136
> --- /dev/null
> +++ b/include/uapi/linux/psci.h
> @@ -0,0 +1,85 @@
> +/*
> + * ARM Power State and Coordination Interface (PSCI) header
> + *
> + * This header holds common PSCI defines and macros shared
> + * by: ARM kernel, ARM64 kernel, KVM ARM/ARM64 and user space.
> + *
> + * Copyright (C) 2014 Linaro Ltd.
> + * Author: Anup Patel 
> + */
> +
> +#ifndef _UAPI_LINUX_PSCI_H
> +#define _UAPI_LINUX_PSCI_H
> +
> +/*
> + * PSCI v0.1 interface
> + *
> + * The PSCI v0.1 function numbers are implementation defined.
> + *
> + * Only PSCI return values such as: SUCCESS, NOT_SUPPORTED,
> + * INVALID_PARAMS, and DENIED defined below are applicable
> + * to PSCI v0.1.
> + */
> +
> +/* PSCI v0.2 interface */
> +#define PSCI_0_2_FN_BASE 0x8400
> +#define PSCI_0_2_FN(n)   (PSCI_0_2_FN_BASE + (n))
> +#define PSCI_0_2_64BIT   0x4000
> +#define PSCI_0_2_FN64_BASE   \
> + (PSCI_0_2_FN_BASE + PSCI_0_2_64BIT)
> +#define PSCI_0_2_FN64(n) (PSCI_0_2_FN64_BASE + (n))
> +
> +#define PSCI_0_2_FN_PSCI_VERSION PSCI_0_2_FN(0)
> +#define PSCI_0_2_FN_CPU_SUSPEND  PSCI_0_2_FN(1)
> +#define PSCI_0_2_FN_CPU_OFF  PSCI_0_2_FN(2)
> +#define PSCI_0_2_FN_CPU_ON   PSCI_0_2_FN(3)
> +#define PSCI_0_2_FN_AFFINITY_INFOPSCI_0_2_FN(4)
> +#define PSCI_0_2_FN_MIGRATE  PSCI_0_2_FN(5)
> +#define PSCI_0_2_FN_MIGRATE_INFO_TYPEPSCI_0_2_FN(6)
> +#define PSCI_0_2_FN_MIGRATE_INFO_UP_CPU  PSCI_0_2_FN(7)
> +#define PSCI_0_2_FN_SYSTEM_OFF   PSCI_0_2_FN(8)
> +#define PSCI_0_2_FN_SYSTEM_RESET PSCI_0_2_FN(9)
> +
> +#define PSCI_0_2_FN64_CPU_SUSPENDPSCI_0_2_FN64(1)
> +#define PSCI_0_2_FN64_CPU_ON PSCI_0_2_FN64(3)
> +#define PSCI_0_2_FN64_AFFINITY_INFO  PSCI_0_2_FN64(4)
> +#define PSCI_0_2_FN64_MIGRATEPSCI_0_2_FN64(5)
> +#define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPUPSCI_0_2_FN64(7)
> +
> +#define PSCI_0_2_POWER_STATE_ID_MASK 0x
> +#define PSCI_0_2_POWER_STATE_ID_SHIFT0
> +#define PSCI_0_2_POWER_STATE_TYPE_MASK   0x1

Shouldn't this be (0x1 << PSCI_0_2_POWER_STATE_TYPE_SHIFT)?

That seems to be the definition of a mask in the PSCI_VERSION_MAJOR_MASK
below, at least be consistent in this file.

> +#define PSCI_0_2_POWER_STATE_TYPE_SHIFT  16
> +#define PSCI_0_2_POWER_STATE_AFFL_MASK   0x3

same

> +#define PSCI_0_2_POWER_STATE_AFFL_SHIFT  24
> +
> +#define PSCI_0_2_AFFINITY_LEVEL_ON   0
> +#define PSCI_0_2_AFFINITY_LEVEL_OFF  1
> +#define PSCI_0_2_AFFINITY_LEVEL_ON_PENDING   2

I'm confused, what do these defines signify?  I spent 10 minutes looking
at the spec and now I think that you probably mean AFFINITY_INFO and
that these are the possible return values?  Probably warrants a comment.

> +
> +#define PSCI_0_2_TOS_UP_MIGRATE  0
> +#define PSCI_0_2_TOS_UP_NO_MIGRATE   1
> +#define PSCI_0_2_TOS_MP  2

Should probably also comment that "TOS" are return values for
MIGRATE_INFO_TYPE.

> +
> +/* PSCI version decoding (independent of PSCI version) */
> +#define PSCI_VERSION_MAJOR_SHIFT 16
> +#define PSCI_VERSION_MINOR_MASK  \
> + ((1U << PSCI_VERSION_MAJOR_SHIFT) - 1)
> +#define PSCI_VERSION_MAJOR_MASK  ~PSCI_VERSION_MINOR_MASK
> +#define PSCI_VERSION_MAJOR(ver)  \
> + (((ver) & PSCI_VERSION_MAJOR_MASK) >> PSCI_VERSION_MAJOR_SHIFT)
> +#define PSCI_VERSION_MINOR(ver)  \
> +

Re: [PATCH 0/2] KVM: async_pf: use_mm/mm_users fixes

2014-04-28 Thread Andrea Arcangeli
On Mon, Apr 28, 2014 at 01:06:05PM +0200, Paolo Bonzini wrote:
> Patch 1 will be for 3.16 only, I'd like a review from Marcelo or Andrea 
> though (that's "KVM: async_pf: kill the unnecessary use_mm/unuse_mm 
> async_pf_execute()" for easier googling).

Patch 1:

Reviewed-by: Andrea Arcangeli 

I think current->NULL would be better too.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: async_pf: kill the unnecessary use_mm/unuse_mm async_pf_execute()

2014-04-28 Thread Andrea Arcangeli
Hi,

On Wed, Apr 23, 2014 at 09:32:28PM +0200, Oleg Nesterov wrote:
> On 04/22, Christian Borntraeger wrote:
> >
> > On 22/04/14 22:15, Christian Borntraeger wrote:
> > > On 21/04/14 15:25, Oleg Nesterov wrote:
> > >> async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
> > >> should work just fine even if current has another or NULL ->mm.
> > >>
> > >> Recently kvm_async_page_present_sync() was added insedie the "use_mm"
> > >> section, but it seems that it doesn't need current->mm too.
> > >>
> > >> Signed-off-by: Oleg Nesterov 
> > >
> > > Indeed, use/unuse_mm should only be necessary for copy_to/from_user etc.
> > > This is fine for s390, but it seems that x86 
> > > kvm_arch_async_page_not_present
> > > might call apf_put_user which might call copy_to_user, so this is not ok, 
> > > I guess.
> >
> > wanted to say kvm_arch_async_page_not_present, but I have to correct myself.
> > x86 does the "page is there" in the cpu loop, not in the worker. The cpu 
> > look
> > d oes have a valid mm. So this patch should be also ok.
> 
> Thanks ;)
> 
> Btw, I forgot to mention this in the changelog, but
> 
> > >> @@ -80,12 +80,10 @@ static void async_pf_execute(struct work_struct 
> > >> *work)
> > >>
> > >>  might_sleep();
> > >>
> > >> -use_mm(mm);
> > >>  down_read(&mm->mmap_sem);
> > >>  get_user_pages(current, mm, addr, 1, 1, 0, NULL, NULL);
> > >>  up_read(&mm->mmap_sem);
> > >>  kvm_async_page_present_sync(vcpu, apf);
> > >> -unuse_mm(mm);
> 
> it can actually do
> 
>   get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> 
> "task" is only used to increment task_struct->xxx_flt. I don't think
> async_pf_execute() actually needs this (current is PF_WQ_WORKER after
> all), but I didn't dare to do another change in the code I can hardly
> understand.

Considering the faults would be randomly distributed among the kworker
threads my preference would also be for NULL instead of current.

ptrace and uprobes tends to be the only two places that look into
other mm with gup, ptrace knows the exact pid that it is triggering
the fault into, so it also can specify the correct task so the fault
goes in the right task struct. uprobes uses NULL.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm smm mode support?

2014-04-28 Thread Kevin O'Connor
On Mon, Apr 28, 2014 at 03:49:31PM +0200, Gerd Hoffmann wrote:
> On Sa, 2014-04-26 at 13:02 +0200, Paolo Bonzini wrote:
> > Il 26/04/2014 11:40, Paolo Bonzini ha scritto:
> > > Il 25/04/2014 09:39, Gerd Hoffmann ha scritto:
> > >> Anyone has plans to add smm support to kvm?
> > >
> > > No plans, but it should be a Simple Matter Of Programming...
> > 
> > Well, we need:
> > 
> > - an extra ioctl to inject an SMI (can be modeled after KVM_NMI)
> > 
> > - an extra user exit triggered when SMM is entered or left
> > 
> > - an extra ioctl (or a GET/SET_ONE_REG implementation) to read/write 
> > whether we are in SMM, used to determine whether the #UD produced by RSM 
> > should be forwarded to the guest or trigger emulation.
> 
> OVMF probably wants set aside some ram which can't be accessed by the
> OS, for secure boot emulation which is actually secure.  Guess we'll
> just go map/unmap some slot in the smm enter/leave vmexits?  Or there
> are better ways to do it?

Normally, the memory at 0xa-0xc is only mapped when in SMM.
And, as I understand it, in a multi-cpu system only the core handling
the SMI can access that ram.  (All other cores would continue to
access IO space at 0xa-0xc.)

-Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm smm mode support?

2014-04-28 Thread Gerd Hoffmann
On Sa, 2014-04-26 at 13:02 +0200, Paolo Bonzini wrote:
> Il 26/04/2014 11:40, Paolo Bonzini ha scritto:
> > Il 25/04/2014 09:39, Gerd Hoffmann ha scritto:
> >> Anyone has plans to add smm support to kvm?
> >
> > No plans, but it should be a Simple Matter Of Programming...
> 
> Well, we need:
> 
> - an extra ioctl to inject an SMI (can be modeled after KVM_NMI)
> 
> - an extra user exit triggered when SMM is entered or left
> 
> - an extra ioctl (or a GET/SET_ONE_REG implementation) to read/write 
> whether we are in SMM, used to determine whether the #UD produced by RSM 
> should be forwarded to the guest or trigger emulation.

OVMF probably wants set aside some ram which can't be accessed by the
OS, for secure boot emulation which is actually secure.  Guess we'll
just go map/unmap some slot in the smm enter/leave vmexits?  Or there
are better ways to do it?

cheers,
  Gerd


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm smm mode support?

2014-04-28 Thread Laszlo Ersek
On 04/26/14 11:40, Paolo Bonzini wrote:
> Il 25/04/2014 09:39, Gerd Hoffmann ha scritto:
>> Anyone has plans to add smm support to kvm?
> 
> No plans, but it should be a Simple Matter Of Programming...
> 
> A good start would be to write unit tests for SMM that work with QEMU.

Well I don't know what behavior to expect from SMM... :)

Plus, Kevin recently posted some remarks about the SMM implementation in
qemu-tcg -- apparently it's not faithful enough to physical hardware:

http://thread.gmane.org/gmane.comp.bios.coreboot.seabios/7959
http://thread.gmane.org/gmane.comp.emulators.qemu/268909

(But I can see that you've been already discussing the second thread; I
guess I should read up on it first...)

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/21] MIPS: KVM: Add nanosecond count bias KVM register

2014-04-28 Thread Paolo Bonzini

Il 25/04/2014 17:19, James Hogan ha scritto:

Expose the KVM guest CP0_Count bias (from the monotonic kernel time) to
userland in nanosecond units via a new KVM_REG_MIPS_COUNT_BIAS register
accessible with the KVM_{GET,SET}_ONE_REG ioctls. This gives userland
control of the bias so that it can exactly match its own monotonic time.

The nanosecond bias is stored separately from the raw bias used
internally (since nanoseconds isn't a convenient or efficient unit for
various timer calculations), and is recalculated each time the raw count
bias is altered. The raw count bias used in CP0_Count determination is
recalculated when the nanosecond bias is altered via the KVM_SET_ONE_REG
ioctl.

Signed-off-by: James Hogan 
Cc: Paolo Bonzini 
Cc: Gleb Natapov 
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: David Daney 
Cc: Sanjay Lal 


If it is possible and not too hairy to use a raw value in userspace 
(together with KVM_REG_MIPS_COUNT_HZ), please do it---my suggestions 
were just that, a suggestion.  Otherwise, the patch looks good.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Fix page-tables reserved bits

2014-04-28 Thread Nadav Amit

On Apr 28, 2014, at 1:41 PM, Paolo Bonzini  wrote:

> Il 17/04/2014 00:04, Marcelo Tosatti ha scritto:
> > >> @@ -3550,9 +3550,9 @@ static void reset_rsvds_bits_mask(struct 
> > >> kvm_vcpu *vcpu,
> > >>  break;
> > >>  case PT64_ROOT_LEVEL:
> > >>  context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
> > >> -rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
> > >> +rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);
> > >>  context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
> > >> -rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
> > >> +rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);
 > >
 > > Bit 7 is not reserved either, for the PDPTE (its PageSize bit).
 > >
>>> >
>>> > In long mode (IA-32e), bit 7 is definitely reserved.
> 
> It's always reserved for PML4E (rsvd_bits_mask[x][3]), while for PDPTEs it is 
> not reserved if you have 1GB pages.
> 
>> There is a separate reserved mask for PS=1, nevermind.
>> 
> 
> Yeah, but the situation for IA32e rsvd_bits_mask[0][2] is exactly the same as 
> for PAE rsvd_bits_mask[0][1], and we're not marking the bit as reserved there.
> 
> The right thing to do is to add rsvd_bits(7, 7) to both rsvd_bits_mask[0][2] 
> and rsvd_bits_mask[1][2], if 1GB pages are not supported.
> 
> As written, the patch has no effect on PDPTEs because rsvd_bits_mask[0][2] is 
> only accessed if bit 7 is zero.
> 
> Nadav, would you mind preparing a follow-up?  Also, how did you find these 
> issues and test the fixes?
I will create a follow-up as soon as possible. We encountered the issues in a 
custom testing environment.
The fixes were validated using the failing tests, but they cover additional 
cases which might not have been tested.

Regards,
Nadav--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] MIPS: KVM: Fixes and guest timer rewrite

2014-04-28 Thread Paolo Bonzini

Il 25/04/2014 17:19, James Hogan ha scritto:

Here are a range of MIPS KVM T&E fixes for v3.16. They can also be found
on my kvm_mips_queue branch here:
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/kvm-mips.git

They originally served to allow it to work better on Ingenic XBurst
cores which have some peculiarities which break non-portable assumptions
in the MIPS KVM implementation (patches 1-3, 11).

Fixing guest CP0_Count emulation to work without a running host
CP0_Count (patch 11) however required a rewrite of the timer emulation
code to use the kernel monotonic time instead, which needed doing anyway
since basing it directly off the host CP0_Count was broken. Various bugs
were fixed in the process (patches 8-10) and improvements made thanks to
valuable feedback from Paolo Bonzini for the last QEMU MIPS/KVM patchset
(patches 4-7, 13-15).

Finally there are some misc cleanups which I did along the way (patches
16-21).

Only the first patch (fixes MIPS KVM with 4K pages) is marked for
stable. For KVM to work on XBurst it needs the timer rework which is a
fairly complex change, so there's little point marking any of the XBurst
specific changes for stable.

All feedback welcome!

Patches 1-3:
Fix KVM/MIPS with 4K pages, missing RDHWR SYNCI (XBurst),
unmoving CP0_Random (XBurst).
Patches 4-7:
Add EPC, Count, Compare guest CP0 registers to KVM register
ioctl interface.
Patches 8-10:
Fix a few potential races relating to timers.
Patches 11-12:
Rewrite guest timer emulation to use ktime_get().
Patches 13-15:
Add KVM virtual registers for controlling guest timer, including
master timer disable, nanosecond bias, and timer frequency.
Patches 16-21:
Cleanups.

James Hogan (21):
  MIPS: KVM: Allocate at least 16KB for exception handlers
  MIPS: KVM: Use local_flush_icache_range to fix RI on XBurst
  MIPS: KVM: Use tlb_write_random
  MIPS: KVM: Fix CP0_EBASE KVM register id
  MIPS: KVM: Add CP0_EPC KVM register access
  MIPS: KVM: Move KVM_{GET,SET}_ONE_REG definitions into kvm_host.h
  MIPS: KVM: Add CP0_Count/Compare KVM register access
  MIPS: KVM: Deliver guest interrupts after local_irq_disable()
  MIPS: KVM: Fix timer race modifying guest CP0_Cause
  MIPS: KVM: Migrate hrtimer to follow VCPU
  MIPS: KVM: Rewrite count/compare timer emulation
  MIPS: KVM: Override guest kernel timer frequency directly
  MIPS: KVM: Add master disable count interface
  MIPS: KVM: Add nanosecond count bias KVM register
  MIPS: KVM: Add count frequency KVM register
  MIPS: KVM: Make kvm_mips_comparecount_{func,wakeup} static
  MIPS: KVM: Whitespace fixes in kvm_mips_callbacks
  MIPS: KVM: Fix kvm_debug bit-rottage
  MIPS: KVM: Remove ifdef DEBUG around kvm_debug
  MIPS: KVM: Quieten kvm_info() logging
  MIPS: KVM: Remove redundant NULL checks before kfree()

 arch/mips/Kconfig |  12 +-
 arch/mips/include/asm/kvm_host.h  | 185 +---
 arch/mips/include/uapi/asm/kvm.h  |  40 +++
 arch/mips/kvm/kvm_locore.S|  32 --
 arch/mips/kvm/kvm_mips.c  | 127 
 arch/mips/kvm/kvm_mips_dyntrans.c |  15 +-
 arch/mips/kvm/kvm_mips_emul.c | 608 --
 arch/mips/kvm/kvm_tlb.c   |  60 ++--
 arch/mips/kvm/kvm_trap_emul.c |  89 +-
 arch/mips/mti-malta/malta-time.c  |  14 +-
 10 files changed, 951 insertions(+), 231 deletions(-)

Cc: Paolo Bonzini 
Cc: Gleb Natapov 
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: David Daney 
Cc: Sanjay Lal 



There weren't too many comments on this series, and it would be really 
nice to have it in 3.16.


Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Emulate VMXON region correctly

2014-04-28 Thread Jan Kiszka
On 2014-04-28 07:00, Bandan Das wrote:
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=54521
> 
> The vmxon region is unused by nvmx, but adding these checks
> are probably harmless and may detect buggy L1 hypervisors in 
> the future!

Nice and welcome! Will you provide unit tests for these cases as well?

Jan

> 
> Bandan Das (3):
>   KVM: nVMX: rearrange get_vmx_mem_address
>   KVM: nVMX: additional checks on vmxon region
>   KVM: nVMX: fail on invalid vmclear/vmptrld pointer
> 
>  arch/x86/kvm/vmx.c | 171 
> -
>  1 file changed, 118 insertions(+), 53 deletions(-)
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 5/5] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2014-04-28 Thread Paolo Bonzini
What about some editing of the big comment...

/*
 * Currently, shadow PTEs are write protected in two cases, 1) write protecting
 * guest page tables, 2) resetting dirty tracking after KVM_GET_DIRTY_LOG. The
 * differences between these two sorts are:
 *
 * a) only the first case clears SPTE_MMU_WRITEABLE bit.
 *
 * b) the first case requires flushing the TLB immediately to avoid corruption
 *of the shadow page table on other VCPUs.  In order to synchronize with
 *other VCPUs the flush is done under the MMU lock.
 *
 *The second case instead can delay flushing of the TLB until just before
 *returning the dirty bitmap is returned to userspace; this is because it
 *only write-protects pages that are set in the bitmap, and further writes
 *to those pages can be safely ignored until userspace examines the bitmap.
 *We rely on this to flush the TLB outside the MMU lock.
 *
 * A problem arises when these two cases occur concurrently.  Userspace can
 * call KVM_GET_DIRTY_LOG, which write-protects pages but does not immediately
 * flush the TLB; in the meanwhile, KVM wants to write-protect a guest page
 * table, sees it's already write-protected, and the result is a corrupted TLB.
 *
 * To avoid this problem, when write protecting guest page tables we *always*
 * flush the TLB if the spte has the SPTE_MMU_WRITEABLE bit set, even if 
 * the spte was already write-protected.  This works since case 2 never touches
 * SPTE_MMU_WRITEABLE bit.  In other words, whenever a spte is updated (only
 * permission and status bits are changed) we need to check whether a spte with
 * SPTE_MMU_WRITEABLE becomes readonly.  If that happens, we flush the TLB.
 * mmu_spte_update() handles this.
 *
 * The rules to use SPTE_MMU_WRITEABLE and PT_WRITABLE_MASK are as follows:
 *
 * a) if you want to see if it has a writable TLB entry, or if the spte can be
 *writable on the mmu mapping, check SPTE_MMU_WRITEABLE.  This is the most
 *common case, otherwise
 *
 * b) when fixing a page fault on the spte or doing write-protection for
 *dirty logging, check PT_WRITABLE_MASK.


Is the above accurate?

>  * TODO: introduce APIs to split these two cases.

What do you mean exactly?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: nVMX: additional checks on vmxon region

2014-04-28 Thread Jan Kiszka
On 2014-04-28 07:00, Bandan Das wrote:
> Currently, the vmxon region isn't used in the nested case.
> However, according to the spec, the vmxon instruction performs
> additional sanity checks on this region and the associated
> pointer. Modify emulated vmxon to better adhere to the spec
> requirements
> 
> Signed-off-by: Bandan Das 
> ---
>  arch/x86/kvm/vmx.c | 53 +
>  1 file changed, 53 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c18fe9a4..d5342c7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -354,6 +354,7 @@ struct vmcs02_list {
>  struct nested_vmx {
>   /* Has the level1 guest done vmxon? */
>   bool vmxon;
> + gpa_t vmxon_ptr;
>  
>   /* The guest-physical address of the current VMCS L1 keeps for L2 */
>   gpa_t current_vmptr;
> @@ -5840,9 +5841,19 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
>   struct kvm_segment cs;
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   struct vmcs *shadow_vmcs;
> + gva_t gva;
> + gpa_t vmptr;
> + struct x86_exception e;
> + struct page *page;
> +
>   const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
>   | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
>  
> + /* Guest physical Address Width */
> + struct kvm_cpuid_entry2 *best =
> + kvm_find_cpuid_entry(vcpu, 0x8008, 0);

We have cpuid_maxphyaddr().

> +
> +
>   /* The Intel VMX Instruction Reference lists a bunch of bits that
>* are prerequisite to running VMXON, most notably cr4.VMXE must be
>* set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
> @@ -5865,6 +5876,46 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
>   kvm_inject_gp(vcpu, 0);
>   return 1;
>   }
> +
> + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> + vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
> + return 1;
> +
> + if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmptr,
> + sizeof(vmptr), &e)) {
> + kvm_inject_page_fault(vcpu, &e);
> + return 1;
> + }
> +
> + /* Don't bailout if best is NULL */
> + WARN_ON(!best);
> +
> + /*
> +  * SDM 3: 24.11.5
> +  * VMXON pointer must be 4KB aligned
> +  * VMXON pointer must not set any bits beyond processor's
> +  * physical address width
> +  * The first 4 bytes of VMXON region contain the supported
> +  * VMCS revision identifier
> +  *
> +  * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
> +  * which replaces physical address width with 32
> +  */
> + if (!IS_ALIGNED(vmptr, PAGE_SIZE) || (best &&
> +   (vmptr >> (best->eax & 0xff {
> + nested_vmx_failInvalid(vcpu);
> + skip_emulated_instruction(vcpu);
> + return 1;
> + }
> +
> + page = nested_get_page(vcpu, vmptr);
> + if ((page == NULL) || ((*(u32 *)kmap(page) != VMCS12_REVISION))) {

Style: you don't need braces around the comparisons.

> + nested_vmx_failInvalid(vcpu);
> + kunmap(page);
> + skip_emulated_instruction(vcpu);
> + return 1;
> + }
> +
>   if (vmx->nested.vmxon) {
>   nested_vmx_failValid(vcpu, VMXERR_VMXON_IN_VMX_ROOT_OPERATION);
>   skip_emulated_instruction(vcpu);
> @@ -5896,9 +5947,11 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
>   vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
>  
>   vmx->nested.vmxon = true;
> + vmx->nested.vmxon_ptr = vmptr;
>  
>   skip_emulated_instruction(vcpu);
>   nested_vmx_succeed(vcpu);
> + kunmap(page);

This late unmapping leaks the page in other error cases.

Jan

>   return 1;
>  }
>  
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: async_pf: use_mm/mm_users fixes

2014-04-28 Thread Paolo Bonzini

Il 21/04/2014 15:25, Oleg Nesterov ha scritto:

Hello.

Completely untested and I know nothing about kvm ;) Please review.

But use_mm() really looks misleading, and the usage of mm_users looks
"obviously wrong". I already sent this change while we were discussing
vmacache, but it was ignored. Since then kvm_async_page_present_sync()
was added into async_pf_execute() into async_pf_execute(), but it seems
to me that use_mm() is still unnecessary.

Oleg.

 virt/kvm/async_pf.c |   10 --
 1 files changed, 4 insertions(+), 6 deletions(-)



Applying patch 2 to kvm/master (for 3.15).

Patch 1 will be for 3.16 only, I'd like a review from Marcelo or Andrea 
though (that's "KVM: async_pf: kill the unnecessary use_mm/unuse_mm 
async_pf_execute()" for easier googling).


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Fix CR3 and LDT sel should not be saved in TSS

2014-04-28 Thread Paolo Bonzini

Il 16/04/2014 21:20, Marcelo Tosatti ha scritto:

On Mon, Apr 07, 2014 at 06:37:47PM +0300, Nadav Amit wrote:

According to Intel specifications, only general purpose registers and segment
selectors should are saved in the old TSS during 32-bit task-switch.


should be


Signed-off-by: Nadav Amit 
---
 arch/x86/kvm/emulate.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 205b17e..0dec502 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2496,7 +2496,7 @@ static int task_switch_16(struct x86_emulate_ctxt *ctxt,
 static void save_state_to_tss32(struct x86_emulate_ctxt *ctxt,
struct tss_segment_32 *tss)
 {
-   tss->cr3 = ctxt->ops->get_cr(ctxt, 3);
+   /* CR3 and ldt selector are not saved intentionally */
tss->eip = ctxt->_eip;
tss->eflags = ctxt->eflags;
tss->eax = reg_read(ctxt, VCPU_REGS_RAX);
@@ -2514,7 +2514,6 @@ static void save_state_to_tss32(struct x86_emulate_ctxt 
*ctxt,
tss->ds = get_segment_selector(ctxt, VCPU_SREG_DS);
tss->fs = get_segment_selector(ctxt, VCPU_SREG_FS);
tss->gs = get_segment_selector(ctxt, VCPU_SREG_GS);
-   tss->ldt_selector = get_segment_selector(ctxt, VCPU_SREG_LDTR);
 }

 static int load_state_from_tss32(struct x86_emulate_ctxt *ctxt,


Only this hunk is enough ?


I guess there could be a corner case where the beginning or tail of the 
TSS is in a read-only page but EIP...LDT is all writable.


Paolo


@@ -2604,6 +2603,8 @@ static int task_switch_32(struct x86_emulate_ctxt *ctxt,
struct tss_segment_32 tss_seg;
int ret;
u32 new_tss_base = get_desc_base(new_desc);
+   u32 eip_offset = offsetof(struct tss_segment_32, eip);
+   u32 ldt_sel_offset = offsetof(struct tss_segment_32, ldt_selector);

ret = ops->read_std(ctxt, old_tss_base, &tss_seg, sizeof tss_seg,
&ctxt->exception);
@@ -2613,8 +2614,9 @@ static int task_switch_32(struct x86_emulate_ctxt *ctxt,

save_state_to_tss32(ctxt, &tss_seg);

-   ret = ops->write_std(ctxt, old_tss_base, &tss_seg, sizeof tss_seg,
-&ctxt->exception);
+   /* Only GP registers and segment selectors are saved */
+   ret = ops->write_std(ctxt, old_tss_base + eip_offset, &tss_seg.eip,
+ldt_sel_offset - eip_offset, &ctxt->exception);
if (ret != X86EMUL_CONTINUE)
/* FIXME: need to provide precise fault address */
return ret;
--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Fix page-tables reserved bits

2014-04-28 Thread Paolo Bonzini

Il 17/04/2014 00:04, Marcelo Tosatti ha scritto:

> >> @@ -3550,9 +3550,9 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu 
*vcpu,
> >>   break;
> >>   case PT64_ROOT_LEVEL:
> >>   context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
> >> - rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
> >> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);
> >>   context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
> >> - rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
> >> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);

> >
> > Bit 7 is not reserved either, for the PDPTE (its PageSize bit).
> >

>
> In long mode (IA-32e), bit 7 is definitely reserved.


It's always reserved for PML4E (rsvd_bits_mask[x][3]), while for PDPTEs 
it is not reserved if you have 1GB pages.



There is a separate reserved mask for PS=1, nevermind.



Yeah, but the situation for IA32e rsvd_bits_mask[0][2] is exactly the 
same as for PAE rsvd_bits_mask[0][1], and we're not marking the bit as 
reserved there.


The right thing to do is to add rsvd_bits(7, 7) to both 
rsvd_bits_mask[0][2] and rsvd_bits_mask[1][2], if 1GB pages are not 
supported.


As written, the patch has no effect on PDPTEs because 
rsvd_bits_mask[0][2] is only accessed if bit 7 is zero.


Nadav, would you mind preparing a follow-up?  Also, how did you find 
these issues and test the fixes?


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 0/6] Emulator speedups - avoid initializations where possible

2014-04-28 Thread Paolo Bonzini

Il 16/04/2014 18:46, Bandan Das ha scritto:

While initializing emulation context structure, kvm memsets to 0 a
number of fields some of which are redundant since they get set
eventually in x86_decode_insn. Cleanup unnecessary initializations
and remove some fields.

This is on top of Paolo's RFC
KVM: x86: speedups for emulator memory accesses
https://lkml.org/lkml/2014/4/1/494

Here are the new realmode.flat numbers with improvement
wrt unpatched kernel -

639 cycles/emulated jump instruction (4.3%)
776 cycles/emulated move instruction (7.5%)
791 cycles/emulated arithmetic instruction (11%)
943 cycles/emulated memory load instruction (5.2%)
948 cycles/emulated memory store instruction (7.6%)
929 cycles/emulated memory RMW instruction (9.0%)

v1 numbers -
639 cycles/emulated jump instruction
786 cycles/emulated move instruction
802 cycles/emulated arithmetic instruction
936 cycles/emulated memory load instruction
970 cycles/emulated memory store instruction
1000 cycles/emulated memory RMW instruction

v3:
Minor changes as proposed in review
 - 3/6 - cleanup typos
 - 6/6 - change comment in struct x86_emulate_ctxt and add back a missing
 if in decode_modrm

v2:
All thanks and credit to Paolo!
 - 1/6 - no change
 - 2/6 - new patch, inercept and check_perm replaced with checks for bits in 
ctxt->d
 - 3/6 - new patch, remove if condition in decode_rm and rearrange bit 
operations
 - 4/6 - remove else conditions from v1 and misc cleanups
 - 5/6 - new patch, remove seg_override and related fields and functions
 - 6/6 - new patch, remove memopp and move rip_relative to a local variable in
 decode_modrm

Bandan Das (6):
  KVM: emulate: move init_decode_cache to emulate.c
  KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
  KVM: emulate: cleanup decode_modrm
  KVM: emulate: clean up initializations in init_decode_cache
  KVM: emulate: rework seg_override
  KVM: emulate: remove memopp and rip_relative

 arch/x86/include/asm/kvm_emulate.h | 26 +++-
 arch/x86/kvm/emulate.c | 85 ++
 arch/x86/kvm/x86.c | 13 --
 3 files changed, 56 insertions(+), 68 deletions(-)



Applied to a private branch collecting all the emulator speedups.  Thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >