[PATCH] KVM: Mapping IOMMU pages after updating memslot

2013-10-23 Thread Yang Zhang
From: Yang Zhang 

In kvm_iommu_map_pages(), we need to know the page size via call
kvm_host_page_size(). And it will check whether the target slot
is valid before return the right page size.
Currently, we will map the iommu pages when creating a new slot.
But we call kvm_iommu_map_pages() during preparing the new slot.
At that time, the new slot is not visible by domain(still in preparing).
So we cannot get the right page size from kvm_host_page_size() and
this will break the IOMMU super page logic.
The solution is to map the iommu pages after we insert the new slot
into domain.

Signed-off-by: Yang Zhang 
Tested-by: Patrick Lu 
---
 virt/kvm/kvm_main.c |   29 ++---
 1 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d469114..9ec60a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -873,21 +873,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out_free;
}
 
-   /*
-* IOMMU mapping:  New slots need to be mapped.  Old slots need to be
-* un-mapped and re-mapped if their base changes.  Since base change
-* unmapping is handled above with slot deletion, mapping alone is
-* needed here.  Anything else the iommu might care about for existing
-* slots (size changes, userspace addr changes and read-only flag
-* changes) is disallowed above, so any other attribute changes getting
-* here can be skipped.
-*/
-   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
-   r = kvm_iommu_map_pages(kvm, &new);
-   if (r)
-   goto out_slots;
-   }
-
/* actual memory is freed via old in kvm_free_physmem_slot below */
if (change == KVM_MR_DELETE) {
new.dirty_bitmap = NULL;
@@ -901,6 +886,20 @@ int __kvm_set_memory_region(struct kvm *kvm,
kvm_free_physmem_slot(&old, &new);
kfree(old_memslots);
 
+   /*
+* IOMMU mapping:  New slots need to be mapped.  Old slots need to be
+* un-mapped and re-mapped if their base changes.  Since base change
+* unmapping is handled above with slot deletion, mapping alone is
+* needed here.  Anything else the iommu might care about for existing
+* slots (size changes, userspace addr changes and read-only flag
+* changes) is disallowed above, so any other attribute changes getting
+* here can be skipped.
+*/
+   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+   r = kvm_iommu_map_pages(kvm, &new);
+   return r;
+   }
+
return 0;
 
 out_slots:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nVMX: Report CPU_BASED_VIRTUAL_NMI_PENDING as supported

2013-10-23 Thread Jan Kiszka
If the host supports it, we can and should expose it to the guest as
well, just like we already do with PIN_BASED_VIRTUAL_NMIS.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81ce389..6850b0f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2228,7 +2228,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
nested_vmx_procbased_ctls_low = 0;
nested_vmx_procbased_ctls_high &=
-   CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
+   CPU_BASED_VIRTUAL_INTR_PENDING |
+   CPU_BASED_VIRTUAL_NMI_PENDING | CPU_BASED_USE_TSC_OFFSETING |
CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
CPU_BASED_CR3_STORE_EXITING |
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nVMX: Fix pick-up of uninjected NMIs

2013-10-23 Thread Jan Kiszka
__vmx_complete_interrupts stored uninjected NMIs in arch.nmi_injected,
not arch.nmi_pending. So we actually need to check the former field in
vmcs12_save_pending_event. This fixes the eventinj unit test when run
in nested KVM.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index feef3a1..81ce389 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8078,7 +8078,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu 
*vcpu,
}
 
vmcs12->idt_vectoring_info_field = idt_vectoring;
-   } else if (vcpu->arch.nmi_pending) {
+   } else if (vcpu->arch.nmi_injected) {
vmcs12->idt_vectoring_info_field =
INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR;
} else if (vcpu->arch.interrupt.pending) {
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v2 7/8] KVM: arm-vgic: Add GICD_SPENDSGIR and GICD_CPENDSGIR handlers

2013-10-23 Thread Marc Zyngier

On 2013-10-22 10:08, Christoffer Dall wrote:
Handle MMIO accesses to the two registers which should support both 
the
case where the VMs want to read/write either of these registers and 
the
case where user space reads/writes these registers to do save/restore 
of

the VGIC state.

Note that the added complexity compared to simple set/clear enable
registers stems from the bookkeping of source cpu ids.  It may be
possible to change the underlying data structure to simplify the
complexity, but since this is not in the critical path, at all, this 
is

left as an interesting excercise to the reader.

Signed-off-by: Christoffer Dall 
Reviewed-by: Alexander Graf 

---
Changelog[v2]:
 - Use struct kvm_exit_mmio accessors for ->data field.
---
 virt/kvm/arm/vgic.c |  114
++-
 1 file changed, 112 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index f2dc72a..4e8c3ab 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -589,18 +589,128 @@ static bool handle_mmio_sgi_reg(struct 
kvm_vcpu *vcpu,

return false;
 }

+static void read_sgi_set_clear(struct kvm_vcpu *vcpu,
+  struct kvm_exit_mmio *mmio,
+  phys_addr_t offset)


set_clear is a bit unclear. How about reset?


+{
+   struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
+   struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
+   int i, sgi, cpu;
+   int min_sgi = (offset & ~0x3) * 4;
+   int max_sgi = min_sgi + 3;
+   int vcpu_id = vcpu->vcpu_id;
+   u32 lr, reg = 0;
+
+   /* Copy source SGIs from distributor side */
+   for (sgi = min_sgi; sgi <= max_sgi; sgi++) {
+   int shift = 8 * (sgi - min_sgi);
+   reg |= (u32)dist->irq_sgi_sources[vcpu_id][sgi] << shift;
+   }
+
+   /* Copy source SGIs already on LRs */
+   for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) {
+   lr = vgic_cpu->vgic_lr[i];
+   sgi = lr & GICH_LR_VIRTUALID;
+   cpu = (lr & GICH_LR_PHYSID_CPUID) >> GICH_LR_PHYSID_CPUID_SHIFT;


Please wrap these  lr accesses into separate functions. There is quite 
a bit of duplication in this patch and I wonder if we could factor 
things a bit.


At least, please isolate what is emulation related from what is 
actually what the underlying HW provides. It will help mitigating my 
headache in the future... ;-)



+   if (sgi >= min_sgi && sgi <= max_sgi) {
+   if (lr & GICH_LR_STATE)
+   reg |= (1 << cpu) << (8 * (sgi - min_sgi));
+   }
+   }
+
+   mmio_data_write(mmio, ~0, reg);
+}
+
 static bool handle_mmio_sgi_clear(struct kvm_vcpu *vcpu,
  struct kvm_exit_mmio *mmio,
  phys_addr_t offset)
 {
-   return false;
+   struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
+   struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
+   int i, sgi, cpu;
+   int min_sgi = (offset & ~0x3) * 4;
+   int max_sgi = min_sgi + 3;
+   int vcpu_id = vcpu->vcpu_id;
+   u32 *lr, reg;
+   bool updated = false;
+
+   if (!mmio->is_write) {
+   read_sgi_set_clear(vcpu, mmio, offset);
+   return false;
+   }
+
+   reg = mmio_data_read(mmio, ~0);
+
+   /* Clear pending SGIs on distributor side */
+   for (sgi = min_sgi; sgi <= max_sgi; sgi++) {
+   u8 mask = reg >> (8 * (sgi - min_sgi));
+   if (dist->irq_sgi_sources[vcpu_id][sgi] & mask)
+   updated = true;
+   dist->irq_sgi_sources[vcpu_id][sgi] &= ~mask;
+   }
+
+   /* Clear SGIs already on LRs */
+   for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) {
+   lr = &vgic_cpu->vgic_lr[i];
+   sgi = *lr & GICH_LR_VIRTUALID;
+   cpu = (*lr & GICH_LR_PHYSID_CPUID) >> 
GICH_LR_PHYSID_CPUID_SHIFT;
+
+   if (sgi >= min_sgi && sgi <= max_sgi) {
+   if (reg & ((1 << cpu) << (8 * (sgi - min_sgi {
+   if (*lr & GICH_LR_PENDING_BIT)
+   updated = true;
+   *lr &= GICH_LR_PENDING_BIT;
+   }
+   }
+   }
+
+   return updated;
 }

 static bool handle_mmio_sgi_set(struct kvm_vcpu *vcpu,
struct kvm_exit_mmio *mmio,
phys_addr_t offset)
 {
-   return false;
+   struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
+   struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
+   int i, sgi, cpu;
+   int min_sgi = (offset & ~0x3) * 4;
+   int max_sgi = min_sgi + 3;
+   int vcpu_id = vcpu->vcpu_id;
+   u32 *lr, reg;
+   bool updated = false;
+
+   if (!mmio->is_write) {
+   read_sgi_set_clear(vcpu, mmio, of

Re: [PATCH RESEND v2 6/8] KVM: arm-vgic: Add vgic reg access from dev attr

2013-10-23 Thread Marc Zyngier

On 2013-10-22 10:08, Christoffer Dall wrote:

Add infrastructure to handle distributor and cpu interface register
accesses through the KVM_{GET/SET}_DEVICE_ATTR interface by adding 
the
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS 
groups

and defining the semantics of the attr field to be the MMIO offset as
specified in the GICv2 specs.

Missing register accesses or other changes in individual register 
access

functions to support save/restore of the VGIC state is added in
subsequent patches.

Signed-off-by: Christoffer Dall 
Reviewed-by: Alexander Graf 

---
Changelog[v2]:
 - Added implementation specific format for the GICC_APRn registers.
---
 Documentation/virtual/kvm/devices/arm-vgic.txt |   50 +
 virt/kvm/arm/vgic.c|  143

 2 files changed, 193 insertions(+)

diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt
b/Documentation/virtual/kvm/devices/arm-vgic.txt
index c9febb2..e6416f8e 100644
--- a/Documentation/virtual/kvm/devices/arm-vgic.txt
+++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
@@ -19,3 +19,53 @@ Groups:
 KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
   Base address in the guest physical address space of the GIC
virtual cpu
   interface register mappings.
+
+  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
+  Attributes:
+The attr field of kvm_device_attr encodes two values:
+bits: | 63     40 | 39 ..  32  |  31   0 |
+values:   |reserved   |   cpu id   |  offset |
+
+All distributor regs are (rw, 32-bit)
+
+The offset is relative to the "Distributor base address" as
defined in the
+GICv2 specs.  Getting or setting such a register has the same 
effect as
+reading or writing the register on the actual hardware from the 
cpu
+specified with cpu id field.  Note that most distributor fields 
are not

+banked, but return the same value regardless of the cpu id used
to access
+the register.
+  Limitations:
+- Priorities are not implemented, and registers are RAZ/WI
+  Errors:
+- ENODEV: Getting or setting this register is not yet supported


-ENODEV?


+  KVM_DEV_ARM_VGIC_GRP_CPU_REGS
+  Attributes:
+The attr field of kvm_device_attr encodes two values:
+bits: | 63     40 | 39 ..  32  |  31   0 |
+values:   |reserved   |   cpu id   |  offset |
+
+All CPU regs are (rw, 32-bit)


Nit: CPU interface registers

+The offset specifies the offset from the "CPU interface base 
address" as
+defined in the GICv2 specs.  Getting or setting such a register 
has the
+same effect as reading or writing the register on the actual 
hardware.

+
+The Active Priorities Registers APRn are implementation defined,
so we set a
+fixed format for our implementation that fits with the model of 
a "GICv2
+impementation without the security extensions" which we present 
to the


implementation


+guest.  This interface always exposes four register APR[0-3]
describing the
+maximum possible 128 preemption levels.  The semantics of the 
register
+indicate if any interrupts in a given preemption level are in 
the active

+state by setting the corresponding bit.
+
+Thus, preemption level X has one or more active interrupts if
and only if:
+
+  APRn[X mod 32] == 0b1,  where n = X / 32
+
+Bits for undefined preemption levels are RAZ/WI.
+
+  Limitations:
+- Priorities are not implemented, and registers are RAZ/WI
+  Errors:
+- ENODEV: Getting or setting this register is not yet supported
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 1148a2e..f2dc72a 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -589,11 +589,29 @@ static bool handle_mmio_sgi_reg(struct kvm_vcpu 
*vcpu,

return false;
 }

+static bool handle_mmio_sgi_clear(struct kvm_vcpu *vcpu,
+ struct kvm_exit_mmio *mmio,
+ phys_addr_t offset)
+{
+   return false;
+}
+
+static bool handle_mmio_sgi_set(struct kvm_vcpu *vcpu,
+   struct kvm_exit_mmio *mmio,
+   phys_addr_t offset)
+{
+   return false;
+}
+
 /*
  * I would have liked to use the kvm_bus_io_*() API instead, but it
  * cannot cope with banked registers (only the VM pointer is passed
  * around, and we need the vcpu). One of these days, someone please
  * fix it!
+ *
+ * Note that the handle_mmio implementations should not use the 
phys_addr
+ * field from the kvm_exit_mmio struct as this will not have any 
sane values

+ * when used to save/restore state from user space.


This is quite ugly... I don't think we'd ever use that field directly, 
but reusing a well known structure for that purpose is very messy. I 
believe we'd be better off creating our own structure instead of 
re-purposing am existing one.


The other possibility would be to properly fill-in the phys_addr field. 
How difficult w

Re: [PATCH RESEND v2 3/8] KVM: arm-vgic: Set base addr through device API

2013-10-23 Thread Marc Zyngier

On 2013-10-22 10:08, Christoffer Dall wrote:
Support setting the distributor and cpu interface base addresses in 
the

VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
in addition to the ARM specific API.

This has the added benefit of being able to share more code in user
space and do things in a uniform maner.


   manner?


Also deprecate the older API at the same time, but backwards
compatibility will be maintained.

Signed-off-by: Christoffer Dall 
Reviewed-by: Alexander Graf 
---
 Documentation/virtual/kvm/api.txt  |6 +-
 Documentation/virtual/kvm/devices/arm-vgic.txt |   11 +++
 arch/arm/include/uapi/asm/kvm.h|9 +++
 arch/arm/kvm/arm.c |2 +-
 include/kvm/arm_vgic.h |2 +-
 virt/kvm/arm/vgic.c|   90

 6 files changed, 105 insertions(+), 15 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt
b/Documentation/virtual/kvm/api.txt
index 858aecf..d68b6c2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2324,7 +2324,7 @@ This ioctl returns the guest registers that are
supported for the
 KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.


-4.84 KVM_ARM_SET_DEVICE_ADDR
+4.84 KVM_ARM_SET_DEVICE_ADDR (deprecated)

 Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
 Architectures: arm, arm64
@@ -2362,6 +2362,10 @@ must be called after calling
KVM_CREATE_IRQCHIP, but before calling
 KVM_RUN on any of the VCPUs.  Calling this ioctl twice for any of 
the

 base addresses will return -EEXIST.

+Note, this IOCTL is deprecated and the more flexible 
SET/GET_DEVICE_ATTR API

+should be used instead.
+
+
 4.85 KVM_PPC_RTAS_DEFINE_TOKEN

 Capability: KVM_CAP_PPC_RTAS
diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt
b/Documentation/virtual/kvm/devices/arm-vgic.txt
index 38f27f7..c9febb2 100644
--- a/Documentation/virtual/kvm/devices/arm-vgic.txt
+++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
@@ -8,3 +8,14 @@ Only one VGIC instance may be instantiated through
either this API or the
 legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM
interrupt
 controller, requiring emulated user-space devices to inject
interrupts to the
 VGIC instead of directly to CPUs.
+
+Groups:
+  KVM_DEV_ARM_VGIC_GRP_ADDR
+  Attributes:
+KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
+  Base address in the guest physical address space of the GIC
distributor
+  register mappings.
+
+KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
+  Base address in the guest physical address space of the GIC
virtual cpu
+  interface register mappings.
diff --git a/arch/arm/include/uapi/asm/kvm.h
b/arch/arm/include/uapi/asm/kvm.h
index 1c85102..587f1ae 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -142,6 +142,15 @@ struct kvm_arch_memory_slot {
 #define KVM_REG_ARM_VFP_FPINST 0x1009
 #define KVM_REG_ARM_VFP_FPINST20x100A

+/* Device Control API: ARM VGIC */
+#define KVM_DEV_ARM_VGIC_GRP_ADDR  0
+#define KVM_DEV_ARM_VGIC_GRP_DIST_REGS 1
+#define KVM_DEV_ARM_VGIC_GRP_CPU_REGS  2
+#define   KVM_DEV_ARM_VGIC_CPUID_SHIFT 32
+#define   KVM_DEV_ARM_VGIC_CPUID_MASK  (0xffULL <<
KVM_DEV_ARM_VGIC_CPUID_SHIFT)
+#define   KVM_DEV_ARM_VGIC_OFFSET_SHIFT0
+#define   KVM_DEV_ARM_VGIC_OFFSET_MASK (0xULL <<
KVM_DEV_ARM_VGIC_OFFSET_SHIFT)
+
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT 24
 #define KVM_ARM_IRQ_TYPE_MASK  0xff
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index ab96af2..3ecee45 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -773,7 +773,7 @@ static int kvm_vm_ioctl_set_device_addr(struct 
kvm *kvm,

case KVM_ARM_DEVICE_VGIC_V2:
if (!vgic_present)
return -ENXIO;
-   return kvm_vgic_set_addr(kvm, type, dev_addr->addr);
+   return kvm_vgic_addr(kvm, type, &dev_addr->addr, true);
default:
return -ENODEV;
}
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 7e2d158..be85127 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -144,7 +144,7 @@ struct kvm_run;
 struct kvm_exit_mmio;

 #ifdef CONFIG_KVM_ARM_VGIC
-int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 
addr);

+int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 *addr,
bool write);
 int kvm_vgic_hyp_init(void);
 int kvm_vgic_init(struct kvm *kvm);
 int kvm_vgic_create(struct kvm *kvm);
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 79a8bae..d9c0fc5 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1479,6 +1479,12 @@ static int vgic_ioaddr_assign(struct kvm *kvm,
phys_addr_t *ioaddr,
 {
int ret;

+   if (addr & ~KVM_PHYS_MASK)
+   return -E2BIG;
+
+   if (addr & (SZ_4K - 1))
+   return -EINVAL

Re: [PATCH RESEND v2 2/8] KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGIC

2013-10-23 Thread Marc Zyngier

Hi Christoffer,

On 2013-10-22 10:08, Christoffer Dall wrote:

Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
ioctl, which can then later be leveraged to use the
KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses 
in

a more generic API than the ARM-specific one and is useful for
save/restore of VGIC state.

Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.

Note that we change the check for creating a VGIC from bailing out if
any VCPUs were created to bailing if any VCPUs were ever run.  This 
is

an important distinction that doesn't break anything, but allows
creating the VGIC after the VCPUs have been created.

Signed-off-by: Christoffer Dall 
Reviewed-by: Alexander Graf 
---
 Documentation/virtual/kvm/devices/arm-vgic.txt |   10 ++
 arch/arm/include/uapi/asm/kvm.h|1 -
 arch/arm/kvm/arm.c |1 +
 include/linux/kvm_host.h   |1 +
 include/uapi/linux/kvm.h   |1 +
 virt/kvm/arm/vgic.c|   46
++--
 virt/kvm/kvm_main.c|5 +++
 7 files changed, 62 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/arm-vgic.txt

diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt
b/Documentation/virtual/kvm/devices/arm-vgic.txt
new file mode 100644
index 000..38f27f7
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
@@ -0,0 +1,10 @@
+ARM Virtual Generic Interrupt Controller (VGIC)
+===
+
+Device types supported:
+  KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0
+
+Only one VGIC instance may be instantiated through either this API 
or the

+legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM
interrupt
+controller, requiring emulated user-space devices to inject
interrupts to the
+VGIC instead of directly to CPUs.
diff --git a/arch/arm/include/uapi/asm/kvm.h
b/arch/arm/include/uapi/asm/kvm.h
index c1ee007..1c85102 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -142,7 +142,6 @@ struct kvm_arch_memory_slot {
 #define KVM_REG_ARM_VFP_FPINST 0x1009
 #define KVM_REG_ARM_VFP_FPINST20x100A

-


Nit: pointless change?


 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT 24
 #define KVM_ARM_IRQ_TYPE_MASK  0xff
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 2b1091a..ab96af2 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -187,6 +187,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_IRQCHIP:
r = vgic_present;
break;
+   case KVM_CAP_DEVICE_CTRL:
case KVM_CAP_USER_MEMORY:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca645a0..2906b79 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct
file *filp);

 extern struct kvm_device_ops kvm_mpic_ops;
 extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvm_arm_vgic_ops;

 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..2d50233 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -843,6 +843,7 @@ struct kvm_device_attr {
 #define KVM_DEV_TYPE_FSL_MPIC_20   1
 #define KVM_DEV_TYPE_FSL_MPIC_42   2
 #define KVM_DEV_TYPE_XICS  3
+#define KVM_DEV_TYPE_ARM_VGIC_V2   4


How about calling it GIC_V2 instead of VGIC_V2? As far as the guest is 
concerned, this is a "true" GIC, and the other names don't imply any 
distinction either...



 /*
  * ioctls for VM fds
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 5ce100f..79a8bae 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1434,15 +1434,23 @@ out:

 int kvm_vgic_create(struct kvm *kvm)
 {
-   int ret = 0;
+   int i, ret = 0;
+   struct kvm_vcpu *vcpu;

mutex_lock(&kvm->lock);

-   if (atomic_read(&kvm->online_vcpus) || kvm->arch.vgic.vctrl_base) {
+   if (kvm->arch.vgic.vctrl_base) {
ret = -EEXIST;
goto out;
}

+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (vcpu->arch.has_run_once) {
+   ret = -EBUSY;
+   goto out;
+   }
+   }


Isn't this racy? What prevents anyone from starting a CPU while you're 
in this loop?



spin_lock_init(&kvm->arch.vgic.lock);
kvm->arch.vgic.vctrl_base = vgic_vctrl_base;
kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF;
@@ -1511,3 +1519,37 @@ int kvm_vgic_set_addr(struct kvm *kvm,
unsigned long type, u64 addr)
mutex_unlock(&kvm->lock);
return r;
 }
+
+static int vgic

[PATCH][kvm-unit-tests] VMX preemption timer: Make test case more robust

2013-10-23 Thread Jan Kiszka
If we both print from L2 and, on timer expiry, from L1, we risk a
deadlock in L1 on the printf lock that is held by L2 then. Avoid this
by only printing from L1.

Furthermore, if the timer fails to fire in time, disable it before
continuing to avoid that it fire later on in different contexts.

Signed-off-by: Jan Kiszka 
---
 x86/vmx_tests.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index 8d47bcd..7893a6c 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -128,6 +128,9 @@ void preemption_timer_init()
preempt_val = 1000;
vmcs_write(PREEMPT_TIMER_VALUE, preempt_val);
preempt_scale = rdmsr(MSR_IA32_VMX_MISC) & 0x1F;
+
+   if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT))
+   printf("\tSave preemption value is not supported\n");
 }
 
 void preemption_timer_main()
@@ -137,9 +140,7 @@ void preemption_timer_main()
printf("\tPreemption timer is not supported\n");
return;
}
-   if (!(ctrl_exit_rev.clr & EXI_SAVE_PREEMPT))
-   printf("\tSave preemption value is not supported\n");
-   else {
+   if (ctrl_exit_rev.clr & EXI_SAVE_PREEMPT) {
set_stage(0);
vmcall();
if (get_stage() == 1)
@@ -148,8 +149,8 @@ void preemption_timer_main()
while (1) {
if (((rdtsc() - tsc_val) >> preempt_scale)
> 10 * preempt_val) {
-   report("Preemption timer", 0);
-   break;
+   set_stage(2);
+   vmcall();
}
}
 }
@@ -170,7 +171,7 @@ int preemption_timer_exit_handler()
report("Preemption timer", 0);
else
report("Preemption timer", 1);
-   return VMX_TEST_VMEXIT;
+   break;
case VMX_VMCALL:
switch (get_stage()) {
case 0:
@@ -182,24 +183,29 @@ int preemption_timer_exit_handler()
EXI_SAVE_PREEMPT) & ctrl_exit_rev.clr;
vmcs_write(EXI_CONTROLS, ctrl_exit);
}
-   break;
+   vmcs_write(GUEST_RIP, guest_rip + insn_len);
+   return VMX_TEST_RESUME;
case 1:
if (vmcs_read(PREEMPT_TIMER_VALUE) >= preempt_val)
report("Save preemption value", 0);
else
report("Save preemption value", 1);
+   vmcs_write(GUEST_RIP, guest_rip + insn_len);
+   return VMX_TEST_RESUME;
+   case 2:
+   report("Preemption timer", 0);
break;
default:
printf("Invalid stage.\n");
print_vmexit_info();
-   return VMX_TEST_VMEXIT;
+   break;
}
-   vmcs_write(GUEST_RIP, guest_rip + insn_len);
-   return VMX_TEST_RESUME;
+   break;
default:
printf("Unknown exit reason, %d\n", reason);
print_vmexit_info();
}
+   vmcs_write(PIN_CONTROLS, vmcs_read(PIN_CONTROLS) & ~PIN_PREEMPT);
return VMX_TEST_VMEXIT;
 }
 
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][kvm-unit-tests] nEPT: Fix logic for testing read access

2013-10-23 Thread Jan Kiszka
We need to fail the test if MAGIC_VAL_1 cannot be found in either
data_page1 or data_page2.

Signed-off-by: Jan Kiszka 
---

BTW, this and the previous patch apply on top of the vmx queue of
kvm-unit-tests.

 x86/vmx_tests.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index a002a7a..8d47bcd 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -956,7 +956,7 @@ static void ept_main()
return;
}
set_stage(0);
-   if (*((u32 *)data_page2) != MAGIC_VAL_1 &&
+   if (*((u32 *)data_page2) != MAGIC_VAL_1 ||
*((u32 *)data_page1) != MAGIC_VAL_1)
report("EPT basic framework - read", 0);
else {
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: nVMX: Report 2MB EPT pages as supported

2013-10-23 Thread Jan Kiszka
As long as the hardware provides us 2MB EPT pages, we can also expose
them to the guest because our shadow EPT code already supports this
feature.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 06fd762..feef3a1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2261,7 +2261,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
/* nested EPT: emulate EPT also to L1 */
nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
-VMX_EPTP_WB_BIT | VMX_EPT_INVEPT_BIT;
+VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
+VMX_EPT_INVEPT_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
/*
 * Since invept is completely emulated we support both global
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: nVMX: Report 2MB EPT pages as supported

2013-10-23 Thread Jan Kiszka
As long as the hardware provides us 2MB EPT pages, we can also expose
them to the guest because our shadow EPT code already supports this
feature.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 06fd762..feef3a1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2261,7 +2261,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
/* nested EPT: emulate EPT also to L1 */
nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
-VMX_EPTP_WB_BIT | VMX_EPT_INVEPT_BIT;
+VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
+VMX_EPT_INVEPT_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
/*
 * Since invept is completely emulated we support both global
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][kvm-unit-tests] nEPT: Fix test cases for 2M huge pages

2013-10-23 Thread Jan Kiszka
If 2M pages are available with EPT, the test code creates its initial
identity map with such pages. But then it tries to remap two 4K pages in
that range which fails as their level 3 PTE is set up for huge pages.

Fix this up by ensuring that install_ept_entry always create non-large
page directory entries and by remapping the 2M area around those two
test pages in 4K chunks.

Signed-off-by: Jan Kiszka 
---
 x86/vmx.c   | 3 ++-
 x86/vmx.h   | 3 ++-
 x86/vmx_tests.c | 8 
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/x86/vmx.c b/x86/vmx.c
index 9db4ef4..3e6fc37 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -173,7 +173,8 @@ void install_ept_entry(unsigned long *pml4,
memset(new_pt, 0, PAGE_SIZE);
pt[offset] = virt_to_phys(new_pt)
| EPT_RA | EPT_WA | EPT_EA;
-   }
+   } else
+   pt[offset] &= ~EPT_LARGE_PAGE;
pt = phys_to_virt(pt[offset] & 0xff000ull);
}
offset = ((unsigned long)guest_addr >> ((level-1) *
diff --git a/x86/vmx.h b/x86/vmx.h
index dc1ebdf..7d967eb 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -485,7 +485,8 @@ enum Ctrl1 {
 #defineEPT_PAGE_LEVEL  4
 #defineEPT_PGDIR_WIDTH 9
 #defineEPT_PGDIR_MASK  511
-#define PAGE_MASK (~(PAGE_SIZE-1))
+#define PAGE_MASK  (~(PAGE_SIZE-1))
+#define PAGE_MASK_2M   (~(PAGE_SIZE_2M-1))
 
 #define EPT_VLT_RD 1
 #define EPT_VLT_WR (1 << 1)
diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index 0759e10..a002a7a 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -915,6 +915,7 @@ static int setup_ept()
 
 static void ept_init()
 {
+   unsigned long base_addr1, base_addr2;
u32 ctrl_cpu[2];
 
init_fail = false;
@@ -934,6 +935,13 @@ static void ept_init()
memset(data_page2, 0x0, PAGE_SIZE);
*((u32 *)data_page1) = MAGIC_VAL_1;
*((u32 *)data_page2) = MAGIC_VAL_2;
+   base_addr1 = (unsigned long)data_page1 & PAGE_MASK_2M;
+   base_addr2 = (unsigned long)data_page2 & PAGE_MASK_2M;
+   if (setup_ept_range(pml4, base_addr1, base_addr1 + PAGE_SIZE_2M, 0, 0,
+   EPT_WA | EPT_RA | EPT_EA) ||
+   setup_ept_range(pml4, base_addr2, base_addr2 + PAGE_SIZE_2M, 0, 0,
+   EPT_WA | EPT_RA | EPT_EA))
+   init_fail = true;
install_ept(pml4, (unsigned long)data_page1, (unsigned long)data_page2,
EPT_RA | EPT_WA | EPT_EA);
 }
-- 
1.8.1.1.298.ge7eed54
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-10-23 Thread Xiao Guangrong
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40772ef..d2aacc2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2798,9 +2798,9 @@ static bool page_fault_can_be_fast(u32 error_code)
 }
 
 static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+   u64 *sptep, u64 spte)
 {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
 
WARN_ON(!sp->role.direct);
@@ -2826,6 +2826,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
u32 error_code)
 {
struct kvm_shadow_walk_iterator iterator;
+   struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;
 
@@ -2846,7 +2847,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
}
 
-   if (!is_last_spte(spte, level))
+   sp = page_header(__pa(iterator.sptep));
+   if (!is_last_spte(spte, sp->role.level))
goto exit;
 
/*
@@ -2872,7 +2874,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
 */
-   ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+   ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
 exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
  spte, ret);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-10-23 Thread Xiao Guangrong
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 18 ++
 arch/x86/kvm/x86.c |  9 +++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62f18ec..337d173 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-   kvm_flush_remote_tlbs(kvm);
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
-   }
}
}
 
-   kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+   /*
+* We can flush all the TLBs out of the mmu lock without TLB
+* corruption since we just change the spte from writable to
+* readonly so that we only need to care the case of changing
+* spte from present to present (changing the spte from present
+* to nonpresent will flush all the TLBs immediately), in other
+* words, the only case we care is mmu_spte_update() where we
+* haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+* instead of PT_WRITABLE_MASK, that means it does not depend
+* on PT_WRITABLE_MASK anymore.
+*/
+   kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3aa650..573c6b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
-   if (is_dirty)
-   kvm_flush_remote_tlbs(kvm);
 
spin_unlock(&kvm->mmu_lock);
 
+   /*
+* All the TLBs can be flushed out of mmu lock, see the comments in
+* kvm_mmu_slot_remove_write_access().
+*/
+   if (is_dirty)
+   kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/15] KVM: MMU: locklessly write-protect

2013-10-23 Thread Xiao Guangrong
Changelog v3:
- the changes from Gleb's review:
  1) drop the patch which fixed the count of spte number in rmap since it
 can not be easily fixed and it has gone after applying this patchset

- ideas from Gleb and discussion with Marcelo is also very appreciated:
  2) change the way to locklessly access shadow page - use SLAB_DESTROY_BY_RCU
 to protect shadow page instead of conditionally using call_rcu()
  3) improve is_last_spte() that checks last spte by only using some bits on
 the spte, then it is safely used when we locklessly write-protect the
 shadow page table

Changelog v2:

- the changes from Gleb's review:
  1) fix calculating the number of spte in the pte_list_add()
  2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
 rmap_get_next()
  3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
  4) use rcu functions to access the rcu protected pointer
  5) spte will be missed in lockless walker if the spte is moved in a desc
 (remove a spte from the rmap using only one desc). Fix it by bottom-up
 walking the desc

- the changes from Paolo's review
  1) make the order and memory barriers between update spte / add spte into
 rmap and dirty-log more clear
  
- the changes from Marcelo's review:
  1) let fast page fault only fix the spts on the last level (level = 1)
  2) improve some changelogs and comments

- the changes from Takuya's review:
  move the patch "flush tlb if the spte can be locklessly modified" forward
  to make it's more easily merged

Thank all of you very much for your time and patience on this patchset!
  
Since we use rcu_assign_pointer() to update the points in desc even if dirty
log is disabled, i have measured the performance:
Host: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz * 12 + 36G memory

- migrate-perf (benchmark the time of get-dirty-log)
  before: Run 10 times, Avg time:9009483 ns.
  after: Run 10 times, Avg time:4807343 ns.

- kerbench
  Guest: 12 VCPUs + 8G memory
  before:
EPT is enabled:
# cat 09-05-origin-ept | grep real   
real 85.58
real 83.47
real 82.95

EPT is disabled:
# cat 09-05-origin-shadow | grep real
real 138.77
real 138.99
real 139.55

  after:
EPT is enabled:
# cat 09-05-lockless-ept | grep real
real 83.40
real 82.81
real 83.39

EPT is disabled:
# cat 09-05-lockless-shadow | grep real
real 138.91
real 139.71
real 138.94

No performance regression!



Background
==
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.

This patchset introduces a way to locklessly write-protect guest memory.

Idea
==
There are the challenges we meet and the ideas to resolve them.

1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.

So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.

Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.

2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.

3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make th

[PATCH v3 02/15] KVM: MMU: lazily drop large spte

2013-10-23 Thread Xiao Guangrong
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 36 
 arch/x86/kvm/x86.c |  8 ++--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d2aacc2..8739208 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1176,8 +1176,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1186,10 +1185,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
u64 spte = *sptep;
 
@@ -1199,17 +1197,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
*flush, bool pt_protect)
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (__drop_large_spte(kvm, sptep)) {
-   *flush |= true;
-   return true;
-   }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
 
-   *flush |= mmu_spte_update(sptep, spte);
-   return false;
+   return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1221,11 +1213,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
-   if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-   sptep = rmap_get_first(*rmapp, &iter);
-   continue;
-   }
 
+   flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}
 
@@ -2669,6 +2658,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
break;
}
 
+   drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
 
@@ -2870,6 +2861,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
 
/*
+* Do not fix write-permission on the large spte since we only dirty
+* the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+* that means other pages are missed if its slot is dirty-logged.
+*
+* Instead, we let the slow page fault path create a normal spte to
+* fix the access.
+*
+* See the comments in kvm_arch_commit_memory_region().
+*/
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   goto exit;
+
+   /*
 * Currently, fast page fault only works for direct mapping since
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edf2a07..b3aa650 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7223,8 +7223,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
 * Write protect all pages for dirty logging.
-* Existing largepage mappings are destroyed here and new ones will
-* not be created until the end of the logging.
+*
+* All the sptes including the large sptes which point to this
+* slot are set to readonly. We can not create any new large
+* spte on this slot until the end

[PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-10-23 Thread Xiao Guangrong
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

  VCPU 0 VCPU 1
lockless wirte protection:
  set spte.w = 0
 lock mmu-lock

 write protection the spte to sync shadow page,
 see spte.w = 0, then without flush tlb

 unlock mmu-lock

 !!! At this point, the shadow page can still be
 writable due to the corrupt tlb entry
 Flush all TLB

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8739208..62f18ec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 * we always atomicly update it, see the comments in
 * spte_has_volatile_bits().
 */
-   if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+   if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;
 
if (!shadow_accessed_mask)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-10-23 Thread Xiao Guangrong
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
+{
+   unsigned long marker;
+
+   marker = (unsigned long)pte_list | 1UL;
+   desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+   return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+   return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
first_free = __find_first_free(desc);
 
-   for (desc_num = 0; desc->more; desc = desc->more)
+   for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;
 
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+   desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
 * Only one entry existing but still use a desc to store it?
 */
-   WARN_ON(!next_desc);
+   WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Only one entry in this desc, move the entry to the head
 * then the desc can be freed.
 */
-   if (!first_desc->sptes[1] && !first_desc->more) {
+   if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
 
rmap_printk("pte_list_remove:  %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
 
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+   WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
rmap_iterator *iter)
 
iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+   WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
 
-   iter->desc = iter->desc->more;
+   iter->desc = desc_is_a_nulls(iter->desc->more) ?
+   NULL : iter->desc->more;
 
if (iter->desc) {
iter->pos = 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker

2013-10-23 Thread Xiao Guangrong
The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 57 ++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4687329..a864140 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+   (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-   *pte_list = (unsigned long)spte;
+   rcu_assign_pte_list(pte_list, spte);
return 0;
}
 
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}
 
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+  pte_list_walk_fn fn)
+{
+   struct pte_list_desc *desc;
+   unsigned long pte_list_value;
+   int i;
+
+restart:
+   /*
+* Force the pte_list to be reloaded.
+*
+* See the comments in hlist_nulls_for_each_entry_rcu().
+*/
+   barrier();
+   pte_list_value = *rcu_dereference(pte_list);
+   if (!pte_list_value)
+   return;
+
+   if (!(pte_list_value & 1))
+   return fn((u64 *)pte_list_value);
+
+   desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+   while (!desc_is_a_nulls(desc)) {
+   /*
+* We should do top-down walk since we always use the higher
+* indices to replace the deleted entry if only one desc is
+* used in the rmap when a spte is removed. Otherwise the
+* moved entry will be missed.
+*/
+   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+   if (desc->sptes[i])
+   fn(desc->sptes[i]);
+
+   desc = rcu_dereference(desc->more);
+
+   /* It is being initialized. */
+   if (unlikely(!desc))
+   goto restart;
+   }
+
+   if (unlikely(desc_get_nulls_value(desc) != pte_list))
+   goto restart;
+}
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
 {
@@ -4615,7 +4664,7 @@ int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
-   0, 0, NULL);
+   0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list

2013-10-23 Thread Xiao Guangrong
Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
   that is good to implement rcu-nulls-list-like lockless rmap
   walking

2) always move the entry in the first desc to the the position we want
   to remove when delete a spte in the parent_ptes/rmap (backward-move).
   It is good for us to implement lockless rmap walk since in the current
   code, when a spte is deleted from the "desc", another spte in the last
   "desc" will be moved to this position to replace the deleted one. If the
   deleted one has been accessed and we do not access the replaced one, the
   replaced one is missed when we do lockless walk.
   To fix this case, we do not backward move the spte, instead, we forward
   move the entry: when a spte is deleted, we move the entry in the first
   desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 179 -
 1 file changed, 123 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e85eed6..5cce039 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static int __find_first_free(struct pte_list_desc *desc)
+{
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   if (!desc->sptes[i])
+   break;
+   return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+   int free = __find_first_free(desc);
+
+   WARN_ON(free >= PTE_LIST_EXT);
+   return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+   int used = __find_first_free(desc) - 1;
+
+   WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+   return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+   int first_free, desc_num;
+
+   first_free = __find_first_free(desc);
+
+   for (desc_num = 0; desc->more; desc = desc->more)
+   desc_num++;
+
+   return first_free + desc_num * PTE_LIST_EXT;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -923,98 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
  *
  * Returns the number of pte entries before the spte was added or zero if
  * the spte was not added.
- *
  */
 static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
 {
struct pte_list_desc *desc;
-   int i, count = 0;
+   int free_pos;
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
-   } else if (!(*pte_list & 1)) {
+   return 0;
+   }
+
+   if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
-   ++count;
-   } else {
-   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-   desc = desc->more;
-   count += PTE_LIST_EXT;
-   }
-   if (desc->sptes[PTE_LIST_EXT-1]) {
-   desc->more = mmu_alloc_pte_list_desc(vcpu);
-   desc = desc->more;
-   }
-   for (i = 0; desc->sptes[i]; ++i)
-   ++count;
-   desc->sptes[i] = spte;
+   return 1;
}
-   return count;
+
+   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+   /* No empty entry in the desc. */
+   if (desc->sptes[PTE_LIST_EXT - 1]) {
+   struct pte_list_desc *new_desc;
+   new_desc = mmu_alloc_pte_list_desc(vcpu);
+   new_desc->more = desc;
+   desc = new_desc;
+   *pte_list = (unsigned long)desc | 1;
+   }
+
+   free_pos = find_first_free(desc);
+   desc->sptes[free_pos] = spte;
+   return count_spte_number(desc) - 1;
 }
 
 static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-  int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+  struct pte_list_desc *desc, int i)
 {
-   int j;
+   struct pte_list_d

[PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log

2013-10-23 Thread Xiao Guangrong
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 84 ++
 arch/x86/kvm/x86.c | 10 +++
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 337d173..e85eed6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2427,6 +2427,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
u64 spte;
int ret = 0;
+   bool remap = is_rmap_spte(*sptep);
 
if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2488,12 +2489,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
 
-   if (pte_access & ACC_WRITE_MASK)
-   mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+   if (!remap) {
+   if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+   rmap_recycle(vcpu, sptep, gfn);
+
+   if (level > PT_PAGE_TABLE_LEVEL)
+   ++vcpu->kvm->stat.lpages;
+   }
+
+   /*
+* The orders we require are:
+* 1) set spte to writable __before__ set the dirty bitmap.
+*It makes sure that dirty-logging is not missed when do
+*live migration at the final step where kvm should stop
+*the guest and push the remaining dirty pages got from
+*dirty-bitmap to the destination. The similar cases are
+*in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+*
+* 2) add the spte into rmap __before__ set the dirty bitmap.
+*
+* They can ensure we can find the writable spte on the rmap
+* when we do lockless write-protection since
+* kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+* on its dirty-bitmap, otherwise these cases will happen:
+*
+*  CPU 0 CPU 1
+*  kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*  mask = xchg(dirty_bitmap, 0)
+*
+*  try to write-protect gfns which
+*  are set on "mask" then walk then
+*  rmap, see no spte on that rmap
+* add the spte into rmap
+*
+* !! Then the page can be freely wrote but not recorded in
+* the dirty bitmap.
+*
+* And:
+*
+*  VCPU 0CPU 1
+*kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*
+* add spte into rmap
+*mask = xchg(dirty_bitmap, 0)
+*
+*try to write-protect gfns which
+*are set on "mask" then walk then
+*rmap, see spte is on the ramp
+*but it is readonly or nonpresent
+* Mark spte writable
+*
+* !! Then the page can be freely wrote but not recorded in the
+* dirty bitmap.
+*
+* See the comments in kvm_vm_ioctl_get_dirty_log().
+*/
+   smp_wmb();
+
+   if (pte_access & ACC_WRITE_MASK)
+   mark_page_dirty(vcpu->kvm, gfn);
 done:
return ret;
 }
@@ -2503,9 +2565,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int level, gfn_t gfn, pfn_t pfn, bool speculative,
 bool host_writable)
 {
-   int was_rmapped = 0;
-   int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 *sptep, write_fault, gfn);
 
@@ -2527,8 +2586,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
-   } else
-   was_rmapped = 1;
+   }
}
 
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2546,16 +2604,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 is_large_pte(*sptep)? "2MB" : "4kB",
 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
 *sptep, sptep);
-  

[PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level

2013-10-23 Thread Xiao Guangrong
The sptes on the middle level should obey these rules:
- they are always writable
- they are not pointing to process's page, so that SPTE_HOST_WRITEABLE has
  no chance to be set

So we can check last spte by using PT_WRITABLE_MASK and SPTE_HOST_WRITEABLE
that can be got from spte, then we can let is_last_spte() do not depend on
the mapping level anymore

This is important to implement lockless write-protection since only spte is
available at that time

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 25 -
 arch/x86/kvm/mmu_audit.c   |  6 +++---
 arch/x86/kvm/paging_tmpl.h |  6 ++
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5b42858..8b96d96 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -337,13 +337,13 @@ static int is_rmap_spte(u64 pte)
return is_shadow_present_pte(pte);
 }
 
-static int is_last_spte(u64 pte, int level)
+static int is_last_spte(u64 pte)
 {
-   if (level == PT_PAGE_TABLE_LEVEL)
-   return 1;
-   if (is_large_pte(pte))
-   return 1;
-   return 0;
+   /*
+* All the sptes on the middle level are writable but
+* SPTE_HOST_WRITEABLE is not set.
+*/
+   return !(is_writable_pte(pte) && !(pte & SPTE_HOST_WRITEABLE));
 }
 
 static pfn_t spte_to_pfn(u64 pte)
@@ -2203,7 +2203,7 @@ static bool shadow_walk_okay(struct 
kvm_shadow_walk_iterator *iterator)
 static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
   u64 spte)
 {
-   if (is_last_spte(spte, iterator->level)) {
+   if (is_last_spte(spte)) {
iterator->level = 0;
return;
}
@@ -2255,15 +2255,14 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
-static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
-u64 *spte)
+static bool mmu_page_zap_pte(struct kvm *kvm, u64 *spte)
 {
u64 pte;
struct kvm_mmu_page *child;
 
pte = *spte;
if (is_shadow_present_pte(pte)) {
-   if (is_last_spte(pte, sp->role.level)) {
+   if (is_last_spte(pte)) {
drop_spte(kvm, spte);
if (is_large_pte(pte))
--kvm->stat.lpages;
@@ -2286,7 +2285,7 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
unsigned i;
 
for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-   mmu_page_zap_pte(kvm, sp, sp->spt + i);
+   mmu_page_zap_pte(kvm, sp->spt + i);
 }
 
 static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
@@ -3068,7 +3067,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
}
 
sp = page_header(__pa(iterator.sptep));
-   if (!is_last_spte(spte, sp->role.level))
+   if (!is_last_spte(spte))
goto exit;
 
/*
@@ -4316,7 +4315,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
local_flush = true;
while (npte--) {
entry = *spte;
-   mmu_page_zap_pte(vcpu->kvm, sp, spte);
+   mmu_page_zap_pte(vcpu->kvm, spte);
if (gentry &&
  !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
  & mask.word) && rmap_can_add(vcpu))
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index daff69e..d54e2ad 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -45,7 +45,7 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
fn(vcpu, ent + i, level);
 
if (is_shadow_present_pte(ent[i]) &&
- !is_last_spte(ent[i], level)) {
+ !is_last_spte(ent[i])) {
struct kvm_mmu_page *child;
 
child = page_header(ent[i] & PT64_BASE_ADDR_MASK);
@@ -110,7 +110,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
}
}
 
-   if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep, level))
+   if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep))
return;
 
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
@@ -158,7 +158,7 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 
*sptep)
 
 static void audit_sptes_have_rmaps(struct kvm_vcpu *vcpu, u64 *sptep, int 
level)
 {
-   if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep, level))
+   if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep))
inspect_spte_has_rmap(vcpu->kvm, sptep);
 }
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ad75d77..33f0216 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -809,7 +809,6 @@ stat

[PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-23 Thread Xiao Guangrong
Allocate shadow pages from slab instead of page-allocator, frequent
shadow page allocation and free can be hit in the slab cache, it is
very useful for shadow mmu

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  | 46 ++---
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5cbf316..df9ae10 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu *walk_mmu;
 
struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
+   struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
@@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t 
shadow_page)
 {
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
 
-   return (struct kvm_mmu_page *)page_private(page);
+   return (struct kvm_mmu_page *)(page->mapping);
 }
 
 static inline u16 kvm_read_ldt(void)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3ae74e6..1bcc8c8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
 __shadow_walk_next(&(_walker), spte))
 
 static struct kmem_cache *pte_list_desc_cache;
+static struct kmem_cache *mmu_shadow_page_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
@@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
   GFP_KERNEL);
if (r)
goto out;
-   r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
+
+   r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+  mmu_shadow_page_cache, 4,
+  GFP_KERNEL);
+   if (r)
+   goto out;
+
+   r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
@@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
pte_list_desc_cache);
+   mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+   mmu_shadow_page_cache);
mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
mmu_page_header_cache);
@@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
+static void set_page_header(struct kvm_mmu_page *sp)
+{
+   struct page *page = virt_to_page(sp->spt);
+
+   WARN_ON(page->mapping);
+   page->mapping = (struct address_space *)sp;
+}
+
+static void clear_page_header(struct kvm_mmu_page *sp)
+{
+   struct page *page = virt_to_page(sp->spt);
+
+   page->mapping = NULL;
+}
+
 static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
list_del(&sp->link);
-   free_page((unsigned long)sp->spt);
+   clear_page_header(sp);
+   kmem_cache_free(mmu_shadow_page_cache, sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
kmem_cache_free(mmu_page_header_cache, sp);
@@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;
 
sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-   sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
+   sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
if (!direct)
sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
-   set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+   set_page_header(sp);
 
/*
 * The active_mmu_pages list is the FIFO list, do not move the
@@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
}
 }
 
-static void init_shadow_page_table(struct kvm_mmu_page *sp)
+static void init_shadow_page_table(void *p)
 {
+   u64 *sptp = (u64 *)p;
int i;
 
for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-   sp->spt[i] = 0ull;
+   sptp[i] = 0ull;
 }
 
 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2137,7 +2164,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
account_shadowed(vcpu->kvm, gfn);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid

[PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly

2013-10-23 Thread Xiao Guangrong
Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a864140..f3ae74e6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
*vcpu)
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
 {
void *obj;
 
if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-   obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+   obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;
 
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+  GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
-  mmu_page_header_cache, 4);
+  mmu_page_header_cache, 4,
+  GFP_KERNEL | __GFP_ZERO);
 out:
return r;
 }
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void pte_list_desc_ctor(void *p)
+{
+   struct pte_list_desc *desc = p;
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   desc->sptes[i] = NULL;
+
+   desc->more = NULL;
+}
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+   first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -4663,8 +4678,8 @@ static void mmu_destroy_caches(void)
 int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-   sizeof(struct pte_list_desc),
-   0, SLAB_DESTROY_BY_RCU, NULL);
+   sizeof(struct pte_list_desc),
+   0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection

2013-10-23 Thread Xiao Guangrong
Use SLAB_DESTROY_BY_RCU to prevent the shadow page to be freed from the
slab, so that it can be locklessly accessed by holding rcu lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1bcc8c8..5b42858 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4710,8 +4710,8 @@ int kvm_mmu_module_init(void)
goto nomem;
 
mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
- PAGE_SIZE, PAGE_SIZE, 0,
- init_shadow_page_table);
+  PAGE_SIZE, PAGE_SIZE, SLAB_DESTROY_BY_RCU,
+  init_shadow_page_table);
if (!mmu_shadow_page_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect

2013-10-23 Thread Xiao Guangrong
Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d82bbec..3e4b941 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1340,8 +1340,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 }
 
 /*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1352,25 +1351,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  *
  * Return true if tlb need be flushed.
  */
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
 {
u64 spte = *sptep;
 
if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (pt_protect)
-   spte &= ~SPTE_MMU_WRITEABLE;
-   spte = spte & ~PT_WRITABLE_MASK;
+   spte &= ~SPTE_MMU_WRITEABLE;
+   spte &= ~PT_WRITABLE_MASK;
 
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
-bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1379,7 +1376,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
 
-   flush |= spte_write_protect(kvm, sptep, pt_protect);
+   flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}
 
@@ -1454,7 +1451,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmapp, true);
+   write_protected |= __rmap_write_protect(rmapp);
}
 
return write_protected;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

2013-10-23 Thread Xiao Guangrong
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  4 ---
 arch/x86/kvm/mmu.c  | 59 ++---
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/x86.c  | 11 
 4 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df9ae10..cdb6f29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8b96d96..d82bbec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+   u64 spte;
+
+retry:
+   /*
+* Note we may partly read the sptep on 32bit host, however, we
+* allow this case because:
+* - we do not access the page got from the sptep.
+* - cmpxchg64 can detect that case and avoid setting a wrong value
+*   to the sptep.
+*/
+   spte = *rcu_dereference(sptep);
+   if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
+   return;
+
+   if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+   return;
+
+   goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+   pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset, unsigned long mask)
 {
unsigned long *rmapp;
 
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmapp, false);
+   rmap_write_protect_lockless(rmapp);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
 {
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
 
-   spin_lock(&kvm->mmu_lock);
-
+   rcu_read_lock();
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
 
for (index = 0; index <= las

[PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer

2013-10-23 Thread Xiao Guangrong
Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e4b941..68dac26 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -937,12 +937,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value)  \
+   rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
 
marker = (unsigned long)pte_list | 1UL;
-   desc->more = (struct pte_list_desc *)marker;
+   rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
 }
 
 static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -999,10 +1010,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
-#define rcu_assign_pte_list(pte_list_p, value) \
-   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
-   (unsigned long *)(value))
-
 /*
  * Pte mapping structures:
  *
@@ -1029,8 +1036,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
-   desc->sptes[0] = (u64 *)*pte_list;
-   desc->sptes[1] = spte;
+   rcu_assign_spte(desc->sptes[0], *pte_list);
+   rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1043,13 +1050,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
*spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
-   new_desc->more = desc;
+   rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
-   desc->sptes[free_pos] = spte;
+   rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
 }
 
@@ -1067,8 +1074,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Move the entry from the first desc to this position we want
 * to remove.
 */
-   desc->sptes[i] = first_desc->sptes[last_used];
-   first_desc->sptes[last_used] = NULL;
+   rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+   rcu_assign_spte(first_desc->sptes[last_used], NULL);
 
/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1080,7 +1087,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
-   *pte_list = (unsigned long)next_desc | 1ul;
+   rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}
 
@@ -1089,8 +1096,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * then the desc can be freed.
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
-   *pte_list = (unsigned long)first_desc->sptes[0];
-   first_desc->sptes[0] = NULL;
+   rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+   rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -1112,7 +1119,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
pr_err("pte_list_remove:  %p 1->BUG\n", spte);
BUG();
}
-   *pte_list = 0;
+   rcu_assign_pte_list(pte_list, 0);
return;
}
 
@@ -1184,9 +1191,12 @@ restart:
 * used in the rmap when a spte is removed. Otherwise the
 * moved entry will be missed.
 */
-   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
-   if (desc->sptes[i])
-  

[PATCH] kvmtools: arm/arm64: emit the MPIDR in DT instead of cpu_id

2013-10-23 Thread Marc Zyngier
kvmtools uses the virtual CPU number to emit the DT CPU nodes.
While this is correct for a flat topology, it fails on anything
else, as the guest expects to find the MPIDR there.

The fix is to ask the kernel for each vcpu MPIDR, and emit this
instead.

Acked-by: Will Deacon 
Signed-off-by: Marc Zyngier 
---
 tools/kvm/arm/aarch32/include/kvm/kvm-cpu-arch.h |  4 
 tools/kvm/arm/aarch32/kvm-cpu.c  | 26 ++
 tools/kvm/arm/aarch64/include/kvm/kvm-cpu-arch.h |  4 
 tools/kvm/arm/aarch64/kvm-cpu.c  | 28 
 tools/kvm/arm/fdt.c  |  9 +---
 tools/kvm/arm/include/arm-common/kvm-cpu-arch.h  |  2 ++
 6 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/arm/aarch32/include/kvm/kvm-cpu-arch.h 
b/tools/kvm/arm/aarch32/include/kvm/kvm-cpu-arch.h
index b9fda07..d28ea67 100644
--- a/tools/kvm/arm/aarch32/include/kvm/kvm-cpu-arch.h
+++ b/tools/kvm/arm/aarch32/include/kvm/kvm-cpu-arch.h
@@ -9,4 +9,8 @@
[0] = (!!(cpuid) << KVM_ARM_VCPU_POWER_OFF),\
 }
 
+#define ARM_MPIDR_HWID_BITMASK 0xFF
+#define ARM_CPU_ID 0, 0, 0
+#define ARM_CPU_ID_MPIDR   5
+
 #endif /* KVM__KVM_CPU_ARCH_H */
diff --git a/tools/kvm/arm/aarch32/kvm-cpu.c b/tools/kvm/arm/aarch32/kvm-cpu.c
index 6a012db..bd71037 100644
--- a/tools/kvm/arm/aarch32/kvm-cpu.c
+++ b/tools/kvm/arm/aarch32/kvm-cpu.c
@@ -6,6 +6,32 @@
 #define ARM_CORE_REG(x)(KVM_REG_ARM | KVM_REG_SIZE_U32 | 
KVM_REG_ARM_CORE | \
 KVM_REG_ARM_CORE_REG(x))
 
+#define ARM_CP15_REG_SHIFT_MASK(x,n)   \
+   (((x) << KVM_REG_ARM_ ## n ## _SHIFT) & KVM_REG_ARM_ ## n ## _MASK)
+
+#define __ARM_CP15_REG(op1,crn,crm,op2)\
+   (KVM_REG_ARM | KVM_REG_SIZE_U32 |   \
+(15 << KVM_REG_ARM_COPROC_SHIFT)   |   \
+ARM_CP15_REG_SHIFT_MASK(op1, OPC1) |   \
+ARM_CP15_REG_SHIFT_MASK(crn, 32_CRN)   |   \
+ARM_CP15_REG_SHIFT_MASK(crm, CRM)  |   \
+ARM_CP15_REG_SHIFT_MASK(op2, 32_OPC2))
+
+#define ARM_CP15_REG(...)  __ARM_CP15_REG(__VA_ARGS__)
+
+unsigned long kvm_cpu__get_vcpu_mpidr(struct kvm_cpu *vcpu)
+{
+   struct kvm_one_reg reg;
+   u32 mpidr;
+
+   reg.id = ARM_CP15_REG(ARM_CPU_ID, ARM_CPU_ID_MPIDR);
+   reg.addr = (u64)(unsigned long)&mpidr;
+   if (ioctl(vcpu->vcpu_fd, KVM_GET_ONE_REG, ®) < 0)
+   die("KVM_GET_ONE_REG failed (get_mpidr vcpu%ld", vcpu->cpu_id);
+
+   return mpidr;
+}
+
 void kvm_cpu__reset_vcpu(struct kvm_cpu *vcpu)
 {
struct kvm *kvm = vcpu->kvm;
diff --git a/tools/kvm/arm/aarch64/include/kvm/kvm-cpu-arch.h 
b/tools/kvm/arm/aarch64/include/kvm/kvm-cpu-arch.h
index d85c583..7d70c3b 100644
--- a/tools/kvm/arm/aarch64/include/kvm/kvm-cpu-arch.h
+++ b/tools/kvm/arm/aarch64/include/kvm/kvm-cpu-arch.h
@@ -10,4 +10,8 @@
   (!!(kvm)->cfg.arch.aarch32_guest << KVM_ARM_VCPU_EL1_32BIT)) 
\
 }
 
+#define ARM_MPIDR_HWID_BITMASK 0xFF00FFUL
+#define ARM_CPU_ID 3, 0, 0, 0
+#define ARM_CPU_ID_MPIDR   5
+
 #endif /* KVM__KVM_CPU_ARCH_H */
diff --git a/tools/kvm/arm/aarch64/kvm-cpu.c b/tools/kvm/arm/aarch64/kvm-cpu.c
index 7cdcb70..059e42c 100644
--- a/tools/kvm/arm/aarch64/kvm-cpu.c
+++ b/tools/kvm/arm/aarch64/kvm-cpu.c
@@ -10,6 +10,34 @@
 #define ARM64_CORE_REG(x)  (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \
 KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))
 
+#define ARM64_SYS_REG_SHIFT_MASK(x,n)  \
+   (((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) &\
+KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
+
+#define __ARM64_SYS_REG(op0,op1,crn,crm,op2)   \
+   (KVM_REG_ARM64 | KVM_REG_SIZE_U64   |   \
+KVM_REG_ARM64_SYSREG   |   \
+ARM64_SYS_REG_SHIFT_MASK(op0, OP0) |   \
+ARM64_SYS_REG_SHIFT_MASK(op1, OP1) |   \
+ARM64_SYS_REG_SHIFT_MASK(crn, CRN) |   \
+ARM64_SYS_REG_SHIFT_MASK(crm, CRM) |   \
+ARM64_SYS_REG_SHIFT_MASK(op2, OP2))
+
+#define ARM64_SYS_REG(...) __ARM64_SYS_REG(__VA_ARGS__)
+
+unsigned long kvm_cpu__get_vcpu_mpidr(struct kvm_cpu *vcpu)
+{
+   struct kvm_one_reg reg;
+   u64 mpidr;
+
+   reg.id = ARM64_SYS_REG(ARM_CPU_ID, ARM_CPU_ID_MPIDR);
+   reg.addr = (u64)&mpidr;
+   if (ioctl(vcpu->vcpu_fd, KVM_GET_ONE_REG, ®) < 0)
+   die("KVM_GET_ONE_REG failed (get_mpidr vcpu%ld", vcpu->cpu_id);
+
+   return mpidr;
+}
+
 static void reset_vcpu_aarch32(struct kvm_cpu *vcpu)
 {
struct kvm *kvm = vcpu->kvm;
diff --git a/tools/kvm/arm/fdt.c b/tools/kvm/arm/fdt.c
index 5e18c11..9a34d98 100644
--- a/tools/kvm/arm/fdt.c
+++ b/tools/kvm/arm/fdt.c
@@ -52,17 +52,20 @@ static void generate_cpu_nodes(void *fdt, struct kvm *kvm)
 

Re: virtio: Large number of tcp connections, vhost_net seems to be a bottleneck

2013-10-23 Thread Jason Wang
On 10/20/2013 04:04 PM, Sahid Ferdjaoui wrote:
> Hi all,
>
> I'm working on create a large number of tcp connections on a guest;
> The environment is on OpenStack:
>
> Host (dedicated compute node):
>   OS/Kernel: Ubuntu/3.2
>   Cpus: 24
>   Mems: 128GB
>
> Guest (alone on the Host):
>   OS/Kernel: Ubuntu/3.2
>   Cpus: 4
>   Mems: 32GB
>
> Currently a guest can handle about 700 000 established connections, the cpus 
> are not loaded and 12giga of memory are used.
> I'm working to understand why I can go up...
>
> On my host, after several tests with different versions of openvswitch and 
> with linux bridge,
> It look like the process vhost_net is the only process loaded to 100% and it 
> seems vhost_net cannot use more than 1 cpu.
>
> I would like to get more informations about vhost_net and if there is a 
> solution to configure it to use more than 1 cpu?

You can if you enable the multiqueue support for virtio-net, it can use
N threads when there's N queue pairs.

See http://www.linux-kvm.org/page/Multiqueue for more information.
> Thanks a lot,
> s.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: ARM: Support hugetlbfs backed huge pages

2013-10-23 Thread Christoffer Dall
Support huge pages in KVM/ARM and KVM/ARM64.  The pud_huge checking on
the unmap path may feel a bit silly as the pud_huge check is always
defined to false, but the compiler should be smart about this.

Note: This deals only with VMAs marked as huge which are allocated by
users through hugetlbfs only.  Transparent huge pages can only be
detected by looking at the underlying pages (or the page tables
themselves) and this patch so far simply maps these on a page-by-page
level in the Stage-2 page tables.

Cc: Catalin Marinas 
Cc: Russell King 
Acked-by: Catalin Marinas 
Acked-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_mmu.h |   17 +++-
 arch/arm/include/asm/pgtable-3level.h  |2 +
 arch/arm/kvm/mmu.c |  169 +---
 arch/arm64/include/asm/kvm_mmu.h   |   12 ++-
 arch/arm64/include/asm/pgtable-hwdef.h |2 +
 5 files changed, 158 insertions(+), 44 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 9b28c41..77de4a4 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(void);
 void kvm_clear_hyp_idmap(void);
 
+static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
+{
+   *pmd = new_pmd;
+   flush_pmd_entry(pmd);
+}
+
 static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
 {
*pte = new_pte;
@@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
pte_val(*pte) |= L_PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
+{
+   pmd_val(*pmd) |= L_PMD_S2_RDWR;
+}
+
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
+ unsigned long size)
 {
/*
 * If we are going to insert an instruction page and the icache is
@@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm 
*kvm, gfn_t gfn)
 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
 */
if (icache_is_pipt()) {
-   unsigned long hva = gfn_to_hva(kvm, gfn);
-   __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
+   __cpuc_coherent_user_range(hva, hva + size);
} else if (!icache_is_vivt_asid_tagged()) {
/* any kind of VIPT cache */
__flush_icache_all();
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 5689c18..a331d25 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -126,6 +126,8 @@
 #define L_PTE_S2_RDONLY (_AT(pteval_t, 1) << 6)   /* HAP[1]   
*/
 #define L_PTE_S2_RDWR   (_AT(pteval_t, 3) << 6)   /* HAP[2:1] */
 
+#define L_PMD_S2_RDWR   (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
+
 /*
  * Hyp-mode PL2 PTE definitions for LPAE.
  */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index b0de86b..745d8b1 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
+#define kvm_pmd_huge(_x)   (pmd_huge(_x))
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
/*
@@ -93,19 +96,29 @@ static bool page_empty(void *ptr)
 
 static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
 {
-   pmd_t *pmd_table = pmd_offset(pud, 0);
-   pud_clear(pud);
-   kvm_tlb_flush_vmid_ipa(kvm, addr);
-   pmd_free(NULL, pmd_table);
+   if (pud_huge(*pud)) {
+   pud_clear(pud);
+   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   } else {
+   pmd_t *pmd_table = pmd_offset(pud, 0);
+   pud_clear(pud);
+   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   pmd_free(NULL, pmd_table);
+   }
put_page(virt_to_page(pud));
 }
 
 static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
 {
-   pte_t *pte_table = pte_offset_kernel(pmd, 0);
-   pmd_clear(pmd);
-   kvm_tlb_flush_vmid_ipa(kvm, addr);
-   pte_free_kernel(NULL, pte_table);
+   if (kvm_pmd_huge(*pmd)) {
+   pmd_clear(pmd);
+   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   } else {
+   pte_t *pte_table = pte_offset_kernel(pmd, 0);
+   pmd_clear(pmd);
+   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   pte_free_kernel(NULL, pte_table);
+   }
put_page(virt_to_page(pmd));
 }
 
@@ -136,18 +149,32 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
continue;
}
 
+   if (pud_huge(*pud)) {
+   

[PATCH 2/4] KVM: ARM: Update comments for kvm_handle_wfi

2013-10-23 Thread Christoffer Dall
Update comments to reflect what is really going on and add the TWE bit
to the comments in kvm_arm.h.

Also renames the function to kvm_handle_wfx like is done on arm64 for
consistency and uber-correctness.

Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_arm.h |1 +
 arch/arm/kvm/handle_exit.c |   14 --
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index fe395b7..1d3153c 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -57,6 +57,7 @@
  * TSC:Trap SMC
  * TSW:Trap cache operations by set/way
  * TWI:Trap WFI
+ * TWE:Trap WFE
  * TIDCP:  Trap L2CTLR/L2ECTLR
  * BSU_IS: Upgrade barriers to the inner shareable domain
  * FB: Force broadcast of all maintainance operations
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index c4c496f..a920790 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -73,15 +73,17 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 }
 
 /**
- * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a 
guest
+ * kvm_handle_wfx - handle a WFI or WFE instructions trapped in guests
  * @vcpu:  the vcpu pointer
  * @run:   the kvm_run structure pointer
  *
- * Simply sets the wait_for_interrupts flag on the vcpu structure, which will
- * halt execution of world-switches and schedule other host processes until
- * there is an incoming IRQ or FIQ to the VM.
+ * WFE: Yield the CPU and come back to this vcpu when the scheduler
+ * decides to.
+ * WFI: Simply call kvm_vcpu_block(), which will halt execution of
+ * world-switches and schedule other host processes until there is an
+ * incoming IRQ or FIQ to the VM.
  */
-static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
+static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
trace_kvm_wfi(*vcpu_pc(vcpu));
if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
@@ -93,7 +95,7 @@ static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 }
 
 static exit_handle_fn arm_exit_handlers[] = {
-   [HSR_EC_WFI]= kvm_handle_wfi,
+   [HSR_EC_WFI]= kvm_handle_wfx,
[HSR_EC_CP15_32]= kvm_handle_cp15_32,
[HSR_EC_CP15_64]= kvm_handle_cp15_64,
[HSR_EC_CP14_MR]= kvm_handle_cp14_access,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: ARM: Transparent huge page (THP) support

2013-10-23 Thread Christoffer Dall
Support transparent huge pages in KVM/ARM and KVM/ARM64.  The
transparent_hugepage_adjust is not very pretty, but this is also how
it's solved on x86 and seems to be simply an artifact on how THPs
behave.  This should eventually be shared across architectures if
possible, but that can always be changed down the road.

Acked-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/mmu.c |   58 ++--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 745d8b1..3719583 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -42,7 +42,7 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
-#define kvm_pmd_huge(_x)   (pmd_huge(_x))
+#define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
@@ -576,12 +576,53 @@ out:
return ret;
 }
 
+static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+{
+   pfn_t pfn = *pfnp;
+   gfn_t gfn = *ipap >> PAGE_SHIFT;
+
+   if (PageTransCompound(pfn_to_page(pfn))) {
+   unsigned long mask;
+   /*
+* The address we faulted on is backed by a transparent huge
+* page.  However, because we map the compound huge page and
+* not the individual tail page, we need to transfer the
+* refcount to the head page.  We have to be careful that the
+* THP doesn't start to split while we are adjusting the
+* refcounts.
+*
+* We are sure this doesn't happen, because mmu_notifier_retry
+* was successful and we are holding the mmu_lock, so if this
+* THP is trying to split, it will be blocked in the mmu
+* notifier before touching any of the pages, specifically
+* before being able to call __split_huge_page_refcount().
+*
+* We can therefore safely transfer the refcount from PG_tail
+* to PG_head and switch the pfn from a tail page to the head
+* page accordingly.
+*/
+   mask = PTRS_PER_PMD - 1;
+   VM_BUG_ON((gfn & mask) != (pfn & mask));
+   if (pfn & mask) {
+   *ipap &= PMD_MASK;
+   kvm_release_pfn_clean(pfn);
+   pfn &= ~mask;
+   kvm_get_pfn(pfn);
+   *pfnp = pfn;
+   }
+
+   return true;
+   }
+
+   return false;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot,
  unsigned long fault_status)
 {
int ret;
-   bool write_fault, writable, hugetlb = false;
+   bool write_fault, writable, hugetlb = false, force_pte = false;
unsigned long mmu_seq;
gfn_t gfn = fault_ipa >> PAGE_SHIFT;
unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
@@ -602,6 +643,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (is_vm_hugetlb_page(vma)) {
hugetlb = true;
gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
+   } else {
+   /*
+* Pages belonging to VMAs not aligned to the PMD mapping
+* granularity cannot be mapped using block descriptors even
+* if the pages belong to a THP for the process, because the
+* stage-2 block descriptor will cover more than a single THP
+* and we loose atomicity for unmapping, updates, and splits
+* of the THP or other pages in the stage-2 block range.
+*/
+   if (vma->vm_start & ~PMD_MASK)
+   force_pte = true;
}
up_read(¤t->mm->mmap_sem);
 
@@ -629,6 +681,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
spin_lock(&kvm->mmu_lock);
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
+   if (!hugetlb && !force_pte)
+   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
if (hugetlb) {
pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] ARM: KVM: Yield CPU when vcpu executes a WFE

2013-10-23 Thread Christoffer Dall
From: Marc Zyngier 

On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

This creates contention, and the observed slowdown is 40x for
hackbench. No, this isn't a typo.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly. Also, using
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT slightly improves the performance
when the VM is severely overcommited.

Quick test to estimate the performance: hackbench 1 process 1000

2xA15 host (baseline):  1.843s

2xA15 guest w/o patch:  2.083s
4xA15 guest w/o patch:  80.212s
8xA15 guest w/o patch:  Could not be bothered to find out

2xA15 guest w/ patch:   2.102s
4xA15 guest w/ patch:   3.205s
8xA15 guest w/ patch:   6.887s

So we go from a 40x degradation to 1.5x in the 2x overcommit case,
which is vaguely more acceptable.

Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_arm.h |4 +++-
 arch/arm/kvm/Kconfig   |1 +
 arch/arm/kvm/handle_exit.c |6 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index d556f03..fe395b7 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -67,7 +67,7 @@
  */
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-   HCR_SWIO | HCR_TIDCP)
+   HCR_TWE | HCR_SWIO | HCR_TIDCP)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* System Control Register (SCTLR) bits */
@@ -208,6 +208,8 @@
 #define HSR_EC_DABT(0x24)
 #define HSR_EC_DABT_HYP(0x25)
 
+#define HSR_WFI_IS_WFE (1U << 0)
+
 #define HSR_HVC_IMM_MASK   ((1UL << 16) - 1)
 
 #define HSR_DABT_S1PTW (1U << 7)
diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index ebf5015..466bd29 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -20,6 +20,7 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
select PREEMPT_NOTIFIERS
select ANON_INODES
+   select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
select KVM_ARM_HOST
depends on ARM_VIRT_EXT && ARM_LPAE
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index df4c82d..c4c496f 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
trace_kvm_wfi(*vcpu_pc(vcpu));
-   kvm_vcpu_block(vcpu);
+   if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
+   kvm_vcpu_on_spin(vcpu);
+   else
+   kvm_vcpu_block(vcpu);
+
return 1;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Second round of KVM/ARM updates for 3.13

2013-10-23 Thread Christoffer Dall
The following changes since commit d570142674890fe10b3d7d86aa105e3dfce1ddfa:

  Merge tag 'kvm-arm-for-3.13-1' of 
git://git.linaro.org/people/cdall/linux-kvm-arm into next (2013-10-16 15:30:32 
+0300)

are available in the git repository at:


  git://git.linaro.org/people/cdall/linux-kvm-arm.git tags/kvm-arm-for-3.13-2

for you to fetch changes up to 9b5fdb9781f74fb15827e465bfb5aa63211953c8:

  KVM: ARM: Transparent huge page (THP) support (2013-10-17 17:06:30 -0700)


Updates for KVM/ARM, take 2 including:
 - Transparent Huge Pages and hugetlbfs support for KVM/ARM
 - Yield CPU when guest executes WFE to speed up CPU overcommit


Christoffer Dall (3):
  KVM: ARM: Update comments for kvm_handle_wfi
  KVM: ARM: Support hugetlbfs backed huge pages
  KVM: ARM: Transparent huge page (THP) support

Marc Zyngier (1):
  ARM: KVM: Yield CPU when vcpu executes a WFE

 arch/arm/include/asm/kvm_arm.h |5 +-
 arch/arm/include/asm/kvm_mmu.h |   17 ++-
 arch/arm/include/asm/pgtable-3level.h  |2 +
 arch/arm/kvm/Kconfig   |1 +
 arch/arm/kvm/handle_exit.c |   20 ++-
 arch/arm/kvm/mmu.c |  223 ++--
 arch/arm64/include/asm/kvm_mmu.h   |   12 +-
 arch/arm64/include/asm/pgtable-hwdef.h |2 +
 8 files changed, 230 insertions(+), 52 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html