date:20150121

Until now, KVM/arm didn't care much for page aging (who was swapping
anyway?), and simply provided empty hooks to the core KVM code. With
server-type systems now being available, things are quite different.

This patch implements very simple support for page aging, by clearing
the Access flag in the Stage-2 page tables. On access fault, the current
fault handling will write the PTE or PMD again, putting the Access flag
back on.

It should be possible to implement a much faster handling for Access
faults, but that's left for a later patch.

With this in place, performance in VMs is degraded much more gracefully.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_host.h   | 13 ++---
 arch/arm/kvm/mmu.c| 59 ++-
 arch/arm/kvm/trace.h  | 33 ++
 arch/arm64/include/asm/kvm_arm.h  |  1 +
 arch/arm64/include/asm/kvm_host.h | 13 ++---
 5 files changed, 96 insertions(+), 23 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 04b4ea0..d6b5b85 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -163,19 +163,10 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, 
pte_t pte);
 
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
- unsigned long end)
-{
-   return 0;
-}
-
-static inline int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
-{
-   return 0;
-}
-
 static inline void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm,
 unsigned long address)
 {
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index e163a45..ffe89a0 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1068,6 +1068,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
 out_unlock:
spin_unlock(kvm-mmu_lock);
+   kvm_set_pfn_accessed(pfn);
kvm_release_pfn_clean(pfn);
return ret;
 }
@@ -1102,7 +1103,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
/* Check the stage-2 fault is trans. fault or write fault */
fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
-   if (fault_status != FSC_FAULT  fault_status != FSC_PERM) {
+   if (fault_status != FSC_FAULT  fault_status != FSC_PERM 
+   fault_status != FSC_ACCESS) {
kvm_err(Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n,
kvm_vcpu_trap_get_class(vcpu),
(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
@@ -1237,6 +1239,61 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long 
hva, pte_t pte)
handle_hva_to_gpa(kvm, hva, end, kvm_set_spte_handler, stage2_pte);
 }
 
+static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+{
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pmd = stage2_get_pmd(kvm, NULL, gpa);
+   if (!pmd || pmd_none(*pmd)) /* Nothing there */
+   return 0;
+
+   if (kvm_pmd_huge(*pmd)) {   /* THP, HugeTLB */
+   *pmd = pmd_mkold(*pmd);
+   goto tlbi;
+   }
+
+   pte = pte_offset_kernel(pmd, gpa);
+   if (pte_none(*pte))
+   return 0;
+
+   *pte = pte_mkold(*pte); /* Just a page... */
+tlbi:
+   kvm_tlb_flush_vmid_ipa(kvm, gpa);
+   return 1;
+}
+
+static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+{
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pmd = stage2_get_pmd(kvm, NULL, gpa);
+   if (!pmd || pmd_none(*pmd)) /* Nothing there */
+   return 0;
+
+   if (kvm_pmd_huge(*pmd)) /* THP, HugeTLB */
+   return pmd_young(*pmd);
+
+   pte = pte_offset_kernel(pmd, gpa);
+   if (!pte_none(*pte))/* Just a page... */
+   return pte_young(*pte);
+
+   return 0;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+{
+   trace_kvm_age_hva(start, end);
+   return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
+}
+
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+{
+   trace_kvm_test_age_hva(hva);
+   return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
+}
+
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
mmu_free_memory_cache(vcpu-arch.mmu_page_cache);
diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
index b6a6e71..364b5382 100644
--- a/arch/arm/kvm/trace.h
+++ b/arch/arm/kvm/trace.h
@@ -203,6 +203,39 @@

[PATCH 0/3] arm/arm64: KVM: Add support for page aging

So far, KVM/arm doesn't implement any support for page aging, leading
to rather bad performance when the system is swapping. This short
series implements the required hooks and fault handling to deal with
pages being marked old/young.

The three patches are fairly straightforward:

- First patch changes the range iterator to be able to return a value

- Second patch implements the actual page aging (clearing the AF bit
  in the page tables, and relying on the normal faulting code to set
  the bit again).

- Last patch optimizes the access fault path by only doing the minimum
  to satisfy the fault.

The end result is a system that behaves visibly better under load, as
VM pages don't get evicted that easily.

Based on 3.19-rc5, tested on Seattle and X-Gene.

Also at git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git 
kvm-arm64/mm-fixes-3.19

Marc Zyngier (3):
  arm/arm64: KVM: Allow handle_hva_to_gpa to return a value
  arm/arm64: KVM: Implement Stage-2 page aging
  arm/arm64: KVM: Optimize handling of Access Flag faults

 arch/arm/include/asm/kvm_host.h   |  13 +---
 arch/arm/kvm/mmu.c| 128 +++---
 arch/arm/kvm/trace.h  |  48 ++
 arch/arm64/include/asm/kvm_arm.h  |   1 +
 arch/arm64/include/asm/kvm_host.h |  13 +---
 5 files changed, 171 insertions(+), 32 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] arm/arm64: KVM: Allow handle_hva_to_gpa to return a value

So far, handle_hva_to_gpa was never required to return a value.
As we prepare to age pages at Stage-2, we need to be able to
return a value from the iterator (kvm_test_age_hva).

Adapt the code to handle this situation. No semantic change.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/mmu.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1366625..e163a45 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1146,15 +1146,16 @@ out_unlock:
return ret;
 }
 
-static void handle_hva_to_gpa(struct kvm *kvm,
- unsigned long start,
- unsigned long end,
- void (*handler)(struct kvm *kvm,
- gpa_t gpa, void *data),
- void *data)
+static int handle_hva_to_gpa(struct kvm *kvm,
+unsigned long start,
+unsigned long end,
+int (*handler)(struct kvm *kvm,
+   gpa_t gpa, void *data),
+void *data)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
+   int ret = 0;
 
slots = kvm_memslots(kvm);
 
@@ -1178,14 +1179,17 @@ static void handle_hva_to_gpa(struct kvm *kvm,
 
for (; gfn  gfn_end; ++gfn) {
gpa_t gpa = gfn  PAGE_SHIFT;
-   handler(kvm, gpa, data);
+   ret |= handler(kvm, gpa, data);
}
}
+
+   return ret;
 }
 
-static void kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
 {
unmap_stage2_range(kvm, gpa, PAGE_SIZE);
+   return 0;
 }
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
@@ -1211,11 +1215,12 @@ int kvm_unmap_hva_range(struct kvm *kvm,
return 0;
 }
 
-static void kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
 {
pte_t *pte = (pte_t *)data;
 
stage2_set_pte(kvm, NULL, gpa, pte, false);
+   return 0;
 }
 
 
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] arm/arm64: KVM: Optimize handling of Access Flag faults

Now that we have page aging in Stage-2, it becomes obvious that
we're doing way too much work handling the fault.

The page is not going anywhere (it is still mapped), the page
tables are already allocated, and all we want is to flip a bit
in the PMD or PTE. Also, we can avoid any form of TLB invalidation,
since a page with the AF bit off is not allowed to be cached.

An obvious solution is to have a separate handler for FSC_ACCESS,
where we pride ourselves to only do the very minimum amount of
work.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/mmu.c   | 46 ++
 arch/arm/kvm/trace.h | 15 +++
 2 files changed, 61 insertions(+)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index ffe89a0..112bae1 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1073,6 +1073,46 @@ out_unlock:
return ret;
 }
 
+/*
+ * Resolve the access fault by making the page young again.
+ * Note that because the faulting entry is guaranteed not to be
+ * cached in the TLB, we don't need to invalidate anything.
+ */
+static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
+{
+   pmd_t *pmd;
+   pte_t *pte;
+   pfn_t pfn;
+   bool pfn_valid = false;
+
+   trace_kvm_access_fault(fault_ipa);
+
+   spin_lock(vcpu-kvm-mmu_lock);
+
+   pmd = stage2_get_pmd(vcpu-kvm, NULL, fault_ipa);
+   if (!pmd || pmd_none(*pmd)) /* Nothing there */
+   goto out;
+
+   if (kvm_pmd_huge(*pmd)) {   /* THP, HugeTLB */
+   *pmd = pmd_mkyoung(*pmd);
+   pfn = pmd_pfn(*pmd);
+   pfn_valid = true;
+   goto out;
+   }
+
+   pte = pte_offset_kernel(pmd, fault_ipa);
+   if (pte_none(*pte)) /* Nothing there either */
+   goto out;
+
+   *pte = pte_mkyoung(*pte);   /* Just a page... */
+   pfn = pte_pfn(*pte);
+   pfn_valid = true;
+out:
+   spin_unlock(vcpu-kvm-mmu_lock);
+   if (pfn_valid)
+   kvm_set_pfn_accessed(pfn);
+}
+
 /**
  * kvm_handle_guest_abort - handles all 2nd stage aborts
  * @vcpu:  the VCPU pointer
@@ -1140,6 +1180,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
/* Userspace should not be able to register out-of-bounds IPAs */
VM_BUG_ON(fault_ipa = KVM_PHYS_SIZE);
 
+   if (fault_status == FSC_ACCESS) {
+   handle_access_fault(vcpu, fault_ipa);
+   ret = 1;
+   goto out_unlock;
+   }
+
ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
if (ret == 0)
ret = 1;
diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
index 364b5382..5665a16 100644
--- a/arch/arm/kvm/trace.h
+++ b/arch/arm/kvm/trace.h
@@ -64,6 +64,21 @@ TRACE_EVENT(kvm_guest_fault,
  __entry-hxfar, __entry-vcpu_pc)
 );
 
+TRACE_EVENT(kvm_access_fault,
+   TP_PROTO(unsigned long ipa),
+   TP_ARGS(ipa),
+
+   TP_STRUCT__entry(
+   __field(unsigned long,  ipa )
+   ),
+
+   TP_fast_assign(
+   __entry-ipa= ipa;
+   ),
+
+   TP_printk(IPA: %lx, __entry-ipa)
+);
+
 TRACE_EVENT(kvm_irq_line,
TP_PROTO(unsigned int type, int vcpu_idx, int irq_num, int level),
TP_ARGS(type, vcpu_idx, irq_num, level),
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/3] arm/arm64: KVM: Invalidate data cache on unmap

Let's assume a guest has created an uncached mapping, and written
to that page. Let's also assume that the host uses a cache-coherent
IO subsystem. Let's finally assume that the host is under memory
pressure and starts to swap things out.

Before this uncached page is evicted, we need to make sure
we invalidate potential speculated, clean cache lines that are
sitting there, or the IO subsystem is going to swap out the
cached view, loosing the data that has been written directly
into memory.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_mmu.h   | 31 +++
 arch/arm/kvm/mmu.c   | 82 
 arch/arm64/include/asm/kvm_mmu.h | 18 +
 3 files changed, 116 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 286644c..552c31f 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -44,6 +44,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include linux/highmem.h
 #include asm/cacheflush.h
 #include asm/pgalloc.h
 
@@ -188,6 +189,36 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
}
 }
 
+static inline void __kvm_flush_dcache_pte(pte_t pte)
+{
+   void *va = kmap_atomic(pte_page(pte));
+
+   kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+
+   kunmap_atomic(va);
+}
+
+static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
+{
+   unsigned long size = PMD_SIZE;
+   pfn_t pfn = pmd_pfn(pmd);
+
+   while (size) {
+   void *va = kmap_atomic_pfn(pfn);
+
+   kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+
+   pfn++;
+   size -= PAGE_SIZE;
+
+   kunmap_atomic(va);
+   }
+}
+
+static inline void __kvm_flush_dcache_pud(pud_t pud)
+{
+}
+
 #define kvm_virt_to_phys(x)virt_to_idmap((unsigned long)(x))
 
 void kvm_set_way_flush(struct kvm_vcpu *vcpu);
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 106737e..78e68ab 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -58,6 +58,26 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
phys_addr_t ipa)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
 }
 
+/*
+ * D-Cache management functions. They take the page table entries by
+ * value, as they are flushing the cache using the kernel mapping (or
+ * kmap on 32bit).
+ */
+static void kvm_flush_dcache_pte(pte_t pte)
+{
+   __kvm_flush_dcache_pte(pte);
+}
+
+static void kvm_flush_dcache_pmd(pmd_t pmd)
+{
+   __kvm_flush_dcache_pmd(pmd);
+}
+
+static void kvm_flush_dcache_pud(pud_t pud)
+{
+   __kvm_flush_dcache_pud(pud);
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  int min, int max)
 {
@@ -119,6 +139,26 @@ static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, 
phys_addr_t addr)
put_page(virt_to_page(pmd));
 }
 
+/*
+ * Unmapping vs dcache management:
+ *
+ * If a guest maps certain memory pages as uncached, all writes will
+ * bypass the data cache and go directly to RAM.  However, the CPUs
+ * can still speculate reads (not writes) and fill cache lines with
+ * data.
+ *
+ * Those cache lines will be *clean* cache lines though, so a
+ * clean+invalidate operation is equivalent to an invalidate
+ * operation, because no cache lines are marked dirty.
+ *
+ * Those clean cache lines could be filled prior to an uncached write
+ * by the guest, and the cache coherent IO subsystem would therefore
+ * end up writing old data to disk.
+ *
+ * This is why right after unmapping a page/section and invalidating
+ * the corresponding TLBs, we call kvm_flush_dcache_p*() to make sure
+ * the IO subsystem will never hit in the cache.
+ */
 static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
   phys_addr_t addr, phys_addr_t end)
 {
@@ -128,9 +168,16 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
start_pte = pte = pte_offset_kernel(pmd, addr);
do {
if (!pte_none(*pte)) {
+   pte_t old_pte = *pte;
+
kvm_set_pte(pte, __pte(0));
-   put_page(virt_to_page(pte));
kvm_tlb_flush_vmid_ipa(kvm, addr);
+
+   /* No need to invalidate the cache for device mappings 
*/
+   if ((pte_val(old_pte)  PAGE_S2_DEVICE) != 
PAGE_S2_DEVICE)
+   kvm_flush_dcache_pte(old_pte);
+
+   put_page(virt_to_page(pte));
}
} while (pte++, addr += PAGE_SIZE, addr != end);
 
@@ -149,8 +196,13 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
next = kvm_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
if (kvm_pmd_huge(*pmd)) {
+   pmd_t old_pmd = *pmd;
+
pmd_clear(pmd);

[PATCH v3 0/3] arm/arm64: KVM: Random selection of cache related fixes

This small series fixes a number of issues that Christoffer and I have
been trying to nail down for a while, having to do with the host dying
under load (swapping), and also with the way we deal with caches in
general (and with set/way operation in particular):

- The first one changes the way we handle cache ops by set/way,
  basically turning them into VA ops for the whole memory. This allows
  platforms with system caches to boot a 32bit zImage, for example.

- The second one fixes a corner case that could happen if the guest
  used an uncached mapping (or had its caches off) while the host was
  swapping it out (and using a cache-coherent IO subsystem).

- Finally, the last one fixes this stability issue when the host was
  swapping, by using a kernel mapping for cache maintenance instead of
  the userspace one.

With these patches (and both the TLB invalidation and HCR fixes that
are on their way to mainline), the APM platform seems much more robust
than it previously was. Fingers crossed.

The first round of review has generated a lot of traffic about
ASID-tagged icache management for guests, but I've decided not to
address this issue as part of this series. The code is broken already,
and there isn't any virtualization capable, ASID-tagged icache core in
the wild, AFAIK. I'll try to revisit this in another series, once I
have wrapped my head around it (or someone beats me to it).

Based on 3.19-rc5, tested on Juno, X-Gene, TC-2 and Cubietruck.

Also at git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git 
kvm-arm64/mm-fixes-3.19

* From v2: [2]
- Reworked the algorithm that tracks the state of the guest's caches,
  as there is some cases I didn't anticipate. In the end, the
  algorithm is simpler.

* From v1: [1]
- Dropped Steve's patch after discussion with Andrea
- Refactor set/way support to avoid code duplication, better comments
- Much improved comments in patch #2, courtesy of Christoffer

[1]: http://www.spinics.net/lists/kvm-arm/msg13008.html
[2]: http://www.spinics.net/lists/kvm-arm/msg13161.html

Marc Zyngier (3):
  arm/arm64: KVM: Use set/way op trapping to track the state of the
caches
  arm/arm64: KVM: Invalidate data cache on unmap
  arm/arm64: KVM: Use kernel mapping to perform invalidation on page
fault

 arch/arm/include/asm/kvm_emulate.h   |  10 +++
 arch/arm/include/asm/kvm_host.h  |   3 -
 arch/arm/include/asm/kvm_mmu.h   |  77 +---
 arch/arm/kvm/arm.c   |  10 ---
 arch/arm/kvm/coproc.c|  64 +++---
 arch/arm/kvm/coproc_a15.c|   2 +-
 arch/arm/kvm/coproc_a7.c |   2 +-
 arch/arm/kvm/mmu.c   | 164 ++-
 arch/arm/kvm/trace.h |  39 +
 arch/arm64/include/asm/kvm_emulate.h |  10 +++
 arch/arm64/include/asm/kvm_host.h|   3 -
 arch/arm64/include/asm/kvm_mmu.h |  34 ++--
 arch/arm64/kvm/sys_regs.c|  75 +++-
 13 files changed, 321 insertions(+), 172 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/3] arm/arm64: KVM: Use set/way op trapping to track the state of the caches

Trying to emulate the behaviour of set/way cache ops is fairly
pointless, as there are too many ways we can end-up missing stuff.
Also, there is some system caches out there that simply ignore
set/way operations.

So instead of trying to implement them, let's convert it to VA ops,
and use them as a way to re-enable the trapping of VM ops. That way,
we can detect the point when the MMU/caches are turned off, and do
a full VM flush (which is what the guest was trying to do anyway).

This allows a 32bit zImage to boot on the APM thingy, and will
probably help bootloaders in general.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_emulate.h   | 10 +
 arch/arm/include/asm/kvm_host.h  |  3 --
 arch/arm/include/asm/kvm_mmu.h   |  3 +-
 arch/arm/kvm/arm.c   | 10 -
 arch/arm/kvm/coproc.c| 64 ++
 arch/arm/kvm/coproc_a15.c|  2 +-
 arch/arm/kvm/coproc_a7.c |  2 +-
 arch/arm/kvm/mmu.c   | 70 -
 arch/arm/kvm/trace.h | 39 +++
 arch/arm64/include/asm/kvm_emulate.h | 10 +
 arch/arm64/include/asm/kvm_host.h|  3 --
 arch/arm64/include/asm/kvm_mmu.h |  3 +-
 arch/arm64/kvm/sys_regs.c| 75 +---
 13 files changed, 155 insertions(+), 139 deletions(-)

diff --git a/arch/arm/include/asm/kvm_emulate.h 
b/arch/arm/include/asm/kvm_emulate.h
index 66ce176..7b01523 100644
--- a/arch/arm/include/asm/kvm_emulate.h
+++ b/arch/arm/include/asm/kvm_emulate.h
@@ -38,6 +38,16 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
vcpu-arch.hcr = HCR_GUEST_MASK;
 }
 
+static inline unsigned long vcpu_get_hcr(struct kvm_vcpu *vcpu)
+{
+   return vcpu-arch.hcr;
+}
+
+static inline void vcpu_set_hcr(struct kvm_vcpu *vcpu, unsigned long hcr)
+{
+   vcpu-arch.hcr = hcr;
+}
+
 static inline bool vcpu_mode_is_32bit(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 254e065..04b4ea0 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
 * Anything that is not used directly from assembly code goes
 * here.
 */
-   /* dcache set/way operation pending */
-   int last_pcpu;
-   cpumask_t require_dcache_flush;
 
/* Don't run the guest on this vcpu */
bool pause;
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 63e0ecc..286644c 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -190,7 +190,8 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
 
 #define kvm_virt_to_phys(x)virt_to_idmap((unsigned long)(x))
 
-void stage2_flush_vm(struct kvm *kvm);
+void kvm_set_way_flush(struct kvm_vcpu *vcpu);
+void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 2d6d910..0b0d58a 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vcpu-cpu = cpu;
vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
 
-   /*
-* Check whether this vcpu requires the cache to be flushed on
-* this physical CPU. This is a consequence of doing dcache
-* operations by set/way on this vcpu. We do it here to be in
-* a non-preemptible section.
-*/
-   if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush))
-   flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
-
kvm_arm_set_running_vcpu(vcpu);
 }
 
@@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
 
vcpu-mode = OUTSIDE_GUEST_MODE;
-   vcpu-arch.last_pcpu = smp_processor_id();
kvm_guest_exit();
trace_kvm_exit(*vcpu_pc(vcpu));
/*
diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
index 7928dbd..0afcc00 100644
--- a/arch/arm/kvm/coproc.c
+++ b/arch/arm/kvm/coproc.c
@@ -189,82 +189,40 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
return true;
 }
 
-/* See note at ARM ARM B1.14.4 */
+/*
+ * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
+ */
 static bool access_dcsw(struct kvm_vcpu *vcpu,
const struct coproc_params *p,
const struct coproc_reg *r)
 {
-   unsigned long val;
-   int cpu;
-
if (!p-is_write)
return read_from_write_only(vcpu, p);
 
-   cpu = get_cpu();
-
-   cpumask_setall(vcpu-arch.require_dcache_flush);
-   cpumask_clear_cpu(cpu,

[PATCH v3 3/3] arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault

When handling a fault in stage-2, we need to resync I$ and D$, just
to be sure we don't leave any old cache line behind.

That's very good, except that we do so using the *user* address.
Under heavy load (swapping like crazy), we may end up in a situation
where the page gets mapped in stage-2 while being unmapped from
userspace by another CPU.

At that point, the DC/IC instructions can generate a fault, which
we handle with kvm-mmu_lock held. The box quickly deadlocks, user
is unhappy.

Instead, perform this invalidation through the kernel mapping,
which is guaranteed to be present. The box is much happier, and so
am I.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_mmu.h   | 43 +++-
 arch/arm/kvm/mmu.c   | 12 +++
 arch/arm64/include/asm/kvm_mmu.h | 13 +++-
 3 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 552c31f..e5614c9 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -162,13 +162,10 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu 
*vcpu)
return (vcpu-arch.cp15[c1_SCTLR]  0b101) == 0b101;
 }
 
-static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
-unsigned long size,
-bool ipa_uncached)
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t 
pfn,
+  unsigned long size,
+  bool ipa_uncached)
 {
-   if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
-   kvm_flush_dcache_to_poc((void *)hva, size);
-   
/*
 * If we are going to insert an instruction page and the icache is
 * either VIPT or PIPT, there is a potential problem where the host
@@ -180,10 +177,38 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
 *
 * VIVT caches are tagged using both the ASID and the VMID and doesn't
 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
+*
+* We need to do this through a kernel mapping (using the
+* user-space mapping has proved to be the wrong
+* solution). For that, we need to kmap one page at a time,
+* and iterate over the range.
 */
-   if (icache_is_pipt()) {
-   __cpuc_coherent_user_range(hva, hva + size);
-   } else if (!icache_is_vivt_asid_tagged()) {
+
+   bool need_flush = !vcpu_has_cache_enabled(vcpu) || ipa_uncached;
+
+   VM_BUG_ON(size  PAGE_MASK);
+
+   if (!need_flush  !icache_is_pipt())
+   goto vipt_cache;
+
+   while (size) {
+   void *va = kmap_atomic_pfn(pfn);
+
+   if (need_flush)
+   kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+   
+   if (icache_is_pipt())
+   __cpuc_coherent_user_range((unsigned long)va,
+  (unsigned long)va + 
PAGE_SIZE);
+
+   size -= PAGE_SIZE;
+   pfn++;
+
+   kunmap_atomic(va);
+   }
+
+vipt_cache:
+   if (!icache_is_pipt()  !icache_is_vivt_asid_tagged()) {
/* any kind of VIPT cache */
__flush_icache_all();
}
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 78e68ab..1366625 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -957,6 +957,12 @@ static bool kvm_is_device_pfn(unsigned long pfn)
return !pfn_valid(pfn);
 }
 
+static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+ unsigned long size, bool uncached)
+{
+   __coherent_cache_guest_page(vcpu, pfn, size, uncached);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -1046,8 +1052,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_set_s2pmd_writable(new_pmd);
kvm_set_pfn_dirty(pfn);
}
-   coherent_cache_guest_page(vcpu, hva  PMD_MASK, PMD_SIZE,
- fault_ipa_uncached);
+   coherent_cache_guest_page(vcpu, pfn, PMD_SIZE, 
fault_ipa_uncached);
ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd);
} else {
pte_t new_pte = pfn_pte(pfn, mem_type);
@@ -1055,8 +1060,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_set_s2pte_writable(new_pte);
kvm_set_pfn_dirty(pfn);
}
-   coherent_cache_guest_page(vcpu, hva,

Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)

2015-01-21 Thread Mario Smarduch

On 01/21/2015 03:08 AM, Christoffer Dall wrote:
 On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote:
 Patch series adds support for armv7/8 dirty page logging. As we move
 towards generic dirty page logging interface we move some common code to
 generic layer shared by x86, armv7 and armv8.

 armv7/8 Dirty page logging implementation overivew-
 - initially write protects memory region 2nd stage page tables
 - read dirty page log and again write protect dirty pages for next pass.
 - second stage huge pages are dissolved into normal pages to keep track of
   dirty memory at page granularity. Tracking at huge page granularity
   limits granularity of marking dirty memory and migration to a light memory
   load. Small page size logging supports higher memory dirty rates, enables
   rapid migration.

   armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and
   512MB (64kb)
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.

 
 Thanks, applied.
 
 -Christoffer
 
Thanks! And also to other folks that helped along the way
in shaping the design and reviews.

- Mario
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)

2015-01-21 Thread Mario Smarduch

On 01/21/2015 03:08 AM, Christoffer Dall wrote:
 On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote:
 Patch series adds support for armv7/8 dirty page logging. As we move
 towards generic dirty page logging interface we move some common code to
 generic layer shared by x86, armv7 and armv8.

 armv7/8 Dirty page logging implementation overivew-
 - initially write protects memory region 2nd stage page tables
 - read dirty page log and again write protect dirty pages for next pass.
 - second stage huge pages are dissolved into normal pages to keep track of
   dirty memory at page granularity. Tracking at huge page granularity
   limits granularity of marking dirty memory and migration to a light memory
   load. Small page size logging supports higher memory dirty rates, enables
   rapid migration.

   armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and
   512MB (64kb)
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.

 
 Thanks, applied.
 
 -Christoffer
 
Thanks! And also to other folks that helped along the way
in shaping the design and reviews.

- Mario
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue

2015-01-21 Thread Radim Krčmář

2015-01-21 12:16-0200, Marcelo Tosatti:
 On Wed, Jan 21, 2015 at 03:09:27PM +0100, Radim Krčmář wrote:
  2015-01-20 15:54-0200, Marcelo Tosatti:
   SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
   and rdtsc is larger than a given threshold:
  [...]
   Disable masterclock support (which increases said delta) in case the
   boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.
  
  Why do we care about 2.6.16 bugs in upstream KVM?
 
 Because people do use 2.6.16 guests.

(Those people probably won't use 3.19+ host ...
 Is this patch intended for stable?)

  The code to benefit tradeoff of this patch seems bad to me ...
 
 Can you state the tradeoff and then explain why it is bad ?

Additional code needs time to understand and is a source of bugs,
yet we still include it because we want to achieve something.

I meant the tradeoff between perceived value of something and
acceptability of the code.  (Ideally, computer programs would be a
shorter version of Do what I want.\nEOF.)

There are three main points that made think it is bad,
1) The bug happens because a guest expects greater precision.
   I consider that as a guest problem.  kvmclock never guaranteed
   anything, so unmet expectations should be a recoverable error.

2) With time, the probability that 2.6.16 is used is getting lower,
   while people looking at KVM's code appear.
   - At what point are we going to drop 2.6.16 support?
 (We shouldn't let mistakes drag us down forever ...
  Or are we dooming KVM on purpose?)

3) The patch made me ask more silly questions than it answered :)
   (Why can't other software depend on previous behavior?
Why can't kvmclock without master clock still fail?
Why can't we improve the master clock?)

  MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with
  MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want
  to support old guests.
 
 What is the benefit of removing support for MSR_KVM_SYSTEM_TIME ?

The maintainability of the code increases.  It would look as if we never
made the mistake with MSR_KVM_SYSTEM_TIME  MSR_KVM_WALL_CLOCK.
(I like when old code looks as if we wrote it from scratch.)

After comparing the (imperfectly evaluated) benefit of both variants,
 original patch:
   + 2.6.16 SUSE guests work
   - MSR_KVM_SYSTEM_TIME guests don't use master clock
   - KVM code is worse
 removal of KVM_FEATURE_CLOCKSOURCE:
   + 2.6.16 SUSE guests likely work
   + KVM code is better
   - MSR_KVM_SYSTEM_TIME guests use even worse clocksource

As KVM_FEATURE_CLOCKSOURCE2 was introduced in 2010, I found the removal
better even without waiting for the last MSR_KVM_SYSTEM_TIME guest to
perish.

 Supporting old guests is important.

It comes at a price.
(Mutually exclusive goals are important as well.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

Since lapic timer handler only wakes up a simple waitqueue,
it can be executed from hardirq context.

Also handle the case where hrtimer_start_expires fails due to -ETIME,
by injecting the interrupt to the guest immediately.

Reduces average cyclictest latency by 3us.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/kvm/lapic.c |   42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

Index: linux-stable-rt/arch/x86/kvm/lapic.c
===
--- linux-stable-rt.orig/arch/x86/kvm/lapic.c   2014-11-25 14:14:38.636810068 
-0200
+++ linux-stable-rt/arch/x86/kvm/lapic.c2015-01-14 14:59:17.840251874 
-0200
@@ -1031,8 +1031,38 @@
   apic-divide_count);
 }
 
+
+static enum hrtimer_restart apic_timer_fn(struct hrtimer *data);
+
+static void apic_timer_expired(struct hrtimer *data)
+{
+   int ret, i = 0;
+   enum hrtimer_restart r;
+   struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
+
+   r = apic_timer_fn(data);
+
+   if (r == HRTIMER_RESTART) {
+   do {
+   ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   hrtimer_add_expires_ns(ktimer-timer,
+   ktimer-period);
+   i++;
+   } while (ret == -ETIME  i  10);
+
+   if (ret == -ETIME) {
+   printk_once(KERN_ERR %s: failed to reprogram timer\n,
+  __func__);
+   WARN_ON_ONCE(1);
+   }
+   }
+}
+
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
+   int ret;
ktime_t now;
atomic_set(apic-lapic_timer.pending, 0);
 
@@ -1062,9 +1092,11 @@
}
}
 
-   hrtimer_start(apic-lapic_timer.timer,
+   ret = hrtimer_start(apic-lapic_timer.timer,
  ktime_add_ns(now, apic-lapic_timer.period),
  HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(apic-lapic_timer.timer);
 
apic_debug(%s: bus cycle is % PRId64 ns, now 0x%016
   PRIx64 , 
@@ -1094,8 +1126,10 @@
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
}
-   hrtimer_start(apic-lapic_timer.timer,
+   ret = hrtimer_start(apic-lapic_timer.timer,
ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(apic-lapic_timer.timer);
 
local_irq_restore(flags);
}
@@ -1581,6 +1615,7 @@
hrtimer_init(apic-lapic_timer.timer, CLOCK_MONOTONIC,
 HRTIMER_MODE_ABS);
apic-lapic_timer.timer.function = apic_timer_fn;
+   apic-lapic_timer.timer.irqsafe = 1;
 
/*
 * APIC is created enabled. This will prevent kvm_lapic_set_base from
@@ -1699,7 +1734,8 @@
 
timer = vcpu-arch.apic-lapic_timer.timer;
if (hrtimer_cancel(timer))
-   hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
+   if (hrtimer_start_expires(timer, HRTIMER_MODE_ABS) == -ETIME)
+   apic_timer_expired(timer);
 }
 
 /*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-21 Thread Radim Kr?má?

2015-01-20 23:04+0200, Nadav Amit:
Radim Kr?má? rkrc...@redhat.com wrote:
2015-01-14 01:27+, Wu, Feng:
the new
hardware even doesn't consider the TPR for lowest priority interrupts
delivery.

A bold move ... what hardware was the first to do so?

I think it was starting with Nehalem.

Thanks, (Could be that QPI can't inform about TPR changes anymore ...)

I played with Linux's TPR on Haswell and found that is has no effect.

Sorry for jumping into the discussion, but doesn’t it depend on
IA32_MISC_ENABLE[23]? This bit disables xTPR messages. On my machine it is
set (probably by the BIOS), but since there is no IA32_MISC_ENABLE is not
locked for changes, the OS can control it.

Thanks, I didn't know about it.
On Ivy Bridge EP (the only modern machine at hand), the bit was set by
default. After clearing it, TPR still had no effect.

The most relevant mention of xTPR I found is related to FSB [1].
[2] isn't enlightening, so there might be more from QPI-era ...

---
1: Intel® E7320 Memory Controller Hub (MCH) Datasheet

http://www.intel.com/content/dam/doc/datasheet/e7320-memory-controller-hub-datasheet.pdf
5.2.2 System Bus Interrupts
2: Intel® Xeon® Processor E5 v2 Family: Datasheet, Vol. 2

http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v2-datasheet-vol-2.pdf
6.1.2 IntControl
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

The problem:

On -RT, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the 
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT 
are sleepable locks. Therefore, waking up a waitqueue 
waiter involves locking a sleeping lock, which 
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/arm/kvm/arm.c  |4 ++--
 arch/arm/kvm/psci.c |4 ++--
 arch/mips/kvm/kvm_mips.c|8 
 arch/powerpc/include/asm/kvm_host.h |4 ++--
 arch/powerpc/kvm/book3s_hv.c|   20 ++--
 arch/s390/include/asm/kvm_host.h|2 +-
 arch/s390/kvm/interrupt.c   |   22 ++
 arch/s390/kvm/sigp.c|   16 
 arch/x86/kvm/lapic.c|6 +++---
 include/linux/kvm_host.h|4 ++--
 virt/kvm/async_pf.c |4 ++--
 virt/kvm/kvm_main.c |   16 
 12 files changed, 54 insertions(+), 56 deletions(-)

Index: linux-stable-rt/arch/arm/kvm/arm.c
===
--- linux-stable-rt.orig/arch/arm/kvm/arm.c 2014-11-25 14:13:39.188899952 
-0200
+++ linux-stable-rt/arch/arm/kvm/arm.c  2014-11-25 14:14:38.620810092 -0200
@@ -495,9 +495,9 @@
 
 static void vcpu_pause(struct kvm_vcpu *vcpu)
 {
-   wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu);
+   struct swait_head *wq = kvm_arch_vcpu_wq(vcpu);
 
-   wait_event_interruptible(*wq, !vcpu-arch.pause);
+   swait_event_interruptible(*wq, !vcpu-arch.pause);
 }
 
 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
Index: linux-stable-rt/arch/arm/kvm/psci.c
===
--- linux-stable-rt.orig/arch/arm/kvm/psci.c2014-11-25 14:13:39.189899951 
-0200
+++ linux-stable-rt/arch/arm/kvm/psci.c 2014-11-25 14:14:38.620810092 -0200
@@ -36,7 +36,7 @@
 {
struct kvm *kvm = source_vcpu-kvm;
struct kvm_vcpu *vcpu = NULL, *tmp;
-   wait_queue_head_t *wq;
+   struct swait_head *wq;
unsigned long cpu_id;
unsigned long mpidr;
phys_addr_t target_pc;
@@ -80,7 +80,7 @@
smp_mb();   /* Make sure the above is visible */
 
wq = kvm_arch_vcpu_wq(vcpu);
-   wake_up_interruptible(wq);
+   swait_wake_interruptible(wq);
 
return KVM_PSCI_RET_SUCCESS;
 }
Index: linux-stable-rt/arch/mips/kvm/kvm_mips.c
===
--- linux-stable-rt.orig/arch/mips/kvm/kvm_mips.c   2014-11-25 
14:13:39.191899948 -0200
+++ linux-stable-rt/arch/mips/kvm/kvm_mips.c2014-11-25 14:14:38.621810091 
-0200
@@ -464,8 +464,8 @@
 
dvcpu-arch.wait = 0;
 
-   if (waitqueue_active(dvcpu-wq)) {
-   wake_up_interruptible(dvcpu-wq);
+   if (swaitqueue_active(dvcpu-wq)) {
+   swait_wake_interruptible(dvcpu-wq);
}
 
return 0;
@@ -971,8 +971,8 @@
kvm_mips_callbacks-queue_timer_int(vcpu);
 
vcpu-arch.wait = 0;
-   if (waitqueue_active(vcpu-wq)) {
-   wake_up_interruptible(vcpu-wq);
+   if (swaitqueue_active(vcpu-wq)) {
+   swait_wake_interruptible(vcpu-wq);
}
 }
 
Index: linux-stable-rt/arch/powerpc/include/asm/kvm_host.h
===
--- linux-stable-rt.orig/arch/powerpc/include/asm/kvm_host.h2014-11-25 
14:13:39.193899944 -0200
+++ linux-stable-rt/arch/powerpc/include/asm/kvm_host.h 2014-11-25 
14:14:38.621810091 -0200
@@ -295,7 +295,7 @@
u8 in_guest;
struct list_head runnable_threads;
spinlock_t lock;
-   wait_queue_head_t wq;
+   struct swait_head wq;
u64 stolen_tb;
u64 preempt_tb;
struct kvm_vcpu *runner;
@@ -612,7 +612,7 @@
u8 prodded;
u32 last_inst;
 
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
Index: linux-stable-rt/arch/powerpc/kvm/book3s_hv.c
===
--- linux-stable-rt.orig/arch/powerpc/kvm/book3s_hv.c   2014-11-25 
14:13:39.195899942 -0200
+++ linux-stable-rt/arch/powerpc/kvm/book3s_hv.c2014-11-25 
14:14:38.625810085 -0200
@@ -74,11 +74,11 @@
 {
int me;
int cpu = vcpu-cpu;
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
 
wqp

[patch -rt 0/2] use simple waitqueue for kvm vcpu waitqueue (v4)

Against v3.14-rt branch of
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

The problem:

On -RT, an emulated LAPIC timer instance has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

v2: improve changelog (Rik van Riel)
v3: limit (once) guest triggered printk and WARN_ON (Paolo Bonzini)
v4: fix typo (Steven Rostedt)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v12 10/18] vfio/platform: trigger an interrupt via eventfd

2015-01-21 Thread Alex Williamson

On Wed, 2015-01-21 at 13:49 +0100, Baptiste Reynal wrote:
 From: Antonios Motakis a.mota...@virtualopensystems.com
 
 This patch allows to set an eventfd for a platform device's interrupt,
 and also to trigger the interrupt eventfd from userspace for testing.
 Level sensitive interrupts are marked as maskable and are handled in
 a later patch. Edge triggered interrupts are not advertised as maskable
 and are implemented here using a simple and efficient IRQ handler.
 
 Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
 [Baptiste Reynal: fix masked interrupt initialization]
 Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com
 ---
  drivers/vfio/platform/vfio_platform_irq.c | 98 
 ++-
  drivers/vfio/platform/vfio_platform_private.h |  2 +
  2 files changed, 98 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
 b/drivers/vfio/platform/vfio_platform_irq.c
 index df5c919..4b1ee22 100644
 --- a/drivers/vfio/platform/vfio_platform_irq.c
 +++ b/drivers/vfio/platform/vfio_platform_irq.c
 @@ -39,12 +39,96 @@ static int vfio_platform_set_irq_unmask(struct 
 vfio_platform_device *vdev,
   return -EINVAL;
  }
  
 +static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
 +{
 + struct vfio_platform_irq *irq_ctx = dev_id;
 +
 + eventfd_signal(irq_ctx-trigger, 1);
 +
 + return IRQ_HANDLED;
 +}
 +
 +static int vfio_set_trigger(struct vfio_platform_device *vdev, int index,
 + int fd, irq_handler_t handler)
 +{
 + struct vfio_platform_irq *irq = vdev-irqs[index];
 + struct eventfd_ctx *trigger;
 + int ret;
 +
 + if (irq-trigger) {
 + free_irq(irq-hwirq, irq);
 + kfree(irq-name);
 + eventfd_ctx_put(irq-trigger);
 + irq-trigger = NULL;
 + }
 +
 + if (fd  0) /* Disable only */
 + return 0;
 +
 + irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s),
 + irq-hwirq, vdev-name);
 + if (!irq-name)
 + return -ENOMEM;
 +
 + trigger = eventfd_ctx_fdget(fd);
 + if (IS_ERR(trigger)) {
 + kfree(irq-name);
 + return PTR_ERR(trigger);
 + }
 +
 + irq-trigger = trigger;
 +
 + irq_set_status_flags(irq-hwirq, IRQ_NOAUTOEN);
 + ret = request_irq(irq-hwirq, handler, 0, irq-name, irq);
 + if (ret) {
 + kfree(irq-name);
 + eventfd_ctx_put(trigger);
 + irq-trigger = NULL;
 + return ret;
 + }
 +
 + if (!irq-masked)
 + enable_irq(irq-hwirq);

Unfortunately, irq-masked doesn't exist until the next patch.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v12 12/18] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export

2015-01-21 Thread Alex Williamson

On Wed, 2015-01-21 at 13:50 +0100, Baptiste Reynal wrote:
 From: Antonios Motakis a.mota...@virtualopensystems.com
 
 We want to reuse virqfd functionality in multiple VFIO drivers; before
 moving these functions to core VFIO, add the vfio_ prefix to the
 virqfd_enable and virqfd_disable functions, and export them so they can
 be used from other modules.
 
 Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
 ---
  drivers/vfio/pci/vfio_pci_intrs.c   | 30 --
  drivers/vfio/pci/vfio_pci_private.h |  4 ++--
  2 files changed, 18 insertions(+), 16 deletions(-)
...
 diff --git a/drivers/vfio/pci/vfio_pci_private.h 
 b/drivers/vfio/pci/vfio_pci_private.h
 index 671c17a..2e2f0ea 100644
 --- a/drivers/vfio/pci/vfio_pci_private.h
 +++ b/drivers/vfio/pci/vfio_pci_private.h
 @@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device 
 *vdev, char __user *buf,
  extern int vfio_pci_init_perm_bits(void);
  extern void vfio_pci_uninit_perm_bits(void);
  
 -extern int vfio_pci_virqfd_init(void);
 -extern void vfio_pci_virqfd_exit(void);
 +extern int vfio_virqfd_init(void);
 +extern void vfio_virqfd_exit(void);
  
  extern int vfio_config_init(struct vfio_pci_device *vdev);
  extern void vfio_config_free(struct vfio_pci_device *vdev);

This chunk is in the wrong patch, it needs to be moved to the next patch
or else the series isn't bisect-able.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue

On Wed, Jan 21, 2015 at 06:00:37PM +0100, Radim Krčmář wrote:
 2015-01-21 12:16-0200, Marcelo Tosatti:
  On Wed, Jan 21, 2015 at 03:09:27PM +0100, Radim Krčmář wrote:
   2015-01-20 15:54-0200, Marcelo Tosatti:
SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
and rdtsc is larger than a given threshold:
   [...]
Disable masterclock support (which increases said delta) in case the
boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.
   
   Why do we care about 2.6.16 bugs in upstream KVM?
  
  Because people do use 2.6.16 guests.
 
 (Those people probably won't use 3.19+ host ...
  Is this patch intended for stable?)

Yes.

   The code to benefit tradeoff of this patch seems bad to me ...
  
  Can you state the tradeoff and then explain why it is bad ?
 
 Additional code needs time to understand and is a source of bugs,
 yet we still include it because we want to achieve something.
 
 I meant the tradeoff between perceived value of something and
 acceptability of the code.  (Ideally, computer programs would be a
 shorter version of Do what I want.\nEOF.)
 
 There are three main points that made think it is bad,
 1) The bug happens because a guest expects greater precision.
I consider that as a guest problem.  kvmclock never guaranteed
anything, so unmet expectations should be a recoverable error.

delta = pvclock_data.tsc_timestamp - RDTSC

Guest expects delta to be smaller than a given threshold. It does
not expect greater precision.

Size of delta does not affect precision.

 2) With time, the probability that 2.6.16 is used is getting lower,
while people looking at KVM's code appear.
- At what point are we going to drop 2.6.16 support?
  (We shouldn't let mistakes drag us down forever ...
   Or are we dooming KVM on purpose?)

One of the features of virtualization is to be able to run old 
operating systems?

 3) The patch made me ask more silly questions than it answered :)
(Why can't other software depend on previous behavior?

Documentation/virtual/kvm/msr.txt:

whose data will be filled in by the hypervisor periodically.
Only one write, or registration, is needed for each VCPU. The interval
between updates of this structure is arbitrary and 
implementation-dependent.
The hypervisor may update this structure at any time it sees fit until
anything with bit0 == 0 is written to it.

 Why can't kvmclock without master clock still fail?

It can, given a loaded system.

 Why can't we improve the master clock?)

Out of context question.

   MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with
   MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want
   to support old guests.
  
  What is the benefit of removing support for MSR_KVM_SYSTEM_TIME ?
 
 The maintainability of the code increases.  It would look as if we never
 made the mistake with MSR_KVM_SYSTEM_TIME  MSR_KVM_WALL_CLOCK.
 (I like when old code looks as if we wrote it from scratch.)
 
 After comparing the (imperfectly evaluated) benefit of both variants,
  original patch:
+ 2.6.16 SUSE guests work
- MSR_KVM_SYSTEM_TIME guests don't use master clock
- KVM code is worse
  removal of KVM_FEATURE_CLOCKSOURCE:
+ 2.6.16 SUSE guests likely work

All guests which depend on KVM_FEATURE_CLOCKSOURCE will timedrift.

+ KVM code is better
- MSR_KVM_SYSTEM_TIME guests use even worse clocksource
 
 As KVM_FEATURE_CLOCKSOURCE2 was introduced in 2010, I found the removal
 better even without waiting for the last MSR_KVM_SYSTEM_TIME guest to
 perish.
 
  Supporting old guests is important.
 
 It comes at a price.
 (Mutually exclusive goals are important as well.)

This phrase is awkward. Overlapping goals are negative,
then? (think of a large number of totally overlapping goals).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC v3 1/2] x86/xen: add xen_is_preemptible_hypercall()

2015-01-21 Thread Luis R. Rodriguez

From: Luis R. Rodriguez mcg...@suse.com

On kernels with voluntary or no preemption we can run
into situations where a hypercall issued through userspace
will linger around as it addresses sub-operatiosn in kernel
context (multicalls). Such operations can trigger soft lockup
detection.

We want to address a way to let the kernel voluntarily preempt
such calls even on non preempt kernels, to address this we first
need to distinguish which hypercalls fall under this category.
This implements xen_is_preemptible_hypercall() which lets us do
just that by adding a secondary hypercall page, calls made via
the new page may be preempted.

Andrew had originally submitted a version of this work [0].

[0] http://lists.xen.org/archives/html/xen-devel/2014-02/msg01056.html

Based on original work by: Andrew Cooper andrew.coop...@citrix.com

Cc: Andy Lutomirski l...@amacapital.net
Cc: Borislav Petkov b...@suse.de
Cc: David Vrabel david.vra...@citrix.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
Cc: Jan Beulich jbeul...@suse.com
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Luis R. Rodriguez mcg...@suse.com
---
 arch/x86/include/asm/xen/hypercall.h | 20 
 arch/x86/xen/enlighten.c |  7 +++
 arch/x86/xen/xen-head.S  | 18 +-
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/xen/hypercall.h 
b/arch/x86/include/asm/xen/hypercall.h
index ca08a27..221008e 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -84,6 +84,22 @@
 
 extern struct { char _entry[32]; } hypercall_page[];
 
+#ifndef CONFIG_PREEMPT
+extern struct { char _entry[32]; } preemptible_hypercall_page[];
+
+static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs)
+{
+   return !user_mode_vm(regs) 
+   regs-ip = (unsigned long)preemptible_hypercall_page 
+   regs-ip  (unsigned long)preemptible_hypercall_page + 
PAGE_SIZE;
+}
+#else
+static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs)
+{
+   return false;
+}
+#endif
+
 #define __HYPERCALLcall hypercall_page+%c[offset]
 #define __HYPERCALL_ENTRY(x)   \
[offset] i (__HYPERVISOR_##x * sizeof(hypercall_page[0]))
@@ -215,7 +231,11 @@ privcmd_call(unsigned call,
 
asm volatile(call *%[call]
 : __HYPERCALL_5PARAM
+#ifndef CONFIG_PREEMPT
+: [call] a (preemptible_hypercall_page[call])
+#else
 : [call] a (hypercall_page[call])
+#endif
 : __HYPERCALL_CLOBBER5);
 
return (long)__res;
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 6bf3a13..9c01b48 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -84,6 +84,9 @@
 #include multicalls.h
 
 EXPORT_SYMBOL_GPL(hypercall_page);
+#ifndef CONFIG_PREEMPT
+EXPORT_SYMBOL_GPL(preemptible_hypercall_page);
+#endif
 
 /*
  * Pointer to the xen_vcpu_info structure or
@@ -1531,6 +1534,10 @@ asmlinkage __visible void __init xen_start_kernel(void)
 #endif
xen_setup_machphys_mapping();
 
+#ifndef CONFIG_PREEMPT
+   copy_page(preemptible_hypercall_page, hypercall_page);
+#endif
+
/* Install Xen paravirt ops */
pv_info = xen_info;
pv_init_ops = xen_init_ops;
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 674b2225..6e6a9517 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -85,9 +85,18 @@ ENTRY(xen_pvh_early_cpu_init)
 .pushsection .text
.balign PAGE_SIZE
 ENTRY(hypercall_page)
+
+#ifdef CONFIG_PREEMPT
+#  define PREEMPT_HYPERCALL_ENTRY(x)
+#else
+#  define PREEMPT_HYPERCALL_ENTRY(x) \
+   .global xen_hypercall_##x ## _p ASM_NL \
+   .set preemptible_xen_hypercall_##x, xen_hypercall_##x + PAGE_SIZE ASM_NL
+#endif
 #define NEXT_HYPERCALL(x) \
ENTRY(xen_hypercall_##x) \
-   .skip 32
+   .skip 32 ASM_NL \
+   PREEMPT_HYPERCALL_ENTRY(x)
 
 NEXT_HYPERCALL(set_trap_table)
 NEXT_HYPERCALL(mmu_update)
@@ -138,6 +147,13 @@ NEXT_HYPERCALL(arch_4)
 NEXT_HYPERCALL(arch_5)
 NEXT_HYPERCALL(arch_6)
.balign PAGE_SIZE
+
+#ifndef CONFIG_PREEMPT
+ENTRY(preemptible_hypercall_page)
+   .skip PAGE_SIZE
+#endif /* CONFIG_PREEMPT */
+
+#undef NEXT_HYPERCALL
 .popsection
 
ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,   .asciz linux)
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 2/2] x86/xen: allow privcmd hypercalls to be preempted

2015-01-21 Thread Andy Lutomirski

On Wed, Jan 21, 2015 at 6:17 PM, Luis R. Rodriguez
mcg...@do-not-panic.com wrote:
 From: Luis R. Rodriguez mcg...@suse.com

 Xen has support for splitting heavy work work into a series
 of hypercalls, called multicalls, and preempting them through
 what Xen calls continuation [0]. Despite this though without
 CONFIG_PREEMPT preemption won't happen, without preemption
 a system can become pretty useless on heavy handed hypercalls.
 Such is the case for example when creating a  50 GiB HVM guest,
 we can get softlockups [1] with:.

 kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]

 The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
 (default 120 seconds), on the Xen side in this particular case
 this happens when the following Xen hypervisor code is used:

 xc_domain_set_pod_target() --
   do_memory_op() --
 arch_memory_op() --
   p2m_pod_set_mem_target()
 -- long delay (real or emulated) --

 This happens on arch_memory_op() on the XENMEM_set_pod_target memory
 op even though arch_memory_op() can handle continuation via
 hypercall_create_continuation() for example.

 Machines over 50 GiB of memory are on high demand and hard to come
 by so to help replicate this sort of issue long delays on select
 hypercalls have been emulated in order to be able to test this on
 smaller machines [2].

 On one hand this issue can be considered as expected given that
 CONFIG_PREEMPT=n is used however we have forced voluntary preemption
 precedent practices in the kernel even for CONFIG_PREEMPT=n through
 the usage of cond_resched() sprinkled in many places. To address
 this issue with Xen hypercalls though we need to find a way to aid
 to the schedular in the middle of hypercalls. We are motivated to
 address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
 rather unresponsive for long periods of time; in the worst case, at least
 only currently by emulating long delays on select io disk bound
 hypercalls, this can lead to filesystem corruption if the delay happens
 for example on SCHEDOP_remote_shutdown (when we call 'xl domain shutdown').

 We can address this problem by trying to check if we should schedule
 on the xen timer in the middle of a hypercall on the return from the
 timer interrupt. We want to be careful to not always force voluntary
 preemption though so to do this we only selectively enable preemption
 on very specific xen hypercalls.

 This enables hypercall preemption by selectively forcing checks for
 voluntary preempting only on ioctl initiated private hypercalls
 where we know some folks have run into reported issues [1].

 [0] 
 http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
 [1] https://bugzilla.novell.com/show_bug.cgi?id=861093
 [2] 
 http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch

 Based on original work by: David Vrabel david.vra...@citrix.com
 Suggested-by: Andy Lutomirski l...@amacapital.net
 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Borislav Petkov b...@suse.de
 Cc: David Vrabel david.vra...@citrix.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 Cc: Jan Beulich jbeul...@suse.com
 Cc: linux-ker...@vger.kernel.org
 Signed-off-by: Luis R. Rodriguez mcg...@suse.com
 ---
  arch/x86/kernel/entry_32.S   |  2 ++
  arch/x86/kernel/entry_64.S   |  2 ++
  drivers/xen/events/events_base.c | 13 +
  include/xen/events.h |  1 +
  4 files changed, 18 insertions(+)

 diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
 index 000d419..b4b1f42 100644
 --- a/arch/x86/kernel/entry_32.S
 +++ b/arch/x86/kernel/entry_32.S
 @@ -982,6 +982,8 @@ ENTRY(xen_hypervisor_callback)
  ENTRY(xen_do_upcall)
  1: mov %esp, %eax
 call xen_evtchn_do_upcall
 +   movl %esp,%eax
 +   call xen_end_upcall
 jmp  ret_from_intr
 CFI_ENDPROC
  ENDPROC(xen_hypervisor_callback)
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index 9ebaf63..ee28733 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -1198,6 +1198,8 @@ ENTRY(xen_do_hypervisor_callback)   # 
 do_hypervisor_callback(struct *pt_regs)
 popq %rsp
 CFI_DEF_CFA_REGISTER rsp
 decl PER_CPU_VAR(irq_count)
 +   movq %rsp, %rdi  /* pass pt_regs as first argument */
 +   call xen_end_upcall
 jmp  error_exit
 CFI_ENDPROC
  END(xen_do_hypervisor_callback)
 diff --git a/drivers/xen/events/events_base.c 
 b/drivers/xen/events/events_base.c
 index b4bca2d..23c526b 100644
 --- a/drivers/xen/events/events_base.c
 +++ b/drivers/xen/events/events_base.c
 @@ -32,6 +32,8 @@
  #include linux/slab.h
  #include linux/irqnr.h
  #include linux/pci.h

[RFC v3 2/2] x86/xen: allow privcmd hypercalls to be preempted

2015-01-21 Thread Luis R. Rodriguez

From: Luis R. Rodriguez mcg...@suse.com

Xen has support for splitting heavy work work into a series
of hypercalls, called multicalls, and preempting them through
what Xen calls continuation [0]. Despite this though without
CONFIG_PREEMPT preemption won't happen, without preemption
a system can become pretty useless on heavy handed hypercalls.
Such is the case for example when creating a  50 GiB HVM guest,
we can get softlockups [1] with:.

kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]

The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
(default 120 seconds), on the Xen side in this particular case
this happens when the following Xen hypervisor code is used:

xc_domain_set_pod_target() --
  do_memory_op() --
arch_memory_op() --
  p2m_pod_set_mem_target()
-- long delay (real or emulated) --

This happens on arch_memory_op() on the XENMEM_set_pod_target memory
op even though arch_memory_op() can handle continuation via
hypercall_create_continuation() for example.

Machines over 50 GiB of memory are on high demand and hard to come
by so to help replicate this sort of issue long delays on select
hypercalls have been emulated in order to be able to test this on
smaller machines [2].

On one hand this issue can be considered as expected given that
CONFIG_PREEMPT=n is used however we have forced voluntary preemption
precedent practices in the kernel even for CONFIG_PREEMPT=n through
the usage of cond_resched() sprinkled in many places. To address
this issue with Xen hypercalls though we need to find a way to aid
to the schedular in the middle of hypercalls. We are motivated to
address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
rather unresponsive for long periods of time; in the worst case, at least
only currently by emulating long delays on select io disk bound
hypercalls, this can lead to filesystem corruption if the delay happens
for example on SCHEDOP_remote_shutdown (when we call 'xl domain shutdown').

We can address this problem by trying to check if we should schedule
on the xen timer in the middle of a hypercall on the return from the
timer interrupt. We want to be careful to not always force voluntary
preemption though so to do this we only selectively enable preemption
on very specific xen hypercalls.

This enables hypercall preemption by selectively forcing checks for
voluntary preempting only on ioctl initiated private hypercalls
where we know some folks have run into reported issues [1].

[0] 
http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
[1] https://bugzilla.novell.com/show_bug.cgi?id=861093
[2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch

Based on original work by: David Vrabel david.vra...@citrix.com
Suggested-by: Andy Lutomirski l...@amacapital.net
Cc: Andy Lutomirski l...@amacapital.net
Cc: Borislav Petkov b...@suse.de
Cc: David Vrabel david.vra...@citrix.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
Cc: Jan Beulich jbeul...@suse.com
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Luis R. Rodriguez mcg...@suse.com
---
 arch/x86/kernel/entry_32.S   |  2 ++
 arch/x86/kernel/entry_64.S   |  2 ++
 drivers/xen/events/events_base.c | 13 +
 include/xen/events.h |  1 +
 4 files changed, 18 insertions(+)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 000d419..b4b1f42 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -982,6 +982,8 @@ ENTRY(xen_hypervisor_callback)
 ENTRY(xen_do_upcall)
 1: mov %esp, %eax
call xen_evtchn_do_upcall
+   movl %esp,%eax
+   call xen_end_upcall
jmp  ret_from_intr
CFI_ENDPROC
 ENDPROC(xen_hypervisor_callback)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63..ee28733 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1198,6 +1198,8 @@ ENTRY(xen_do_hypervisor_callback)   # 
do_hypervisor_callback(struct *pt_regs)
popq %rsp
CFI_DEF_CFA_REGISTER rsp
decl PER_CPU_VAR(irq_count)
+   movq %rsp, %rdi  /* pass pt_regs as first argument */
+   call xen_end_upcall
jmp  error_exit
CFI_ENDPROC
 END(xen_do_hypervisor_callback)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index b4bca2d..23c526b 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -32,6 +32,8 @@
 #include linux/slab.h
 #include linux/irqnr.h
 #include linux/pci.h
+#include linux/sched.h
+#include linux/kprobes.h
 
 #ifdef CONFIG_X86
 #include asm/desc.h
@@ -1243,6 +1245,17 @@ void xen_evtchn_do_upcall(struct pt_regs *regs)

[RFC v3 0/2] x86/xen: add xen hypercall preemption

2015-01-21 Thread Luis R. Rodriguez

From: Luis R. Rodriguez mcg...@suse.com

After my last respin Andy provided some ideas as how to skip
IRQ context hacks for preemption, this v3 spin addresses that
and a bit more.

This is based on both Andrew Cooper's and David Vrabel's work,
further modified based on ideas by Andy Lutomirski to avoid
having to deal with preemption on IRQ context. Ian had originally
suggested to avoid the pt_regs stuff by using a CPU variable but
based on Andy's observations it is difficult to prove we will
avoid recursing or bad nesting when dealing with preemption out
of IRQ context. This is specially true given that after a hypercall
gets preempted the hypercall may end up another CPU.

This uses NOKPROBE_SYMBOL and notrace since based on Andy's advice
I am not confident that tracing and kprobes are safe to use in what
might be an extended RCU quiescent state (i.e. where we're outside
irq_enter and irq_exit).

I've tested this on 64-bit, some testing on 32-bit would be
appreciated.

Luis R. Rodriguez (2):
  x86/xen: add xen_is_preemptible_hypercall()
  x86/xen: allow privcmd hypercalls to be preempted

 arch/x86/include/asm/xen/hypercall.h | 20 
 arch/x86/kernel/entry_32.S   |  2 ++
 arch/x86/kernel/entry_64.S   |  2 ++
 arch/x86/xen/enlighten.c |  7 +++
 arch/x86/xen/xen-head.S  | 18 +-
 drivers/xen/events/events_base.c | 13 +
 include/xen/events.h |  1 +
 7 files changed, 62 insertions(+), 1 deletion(-)

-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 1/2] x86/xen: add xen_is_preemptible_hypercall()

2015-01-21 Thread Andy Lutomirski

On Wed, Jan 21, 2015 at 6:17 PM, Luis R. Rodriguez
mcg...@do-not-panic.com wrote:
 From: Luis R. Rodriguez mcg...@suse.com

 On kernels with voluntary or no preemption we can run
 into situations where a hypercall issued through userspace
 will linger around as it addresses sub-operatiosn in kernel
 context (multicalls). Such operations can trigger soft lockup
 detection.

 We want to address a way to let the kernel voluntarily preempt
 such calls even on non preempt kernels, to address this we first
 need to distinguish which hypercalls fall under this category.
 This implements xen_is_preemptible_hypercall() which lets us do
 just that by adding a secondary hypercall page, calls made via
 the new page may be preempted.

 Andrew had originally submitted a version of this work [0].

 [0] http://lists.xen.org/archives/html/xen-devel/2014-02/msg01056.html

 Based on original work by: Andrew Cooper andrew.coop...@citrix.com

 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Borislav Petkov b...@suse.de
 Cc: David Vrabel david.vra...@citrix.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 Cc: Jan Beulich jbeul...@suse.com
 Cc: linux-ker...@vger.kernel.org
 Signed-off-by: Luis R. Rodriguez mcg...@suse.com
 ---
  arch/x86/include/asm/xen/hypercall.h | 20 
  arch/x86/xen/enlighten.c |  7 +++
  arch/x86/xen/xen-head.S  | 18 +-
  3 files changed, 44 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/include/asm/xen/hypercall.h 
 b/arch/x86/include/asm/xen/hypercall.h
 index ca08a27..221008e 100644
 --- a/arch/x86/include/asm/xen/hypercall.h
 +++ b/arch/x86/include/asm/xen/hypercall.h
 @@ -84,6 +84,22 @@

  extern struct { char _entry[32]; } hypercall_page[];

 +#ifndef CONFIG_PREEMPT
 +extern struct { char _entry[32]; } preemptible_hypercall_page[];

A comment somewhere explaining why only non-preemptible kernels have
preemptible hypercalls might be friendly to some future reader. :)

 +
 +static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs)
 +{
 +   return !user_mode_vm(regs) 
 +   regs-ip = (unsigned long)preemptible_hypercall_page 
 +   regs-ip  (unsigned long)preemptible_hypercall_page + 
 PAGE_SIZE;
 +}

This makes it seem like the page is indeed one page long, but I don't
see what actually allocates a whole page for it.  What am I missing?

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.



On 21/01/2015 11:16, Wincy Van wrote:
 On Wed, Jan 21, 2015 at 4:35 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 Wincy Van wrote on 2015-01-16:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance.

 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.

 Signed-off-by: Wincy Van fanwenyi0...@gmail.com

 To enable x2apic, should you to consider the behavior changes to rdmsr and 
 wrmsr. I didn't see your patch do it. Is it correct?
 
 Yes, indeed, I've not noticed that kvm handle nested msr bitmap
 manually, the next version will fix this.
 
 BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
 here and change to apicv in following patch.
 
 Do you mean that we should split this patch from the apicv patch set?

I think it's okay to keep it in the same patchset, but you can put it first.

Paolo

 

 ---
  arch/x86/kvm/vmx.c |   49
 -
  1 files changed, 48 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }

 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;

 @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,

 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +   case EXIT_REASON_APIC_WRITE:
 +   /* apic_write should exit unconditionally. */
 +   return 1;

 APIC_WRITE vmexit is introduced by APIC register virtualization not 
 virtualize x2apic. Move it to next patch.
 
 Agreed, will do.
 
 Thanks,
 
 Wincy
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] [PATCH v14 11/11] pvqspinlock, x86: Enable PV qspinlock for XEN

2015-01-21 Thread David Vrabel

On 20/01/15 20:12, Waiman Long wrote:
 This patch adds the necessary XEN specific code to allow XEN to
 support the CPU halting and kicking operations needed by the queue
 spinlock PV code.

Xen is a word, please don't capitalize it.

 +void xen_lock_stats(int stat_types)
 +{
 + if (stat_types  PV_LOCKSTAT_WAKE_KICKED)
 + add_smp(wake_kick_stats, 1);
 + if (stat_types  PV_LOCKSTAT_WAKE_SPURIOUS)
 + add_smp(wake_spur_stats, 1);
 + if (stat_types  PV_LOCKSTAT_KICK_NOHALT)
 + add_smp(kick_nohlt_stats, 1);
 + if (stat_types  PV_LOCKSTAT_HALT_QHEAD)
 + add_smp(halt_qhead_stats, 1);
 + if (stat_types  PV_LOCKSTAT_HALT_QNODE)
 + add_smp(halt_qnode_stats, 1);
 + if (stat_types  PV_LOCKSTAT_HALT_ABORT)
 + add_smp(halt_abort_stats, 1);
 +}
 +PV_CALLEE_SAVE_REGS_THUNK(xen_lock_stats);

This is not inlined and the 6 test-and-branch cannot be optimized away.

 +/*
 + * Halt the current CPU  release it back to the host
 + * Return 0 if halted, -1 otherwise.
 + */
 +int xen_halt_cpu(u8 *byte, u8 val)
 +{
 + int irq = __this_cpu_read(lock_kicker_irq);
 + unsigned long flags;
 + u64 start;
 +
 + /* If kicker interrupts not initialized yet, just spin */
 + if (irq == -1)
 + return -1;
 +
 + /*
 +  * Make sure an interrupt handler can't upset things in a
 +  * partially setup state.
 +  */
 + local_irq_save(flags);
 + start = spin_time_start();
 +
 + /* clear pending */
 + xen_clear_irq_pending(irq);
 +
 + /* Allow interrupts while blocked */
 + local_irq_restore(flags);

It's not clear what partially setup state is being protected here.
xen_clear_irq_pending() is an atomic bit clear.

I think you can drop the irq save/restore here.

 + /*
 +  * Don't halt if the content of the given byte address differs from
 +  * the expected value. A read memory barrier is added to make sure that
 +  * the latest value of the byte address is fetched.
 +  */
 + smp_rmb();

The atomic bit clear in xen_clear_irq_pending() acts as a full memory
barrier.  I don't think you need an additional memory barrier here, only
a compiler one.  I suggest using READ_ONCE().

 + if (*byte != val) {
 + xen_lock_stats(PV_LOCKSTAT_HALT_ABORT);
 + return -1;
 + }
 + /*
 +  * If an interrupt happens here, it will leave the wakeup irq
 +  * pending, which will cause xen_poll_irq() to return
 +  * immediately.
 +  */
 +
 + /* Block until irq becomes pending (or perhaps a spurious wakeup) */
 + xen_poll_irq(irq);
 + spin_time_accum_blocked(start);
 + return 0;
 +}
 +PV_CALLEE_SAVE_REGS_THUNK(xen_halt_cpu);
 +
 +#endif /* CONFIG_QUEUE_SPINLOCK */
 +
  static irqreturn_t dummy_handler(int irq, void *dev_id)
  {
   BUG();

David

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.



On 21/01/2015 10:23, Wincy Van wrote:
 Yes, moving that msrs looks a bit ugly, but the irqchip_in_kernel is
 per-VM, not a global
 setting, there would be different settings of kernel_irqchip between VMs.
 If we use irqchip_in_kernel to check it and set different value of the
 ctl msrs, I think it may
 be even worse than moving the msrs, because this logic should be a
 init function, and this
 setting should be converged.

I too prefer your solution.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)

2015-01-21 Thread Christoffer Dall

On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote:
 Patch series adds support for armv7/8 dirty page logging. As we move
 towards generic dirty page logging interface we move some common code to
 generic layer shared by x86, armv7 and armv8.
 
 armv7/8 Dirty page logging implementation overivew-
 - initially write protects memory region 2nd stage page tables
 - read dirty page log and again write protect dirty pages for next pass.
 - second stage huge pages are dissolved into normal pages to keep track of
   dirty memory at page granularity. Tracking at huge page granularity
   limits granularity of marking dirty memory and migration to a light memory
   load. Small page size logging supports higher memory dirty rates, enables
   rapid migration.
 
   armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and
   512MB (64kb)
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.
 

Thanks, applied.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)

2015-01-21 Thread Christoffer Dall

On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote:
 Patch series adds support for armv7/8 dirty page logging. As we move
 towards generic dirty page logging interface we move some common code to
 generic layer shared by x86, armv7 and armv8.
 
 armv7/8 Dirty page logging implementation overivew-
 - initially write protects memory region 2nd stage page tables
 - read dirty page log and again write protect dirty pages for next pass.
 - second stage huge pages are dissolved into normal pages to keep track of
   dirty memory at page granularity. Tracking at huge page granularity
   limits granularity of marking dirty memory and migration to a light memory
   load. Small page size logging supports higher memory dirty rates, enables
   rapid migration.
 
   armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and
   512MB (64kb)
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.
 

Thanks, applied.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.

On Wed, Jan 21, 2015 at 4:35 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 Wincy Van wrote on 2015-01-16:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance.

 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.

 Signed-off-by: Wincy Van fanwenyi0...@gmail.com

 To enable x2apic, should you to consider the behavior changes to rdmsr and 
 wrmsr. I didn't see your patch do it. Is it correct?

Yes, indeed, I've not noticed that kvm handle nested msr bitmap
manually, the next version will fix this.

 BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
 here and change to apicv in following patch.

Do you mean that we should split this patch from the apicv patch set?


 ---
  arch/x86/kvm/vmx.c |   49
 -
  1 files changed, 48 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }

 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;

 @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,

 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +   case EXIT_REASON_APIC_WRITE:
 +   /* apic_write should exit unconditionally. */
 +   return 1;

 APIC_WRITE vmexit is introduced by APIC register virtualization not 
 virtualize x2apic. Move it to next patch.

Agreed, will do.

Thanks,

Wincy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

On Wed, Jan 21, 2015 at 4:49 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 +   if (vector == vmcs12-posted_intr_nv  +
 nested_cpu_has_posted_intr(vmcs12)) { +   if (vcpu-mode
 == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), +
   POSTED_INTR_VECTOR); +   else {
 +   r = -1; +   goto out; +
} + +   /* +* if posted intr is
 done by hardware, the +* corresponding eoi was sent to
 L0. Thus +* we should send eoi to L1 manually. +
  */ +   kvm_apic_set_eoi_accelerated(vcpu, +
 vmcs12-posted_intr_nv);

 Why this is necessary? As your comments mentioned, it is done by
 hardware not L1, why L1 should aware of it?


 According to SDM 29.6, if the processor recognizes a posted interrupt,
 it will send an EOI to LAPIC.
 If the posted intr is done by hardware, the processor will send eoi to
 hardware LAPIC, not L1's, just like the none-nested case(the physical
 interrupt is dismissed). So we should take care of the L1's LAPIC and send 
 an eoi to it.

 No. You are not emulating the PI feature. You just reuse the hardware's 
 capability. So you don't need to let L1 know it.


Agreed, I had thought we have already set L1's IRR before this, I was wrong.

BTW, I was trying to complete the nested posted intr manually if the
dest vcpu is in_guest_mode but not IN_GUEST_MODE, but I found that
it is difficult to set RVI of the destination vcpu timely, because we
should keep the RVI, PIR and ON in sync : (

I think it is better to do a nested vmexit in the case above, rather
than emulate it, because that case is much less than the hardware
case.


Thanks,

Wincy.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[question] incremental backup a running vm

2015-01-21 Thread Zhang Haoyu

Hi,

Does drive_mirror support incremental backup a running vm?
Or other mechanism does?

incremental backup a running vm requirements:
First time backup, all of the allocated data will be mirrored to destination,
then a copied bitmap will be saved to a file, then the bitmap file will log 
dirty for
the changed data.
Next time backup, only the dirty data will be mirrored to destination.
Even the VM shutdown and start after several days,
the bitmap will be loaded while starting vm.
Any ideas?

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [question] incremental backup a running vm



On 21/01/2015 11:32, Zhang Haoyu wrote:
 Hi,
 
 Does drive_mirror support incremental backup a running vm?
 Or other mechanism does?
 
 incremental backup a running vm requirements:
 First time backup, all of the allocated data will be mirrored to destination,
 then a copied bitmap will be saved to a file, then the bitmap file will log 
 dirty for
 the changed data.
 Next time backup, only the dirty data will be mirrored to destination.
 Even the VM shutdown and start after several days,
 the bitmap will be loaded while starting vm.
 Any ideas?

Drive-mirror is for storage migration.  For backup there is another job,
drive-backup.  drive-backup copies a point-in-time snapshot of one or
more disks corresponding to when the backup was started.

Incremental backup is being worked on.  You can see patches on the list.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.

Wincy Van wrote on 2015-01-16:
 To enable nested apicv support, we need per-cpu vmx control MSRs:
   1. If in-kernel irqchip is enabled, we can enable nested
  posted interrupt, we should set posted intr bit in the
  nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled,
  we can not enable nested posted interrupt, the posted intr bit in
  the nested_vmx_pinbased_ctls_high will be cleared.
 Since there would be different settings about in-kernel irqchip
 between VMs, different nested control MSRs are needed.

I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr 
to per vcpu.

Best regards,
Yang

Re: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 +   if (vector == vmcs12-posted_intr_nv 
 +   nested_cpu_has_posted_intr(vmcs12)) {
 +   if (vcpu-mode == IN_GUEST_MODE)
 +   apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
 +   POSTED_INTR_VECTOR);
 +   else {
 +   r = -1;
 +   goto out;
 +   }
 +
 +   /*
 +* if posted intr is done by hardware, the
 +* corresponding eoi was sent to L0. Thus
 +* we should send eoi to L1 manually.
 +*/
 +   kvm_apic_set_eoi_accelerated(vcpu,
 +   vmcs12-posted_intr_nv);

 Why this is necessary? As your comments mentioned, it is done by hardware not 
 L1, why L1 should aware of it?


According to SDM 29.6, if the processor recognizes a posted interrupt,
it will send an EOI to LAPIC.
If the posted intr is done by hardware, the processor will send eoi to
hardware LAPIC, not L1's, just
like the none-nested case(the physical interrupt is dismissed). So we
should take care of the L1's
LAPIC and send an eoi to it.


Thanks,

Wincy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

Wincy Van wrote on 2015-01-21:
 On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 +   if (vector == vmcs12-posted_intr_nv  +  
 nested_cpu_has_posted_intr(vmcs12)) { +   if (vcpu-mode
 == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), +
   POSTED_INTR_VECTOR); +   else {
 +   r = -1; +   goto out; +   
} + +   /* +* if posted intr is
 done by hardware, the +* corresponding eoi was sent to
 L0. Thus +* we should send eoi to L1 manually. +  
  */ +   kvm_apic_set_eoi_accelerated(vcpu, +  
 vmcs12-posted_intr_nv);
 
 Why this is necessary? As your comments mentioned, it is done by
 hardware not L1, why L1 should aware of it?
 
 
 According to SDM 29.6, if the processor recognizes a posted interrupt,
 it will send an EOI to LAPIC.
 If the posted intr is done by hardware, the processor will send eoi to
 hardware LAPIC, not L1's, just like the none-nested case(the physical
 interrupt is dismissed). So we should take care of the L1's LAPIC and send an 
 eoi to it.

No. You are not emulating the PI feature. You just reuse the hardware's 
capability. So you don't need to let L1 know it.

 
 
 Thanks,
 
 Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

Wincy Van wrote on 2015-01-20:
 If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu
 to inject interrupt timely. With posted interrupt processing, the kick
 intr is not needed, and interrupts are fully taken care of by hardware.
 
 In nested vmx, this feature avoids much more vmexits than non-nested vmx.
 
 This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if
 L1's vector is different with L0's. If vcpu is in hardware's non-root
 mode, we use a physical ipi to deliver posted interrupts, otherwise we
 will deliver that interrupt to L1 and kick that vcpu out of nested non-root 
 mode.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |  136
  ++-- 1 files changed,
  132 insertions(+), 4 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 {
 u64 tsc_offset; u64 virtual_apic_page_addr; u64
 apic_access_addr; +   u64 posted_intr_desc_addr; u64
 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@
 -334,6 +335,7 @@ struct __packed vmcs12 { u32
 vmx_preemption_timer_value; u32 padding32[7]; /* room for future
 expansion */ u16 virtual_processor_id; +   u16
 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector;
 u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx {
 /* The host-usable pointer to the above */ struct page
 *current_vmcs12_page; struct vmcs12 *current_vmcs12; +  
 spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /*
  * Indicates if the shadow vmcs must be updated with the @@
  -406,6 +409,8 @@ struct nested_vmx { */
 struct page *apic_access_page;
 struct page *virtual_apic_page;
 +   struct page *pi_desc_page;
 +   struct pi_desc *pi_desc;
 u64 msr_ia32_feature_control;
 
 struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static
 int max_shadow_read_write_fields =
 
  static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), +  
 FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR,
 guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
 FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@
 static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR,
 virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR,
 apic_access_addr), +   FIELD64(POSTED_INTR_DESC_ADDR,
 posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer),
 FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
 FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7
 @@ static void kvm_cpu_vmxon(u64 addr);  static void
 kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void);  static
 bool vmx_xsaves_supported(void);
 +static int vmx_vm_has_apicv(struct kvm *kvm);
  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg); @@
 -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct
 vmcs12 *vmcs12)
 return nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
  }
 +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) {
 +   return vmcs12-pin_based_vm_exec_control 
 +PIN_BASED_POSTED_INTR; }
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_pinbased_ctls_high |=
 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
 PIN_BASED_VMX_PREEMPTION_TIMER;
 +   if (vmx_vm_has_apicv(vmx-vcpu.kvm))
 +   vmx-nested.nested_vmx_pinbased_ctls_high |=
 +   PIN_BASED_POSTED_INTR;
 
 /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6
 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return
 enable_apicv  irqchip_in_kernel(kvm);  }
 +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 +   int vector) {
 +   int r = 0;
 +   struct vmcs12 *vmcs12;
 +
 +   /*
 +* Since posted intr delivery is async,
 +* we must aquire a spin-lock to avoid
 +* the race of vmcs12.
 +*/
 +   spin_lock(to_vmx(vcpu)-nested.vmcs12_lock);
 +   vmcs12 = get_vmcs12(vcpu);
 +   if (!is_guest_mode(vcpu) || !vmcs12) {
 +   r = -1;
 +   goto out;
 +   }
 +   if (vector == vmcs12-posted_intr_nv 
 +

RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.

Wincy Van wrote on 2015-01-16:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com

To enable x2apic, should you to consider the behavior changes to rdmsr and 
wrmsr. I didn't see your patch do it. Is it correct?
BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
here and change to apicv in following patch.

 ---
  arch/x86/kvm/vmx.c |   49
 -
  1 files changed, 48 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }
 
 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;
 
 @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +   case EXIT_REASON_APIC_WRITE:
 +   /* apic_write should exit unconditionally. */
 +   return 1;

APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize 
x2apic. Move it to next patch.

 case EXIT_REASON_EPT_VIOLATION:
 /*
  * L0 always deals with the EPT violation. If nested EPT is
 @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct
 kvm_vcpu *vcpu)
   ns_to_ktime(preemption_timeout),
 HRTIMER_MODE_REL);  }
 
 +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   if (nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 +   return -EINVAL;
 +   return 0;
 +}
 +
 +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 +  struct vmcs12 *vmcs12) {
 +   int r;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return 0;
 +
 +   r = nested_vmx_check_virt_x2apic(vcpu, vmcs12);
 +   if (r)
 +   goto fail;
 +
 +   /* tpr shadow is needed by all apicv features. */
 +   if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
 +   r = -EINVAL;
 +   goto fail;
 +   }
 +
 +   return 0;
 +
 +fail:
 +   return r;
 +}
 +
  static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
unsigned long count_field,
unsigned long addr_field, @@
 -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 
 page_to_phys(vmx-nested.apic_access_page));
 -   } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 +   } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) 
 +
 + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) {
 exec_control |=
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 kvm_vcpu_reload_apic_access_page(vcpu);
 @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu,
 bool launch)
 return 1;
 }
 
 +   if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) {
 +   nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 +   return 1;
 +   }
 +
 if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) {
 nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 return 1;
 --
 1.7.1

Best regards,
Yang

Re: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.

On Wed, Jan 21, 2015 at 4:18 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 Wincy Van wrote on 2015-01-16:
 To enable nested apicv support, we need per-cpu vmx control MSRs:
   1. If in-kernel irqchip is enabled, we can enable nested
  posted interrupt, we should set posted intr bit in the
  nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled,
  we can not enable nested posted interrupt, the posted intr bit in
  the nested_vmx_pinbased_ctls_high will be cleared.
 Since there would be different settings about in-kernel irqchip
 between VMs, different nested control MSRs are needed.

 I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl 
 msr to per vcpu.


Yes, moving that msrs looks a bit ugly, but the irqchip_in_kernel is
per-VM, not a global
setting, there would be different settings of kernel_irqchip between VMs.
If we use irqchip_in_kernel to check it and set different value of the
ctl msrs, I think it may
be even worse than moving the msrs, because this logic should be a
init function, and this
setting should be converged.

Thanks,

Wincy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v14 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled

2015-01-21 Thread Raghavendra K T


On 01/21/2015 01:42 AM, Waiman Long wrote:

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---

Reviewed-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 12/18] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export

From: Antonios Motakis a.mota...@virtualopensystems.com

We want to reuse virqfd functionality in multiple VFIO drivers; before
moving these functions to core VFIO, add the vfio_ prefix to the
virqfd_enable and virqfd_disable functions, and export them so they can
be used from other modules.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c   | 30 --
 drivers/vfio/pci/vfio_pci_private.h |  4 ++--
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index e8d695b..0a41833d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -126,10 +126,10 @@ static void virqfd_inject(struct work_struct *work)
virqfd-thread(virqfd-vdev, virqfd-data);
 }
 
-static int virqfd_enable(struct vfio_pci_device *vdev,
-int (*handler)(struct vfio_pci_device *, void *),
-void (*thread)(struct vfio_pci_device *, void *),
-void *data, struct virqfd **pvirqfd, int fd)
+int vfio_virqfd_enable(struct vfio_pci_device *vdev,
+  int (*handler)(struct vfio_pci_device *, void *),
+  void (*thread)(struct vfio_pci_device *, void *),
+  void *data, struct virqfd **pvirqfd, int fd)
 {
struct fd irqfd;
struct eventfd_ctx *ctx;
@@ -215,9 +215,9 @@ err_fd:
 
return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_virqfd_enable);
 
-static void virqfd_disable(struct vfio_pci_device *vdev,
-  struct virqfd **pvirqfd)
+void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd)
 {
unsigned long flags;
 
@@ -237,6 +237,7 @@ static void virqfd_disable(struct vfio_pci_device *vdev,
 */
flush_workqueue(vfio_irqfd_cleanup_wq);
 }
+EXPORT_SYMBOL_GPL(vfio_virqfd_disable);
 
 /*
  * INTx
@@ -440,8 +441,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
 static void vfio_intx_disable(struct vfio_pci_device *vdev)
 {
vfio_intx_set_signal(vdev, -1);
-   virqfd_disable(vdev, vdev-ctx[0].unmask);
-   virqfd_disable(vdev, vdev-ctx[0].mask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].mask);
vdev-irq_type = VFIO_PCI_NUM_IRQS;
vdev-num_ctx = 0;
kfree(vdev-ctx);
@@ -605,8 +606,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix);
 
for (i = 0; i  vdev-num_ctx; i++) {
-   virqfd_disable(vdev, vdev-ctx[i].unmask);
-   virqfd_disable(vdev, vdev-ctx[i].mask);
+   vfio_virqfd_disable(vdev, vdev-ctx[i].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[i].mask);
}
 
if (msix) {
@@ -639,11 +640,12 @@ static int vfio_pci_set_intx_unmask(struct 
vfio_pci_device *vdev,
} else if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
int32_t fd = *(int32_t *)data;
if (fd = 0)
-   return virqfd_enable(vdev, vfio_pci_intx_unmask_handler,
-vfio_send_intx_eventfd, NULL,
-vdev-ctx[0].unmask, fd);
+   return vfio_virqfd_enable(vdev,
+ vfio_pci_intx_unmask_handler,
+ vfio_send_intx_eventfd, NULL,
+ vdev-ctx[0].unmask, fd);
 
-   virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
}
 
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 671c17a..2e2f0ea 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, 
char __user *buf,
 extern int vfio_pci_init_perm_bits(void);
 extern void vfio_pci_uninit_perm_bits(void);
 
-extern int vfio_pci_virqfd_init(void);
-extern void vfio_pci_virqfd_exit(void);
+extern int vfio_virqfd_init(void);
+extern void vfio_virqfd_exit(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 17/18] vfio: initialize the virqfd workqueue in VFIO generic code

From: Antonios Motakis a.mota...@virtualopensystems.com

Now we have finally completely decoupled virqfd from VFIO_PCI. We can
initialize it from the VFIO generic code, in order to safely use it from
multiple independent VFIO bus drivers.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/Makefile   | 4 +++-
 drivers/vfio/pci/Makefile   | 3 +--
 drivers/vfio/pci/vfio_pci.c | 8 
 drivers/vfio/vfio.c | 8 
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index dadf0ca..d798b09 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,4 +1,6 @@
-obj-$(CONFIG_VFIO) += vfio.o
+vfio_core-y := vfio.o virqfd.o
+
+obj-$(CONFIG_VFIO) += vfio_core.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index c7c8644..1310792 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,4 @@
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \
- ../virqfd.o
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fc4308c..8d156d7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1012,7 +1012,6 @@ put_devs:
 static void __exit vfio_pci_cleanup(void)
 {
pci_unregister_driver(vfio_pci_driver);
-   vfio_virqfd_exit();
vfio_pci_uninit_perm_bits();
 }
 
@@ -1025,11 +1024,6 @@ static int __init vfio_pci_init(void)
if (ret)
return ret;
 
-   /* Start the virqfd cleanup handler */
-   ret = vfio_virqfd_init();
-   if (ret)
-   goto out_virqfd;
-
/* Register and scan for devices */
ret = pci_register_driver(vfio_pci_driver);
if (ret)
@@ -1038,8 +1032,6 @@ static int __init vfio_pci_init(void)
return 0;
 
 out_driver:
-   vfio_virqfd_exit();
-out_virqfd:
vfio_pci_uninit_perm_bits();
return ret;
 }
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index f018d8d..8e84471 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1464,6 +1464,11 @@ static int __init vfio_init(void)
if (ret)
goto err_cdev_add;
 
+   /* Start the virqfd cleanup handler used by some VFIO bus drivers */
+   ret = vfio_virqfd_init();
+   if (ret)
+   goto err_virqfd;
+
pr_info(DRIVER_DESC  version:  DRIVER_VERSION \n);
 
/*
@@ -1476,6 +1481,8 @@ static int __init vfio_init(void)
 
return 0;
 
+err_virqfd:
+   cdev_del(vfio.group_cdev);
 err_cdev_add:
unregister_chrdev_region(vfio.group_devt, MINORMASK);
 err_alloc_chrdev:
@@ -1490,6 +1497,7 @@ static void __exit vfio_cleanup(void)
 {
WARN_ON(!list_empty(vfio.group_list));
 
+   vfio_virqfd_exit();
idr_destroy(vfio.group_idr);
cdev_del(vfio.group_cdev);
unregister_chrdev_region(vfio.group_devt, MINORMASK);
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 18/18] vfio/platform: implement IRQ masking/unmasking via an eventfd

From: Antonios Motakis a.mota...@virtualopensystems.com

With this patch the VFIO user will be able to set an eventfd that can be
used in order to mask and unmask IRQs of platform devices.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 47 ---
 drivers/vfio/platform/vfio_platform_private.h |  2 ++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index e0e6388..88bba57 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -37,6 +37,15 @@ static void vfio_platform_mask(struct vfio_platform_irq 
*irq_ctx)
spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
+static int vfio_platform_mask_handler(void *opaque, void *unused)
+{
+   struct vfio_platform_irq *irq_ctx = opaque;
+
+   vfio_platform_mask(irq_ctx);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
  unsigned index, unsigned start,
  unsigned count, uint32_t flags,
@@ -48,8 +57,18 @@ static int vfio_platform_set_irq_mask(struct 
vfio_platform_device *vdev,
if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
return -EINVAL;
 
-   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
-   return -EINVAL; /* not implemented yet */
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   if (fd = 0)
+   return vfio_virqfd_enable((void *) vdev-irqs[index],
+ vfio_platform_mask_handler,
+ NULL, NULL,
+ vdev-irqs[index].mask, fd);
+
+   vfio_virqfd_disable(vdev-irqs[index].mask);
+   return 0;
+   }
 
if (flags  VFIO_IRQ_SET_DATA_NONE) {
vfio_platform_mask(vdev-irqs[index]);
@@ -78,6 +97,15 @@ static void vfio_platform_unmask(struct vfio_platform_irq 
*irq_ctx)
spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
+static int vfio_platform_unmask_handler(void *opaque, void *unused)
+{
+   struct vfio_platform_irq *irq_ctx = opaque;
+
+   vfio_platform_unmask(irq_ctx);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
unsigned index, unsigned start,
unsigned count, uint32_t flags,
@@ -89,8 +117,19 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
return -EINVAL;
 
-   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
-   return -EINVAL; /* not implemented yet */
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   if (fd = 0)
+   return vfio_virqfd_enable((void *) vdev-irqs[index],
+ vfio_platform_unmask_handler,
+ NULL, NULL,
+ vdev-irqs[index].unmask,
+ fd);
+
+   vfio_virqfd_disable(vdev-irqs[index].unmask);
+   return 0;
+   }
 
if (flags  VFIO_IRQ_SET_DATA_NONE) {
vfio_platform_unmask(vdev-irqs[index]);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index ff2db1d..5d31e04 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -35,6 +35,8 @@ struct vfio_platform_irq {
struct eventfd_ctx  *trigger;
boolmasked;
spinlock_t  lock;
+   struct virqfd   *unmask;
+   struct virqfd   *mask;
 };
 
 struct vfio_platform_region {
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 08/18] vfio/platform: return IRQ info

From: Antonios Motakis a.mota...@virtualopensystems.com

Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/Makefile|  2 +-
 drivers/vfio/platform/vfio_platform_common.c  | 31 +---
 drivers/vfio/platform/vfio_platform_irq.c | 51 +++
 drivers/vfio/platform/vfio_platform_private.h | 10 ++
 4 files changed, 89 insertions(+), 5 deletions(-)
 create mode 100644 drivers/vfio/platform/vfio_platform_irq.c

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 279862b..c6316cc 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,4 +1,4 @@
 
-vfio-platform-y := vfio_platform.o vfio_platform_common.o
+vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 6bf78ee..cf7bb08 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -100,6 +100,7 @@ static void vfio_platform_release(void *device_data)
 
if (!(--vdev-refcnt)) {
vfio_platform_regions_cleanup(vdev);
+   vfio_platform_irq_cleanup(vdev);
}
 
mutex_unlock(driver_lock);
@@ -121,6 +122,10 @@ static int vfio_platform_open(void *device_data)
ret = vfio_platform_regions_init(vdev);
if (ret)
goto err_reg;
+
+   ret = vfio_platform_irq_init(vdev);
+   if (ret)
+   goto err_irq;
}
 
vdev-refcnt++;
@@ -128,6 +133,8 @@ static int vfio_platform_open(void *device_data)
mutex_unlock(driver_lock);
return 0;
 
+err_irq:
+   vfio_platform_regions_cleanup(vdev);
 err_reg:
mutex_unlock(driver_lock);
module_put(THIS_MODULE);
@@ -153,7 +160,7 @@ static long vfio_platform_ioctl(void *device_data,
 
info.flags = vdev-flags;
info.num_regions = vdev-num_regions;
-   info.num_irqs = 0;
+   info.num_irqs = vdev-num_irqs;
 
return copy_to_user((void __user *)arg, info, minsz);
 
@@ -178,10 +185,26 @@ static long vfio_platform_ioctl(void *device_data,
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+   struct vfio_irq_info info;
+
+   minsz = offsetofend(struct vfio_irq_info, count);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   if (info.index = vdev-num_irqs)
+   return -EINVAL;
+
+   info.flags = vdev-irqs[info.index].flags;
+   info.count = vdev-irqs[info.index].count;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   } else if (cmd == VFIO_DEVICE_SET_IRQS)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_RESET)
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
new file mode 100644
index 000..c6c3ec1
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -0,0 +1,51 @@
+/*
+ * VFIO platform devices interrupt handling
+ *
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/eventfd.h
+#include linux/interrupt.h
+#include linux/slab.h
+#include linux/types.h
+#include linux/vfio.h
+#include linux/irq.h
+
+#include vfio_platform_private.h
+
+int vfio_platform_irq_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (vdev-get_irq(vdev, cnt) = 0)
+   cnt++;
+
+   vdev-irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), GFP_KERNEL);
+   if (!vdev-irqs)
+   return -ENOMEM;
+
+   for (i = 0; i  cnt; i++) {
+   vdev-irqs[i].flags = 0;
+   vdev-irqs[i].count = 1;
+   }
+
+   vdev-num_irqs = cnt;
+
+

[PATCH v12 10/18] vfio/platform: trigger an interrupt via eventfd

From: Antonios Motakis a.mota...@virtualopensystems.com

This patch allows to set an eventfd for a platform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.
Level sensitive interrupts are marked as maskable and are handled in
a later patch. Edge triggered interrupts are not advertised as maskable
and are implemented here using a simple and efficient IRQ handler.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
[Baptiste Reynal: fix masked interrupt initialization]
Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 98 ++-
 drivers/vfio/platform/vfio_platform_private.h |  2 +
 2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index df5c919..4b1ee22 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -39,12 +39,96 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
return -EINVAL;
 }
 
+static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
+{
+   struct vfio_platform_irq *irq_ctx = dev_id;
+
+   eventfd_signal(irq_ctx-trigger, 1);
+
+   return IRQ_HANDLED;
+}
+
+static int vfio_set_trigger(struct vfio_platform_device *vdev, int index,
+   int fd, irq_handler_t handler)
+{
+   struct vfio_platform_irq *irq = vdev-irqs[index];
+   struct eventfd_ctx *trigger;
+   int ret;
+
+   if (irq-trigger) {
+   free_irq(irq-hwirq, irq);
+   kfree(irq-name);
+   eventfd_ctx_put(irq-trigger);
+   irq-trigger = NULL;
+   }
+
+   if (fd  0) /* Disable only */
+   return 0;
+
+   irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s),
+   irq-hwirq, vdev-name);
+   if (!irq-name)
+   return -ENOMEM;
+
+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger)) {
+   kfree(irq-name);
+   return PTR_ERR(trigger);
+   }
+
+   irq-trigger = trigger;
+
+   irq_set_status_flags(irq-hwirq, IRQ_NOAUTOEN);
+   ret = request_irq(irq-hwirq, handler, 0, irq-name, irq);
+   if (ret) {
+   kfree(irq-name);
+   eventfd_ctx_put(trigger);
+   irq-trigger = NULL;
+   return ret;
+   }
+
+   if (!irq-masked)
+   enable_irq(irq-hwirq);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
 unsigned index, unsigned start,
 unsigned count, uint32_t flags,
 void *data)
 {
-   return -EINVAL;
+   struct vfio_platform_irq *irq = vdev-irqs[index];
+   irq_handler_t handler;
+
+   if (vdev-irqs[index].flags  VFIO_IRQ_INFO_AUTOMASKED)
+   return -EINVAL; /* not implemented */
+   else
+   handler = vfio_irq_handler;
+
+   if (!count  (flags  VFIO_IRQ_SET_DATA_NONE))
+   return vfio_set_trigger(vdev, index, -1, handler);
+
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   return vfio_set_trigger(vdev, index, fd, handler);
+   }
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   handler(irq-hwirq, irq);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t trigger = *(uint8_t *)data;
+
+   if (trigger)
+   handler(irq-hwirq, irq);
+   }
+
+   return 0;
 }
 
 int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
@@ -90,7 +174,12 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
if (hwirq  0)
goto err;
 
-   vdev-irqs[i].flags = 0;
+   vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD;
+
+   if (irq_get_trigger_type(hwirq)  IRQ_TYPE_LEVEL_MASK)
+   vdev-irqs[i].flags |= VFIO_IRQ_INFO_MASKABLE
+   | VFIO_IRQ_INFO_AUTOMASKED;
+
vdev-irqs[i].count = 1;
vdev-irqs[i].hwirq = hwirq;
}
@@ -105,6 +194,11 @@ err:
 
 void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
 {
+   int i;
+
+   for (i = 0; i  vdev-num_irqs; i++)
+   vfio_set_trigger(vdev, i, -1, NULL);
+
vdev-num_irqs = 0;
kfree(vdev-irqs);
 }
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index b119a6c..aa01cc3 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -31,6

[PATCH v12 02/18] vfio: platform: probe to devices on the platform bus

From: Antonios Motakis a.mota...@virtualopensystems.com

Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform.c | 103 ++
 include/uapi/linux/vfio.h |   1 +
 2 files changed, 104 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_platform.c

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
new file mode 100644
index 000..cef645c
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/slab.h
+#include linux/vfio.h
+#include linux/platform_device.h
+
+#include vfio_platform_private.h
+
+#define DRIVER_VERSION  0.10
+#define DRIVER_AUTHOR   Antonios Motakis a.mota...@virtualopensystems.com
+#define DRIVER_DESC VFIO for platform devices - User Level meta-driver
+
+/* probing devices from the linux platform bus */
+
+static struct resource *get_platform_resource(struct vfio_platform_device 
*vdev,
+ int num)
+{
+   struct platform_device *dev = (struct platform_device *) vdev-opaque;
+   int i;
+
+   for (i = 0; i  dev-num_resources; i++) {
+   struct resource *r = dev-resource[i];
+
+   if (resource_type(r)  (IORESOURCE_MEM|IORESOURCE_IO)) {
+   if (!num)
+   return r;
+
+   num--;
+   }
+   }
+   return NULL;
+}
+
+static int get_platform_irq(struct vfio_platform_device *vdev, int i)
+{
+   struct platform_device *pdev = (struct platform_device *) vdev-opaque;
+
+   return platform_get_irq(pdev, i);
+}
+
+static int vfio_platform_probe(struct platform_device *pdev)
+{
+   struct vfio_platform_device *vdev;
+   int ret;
+
+   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+   if (!vdev)
+   return -ENOMEM;
+
+   vdev-opaque = (void *) pdev;
+   vdev-name = pdev-name;
+   vdev-flags = VFIO_DEVICE_FLAGS_PLATFORM;
+   vdev-get_resource = get_platform_resource;
+   vdev-get_irq = get_platform_irq;
+
+   ret = vfio_platform_probe_common(vdev, pdev-dev);
+   if (ret)
+   kfree(vdev);
+
+   return ret;
+}
+
+static int vfio_platform_remove(struct platform_device *pdev)
+{
+   struct vfio_platform_device *vdev;
+
+   vdev = vfio_platform_remove_common(pdev-dev);
+   if (vdev) {
+   kfree(vdev);
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static struct platform_driver vfio_platform_driver = {
+   .probe  = vfio_platform_probe,
+   .remove = vfio_platform_remove,
+   .driver = {
+   .name   = vfio-platform,
+   .owner  = THIS_MODULE,
+   },
+};
+
+module_platform_driver(vfio_platform_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE(GPL v2);
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9ade02b..4e93a97 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -159,6 +159,7 @@ struct vfio_device_info {
__u32   flags;
 #define VFIO_DEVICE_FLAGS_RESET(1  0)/* Device supports 
reset */
 #define VFIO_DEVICE_FLAGS_PCI  (1  1)/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1  2)/* vfio-platform device */
__u32   num_regions;/* Max region index + 1 */
__u32   num_irqs;   /* Max IRQ index + 1 */
 };
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 05/18] vfio/platform: return info for device memory mapped IO regions

From: Antonios Motakis a.mota...@virtualopensystems.com

This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call,
which allows the user to learn about the available MMIO resources of
a device.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 106 +-
 drivers/vfio/platform/vfio_platform_private.h |  22 ++
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 862b43b..2a4613c 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -22,17 +22,97 @@
 
 #include vfio_platform_private.h
 
+static DEFINE_MUTEX(driver_lock);
+
+static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (vdev-get_resource(vdev, cnt))
+   cnt++;
+
+   vdev-regions = kcalloc(cnt, sizeof(struct vfio_platform_region),
+   GFP_KERNEL);
+   if (!vdev-regions)
+   return -ENOMEM;
+
+   for (i = 0; i  cnt;  i++) {
+   struct resource *res =
+   vdev-get_resource(vdev, i);
+
+   if (!res)
+   goto err;
+
+   vdev-regions[i].addr = res-start;
+   vdev-regions[i].size = resource_size(res);
+   vdev-regions[i].flags = 0;
+
+   switch (resource_type(res)) {
+   case IORESOURCE_MEM:
+   vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO;
+   break;
+   case IORESOURCE_IO:
+   vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
+   break;
+   default:
+   goto err;
+   }
+   }
+
+   vdev-num_regions = cnt;
+
+   return 0;
+err:
+   kfree(vdev-regions);
+   return -EINVAL;
+}
+
+static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
+{
+   vdev-num_regions = 0;
+   kfree(vdev-regions);
+}
+
 static void vfio_platform_release(void *device_data)
 {
+   struct vfio_platform_device *vdev = device_data;
+
+   mutex_lock(driver_lock);
+
+   if (!(--vdev-refcnt)) {
+   vfio_platform_regions_cleanup(vdev);
+   }
+
+   mutex_unlock(driver_lock);
+
module_put(THIS_MODULE);
 }
 
 static int vfio_platform_open(void *device_data)
 {
+   struct vfio_platform_device *vdev = device_data;
+   int ret;
+
if (!try_module_get(THIS_MODULE))
return -ENODEV;
 
+   mutex_lock(driver_lock);
+
+   if (!vdev-refcnt) {
+   ret = vfio_platform_regions_init(vdev);
+   if (ret)
+   goto err_reg;
+   }
+
+   vdev-refcnt++;
+
+   mutex_unlock(driver_lock);
return 0;
+
+err_reg:
+   mutex_unlock(driver_lock);
+   module_put(THIS_MODULE);
+   return ret;
 }
 
 static long vfio_platform_ioctl(void *device_data,
@@ -53,15 +133,33 @@ static long vfio_platform_ioctl(void *device_data,
return -EINVAL;
 
info.flags = vdev-flags;
-   info.num_regions = 0;
+   info.num_regions = vdev-num_regions;
info.num_irqs = 0;
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+   struct vfio_region_info info;
+
+   minsz = offsetofend(struct vfio_region_info, offset);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   if (info.index = vdev-num_regions)
+   return -EINVAL;
+
+   /* map offset to the physical address  */
+   info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index);
+   info.size = vdev-regions[info.index].size;
+   info.flags = vdev-regions[info.index].flags;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_SET_IRQS)
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index c046988..3551f6d 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -18,7 +18,29 @@
 #include linux/types.h
 #include linux/interrupt.h
 
+#define VFIO_PLATFORM_OFFSET_SHIFT   40
+#define VFIO_PLATFORM_OFFSET_MASK (((u64)(1)  VFIO_PLATFORM_OFFSET_SHIFT) - 
1)
+

[PATCH v12 06/18] vfio/platform: read and write support for the device fd

From: Antonios Motakis a.mota...@virtualopensystems.com

VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Usually, the user will mmap memory regions that are
addressable on page boundaries, however for memory regions where this is
not the case we cannot provide mmap functionality due to security concerns.
For this reason we also allow to use read and write functions to the file
descriptor pointing to the memory regions.

We implement this functionality only for MMIO regions of platform devices;
PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 150 ++
 drivers/vfio/platform/vfio_platform_private.h |   1 +
 2 files changed, 151 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 2a4613c..fda4c30 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -50,6 +50,10 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
switch (resource_type(res)) {
case IORESOURCE_MEM:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO;
+   vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_READ;
+   if (!(res-flags  IORESOURCE_READONLY))
+   vdev-regions[i].flags |=
+   VFIO_REGION_INFO_FLAG_WRITE;
break;
case IORESOURCE_IO:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
@@ -69,6 +73,11 @@ err:
 
 static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
 {
+   int i;
+
+   for (i = 0; i  vdev-num_regions; i++)
+   iounmap(vdev-regions[i].ioaddr);
+
vdev-num_regions = 0;
kfree(vdev-regions);
 }
@@ -171,15 +180,156 @@ static long vfio_platform_ioctl(void *device_data,
return -ENOTTY;
 }
 
+static ssize_t vfio_platform_read_mmio(struct vfio_platform_region reg,
+  char __user *buf, size_t count,
+  loff_t off)
+{
+   unsigned int done = 0;
+
+   if (!reg.ioaddr) {
+   reg.ioaddr =
+   ioremap_nocache(reg.addr, reg.size);
+
+   if (!reg.ioaddr)
+   return -ENOMEM;
+   }
+
+   while (count) {
+   size_t filled;
+
+   if (count = 4  !(off % 4)) {
+   u32 val;
+
+   val = ioread32(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 4))
+   goto err;
+
+   filled = 4;
+   } else if (count = 2  !(off % 2)) {
+   u16 val;
+
+   val = ioread16(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 2))
+   goto err;
+
+   filled = 2;
+   } else {
+   u8 val;
+
+   val = ioread8(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 1))
+   goto err;
+
+   filled = 1;
+   }
+
+
+   count -= filled;
+   done += filled;
+   off += filled;
+   buf += filled;
+   }
+
+   return done;
+err:
+   return -EFAULT;
+}
+
 static ssize_t vfio_platform_read(void *device_data, char __user *buf,
  size_t count, loff_t *ppos)
 {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
+   loff_t off = *ppos  VFIO_PLATFORM_OFFSET_MASK;
+
+   if (index = vdev-num_regions)
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_READ))
+   return -EINVAL;
+
+   if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_MMIO)
+   return vfio_platform_read_mmio(vdev-regions[index],
+   buf, count, off);
+   else if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_PIO)
+   return -EINVAL; /* not implemented */
+
return -EINVAL;
 }
 
+static ssize_t vfio_platform_write_mmio(struct vfio_platform_region reg,
+   const char __user *buf, size_t count,
+   loff_t off)
+{
+   unsigned int done = 0;
+
+   if (!reg.ioaddr) {
+   reg.ioaddr =
+   ioremap_nocache(reg.addr, reg.size);
+
+   if (!reg.ioaddr)
+   return -ENOMEM;
+   }
+
+   while (count) {
+   size_t

[PATCH v12 01/18] vfio/platform: initial skeleton of VFIO support for platform devices

From: Antonios Motakis a.mota...@virtualopensystems.com

This patch forms the common skeleton code for platform devices support
with VFIO. This will include the core functionality of VFIO_PLATFORM,
however binding to the device and discovering the device resources will
be done with the help of a separate file where any Linux platform bus
specific code will reside.

This will allow us to implement support for also discovering AMBA devices
and their resources, but still reuse a large part of the VFIO_PLATFORM
implementation.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
[Baptiste Reynal: added includes in vfio_platform_private.h]
Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 121 ++
 drivers/vfio/platform/vfio_platform_private.h |  39 +
 2 files changed, 160 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_platform_common.c
 create mode 100644 drivers/vfio/platform/vfio_platform_private.h

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
new file mode 100644
index 000..34d023b
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -0,0 +1,121 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/device.h
+#include linux/iommu.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/slab.h
+#include linux/types.h
+#include linux/vfio.h
+
+#include vfio_platform_private.h
+
+static void vfio_platform_release(void *device_data)
+{
+   module_put(THIS_MODULE);
+}
+
+static int vfio_platform_open(void *device_data)
+{
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return 0;
+}
+
+static long vfio_platform_ioctl(void *device_data,
+   unsigned int cmd, unsigned long arg)
+{
+   if (cmd == VFIO_DEVICE_GET_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_RESET)
+   return -EINVAL;
+
+   return -ENOTTY;
+}
+
+static ssize_t vfio_platform_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+   return -EINVAL;
+}
+
+static ssize_t vfio_platform_write(void *device_data, const char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
+{
+   return -EINVAL;
+}
+
+static const struct vfio_device_ops vfio_platform_ops = {
+   .name   = vfio-platform,
+   .open   = vfio_platform_open,
+   .release= vfio_platform_release,
+   .ioctl  = vfio_platform_ioctl,
+   .read   = vfio_platform_read,
+   .write  = vfio_platform_write,
+   .mmap   = vfio_platform_mmap,
+};
+
+int vfio_platform_probe_common(struct vfio_platform_device *vdev,
+  struct device *dev)
+{
+   struct iommu_group *group;
+   int ret;
+
+   if (!vdev)
+   return -EINVAL;
+
+   group = iommu_group_get(dev);
+   if (!group) {
+   pr_err(VFIO: No IOMMU group for device %s\n, vdev-name);
+   return -EINVAL;
+   }
+
+   ret = vfio_add_group_dev(dev, vfio_platform_ops, vdev);
+   if (ret) {
+   iommu_group_put(group);
+   return ret;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_probe_common);
+
+struct vfio_platform_device *vfio_platform_remove_common(struct device *dev)
+{
+   struct vfio_platform_device *vdev;
+
+   vdev = vfio_del_group_dev(dev);
+   if (vdev)
+   iommu_group_put(dev-iommu_group);
+
+   return vdev;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
new file mode 100644
index 000..c046988
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program

[PATCH v12 04/18] vfio/platform: return info for bound device

From: Antonios Motakis a.mota...@virtualopensystems.com

A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties. This patch enables the
VFIO_DEVICE_GET_INFO ioctl call.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 34d023b..862b43b 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -38,10 +38,27 @@ static int vfio_platform_open(void *device_data)
 static long vfio_platform_ioctl(void *device_data,
unsigned int cmd, unsigned long arg)
 {
-   if (cmd == VFIO_DEVICE_GET_INFO)
-   return -EINVAL;
+   struct vfio_platform_device *vdev = device_data;
+   unsigned long minsz;
+
+   if (cmd == VFIO_DEVICE_GET_INFO) {
+   struct vfio_device_info info;
+
+   minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   info.flags = vdev-flags;
+   info.num_regions = 0;
+   info.num_irqs = 0;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 13/18] vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit

From: Antonios Motakis a.mota...@virtualopensystems.com

The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really
PCI specific, since we plan to reuse the virqfd code with more VFIO drivers
in addition to VFIO_PCI.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci.c   | 6 +++---
 drivers/vfio/pci/vfio_pci_intrs.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 9558da3..fc4308c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1012,7 +1012,7 @@ put_devs:
 static void __exit vfio_pci_cleanup(void)
 {
pci_unregister_driver(vfio_pci_driver);
-   vfio_pci_virqfd_exit();
+   vfio_virqfd_exit();
vfio_pci_uninit_perm_bits();
 }
 
@@ -1026,7 +1026,7 @@ static int __init vfio_pci_init(void)
return ret;
 
/* Start the virqfd cleanup handler */
-   ret = vfio_pci_virqfd_init();
+   ret = vfio_virqfd_init();
if (ret)
goto out_virqfd;
 
@@ -1038,7 +1038,7 @@ static int __init vfio_pci_init(void)
return 0;
 
 out_driver:
-   vfio_pci_virqfd_exit();
+   vfio_virqfd_exit();
 out_virqfd:
vfio_pci_uninit_perm_bits();
return ret;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 0a41833d..a5378d5 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -45,7 +45,7 @@ struct virqfd {
 
 static struct workqueue_struct *vfio_irqfd_cleanup_wq;
 
-int __init vfio_pci_virqfd_init(void)
+int __init vfio_virqfd_init(void)
 {
vfio_irqfd_cleanup_wq =
create_singlethread_workqueue(vfio-irqfd-cleanup);
@@ -55,7 +55,7 @@ int __init vfio_pci_virqfd_init(void)
return 0;
 }
 
-void vfio_pci_virqfd_exit(void)
+void vfio_virqfd_exit(void)
 {
destroy_workqueue(vfio_irqfd_cleanup_wq);
 }
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 07/18] vfio/platform: support MMAP of MMIO regions

From: Antonios Motakis a.mota...@virtualopensystems.com

Allow to memory map the MMIO regions of the device so userspace can
directly access them. PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c | 65 
 1 file changed, 65 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index fda4c30..6bf78ee 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -54,6 +54,16 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
if (!(res-flags  IORESOURCE_READONLY))
vdev-regions[i].flags |=
VFIO_REGION_INFO_FLAG_WRITE;
+
+   /*
+* Only regions addressed with PAGE granularity may be
+* MMAPed securely.
+*/
+   if (!(vdev-regions[i].addr  ~PAGE_MASK) 
+   !(vdev-regions[i].size  ~PAGE_MASK))
+   vdev-regions[i].flags |=
+   VFIO_REGION_INFO_FLAG_MMAP;
+
break;
case IORESOURCE_IO:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
@@ -333,8 +343,63 @@ static ssize_t vfio_platform_write(void *device_data, 
const char __user *buf,
return -EINVAL;
 }
 
+static int vfio_platform_mmap_mmio(struct vfio_platform_region region,
+  struct vm_area_struct *vma)
+{
+   u64 req_len, pgoff, req_start;
+
+   req_len = vma-vm_end - vma-vm_start;
+   pgoff = vma-vm_pgoff 
+   ((1U  (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+   req_start = pgoff  PAGE_SHIFT;
+
+   if (region.size  PAGE_SIZE || req_start + req_len  region.size)
+   return -EINVAL;
+
+   vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot);
+   vma-vm_pgoff = (region.addr  PAGE_SHIFT) + pgoff;
+
+   return remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff,
+  req_len, vma-vm_page_prot);
+}
+
 static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
 {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index;
+
+   index = vma-vm_pgoff  (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT);
+
+   if (vma-vm_end  vma-vm_start)
+   return -EINVAL;
+   if (!(vma-vm_flags  VM_SHARED))
+   return -EINVAL;
+   if (index = vdev-num_regions)
+   return -EINVAL;
+   if (vma-vm_start  ~PAGE_MASK)
+   return -EINVAL;
+   if (vma-vm_end  ~PAGE_MASK)
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_MMAP))
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_READ)
+(vma-vm_flags  VM_READ))
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_WRITE)
+(vma-vm_flags  VM_WRITE))
+   return -EINVAL;
+
+   vma-vm_private_data = vdev;
+
+   if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_MMIO)
+   return vfio_platform_mmap_mmio(vdev-regions[index], vma);
+
+   else if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_PIO)
+   return -EINVAL; /* not implemented */
+
return -EINVAL;
 }
 
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 16/18] vfio: move eventfd support code for VFIO_PCI to a separate file

From: Antonios Motakis a.mota...@virtualopensystems.com

The virqfd functionality that is used by VFIO_PCI to implement interrupt
masking and unmasking via an eventfd, is generic enough and can be reused
by another driver. Move it to a separate file in order to allow the code
to be shared.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/Makefile   |   3 +-
 drivers/vfio/pci/vfio_pci_intrs.c   | 215 
 drivers/vfio/pci/vfio_pci_private.h |   3 -
 drivers/vfio/virqfd.c   | 213 +++
 include/linux/vfio.h|  27 +
 5 files changed, 242 insertions(+), 219 deletions(-)
 create mode 100644 drivers/vfio/virqfd.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 1310792..c7c8644 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,4 +1,5 @@
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \
+ ../virqfd.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 5b5fc23..de4befc 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -19,228 +19,13 @@
 #include linux/msi.h
 #include linux/pci.h
 #include linux/file.h
-#include linux/poll.h
 #include linux/vfio.h
 #include linux/wait.h
-#include linux/workqueue.h
 #include linux/slab.h
 
 #include vfio_pci_private.h
 
 /*
- * IRQfd - generic
- */
-struct virqfd {
-   void*opaque;
-   struct eventfd_ctx  *eventfd;
-   int (*handler)(void *, void *);
-   void(*thread)(void *, void *);
-   void*data;
-   struct work_struct  inject;
-   wait_queue_twait;
-   poll_table  pt;
-   struct work_struct  shutdown;
-   struct virqfd   **pvirqfd;
-};
-
-static struct workqueue_struct *vfio_irqfd_cleanup_wq;
-DEFINE_SPINLOCK(virqfd_lock);
-
-int __init vfio_virqfd_init(void)
-{
-   vfio_irqfd_cleanup_wq =
-   create_singlethread_workqueue(vfio-irqfd-cleanup);
-   if (!vfio_irqfd_cleanup_wq)
-   return -ENOMEM;
-
-   return 0;
-}
-
-void vfio_virqfd_exit(void)
-{
-   destroy_workqueue(vfio_irqfd_cleanup_wq);
-}
-
-static void virqfd_deactivate(struct virqfd *virqfd)
-{
-   queue_work(vfio_irqfd_cleanup_wq, virqfd-shutdown);
-}
-
-static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void 
*key)
-{
-   struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
-   unsigned long flags = (unsigned long)key;
-
-   if (flags  POLLIN) {
-   /* An event has been signaled, call function */
-   if ((!virqfd-handler ||
-virqfd-handler(virqfd-opaque, virqfd-data)) 
-   virqfd-thread)
-   schedule_work(virqfd-inject);
-   }
-
-   if (flags  POLLHUP) {
-   unsigned long flags;
-   spin_lock_irqsave(virqfd_lock, flags);
-
-   /*
-* The eventfd is closing, if the virqfd has not yet been
-* queued for release, as determined by testing whether the
-* virqfd pointer to it is still valid, queue it now.  As
-* with kvm irqfds, we know we won't race against the virqfd
-* going away because we hold the lock to get here.
-*/
-   if (*(virqfd-pvirqfd) == virqfd) {
-   *(virqfd-pvirqfd) = NULL;
-   virqfd_deactivate(virqfd);
-   }
-
-   spin_unlock_irqrestore(virqfd_lock, flags);
-   }
-
-   return 0;
-}
-
-static void virqfd_ptable_queue_proc(struct file *file,
-wait_queue_head_t *wqh, poll_table *pt)
-{
-   struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
-   add_wait_queue(wqh, virqfd-wait);
-}
-
-static void virqfd_shutdown(struct work_struct *work)
-{
-   struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
-   u64 cnt;
-
-   eventfd_ctx_remove_wait_queue(virqfd-eventfd, virqfd-wait, cnt);
-   flush_work(virqfd-inject);
-   eventfd_ctx_put(virqfd-eventfd);
-
-   kfree(virqfd);
-}
-
-static void virqfd_inject(struct work_struct *work)
-{
-   struct virqfd *virqfd = container_of(work, struct virqfd, inject);
-   if (virqfd-thread)
-   virqfd-thread(virqfd-opaque, virqfd-data);
-}
-
-int vfio_virqfd_enable(void *opaque,
-  int (*handler)(void *, void *),
-  void (*thread)(void *, void *),
-  void *data, struct virqfd **pvirqfd, int fd)
-{
-   struct fd irqfd;
-   struct

[PATCH v12 11/18] vfio/platform: support for level sensitive interrupts

From: Antonios Motakis a.mota...@virtualopensystems.com

Level sensitive interrupts are exposed as maskable and automasked
interrupts and are masked and disabled automatically when they fire.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 99 ++-
 drivers/vfio/platform/vfio_platform_private.h |  2 +
 2 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index 4b1ee22..e0e6388 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -23,12 +23,59 @@
 
 #include vfio_platform_private.h
 
+static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (!irq_ctx-masked) {
+   disable_irq_nosync(irq_ctx-hwirq);
+   irq_ctx-masked = true;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
+}
+
 static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
  unsigned index, unsigned start,
  unsigned count, uint32_t flags,
  void *data)
 {
-   return -EINVAL;
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   return -EINVAL; /* not implemented yet */
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   vfio_platform_mask(vdev-irqs[index]);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t mask = *(uint8_t *)data;
+
+   if (mask)
+   vfio_platform_mask(vdev-irqs[index]);
+   }
+
+   return 0;
+}
+
+static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (irq_ctx-masked) {
+   enable_irq(irq_ctx-hwirq);
+   irq_ctx-masked = false;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
 static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
@@ -36,7 +83,50 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
unsigned count, uint32_t flags,
void *data)
 {
-   return -EINVAL;
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   return -EINVAL; /* not implemented yet */
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   vfio_platform_unmask(vdev-irqs[index]);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t unmask = *(uint8_t *)data;
+
+   if (unmask)
+   vfio_platform_unmask(vdev-irqs[index]);
+   }
+
+   return 0;
+}
+
+static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id)
+{
+   struct vfio_platform_irq *irq_ctx = dev_id;
+   unsigned long flags;
+   int ret = IRQ_NONE;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (!irq_ctx-masked) {
+   ret = IRQ_HANDLED;
+
+   /* automask maskable interrupts */
+   disable_irq_nosync(irq_ctx-hwirq);
+   irq_ctx-masked = true;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
+
+   if (ret == IRQ_HANDLED)
+   eventfd_signal(irq_ctx-trigger, 1);
+
+   return ret;
 }
 
 static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
@@ -102,7 +192,7 @@ static int vfio_platform_set_irq_trigger(struct 
vfio_platform_device *vdev,
irq_handler_t handler;
 
if (vdev-irqs[index].flags  VFIO_IRQ_INFO_AUTOMASKED)
-   return -EINVAL; /* not implemented */
+   handler = vfio_automasked_irq_handler;
else
handler = vfio_irq_handler;
 
@@ -174,6 +264,8 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
if (hwirq  0)
goto err;
 
+   spin_lock_init(vdev-irqs[i].lock);
+
vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD;
 
if (irq_get_trigger_type(hwirq)  IRQ_TYPE_LEVEL_MASK)
@@ -182,6 +274,7 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
 
vdev-irqs[i].count = 1;
vdev-irqs[i].hwirq = hwirq;
+   vdev-irqs[i].masked = false;
}
 
vdev-num_irqs = cnt;
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index aa01cc3..ff2db1d

[PATCH v12 15/18] vfio: pass an opaque pointer on virqfd initialization

From: Antonios Motakis a.mota...@virtualopensystems.com

VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler
that implements masking/unmasking of IRQs via an eventfd. We can replace
it in the virqfd infrastructure with an opaque type so we can make use
of the mechanism from other VFIO bus drivers.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index b35bc16..5b5fc23 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -31,10 +31,10 @@
  * IRQfd - generic
  */
 struct virqfd {
-   struct vfio_pci_device  *vdev;
+   void*opaque;
struct eventfd_ctx  *eventfd;
-   int (*handler)(struct vfio_pci_device *, void *);
-   void(*thread)(struct vfio_pci_device *, void *);
+   int (*handler)(void *, void *);
+   void(*thread)(void *, void *);
void*data;
struct work_struct  inject;
wait_queue_twait;
@@ -74,7 +74,7 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, 
int sync, void *key)
if (flags  POLLIN) {
/* An event has been signaled, call function */
if ((!virqfd-handler ||
-virqfd-handler(virqfd-vdev, virqfd-data)) 
+virqfd-handler(virqfd-opaque, virqfd-data)) 
virqfd-thread)
schedule_work(virqfd-inject);
}
@@ -124,12 +124,12 @@ static void virqfd_inject(struct work_struct *work)
 {
struct virqfd *virqfd = container_of(work, struct virqfd, inject);
if (virqfd-thread)
-   virqfd-thread(virqfd-vdev, virqfd-data);
+   virqfd-thread(virqfd-opaque, virqfd-data);
 }
 
-int vfio_virqfd_enable(struct vfio_pci_device *vdev,
-  int (*handler)(struct vfio_pci_device *, void *),
-  void (*thread)(struct vfio_pci_device *, void *),
+int vfio_virqfd_enable(void *opaque,
+  int (*handler)(void *, void *),
+  void (*thread)(void *, void *),
   void *data, struct virqfd **pvirqfd, int fd)
 {
struct fd irqfd;
@@ -143,7 +143,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
return -ENOMEM;
 
virqfd-pvirqfd = pvirqfd;
-   virqfd-vdev = vdev;
+   virqfd-opaque = opaque;
virqfd-handler = handler;
virqfd-thread = thread;
virqfd-data = data;
@@ -196,7 +196,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
 * before we registered and trigger it as if we didn't miss it.
 */
if (events  POLLIN) {
-   if ((!handler || handler(vdev, data))  thread)
+   if ((!handler || handler(opaque, data))  thread)
schedule_work(virqfd-inject);
}
 
@@ -243,8 +243,10 @@ EXPORT_SYMBOL_GPL(vfio_virqfd_disable);
 /*
  * INTx
  */
-static void vfio_send_intx_eventfd(struct vfio_pci_device *vdev, void *unused)
+static void vfio_send_intx_eventfd(void *opaque, void *unused)
 {
+   struct vfio_pci_device *vdev = opaque;
+
if (likely(is_intx(vdev)  !vdev-virq_disabled))
eventfd_signal(vdev-ctx[0].trigger, 1);
 }
@@ -287,9 +289,9 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
  * a signal is necessary, which can then be handled via a work queue
  * or directly depending on the caller.
  */
-static int vfio_pci_intx_unmask_handler(struct vfio_pci_device *vdev,
-   void *unused)
+static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
+   struct vfio_pci_device *vdev = opaque;
struct pci_dev *pdev = vdev-pdev;
unsigned long flags;
int ret = 0;
@@ -641,7 +643,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device 
*vdev,
} else if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
int32_t fd = *(int32_t *)data;
if (fd = 0)
-   return vfio_virqfd_enable(vdev,
+   return vfio_virqfd_enable((void *) vdev,
  vfio_pci_intx_unmask_handler,
  vfio_send_intx_eventfd, NULL,
  vdev-ctx[0].unmask, fd);
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 09/18] vfio/platform: initial interrupts support code

From: Antonios Motakis a.mota...@virtualopensystems.com

This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which
most IRQ functionality is implemented in VFIO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 52 +--
 drivers/vfio/platform/vfio_platform_irq.c | 59 +++
 drivers/vfio/platform/vfio_platform_private.h |  7 
 3 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index cf7bb08..a532a25 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -204,10 +204,54 @@ static long vfio_platform_ioctl(void *device_data,
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_SET_IRQS)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+   struct vfio_irq_set hdr;
+   u8 *data = NULL;
+   int ret = 0;
+
+   minsz = offsetofend(struct vfio_irq_set, count);
+
+   if (copy_from_user(hdr, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (hdr.argsz  minsz)
+   return -EINVAL;
+
+   if (hdr.index = vdev-num_irqs)
+   return -EINVAL;
+
+   if (hdr.flags  ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+ VFIO_IRQ_SET_ACTION_TYPE_MASK))
+   return -EINVAL;
 
-   else if (cmd == VFIO_DEVICE_RESET)
+   if (!(hdr.flags  VFIO_IRQ_SET_DATA_NONE)) {
+   size_t size;
+
+   if (hdr.flags  VFIO_IRQ_SET_DATA_BOOL)
+   size = sizeof(uint8_t);
+   else if (hdr.flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   size = sizeof(int32_t);
+   else
+   return -EINVAL;
+
+   if (hdr.argsz - minsz  size)
+   return -EINVAL;
+
+   data = memdup_user((void __user *)(arg + minsz), size);
+   if (IS_ERR(data))
+   return PTR_ERR(data);
+   }
+
+   mutex_lock(vdev-igate);
+
+   ret = vfio_platform_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+  hdr.start, hdr.count, data);
+   mutex_unlock(vdev-igate);
+   kfree(data);
+
+   return ret;
+
+   } else if (cmd == VFIO_DEVICE_RESET)
return -EINVAL;
 
return -ENOTTY;
@@ -457,6 +501,8 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return ret;
}
 
+   mutex_init(vdev-igate);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(vfio_platform_probe_common);
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index c6c3ec1..df5c919 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -23,6 +23,56 @@
 
 #include vfio_platform_private.h
 
+static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
+ unsigned index, unsigned start,
+ unsigned count, uint32_t flags,
+ void *data)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
+   unsigned index, unsigned start,
+   unsigned count, uint32_t flags,
+   void *data)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
+unsigned index, unsigned start,
+unsigned count, uint32_t flags,
+void *data)
+{
+   return -EINVAL;
+}
+
+int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
+uint32_t flags, unsigned index, unsigned start,
+unsigned count, void *data)
+{
+   int (*func)(struct vfio_platform_device *vdev, unsigned index,
+   unsigned start, unsigned count, uint32_t flags,
+   void *data) = NULL;
+
+   switch (flags  VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+   case VFIO_IRQ_SET_ACTION_MASK:
+   func = vfio_platform_set_irq_mask;
+   break;
+   case VFIO_IRQ_SET_ACTION_UNMASK:
+   func = vfio_platform_set_irq_unmask;
+   break;
+   case VFIO_IRQ_SET_ACTION_TRIGGER:
+   func =

[PATCH v12 03/18] vfio: platform: add the VFIO PLATFORM module to Kconfig

From: Antonios Motakis a.mota...@virtualopensystems.com

Enable building the VFIO PLATFORM driver that allows to use Linux platform
devices with VFIO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/Kconfig   | 1 +
 drivers/vfio/Makefile  | 1 +
 drivers/vfio/platform/Kconfig  | 9 +
 drivers/vfio/platform/Makefile | 4 
 4 files changed, 15 insertions(+)
 create mode 100644 drivers/vfio/platform/Kconfig
 create mode 100644 drivers/vfio/platform/Makefile

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index a0abe04..962fb80 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -27,3 +27,4 @@ menuconfig VFIO
  If you don't know what to do here, say N.
 
 source drivers/vfio/pci/Kconfig
+source drivers/vfio/platform/Kconfig
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 0b035b1..dadf0ca 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += platform/
diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig
new file mode 100644
index 000..c51af17
--- /dev/null
+++ b/drivers/vfio/platform/Kconfig
@@ -0,0 +1,9 @@
+config VFIO_PLATFORM
+   tristate VFIO support for platform devices
+   depends on VFIO  EVENTFD  ARM
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on the system using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
new file mode 100644
index 000..279862b
--- /dev/null
+++ b/drivers/vfio/platform/Makefile
@@ -0,0 +1,4 @@
+
+vfio-platform-y := vfio_platform.o vfio_platform_common.o
+
+obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v12 14/18] vfio: add local lock for virqfd instead of depending on VFIO PCI

From: Antonios Motakis a.mota...@virtualopensystems.com

The Virqfd code needs to keep accesses to any struct *virqfd safe, but
this comes into play only when creating or destroying eventfds, so sharing
the same spinlock with the VFIO bus driver is not necessary.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index a5378d5..b35bc16 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -44,6 +44,7 @@ struct virqfd {
 };
 
 static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+DEFINE_SPINLOCK(virqfd_lock);
 
 int __init vfio_virqfd_init(void)
 {
@@ -80,21 +81,21 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, 
int sync, void *key)
 
if (flags  POLLHUP) {
unsigned long flags;
-   spin_lock_irqsave(virqfd-vdev-irqlock, flags);
+   spin_lock_irqsave(virqfd_lock, flags);
 
/*
 * The eventfd is closing, if the virqfd has not yet been
 * queued for release, as determined by testing whether the
-* vdev pointer to it is still valid, queue it now.  As
+* virqfd pointer to it is still valid, queue it now.  As
 * with kvm irqfds, we know we won't race against the virqfd
-* going away because we hold wqh-lock to get here.
+* going away because we hold the lock to get here.
 */
if (*(virqfd-pvirqfd) == virqfd) {
*(virqfd-pvirqfd) = NULL;
virqfd_deactivate(virqfd);
}
 
-   spin_unlock_irqrestore(virqfd-vdev-irqlock, flags);
+   spin_unlock_irqrestore(virqfd_lock, flags);
}
 
return 0;
@@ -170,16 +171,16 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
 * we update the pointer to the virqfd under lock to avoid
 * pushing multiple jobs to release the same virqfd.
 */
-   spin_lock_irq(vdev-irqlock);
+   spin_lock_irq(virqfd_lock);
 
if (*pvirqfd) {
-   spin_unlock_irq(vdev-irqlock);
+   spin_unlock_irq(virqfd_lock);
ret = -EBUSY;
goto err_busy;
}
*pvirqfd = virqfd;
 
-   spin_unlock_irq(vdev-irqlock);
+   spin_unlock_irq(virqfd_lock);
 
/*
 * Install our own custom wake-up handling so we are notified via
@@ -217,18 +218,18 @@ err_fd:
 }
 EXPORT_SYMBOL_GPL(vfio_virqfd_enable);
 
-void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd)
+void vfio_virqfd_disable(struct virqfd **pvirqfd)
 {
unsigned long flags;
 
-   spin_lock_irqsave(vdev-irqlock, flags);
+   spin_lock_irqsave(virqfd_lock, flags);
 
if (*pvirqfd) {
virqfd_deactivate(*pvirqfd);
*pvirqfd = NULL;
}
 
-   spin_unlock_irqrestore(vdev-irqlock, flags);
+   spin_unlock_irqrestore(virqfd_lock, flags);
 
/*
 * Block until we know all outstanding shutdown jobs have completed.
@@ -441,8 +442,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
 static void vfio_intx_disable(struct vfio_pci_device *vdev)
 {
vfio_intx_set_signal(vdev, -1);
-   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
-   vfio_virqfd_disable(vdev, vdev-ctx[0].mask);
+   vfio_virqfd_disable(vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev-ctx[0].mask);
vdev-irq_type = VFIO_PCI_NUM_IRQS;
vdev-num_ctx = 0;
kfree(vdev-ctx);
@@ -606,8 +607,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix);
 
for (i = 0; i  vdev-num_ctx; i++) {
-   vfio_virqfd_disable(vdev, vdev-ctx[i].unmask);
-   vfio_virqfd_disable(vdev, vdev-ctx[i].mask);
+   vfio_virqfd_disable(vdev-ctx[i].unmask);
+   vfio_virqfd_disable(vdev-ctx[i].mask);
}
 
if (msix) {
@@ -645,7 +646,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device 
*vdev,
  vfio_send_intx_eventfd, NULL,
  vdev-ctx[0].unmask, fd);
 
-   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev-ctx[0].unmask);
}
 
return 0;
-- 
2.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Fix CR3_PCID_INVD type on 32-bit



On 15/01/2015 09:44, Borislav Petkov wrote:
 From: Borislav Petkov b...@suse.de
 
 arch/x86/kvm/emulate.c: In function ‘check_cr_write’:
 arch/x86/kvm/emulate.c:3552:4: warning: left shift count = width of type
 rsvd = CR3_L_MODE_RESERVED_BITS  ~CR3_PCID_INVD;
 
 happens because sizeof(UL) on 32-bit is 4 bytes but we shift it 63 bits
 to the left.
 
 Signed-off-by: Borislav Petkov b...@suse.de
 ---
  arch/x86/include/asm/kvm_host.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index d89c6b828c96..a8d07a060136 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -51,7 +51,7 @@
 | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
  
  #define CR3_L_MODE_RESERVED_BITS 0xFF00ULL
 -#define CR3_PCID_INVD (1UL  63)
 +#define CR3_PCID_INVD BIT_64(63)
  #define CR4_RESERVED_BITS   \
   (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
 | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
 

Applied, thanks.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-01-21 Thread Peter Zijlstra

On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote:
 I'm actually wondering if we should just nuke the _interruptible()
 version of swait. As it should only be all interruptible or all not
 interruptible, that the swait_wake() should just do the wake up
 regardless. In which case, swait_wake() is good enough. No need to have
 different versions where people may think do something special.
 
 Peter?

Yeah, I think the lastest thing I have sitting here on my disk only has
the swake_up() which does TASK_NORMAL, no choice there.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue

2015-01-21 Thread Radim Krčmář

2015-01-20 15:54-0200, Marcelo Tosatti:
 SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
 and rdtsc is larger than a given threshold:
[...]
 Disable masterclock support (which increases said delta) in case the
 boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.

Why do we care about 2.6.16 bugs in upstream KVM?

The code to benefit tradeoff of this patch seems bad to me ...
MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with
MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want
to support old guests.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue