[PATCH 2/3] arm/arm64: KVM: Implement Stage-2 page aging
Until now, KVM/arm didn't care much for page aging (who was swapping anyway?), and simply provided empty hooks to the core KVM code. With server-type systems now being available, things are quite different. This patch implements very simple support for page aging, by clearing the Access flag in the Stage-2 page tables. On access fault, the current fault handling will write the PTE or PMD again, putting the Access flag back on. It should be possible to implement a much faster handling for Access faults, but that's left for a later patch. With this in place, performance in VMs is degraded much more gracefully. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 13 ++--- arch/arm/kvm/mmu.c| 59 ++- arch/arm/kvm/trace.h | 33 ++ arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/kvm_host.h | 13 ++--- 5 files changed, 96 insertions(+), 23 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 04b4ea0..d6b5b85 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -163,19 +163,10 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); +int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, - unsigned long end) -{ - return 0; -} - -static inline int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) -{ - return 0; -} - static inline void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, unsigned long address) { diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index e163a45..ffe89a0 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1068,6 +1068,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, out_unlock: spin_unlock(kvm-mmu_lock); + kvm_set_pfn_accessed(pfn); kvm_release_pfn_clean(pfn); return ret; } @@ -1102,7 +1103,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) /* Check the stage-2 fault is trans. fault or write fault */ fault_status = kvm_vcpu_trap_get_fault_type(vcpu); - if (fault_status != FSC_FAULT fault_status != FSC_PERM) { + if (fault_status != FSC_FAULT fault_status != FSC_PERM + fault_status != FSC_ACCESS) { kvm_err(Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n, kvm_vcpu_trap_get_class(vcpu), (unsigned long)kvm_vcpu_trap_get_fault(vcpu), @@ -1237,6 +1239,61 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) handle_hva_to_gpa(kvm, hva, end, kvm_set_spte_handler, stage2_pte); } +static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) +{ + pmd_t *pmd; + pte_t *pte; + + pmd = stage2_get_pmd(kvm, NULL, gpa); + if (!pmd || pmd_none(*pmd)) /* Nothing there */ + return 0; + + if (kvm_pmd_huge(*pmd)) { /* THP, HugeTLB */ + *pmd = pmd_mkold(*pmd); + goto tlbi; + } + + pte = pte_offset_kernel(pmd, gpa); + if (pte_none(*pte)) + return 0; + + *pte = pte_mkold(*pte); /* Just a page... */ +tlbi: + kvm_tlb_flush_vmid_ipa(kvm, gpa); + return 1; +} + +static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) +{ + pmd_t *pmd; + pte_t *pte; + + pmd = stage2_get_pmd(kvm, NULL, gpa); + if (!pmd || pmd_none(*pmd)) /* Nothing there */ + return 0; + + if (kvm_pmd_huge(*pmd)) /* THP, HugeTLB */ + return pmd_young(*pmd); + + pte = pte_offset_kernel(pmd, gpa); + if (!pte_none(*pte))/* Just a page... */ + return pte_young(*pte); + + return 0; +} + +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) +{ + trace_kvm_age_hva(start, end); + return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL); +} + +int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) +{ + trace_kvm_test_age_hva(hva); + return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL); +} + void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu) { mmu_free_memory_cache(vcpu-arch.mmu_page_cache); diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h index b6a6e71..364b5382 100644 --- a/arch/arm/kvm/trace.h +++ b/arch/arm/kvm/trace.h @@ -203,6 +203,39 @@
[PATCH 0/3] arm/arm64: KVM: Add support for page aging
So far, KVM/arm doesn't implement any support for page aging, leading to rather bad performance when the system is swapping. This short series implements the required hooks and fault handling to deal with pages being marked old/young. The three patches are fairly straightforward: - First patch changes the range iterator to be able to return a value - Second patch implements the actual page aging (clearing the AF bit in the page tables, and relying on the normal faulting code to set the bit again). - Last patch optimizes the access fault path by only doing the minimum to satisfy the fault. The end result is a system that behaves visibly better under load, as VM pages don't get evicted that easily. Based on 3.19-rc5, tested on Seattle and X-Gene. Also at git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/mm-fixes-3.19 Marc Zyngier (3): arm/arm64: KVM: Allow handle_hva_to_gpa to return a value arm/arm64: KVM: Implement Stage-2 page aging arm/arm64: KVM: Optimize handling of Access Flag faults arch/arm/include/asm/kvm_host.h | 13 +--- arch/arm/kvm/mmu.c| 128 +++--- arch/arm/kvm/trace.h | 48 ++ arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/kvm_host.h | 13 +--- 5 files changed, 171 insertions(+), 32 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] arm/arm64: KVM: Allow handle_hva_to_gpa to return a value
So far, handle_hva_to_gpa was never required to return a value. As we prepare to age pages at Stage-2, we need to be able to return a value from the iterator (kvm_test_age_hva). Adapt the code to handle this situation. No semantic change. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/mmu.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 1366625..e163a45 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1146,15 +1146,16 @@ out_unlock: return ret; } -static void handle_hva_to_gpa(struct kvm *kvm, - unsigned long start, - unsigned long end, - void (*handler)(struct kvm *kvm, - gpa_t gpa, void *data), - void *data) +static int handle_hva_to_gpa(struct kvm *kvm, +unsigned long start, +unsigned long end, +int (*handler)(struct kvm *kvm, + gpa_t gpa, void *data), +void *data) { struct kvm_memslots *slots; struct kvm_memory_slot *memslot; + int ret = 0; slots = kvm_memslots(kvm); @@ -1178,14 +1179,17 @@ static void handle_hva_to_gpa(struct kvm *kvm, for (; gfn gfn_end; ++gfn) { gpa_t gpa = gfn PAGE_SHIFT; - handler(kvm, gpa, data); + ret |= handler(kvm, gpa, data); } } + + return ret; } -static void kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) +static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) { unmap_stage2_range(kvm, gpa, PAGE_SIZE); + return 0; } int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) @@ -1211,11 +1215,12 @@ int kvm_unmap_hva_range(struct kvm *kvm, return 0; } -static void kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) +static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) { pte_t *pte = (pte_t *)data; stage2_set_pte(kvm, NULL, gpa, pte, false); + return 0; } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] arm/arm64: KVM: Optimize handling of Access Flag faults
Now that we have page aging in Stage-2, it becomes obvious that we're doing way too much work handling the fault. The page is not going anywhere (it is still mapped), the page tables are already allocated, and all we want is to flip a bit in the PMD or PTE. Also, we can avoid any form of TLB invalidation, since a page with the AF bit off is not allowed to be cached. An obvious solution is to have a separate handler for FSC_ACCESS, where we pride ourselves to only do the very minimum amount of work. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/mmu.c | 46 ++ arch/arm/kvm/trace.h | 15 +++ 2 files changed, 61 insertions(+) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index ffe89a0..112bae1 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1073,6 +1073,46 @@ out_unlock: return ret; } +/* + * Resolve the access fault by making the page young again. + * Note that because the faulting entry is guaranteed not to be + * cached in the TLB, we don't need to invalidate anything. + */ +static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) +{ + pmd_t *pmd; + pte_t *pte; + pfn_t pfn; + bool pfn_valid = false; + + trace_kvm_access_fault(fault_ipa); + + spin_lock(vcpu-kvm-mmu_lock); + + pmd = stage2_get_pmd(vcpu-kvm, NULL, fault_ipa); + if (!pmd || pmd_none(*pmd)) /* Nothing there */ + goto out; + + if (kvm_pmd_huge(*pmd)) { /* THP, HugeTLB */ + *pmd = pmd_mkyoung(*pmd); + pfn = pmd_pfn(*pmd); + pfn_valid = true; + goto out; + } + + pte = pte_offset_kernel(pmd, fault_ipa); + if (pte_none(*pte)) /* Nothing there either */ + goto out; + + *pte = pte_mkyoung(*pte); /* Just a page... */ + pfn = pte_pfn(*pte); + pfn_valid = true; +out: + spin_unlock(vcpu-kvm-mmu_lock); + if (pfn_valid) + kvm_set_pfn_accessed(pfn); +} + /** * kvm_handle_guest_abort - handles all 2nd stage aborts * @vcpu: the VCPU pointer @@ -1140,6 +1180,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) /* Userspace should not be able to register out-of-bounds IPAs */ VM_BUG_ON(fault_ipa = KVM_PHYS_SIZE); + if (fault_status == FSC_ACCESS) { + handle_access_fault(vcpu, fault_ipa); + ret = 1; + goto out_unlock; + } + ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); if (ret == 0) ret = 1; diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h index 364b5382..5665a16 100644 --- a/arch/arm/kvm/trace.h +++ b/arch/arm/kvm/trace.h @@ -64,6 +64,21 @@ TRACE_EVENT(kvm_guest_fault, __entry-hxfar, __entry-vcpu_pc) ); +TRACE_EVENT(kvm_access_fault, + TP_PROTO(unsigned long ipa), + TP_ARGS(ipa), + + TP_STRUCT__entry( + __field(unsigned long, ipa ) + ), + + TP_fast_assign( + __entry-ipa= ipa; + ), + + TP_printk(IPA: %lx, __entry-ipa) +); + TRACE_EVENT(kvm_irq_line, TP_PROTO(unsigned int type, int vcpu_idx, int irq_num, int level), TP_ARGS(type, vcpu_idx, irq_num, level), -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] arm/arm64: KVM: Invalidate data cache on unmap
Let's assume a guest has created an uncached mapping, and written to that page. Let's also assume that the host uses a cache-coherent IO subsystem. Let's finally assume that the host is under memory pressure and starts to swap things out. Before this uncached page is evicted, we need to make sure we invalidate potential speculated, clean cache lines that are sitting there, or the IO subsystem is going to swap out the cached view, loosing the data that has been written directly into memory. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_mmu.h | 31 +++ arch/arm/kvm/mmu.c | 82 arch/arm64/include/asm/kvm_mmu.h | 18 + 3 files changed, 116 insertions(+), 15 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 286644c..552c31f 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -44,6 +44,7 @@ #ifndef __ASSEMBLY__ +#include linux/highmem.h #include asm/cacheflush.h #include asm/pgalloc.h @@ -188,6 +189,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, } } +static inline void __kvm_flush_dcache_pte(pte_t pte) +{ + void *va = kmap_atomic(pte_page(pte)); + + kvm_flush_dcache_to_poc(va, PAGE_SIZE); + + kunmap_atomic(va); +} + +static inline void __kvm_flush_dcache_pmd(pmd_t pmd) +{ + unsigned long size = PMD_SIZE; + pfn_t pfn = pmd_pfn(pmd); + + while (size) { + void *va = kmap_atomic_pfn(pfn); + + kvm_flush_dcache_to_poc(va, PAGE_SIZE); + + pfn++; + size -= PAGE_SIZE; + + kunmap_atomic(va); + } +} + +static inline void __kvm_flush_dcache_pud(pud_t pud) +{ +} + #define kvm_virt_to_phys(x)virt_to_idmap((unsigned long)(x)) void kvm_set_way_flush(struct kvm_vcpu *vcpu); diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 106737e..78e68ab 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -58,6 +58,26 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa); } +/* + * D-Cache management functions. They take the page table entries by + * value, as they are flushing the cache using the kernel mapping (or + * kmap on 32bit). + */ +static void kvm_flush_dcache_pte(pte_t pte) +{ + __kvm_flush_dcache_pte(pte); +} + +static void kvm_flush_dcache_pmd(pmd_t pmd) +{ + __kvm_flush_dcache_pmd(pmd); +} + +static void kvm_flush_dcache_pud(pud_t pud) +{ + __kvm_flush_dcache_pud(pud); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, int min, int max) { @@ -119,6 +139,26 @@ static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr) put_page(virt_to_page(pmd)); } +/* + * Unmapping vs dcache management: + * + * If a guest maps certain memory pages as uncached, all writes will + * bypass the data cache and go directly to RAM. However, the CPUs + * can still speculate reads (not writes) and fill cache lines with + * data. + * + * Those cache lines will be *clean* cache lines though, so a + * clean+invalidate operation is equivalent to an invalidate + * operation, because no cache lines are marked dirty. + * + * Those clean cache lines could be filled prior to an uncached write + * by the guest, and the cache coherent IO subsystem would therefore + * end up writing old data to disk. + * + * This is why right after unmapping a page/section and invalidating + * the corresponding TLBs, we call kvm_flush_dcache_p*() to make sure + * the IO subsystem will never hit in the cache. + */ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr, phys_addr_t end) { @@ -128,9 +168,16 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd, start_pte = pte = pte_offset_kernel(pmd, addr); do { if (!pte_none(*pte)) { + pte_t old_pte = *pte; + kvm_set_pte(pte, __pte(0)); - put_page(virt_to_page(pte)); kvm_tlb_flush_vmid_ipa(kvm, addr); + + /* No need to invalidate the cache for device mappings */ + if ((pte_val(old_pte) PAGE_S2_DEVICE) != PAGE_S2_DEVICE) + kvm_flush_dcache_pte(old_pte); + + put_page(virt_to_page(pte)); } } while (pte++, addr += PAGE_SIZE, addr != end); @@ -149,8 +196,13 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud, next = kvm_pmd_addr_end(addr, end); if (!pmd_none(*pmd)) { if (kvm_pmd_huge(*pmd)) { + pmd_t old_pmd = *pmd; + pmd_clear(pmd);
[PATCH v3 0/3] arm/arm64: KVM: Random selection of cache related fixes
This small series fixes a number of issues that Christoffer and I have been trying to nail down for a while, having to do with the host dying under load (swapping), and also with the way we deal with caches in general (and with set/way operation in particular): - The first one changes the way we handle cache ops by set/way, basically turning them into VA ops for the whole memory. This allows platforms with system caches to boot a 32bit zImage, for example. - The second one fixes a corner case that could happen if the guest used an uncached mapping (or had its caches off) while the host was swapping it out (and using a cache-coherent IO subsystem). - Finally, the last one fixes this stability issue when the host was swapping, by using a kernel mapping for cache maintenance instead of the userspace one. With these patches (and both the TLB invalidation and HCR fixes that are on their way to mainline), the APM platform seems much more robust than it previously was. Fingers crossed. The first round of review has generated a lot of traffic about ASID-tagged icache management for guests, but I've decided not to address this issue as part of this series. The code is broken already, and there isn't any virtualization capable, ASID-tagged icache core in the wild, AFAIK. I'll try to revisit this in another series, once I have wrapped my head around it (or someone beats me to it). Based on 3.19-rc5, tested on Juno, X-Gene, TC-2 and Cubietruck. Also at git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/mm-fixes-3.19 * From v2: [2] - Reworked the algorithm that tracks the state of the guest's caches, as there is some cases I didn't anticipate. In the end, the algorithm is simpler. * From v1: [1] - Dropped Steve's patch after discussion with Andrea - Refactor set/way support to avoid code duplication, better comments - Much improved comments in patch #2, courtesy of Christoffer [1]: http://www.spinics.net/lists/kvm-arm/msg13008.html [2]: http://www.spinics.net/lists/kvm-arm/msg13161.html Marc Zyngier (3): arm/arm64: KVM: Use set/way op trapping to track the state of the caches arm/arm64: KVM: Invalidate data cache on unmap arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault arch/arm/include/asm/kvm_emulate.h | 10 +++ arch/arm/include/asm/kvm_host.h | 3 - arch/arm/include/asm/kvm_mmu.h | 77 +--- arch/arm/kvm/arm.c | 10 --- arch/arm/kvm/coproc.c| 64 +++--- arch/arm/kvm/coproc_a15.c| 2 +- arch/arm/kvm/coproc_a7.c | 2 +- arch/arm/kvm/mmu.c | 164 ++- arch/arm/kvm/trace.h | 39 + arch/arm64/include/asm/kvm_emulate.h | 10 +++ arch/arm64/include/asm/kvm_host.h| 3 - arch/arm64/include/asm/kvm_mmu.h | 34 ++-- arch/arm64/kvm/sys_regs.c| 75 +++- 13 files changed, 321 insertions(+), 172 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
Trying to emulate the behaviour of set/way cache ops is fairly pointless, as there are too many ways we can end-up missing stuff. Also, there is some system caches out there that simply ignore set/way operations. So instead of trying to implement them, let's convert it to VA ops, and use them as a way to re-enable the trapping of VM ops. That way, we can detect the point when the MMU/caches are turned off, and do a full VM flush (which is what the guest was trying to do anyway). This allows a 32bit zImage to boot on the APM thingy, and will probably help bootloaders in general. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_emulate.h | 10 + arch/arm/include/asm/kvm_host.h | 3 -- arch/arm/include/asm/kvm_mmu.h | 3 +- arch/arm/kvm/arm.c | 10 - arch/arm/kvm/coproc.c| 64 ++ arch/arm/kvm/coproc_a15.c| 2 +- arch/arm/kvm/coproc_a7.c | 2 +- arch/arm/kvm/mmu.c | 70 - arch/arm/kvm/trace.h | 39 +++ arch/arm64/include/asm/kvm_emulate.h | 10 + arch/arm64/include/asm/kvm_host.h| 3 -- arch/arm64/include/asm/kvm_mmu.h | 3 +- arch/arm64/kvm/sys_regs.c| 75 +--- 13 files changed, 155 insertions(+), 139 deletions(-) diff --git a/arch/arm/include/asm/kvm_emulate.h b/arch/arm/include/asm/kvm_emulate.h index 66ce176..7b01523 100644 --- a/arch/arm/include/asm/kvm_emulate.h +++ b/arch/arm/include/asm/kvm_emulate.h @@ -38,6 +38,16 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) vcpu-arch.hcr = HCR_GUEST_MASK; } +static inline unsigned long vcpu_get_hcr(struct kvm_vcpu *vcpu) +{ + return vcpu-arch.hcr; +} + +static inline void vcpu_set_hcr(struct kvm_vcpu *vcpu, unsigned long hcr) +{ + vcpu-arch.hcr = hcr; +} + static inline bool vcpu_mode_is_32bit(struct kvm_vcpu *vcpu) { return 1; diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 254e065..04b4ea0 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -125,9 +125,6 @@ struct kvm_vcpu_arch { * Anything that is not used directly from assembly code goes * here. */ - /* dcache set/way operation pending */ - int last_pcpu; - cpumask_t require_dcache_flush; /* Don't run the guest on this vcpu */ bool pause; diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 63e0ecc..286644c 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -190,7 +190,8 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, #define kvm_virt_to_phys(x)virt_to_idmap((unsigned long)(x)) -void stage2_flush_vm(struct kvm *kvm); +void kvm_set_way_flush(struct kvm_vcpu *vcpu); +void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled); #endif /* !__ASSEMBLY__ */ diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2d6d910..0b0d58a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) vcpu-cpu = cpu; vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state); - /* -* Check whether this vcpu requires the cache to be flushed on -* this physical CPU. This is a consequence of doing dcache -* operations by set/way on this vcpu. We do it here to be in -* a non-preemptible section. -*/ - if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush)) - flush_cache_all(); /* We'd really want v7_flush_dcache_all() */ - kvm_arm_set_running_vcpu(vcpu); } @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) ret = kvm_call_hyp(__kvm_vcpu_run, vcpu); vcpu-mode = OUTSIDE_GUEST_MODE; - vcpu-arch.last_pcpu = smp_processor_id(); kvm_guest_exit(); trace_kvm_exit(*vcpu_pc(vcpu)); /* diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c index 7928dbd..0afcc00 100644 --- a/arch/arm/kvm/coproc.c +++ b/arch/arm/kvm/coproc.c @@ -189,82 +189,40 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu, return true; } -/* See note at ARM ARM B1.14.4 */ +/* + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized). + */ static bool access_dcsw(struct kvm_vcpu *vcpu, const struct coproc_params *p, const struct coproc_reg *r) { - unsigned long val; - int cpu; - if (!p-is_write) return read_from_write_only(vcpu, p); - cpu = get_cpu(); - - cpumask_setall(vcpu-arch.require_dcache_flush); - cpumask_clear_cpu(cpu,
[PATCH v3 3/3] arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault
When handling a fault in stage-2, we need to resync I$ and D$, just to be sure we don't leave any old cache line behind. That's very good, except that we do so using the *user* address. Under heavy load (swapping like crazy), we may end up in a situation where the page gets mapped in stage-2 while being unmapped from userspace by another CPU. At that point, the DC/IC instructions can generate a fault, which we handle with kvm-mmu_lock held. The box quickly deadlocks, user is unhappy. Instead, perform this invalidation through the kernel mapping, which is guaranteed to be present. The box is much happier, and so am I. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_mmu.h | 43 +++- arch/arm/kvm/mmu.c | 12 +++ arch/arm64/include/asm/kvm_mmu.h | 13 +++- 3 files changed, 50 insertions(+), 18 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 552c31f..e5614c9 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -162,13 +162,10 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu) return (vcpu-arch.cp15[c1_SCTLR] 0b101) == 0b101; } -static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, -unsigned long size, -bool ipa_uncached) +static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn, + unsigned long size, + bool ipa_uncached) { - if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) - kvm_flush_dcache_to_poc((void *)hva, size); - /* * If we are going to insert an instruction page and the icache is * either VIPT or PIPT, there is a potential problem where the host @@ -180,10 +177,38 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, * * VIVT caches are tagged using both the ASID and the VMID and doesn't * need any kind of flushing (DDI 0406C.b - Page B3-1392). +* +* We need to do this through a kernel mapping (using the +* user-space mapping has proved to be the wrong +* solution). For that, we need to kmap one page at a time, +* and iterate over the range. */ - if (icache_is_pipt()) { - __cpuc_coherent_user_range(hva, hva + size); - } else if (!icache_is_vivt_asid_tagged()) { + + bool need_flush = !vcpu_has_cache_enabled(vcpu) || ipa_uncached; + + VM_BUG_ON(size PAGE_MASK); + + if (!need_flush !icache_is_pipt()) + goto vipt_cache; + + while (size) { + void *va = kmap_atomic_pfn(pfn); + + if (need_flush) + kvm_flush_dcache_to_poc(va, PAGE_SIZE); + + if (icache_is_pipt()) + __cpuc_coherent_user_range((unsigned long)va, + (unsigned long)va + PAGE_SIZE); + + size -= PAGE_SIZE; + pfn++; + + kunmap_atomic(va); + } + +vipt_cache: + if (!icache_is_pipt() !icache_is_vivt_asid_tagged()) { /* any kind of VIPT cache */ __flush_icache_all(); } diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 78e68ab..1366625 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -957,6 +957,12 @@ static bool kvm_is_device_pfn(unsigned long pfn) return !pfn_valid(pfn); } +static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn, + unsigned long size, bool uncached) +{ + __coherent_cache_guest_page(vcpu, pfn, size, uncached); +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, unsigned long fault_status) @@ -1046,8 +1052,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_set_s2pmd_writable(new_pmd); kvm_set_pfn_dirty(pfn); } - coherent_cache_guest_page(vcpu, hva PMD_MASK, PMD_SIZE, - fault_ipa_uncached); + coherent_cache_guest_page(vcpu, pfn, PMD_SIZE, fault_ipa_uncached); ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd); } else { pte_t new_pte = pfn_pte(pfn, mem_type); @@ -1055,8 +1060,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_set_s2pte_writable(new_pte); kvm_set_pfn_dirty(pfn); } - coherent_cache_guest_page(vcpu, hva,
Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)
On 01/21/2015 03:08 AM, Christoffer Dall wrote: On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote: Patch series adds support for armv7/8 dirty page logging. As we move towards generic dirty page logging interface we move some common code to generic layer shared by x86, armv7 and armv8. armv7/8 Dirty page logging implementation overivew- - initially write protects memory region 2nd stage page tables - read dirty page log and again write protect dirty pages for next pass. - second stage huge pages are dissolved into normal pages to keep track of dirty memory at page granularity. Tracking at huge page granularity limits granularity of marking dirty memory and migration to a light memory load. Small page size logging supports higher memory dirty rates, enables rapid migration. armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and 512MB (64kb) - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Thanks, applied. -Christoffer Thanks! And also to other folks that helped along the way in shaping the design and reviews. - Mario -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)
On 01/21/2015 03:08 AM, Christoffer Dall wrote: On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote: Patch series adds support for armv7/8 dirty page logging. As we move towards generic dirty page logging interface we move some common code to generic layer shared by x86, armv7 and armv8. armv7/8 Dirty page logging implementation overivew- - initially write protects memory region 2nd stage page tables - read dirty page log and again write protect dirty pages for next pass. - second stage huge pages are dissolved into normal pages to keep track of dirty memory at page granularity. Tracking at huge page granularity limits granularity of marking dirty memory and migration to a light memory load. Small page size logging supports higher memory dirty rates, enables rapid migration. armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and 512MB (64kb) - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Thanks, applied. -Christoffer Thanks! And also to other folks that helped along the way in shaping the design and reviews. - Mario -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
2015-01-21 12:16-0200, Marcelo Tosatti: On Wed, Jan 21, 2015 at 03:09:27PM +0100, Radim Krčmář wrote: 2015-01-20 15:54-0200, Marcelo Tosatti: SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp and rdtsc is larger than a given threshold: [...] Disable masterclock support (which increases said delta) in case the boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW. Why do we care about 2.6.16 bugs in upstream KVM? Because people do use 2.6.16 guests. (Those people probably won't use 3.19+ host ... Is this patch intended for stable?) The code to benefit tradeoff of this patch seems bad to me ... Can you state the tradeoff and then explain why it is bad ? Additional code needs time to understand and is a source of bugs, yet we still include it because we want to achieve something. I meant the tradeoff between perceived value of something and acceptability of the code. (Ideally, computer programs would be a shorter version of Do what I want.\nEOF.) There are three main points that made think it is bad, 1) The bug happens because a guest expects greater precision. I consider that as a guest problem. kvmclock never guaranteed anything, so unmet expectations should be a recoverable error. 2) With time, the probability that 2.6.16 is used is getting lower, while people looking at KVM's code appear. - At what point are we going to drop 2.6.16 support? (We shouldn't let mistakes drag us down forever ... Or are we dooming KVM on purpose?) 3) The patch made me ask more silly questions than it answered :) (Why can't other software depend on previous behavior? Why can't kvmclock without master clock still fail? Why can't we improve the master clock?) MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want to support old guests. What is the benefit of removing support for MSR_KVM_SYSTEM_TIME ? The maintainability of the code increases. It would look as if we never made the mistake with MSR_KVM_SYSTEM_TIME MSR_KVM_WALL_CLOCK. (I like when old code looks as if we wrote it from scratch.) After comparing the (imperfectly evaluated) benefit of both variants, original patch: + 2.6.16 SUSE guests work - MSR_KVM_SYSTEM_TIME guests don't use master clock - KVM code is worse removal of KVM_FEATURE_CLOCKSOURCE: + 2.6.16 SUSE guests likely work + KVM code is better - MSR_KVM_SYSTEM_TIME guests use even worse clocksource As KVM_FEATURE_CLOCKSOURCE2 was introduced in 2010, I found the removal better even without waiting for the last MSR_KVM_SYSTEM_TIME guest to perish. Supporting old guests is important. It comes at a price. (Mutually exclusive goals are important as well.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe
Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Also handle the case where hrtimer_start_expires fails due to -ETIME, by injecting the interrupt to the guest immediately. Reduces average cyclictest latency by 3us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/lapic.c | 42 +++--- 1 file changed, 39 insertions(+), 3 deletions(-) Index: linux-stable-rt/arch/x86/kvm/lapic.c === --- linux-stable-rt.orig/arch/x86/kvm/lapic.c 2014-11-25 14:14:38.636810068 -0200 +++ linux-stable-rt/arch/x86/kvm/lapic.c2015-01-14 14:59:17.840251874 -0200 @@ -1031,8 +1031,38 @@ apic-divide_count); } + +static enum hrtimer_restart apic_timer_fn(struct hrtimer *data); + +static void apic_timer_expired(struct hrtimer *data) +{ + int ret, i = 0; + enum hrtimer_restart r; + struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); + + r = apic_timer_fn(data); + + if (r == HRTIMER_RESTART) { + do { + ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS); + if (ret == -ETIME) + hrtimer_add_expires_ns(ktimer-timer, + ktimer-period); + i++; + } while (ret == -ETIME i 10); + + if (ret == -ETIME) { + printk_once(KERN_ERR %s: failed to reprogram timer\n, + __func__); + WARN_ON_ONCE(1); + } + } +} + + static void start_apic_timer(struct kvm_lapic *apic) { + int ret; ktime_t now; atomic_set(apic-lapic_timer.pending, 0); @@ -1062,9 +1092,11 @@ } } - hrtimer_start(apic-lapic_timer.timer, + ret = hrtimer_start(apic-lapic_timer.timer, ktime_add_ns(now, apic-lapic_timer.period), HRTIMER_MODE_ABS); + if (ret == -ETIME) + apic_timer_expired(apic-lapic_timer.timer); apic_debug(%s: bus cycle is % PRId64 ns, now 0x%016 PRIx64 , @@ -1094,8 +1126,10 @@ ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); } - hrtimer_start(apic-lapic_timer.timer, + ret = hrtimer_start(apic-lapic_timer.timer, ktime_add_ns(now, ns), HRTIMER_MODE_ABS); + if (ret == -ETIME) + apic_timer_expired(apic-lapic_timer.timer); local_irq_restore(flags); } @@ -1581,6 +1615,7 @@ hrtimer_init(apic-lapic_timer.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); apic-lapic_timer.timer.function = apic_timer_fn; + apic-lapic_timer.timer.irqsafe = 1; /* * APIC is created enabled. This will prevent kvm_lapic_set_base from @@ -1699,7 +1734,8 @@ timer = vcpu-arch.apic-lapic_timer.timer; if (hrtimer_cancel(timer)) - hrtimer_start_expires(timer, HRTIMER_MODE_ABS); + if (hrtimer_start_expires(timer, HRTIMER_MODE_ABS) == -ETIME) + apic_timer_expired(timer); } /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
2015-01-20 23:04+0200, Nadav Amit: Radim Kr?má? rkrc...@redhat.com wrote: 2015-01-14 01:27+, Wu, Feng: the new hardware even doesn't consider the TPR for lowest priority interrupts delivery. A bold move ... what hardware was the first to do so? I think it was starting with Nehalem. Thanks, (Could be that QPI can't inform about TPR changes anymore ...) I played with Linux's TPR on Haswell and found that is has no effect. Sorry for jumping into the discussion, but doesn’t it depend on IA32_MISC_ENABLE[23]? This bit disables xTPR messages. On my machine it is set (probably by the BIOS), but since there is no IA32_MISC_ENABLE is not locked for changes, the OS can control it. Thanks, I didn't know about it. On Ivy Bridge EP (the only modern machine at hand), the bit was set by default. After clearing it, TPR still had no effect. The most relevant mention of xTPR I found is related to FSB [1]. [2] isn't enlightening, so there might be more from QPI-era ... --- 1: Intel® E7320 Memory Controller Hub (MCH) Datasheet http://www.intel.com/content/dam/doc/datasheet/e7320-memory-controller-hub-datasheet.pdf 5.2.2 System Bus Interrupts 2: Intel® Xeon® Processor E5 v2 Family: Datasheet, Vol. 2 http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v2-datasheet-vol-2.pdf 6.1.2 IntControl -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: # cyclictest -m -n -q -p99 -l 100 -h60 -D 1m This patch reduces the average latency in my tests from 14us to 11us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/arm/kvm/arm.c |4 ++-- arch/arm/kvm/psci.c |4 ++-- arch/mips/kvm/kvm_mips.c|8 arch/powerpc/include/asm/kvm_host.h |4 ++-- arch/powerpc/kvm/book3s_hv.c| 20 ++-- arch/s390/include/asm/kvm_host.h|2 +- arch/s390/kvm/interrupt.c | 22 ++ arch/s390/kvm/sigp.c| 16 arch/x86/kvm/lapic.c|6 +++--- include/linux/kvm_host.h|4 ++-- virt/kvm/async_pf.c |4 ++-- virt/kvm/kvm_main.c | 16 12 files changed, 54 insertions(+), 56 deletions(-) Index: linux-stable-rt/arch/arm/kvm/arm.c === --- linux-stable-rt.orig/arch/arm/kvm/arm.c 2014-11-25 14:13:39.188899952 -0200 +++ linux-stable-rt/arch/arm/kvm/arm.c 2014-11-25 14:14:38.620810092 -0200 @@ -495,9 +495,9 @@ static void vcpu_pause(struct kvm_vcpu *vcpu) { - wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); + struct swait_head *wq = kvm_arch_vcpu_wq(vcpu); - wait_event_interruptible(*wq, !vcpu-arch.pause); + swait_event_interruptible(*wq, !vcpu-arch.pause); } static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu) Index: linux-stable-rt/arch/arm/kvm/psci.c === --- linux-stable-rt.orig/arch/arm/kvm/psci.c2014-11-25 14:13:39.189899951 -0200 +++ linux-stable-rt/arch/arm/kvm/psci.c 2014-11-25 14:14:38.620810092 -0200 @@ -36,7 +36,7 @@ { struct kvm *kvm = source_vcpu-kvm; struct kvm_vcpu *vcpu = NULL, *tmp; - wait_queue_head_t *wq; + struct swait_head *wq; unsigned long cpu_id; unsigned long mpidr; phys_addr_t target_pc; @@ -80,7 +80,7 @@ smp_mb(); /* Make sure the above is visible */ wq = kvm_arch_vcpu_wq(vcpu); - wake_up_interruptible(wq); + swait_wake_interruptible(wq); return KVM_PSCI_RET_SUCCESS; } Index: linux-stable-rt/arch/mips/kvm/kvm_mips.c === --- linux-stable-rt.orig/arch/mips/kvm/kvm_mips.c 2014-11-25 14:13:39.191899948 -0200 +++ linux-stable-rt/arch/mips/kvm/kvm_mips.c2014-11-25 14:14:38.621810091 -0200 @@ -464,8 +464,8 @@ dvcpu-arch.wait = 0; - if (waitqueue_active(dvcpu-wq)) { - wake_up_interruptible(dvcpu-wq); + if (swaitqueue_active(dvcpu-wq)) { + swait_wake_interruptible(dvcpu-wq); } return 0; @@ -971,8 +971,8 @@ kvm_mips_callbacks-queue_timer_int(vcpu); vcpu-arch.wait = 0; - if (waitqueue_active(vcpu-wq)) { - wake_up_interruptible(vcpu-wq); + if (swaitqueue_active(vcpu-wq)) { + swait_wake_interruptible(vcpu-wq); } } Index: linux-stable-rt/arch/powerpc/include/asm/kvm_host.h === --- linux-stable-rt.orig/arch/powerpc/include/asm/kvm_host.h2014-11-25 14:13:39.193899944 -0200 +++ linux-stable-rt/arch/powerpc/include/asm/kvm_host.h 2014-11-25 14:14:38.621810091 -0200 @@ -295,7 +295,7 @@ u8 in_guest; struct list_head runnable_threads; spinlock_t lock; - wait_queue_head_t wq; + struct swait_head wq; u64 stolen_tb; u64 preempt_tb; struct kvm_vcpu *runner; @@ -612,7 +612,7 @@ u8 prodded; u32 last_inst; - wait_queue_head_t *wqp; + struct swait_head *wqp; struct kvmppc_vcore *vcore; int ret; int trap; Index: linux-stable-rt/arch/powerpc/kvm/book3s_hv.c === --- linux-stable-rt.orig/arch/powerpc/kvm/book3s_hv.c 2014-11-25 14:13:39.195899942 -0200 +++ linux-stable-rt/arch/powerpc/kvm/book3s_hv.c2014-11-25 14:14:38.625810085 -0200 @@ -74,11 +74,11 @@ { int me; int cpu = vcpu-cpu; - wait_queue_head_t *wqp; + struct swait_head *wqp; wqp
[patch -rt 0/2] use simple waitqueue for kvm vcpu waitqueue (v4)
Against v3.14-rt branch of git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git The problem: On -RT, an emulated LAPIC timer instance has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: # cyclictest -m -n -q -p99 -l 100 -h60 -D 1m This patch reduces the average latency in my tests from 14us to 11us. v2: improve changelog (Rik van Riel) v3: limit (once) guest triggered printk and WARN_ON (Paolo Bonzini) v4: fix typo (Steven Rostedt) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 10/18] vfio/platform: trigger an interrupt via eventfd
On Wed, 2015-01-21 at 13:49 +0100, Baptiste Reynal wrote: From: Antonios Motakis a.mota...@virtualopensystems.com This patch allows to set an eventfd for a platform device's interrupt, and also to trigger the interrupt eventfd from userspace for testing. Level sensitive interrupts are marked as maskable and are handled in a later patch. Edge triggered interrupts are not advertised as maskable and are implemented here using a simple and efficient IRQ handler. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com [Baptiste Reynal: fix masked interrupt initialization] Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 98 ++- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 98 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index df5c919..4b1ee22 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -39,12 +39,96 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, return -EINVAL; } +static irqreturn_t vfio_irq_handler(int irq, void *dev_id) +{ + struct vfio_platform_irq *irq_ctx = dev_id; + + eventfd_signal(irq_ctx-trigger, 1); + + return IRQ_HANDLED; +} + +static int vfio_set_trigger(struct vfio_platform_device *vdev, int index, + int fd, irq_handler_t handler) +{ + struct vfio_platform_irq *irq = vdev-irqs[index]; + struct eventfd_ctx *trigger; + int ret; + + if (irq-trigger) { + free_irq(irq-hwirq, irq); + kfree(irq-name); + eventfd_ctx_put(irq-trigger); + irq-trigger = NULL; + } + + if (fd 0) /* Disable only */ + return 0; + + irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s), + irq-hwirq, vdev-name); + if (!irq-name) + return -ENOMEM; + + trigger = eventfd_ctx_fdget(fd); + if (IS_ERR(trigger)) { + kfree(irq-name); + return PTR_ERR(trigger); + } + + irq-trigger = trigger; + + irq_set_status_flags(irq-hwirq, IRQ_NOAUTOEN); + ret = request_irq(irq-hwirq, handler, 0, irq-name, irq); + if (ret) { + kfree(irq-name); + eventfd_ctx_put(trigger); + irq-trigger = NULL; + return ret; + } + + if (!irq-masked) + enable_irq(irq-hwirq); Unfortunately, irq-masked doesn't exist until the next patch. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 12/18] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
On Wed, 2015-01-21 at 13:50 +0100, Baptiste Reynal wrote: From: Antonios Motakis a.mota...@virtualopensystems.com We want to reuse virqfd functionality in multiple VFIO drivers; before moving these functions to core VFIO, add the vfio_ prefix to the virqfd_enable and virqfd_disable functions, and export them so they can be used from other modules. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 30 -- drivers/vfio/pci/vfio_pci_private.h | 4 ++-- 2 files changed, 18 insertions(+), 16 deletions(-) ... diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 671c17a..2e2f0ea 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf, extern int vfio_pci_init_perm_bits(void); extern void vfio_pci_uninit_perm_bits(void); -extern int vfio_pci_virqfd_init(void); -extern void vfio_pci_virqfd_exit(void); +extern int vfio_virqfd_init(void); +extern void vfio_virqfd_exit(void); extern int vfio_config_init(struct vfio_pci_device *vdev); extern void vfio_config_free(struct vfio_pci_device *vdev); This chunk is in the wrong patch, it needs to be moved to the next patch or else the series isn't bisect-able. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
On Wed, Jan 21, 2015 at 06:00:37PM +0100, Radim Krčmář wrote: 2015-01-21 12:16-0200, Marcelo Tosatti: On Wed, Jan 21, 2015 at 03:09:27PM +0100, Radim Krčmář wrote: 2015-01-20 15:54-0200, Marcelo Tosatti: SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp and rdtsc is larger than a given threshold: [...] Disable masterclock support (which increases said delta) in case the boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW. Why do we care about 2.6.16 bugs in upstream KVM? Because people do use 2.6.16 guests. (Those people probably won't use 3.19+ host ... Is this patch intended for stable?) Yes. The code to benefit tradeoff of this patch seems bad to me ... Can you state the tradeoff and then explain why it is bad ? Additional code needs time to understand and is a source of bugs, yet we still include it because we want to achieve something. I meant the tradeoff between perceived value of something and acceptability of the code. (Ideally, computer programs would be a shorter version of Do what I want.\nEOF.) There are three main points that made think it is bad, 1) The bug happens because a guest expects greater precision. I consider that as a guest problem. kvmclock never guaranteed anything, so unmet expectations should be a recoverable error. delta = pvclock_data.tsc_timestamp - RDTSC Guest expects delta to be smaller than a given threshold. It does not expect greater precision. Size of delta does not affect precision. 2) With time, the probability that 2.6.16 is used is getting lower, while people looking at KVM's code appear. - At what point are we going to drop 2.6.16 support? (We shouldn't let mistakes drag us down forever ... Or are we dooming KVM on purpose?) One of the features of virtualization is to be able to run old operating systems? 3) The patch made me ask more silly questions than it answered :) (Why can't other software depend on previous behavior? Documentation/virtual/kvm/msr.txt: whose data will be filled in by the hypervisor periodically. Only one write, or registration, is needed for each VCPU. The interval between updates of this structure is arbitrary and implementation-dependent. The hypervisor may update this structure at any time it sees fit until anything with bit0 == 0 is written to it. Why can't kvmclock without master clock still fail? It can, given a loaded system. Why can't we improve the master clock?) Out of context question. MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want to support old guests. What is the benefit of removing support for MSR_KVM_SYSTEM_TIME ? The maintainability of the code increases. It would look as if we never made the mistake with MSR_KVM_SYSTEM_TIME MSR_KVM_WALL_CLOCK. (I like when old code looks as if we wrote it from scratch.) After comparing the (imperfectly evaluated) benefit of both variants, original patch: + 2.6.16 SUSE guests work - MSR_KVM_SYSTEM_TIME guests don't use master clock - KVM code is worse removal of KVM_FEATURE_CLOCKSOURCE: + 2.6.16 SUSE guests likely work All guests which depend on KVM_FEATURE_CLOCKSOURCE will timedrift. + KVM code is better - MSR_KVM_SYSTEM_TIME guests use even worse clocksource As KVM_FEATURE_CLOCKSOURCE2 was introduced in 2010, I found the removal better even without waiting for the last MSR_KVM_SYSTEM_TIME guest to perish. Supporting old guests is important. It comes at a price. (Mutually exclusive goals are important as well.) This phrase is awkward. Overlapping goals are negative, then? (think of a large number of totally overlapping goals). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 1/2] x86/xen: add xen_is_preemptible_hypercall()
From: Luis R. Rodriguez mcg...@suse.com On kernels with voluntary or no preemption we can run into situations where a hypercall issued through userspace will linger around as it addresses sub-operatiosn in kernel context (multicalls). Such operations can trigger soft lockup detection. We want to address a way to let the kernel voluntarily preempt such calls even on non preempt kernels, to address this we first need to distinguish which hypercalls fall under this category. This implements xen_is_preemptible_hypercall() which lets us do just that by adding a secondary hypercall page, calls made via the new page may be preempted. Andrew had originally submitted a version of this work [0]. [0] http://lists.xen.org/archives/html/xen-devel/2014-02/msg01056.html Based on original work by: Andrew Cooper andrew.coop...@citrix.com Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@suse.de Cc: David Vrabel david.vra...@citrix.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Cc: Jan Beulich jbeul...@suse.com Cc: linux-ker...@vger.kernel.org Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- arch/x86/include/asm/xen/hypercall.h | 20 arch/x86/xen/enlighten.c | 7 +++ arch/x86/xen/xen-head.S | 18 +- 3 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index ca08a27..221008e 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -84,6 +84,22 @@ extern struct { char _entry[32]; } hypercall_page[]; +#ifndef CONFIG_PREEMPT +extern struct { char _entry[32]; } preemptible_hypercall_page[]; + +static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs) +{ + return !user_mode_vm(regs) + regs-ip = (unsigned long)preemptible_hypercall_page + regs-ip (unsigned long)preemptible_hypercall_page + PAGE_SIZE; +} +#else +static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs) +{ + return false; +} +#endif + #define __HYPERCALLcall hypercall_page+%c[offset] #define __HYPERCALL_ENTRY(x) \ [offset] i (__HYPERVISOR_##x * sizeof(hypercall_page[0])) @@ -215,7 +231,11 @@ privcmd_call(unsigned call, asm volatile(call *%[call] : __HYPERCALL_5PARAM +#ifndef CONFIG_PREEMPT +: [call] a (preemptible_hypercall_page[call]) +#else : [call] a (hypercall_page[call]) +#endif : __HYPERCALL_CLOBBER5); return (long)__res; diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 6bf3a13..9c01b48 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -84,6 +84,9 @@ #include multicalls.h EXPORT_SYMBOL_GPL(hypercall_page); +#ifndef CONFIG_PREEMPT +EXPORT_SYMBOL_GPL(preemptible_hypercall_page); +#endif /* * Pointer to the xen_vcpu_info structure or @@ -1531,6 +1534,10 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif xen_setup_machphys_mapping(); +#ifndef CONFIG_PREEMPT + copy_page(preemptible_hypercall_page, hypercall_page); +#endif + /* Install Xen paravirt ops */ pv_info = xen_info; pv_init_ops = xen_init_ops; diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 674b2225..6e6a9517 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -85,9 +85,18 @@ ENTRY(xen_pvh_early_cpu_init) .pushsection .text .balign PAGE_SIZE ENTRY(hypercall_page) + +#ifdef CONFIG_PREEMPT +# define PREEMPT_HYPERCALL_ENTRY(x) +#else +# define PREEMPT_HYPERCALL_ENTRY(x) \ + .global xen_hypercall_##x ## _p ASM_NL \ + .set preemptible_xen_hypercall_##x, xen_hypercall_##x + PAGE_SIZE ASM_NL +#endif #define NEXT_HYPERCALL(x) \ ENTRY(xen_hypercall_##x) \ - .skip 32 + .skip 32 ASM_NL \ + PREEMPT_HYPERCALL_ENTRY(x) NEXT_HYPERCALL(set_trap_table) NEXT_HYPERCALL(mmu_update) @@ -138,6 +147,13 @@ NEXT_HYPERCALL(arch_4) NEXT_HYPERCALL(arch_5) NEXT_HYPERCALL(arch_6) .balign PAGE_SIZE + +#ifndef CONFIG_PREEMPT +ENTRY(preemptible_hypercall_page) + .skip PAGE_SIZE +#endif /* CONFIG_PREEMPT */ + +#undef NEXT_HYPERCALL .popsection ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz linux) -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 2/2] x86/xen: allow privcmd hypercalls to be preempted
On Wed, Jan 21, 2015 at 6:17 PM, Luis R. Rodriguez mcg...@do-not-panic.com wrote: From: Luis R. Rodriguez mcg...@suse.com Xen has support for splitting heavy work work into a series of hypercalls, called multicalls, and preempting them through what Xen calls continuation [0]. Despite this though without CONFIG_PREEMPT preemption won't happen, without preemption a system can become pretty useless on heavy handed hypercalls. Such is the case for example when creating a 50 GiB HVM guest, we can get softlockups [1] with:. kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351] The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check (default 120 seconds), on the Xen side in this particular case this happens when the following Xen hypervisor code is used: xc_domain_set_pod_target() -- do_memory_op() -- arch_memory_op() -- p2m_pod_set_mem_target() -- long delay (real or emulated) -- This happens on arch_memory_op() on the XENMEM_set_pod_target memory op even though arch_memory_op() can handle continuation via hypercall_create_continuation() for example. Machines over 50 GiB of memory are on high demand and hard to come by so to help replicate this sort of issue long delays on select hypercalls have been emulated in order to be able to test this on smaller machines [2]. On one hand this issue can be considered as expected given that CONFIG_PREEMPT=n is used however we have forced voluntary preemption precedent practices in the kernel even for CONFIG_PREEMPT=n through the usage of cond_resched() sprinkled in many places. To address this issue with Xen hypercalls though we need to find a way to aid to the schedular in the middle of hypercalls. We are motivated to address this issue on CONFIG_PREEMPT=n as otherwise the system becomes rather unresponsive for long periods of time; in the worst case, at least only currently by emulating long delays on select io disk bound hypercalls, this can lead to filesystem corruption if the delay happens for example on SCHEDOP_remote_shutdown (when we call 'xl domain shutdown'). We can address this problem by trying to check if we should schedule on the xen timer in the middle of a hypercall on the return from the timer interrupt. We want to be careful to not always force voluntary preemption though so to do this we only selectively enable preemption on very specific xen hypercalls. This enables hypercall preemption by selectively forcing checks for voluntary preempting only on ioctl initiated private hypercalls where we know some folks have run into reported issues [1]. [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9 [1] https://bugzilla.novell.com/show_bug.cgi?id=861093 [2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch Based on original work by: David Vrabel david.vra...@citrix.com Suggested-by: Andy Lutomirski l...@amacapital.net Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@suse.de Cc: David Vrabel david.vra...@citrix.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Cc: Jan Beulich jbeul...@suse.com Cc: linux-ker...@vger.kernel.org Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- arch/x86/kernel/entry_32.S | 2 ++ arch/x86/kernel/entry_64.S | 2 ++ drivers/xen/events/events_base.c | 13 + include/xen/events.h | 1 + 4 files changed, 18 insertions(+) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 000d419..b4b1f42 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -982,6 +982,8 @@ ENTRY(xen_hypervisor_callback) ENTRY(xen_do_upcall) 1: mov %esp, %eax call xen_evtchn_do_upcall + movl %esp,%eax + call xen_end_upcall jmp ret_from_intr CFI_ENDPROC ENDPROC(xen_hypervisor_callback) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 9ebaf63..ee28733 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1198,6 +1198,8 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs) popq %rsp CFI_DEF_CFA_REGISTER rsp decl PER_CPU_VAR(irq_count) + movq %rsp, %rdi /* pass pt_regs as first argument */ + call xen_end_upcall jmp error_exit CFI_ENDPROC END(xen_do_hypervisor_callback) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index b4bca2d..23c526b 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -32,6 +32,8 @@ #include linux/slab.h #include linux/irqnr.h #include linux/pci.h
[RFC v3 2/2] x86/xen: allow privcmd hypercalls to be preempted
From: Luis R. Rodriguez mcg...@suse.com Xen has support for splitting heavy work work into a series of hypercalls, called multicalls, and preempting them through what Xen calls continuation [0]. Despite this though without CONFIG_PREEMPT preemption won't happen, without preemption a system can become pretty useless on heavy handed hypercalls. Such is the case for example when creating a 50 GiB HVM guest, we can get softlockups [1] with:. kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351] The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check (default 120 seconds), on the Xen side in this particular case this happens when the following Xen hypervisor code is used: xc_domain_set_pod_target() -- do_memory_op() -- arch_memory_op() -- p2m_pod_set_mem_target() -- long delay (real or emulated) -- This happens on arch_memory_op() on the XENMEM_set_pod_target memory op even though arch_memory_op() can handle continuation via hypercall_create_continuation() for example. Machines over 50 GiB of memory are on high demand and hard to come by so to help replicate this sort of issue long delays on select hypercalls have been emulated in order to be able to test this on smaller machines [2]. On one hand this issue can be considered as expected given that CONFIG_PREEMPT=n is used however we have forced voluntary preemption precedent practices in the kernel even for CONFIG_PREEMPT=n through the usage of cond_resched() sprinkled in many places. To address this issue with Xen hypercalls though we need to find a way to aid to the schedular in the middle of hypercalls. We are motivated to address this issue on CONFIG_PREEMPT=n as otherwise the system becomes rather unresponsive for long periods of time; in the worst case, at least only currently by emulating long delays on select io disk bound hypercalls, this can lead to filesystem corruption if the delay happens for example on SCHEDOP_remote_shutdown (when we call 'xl domain shutdown'). We can address this problem by trying to check if we should schedule on the xen timer in the middle of a hypercall on the return from the timer interrupt. We want to be careful to not always force voluntary preemption though so to do this we only selectively enable preemption on very specific xen hypercalls. This enables hypercall preemption by selectively forcing checks for voluntary preempting only on ioctl initiated private hypercalls where we know some folks have run into reported issues [1]. [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9 [1] https://bugzilla.novell.com/show_bug.cgi?id=861093 [2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch Based on original work by: David Vrabel david.vra...@citrix.com Suggested-by: Andy Lutomirski l...@amacapital.net Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@suse.de Cc: David Vrabel david.vra...@citrix.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Cc: Jan Beulich jbeul...@suse.com Cc: linux-ker...@vger.kernel.org Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- arch/x86/kernel/entry_32.S | 2 ++ arch/x86/kernel/entry_64.S | 2 ++ drivers/xen/events/events_base.c | 13 + include/xen/events.h | 1 + 4 files changed, 18 insertions(+) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 000d419..b4b1f42 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -982,6 +982,8 @@ ENTRY(xen_hypervisor_callback) ENTRY(xen_do_upcall) 1: mov %esp, %eax call xen_evtchn_do_upcall + movl %esp,%eax + call xen_end_upcall jmp ret_from_intr CFI_ENDPROC ENDPROC(xen_hypervisor_callback) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 9ebaf63..ee28733 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1198,6 +1198,8 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs) popq %rsp CFI_DEF_CFA_REGISTER rsp decl PER_CPU_VAR(irq_count) + movq %rsp, %rdi /* pass pt_regs as first argument */ + call xen_end_upcall jmp error_exit CFI_ENDPROC END(xen_do_hypervisor_callback) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index b4bca2d..23c526b 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -32,6 +32,8 @@ #include linux/slab.h #include linux/irqnr.h #include linux/pci.h +#include linux/sched.h +#include linux/kprobes.h #ifdef CONFIG_X86 #include asm/desc.h @@ -1243,6 +1245,17 @@ void xen_evtchn_do_upcall(struct pt_regs *regs)
[RFC v3 0/2] x86/xen: add xen hypercall preemption
From: Luis R. Rodriguez mcg...@suse.com After my last respin Andy provided some ideas as how to skip IRQ context hacks for preemption, this v3 spin addresses that and a bit more. This is based on both Andrew Cooper's and David Vrabel's work, further modified based on ideas by Andy Lutomirski to avoid having to deal with preemption on IRQ context. Ian had originally suggested to avoid the pt_regs stuff by using a CPU variable but based on Andy's observations it is difficult to prove we will avoid recursing or bad nesting when dealing with preemption out of IRQ context. This is specially true given that after a hypercall gets preempted the hypercall may end up another CPU. This uses NOKPROBE_SYMBOL and notrace since based on Andy's advice I am not confident that tracing and kprobes are safe to use in what might be an extended RCU quiescent state (i.e. where we're outside irq_enter and irq_exit). I've tested this on 64-bit, some testing on 32-bit would be appreciated. Luis R. Rodriguez (2): x86/xen: add xen_is_preemptible_hypercall() x86/xen: allow privcmd hypercalls to be preempted arch/x86/include/asm/xen/hypercall.h | 20 arch/x86/kernel/entry_32.S | 2 ++ arch/x86/kernel/entry_64.S | 2 ++ arch/x86/xen/enlighten.c | 7 +++ arch/x86/xen/xen-head.S | 18 +- drivers/xen/events/events_base.c | 13 + include/xen/events.h | 1 + 7 files changed, 62 insertions(+), 1 deletion(-) -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 1/2] x86/xen: add xen_is_preemptible_hypercall()
On Wed, Jan 21, 2015 at 6:17 PM, Luis R. Rodriguez mcg...@do-not-panic.com wrote: From: Luis R. Rodriguez mcg...@suse.com On kernels with voluntary or no preemption we can run into situations where a hypercall issued through userspace will linger around as it addresses sub-operatiosn in kernel context (multicalls). Such operations can trigger soft lockup detection. We want to address a way to let the kernel voluntarily preempt such calls even on non preempt kernels, to address this we first need to distinguish which hypercalls fall under this category. This implements xen_is_preemptible_hypercall() which lets us do just that by adding a secondary hypercall page, calls made via the new page may be preempted. Andrew had originally submitted a version of this work [0]. [0] http://lists.xen.org/archives/html/xen-devel/2014-02/msg01056.html Based on original work by: Andrew Cooper andrew.coop...@citrix.com Cc: Andy Lutomirski l...@amacapital.net Cc: Borislav Petkov b...@suse.de Cc: David Vrabel david.vra...@citrix.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Cc: Jan Beulich jbeul...@suse.com Cc: linux-ker...@vger.kernel.org Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- arch/x86/include/asm/xen/hypercall.h | 20 arch/x86/xen/enlighten.c | 7 +++ arch/x86/xen/xen-head.S | 18 +- 3 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index ca08a27..221008e 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -84,6 +84,22 @@ extern struct { char _entry[32]; } hypercall_page[]; +#ifndef CONFIG_PREEMPT +extern struct { char _entry[32]; } preemptible_hypercall_page[]; A comment somewhere explaining why only non-preemptible kernels have preemptible hypercalls might be friendly to some future reader. :) + +static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs) +{ + return !user_mode_vm(regs) + regs-ip = (unsigned long)preemptible_hypercall_page + regs-ip (unsigned long)preemptible_hypercall_page + PAGE_SIZE; +} This makes it seem like the page is indeed one page long, but I don't see what actually allocates a whole page for it. What am I missing? --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
On 21/01/2015 11:16, Wincy Van wrote: On Wed, Jan 21, 2015 at 4:35 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Wincy Van wrote on 2015-01-16: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? Yes, indeed, I've not noticed that kvm handle nested msr bitmap manually, the next version will fix this. BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. Do you mean that we should split this patch from the apicv patch set? I think it's okay to keep it in the same patchset, but you can put it first. Paolo --- arch/x86/kvm/vmx.c | 49 - 1 files changed, 48 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); + case EXIT_REASON_APIC_WRITE: + /* apic_write should exit unconditionally. */ + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. Agreed, will do. Thanks, Wincy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH v14 11/11] pvqspinlock, x86: Enable PV qspinlock for XEN
On 20/01/15 20:12, Waiman Long wrote: This patch adds the necessary XEN specific code to allow XEN to support the CPU halting and kicking operations needed by the queue spinlock PV code. Xen is a word, please don't capitalize it. +void xen_lock_stats(int stat_types) +{ + if (stat_types PV_LOCKSTAT_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + if (stat_types PV_LOCKSTAT_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + if (stat_types PV_LOCKSTAT_KICK_NOHALT) + add_smp(kick_nohlt_stats, 1); + if (stat_types PV_LOCKSTAT_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + if (stat_types PV_LOCKSTAT_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + if (stat_types PV_LOCKSTAT_HALT_ABORT) + add_smp(halt_abort_stats, 1); +} +PV_CALLEE_SAVE_REGS_THUNK(xen_lock_stats); This is not inlined and the 6 test-and-branch cannot be optimized away. +/* + * Halt the current CPU release it back to the host + * Return 0 if halted, -1 otherwise. + */ +int xen_halt_cpu(u8 *byte, u8 val) +{ + int irq = __this_cpu_read(lock_kicker_irq); + unsigned long flags; + u64 start; + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return -1; + + /* + * Make sure an interrupt handler can't upset things in a + * partially setup state. + */ + local_irq_save(flags); + start = spin_time_start(); + + /* clear pending */ + xen_clear_irq_pending(irq); + + /* Allow interrupts while blocked */ + local_irq_restore(flags); It's not clear what partially setup state is being protected here. xen_clear_irq_pending() is an atomic bit clear. I think you can drop the irq save/restore here. + /* + * Don't halt if the content of the given byte address differs from + * the expected value. A read memory barrier is added to make sure that + * the latest value of the byte address is fetched. + */ + smp_rmb(); The atomic bit clear in xen_clear_irq_pending() acts as a full memory barrier. I don't think you need an additional memory barrier here, only a compiler one. I suggest using READ_ONCE(). + if (*byte != val) { + xen_lock_stats(PV_LOCKSTAT_HALT_ABORT); + return -1; + } + /* + * If an interrupt happens here, it will leave the wakeup irq + * pending, which will cause xen_poll_irq() to return + * immediately. + */ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); + spin_time_accum_blocked(start); + return 0; +} +PV_CALLEE_SAVE_REGS_THUNK(xen_halt_cpu); + +#endif /* CONFIG_QUEUE_SPINLOCK */ + static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
On 21/01/2015 10:23, Wincy Van wrote: Yes, moving that msrs looks a bit ugly, but the irqchip_in_kernel is per-VM, not a global setting, there would be different settings of kernel_irqchip between VMs. If we use irqchip_in_kernel to check it and set different value of the ctl msrs, I think it may be even worse than moving the msrs, because this logic should be a init function, and this setting should be converged. I too prefer your solution. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)
On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote: Patch series adds support for armv7/8 dirty page logging. As we move towards generic dirty page logging interface we move some common code to generic layer shared by x86, armv7 and armv8. armv7/8 Dirty page logging implementation overivew- - initially write protects memory region 2nd stage page tables - read dirty page log and again write protect dirty pages for next pass. - second stage huge pages are dissolved into normal pages to keep track of dirty memory at page granularity. Tracking at huge page granularity limits granularity of marking dirty memory and migration to a light memory load. Small page size logging supports higher memory dirty rates, enables rapid migration. armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and 512MB (64kb) - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Thanks, applied. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v16 00/10] KVM/arm/arm64/x86: dirty page logging for ARMv7/8 (3.18.0-rc2)
On Thu, Jan 15, 2015 at 03:58:51PM -0800, Mario Smarduch wrote: Patch series adds support for armv7/8 dirty page logging. As we move towards generic dirty page logging interface we move some common code to generic layer shared by x86, armv7 and armv8. armv7/8 Dirty page logging implementation overivew- - initially write protects memory region 2nd stage page tables - read dirty page log and again write protect dirty pages for next pass. - second stage huge pages are dissolved into normal pages to keep track of dirty memory at page granularity. Tracking at huge page granularity limits granularity of marking dirty memory and migration to a light memory load. Small page size logging supports higher memory dirty rates, enables rapid migration. armv7 supports 2MB Huge page, and armv8 supports 2MB (4kb) and 512MB (64kb) - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Thanks, applied. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
On Wed, Jan 21, 2015 at 4:35 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Wincy Van wrote on 2015-01-16: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? Yes, indeed, I've not noticed that kvm handle nested msr bitmap manually, the next version will fix this. BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. Do you mean that we should split this patch from the apicv patch set? --- arch/x86/kvm/vmx.c | 49 - 1 files changed, 48 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); + case EXIT_REASON_APIC_WRITE: + /* apic_write should exit unconditionally. */ + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. Agreed, will do. Thanks, Wincy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
On Wed, Jan 21, 2015 at 4:49 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: + if (vector == vmcs12-posted_intr_nv + nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu-mode == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), + POSTED_INTR_VECTOR); + else { + r = -1; + goto out; + } + + /* +* if posted intr is done by hardware, the +* corresponding eoi was sent to L0. Thus +* we should send eoi to L1 manually. + */ + kvm_apic_set_eoi_accelerated(vcpu, + vmcs12-posted_intr_nv); Why this is necessary? As your comments mentioned, it is done by hardware not L1, why L1 should aware of it? According to SDM 29.6, if the processor recognizes a posted interrupt, it will send an EOI to LAPIC. If the posted intr is done by hardware, the processor will send eoi to hardware LAPIC, not L1's, just like the none-nested case(the physical interrupt is dismissed). So we should take care of the L1's LAPIC and send an eoi to it. No. You are not emulating the PI feature. You just reuse the hardware's capability. So you don't need to let L1 know it. Agreed, I had thought we have already set L1's IRR before this, I was wrong. BTW, I was trying to complete the nested posted intr manually if the dest vcpu is in_guest_mode but not IN_GUEST_MODE, but I found that it is difficult to set RVI of the destination vcpu timely, because we should keep the RVI, PIR and ON in sync : ( I think it is better to do a nested vmexit in the case above, rather than emulate it, because that case is much less than the hardware case. Thanks, Wincy. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[question] incremental backup a running vm
Hi, Does drive_mirror support incremental backup a running vm? Or other mechanism does? incremental backup a running vm requirements: First time backup, all of the allocated data will be mirrored to destination, then a copied bitmap will be saved to a file, then the bitmap file will log dirty for the changed data. Next time backup, only the dirty data will be mirrored to destination. Even the VM shutdown and start after several days, the bitmap will be loaded while starting vm. Any ideas? Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [question] incremental backup a running vm
On 21/01/2015 11:32, Zhang Haoyu wrote: Hi, Does drive_mirror support incremental backup a running vm? Or other mechanism does? incremental backup a running vm requirements: First time backup, all of the allocated data will be mirrored to destination, then a copied bitmap will be saved to a file, then the bitmap file will log dirty for the changed data. Next time backup, only the dirty data will be mirrored to destination. Even the VM shutdown and start after several days, the bitmap will be loaded while starting vm. Any ideas? Drive-mirror is for storage migration. For backup there is another job, drive-backup. drive-backup copies a point-in-time snapshot of one or more disks corresponding to when the backup was started. Incremental backup is being worked on. You can see patches on the list. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
Wincy Van wrote on 2015-01-16: To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr to per vcpu. Best regards, Yang
Re: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: + if (vector == vmcs12-posted_intr_nv + nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu-mode == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), + POSTED_INTR_VECTOR); + else { + r = -1; + goto out; + } + + /* +* if posted intr is done by hardware, the +* corresponding eoi was sent to L0. Thus +* we should send eoi to L1 manually. +*/ + kvm_apic_set_eoi_accelerated(vcpu, + vmcs12-posted_intr_nv); Why this is necessary? As your comments mentioned, it is done by hardware not L1, why L1 should aware of it? According to SDM 29.6, if the processor recognizes a posted interrupt, it will send an EOI to LAPIC. If the posted intr is done by hardware, the processor will send eoi to hardware LAPIC, not L1's, just like the none-nested case(the physical interrupt is dismissed). So we should take care of the L1's LAPIC and send an eoi to it. Thanks, Wincy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-21: On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: + if (vector == vmcs12-posted_intr_nv + nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu-mode == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), + POSTED_INTR_VECTOR); + else { + r = -1; + goto out; + } + + /* +* if posted intr is done by hardware, the +* corresponding eoi was sent to L0. Thus +* we should send eoi to L1 manually. + */ + kvm_apic_set_eoi_accelerated(vcpu, + vmcs12-posted_intr_nv); Why this is necessary? As your comments mentioned, it is done by hardware not L1, why L1 should aware of it? According to SDM 29.6, if the processor recognizes a posted interrupt, it will send an EOI to LAPIC. If the posted intr is done by hardware, the processor will send eoi to hardware LAPIC, not L1's, just like the none-nested case(the physical interrupt is dismissed). So we should take care of the L1's LAPIC and send an eoi to it. No. You are not emulating the PI feature. You just reuse the hardware's capability. So you don't need to let L1 know it. Thanks, Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-20: If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu to inject interrupt timely. With posted interrupt processing, the kick intr is not needed, and interrupts are fully taken care of by hardware. In nested vmx, this feature avoids much more vmexits than non-nested vmx. This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if L1's vector is different with L0's. If vcpu is in hardware's non-root mode, we use a physical ipi to deliver posted interrupts, otherwise we will deliver that interrupt to L1 and kick that vcpu out of nested non-root mode. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 136 ++-- 1 files changed, 132 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 { u64 tsc_offset; u64 virtual_apic_page_addr; u64 apic_access_addr; + u64 posted_intr_desc_addr; u64 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@ -334,6 +335,7 @@ struct __packed vmcs12 { u32 vmx_preemption_timer_value; u32 padding32[7]; /* room for future expansion */ u16 virtual_processor_id; + u16 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector; u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx { /* The host-usable pointer to the above */ struct page *current_vmcs12_page; struct vmcs12 *current_vmcs12; + spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /* * Indicates if the shadow vmcs must be updated with the @@ -406,6 +409,8 @@ struct nested_vmx { */ struct page *apic_access_page; struct page *virtual_apic_page; + struct page *pi_desc_page; + struct pi_desc *pi_desc; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static int max_shadow_read_write_fields = static const unsigned short vmcs_field_to_offset_table[] = { FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), + FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR, guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector), FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@ static const unsigned short vmcs_field_to_offset_table[] = { FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR, apic_access_addr), + FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer), FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0), FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7 @@ static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void); static bool vmx_xsaves_supported(void); +static int vmx_vm_has_apicv(struct kvm *kvm); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct vmcs12 *vmcs12) return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); } +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) { + return vmcs12-pin_based_vm_exec_control +PIN_BASED_POSTED_INTR; } + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | PIN_BASED_VMX_PREEMPTION_TIMER; + if (vmx_vm_has_apicv(vmx-vcpu.kvm)) + vmx-nested.nested_vmx_pinbased_ctls_high |= + PIN_BASED_POSTED_INTR; /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return enable_apicv irqchip_in_kernel(kvm); } +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, + int vector) { + int r = 0; + struct vmcs12 *vmcs12; + + /* +* Since posted intr delivery is async, +* we must aquire a spin-lock to avoid +* the race of vmcs12. +*/ + spin_lock(to_vmx(vcpu)-nested.vmcs12_lock); + vmcs12 = get_vmcs12(vcpu); + if (!is_guest_mode(vcpu) || !vmcs12) { + r = -1; + goto out; + } + if (vector == vmcs12-posted_intr_nv +
RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
Wincy Van wrote on 2015-01-16: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. --- arch/x86/kvm/vmx.c | 49 - 1 files changed, 48 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); + case EXIT_REASON_APIC_WRITE: + /* apic_write should exit unconditionally. */ + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. case EXIT_REASON_EPT_VIOLATION: /* * L0 always deals with the EPT violation. If nested EPT is @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu) ns_to_ktime(preemption_timeout), HRTIMER_MODE_REL); } +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + return -EINVAL; + return 0; +} + +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + int r; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return 0; + + r = nested_vmx_check_virt_x2apic(vcpu, vmcs12); + if (r) + goto fail; + + /* tpr shadow is needed by all apicv features. */ + if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { + r = -EINVAL; + goto fail; + } + + return 0; + +fail: + return r; +} + static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, unsigned long count_field, unsigned long addr_field, @@ -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); - } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) + + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; kvm_vcpu_reload_apic_access_page(vcpu); @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; } + if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) { + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); + return 1; + } + if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) { nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); return 1; -- 1.7.1 Best regards, Yang
Re: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
On Wed, Jan 21, 2015 at 4:18 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Wincy Van wrote on 2015-01-16: To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr to per vcpu. Yes, moving that msrs looks a bit ugly, but the irqchip_in_kernel is per-VM, not a global setting, there would be different settings of kernel_irqchip between VMs. If we use irqchip_in_kernel to check it and set different value of the ctl msrs, I think it may be even worse than moving the msrs, because this logic should be a init function, and this setting should be converged. Thanks, Wincy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v14 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled
On 01/21/2015 01:42 AM, Waiman Long wrote: This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- Reviewed-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 12/18] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
From: Antonios Motakis a.mota...@virtualopensystems.com We want to reuse virqfd functionality in multiple VFIO drivers; before moving these functions to core VFIO, add the vfio_ prefix to the virqfd_enable and virqfd_disable functions, and export them so they can be used from other modules. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 30 -- drivers/vfio/pci/vfio_pci_private.h | 4 ++-- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index e8d695b..0a41833d 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -126,10 +126,10 @@ static void virqfd_inject(struct work_struct *work) virqfd-thread(virqfd-vdev, virqfd-data); } -static int virqfd_enable(struct vfio_pci_device *vdev, -int (*handler)(struct vfio_pci_device *, void *), -void (*thread)(struct vfio_pci_device *, void *), -void *data, struct virqfd **pvirqfd, int fd) +int vfio_virqfd_enable(struct vfio_pci_device *vdev, + int (*handler)(struct vfio_pci_device *, void *), + void (*thread)(struct vfio_pci_device *, void *), + void *data, struct virqfd **pvirqfd, int fd) { struct fd irqfd; struct eventfd_ctx *ctx; @@ -215,9 +215,9 @@ err_fd: return ret; } +EXPORT_SYMBOL_GPL(vfio_virqfd_enable); -static void virqfd_disable(struct vfio_pci_device *vdev, - struct virqfd **pvirqfd) +void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd) { unsigned long flags; @@ -237,6 +237,7 @@ static void virqfd_disable(struct vfio_pci_device *vdev, */ flush_workqueue(vfio_irqfd_cleanup_wq); } +EXPORT_SYMBOL_GPL(vfio_virqfd_disable); /* * INTx @@ -440,8 +441,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd) static void vfio_intx_disable(struct vfio_pci_device *vdev) { vfio_intx_set_signal(vdev, -1); - virqfd_disable(vdev, vdev-ctx[0].unmask); - virqfd_disable(vdev, vdev-ctx[0].mask); + vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[0].mask); vdev-irq_type = VFIO_PCI_NUM_IRQS; vdev-num_ctx = 0; kfree(vdev-ctx); @@ -605,8 +606,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix); for (i = 0; i vdev-num_ctx; i++) { - virqfd_disable(vdev, vdev-ctx[i].unmask); - virqfd_disable(vdev, vdev-ctx[i].mask); + vfio_virqfd_disable(vdev, vdev-ctx[i].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[i].mask); } if (msix) { @@ -639,11 +640,12 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, } else if (flags VFIO_IRQ_SET_DATA_EVENTFD) { int32_t fd = *(int32_t *)data; if (fd = 0) - return virqfd_enable(vdev, vfio_pci_intx_unmask_handler, -vfio_send_intx_eventfd, NULL, -vdev-ctx[0].unmask, fd); + return vfio_virqfd_enable(vdev, + vfio_pci_intx_unmask_handler, + vfio_send_intx_eventfd, NULL, + vdev-ctx[0].unmask, fd); - virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); } return 0; diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 671c17a..2e2f0ea 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf, extern int vfio_pci_init_perm_bits(void); extern void vfio_pci_uninit_perm_bits(void); -extern int vfio_pci_virqfd_init(void); -extern void vfio_pci_virqfd_exit(void); +extern int vfio_virqfd_init(void); +extern void vfio_virqfd_exit(void); extern int vfio_config_init(struct vfio_pci_device *vdev); extern void vfio_config_free(struct vfio_pci_device *vdev); -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 17/18] vfio: initialize the virqfd workqueue in VFIO generic code
From: Antonios Motakis a.mota...@virtualopensystems.com Now we have finally completely decoupled virqfd from VFIO_PCI. We can initialize it from the VFIO generic code, in order to safely use it from multiple independent VFIO bus drivers. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/Makefile | 4 +++- drivers/vfio/pci/Makefile | 3 +-- drivers/vfio/pci/vfio_pci.c | 8 drivers/vfio/vfio.c | 8 4 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index dadf0ca..d798b09 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -1,4 +1,6 @@ -obj-$(CONFIG_VFIO) += vfio.o +vfio_core-y := vfio.o virqfd.o + +obj-$(CONFIG_VFIO) += vfio_core.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index c7c8644..1310792 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,5 +1,4 @@ -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \ - ../virqfd.o +vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index fc4308c..8d156d7 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1012,7 +1012,6 @@ put_devs: static void __exit vfio_pci_cleanup(void) { pci_unregister_driver(vfio_pci_driver); - vfio_virqfd_exit(); vfio_pci_uninit_perm_bits(); } @@ -1025,11 +1024,6 @@ static int __init vfio_pci_init(void) if (ret) return ret; - /* Start the virqfd cleanup handler */ - ret = vfio_virqfd_init(); - if (ret) - goto out_virqfd; - /* Register and scan for devices */ ret = pci_register_driver(vfio_pci_driver); if (ret) @@ -1038,8 +1032,6 @@ static int __init vfio_pci_init(void) return 0; out_driver: - vfio_virqfd_exit(); -out_virqfd: vfio_pci_uninit_perm_bits(); return ret; } diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index f018d8d..8e84471 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1464,6 +1464,11 @@ static int __init vfio_init(void) if (ret) goto err_cdev_add; + /* Start the virqfd cleanup handler used by some VFIO bus drivers */ + ret = vfio_virqfd_init(); + if (ret) + goto err_virqfd; + pr_info(DRIVER_DESC version: DRIVER_VERSION \n); /* @@ -1476,6 +1481,8 @@ static int __init vfio_init(void) return 0; +err_virqfd: + cdev_del(vfio.group_cdev); err_cdev_add: unregister_chrdev_region(vfio.group_devt, MINORMASK); err_alloc_chrdev: @@ -1490,6 +1497,7 @@ static void __exit vfio_cleanup(void) { WARN_ON(!list_empty(vfio.group_list)); + vfio_virqfd_exit(); idr_destroy(vfio.group_idr); cdev_del(vfio.group_cdev); unregister_chrdev_region(vfio.group_devt, MINORMASK); -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 18/18] vfio/platform: implement IRQ masking/unmasking via an eventfd
From: Antonios Motakis a.mota...@virtualopensystems.com With this patch the VFIO user will be able to set an eventfd that can be used in order to mask and unmask IRQs of platform devices. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 47 --- drivers/vfio/platform/vfio_platform_private.h | 2 ++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index e0e6388..88bba57 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -37,6 +37,15 @@ static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx) spin_unlock_irqrestore(irq_ctx-lock, flags); } +static int vfio_platform_mask_handler(void *opaque, void *unused) +{ + struct vfio_platform_irq *irq_ctx = opaque; + + vfio_platform_mask(irq_ctx); + + return 0; +} + static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, @@ -48,8 +57,18 @@ static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) return -EINVAL; - if (flags VFIO_IRQ_SET_DATA_EVENTFD) - return -EINVAL; /* not implemented yet */ + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + if (fd = 0) + return vfio_virqfd_enable((void *) vdev-irqs[index], + vfio_platform_mask_handler, + NULL, NULL, + vdev-irqs[index].mask, fd); + + vfio_virqfd_disable(vdev-irqs[index].mask); + return 0; + } if (flags VFIO_IRQ_SET_DATA_NONE) { vfio_platform_mask(vdev-irqs[index]); @@ -78,6 +97,15 @@ static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx) spin_unlock_irqrestore(irq_ctx-lock, flags); } +static int vfio_platform_unmask_handler(void *opaque, void *unused) +{ + struct vfio_platform_irq *irq_ctx = opaque; + + vfio_platform_unmask(irq_ctx); + + return 0; +} + static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, @@ -89,8 +117,19 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) return -EINVAL; - if (flags VFIO_IRQ_SET_DATA_EVENTFD) - return -EINVAL; /* not implemented yet */ + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + if (fd = 0) + return vfio_virqfd_enable((void *) vdev-irqs[index], + vfio_platform_unmask_handler, + NULL, NULL, + vdev-irqs[index].unmask, + fd); + + vfio_virqfd_disable(vdev-irqs[index].unmask); + return 0; + } if (flags VFIO_IRQ_SET_DATA_NONE) { vfio_platform_unmask(vdev-irqs[index]); diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index ff2db1d..5d31e04 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -35,6 +35,8 @@ struct vfio_platform_irq { struct eventfd_ctx *trigger; boolmasked; spinlock_t lock; + struct virqfd *unmask; + struct virqfd *mask; }; struct vfio_platform_region { -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 08/18] vfio/platform: return IRQ info
From: Antonios Motakis a.mota...@virtualopensystems.com Return information for the interrupts exposed by the device. This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs and enables VFIO_DEVICE_GET_IRQ_INFO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/Makefile| 2 +- drivers/vfio/platform/vfio_platform_common.c | 31 +--- drivers/vfio/platform/vfio_platform_irq.c | 51 +++ drivers/vfio/platform/vfio_platform_private.h | 10 ++ 4 files changed, 89 insertions(+), 5 deletions(-) create mode 100644 drivers/vfio/platform/vfio_platform_irq.c diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile index 279862b..c6316cc 100644 --- a/drivers/vfio/platform/Makefile +++ b/drivers/vfio/platform/Makefile @@ -1,4 +1,4 @@ -vfio-platform-y := vfio_platform.o vfio_platform_common.o +vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 6bf78ee..cf7bb08 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -100,6 +100,7 @@ static void vfio_platform_release(void *device_data) if (!(--vdev-refcnt)) { vfio_platform_regions_cleanup(vdev); + vfio_platform_irq_cleanup(vdev); } mutex_unlock(driver_lock); @@ -121,6 +122,10 @@ static int vfio_platform_open(void *device_data) ret = vfio_platform_regions_init(vdev); if (ret) goto err_reg; + + ret = vfio_platform_irq_init(vdev); + if (ret) + goto err_irq; } vdev-refcnt++; @@ -128,6 +133,8 @@ static int vfio_platform_open(void *device_data) mutex_unlock(driver_lock); return 0; +err_irq: + vfio_platform_regions_cleanup(vdev); err_reg: mutex_unlock(driver_lock); module_put(THIS_MODULE); @@ -153,7 +160,7 @@ static long vfio_platform_ioctl(void *device_data, info.flags = vdev-flags; info.num_regions = vdev-num_regions; - info.num_irqs = 0; + info.num_irqs = vdev-num_irqs; return copy_to_user((void __user *)arg, info, minsz); @@ -178,10 +185,26 @@ static long vfio_platform_ioctl(void *device_data, return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) { + struct vfio_irq_info info; + + minsz = offsetofend(struct vfio_irq_info, count); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + if (info.index = vdev-num_irqs) + return -EINVAL; + + info.flags = vdev-irqs[info.index].flags; + info.count = vdev-irqs[info.index].count; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_SET_IRQS) + } else if (cmd == VFIO_DEVICE_SET_IRQS) return -EINVAL; else if (cmd == VFIO_DEVICE_RESET) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c new file mode 100644 index 000..c6c3ec1 --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -0,0 +1,51 @@ +/* + * VFIO platform devices interrupt handling + * + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/eventfd.h +#include linux/interrupt.h +#include linux/slab.h +#include linux/types.h +#include linux/vfio.h +#include linux/irq.h + +#include vfio_platform_private.h + +int vfio_platform_irq_init(struct vfio_platform_device *vdev) +{ + int cnt = 0, i; + + while (vdev-get_irq(vdev, cnt) = 0) + cnt++; + + vdev-irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), GFP_KERNEL); + if (!vdev-irqs) + return -ENOMEM; + + for (i = 0; i cnt; i++) { + vdev-irqs[i].flags = 0; + vdev-irqs[i].count = 1; + } + + vdev-num_irqs = cnt; + +
[PATCH v12 10/18] vfio/platform: trigger an interrupt via eventfd
From: Antonios Motakis a.mota...@virtualopensystems.com This patch allows to set an eventfd for a platform device's interrupt, and also to trigger the interrupt eventfd from userspace for testing. Level sensitive interrupts are marked as maskable and are handled in a later patch. Edge triggered interrupts are not advertised as maskable and are implemented here using a simple and efficient IRQ handler. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com [Baptiste Reynal: fix masked interrupt initialization] Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 98 ++- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 98 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index df5c919..4b1ee22 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -39,12 +39,96 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, return -EINVAL; } +static irqreturn_t vfio_irq_handler(int irq, void *dev_id) +{ + struct vfio_platform_irq *irq_ctx = dev_id; + + eventfd_signal(irq_ctx-trigger, 1); + + return IRQ_HANDLED; +} + +static int vfio_set_trigger(struct vfio_platform_device *vdev, int index, + int fd, irq_handler_t handler) +{ + struct vfio_platform_irq *irq = vdev-irqs[index]; + struct eventfd_ctx *trigger; + int ret; + + if (irq-trigger) { + free_irq(irq-hwirq, irq); + kfree(irq-name); + eventfd_ctx_put(irq-trigger); + irq-trigger = NULL; + } + + if (fd 0) /* Disable only */ + return 0; + + irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s), + irq-hwirq, vdev-name); + if (!irq-name) + return -ENOMEM; + + trigger = eventfd_ctx_fdget(fd); + if (IS_ERR(trigger)) { + kfree(irq-name); + return PTR_ERR(trigger); + } + + irq-trigger = trigger; + + irq_set_status_flags(irq-hwirq, IRQ_NOAUTOEN); + ret = request_irq(irq-hwirq, handler, 0, irq-name, irq); + if (ret) { + kfree(irq-name); + eventfd_ctx_put(trigger); + irq-trigger = NULL; + return ret; + } + + if (!irq-masked) + enable_irq(irq-hwirq); + + return 0; +} + static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + struct vfio_platform_irq *irq = vdev-irqs[index]; + irq_handler_t handler; + + if (vdev-irqs[index].flags VFIO_IRQ_INFO_AUTOMASKED) + return -EINVAL; /* not implemented */ + else + handler = vfio_irq_handler; + + if (!count (flags VFIO_IRQ_SET_DATA_NONE)) + return vfio_set_trigger(vdev, index, -1, handler); + + if (start != 0 || count != 1) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + return vfio_set_trigger(vdev, index, fd, handler); + } + + if (flags VFIO_IRQ_SET_DATA_NONE) { + handler(irq-hwirq, irq); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t trigger = *(uint8_t *)data; + + if (trigger) + handler(irq-hwirq, irq); + } + + return 0; } int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev, @@ -90,7 +174,12 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) if (hwirq 0) goto err; - vdev-irqs[i].flags = 0; + vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD; + + if (irq_get_trigger_type(hwirq) IRQ_TYPE_LEVEL_MASK) + vdev-irqs[i].flags |= VFIO_IRQ_INFO_MASKABLE + | VFIO_IRQ_INFO_AUTOMASKED; + vdev-irqs[i].count = 1; vdev-irqs[i].hwirq = hwirq; } @@ -105,6 +194,11 @@ err: void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev) { + int i; + + for (i = 0; i vdev-num_irqs; i++) + vfio_set_trigger(vdev, i, -1, NULL); + vdev-num_irqs = 0; kfree(vdev-irqs); } diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index b119a6c..aa01cc3 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -31,6
[PATCH v12 02/18] vfio: platform: probe to devices on the platform bus
From: Antonios Motakis a.mota...@virtualopensystems.com Driver to bind to Linux platform devices, and callbacks to discover their resources to be used by the main VFIO PLATFORM code. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform.c | 103 ++ include/uapi/linux/vfio.h | 1 + 2 files changed, 104 insertions(+) create mode 100644 drivers/vfio/platform/vfio_platform.c diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c new file mode 100644 index 000..cef645c --- /dev/null +++ b/drivers/vfio/platform/vfio_platform.c @@ -0,0 +1,103 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/slab.h +#include linux/vfio.h +#include linux/platform_device.h + +#include vfio_platform_private.h + +#define DRIVER_VERSION 0.10 +#define DRIVER_AUTHOR Antonios Motakis a.mota...@virtualopensystems.com +#define DRIVER_DESC VFIO for platform devices - User Level meta-driver + +/* probing devices from the linux platform bus */ + +static struct resource *get_platform_resource(struct vfio_platform_device *vdev, + int num) +{ + struct platform_device *dev = (struct platform_device *) vdev-opaque; + int i; + + for (i = 0; i dev-num_resources; i++) { + struct resource *r = dev-resource[i]; + + if (resource_type(r) (IORESOURCE_MEM|IORESOURCE_IO)) { + if (!num) + return r; + + num--; + } + } + return NULL; +} + +static int get_platform_irq(struct vfio_platform_device *vdev, int i) +{ + struct platform_device *pdev = (struct platform_device *) vdev-opaque; + + return platform_get_irq(pdev, i); +} + +static int vfio_platform_probe(struct platform_device *pdev) +{ + struct vfio_platform_device *vdev; + int ret; + + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); + if (!vdev) + return -ENOMEM; + + vdev-opaque = (void *) pdev; + vdev-name = pdev-name; + vdev-flags = VFIO_DEVICE_FLAGS_PLATFORM; + vdev-get_resource = get_platform_resource; + vdev-get_irq = get_platform_irq; + + ret = vfio_platform_probe_common(vdev, pdev-dev); + if (ret) + kfree(vdev); + + return ret; +} + +static int vfio_platform_remove(struct platform_device *pdev) +{ + struct vfio_platform_device *vdev; + + vdev = vfio_platform_remove_common(pdev-dev); + if (vdev) { + kfree(vdev); + return 0; + } + + return -EINVAL; +} + +static struct platform_driver vfio_platform_driver = { + .probe = vfio_platform_probe, + .remove = vfio_platform_remove, + .driver = { + .name = vfio-platform, + .owner = THIS_MODULE, + }, +}; + +module_platform_driver(vfio_platform_driver); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE(GPL v2); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9ade02b..4e93a97 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -159,6 +159,7 @@ struct vfio_device_info { __u32 flags; #define VFIO_DEVICE_FLAGS_RESET(1 0)/* Device supports reset */ #define VFIO_DEVICE_FLAGS_PCI (1 1)/* vfio-pci device */ +#define VFIO_DEVICE_FLAGS_PLATFORM (1 2)/* vfio-platform device */ __u32 num_regions;/* Max region index + 1 */ __u32 num_irqs; /* Max IRQ index + 1 */ }; -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 05/18] vfio/platform: return info for device memory mapped IO regions
From: Antonios Motakis a.mota...@virtualopensystems.com This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call, which allows the user to learn about the available MMIO resources of a device. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 106 +- drivers/vfio/platform/vfio_platform_private.h | 22 ++ 2 files changed, 124 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 862b43b..2a4613c 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -22,17 +22,97 @@ #include vfio_platform_private.h +static DEFINE_MUTEX(driver_lock); + +static int vfio_platform_regions_init(struct vfio_platform_device *vdev) +{ + int cnt = 0, i; + + while (vdev-get_resource(vdev, cnt)) + cnt++; + + vdev-regions = kcalloc(cnt, sizeof(struct vfio_platform_region), + GFP_KERNEL); + if (!vdev-regions) + return -ENOMEM; + + for (i = 0; i cnt; i++) { + struct resource *res = + vdev-get_resource(vdev, i); + + if (!res) + goto err; + + vdev-regions[i].addr = res-start; + vdev-regions[i].size = resource_size(res); + vdev-regions[i].flags = 0; + + switch (resource_type(res)) { + case IORESOURCE_MEM: + vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO; + break; + case IORESOURCE_IO: + vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; + break; + default: + goto err; + } + } + + vdev-num_regions = cnt; + + return 0; +err: + kfree(vdev-regions); + return -EINVAL; +} + +static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev) +{ + vdev-num_regions = 0; + kfree(vdev-regions); +} + static void vfio_platform_release(void *device_data) { + struct vfio_platform_device *vdev = device_data; + + mutex_lock(driver_lock); + + if (!(--vdev-refcnt)) { + vfio_platform_regions_cleanup(vdev); + } + + mutex_unlock(driver_lock); + module_put(THIS_MODULE); } static int vfio_platform_open(void *device_data) { + struct vfio_platform_device *vdev = device_data; + int ret; + if (!try_module_get(THIS_MODULE)) return -ENODEV; + mutex_lock(driver_lock); + + if (!vdev-refcnt) { + ret = vfio_platform_regions_init(vdev); + if (ret) + goto err_reg; + } + + vdev-refcnt++; + + mutex_unlock(driver_lock); return 0; + +err_reg: + mutex_unlock(driver_lock); + module_put(THIS_MODULE); + return ret; } static long vfio_platform_ioctl(void *device_data, @@ -53,15 +133,33 @@ static long vfio_platform_ioctl(void *device_data, return -EINVAL; info.flags = vdev-flags; - info.num_regions = 0; + info.num_regions = vdev-num_regions; info.num_irqs = 0; return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) { + struct vfio_region_info info; + + minsz = offsetofend(struct vfio_region_info, offset); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + if (info.index = vdev-num_regions) + return -EINVAL; + + /* map offset to the physical address */ + info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index); + info.size = vdev-regions[info.index].size; + info.flags = vdev-regions[info.index].flags; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) + } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) return -EINVAL; else if (cmd == VFIO_DEVICE_SET_IRQS) diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index c046988..3551f6d 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -18,7 +18,29 @@ #include linux/types.h #include linux/interrupt.h +#define VFIO_PLATFORM_OFFSET_SHIFT 40 +#define VFIO_PLATFORM_OFFSET_MASK (((u64)(1) VFIO_PLATFORM_OFFSET_SHIFT) - 1) +
[PATCH v12 06/18] vfio/platform: read and write support for the device fd
From: Antonios Motakis a.mota...@virtualopensystems.com VFIO returns a file descriptor which we can use to manipulate the memory regions of the device. Usually, the user will mmap memory regions that are addressable on page boundaries, however for memory regions where this is not the case we cannot provide mmap functionality due to security concerns. For this reason we also allow to use read and write functions to the file descriptor pointing to the memory regions. We implement this functionality only for MMIO regions of platform devices; PIO regions are not being handled at this point. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 150 ++ drivers/vfio/platform/vfio_platform_private.h | 1 + 2 files changed, 151 insertions(+) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 2a4613c..fda4c30 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -50,6 +50,10 @@ static int vfio_platform_regions_init(struct vfio_platform_device *vdev) switch (resource_type(res)) { case IORESOURCE_MEM: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO; + vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_READ; + if (!(res-flags IORESOURCE_READONLY)) + vdev-regions[i].flags |= + VFIO_REGION_INFO_FLAG_WRITE; break; case IORESOURCE_IO: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; @@ -69,6 +73,11 @@ err: static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev) { + int i; + + for (i = 0; i vdev-num_regions; i++) + iounmap(vdev-regions[i].ioaddr); + vdev-num_regions = 0; kfree(vdev-regions); } @@ -171,15 +180,156 @@ static long vfio_platform_ioctl(void *device_data, return -ENOTTY; } +static ssize_t vfio_platform_read_mmio(struct vfio_platform_region reg, + char __user *buf, size_t count, + loff_t off) +{ + unsigned int done = 0; + + if (!reg.ioaddr) { + reg.ioaddr = + ioremap_nocache(reg.addr, reg.size); + + if (!reg.ioaddr) + return -ENOMEM; + } + + while (count) { + size_t filled; + + if (count = 4 !(off % 4)) { + u32 val; + + val = ioread32(reg.ioaddr + off); + if (copy_to_user(buf, val, 4)) + goto err; + + filled = 4; + } else if (count = 2 !(off % 2)) { + u16 val; + + val = ioread16(reg.ioaddr + off); + if (copy_to_user(buf, val, 2)) + goto err; + + filled = 2; + } else { + u8 val; + + val = ioread8(reg.ioaddr + off); + if (copy_to_user(buf, val, 1)) + goto err; + + filled = 1; + } + + + count -= filled; + done += filled; + off += filled; + buf += filled; + } + + return done; +err: + return -EFAULT; +} + static ssize_t vfio_platform_read(void *device_data, char __user *buf, size_t count, loff_t *ppos) { + struct vfio_platform_device *vdev = device_data; + unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos); + loff_t off = *ppos VFIO_PLATFORM_OFFSET_MASK; + + if (index = vdev-num_regions) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_READ)) + return -EINVAL; + + if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_MMIO) + return vfio_platform_read_mmio(vdev-regions[index], + buf, count, off); + else if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_PIO) + return -EINVAL; /* not implemented */ + return -EINVAL; } +static ssize_t vfio_platform_write_mmio(struct vfio_platform_region reg, + const char __user *buf, size_t count, + loff_t off) +{ + unsigned int done = 0; + + if (!reg.ioaddr) { + reg.ioaddr = + ioremap_nocache(reg.addr, reg.size); + + if (!reg.ioaddr) + return -ENOMEM; + } + + while (count) { + size_t
[PATCH v12 01/18] vfio/platform: initial skeleton of VFIO support for platform devices
From: Antonios Motakis a.mota...@virtualopensystems.com This patch forms the common skeleton code for platform devices support with VFIO. This will include the core functionality of VFIO_PLATFORM, however binding to the device and discovering the device resources will be done with the help of a separate file where any Linux platform bus specific code will reside. This will allow us to implement support for also discovering AMBA devices and their resources, but still reuse a large part of the VFIO_PLATFORM implementation. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com [Baptiste Reynal: added includes in vfio_platform_private.h] Signed-off-by: Baptiste Reynal b.rey...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 121 ++ drivers/vfio/platform/vfio_platform_private.h | 39 + 2 files changed, 160 insertions(+) create mode 100644 drivers/vfio/platform/vfio_platform_common.c create mode 100644 drivers/vfio/platform/vfio_platform_private.h diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c new file mode 100644 index 000..34d023b --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -0,0 +1,121 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/device.h +#include linux/iommu.h +#include linux/module.h +#include linux/mutex.h +#include linux/slab.h +#include linux/types.h +#include linux/vfio.h + +#include vfio_platform_private.h + +static void vfio_platform_release(void *device_data) +{ + module_put(THIS_MODULE); +} + +static int vfio_platform_open(void *device_data) +{ + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + return 0; +} + +static long vfio_platform_ioctl(void *device_data, + unsigned int cmd, unsigned long arg) +{ + if (cmd == VFIO_DEVICE_GET_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_GET_REGION_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_SET_IRQS) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_RESET) + return -EINVAL; + + return -ENOTTY; +} + +static ssize_t vfio_platform_read(void *device_data, char __user *buf, + size_t count, loff_t *ppos) +{ + return -EINVAL; +} + +static ssize_t vfio_platform_write(void *device_data, const char __user *buf, + size_t count, loff_t *ppos) +{ + return -EINVAL; +} + +static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma) +{ + return -EINVAL; +} + +static const struct vfio_device_ops vfio_platform_ops = { + .name = vfio-platform, + .open = vfio_platform_open, + .release= vfio_platform_release, + .ioctl = vfio_platform_ioctl, + .read = vfio_platform_read, + .write = vfio_platform_write, + .mmap = vfio_platform_mmap, +}; + +int vfio_platform_probe_common(struct vfio_platform_device *vdev, + struct device *dev) +{ + struct iommu_group *group; + int ret; + + if (!vdev) + return -EINVAL; + + group = iommu_group_get(dev); + if (!group) { + pr_err(VFIO: No IOMMU group for device %s\n, vdev-name); + return -EINVAL; + } + + ret = vfio_add_group_dev(dev, vfio_platform_ops, vdev); + if (ret) { + iommu_group_put(group); + return ret; + } + + return 0; +} +EXPORT_SYMBOL_GPL(vfio_platform_probe_common); + +struct vfio_platform_device *vfio_platform_remove_common(struct device *dev) +{ + struct vfio_platform_device *vdev; + + vdev = vfio_del_group_dev(dev); + if (vdev) + iommu_group_put(dev-iommu_group); + + return vdev; +} +EXPORT_SYMBOL_GPL(vfio_platform_remove_common); diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h new file mode 100644 index 000..c046988 --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -0,0 +1,39 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program
[PATCH v12 04/18] vfio/platform: return info for bound device
From: Antonios Motakis a.mota...@virtualopensystems.com A VFIO userspace driver will start by opening the VFIO device that corresponds to an IOMMU group, and will use the ioctl interface to get the basic device info, such as number of memory regions and interrupts, and their properties. This patch enables the VFIO_DEVICE_GET_INFO ioctl call. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 34d023b..862b43b 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -38,10 +38,27 @@ static int vfio_platform_open(void *device_data) static long vfio_platform_ioctl(void *device_data, unsigned int cmd, unsigned long arg) { - if (cmd == VFIO_DEVICE_GET_INFO) - return -EINVAL; + struct vfio_platform_device *vdev = device_data; + unsigned long minsz; + + if (cmd == VFIO_DEVICE_GET_INFO) { + struct vfio_device_info info; + + minsz = offsetofend(struct vfio_device_info, num_irqs); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + info.flags = vdev-flags; + info.num_regions = 0; + info.num_irqs = 0; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_GET_REGION_INFO) + } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) return -EINVAL; else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 13/18] vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
From: Antonios Motakis a.mota...@virtualopensystems.com The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really PCI specific, since we plan to reuse the virqfd code with more VFIO drivers in addition to VFIO_PCI. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci.c | 6 +++--- drivers/vfio/pci/vfio_pci_intrs.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 9558da3..fc4308c 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1012,7 +1012,7 @@ put_devs: static void __exit vfio_pci_cleanup(void) { pci_unregister_driver(vfio_pci_driver); - vfio_pci_virqfd_exit(); + vfio_virqfd_exit(); vfio_pci_uninit_perm_bits(); } @@ -1026,7 +1026,7 @@ static int __init vfio_pci_init(void) return ret; /* Start the virqfd cleanup handler */ - ret = vfio_pci_virqfd_init(); + ret = vfio_virqfd_init(); if (ret) goto out_virqfd; @@ -1038,7 +1038,7 @@ static int __init vfio_pci_init(void) return 0; out_driver: - vfio_pci_virqfd_exit(); + vfio_virqfd_exit(); out_virqfd: vfio_pci_uninit_perm_bits(); return ret; diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 0a41833d..a5378d5 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -45,7 +45,7 @@ struct virqfd { static struct workqueue_struct *vfio_irqfd_cleanup_wq; -int __init vfio_pci_virqfd_init(void) +int __init vfio_virqfd_init(void) { vfio_irqfd_cleanup_wq = create_singlethread_workqueue(vfio-irqfd-cleanup); @@ -55,7 +55,7 @@ int __init vfio_pci_virqfd_init(void) return 0; } -void vfio_pci_virqfd_exit(void) +void vfio_virqfd_exit(void) { destroy_workqueue(vfio_irqfd_cleanup_wq); } -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 07/18] vfio/platform: support MMAP of MMIO regions
From: Antonios Motakis a.mota...@virtualopensystems.com Allow to memory map the MMIO regions of the device so userspace can directly access them. PIO regions are not being handled at this point. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 65 1 file changed, 65 insertions(+) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index fda4c30..6bf78ee 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -54,6 +54,16 @@ static int vfio_platform_regions_init(struct vfio_platform_device *vdev) if (!(res-flags IORESOURCE_READONLY)) vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_WRITE; + + /* +* Only regions addressed with PAGE granularity may be +* MMAPed securely. +*/ + if (!(vdev-regions[i].addr ~PAGE_MASK) + !(vdev-regions[i].size ~PAGE_MASK)) + vdev-regions[i].flags |= + VFIO_REGION_INFO_FLAG_MMAP; + break; case IORESOURCE_IO: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; @@ -333,8 +343,63 @@ static ssize_t vfio_platform_write(void *device_data, const char __user *buf, return -EINVAL; } +static int vfio_platform_mmap_mmio(struct vfio_platform_region region, + struct vm_area_struct *vma) +{ + u64 req_len, pgoff, req_start; + + req_len = vma-vm_end - vma-vm_start; + pgoff = vma-vm_pgoff + ((1U (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + req_start = pgoff PAGE_SHIFT; + + if (region.size PAGE_SIZE || req_start + req_len region.size) + return -EINVAL; + + vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot); + vma-vm_pgoff = (region.addr PAGE_SHIFT) + pgoff; + + return remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, + req_len, vma-vm_page_prot); +} + static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma) { + struct vfio_platform_device *vdev = device_data; + unsigned int index; + + index = vma-vm_pgoff (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT); + + if (vma-vm_end vma-vm_start) + return -EINVAL; + if (!(vma-vm_flags VM_SHARED)) + return -EINVAL; + if (index = vdev-num_regions) + return -EINVAL; + if (vma-vm_start ~PAGE_MASK) + return -EINVAL; + if (vma-vm_end ~PAGE_MASK) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_MMAP)) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_READ) +(vma-vm_flags VM_READ)) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_WRITE) +(vma-vm_flags VM_WRITE)) + return -EINVAL; + + vma-vm_private_data = vdev; + + if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_MMIO) + return vfio_platform_mmap_mmio(vdev-regions[index], vma); + + else if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_PIO) + return -EINVAL; /* not implemented */ + return -EINVAL; } -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 16/18] vfio: move eventfd support code for VFIO_PCI to a separate file
From: Antonios Motakis a.mota...@virtualopensystems.com The virqfd functionality that is used by VFIO_PCI to implement interrupt masking and unmasking via an eventfd, is generic enough and can be reused by another driver. Move it to a separate file in order to allow the code to be shared. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_intrs.c | 215 drivers/vfio/pci/vfio_pci_private.h | 3 - drivers/vfio/virqfd.c | 213 +++ include/linux/vfio.h| 27 + 5 files changed, 242 insertions(+), 219 deletions(-) create mode 100644 drivers/vfio/virqfd.c diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 1310792..c7c8644 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,4 +1,5 @@ -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \ + ../virqfd.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 5b5fc23..de4befc 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -19,228 +19,13 @@ #include linux/msi.h #include linux/pci.h #include linux/file.h -#include linux/poll.h #include linux/vfio.h #include linux/wait.h -#include linux/workqueue.h #include linux/slab.h #include vfio_pci_private.h /* - * IRQfd - generic - */ -struct virqfd { - void*opaque; - struct eventfd_ctx *eventfd; - int (*handler)(void *, void *); - void(*thread)(void *, void *); - void*data; - struct work_struct inject; - wait_queue_twait; - poll_table pt; - struct work_struct shutdown; - struct virqfd **pvirqfd; -}; - -static struct workqueue_struct *vfio_irqfd_cleanup_wq; -DEFINE_SPINLOCK(virqfd_lock); - -int __init vfio_virqfd_init(void) -{ - vfio_irqfd_cleanup_wq = - create_singlethread_workqueue(vfio-irqfd-cleanup); - if (!vfio_irqfd_cleanup_wq) - return -ENOMEM; - - return 0; -} - -void vfio_virqfd_exit(void) -{ - destroy_workqueue(vfio_irqfd_cleanup_wq); -} - -static void virqfd_deactivate(struct virqfd *virqfd) -{ - queue_work(vfio_irqfd_cleanup_wq, virqfd-shutdown); -} - -static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) -{ - struct virqfd *virqfd = container_of(wait, struct virqfd, wait); - unsigned long flags = (unsigned long)key; - - if (flags POLLIN) { - /* An event has been signaled, call function */ - if ((!virqfd-handler || -virqfd-handler(virqfd-opaque, virqfd-data)) - virqfd-thread) - schedule_work(virqfd-inject); - } - - if (flags POLLHUP) { - unsigned long flags; - spin_lock_irqsave(virqfd_lock, flags); - - /* -* The eventfd is closing, if the virqfd has not yet been -* queued for release, as determined by testing whether the -* virqfd pointer to it is still valid, queue it now. As -* with kvm irqfds, we know we won't race against the virqfd -* going away because we hold the lock to get here. -*/ - if (*(virqfd-pvirqfd) == virqfd) { - *(virqfd-pvirqfd) = NULL; - virqfd_deactivate(virqfd); - } - - spin_unlock_irqrestore(virqfd_lock, flags); - } - - return 0; -} - -static void virqfd_ptable_queue_proc(struct file *file, -wait_queue_head_t *wqh, poll_table *pt) -{ - struct virqfd *virqfd = container_of(pt, struct virqfd, pt); - add_wait_queue(wqh, virqfd-wait); -} - -static void virqfd_shutdown(struct work_struct *work) -{ - struct virqfd *virqfd = container_of(work, struct virqfd, shutdown); - u64 cnt; - - eventfd_ctx_remove_wait_queue(virqfd-eventfd, virqfd-wait, cnt); - flush_work(virqfd-inject); - eventfd_ctx_put(virqfd-eventfd); - - kfree(virqfd); -} - -static void virqfd_inject(struct work_struct *work) -{ - struct virqfd *virqfd = container_of(work, struct virqfd, inject); - if (virqfd-thread) - virqfd-thread(virqfd-opaque, virqfd-data); -} - -int vfio_virqfd_enable(void *opaque, - int (*handler)(void *, void *), - void (*thread)(void *, void *), - void *data, struct virqfd **pvirqfd, int fd) -{ - struct fd irqfd; - struct
[PATCH v12 11/18] vfio/platform: support for level sensitive interrupts
From: Antonios Motakis a.mota...@virtualopensystems.com Level sensitive interrupts are exposed as maskable and automasked interrupts and are masked and disabled automatically when they fire. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 99 ++- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 98 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index 4b1ee22..e0e6388 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -23,12 +23,59 @@ #include vfio_platform_private.h +static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx) +{ + unsigned long flags; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (!irq_ctx-masked) { + disable_irq_nosync(irq_ctx-hwirq); + irq_ctx-masked = true; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); +} + static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + if (start != 0 || count != 1) + return -EINVAL; + + if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) + return -EINVAL; /* not implemented yet */ + + if (flags VFIO_IRQ_SET_DATA_NONE) { + vfio_platform_mask(vdev-irqs[index]); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t mask = *(uint8_t *)data; + + if (mask) + vfio_platform_mask(vdev-irqs[index]); + } + + return 0; +} + +static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx) +{ + unsigned long flags; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (irq_ctx-masked) { + enable_irq(irq_ctx-hwirq); + irq_ctx-masked = false; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); } static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, @@ -36,7 +83,50 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + if (start != 0 || count != 1) + return -EINVAL; + + if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) + return -EINVAL; /* not implemented yet */ + + if (flags VFIO_IRQ_SET_DATA_NONE) { + vfio_platform_unmask(vdev-irqs[index]); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t unmask = *(uint8_t *)data; + + if (unmask) + vfio_platform_unmask(vdev-irqs[index]); + } + + return 0; +} + +static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id) +{ + struct vfio_platform_irq *irq_ctx = dev_id; + unsigned long flags; + int ret = IRQ_NONE; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (!irq_ctx-masked) { + ret = IRQ_HANDLED; + + /* automask maskable interrupts */ + disable_irq_nosync(irq_ctx-hwirq); + irq_ctx-masked = true; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); + + if (ret == IRQ_HANDLED) + eventfd_signal(irq_ctx-trigger, 1); + + return ret; } static irqreturn_t vfio_irq_handler(int irq, void *dev_id) @@ -102,7 +192,7 @@ static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, irq_handler_t handler; if (vdev-irqs[index].flags VFIO_IRQ_INFO_AUTOMASKED) - return -EINVAL; /* not implemented */ + handler = vfio_automasked_irq_handler; else handler = vfio_irq_handler; @@ -174,6 +264,8 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) if (hwirq 0) goto err; + spin_lock_init(vdev-irqs[i].lock); + vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD; if (irq_get_trigger_type(hwirq) IRQ_TYPE_LEVEL_MASK) @@ -182,6 +274,7 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) vdev-irqs[i].count = 1; vdev-irqs[i].hwirq = hwirq; + vdev-irqs[i].masked = false; } vdev-num_irqs = cnt; diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index aa01cc3..ff2db1d
[PATCH v12 15/18] vfio: pass an opaque pointer on virqfd initialization
From: Antonios Motakis a.mota...@virtualopensystems.com VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler that implements masking/unmasking of IRQs via an eventfd. We can replace it in the virqfd infrastructure with an opaque type so we can make use of the mechanism from other VFIO bus drivers. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index b35bc16..5b5fc23 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -31,10 +31,10 @@ * IRQfd - generic */ struct virqfd { - struct vfio_pci_device *vdev; + void*opaque; struct eventfd_ctx *eventfd; - int (*handler)(struct vfio_pci_device *, void *); - void(*thread)(struct vfio_pci_device *, void *); + int (*handler)(void *, void *); + void(*thread)(void *, void *); void*data; struct work_struct inject; wait_queue_twait; @@ -74,7 +74,7 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) { /* An event has been signaled, call function */ if ((!virqfd-handler || -virqfd-handler(virqfd-vdev, virqfd-data)) +virqfd-handler(virqfd-opaque, virqfd-data)) virqfd-thread) schedule_work(virqfd-inject); } @@ -124,12 +124,12 @@ static void virqfd_inject(struct work_struct *work) { struct virqfd *virqfd = container_of(work, struct virqfd, inject); if (virqfd-thread) - virqfd-thread(virqfd-vdev, virqfd-data); + virqfd-thread(virqfd-opaque, virqfd-data); } -int vfio_virqfd_enable(struct vfio_pci_device *vdev, - int (*handler)(struct vfio_pci_device *, void *), - void (*thread)(struct vfio_pci_device *, void *), +int vfio_virqfd_enable(void *opaque, + int (*handler)(void *, void *), + void (*thread)(void *, void *), void *data, struct virqfd **pvirqfd, int fd) { struct fd irqfd; @@ -143,7 +143,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, return -ENOMEM; virqfd-pvirqfd = pvirqfd; - virqfd-vdev = vdev; + virqfd-opaque = opaque; virqfd-handler = handler; virqfd-thread = thread; virqfd-data = data; @@ -196,7 +196,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, * before we registered and trigger it as if we didn't miss it. */ if (events POLLIN) { - if ((!handler || handler(vdev, data)) thread) + if ((!handler || handler(opaque, data)) thread) schedule_work(virqfd-inject); } @@ -243,8 +243,10 @@ EXPORT_SYMBOL_GPL(vfio_virqfd_disable); /* * INTx */ -static void vfio_send_intx_eventfd(struct vfio_pci_device *vdev, void *unused) +static void vfio_send_intx_eventfd(void *opaque, void *unused) { + struct vfio_pci_device *vdev = opaque; + if (likely(is_intx(vdev) !vdev-virq_disabled)) eventfd_signal(vdev-ctx[0].trigger, 1); } @@ -287,9 +289,9 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev) * a signal is necessary, which can then be handled via a work queue * or directly depending on the caller. */ -static int vfio_pci_intx_unmask_handler(struct vfio_pci_device *vdev, - void *unused) +static int vfio_pci_intx_unmask_handler(void *opaque, void *unused) { + struct vfio_pci_device *vdev = opaque; struct pci_dev *pdev = vdev-pdev; unsigned long flags; int ret = 0; @@ -641,7 +643,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, } else if (flags VFIO_IRQ_SET_DATA_EVENTFD) { int32_t fd = *(int32_t *)data; if (fd = 0) - return vfio_virqfd_enable(vdev, + return vfio_virqfd_enable((void *) vdev, vfio_pci_intx_unmask_handler, vfio_send_intx_eventfd, NULL, vdev-ctx[0].unmask, fd); -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 09/18] vfio/platform: initial interrupts support code
From: Antonios Motakis a.mota...@virtualopensystems.com This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which most IRQ functionality is implemented in VFIO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 52 +-- drivers/vfio/platform/vfio_platform_irq.c | 59 +++ drivers/vfio/platform/vfio_platform_private.h | 7 3 files changed, 115 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index cf7bb08..a532a25 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -204,10 +204,54 @@ static long vfio_platform_ioctl(void *device_data, return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_SET_IRQS) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_SET_IRQS) { + struct vfio_irq_set hdr; + u8 *data = NULL; + int ret = 0; + + minsz = offsetofend(struct vfio_irq_set, count); + + if (copy_from_user(hdr, (void __user *)arg, minsz)) + return -EFAULT; + + if (hdr.argsz minsz) + return -EINVAL; + + if (hdr.index = vdev-num_irqs) + return -EINVAL; + + if (hdr.flags ~(VFIO_IRQ_SET_DATA_TYPE_MASK | + VFIO_IRQ_SET_ACTION_TYPE_MASK)) + return -EINVAL; - else if (cmd == VFIO_DEVICE_RESET) + if (!(hdr.flags VFIO_IRQ_SET_DATA_NONE)) { + size_t size; + + if (hdr.flags VFIO_IRQ_SET_DATA_BOOL) + size = sizeof(uint8_t); + else if (hdr.flags VFIO_IRQ_SET_DATA_EVENTFD) + size = sizeof(int32_t); + else + return -EINVAL; + + if (hdr.argsz - minsz size) + return -EINVAL; + + data = memdup_user((void __user *)(arg + minsz), size); + if (IS_ERR(data)) + return PTR_ERR(data); + } + + mutex_lock(vdev-igate); + + ret = vfio_platform_set_irqs_ioctl(vdev, hdr.flags, hdr.index, + hdr.start, hdr.count, data); + mutex_unlock(vdev-igate); + kfree(data); + + return ret; + + } else if (cmd == VFIO_DEVICE_RESET) return -EINVAL; return -ENOTTY; @@ -457,6 +501,8 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev, return ret; } + mutex_init(vdev-igate); + return 0; } EXPORT_SYMBOL_GPL(vfio_platform_probe_common); diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index c6c3ec1..df5c919 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -23,6 +23,56 @@ #include vfio_platform_private.h +static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, + unsigned index, unsigned start, + unsigned count, uint32_t flags, + void *data) +{ + return -EINVAL; +} + +static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, + unsigned index, unsigned start, + unsigned count, uint32_t flags, + void *data) +{ + return -EINVAL; +} + +static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, +unsigned index, unsigned start, +unsigned count, uint32_t flags, +void *data) +{ + return -EINVAL; +} + +int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev, +uint32_t flags, unsigned index, unsigned start, +unsigned count, void *data) +{ + int (*func)(struct vfio_platform_device *vdev, unsigned index, + unsigned start, unsigned count, uint32_t flags, + void *data) = NULL; + + switch (flags VFIO_IRQ_SET_ACTION_TYPE_MASK) { + case VFIO_IRQ_SET_ACTION_MASK: + func = vfio_platform_set_irq_mask; + break; + case VFIO_IRQ_SET_ACTION_UNMASK: + func = vfio_platform_set_irq_unmask; + break; + case VFIO_IRQ_SET_ACTION_TRIGGER: + func =
[PATCH v12 03/18] vfio: platform: add the VFIO PLATFORM module to Kconfig
From: Antonios Motakis a.mota...@virtualopensystems.com Enable building the VFIO PLATFORM driver that allows to use Linux platform devices with VFIO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/Kconfig | 1 + drivers/vfio/Makefile | 1 + drivers/vfio/platform/Kconfig | 9 + drivers/vfio/platform/Makefile | 4 4 files changed, 15 insertions(+) create mode 100644 drivers/vfio/platform/Kconfig create mode 100644 drivers/vfio/platform/Makefile diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index a0abe04..962fb80 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -27,3 +27,4 @@ menuconfig VFIO If you don't know what to do here, say N. source drivers/vfio/pci/Kconfig +source drivers/vfio/platform/Kconfig diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index 0b035b1..dadf0ca 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o obj-$(CONFIG_VFIO_PCI) += pci/ +obj-$(CONFIG_VFIO_PLATFORM) += platform/ diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig new file mode 100644 index 000..c51af17 --- /dev/null +++ b/drivers/vfio/platform/Kconfig @@ -0,0 +1,9 @@ +config VFIO_PLATFORM + tristate VFIO support for platform devices + depends on VFIO EVENTFD ARM + help + Support for platform devices with VFIO. This is required to make + use of platform devices present on the system using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile new file mode 100644 index 000..279862b --- /dev/null +++ b/drivers/vfio/platform/Makefile @@ -0,0 +1,4 @@ + +vfio-platform-y := vfio_platform.o vfio_platform_common.o + +obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 14/18] vfio: add local lock for virqfd instead of depending on VFIO PCI
From: Antonios Motakis a.mota...@virtualopensystems.com The Virqfd code needs to keep accesses to any struct *virqfd safe, but this comes into play only when creating or destroying eventfds, so sharing the same spinlock with the VFIO bus driver is not necessary. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index a5378d5..b35bc16 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -44,6 +44,7 @@ struct virqfd { }; static struct workqueue_struct *vfio_irqfd_cleanup_wq; +DEFINE_SPINLOCK(virqfd_lock); int __init vfio_virqfd_init(void) { @@ -80,21 +81,21 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLHUP) { unsigned long flags; - spin_lock_irqsave(virqfd-vdev-irqlock, flags); + spin_lock_irqsave(virqfd_lock, flags); /* * The eventfd is closing, if the virqfd has not yet been * queued for release, as determined by testing whether the -* vdev pointer to it is still valid, queue it now. As +* virqfd pointer to it is still valid, queue it now. As * with kvm irqfds, we know we won't race against the virqfd -* going away because we hold wqh-lock to get here. +* going away because we hold the lock to get here. */ if (*(virqfd-pvirqfd) == virqfd) { *(virqfd-pvirqfd) = NULL; virqfd_deactivate(virqfd); } - spin_unlock_irqrestore(virqfd-vdev-irqlock, flags); + spin_unlock_irqrestore(virqfd_lock, flags); } return 0; @@ -170,16 +171,16 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, * we update the pointer to the virqfd under lock to avoid * pushing multiple jobs to release the same virqfd. */ - spin_lock_irq(vdev-irqlock); + spin_lock_irq(virqfd_lock); if (*pvirqfd) { - spin_unlock_irq(vdev-irqlock); + spin_unlock_irq(virqfd_lock); ret = -EBUSY; goto err_busy; } *pvirqfd = virqfd; - spin_unlock_irq(vdev-irqlock); + spin_unlock_irq(virqfd_lock); /* * Install our own custom wake-up handling so we are notified via @@ -217,18 +218,18 @@ err_fd: } EXPORT_SYMBOL_GPL(vfio_virqfd_enable); -void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd) +void vfio_virqfd_disable(struct virqfd **pvirqfd) { unsigned long flags; - spin_lock_irqsave(vdev-irqlock, flags); + spin_lock_irqsave(virqfd_lock, flags); if (*pvirqfd) { virqfd_deactivate(*pvirqfd); *pvirqfd = NULL; } - spin_unlock_irqrestore(vdev-irqlock, flags); + spin_unlock_irqrestore(virqfd_lock, flags); /* * Block until we know all outstanding shutdown jobs have completed. @@ -441,8 +442,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd) static void vfio_intx_disable(struct vfio_pci_device *vdev) { vfio_intx_set_signal(vdev, -1); - vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); - vfio_virqfd_disable(vdev, vdev-ctx[0].mask); + vfio_virqfd_disable(vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev-ctx[0].mask); vdev-irq_type = VFIO_PCI_NUM_IRQS; vdev-num_ctx = 0; kfree(vdev-ctx); @@ -606,8 +607,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix); for (i = 0; i vdev-num_ctx; i++) { - vfio_virqfd_disable(vdev, vdev-ctx[i].unmask); - vfio_virqfd_disable(vdev, vdev-ctx[i].mask); + vfio_virqfd_disable(vdev-ctx[i].unmask); + vfio_virqfd_disable(vdev-ctx[i].mask); } if (msix) { @@ -645,7 +646,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, vfio_send_intx_eventfd, NULL, vdev-ctx[0].unmask, fd); - vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev-ctx[0].unmask); } return 0; -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Fix CR3_PCID_INVD type on 32-bit
On 15/01/2015 09:44, Borislav Petkov wrote: From: Borislav Petkov b...@suse.de arch/x86/kvm/emulate.c: In function ‘check_cr_write’: arch/x86/kvm/emulate.c:3552:4: warning: left shift count = width of type rsvd = CR3_L_MODE_RESERVED_BITS ~CR3_PCID_INVD; happens because sizeof(UL) on 32-bit is 4 bytes but we shift it 63 bits to the left. Signed-off-by: Borislav Petkov b...@suse.de --- arch/x86/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d89c6b828c96..a8d07a060136 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -51,7 +51,7 @@ | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG)) #define CR3_L_MODE_RESERVED_BITS 0xFF00ULL -#define CR3_PCID_INVD (1UL 63) +#define CR3_PCID_INVD BIT_64(63) #define CR4_RESERVED_BITS \ (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ Applied, thanks. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote: I'm actually wondering if we should just nuke the _interruptible() version of swait. As it should only be all interruptible or all not interruptible, that the swait_wake() should just do the wake up regardless. In which case, swait_wake() is good enough. No need to have different versions where people may think do something special. Peter? Yeah, I think the lastest thing I have sitting here on my disk only has the swake_up() which does TASK_NORMAL, no choice there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
2015-01-20 15:54-0200, Marcelo Tosatti: SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp and rdtsc is larger than a given threshold: [...] Disable masterclock support (which increases said delta) in case the boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW. Why do we care about 2.6.16 bugs in upstream KVM? The code to benefit tradeoff of this patch seems bad to me ... MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want to support old guests. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
On Wed, Jan 21, 2015 at 03:09:27PM +0100, Radim Krčmář wrote: 2015-01-20 15:54-0200, Marcelo Tosatti: SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp and rdtsc is larger than a given threshold: [...] Disable masterclock support (which increases said delta) in case the boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW. Why do we care about 2.6.16 bugs in upstream KVM? Because people do use 2.6.16 guests. The code to benefit tradeoff of this patch seems bad to me ... Can you state the tradeoff and then explain why it is bad ? MSR_KVM_SYSTEM_TIME is deprecated -- we could remove it now, with MSR_KVM_WALL_CLOCK (after hiding KVM_FEATURE_CLOCKSOURCE) if we want to support old guests. What is the benefit of removing support for MSR_KVM_SYSTEM_TIME ? Supporting old guests is important. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html