[PATCH RESEND v15 07/10] KVM: arm: page logging 2nd stage fault handling
This patch adds support for 2nd stage page fault handling while dirty page logging. On huge page faults, huge pages are dissolved to normal pages, and rebuilding of 2nd stage huge pages is blocked. In case migration is canceled this restriction is removed and huge pages may be rebuilt again. This patch applies cleanly on top of patch series posted Dec. 15'th: https://lists.cs.columbia.edu/pipermail/kvmarm/2014-December/012826.html Patch #11 has been dropped, and should not be applied. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- Change Log since last RESEND v1 -- v2: - Disallow dirty page logging of IO region - fail for initial write protect and disable logging code in 2nd stage page fault handler. - Fixed auto spell correction errors Change Log RESEND v0 -- v1: - fixed bug exposed by new generic __get_user_pages_fast(), when region is writable, prevent write protection of pte on read fault - Removed marking entire huge page dirty on initial access - don't dissolve huge pages of non-writable regions - Made updates based on Christoffers comments - renamed logging status function to memslot_is_logging() - changed few values to bool from longs - streamlined user_mem_abort() to eliminate extra conditional checks --- arch/arm/kvm/mmu.c | 113 1 file changed, 105 insertions(+), 8 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 73d506f..b878236 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -47,6 +47,18 @@ static phys_addr_t hyp_idmap_vector; #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) #define kvm_pud_huge(_x) pud_huge(_x) +#define KVM_S2PTE_FLAG_IS_IOMAP(1UL 0) +#define KVM_S2PTE_FLAG_LOGGING_ACTIVE (1UL 1) + +static bool memslot_is_logging(struct kvm_memory_slot *memslot) +{ +#ifdef CONFIG_ARM + return !!memslot-dirty_bitmap; +#else + return false; +#endif +} + static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) { /* @@ -59,6 +71,25 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa); } +/** + * stage2_dissolve_pmd() - clear and flush huge PMD entry + * @kvm: pointer to kvm structure. + * @addr: IPA + * @pmd: pmd pointer for IPA + * + * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all + * pages in the range dirty. + */ +static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd) +{ + if (!kvm_pmd_huge(*pmd)) + return; + + pmd_clear(pmd); + kvm_tlb_flush_vmid_ipa(kvm, addr); + put_page(virt_to_page(pmd)); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, int min, int max) { @@ -703,10 +734,13 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache } static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, - phys_addr_t addr, const pte_t *new_pte, bool iomap) + phys_addr_t addr, const pte_t *new_pte, + unsigned long flags) { pmd_t *pmd; pte_t *pte, old_pte; + bool iomap = flags KVM_S2PTE_FLAG_IS_IOMAP; + bool logging_active = flags KVM_S2PTE_FLAG_LOGGING_ACTIVE; /* Create stage-2 page table mapping - Levels 0 and 1 */ pmd = stage2_get_pmd(kvm, cache, addr); @@ -718,6 +752,13 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, return 0; } + /* +* While dirty page logging - dissolve huge PMD, then continue on to +* allocate page. +*/ + if (logging_active) + stage2_dissolve_pmd(kvm, addr, pmd); + /* Create stage-2 page mappings - Level 2 */ if (pmd_none(*pmd)) { if (!cache) @@ -774,7 +815,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, if (ret) goto out; spin_lock(kvm-mmu_lock); - ret = stage2_set_pte(kvm, cache, addr, pte, true); + ret = stage2_set_pte(kvm, cache, addr, pte, + KVM_S2PTE_FLAG_IS_IOMAP); spin_unlock(kvm-mmu_lock); if (ret) goto out; @@ -1002,6 +1044,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, pfn_t pfn; pgprot_t mem_type = PAGE_S2; bool fault_ipa_uncached; + bool can_set_pte_rw = true; + unsigned long set_pte_flags = 0; write_fault = kvm_is_write_fault(vcpu); if (fault_status == FSC_PERM !write_fault) { @@ -1009,6 +1053,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; } +
Re: [RESEND PATCH v15 07/11] KVM: arm: page logging 2nd stage fault handling
On 01/09/2015 02:24 AM, Christoffer Dall wrote: On Thu, Jan 08, 2015 at 08:28:46AM -0800, Mario Smarduch wrote: On 01/08/2015 02:45 AM, Christoffer Dall wrote: On Wed, Jan 07, 2015 at 05:43:18PM -0800, Mario Smarduch wrote: Hi Christoffer, before going through your comments, I discovered that in 3.18.0-rc2 - a generic __get_user_pages_fast() was implemented, now ARM picks this up. This causes gfn_to_pfn_prot() to return meaningful 'writable' value for a read fault, provided the region is writable. Prior to that the weak version returned 0 and 'writable' had no optimization effect to set pte/pmd - RW on a read fault. As a consequence dirty logging broke in 3.18, I was seeing Correction on this, proper __get_user_pages_fast() behavior exposed a bug in page logging code. weird but very intermittent issues. I just put in the additional few lines to fix it, prevent pte RW (only R) on read faults while logging writable region. On 01/07/2015 04:38 AM, Christoffer Dall wrote: On Wed, Dec 17, 2014 at 06:07:29PM -0800, Mario Smarduch wrote: This patch is a followup to v15 patch series, with following changes: - When clearing/dissolving a huge, PMD mark huge page range dirty, since the state of whole range is unknown. After the huge page is dissolved dirty page logging is at page granularity. What is the sequence of events where you could have dirtied another page within the PMD range after the user initially requested dirty page logging? No there is none. My issue was the start point for tracking dirty pages and that would be second call to dirty log read. Not first call after initial write protect where any page in range can be assumed dirty. I'll remove this, not sure if there would be any use case to call dirty log only once. Calling dirty log once can not give you anything meaningful, right? You must assume all memory is 'dirty' at this point, no? There is the interval between KVM_MEM_LOG_DIRTY_PAGES and first call to KVM_GET_DIRTY_LOG. Not sure of any use case, maybe enable logging, wait a while do a dirty log read, disable logging. Get an accumulated snapshot of dirty page activity. ok, so from the time the user calls KVM_MEM_LOG_DIRTY_PAGES, then any fault on any huge page will dissolve that huge page into pages, and each dirty page will be logged accordingly for the first call to KVM_GET_DIRTY_LOG, right? What am I missing here? Yes that's correct, this may or may not be meaningful in itself. The original point was first time access to a huge page (on first or some later call) and do we consider whole range dirty. Keeping track at page granularity + original image provides everything needed to reconstruct the source so it should not matter. I think I convoluted this issue a bit. - Mario -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote: We used to optimize rescheduling and audit on syscall exit. Now that the full slow path is reasonably fast, remove these optimizations. This adds something like 10ns to the previously optimized paths on my computer, presumably due mostly to SAVE_REST / RESTORE_REST. I think that we should eventually replace both the syscall and non-paranoid interrupt exit slow paths with a pair of C functions along the lines of the syscall entry hooks. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_64.S | 52 +- 1 file changed, 5 insertions(+), 47 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index a5afdf0f7fa4..222dc5c45ac3 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -427,15 +427,12 @@ system_call_fastpath: * Has incomplete stack frame and undefined top of stack. */ ret_from_sys_call: - movl $_TIF_ALLWORK_MASK,%edi - /* edi: flagmask */ -sysret_check: + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) + jnz int_ret_from_sys_call_fixup /* Go the the slow path */ + LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF - movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx - andl %edi,%edx - jnz sysret_careful CFI_REMEMBER_STATE /* * sysretq will re-enable interrupts: @@ -449,49 +446,10 @@ sysret_check: USERGS_SYSRET64 CFI_RESTORE_STATE - /* Handle reschedules */ - /* edx: work, edi: workmask */ -sysret_careful: - bt $TIF_NEED_RESCHED,%edx - jnc sysret_signal - TRACE_IRQS_ON - ENABLE_INTERRUPTS(CLBR_NONE) - pushq_cfi %rdi - SCHEDULE_USER - popq_cfi %rdi - jmp sysret_check - /* Handle a signal */ -sysret_signal: - TRACE_IRQS_ON - ENABLE_INTERRUPTS(CLBR_NONE) -#ifdef CONFIG_AUDITSYSCALL - bt $TIF_SYSCALL_AUDIT,%edx - jc sysret_audit -#endif - /* - * We have a signal, or exit tracing or single-step. - * These all wind up with the iret return path anyway, - * so just join that path right now. - */ +int_ret_from_sys_call_fixup: FIXUP_TOP_OF_STACK %r11, -ARGOFFSET - jmp int_check_syscall_exit_work - -#ifdef CONFIG_AUDITSYSCALL - /* - * Return fast path for syscall audit. Call __audit_syscall_exit() - * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT - * masked off. - */ -sysret_audit: - movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */ - cmpq $-MAX_ERRNO,%rsi /* is it -MAX_ERRNO? */ - setbe %al /* 1 if so, 0 if not */ - movzbl %al,%edi /* zero-extend that into %edi */ - call __audit_syscall_exit - movl $(_TIF_ALLWORK_MASK ~_TIF_SYSCALL_AUDIT),%edi - jmp sysret_check -#endif /* CONFIG_AUDITSYSCALL */ + jmp int_ret_from_sys_call Ok, dumb question: what is replacing that audit functionality now? Or are we remaining with the single path down syscall_trace_leave()? AFAICT, the intention is to direct everything to int_ret_from_sys_call after the syscall has been done so that paths can get simplified. Which is a good thing, AFAICT. Acked-by: Borislav Petkov b...@suse.de -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] kvmtool: remove 8250 IRQ line reset on device_init
Currently we reset the KVM interrupt line on initializing the 8250 serial device emulation. For ARM this creates a problem where we use the in-kernel IRQ chip before having fully initialized it. But with the new kernel interface we cannot finish the GIC initialization before we know the number of used IRQs, so we have to wait until all devices have been created and initialized. Since the in-kernel GIC emulation resets the IRQ line anyway and also QEMU gets away without resetting it, the easiest solution is to drop the IRQ line reset. Signed-off-by: Andre Przywara andre.przyw...@arm.com --- Hi Pekka, this patch is an easy fix for our problems with Marc's kvmtool tree and 3.19-rc (setting number of IRQs after the GIC has been used already). I see that this basically reverts: commit 2ca9e37193ca5f5df5726e0061dc749829295435 Author: Pekka Enberg penb...@kernel.org Date: Sun Jan 23 11:49:39 2011 +0200 kvm,8250: Fix device initial state This patch fixes 8250 device initial state for registers and IRQ based on what Qemu does. Do you (or does anyone) know of the issue that this patch fixed? This is four years old and from what I see QEMU does no longer the mentioned IRQ reset(?). Reworking kvmtool to avoid this issue sound rather painful. Cheers, Andre. tools/kvm/hw/serial.c |1 - 1 file changed, 1 deletion(-) diff --git a/tools/kvm/hw/serial.c b/tools/kvm/hw/serial.c index 270e6182..2f19ba8 100644 --- a/tools/kvm/hw/serial.c +++ b/tools/kvm/hw/serial.c @@ -406,7 +406,6 @@ static int serial8250__device_init(struct kvm *kvm, struct serial8250_device *de ioport__map_irq(dev-irq); r = ioport__register(kvm, dev-iobase, serial8250_ops, 8, dev); - kvm__irq_line(kvm, dev-irq, 0); return r; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On 9 January 2015 at 14:16, Marc Zyngier marc.zyng...@arm.com wrote: On 09/01/15 13:03, Peter Maydell wrote: When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't mean we get a new VMID for it, though, does it? I thought that what causes the icache flush to happen for the reset guest is that we unmap all of stage 2 and then fault it back in, via this code. That works for PIPT (we flush the range) and for VIPT (we do a full icache flush), but at the moment for VIVT ASID tagged we assume we can do nothing, and I don't think that's right for this case (though it is right for faulted because page was swapped out and OK for page was written by DMA). When we reset the guest, we also turn both its Icache off. Before turning it back on, the guest has to invalidate it (the ARM ARM doesn't seem to define the state of the cache out of reset). But implementations are allowed to hit in the cache even when the cache is disabled. In particular, setting the guest SCTLR_EL1.I to 0 means iside accesses are Normal Noncacheable for stage 1 attributes and (v8 ARM ARM D3.4.6) these can be held in the instruction cache. So the guest is required to do an icache-invalidate for any instructions that it writes *itself* (or DMAs in) even with the icache off. But it cannot possibly do so for its own initial startup code, and it must be the job of KVM to do that for it. (You can think of this as the VCPU provided by KVM always invalidates the icache on reset and does not require an impdef magic cache-init routine as described by D3.4.4 if you like.) -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
On 09/01/2015 16:12, Radim Krčmář wrote: The chipset doesn't support it. :( I meant that we need to recompute PI entries for lowest priority interrupts every time guest's TPR changes. Luckily, Linux doesn't use TPR, but other OS might be a reason to drop lowest priority from PI optimizations. (Or make it more complicated.) Doing vector hashing is a possibility as well. I would like to know what existing chipsets do in practice, then we can mimic it. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Flush TLB when D bit is manually changed.
On 09/01/2015 09:44, Kai Huang wrote: When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding TLB entity in the hardware won't be updated immediately. We should flush it to guarantee the consistence of D bit between TLB and MMU page table in memory. This is required if some specific hardware feature uses D-bit status to do specific things. Sanity test was done on my machine with Intel processor. Signed-off-by: Kai Huang kai.hu...@linux.intel.com --- arch/x86/kvm/mmu.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 978f402..1feac0c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) return (old_spte bit_mask) !(new_spte bit_mask); } +static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask) +{ + return (old_spte bit_mask) != (new_spte bit_mask); +} + /* Rules for using mmu_spte_set: * Set the sptep from nonpresent to present. * Note: the sptep being assigned *must* be either not present @@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) if (!shadow_accessed_mask) return ret; + /* + * We also need to flush TLB when D-bit is changed by software to + * guarantee the D-bit consistence between TLB and MMU page table. + */ + if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask)) I think shadow_accessed_mask needs to be checked too. I made the change and applied the patch. Paolo + ret = true; + if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask)) kvm_set_pfn_accessed(spte_to_pfn(old_spte)); if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH v15 07/11] KVM: arm: page logging 2nd stage fault handling
On Thu, Jan 08, 2015 at 08:28:46AM -0800, Mario Smarduch wrote: On 01/08/2015 02:45 AM, Christoffer Dall wrote: On Wed, Jan 07, 2015 at 05:43:18PM -0800, Mario Smarduch wrote: Hi Christoffer, before going through your comments, I discovered that in 3.18.0-rc2 - a generic __get_user_pages_fast() was implemented, now ARM picks this up. This causes gfn_to_pfn_prot() to return meaningful 'writable' value for a read fault, provided the region is writable. Prior to that the weak version returned 0 and 'writable' had no optimization effect to set pte/pmd - RW on a read fault. As a consequence dirty logging broke in 3.18, I was seeing Correction on this, proper __get_user_pages_fast() behavior exposed a bug in page logging code. weird but very intermittent issues. I just put in the additional few lines to fix it, prevent pte RW (only R) on read faults while logging writable region. On 01/07/2015 04:38 AM, Christoffer Dall wrote: On Wed, Dec 17, 2014 at 06:07:29PM -0800, Mario Smarduch wrote: This patch is a followup to v15 patch series, with following changes: - When clearing/dissolving a huge, PMD mark huge page range dirty, since the state of whole range is unknown. After the huge page is dissolved dirty page logging is at page granularity. What is the sequence of events where you could have dirtied another page within the PMD range after the user initially requested dirty page logging? No there is none. My issue was the start point for tracking dirty pages and that would be second call to dirty log read. Not first call after initial write protect where any page in range can be assumed dirty. I'll remove this, not sure if there would be any use case to call dirty log only once. Calling dirty log once can not give you anything meaningful, right? You must assume all memory is 'dirty' at this point, no? There is the interval between KVM_MEM_LOG_DIRTY_PAGES and first call to KVM_GET_DIRTY_LOG. Not sure of any use case, maybe enable logging, wait a while do a dirty log read, disable logging. Get an accumulated snapshot of dirty page activity. ok, so from the time the user calls KVM_MEM_LOG_DIRTY_PAGES, then any fault on any huge page will dissolve that huge page into pages, and each dirty page will be logged accordingly for the first call to KVM_GET_DIRTY_LOG, right? What am I missing here? -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 11/11] KVM: arm/arm64: Add support to dissolve huge PUD
On Thu, Jan 08, 2015 at 08:41:15AM -0800, Mario Smarduch wrote: [...] I'm just thinking here, why do we need to check if we get a valid pud back here, but we don't need the equivalent check in dissolve_pmd from patch 7? kvm_pud_huge() doesn't check bit 0 for invalid entry, but pud_none() is not the right way to check either, maybe pud_bad() first. Nothing is done in patch 7 since the pmd is retrieved from stage2_get_pmd(). hmmm, but stage2_get_pmd() can return a NULL pointer if you have the IOMAP flag set... I think the rationale is that it should never happen because we never call these functions with the logging and iomap flags at the same time... I'm little lost here, not sure how it's related to above. But I think a VFIO device will have a memslot and it would be possible to enable logging. But to what end I'm not sure. As I said above, if you call the set_s2pte function with the IOMAP and LOGGING flags set, then you'll end up in a situation where you can get a NULL pointer back from stage2_get_pmd() but you're never checking against that. I see what you're saying now. Now, this raises an interesting point, we have now added code that prevents faults from ever happening on device maps, but introducing a path here where the user can set logging on a memslot with device memory regions, which introduces write faults on such regions. My gut feeling is that we should avoid that from ever happening, and not allow this function to be called with both flags set. Maybe kvm_arch_prepare_memory_region() can check if KVM_MEM_LOG_DIRTY_PAGES is being enabled for an IO region and don't allow it. Yeah, I think we need to add a check for that somewhere as part of this series (patch 7 perhaps?). -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Flush TLB when D bit is manually changed.
On 01/09/2015 04:44 PM, Kai Huang wrote: When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding TLB entity in the hardware won't be updated immediately. We should flush it to guarantee the consistence of D bit between TLB and MMU page table in memory. This is required if some specific hardware feature uses D-bit status to do specific things. Currently, A/D is lazied synced and it does not hurt anything since we have marked the page dirty after clearing the bit and mmu-notifier can keep the coherence on host (It is the guest's responsibility to sync TLBs when changing the bit). So, i am just curious what some specific hardware is. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Flush TLB when D bit is manually changed.
When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding TLB entity in the hardware won't be updated immediately. We should flush it to guarantee the consistence of D bit between TLB and MMU page table in memory. This is required if some specific hardware feature uses D-bit status to do specific things. Sanity test was done on my machine with Intel processor. Signed-off-by: Kai Huang kai.hu...@linux.intel.com --- arch/x86/kvm/mmu.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 978f402..1feac0c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) return (old_spte bit_mask) !(new_spte bit_mask); } +static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask) +{ + return (old_spte bit_mask) != (new_spte bit_mask); +} + /* Rules for using mmu_spte_set: * Set the sptep from nonpresent to present. * Note: the sptep being assigned *must* be either not present @@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) if (!shadow_accessed_mask) return ret; + /* +* We also need to flush TLB when D-bit is changed by software to +* guarantee the D-bit consistence between TLB and MMU page table. +*/ + if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask)) + ret = true; + if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask)) kvm_set_pfn_accessed(spte_to_pfn(old_spte)); if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask)) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Flush TLB when D bit is manually changed.
On 09/01/2015 09:44, Kai Huang wrote: When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding TLB entity in the hardware won't be updated immediately. We should flush it to guarantee the consistence of D bit between TLB and MMU page table in memory. This is required if some specific hardware feature uses D-bit status to do specific things. Sanity test was done on my machine with Intel processor. Signed-off-by: Kai Huang kai.hu...@linux.intel.com --- arch/x86/kvm/mmu.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 978f402..1feac0c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask) return (old_spte bit_mask) !(new_spte bit_mask); } +static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask) +{ + return (old_spte bit_mask) != (new_spte bit_mask); +} + /* Rules for using mmu_spte_set: * Set the sptep from nonpresent to present. * Note: the sptep being assigned *must* be either not present @@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) if (!shadow_accessed_mask) return ret; + /* + * We also need to flush TLB when D-bit is changed by software to + * guarantee the D-bit consistence between TLB and MMU page table. + */ + if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask)) I think shadow_accessed_mask needs to be checked too. I made the change and applied the patch. Paolo + ret = true; + if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask)) kvm_set_pfn_accessed(spte_to_pfn(old_spte)); if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On 09/01/15 15:28, Peter Maydell wrote: On 9 January 2015 at 14:16, Marc Zyngier marc.zyng...@arm.com wrote: On 09/01/15 13:03, Peter Maydell wrote: When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't mean we get a new VMID for it, though, does it? I thought that what causes the icache flush to happen for the reset guest is that we unmap all of stage 2 and then fault it back in, via this code. That works for PIPT (we flush the range) and for VIPT (we do a full icache flush), but at the moment for VIVT ASID tagged we assume we can do nothing, and I don't think that's right for this case (though it is right for faulted because page was swapped out and OK for page was written by DMA). When we reset the guest, we also turn both its Icache off. Before turning it back on, the guest has to invalidate it (the ARM ARM doesn't seem to define the state of the cache out of reset). But implementations are allowed to hit in the cache even when the cache is disabled. In particular, setting the guest SCTLR_EL1.I to 0 means iside accesses are Normal Noncacheable for stage 1 attributes and (v8 ARM ARM D3.4.6) these can be held in the instruction cache. So the guest is required to do an icache-invalidate for any instructions that it writes *itself* (or DMAs in) even with the icache off. But it cannot possibly do so for its own initial startup code, and it must be the job of KVM to do that for it. (You can think of this as the VCPU provided by KVM always invalidates the icache on reset and does not require an impdef magic cache-init routine as described by D3.4.4 if you like.) Right. I think I finally see your point. We've been safe so far that no ASID-tagged VIVT icache platform is virtualisation capable, AFAIK. Probably best to just nuke the icache always for non-PIPT caches, then. I'll fold that in when I respin the series. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
2015-01-09 15:56+0100, Paolo Bonzini: On 09/01/2015 15:54, Radim Krčmář wrote: There are two points relevant to this patch in new KVM's implementation, (KVM: x86: amend APIC lowest priority arbitration, https://lkml.org/lkml/2015/1/9/362) 1) lowest priority depends on TPR 2) there is no need for balancing (1) has to be considered with PI as well. The chipset doesn't support it. :( I meant that we need to recompute PI entries for lowest priority interrupts every time guest's TPR changes. Luckily, Linux doesn't use TPR, but other OS might be a reason to drop lowest priority from PI optimizations. (Or make it more complicated.) I kept (2) to avoid whining from people building on that behaviour, but lowest priority backed by PI could be transparent without it. Patch below removes the balancing, but I am not sure this is a price we allowed ourselves to pay ... what are your opinions? I wouldn't mind, but it requires a lot of benchmarking. (I was afraid it would come to that.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
2015-01-09 16:18+0100, Paolo Bonzini: On 09/01/2015 16:12, Radim Krčmář wrote: The chipset doesn't support it. :( I meant that we need to recompute PI entries for lowest priority interrupts every time guest's TPR changes. Luckily, Linux doesn't use TPR, but other OS might be a reason to drop lowest priority from PI optimizations. (Or make it more complicated.) Doing vector hashing is a possibility as well. I would like to know what existing chipsets do in practice, then we can mimic it. When looking at /proc/interrupts from time to time, I have only seen interrupts landing on the first CPU of the set. We could also distinguish between AMD and Intel ... AMD should deliver to the highest APIC ID. (If we still need to decide after focus processor and APR checks.) I'll try to check using a more trustworthy approach. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
On Fri, Jan 9, 2015 at 7:53 AM, Borislav Petkov b...@alien8.de wrote: On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote: We used to optimize rescheduling and audit on syscall exit. Now that the full slow path is reasonably fast, remove these optimizations. This adds something like 10ns to the previously optimized paths on my computer, presumably due mostly to SAVE_REST / RESTORE_REST. I think that we should eventually replace both the syscall and non-paranoid interrupt exit slow paths with a pair of C functions along the lines of the syscall entry hooks. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_64.S | 52 +- 1 file changed, 5 insertions(+), 47 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index a5afdf0f7fa4..222dc5c45ac3 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -427,15 +427,12 @@ system_call_fastpath: * Has incomplete stack frame and undefined top of stack. */ ret_from_sys_call: - movl $_TIF_ALLWORK_MASK,%edi - /* edi: flagmask */ -sysret_check: + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) + jnz int_ret_from_sys_call_fixup /* Go the the slow path */ + LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF - movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx - andl %edi,%edx - jnz sysret_careful CFI_REMEMBER_STATE /* * sysretq will re-enable interrupts: @@ -449,49 +446,10 @@ sysret_check: USERGS_SYSRET64 CFI_RESTORE_STATE - /* Handle reschedules */ - /* edx: work, edi: workmask */ -sysret_careful: - bt $TIF_NEED_RESCHED,%edx - jnc sysret_signal - TRACE_IRQS_ON - ENABLE_INTERRUPTS(CLBR_NONE) - pushq_cfi %rdi - SCHEDULE_USER - popq_cfi %rdi - jmp sysret_check - /* Handle a signal */ -sysret_signal: - TRACE_IRQS_ON - ENABLE_INTERRUPTS(CLBR_NONE) -#ifdef CONFIG_AUDITSYSCALL - bt $TIF_SYSCALL_AUDIT,%edx - jc sysret_audit -#endif - /* - * We have a signal, or exit tracing or single-step. - * These all wind up with the iret return path anyway, - * so just join that path right now. - */ +int_ret_from_sys_call_fixup: FIXUP_TOP_OF_STACK %r11, -ARGOFFSET - jmp int_check_syscall_exit_work - -#ifdef CONFIG_AUDITSYSCALL - /* - * Return fast path for syscall audit. Call __audit_syscall_exit() - * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT - * masked off. - */ -sysret_audit: - movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */ - cmpq $-MAX_ERRNO,%rsi /* is it -MAX_ERRNO? */ - setbe %al /* 1 if so, 0 if not */ - movzbl %al,%edi /* zero-extend that into %edi */ - call __audit_syscall_exit - movl $(_TIF_ALLWORK_MASK ~_TIF_SYSCALL_AUDIT),%edi - jmp sysret_check -#endif /* CONFIG_AUDITSYSCALL */ + jmp int_ret_from_sys_call Ok, dumb question: what is replacing that audit functionality now? Or are we remaining with the single path down syscall_trace_leave()? AFAICT, the intention is to direct everything to int_ret_from_sys_call after the syscall has been done so that paths can get simplified. Which is a good thing, AFAICT. Exactly. int_ret_from_sys_call calls into the audit code in syscall_trace_leave. IIRC the current code sometimes calls the audit exit hook twice, which seems like a bug to me. --Andy Acked-by: Borislav Petkov b...@suse.de -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: update_memslots: clean flags for invalid memslots
Indeed, any invalid memslots should be new-npages = 0, new-base_gfn = 0 and new-flags = 0 at the same time. Signed-off-by: Tiejun Chen tiejun.c...@intel.com --- Paolo, This is just a small cleanup to follow-up commit, efbeec7098ee, fix sorting of memslots with base_gfn == 0. Tiejun virt/kvm/kvm_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1cc6e2e..369c759 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -673,6 +673,7 @@ static void update_memslots(struct kvm_memslots *slots, if (!new-npages) { WARN_ON(!mslots[i].npages); new-base_gfn = 0; + new-flags = 0; if (mslots[i].npages) slots-used_slots--; } else { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
On 09/01/15 11:19, Christoffer Dall wrote: On Thu, Jan 08, 2015 at 11:59:07AM +, Marc Zyngier wrote: Trying to emulate the behaviour of set/way cache ops is fairly pointless, as there are too many ways we can end-up missing stuff. Also, there is some system caches out there that simply ignore set/way operations. So instead of trying to implement them, let's convert it to VA ops, and use them as a way to re-enable the trapping of VM ops. That way, we can detect the point when the MMU/caches are turned off, and do a full VM flush (which is what the guest was trying to do anyway). This allows a 32bit zImage to boot on the APM thingy, and will probably help bootloaders in general. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 3 -- arch/arm/kvm/arm.c| 10 arch/arm/kvm/coproc.c | 90 + arch/arm64/include/asm/kvm_host.h | 3 -- arch/arm64/kvm/sys_regs.c | 102 ++ 5 files changed, 116 insertions(+), 92 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 254e065..04b4ea0 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -125,9 +125,6 @@ struct kvm_vcpu_arch { * Anything that is not used directly from assembly code goes * here. */ - /* dcache set/way operation pending */ - int last_pcpu; - cpumask_t require_dcache_flush; /* Don't run the guest on this vcpu */ bool pause; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2d6d910..0b0d58a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) vcpu-cpu = cpu; vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state); - /* - * Check whether this vcpu requires the cache to be flushed on - * this physical CPU. This is a consequence of doing dcache - * operations by set/way on this vcpu. We do it here to be in - * a non-preemptible section. - */ - if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush)) - flush_cache_all(); /* We'd really want v7_flush_dcache_all() */ - kvm_arm_set_running_vcpu(vcpu); } @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) ret = kvm_call_hyp(__kvm_vcpu_run, vcpu); vcpu-mode = OUTSIDE_GUEST_MODE; - vcpu-arch.last_pcpu = smp_processor_id(); kvm_guest_exit(); trace_kvm_exit(*vcpu_pc(vcpu)); /* diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c index 7928dbd..3d352a5 100644 --- a/arch/arm/kvm/coproc.c +++ b/arch/arm/kvm/coproc.c @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu, return true; } -/* See note at ARM ARM B1.14.4 */ +/* + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized). + * + * Main problems: + * - S/W ops are local to a CPU (not broadcast) + * - We have line migration behind our back (speculation) + * - System caches don't support S/W at all (damn!) + * + * In the face of the above, the best we can do is to try and convert + * S/W ops to VA ops. Because the guest is not allowed to infer the + * S/W to PA mapping, it can only use S/W to nuke the whole cache, + * which is a rather good thing for us. + * + * Also, it is only used when turning caches on/off (The expected + * usage of the cache maintenance instructions that operate by set/way + * is associated with the cache maintenance instructions associated + * with the powerdown and powerup of caches, if this is required by + * the implementation.). + * + * We use the following policy: + * + * - If we trap a S/W operation, we enable VM trapping to detect + * caches being turned on/off. + * + * - If the caches have already been turned off when doing the S/W op, + * we nuke the whole VM cache. + * + * - We flush the cache on both caches being turned on and off. + * + * - Once the caches are enabled, we stop trapping VM ops. + */ static bool access_dcsw(struct kvm_vcpu *vcpu, const struct coproc_params *p, const struct coproc_reg *r) { - unsigned long val; - int cpu; - if (!p-is_write) return read_from_write_only(vcpu, p); - cpu = get_cpu(); - - cpumask_setall(vcpu-arch.require_dcache_flush); - cpumask_clear_cpu(cpu, vcpu-arch.require_dcache_flush); - - /* If we were already preempted, take the long way around */ - if (cpu != vcpu-arch.last_pcpu) { - flush_cache_all(); - goto done; - } - - val = *vcpu_reg(vcpu, p-Rt1); - - switch (p-CRm) { - case 6: /* Upgrade DCISW
Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
On Fri, Jan 09, 2015 at 11:38:19AM +, Marc Zyngier wrote: [...] @@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu, const struct coproc_params *p, const struct coproc_reg *r) { + bool was_enabled = vcpu_has_cache_enabled(vcpu); + bool now_enabled; + access_vm_reg(vcpu, p, r); - if (vcpu_has_cache_enabled(vcpu)) { /* MMU+Caches enabled? */ - vcpu-arch.hcr = ~HCR_TVM; + now_enabled = vcpu_has_cache_enabled(vcpu); + + /* + * If switching the MMU+caches on, need to invalidate the caches. + * If switching it off, need to clean the caches. + * Clean + invalidate does the trick always. couldn't a clean here then be writing bogus to RAM on an initial boot of the guest? It shouldn't. We're supposed to start a VM with a fully invalid cache (what's why we call coherent_cache_guest_page). And this is no different from what we had before, as an invalidate by set/way was upgraded to clean+invalidate. right, that makes sense. Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
On Thu, Jan 08, 2015 at 11:59:08AM +, Marc Zyngier wrote: Let's assume a guest has created an uncached mapping, and written to that page. Let's also assume that the host uses a cache-coherent IO subsystem. Let's finally assume that the host is under memory pressure and starts to swap things out. Before this uncached page is evicted, we need to make sure it gets invalidated, or the IO subsystem is going to swap out the cached view, loosing the data that has been written there. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_mmu.h | 31 +++ arch/arm/kvm/mmu.c | 46 +++- arch/arm64/include/asm/kvm_mmu.h | 18 3 files changed, 80 insertions(+), 15 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 63e0ecc..7ceb836 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -44,6 +44,7 @@ #ifndef __ASSEMBLY__ +#include linux/highmem.h #include asm/cacheflush.h #include asm/pgalloc.h @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, #define kvm_virt_to_phys(x) virt_to_idmap((unsigned long)(x)) +static inline void __kvm_flush_dcache_pte(pte_t pte) +{ + void *va = kmap_atomic(pte_page(pte)); + + kvm_flush_dcache_to_poc(va, PAGE_SIZE); + + kunmap_atomic(va); +} + +static inline void __kvm_flush_dcache_pmd(pmd_t pmd) +{ + unsigned long size = PMD_SIZE; + pfn_t pfn = pmd_pfn(pmd); + + while (size) { + void *va = kmap_atomic_pfn(pfn); + + kvm_flush_dcache_to_poc(va, PAGE_SIZE); + + pfn++; + size -= PAGE_SIZE; + + kunmap_atomic(va); + } +} + +static inline void __kvm_flush_dcache_pud(pud_t pud) +{ +} + void stage2_flush_vm(struct kvm *kvm); #endif /* !__ASSEMBLY__ */ diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 1dc9778..1f5b793 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa); } +static void kvm_flush_dcache_pte(pte_t pte) +{ + __kvm_flush_dcache_pte(pte); +} + +static void kvm_flush_dcache_pmd(pmd_t pmd) +{ + __kvm_flush_dcache_pmd(pmd); +} + +static void kvm_flush_dcache_pud(pud_t pud) +{ + __kvm_flush_dcache_pud(pud); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, int min, int max) { @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd, start_pte = pte = pte_offset_kernel(pmd, addr); do { if (!pte_none(*pte)) { + pte_t old_pte = *pte; kvm_set_pte(pte, __pte(0)); - put_page(virt_to_page(pte)); was this a bug beforehand that we released the page before we flushed the TLB? kvm_tlb_flush_vmid_ipa(kvm, addr); + if ((pte_val(old_pte) PAGE_S2_DEVICE) != PAGE_S2_DEVICE) + kvm_flush_dcache_pte(old_pte); this is confusing me: We are only flushing the cache for cached stage-2 mappings? Weren't we trying to flush the cache for uncached mappings? (we obviously also need flush a cached stage-2 mapping but where the guest is mapping it as uncached, but we don't know that easily). Am I missing something completely here? In any case we probably need a comment here explaining this. + put_page(virt_to_page(pte)); } } while (pte++, addr += PAGE_SIZE, addr != end); @@ -149,8 +167,10 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud, next = kvm_pmd_addr_end(addr, end); if (!pmd_none(*pmd)) { if (kvm_pmd_huge(*pmd)) { + pmd_t old_pmd = *pmd; pmd_clear(pmd); kvm_tlb_flush_vmid_ipa(kvm, addr); + kvm_flush_dcache_pmd(old_pmd); put_page(virt_to_page(pmd)); } else { unmap_ptes(kvm, pmd, addr, next); @@ -173,8 +193,10 @@ static void unmap_puds(struct kvm *kvm, pgd_t *pgd, next = kvm_pud_addr_end(addr, end); if (!pud_none(*pud)) { if (pud_huge(*pud)) { + pud_t old_pud = *pud; pud_clear(pud); kvm_tlb_flush_vmid_ipa(kvm, addr); + kvm_flush_dcache_pud(old_pud); put_page(virt_to_page(pud)); } else {
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On 09/01/15 13:03, Peter Maydell wrote: On 9 January 2015 at 12:50, Christoffer Dall christoffer.d...@linaro.org wrote: On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote: If this is the first instruction in the guest (ie we've just (warm) reset the VM and are running the kernel as loaded into the guest by QEMU/kvmtool) then the guest can't have invalidated the icache, and QEMU can't do the invalidate because it doesn't have the vaddr and VMID of the guest. The guest must clean its icache before turning on the MMU, no? Yes, but to execute the clean icache insn inside the guest, the guest is executing instructions, so we'd better not have stale info for that code... But we never start a guest with caches on. It has to enable them on its own. Whenever we reuse a VMID (rollover), we flush the entire icache for that vmid. When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't mean we get a new VMID for it, though, does it? I thought that what causes the icache flush to happen for the reset guest is that we unmap all of stage 2 and then fault it back in, via this code. That works for PIPT (we flush the range) and for VIPT (we do a full icache flush), but at the moment for VIVT ASID tagged we assume we can do nothing, and I don't think that's right for this case (though it is right for faulted because page was swapped out and OK for page was written by DMA). When we reset the guest, we also turn both its Icache off. Before turning it back on, the guest has to invalidate it (the ARM ARM doesn't seem to define the state of the cache out of reset). M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
On 09/01/15 12:30, Christoffer Dall wrote: On Thu, Jan 08, 2015 at 11:59:08AM +, Marc Zyngier wrote: Let's assume a guest has created an uncached mapping, and written to that page. Let's also assume that the host uses a cache-coherent IO subsystem. Let's finally assume that the host is under memory pressure and starts to swap things out. Before this uncached page is evicted, we need to make sure it gets invalidated, or the IO subsystem is going to swap out the cached view, loosing the data that has been written there. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_mmu.h | 31 +++ arch/arm/kvm/mmu.c | 46 +++- arch/arm64/include/asm/kvm_mmu.h | 18 3 files changed, 80 insertions(+), 15 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 63e0ecc..7ceb836 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -44,6 +44,7 @@ #ifndef __ASSEMBLY__ +#include linux/highmem.h #include asm/cacheflush.h #include asm/pgalloc.h @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva, #define kvm_virt_to_phys(x) virt_to_idmap((unsigned long)(x)) +static inline void __kvm_flush_dcache_pte(pte_t pte) +{ +void *va = kmap_atomic(pte_page(pte)); + +kvm_flush_dcache_to_poc(va, PAGE_SIZE); + +kunmap_atomic(va); +} + +static inline void __kvm_flush_dcache_pmd(pmd_t pmd) +{ +unsigned long size = PMD_SIZE; +pfn_t pfn = pmd_pfn(pmd); + +while (size) { +void *va = kmap_atomic_pfn(pfn); + +kvm_flush_dcache_to_poc(va, PAGE_SIZE); + +pfn++; +size -= PAGE_SIZE; + +kunmap_atomic(va); +} +} + +static inline void __kvm_flush_dcache_pud(pud_t pud) +{ +} + void stage2_flush_vm(struct kvm *kvm); #endif /* !__ASSEMBLY__ */ diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 1dc9778..1f5b793 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa); } +static void kvm_flush_dcache_pte(pte_t pte) +{ +__kvm_flush_dcache_pte(pte); +} + +static void kvm_flush_dcache_pmd(pmd_t pmd) +{ +__kvm_flush_dcache_pmd(pmd); +} + +static void kvm_flush_dcache_pud(pud_t pud) +{ +__kvm_flush_dcache_pud(pud); +} + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, int min, int max) { @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd, start_pte = pte = pte_offset_kernel(pmd, addr); do { if (!pte_none(*pte)) { +pte_t old_pte = *pte; kvm_set_pte(pte, __pte(0)); -put_page(virt_to_page(pte)); was this a bug beforehand that we released the page before we flushed the TLB? I don't think so. TLB maintenance doesn't require the mapping to exist in the page tables (while the cache maintenance do). kvm_tlb_flush_vmid_ipa(kvm, addr); +if ((pte_val(old_pte) PAGE_S2_DEVICE) != PAGE_S2_DEVICE) +kvm_flush_dcache_pte(old_pte); this is confusing me: We are only flushing the cache for cached stage-2 mappings? Weren't we trying to flush the cache for uncached mappings? (we obviously also need flush a cached stage-2 mapping but where the guest is mapping it as uncached, but we don't know that easily). Am I missing something completely here? I think you must be misreading something: - we want to invalidate mappings because the guest may have created an uncached mapping - as we cannot know about the guest's uncached mappings, we flush things unconditionally (basically anything that is RAM). - we avoid flushing devices because it is pointless (and the kernel doesn't have a linear mapping for those). Is that clearer? Or is it me that is misunderstanding your concerns (entirely possible, I'm not at my best right now...). Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: amend APIC lowest priority arbitration
Lowest priority should take the task priority into account. SDM 10.6.2.4 Lowest Priority Delivery Mode. (Too long to quote; second and last paragraphs are relevant.) Before this patch, we strived to have the same amount of handled lowest-priority interrupts on all VCPUs. This is only a complication, but kept for compatibility. Real modern Intels can't send lowest priority IPIs and the chipset directs external ones using processors' TPR. AMD still has rough edges. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a688fbffb34e..5b9d8c589bba 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -833,7 +833,15 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2) { - return vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio; + /* XXX: AMD (2:16.6.2 Lowest Priority Messages and Arbitration) +* - uses the APR register (which also considers ISR and IRR), +* - chooses the highest APIC ID when APRs are identical, +* - and allows a focus processor. +* XXX: pseudo-balancing with apic_arb_prio is a KVM-specific feature +*/ + int tpr = kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) - + kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI); + return tpr ? : vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio; } static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) -- 2.2.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 00/26] Add VT-d Posted-Interrupts support
-Original Message- From: j...@8bytes.org [mailto:j...@8bytes.org] Sent: Friday, January 09, 2015 8:46 PM To: Wu, Feng Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; alex.william...@redhat.com; jiang@linux.intel.com; eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org Subject: Re: [v3 00/26] Add VT-d Posted-Interrupts support Hi Feng, On Tue, Jan 06, 2015 at 01:10:19AM +, Wu, Feng wrote: Ping... Hi Joerg David, Could you please have a look at the IOMMU part of this series (patch 02 - 04, patch 06 - 09 , patch 26)? Hi Thomas, Ingo, Peter, Could you please have a look at this series, especially for patch 01, 05, 21? I fear this conflicts somewhat with the irq-domain patches from Jiang Liu. Once we worked this out (should happen soon) I'll have a look at the VT-d posted interrupt patches. Regards, Joerg Thanks a lot for your feedback on this. In fact, my patches is based on Jiang Liu's irq-domain patches and has been discussed with Jiang offline. So I don't think it will conflicts with Jiang's work. Anyway, any comments are welcome! Thanks, Feng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On Thu, Jan 08, 2015 at 11:59:09AM +, Marc Zyngier wrote: When handling a fault in stage-2, we need to resync I$ and D$, just to be sure we don't leave any old cache line behind. That's very good, except that we do so using the *user* address. Under heavy load (swapping like crazy), we may end up in a situation where the page gets mapped in stage-2 while being unmapped from userspace by another CPU. At that point, the DC/IC instructions can generate a fault, which we handle with kvm-mmu_lock held. The box quickly deadlocks, user is unhappy. Instead, perform this invalidation through the kernel mapping, which is guaranteed to be present. The box is much happier, and so am I. Signed-off-by: Marc Zyngier marc.zyng...@arm.com This looks good to me! Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[virtio] dummy device
Hi, I'm trying to add a new virtio device. I've managed to make qemu connect my virtio device to the pci bus. I'm now trying to make a dummy driver in the guest's kernel. My lspci binary version isn't very verbose. For example the output for lspci and lspci -n are the same: 00:00.0 Class 0604: 1957:0030 00:01.0 Class 0200: 1af4:1000 00:02.0 Class 0b40: 1af4:100d (That's all I get from them.) My device is on the last line. I have tried using another solution to see if there is a driver doing something with my device (at least probing it): ls /sys/bus/pci/devices/\:00\:02.0/driver/module/drivers/ It outputs: pci:virtio-pci Should I see here the name of my device/driver module, or this is what I would see for any device on an emulated machine with virtio pci based bus? I currently just defined the struct virtio_driver with it's related things, and called: module_virtio_driver(); MODULE_DEVICE_TABLE(); Is there anything left to do in order to have a functional dummy device? Is there a better way to see the name of the module associated with a PCI device (virtio device)? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 00/26] Add VT-d Posted-Interrupts support
Hi Feng, On Tue, Jan 06, 2015 at 01:10:19AM +, Wu, Feng wrote: Ping... Hi Joerg David, Could you please have a look at the IOMMU part of this series (patch 02 - 04, patch 06 - 09 , patch 26)? Hi Thomas, Ingo, Peter, Could you please have a look at this series, especially for patch 01, 05, 21? I fear this conflicts somewhat with the irq-domain patches from Jiang Liu. Once we worked this out (should happen soon) I'll have a look at the VT-d posted interrupt patches. Regards, Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote: On 8 January 2015 at 15:06, Marc Zyngier marc.zyng...@arm.com wrote: On 08/01/15 13:16, Peter Maydell wrote: ASID cached VIVT icaches are also VMID tagged. It is thus impossible for stale cache lines to come with a new page. And if by synchronizing the caches you obtain a different instruction stream, it means you've restored the wrong page. ...is that true even if the dirty data in the dcache comes from the userspace process doing DMA or writing the initial boot image or whatever? We perform this on a page that is being brought in stage-2. Two cases: - This is a page is mapped for the first time: the icache should be invalid for this page (the guest should have invalidated it the first place), If this is the first instruction in the guest (ie we've just (warm) reset the VM and are running the kernel as loaded into the guest by QEMU/kvmtool) then the guest can't have invalidated the icache, and QEMU can't do the invalidate because it doesn't have the vaddr and VMID of the guest. The guest must clean its icache before turning on the MMU, no? Whenever we reuse a VMID (rollover), we flush the entire icache for that vmid. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
On 9 January 2015 at 12:50, Christoffer Dall christoffer.d...@linaro.org wrote: On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote: If this is the first instruction in the guest (ie we've just (warm) reset the VM and are running the kernel as loaded into the guest by QEMU/kvmtool) then the guest can't have invalidated the icache, and QEMU can't do the invalidate because it doesn't have the vaddr and VMID of the guest. The guest must clean its icache before turning on the MMU, no? Yes, but to execute the clean icache insn inside the guest, the guest is executing instructions, so we'd better not have stale info for that code... Whenever we reuse a VMID (rollover), we flush the entire icache for that vmid. When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't mean we get a new VMID for it, though, does it? I thought that what causes the icache flush to happen for the reset guest is that we unmap all of stage 2 and then fault it back in, via this code. That works for PIPT (we flush the range) and for VIPT (we do a full icache flush), but at the moment for VIVT ASID tagged we assume we can do nothing, and I don't think that's right for this case (though it is right for faulted because page was swapped out and OK for page was written by DMA). -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote: + /* + * Try to use SYSRET instead of IRET if we're returning to + * a completely clean 64-bit userspace context. + */ + movq (RCX-R11)(%rsp), %rcx + cmpq %rcx,(RIP-R11)(%rsp) /* RCX == RIP */ + jne opportunistic_sysret_failed + + /* + * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP + * in kernel space. This essentially lets the user take over + * the kernel, since userspace controls RSP. It's not worth + * testing for canonicalness exactly -- this check detects any + * of the 17 high bits set, which is true for non-canonical + * or kernel addresses. (This will pessimize vsyscall=native. + * Big deal.) + */ + shr $47, %rcx + jnz opportunistic_sysret_failed + + cmpq $__USER_CS,(CS-R11)(%rsp) /* CS must match SYSRET */ + jne opportunistic_sysret_failed + + movq (R11-R11)(%rsp), %r11 + cmpq %r11,(EFLAGS-R11)(%rsp)/* R11 == RFLAGS */ + jne opportunistic_sysret_failed + + testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */ + jnz opportunistic_sysret_failed + + /* nothing to check for RSP */ + + cmpq $__USER_DS,(SS-R11)(%rsp) /* SS must match SYSRET */ + jne opportunistic_sysret_failed Btw, Denys' R11-ARGOFFSET fix makes sense here too - using ARGOFFSET instead of R11 would make this here clearer. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
On Thu, Jan 08, 2015 at 11:59:07AM +, Marc Zyngier wrote: Trying to emulate the behaviour of set/way cache ops is fairly pointless, as there are too many ways we can end-up missing stuff. Also, there is some system caches out there that simply ignore set/way operations. So instead of trying to implement them, let's convert it to VA ops, and use them as a way to re-enable the trapping of VM ops. That way, we can detect the point when the MMU/caches are turned off, and do a full VM flush (which is what the guest was trying to do anyway). This allows a 32bit zImage to boot on the APM thingy, and will probably help bootloaders in general. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 3 -- arch/arm/kvm/arm.c| 10 arch/arm/kvm/coproc.c | 90 + arch/arm64/include/asm/kvm_host.h | 3 -- arch/arm64/kvm/sys_regs.c | 102 ++ 5 files changed, 116 insertions(+), 92 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 254e065..04b4ea0 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -125,9 +125,6 @@ struct kvm_vcpu_arch { * Anything that is not used directly from assembly code goes * here. */ - /* dcache set/way operation pending */ - int last_pcpu; - cpumask_t require_dcache_flush; /* Don't run the guest on this vcpu */ bool pause; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2d6d910..0b0d58a 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) vcpu-cpu = cpu; vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state); - /* - * Check whether this vcpu requires the cache to be flushed on - * this physical CPU. This is a consequence of doing dcache - * operations by set/way on this vcpu. We do it here to be in - * a non-preemptible section. - */ - if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush)) - flush_cache_all(); /* We'd really want v7_flush_dcache_all() */ - kvm_arm_set_running_vcpu(vcpu); } @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) ret = kvm_call_hyp(__kvm_vcpu_run, vcpu); vcpu-mode = OUTSIDE_GUEST_MODE; - vcpu-arch.last_pcpu = smp_processor_id(); kvm_guest_exit(); trace_kvm_exit(*vcpu_pc(vcpu)); /* diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c index 7928dbd..3d352a5 100644 --- a/arch/arm/kvm/coproc.c +++ b/arch/arm/kvm/coproc.c @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu, return true; } -/* See note at ARM ARM B1.14.4 */ +/* + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized). + * + * Main problems: + * - S/W ops are local to a CPU (not broadcast) + * - We have line migration behind our back (speculation) + * - System caches don't support S/W at all (damn!) + * + * In the face of the above, the best we can do is to try and convert + * S/W ops to VA ops. Because the guest is not allowed to infer the + * S/W to PA mapping, it can only use S/W to nuke the whole cache, + * which is a rather good thing for us. + * + * Also, it is only used when turning caches on/off (The expected + * usage of the cache maintenance instructions that operate by set/way + * is associated with the cache maintenance instructions associated + * with the powerdown and powerup of caches, if this is required by + * the implementation.). + * + * We use the following policy: + * + * - If we trap a S/W operation, we enable VM trapping to detect + * caches being turned on/off. + * + * - If the caches have already been turned off when doing the S/W op, + * we nuke the whole VM cache. + * + * - We flush the cache on both caches being turned on and off. + * + * - Once the caches are enabled, we stop trapping VM ops. + */ static bool access_dcsw(struct kvm_vcpu *vcpu, const struct coproc_params *p, const struct coproc_reg *r) { - unsigned long val; - int cpu; - if (!p-is_write) return read_from_write_only(vcpu, p); - cpu = get_cpu(); - - cpumask_setall(vcpu-arch.require_dcache_flush); - cpumask_clear_cpu(cpu, vcpu-arch.require_dcache_flush); - - /* If we were already preempted, take the long way around */ - if (cpu != vcpu-arch.last_pcpu) { - flush_cache_all(); - goto done; - } - - val = *vcpu_reg(vcpu, p-Rt1); - - switch (p-CRm) { - case 6: /* Upgrade DCISW to DCCISW, as per HCR.SWIO */ -
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
2014-12-12 23:14+0800, Feng Wu: This patch defines a new interface kvm_find_dest_vcpu for VT-d PI, which can returns the destination vCPU of the interrupt for guests. Since VT-d PI cannot handle broadcast/multicast interrupt, Here we only handle Fixed and Lowest priority interrupts. The current method of handling guest lowest priority interrtups is to use a counter 'apic_arb_prio' for each vCPU, we choose the vCPU with smallest 'apic_arb_prio' and then increase it by 1. However, for VT-d PI, we cannot re-use this, since we no longer have control to 'apic_arb_prio' with posted interrupt direct delivery by Hardware. Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. There are two points relevant to this patch in new KVM's implementation, (KVM: x86: amend APIC lowest priority arbitration, https://lkml.org/lkml/2015/1/9/362) 1) lowest priority depends on TPR 2) there is no need for balancing (1) has to be considered with PI as well. I kept (2) to avoid whining from people building on that behaviour, but lowest priority backed by PI could be transparent without it. Patch below removes the balancing, but I am not sure this is a price we allowed ourselves to pay ... what are your opinions? ---8--- KVM: x86: don't balance lowest priority interrupts Balancing is not mandated by specification and real hardware most likely doesn't do it. We break backward compatibility to allow optimizations. (Posted interrupts can deliver to only one fixed destination.) Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/lapic.c| 8 ++-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 97a5dd0222c8..aa4bd8286232 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -370,7 +370,6 @@ struct kvm_vcpu_arch { u64 apic_base; struct kvm_lapic *apic;/* kernel irqchip context */ unsigned long apic_attention; - int32_t apic_arb_prio; int mp_state; u64 ia32_misc_enable_msr; bool tpr_access_reporting; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5b9d8c589bba..eb85af8e8fc0 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -749,7 +749,6 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, trig_mode, vector); switch (delivery_mode) { case APIC_DM_LOWEST: - vcpu-arch.apic_arb_prio++; case APIC_DM_FIXED: /* FIXME add logic for vcpu on reset */ if (unlikely(!apic_enabled(apic))) @@ -837,11 +836,9 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2) * - uses the APR register (which also considers ISR and IRR), * - chooses the highest APIC ID when APRs are identical, * - and allows a focus processor. -* XXX: pseudo-balancing with apic_arb_prio is a KVM-specific feature */ - int tpr = kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) - - kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI); - return tpr ? : vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio; + return kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) - + kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI); } static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) @@ -1595,7 +1592,6 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) vcpu-arch.pv_eoi.msr_val = 0; apic_update_ppr(apic); - vcpu-arch.apic_arb_prio = 0; vcpu-arch.apic_attention = 0; apic_debug(%s: vcpu=%p, id=%d, base_msr= -- 2.2.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
On 09/01/2015 15:54, Radim Krčmář wrote: There are two points relevant to this patch in new KVM's implementation, (KVM: x86: amend APIC lowest priority arbitration, https://lkml.org/lkml/2015/1/9/362) 1) lowest priority depends on TPR 2) there is no need for balancing (1) has to be considered with PI as well. The chipset doesn't support it. :( I kept (2) to avoid whining from people building on that behaviour, but lowest priority backed by PI could be transparent without it. Patch below removes the balancing, but I am not sure this is a price we allowed ourselves to pay ... what are your opinions? I wouldn't mind, but it requires a lot of benchmarking. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html