Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15
On 2014-05-06 20:35, gso...@gmail.com wrote: Signed-off-by: Gabriel Somlo so...@cmu.edu --- Jan, After today's pull from kvm, I also need this to build against my Fedora 20 kernel (3.13.10-200.fc20.x86_64). Which version did you pull? Neither next nor queue contain this change yet. Thanks, Gabriel x86/external-module-compat.h | 5 + 1 file changed, 5 insertions(+) diff --git a/x86/external-module-compat.h b/x86/external-module-compat.h index dde9463..e62486b 100644 --- a/x86/external-module-compat.h +++ b/x86/external-module-compat.h @@ -1480,6 +1480,11 @@ static inline int __register_hotcpu_notifier(struct notifier_block *nb) { return 0; } +static inline int pci_enable_msix_exact(struct pci_dev *dev, + struct msix_entry *entries, int nvec) +{ + return -ENOSYS; +} First, this is not x86-specific, thus should go to external-module-compat-comm.h. Second, pci_enable_msix_exact was introduced with 3.14. And then I think we can (and should) provide the original content of this wrapper once it is needed, not a stub. Right now I'm seeing different build problems: http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio Didn't look into details yet. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15
Il 07/05/2014 08:29, Jan Kiszka ha scritto: On 2014-05-06 20:35, gso...@gmail.com wrote: Signed-off-by: Gabriel Somlo so...@cmu.edu --- Jan, After today's pull from kvm, I also need this to build against my Fedora 20 kernel (3.13.10-200.fc20.x86_64). Which version did you pull? Neither next nor queue contain this change yet. It went in via kvm/master. It was submitted before 3.15, but I didn't send it during the merge window because: 1) we had to wait for the function itself to get in via the PCI tree; 2) Marcelo didn't know about this patch. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM exit on UD interception
Il 06/05/2014 22:11, Alexandru Duţu ha scritto: What is puzzling thought is the fact that even if there is an instruction that can't be emulated by KVM, re-entering virtualized mode is still attempted. Re-entering will produce a double fault, That would be a bug. It should inject the #UD into the guest. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] KVM: PPC: Book3S PR: Ignore PMU SPRs
On Fri, May 02, 2014 at 10:35:09AM +0200, Alexander Graf wrote: On 05/01/2014 12:12 AM, Paul Mackerras wrote: On Tue, Apr 29, 2014 at 06:17:37PM +0200, Alexander Graf wrote: When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs that we don't emulate. Just ignore accesses to them. Signed-off-by: Alexander Graf ag...@suse.de This patch is OK as it stands, but in fact the architecture says that kernel accesses to unimplemented SPRs are mostly supposed to be no-ops rather than causing a trap (mostly == excluding mtspr to 0 or mfspr from 0, 4, 5 or 6). I have a patch to implement that, which I'll post. I think what we want is a flag similar to x86 where we can force ignore unknown SPRs, but leave it at triggering an interrupt as default. We usually have to be at least aware of unknown SPRs and check that not implementing them is ok for the guest. Debugging a program interrupt because of an unknown SPR is usually a lot easier than debugging a breaking guest because it was using the SPR as storage and we didn't back it by anything. That has not been my experience, for accesses by the Linux kernel early in the boot process; usually we end up in a loop of ISI interrupts because the HPT isn't set up yet, with the original interrupt cause (and PC) lost long ago. The Power ISA was changed in version 2.05 (POWER6) to specify that accesses to unimplemented SPRs by privileged code must be no-ops on server processors. Before that the architecture allowed either an illegal instruction interrupt or boundedly undefined behaviour (which would include a no-op). So, if we're emulating POWERx for x = 6, to be correct we need to do the no-op behaviour, even if we retain the option of making them trap for debugging purposes. Of course at the moment we basically never look at what specific CPU we're emulating, but maybe now we have to. Regards, Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
Paul Mackerras pau...@samba.org writes: On Sun, May 04, 2014 at 10:56:08PM +0530, Aneesh Kumar K.V wrote: With debug option sleep inside atomic section checking enabled we get the below WARN_ON during a PR KVM boot. This is because upstream now have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the warning by adding preempt_disable/enable around floating point and altivec enable. This worries me a bit. In this code: if (msr MSR_FP) { +preempt_disable(); enable_kernel_fp(); load_fp_state(vcpu-arch.fp); t-fp_save_area = vcpu-arch.fp; +preempt_enable(); What would happen if we actually did get preempted at this point? Wouldn't we lose the FP state we just loaded? I was not sure we have got CONFIG_PREEMPT working with kvm. So i was not looking at preempted case. But yes, if we have PREEMPT enabled it would break as per the current code. In other words, how come we're not already preempt-disabled at this point? I don't see us disabling preempt in the code path. Are you suggesting that we should be preempt disabled for the whole duration of kvmppc_handle_ext ? -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch
Il 07/05/2014 04:30, Bandan Das ha scritto: + if (unlikely(ctxt-_eip == fc-end)) { Is this really going to be unlikely ? Yes, it happens at most once per instruction and only for instructions that cross pages. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
Il 07/05/2014 06:21, Bandan Das ha scritto: + if (rc != X86EMUL_CONTINNUE) + goto done; + } + while (size--) { - if (unlikely(ctxt-_eip == fc-end)) { - rc = do_insn_fetch_bytes(ctxt); - if (rc != X86EMUL_CONTINUE) - return rc; - } *dest++ = *src++; ctxt-_eip++; continue; @@ -4273,7 +4282,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) if (insn_len 0) memcpy(ctxt-fetch.data, insn, insn_len); else { - rc = do_insn_fetch_bytes(ctxt); + rc = do_insn_fetch_bytes(ctxt, 1); Is this saying that if the cache is full, then we fetch one more byte ? No, it is saying that if the instruction is being executed for the first time (we can execute it multiple times if we reenter a repeated instruction after a userspace exit) we try to get at least one byte from RIP. Most of the time, do_insn_fetch_bytes will fetch 15 bytes which are the maximum length of an instruction. Passing op_size == 1 matches this change in do_insn_fetch_bytes: - if (unlikely(size == 0)) + if (unlikely(size op_size)) Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches
Il 07/05/2014 06:36, Bandan Das ha scritto: + _x = *(_type __aligned(1) *) _fc-data[ctxt-_eip - _fc-start]; \ For my own understanding, how does the __aligned help here ? Except for 16-byte SSE accesses, x86 doesn't distinguish aligned and unaligned accesses. You can read 4 bytes at 0x2345 and the processor will do the right thing. Still it's better to tell the compiler what we're doing. Wouldn't that result in unaligned accesses that will actually impact performance ? These accesses *can* and will in fact be unaligned. For example, say you have mov ax,0x1234 which is 0xb8 0x34 0x12. When you read it into the fetch cache, you will have data[0] == 0xb8, data[1] == 0x34, data[2] == 0x12. Fetching the 16-bit immediate from data[1] will then be an unaligned access. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Nested L2 guest startup problems
Il 04/05/2014 18:33, Hu Yaohui ha scritto: I experienced a similar problem that was related to nested code having some bugs related to apicv and other new vmx features. For example, the code enabled posted interrupts to run L2 even when the feature was not exposed to L1 and L1 didn't use it. Try changing prepare_vmcs02 to force disabling posted_interrupts, code should looks like: exec_control = vmcs12-pin_based_vm_exec_control; exec_control |= vmcs_config.pin_based_exec_ctrl; exec_control = ~(PIN_BASED_VMX_PREEMPTION_TIMER|PIN_BASED_POSTED_INTR); vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control); ... and also ... ... exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_PAUSE_LOOP_EXITING); PLE should be left enabled, I think. Apart from that, I'll change the suggestion into a patch. Thanks! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
Kim, Christoffer, On Tue, May 06 2014 at 7:04:48 pm BST, Christoffer Dall christoffer.d...@linaro.org wrote: On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote: Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE. Signed-off-by: Kim Phillips kim.phill...@linaro.org --- arch/arm/kvm/mmu.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 7789857..a354610 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -652,6 +652,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache; struct vm_area_struct *vma; pfn_t pfn; +pgprot_t mem_type = PAGE_S2; write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu)); if (fault_status == FSC_PERM !write_fault) { @@ -702,6 +703,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (is_error_pfn(pfn)) return -EFAULT; +if (kvm_is_mmio_pfn(pfn)) +mem_type = PAGE_S2_DEVICE; + spin_lock(kvm-mmu_lock); if (mmu_notifier_retry(kvm, mmu_seq)) goto out_unlock; @@ -709,7 +713,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, hugetlb = transparent_hugepage_adjust(pfn, fault_ipa); if (hugetlb) { -pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2); +pmd_t new_pmd = pfn_pmd(pfn, mem_type); new_pmd = pmd_mkhuge(new_pmd); if (writable) { kvm_set_s2pmd_writable(new_pmd); @@ -718,13 +722,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, coherent_icache_guest_page(kvm, hva PMD_MASK, PMD_SIZE); ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd); } else { -pte_t new_pte = pfn_pte(pfn, PAGE_S2); +pte_t new_pte = pfn_pte(pfn, mem_type); if (writable) { kvm_set_s2pte_writable(new_pte); kvm_set_pfn_dirty(pfn); } coherent_icache_guest_page(kvm, hva, PAGE_SIZE); -ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false); +ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, + mem_type == PAGE_S2_DEVICE); } -- 1.9.1 I think this looks reasonable. Acked-by: Christoffer Dall christoffer.d...@linaro.org I feel like I'm missing some context here, and the commit message is way too terse for me to make sense of it. So far, we can only get into user_mem_abort on a Stage-2 fault (translation or permission) for memory. How can we suddenly get here for a *device* fault? Do we get a special kind of memslot? I'm not saying the patch does anything wrong, but I'd like to understand the rationale behind it. On its own, it doesn't make much sense. Thanks, M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. Virtio by design has full access to guest physical memory. It doesn't route DMA via PCI. So user space drivers simply don't make sense here. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
Am 07.05.2014 um 11:52 schrieb Marc Zyngier marc.zyng...@arm.com: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. It's not for virtio-legacy. It'll be much more productive to influence virtio-1.0 to not redo the same mistakes than enabling even more hackery with the legacy one. Alex M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote: Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. Virtio by design has full access to guest physical memory. It doesn't route DMA via PCI. So user space drivers simply don't make sense here. Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for isolation only (much like an MPU)? R-class guests anyone? Agreed, this is not the general use case, but that doesn't seem to be completely unrealistic either. M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 05/07/2014 11:57 AM, Marc Zyngier wrote: On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote: Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. Virtio by design has full access to guest physical memory. It doesn't route DMA via PCI. So user space drivers simply don't make sense here. Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for isolation only (much like an MPU)? R-class guests anyone? Agreed, this is not the general use case, but that doesn't seem to be completely unrealistic either. Yes, and once that user tries the same without idmap virtio ends up overwriting random memory. It's just not a good idea and I'd much rather see us solve this properly with virtio 1.0 really. Of course alternatively we can continue bikeshedding about this until everything becomes moot because we switched to virtio 1.0 ;). Rusty / Michael, virtio 1.0 does go via normal DMA channels, right? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? There's SCTLR.EE in AArch32, right? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. I don't actually care much about the details of what we decide; I just want us to be consistent between QEMU and kvmtool and (to the extent that architectural differences permit) consistent between PPC and ARM. At the moment we seem to be heading in gratuitously different directions. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 10:55:45 am BST, Alexander Graf ag...@suse.de wrote: Am 07.05.2014 um 11:52 schrieb Marc Zyngier marc.zyng...@arm.com: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + + return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. It's not for virtio-legacy. It'll be much more productive to influence virtio-1.0 to not redo the same mistakes than enabling even more hackery with the legacy one. Are you saying I shouldn't improve an existing code base and implement a useful feature, and should instead work on some new fancy stuff for which there is no platform support, no kernel support, and not an official spec either? Watch me. M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07, 2014 at 12:11:13PM +0200, Alexander Graf wrote: On 05/07/2014 11:57 AM, Marc Zyngier wrote: On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote: Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. Virtio by design has full access to guest physical memory. It doesn't route DMA via PCI. So user space drivers simply don't make sense here. Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for isolation only (much like an MPU)? R-class guests anyone? Agreed, this is not the general use case, but that doesn't seem to be completely unrealistic either. Yes, and once that user tries the same without idmap virtio ends up overwriting random memory. It's just not a good idea and I'd much rather see us solve this properly with virtio 1.0 really. Of course alternatively we can continue bikeshedding about this until everything becomes moot because we switched to virtio 1.0 ;). Rusty / Michael, virtio 1.0 does go via normal DMA channels, right? Alex By default it doesn't at the moment, in particular IOMMUs really seem to hurt performance when enabled, and many guests seem to be dumb and enable it everywhere if present. By design IOMMUs can protect you from malicious devices, which is relevant if you assign a device but of course isn't relevant for qemu as virtio is part of qemu atm. In virtio 1.0 it's possible for a device to have required features which drivers must ack. So we'll be able to have hypervisor tell guest that it requires DMA for some VFs. You would then be able to do mix fast virtio PF bypassing IOMMU and handle that in kernel, and slow virtio VF going through an IOMMU and handle that in userspace. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvmclock: Ensure time in migration never goes backward
Hi all, On 06/05/14 08:16, Alexander Graf wrote: On 06.05.14 01:23, Marcelo Tosatti wrote: 1) By what algorithm you retrieve and compare time in kvmclock guest structure and KVM_GET_CLOCK. What are the results of the comparison. And whether and backwards time was visible in the guest. I've managed to get my hands on a broken migration stream from Nick. There I looked at the curr_clocksource structure and saw that the last seen time on the kvmclock clock source was greater than the value that the kvmclock device migrated. We've been seeing live migration failures where the guest sees time go backwards (= massive forward leap to the kernel, apparently) for a while now, affecting perhaps 5-10% of migrations we'd do (usually a large proportion of the migrations on a few hosts, rather than an even spread); initially in December, when we tried an upgrade to QEMU 1.7.1 and a 3.mumble (3.10?) kernel, from 1.5.0 and Debian's 3.2. My testing at the time seemed to indicate that either upgrade - qemu or kernel - caused the problems to show up. Guest symptoms are that the kernel enters a tight loop in __run_timers and stays there. In the end, I gave up and downgraded us again without any clear idea of what was happening, or why. In April, we finally got together a fairly reliable test case. This patch resolves the guest hangs in that test, and I've also been able to conduct 1000 migrations of production guests without seeing the issue recur. So, Tested-by: Nick Thomas n...@bytemark.co.uk /Nick -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 11:11:13 am BST, Alexander Graf ag...@suse.de wrote: On 05/07/2014 11:57 AM, Marc Zyngier wrote: Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for isolation only (much like an MPU)? R-class guests anyone? Agreed, this is not the general use case, but that doesn't seem to be completely unrealistic either. Yes, and once that user tries the same without idmap virtio ends up overwriting random memory. And how different is that from the kernel suddenly deciding to use VAs instead of PAs? Just as broken. Are we going to prevent the kernel from using virtio? It's just not a good idea and I'd much rather see us solve this properly with virtio 1.0 really. Again, what is virtio 1.0 doing here? M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, 07 May 2014 10:52:01 +0100 Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + +return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. M. All the fuzz is not really about enforcing kernel access... PPC also has a current endianness selector (MSR_LE) but it only makes sense if you are in the cpu context. Initial versions of the virtio biendian support for QEMU PPC64 used an arbitrary cpu: in this case, the only sensible thing to look at to support kernel based virtio is the interrupt endianness selector (LPCR_ILE), because if gives a safe hint of the kernel endianness. The patch set has evolved and now uses current_cpu at device reset time. As a consequence, we are not necessarily tied to the kernel LPCR_ILE selector I guess. Cheers. -- Gregory Kurz kurzg...@fr.ibm.com gk...@linux.vnet.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)562 165 496 Anarchy is about taking complete responsibility for yourself. Alan Moore. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? There's SCTLR.EE in AArch32, right? Indeed, good point. On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. I don't actually care much about the details of what we decide; I just want us to be consistent between QEMU and kvmtool and (to the extent that architectural differences permit) consistent between PPC and ARM. At the moment we seem to be heading in gratuitously different directions. My point is: is there any good technical reason for deciding not to support guest user space access, other than religious matters about the latest incarnation of The Holy Virtio Spec? Thanks, M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com wrote: On Wed, 07 May 2014 10:52:01 +0100 Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote: On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote: On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote: On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote: +reg.addr = (u64)data; +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) +die(KVM_GET_ONE_REG failed (SCTLR_EL1)); + + return (data SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE; This rules out guests where userspace and kernelspace can run with different endinness. Whilst Linux doesn't currently do this, can we support it here? It all gets a bit hairy if the guest is using a stage-1 SMMU to let userspace play with a virtio device... Yeah, I suppose we could check either EE or E0 depending on the mode when the access was made. We already have all the information, just need to handle the case. I'll respin the series. How virtio implementations should determine their endianness is a spec question, I think; at any rate QEMU and kvmtool ought to agree on how it's done. I think the most recent suggestion on the QEMU mailing list (for PPC) is that we should care about the guest kernel endianness, but I don't know if anybody thought of the pass-through-to-userspace usecase... Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. M. Hi Gregory, All the fuzz is not really about enforcing kernel access... PPC also has a current endianness selector (MSR_LE) but it only makes sense if you are in the cpu context. Initial versions of the virtio biendian support for QEMU PPC64 used an arbitrary cpu: in this case, the only sensible thing to look at to support kernel based virtio is the interrupt endianness selector (LPCR_ILE), because if gives a safe hint of the kernel endianness. The patch set has evolved and now uses current_cpu at device reset time. As a consequence, we are not necessarily tied to the kernel LPCR_ILE selector I guess. That makes a lot of sense, thanks for explaining that. You're basically doing the exact same thing we do with kvmtool on ARM. So if we have similar architectural features on both sides, why don't we support both kernel and userspace access? Cheers, M. -- Jazz is not dead. It just smells funny. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Nested L2 guest startup problems
On Wed, May 7, 2014 at 11:58 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 04/05/2014 18:33, Hu Yaohui ha scritto: I experienced a similar problem that was related to nested code having some bugs related to apicv and other new vmx features. For example, the code enabled posted interrupts to run L2 even when the feature was not exposed to L1 and L1 didn't use it. Try changing prepare_vmcs02 to force disabling posted_interrupts, code should looks like: exec_control = vmcs12-pin_based_vm_exec_control; exec_control |= vmcs_config.pin_based_exec_ctrl; exec_control = ~(PIN_BASED_VMX_PREEMPTION_TIMER|PIN_BASED_POSTED_INTR); vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control); ... and also ... ... exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_PAUSE_LOOP_EXITING); PLE should be left enabled, I think. Well... the PLE settings L0 uses to run L1 (vmcs01) may be different than the PLE settings L1 configured to run L2 (vmcs12). For example, L0 can use a ple_gap to run L1 that is bigger than the ple_gap L1 configured to run L2. Or L0 can use a ple_window to run L1 that is smaller than the ple_window L1 configured to run L2. So seems PLE should never be exposed to L1 or an appropriate nested handling needs to be implemented. Note the handling may become complex because in some cases a PLE exit from L2 should be handled directly by L0 and not passed to L1... remember nested preemption timer support :) ? Apart from that, I'll change the suggestion into a patch. Great! Thanks! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
On 05/07/2014 07:56 AM, Paul Mackerras wrote: On Sun, May 04, 2014 at 10:56:08PM +0530, Aneesh Kumar K.V wrote: With debug option sleep inside atomic section checking enabled we get the below WARN_ON during a PR KVM boot. This is because upstream now have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the warning by adding preempt_disable/enable around floating point and altivec enable. This worries me a bit. In this code: if (msr MSR_FP) { + preempt_disable(); enable_kernel_fp(); load_fp_state(vcpu-arch.fp); t-fp_save_area = vcpu-arch.fp; + preempt_enable(); What would happen if we actually did get preempted at this point? Wouldn't we lose the FP state we just loaded? In other words, how come we're not already preempt-disabled at this point? This is probably because we're trying to confuse Linux :). The entry path happens with interrupts hard disabled, but preempt enabled so that Linux doesn't consider the guest time as non-preemptible. That's the only call I could find where preempt is logically enabled (though it really isn't). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Nested L2 guest startup problems
Il 07/05/2014 13:16, Abel Gordon ha scritto: PLE should be left enabled, I think. Well... the PLE settings L0 uses to run L1 (vmcs01) may be different than the PLE settings L1 configured to run L2 (vmcs12). For example, L0 can use a ple_gap to run L1 that is bigger than the ple_gap L1 configured to run L2. Or L0 can use a ple_window to run L1 that is smaller than the ple_window L1 configured to run L2. That's correct. We should leave PLE enabled while running L2, but hide the feature altogether from L1. Paolo So seems PLE should never be exposed to L1 or an appropriate nested handling needs to be implemented. Note the handling may become complex because in some cases a PLE exit from L2 should be handled directly by L0 and not passed to L1... remember nested preemption timer support :) ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Nested L2 guest startup problems
Il 07/05/2014 13:37, Paolo Bonzini ha scritto: Il 07/05/2014 13:16, Abel Gordon ha scritto: PLE should be left enabled, I think. Well... the PLE settings L0 uses to run L1 (vmcs01) may be different than the PLE settings L1 configured to run L2 (vmcs12). For example, L0 can use a ple_gap to run L1 that is bigger than the ple_gap L1 configured to run L2. Or L0 can use a ple_window to run L1 that is smaller than the ple_window L1 configured to run L2. That's correct. We should leave PLE enabled while running L2, but hide the feature altogether from L1. ... which we already do. The only secondary execution controls we allow are APIC page, unrestricted guest, WBINVD exits, and of course EPT. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 05/07/2014 12:46 PM, Marc Zyngier wrote: On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? There's SCTLR.EE in AArch32, right? Indeed, good point. On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. I don't actually care much about the details of what we decide; I just want us to be consistent between QEMU and kvmtool and (to the extent that architectural differences permit) consistent between PPC and ARM. At the moment we seem to be heading in gratuitously different directions. My point is: is there any good technical reason for deciding not to support guest user space access, other than religious matters about the latest incarnation of The Holy Virtio Spec? Yes, because it can't be isolated as per the current spec. User space has no business in physical addresses. And since so far I haven't heard of a single case where people on ARM are either a) nesting virtualization or b) running different endian user space I don't think this point is valid. Virtio 1.0 is defined to be little endian only, so we don't need all that messy magic logic anymore. By the time people will do nesting or different endian user space we will most likely be in virtio 1.0 land. Shoehorning in anything in between is just a waste of time. If you like to see a constructed case where your logic falls apart, I can easily give you one too (because the whole thing is just insanely fragile). Imagine you have nesting. Your L1 guest passes its virtio device into the L2 guest with idmap. The L1 guest wants to trace MMIO accesses, so it traps on every access and delivers it on its own. L2 is LE, L1 is BE. Virtio gets initialized BE even through the guest that really wants to access it is LE. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 07/05/14 12:49, Alexander Graf wrote: On 05/07/2014 12:46 PM, Marc Zyngier wrote: On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell peter.mayd...@linaro.org wrote: On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell peter.mayd...@linaro.org wrote: Current opinion on the qemu-devel thread seems to be that we should just define that the endianness of the virtio device is the endianness of the guest kernel at the point where the guest triggers a reset of the virtio device by writing zero the QueuePFN or Status registers. On AArch32, we only have the CPSR.E bit to select the endiannes. Are we going to simply explode if the access comes from userspace? There's SCTLR.EE in AArch32, right? Indeed, good point. On AArch64, we can either select the kernel endianness, or userspace endianness. Are we going to go a different route just for the sake of enforcing kernel access? I'm inclined to think of userspace access as a valid use case. I don't actually care much about the details of what we decide; I just want us to be consistent between QEMU and kvmtool and (to the extent that architectural differences permit) consistent between PPC and ARM. At the moment we seem to be heading in gratuitously different directions. My point is: is there any good technical reason for deciding not to support guest user space access, other than religious matters about the latest incarnation of The Holy Virtio Spec? Yes, because it can't be isolated as per the current spec. User space has no business in physical addresses. And since so far I haven't heard of a single case where people on ARM are either a) nesting virtualization or b) running different endian user space I don't think this point is valid. Virtio 1.0 is defined to be little endian only, so we don't need all that messy magic logic anymore. By the Alex, please read my lips: at the moment, I don't care about virtio-1.0. At all. Doesn't register. And hammering it on and on won't change a thing (yes, I've rewritten this sentence at least five times to remove all the fscking swear words). time people will do nesting or different endian user space we will most likely be in virtio 1.0 land. Shoehorning in anything in between is just a waste of time. If you don't want to support it on your pet platform/environment, fine. If you like to see a constructed case where your logic falls apart, I can easily give you one too (because the whole thing is just insanely fragile). Imagine you have nesting. Your L1 guest passes its virtio device into the L2 guest with idmap. The L1 guest wants to trace MMIO accesses, so it traps on every access and delivers it on its own. L2 is LE, L1 is BE. Virtio gets initialized BE even through the guest that really wants to access it is LE. Then it is a bug in your L1 that doesn't properly emulate accesses it traps. Not that I care, really. That being said, I'm going to stop replying to this thread, and instead go back writing code, posting it, and getting on with my life in virtio-legacy land. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com wrote: All the fuzz is not really about enforcing kernel access... PPC also has a current endianness selector (MSR_LE) but it only makes sense if you are in the cpu context. Initial versions of the virtio biendian support for QEMU PPC64 used an arbitrary cpu: in this case, the only sensible thing to look at to support kernel based virtio is the interrupt endianness selector (LPCR_ILE), because if gives a safe hint of the kernel endianness. The patch set has evolved and now uses current_cpu at device reset time. As a consequence, we are not necessarily tied to the kernel LPCR_ILE selector I guess. Ah yes, I'd forgotten the history behind why we ended up looking at interrupt endianness. That makes a lot of sense, thanks for explaining that. You're basically doing the exact same thing we do with kvmtool on ARM. So if we have similar architectural features on both sides, why don't we support both kernel and userspace access? I don't think that we really need to get into whether userspace access is or is not a good idea -- endianness of the CPU which does the virtio reset at the point when it does that reset is a nice simple rule that should generalise across architectures, so why make it more complicated than that? thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 7 May 2014 13:16, Marc Zyngier marc.zyng...@arm.com wrote: That being said, I'm going to stop replying to this thread, and instead go back writing code, posting it, and getting on with my life in virtio-legacy land. Some of us are trying to have a conversation in this thread about virtio-legacy behaviour :-) thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On 07/05/14 13:17, Peter Maydell wrote: On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com wrote: All the fuzz is not really about enforcing kernel access... PPC also has a current endianness selector (MSR_LE) but it only makes sense if you are in the cpu context. Initial versions of the virtio biendian support for QEMU PPC64 used an arbitrary cpu: in this case, the only sensible thing to look at to support kernel based virtio is the interrupt endianness selector (LPCR_ILE), because if gives a safe hint of the kernel endianness. The patch set has evolved and now uses current_cpu at device reset time. As a consequence, we are not necessarily tied to the kernel LPCR_ILE selector I guess. Ah yes, I'd forgotten the history behind why we ended up looking at interrupt endianness. That makes a lot of sense, thanks for explaining that. You're basically doing the exact same thing we do with kvmtool on ARM. So if we have similar architectural features on both sides, why don't we support both kernel and userspace access? I don't think that we really need to get into whether userspace access is or is not a good idea -- endianness of the CPU which does the virtio reset at the point when it does that reset is a nice simple rule that should generalise across architectures, so why make it more complicated than that? This definition looks pretty good to me. Simple and to the point. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness
On Wed, 7 May 2014 13:17:51 +0100 Peter Maydell peter.mayd...@linaro.org wrote: On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote: On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com wrote: All the fuzz is not really about enforcing kernel access... PPC also has a current endianness selector (MSR_LE) but it only makes sense if you are in the cpu context. Initial versions of the virtio biendian support for QEMU PPC64 used an arbitrary cpu: in this case, the only sensible thing to look at to support kernel based virtio is the interrupt endianness selector (LPCR_ILE), because if gives a safe hint of the kernel endianness. The patch set has evolved and now uses current_cpu at device reset time. As a consequence, we are not necessarily tied to the kernel LPCR_ILE selector I guess. Ah yes, I'd forgotten the history behind why we ended up looking at interrupt endianness. That makes a lot of sense, thanks for explaining that. You're basically doing the exact same thing we do with kvmtool on ARM. So if we have similar architectural features on both sides, why don't we support both kernel and userspace access? I don't think that we really need to get into whether userspace access is or is not a good idea -- endianness of the CPU which does the virtio reset at the point when it does that reset is a nice simple rule that should generalise across architectures, so why make it more complicated than that? thanks -- PMM I am convinced... and feeling a bit guilty for all the noise ;) I'll come with a new virtio patch set for QEMU that does just what you say. -- Gregory Kurz kurzg...@fr.ibm.com gk...@linux.vnet.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)562 165 496 Anarchy is about taking complete responsibility for yourself. Alan Moore. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call
Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. In 64-bit mode, the resulting RIP is always 64-bit. Otherwise it is masked according to the instruction operand-size. Note that when 16-bit address size is used, bits 63:32 are unmodified. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6833b41..e406705 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) { - register_address_increment(ctxt, ctxt-_eip, rel); + /* 64-bit mode relative jumps are always 64-bit; otherwise mask */ + int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes; + masked_increment(ctxt-eip, op_bytes, rel); } static u32 desc_limit_scaled(struct desc_struct *desc) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: x86: Emulator does not calculate address correctly
In long-mode, when the address size is 4 bytes, the linear address is not truncated as the emulator mistakenly does. Instead, the offset within the segment (the ea field) should be truncated according to the address size. As Intel SDM says: In 64-bit mode, the effective address components are added and the effective address is truncated ... before adding the full 64-bit segment base. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e8a5840..743e8e3 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, u16 sel; unsigned cpl; - la = seg_base(ctxt, addr.seg) + addr.ea; + la = seg_base(ctxt, addr.seg) + + (ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea); switch (ctxt-mode) { case X86EMUL_MODE_PROT64: if (((signed long)la 16) 16 != la) @@ -678,7 +679,7 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, } break; } - if (fetch ? ctxt-mode != X86EMUL_MODE_PROT64 : ctxt-ad_bytes != 8) + if (ctxt-mode != X86EMUL_MODE_PROT64) la = (u32)-1; if (insn_aligned(ctxt, size) ((la (size - 1)) != 0)) return emulate_gp(ctxt, 0); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] KVM: x86: Fix exit handler and emulation bugs
This series of patches fixes various scenarios in which KVM does not follow x86 specifications. Patches #4 and #5 are related; they reflect a new revision of the previously submitted patch that dealt with the wrong masking of registers in long-mode. Patch #3 is a follow-up to the previously sumbitted patch that fixed the wrong reserved page table masks. Patches #3 and #5 were not tested in a manner that actually checks the modified behavior. Not all the pathes in patch #4 were tested. Thanks for reviewing the patches. Nadav Amit (5): KVM: x86: Emulator does not calculate address correctly KVM: vmx: handle_dr does not handle RSP correctly KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support KVM: x86: Wrong register masking in 64-bit mode KVM: x86: Fix wrong masking on relative jump/call arch/x86/kvm/cpuid.h | 7 +++ arch/x86/kvm/emulate.c | 47 +-- arch/x86/kvm/mmu.c | 8 ++-- arch/x86/kvm/vmx.c | 2 +- 4 files changed, 43 insertions(+), 21 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode
32-bit operations are zero extended in 64-bit mode. Currently, the code does not handle them correctly and keeps the high bits. In 16-bit mode, the high 32-bits are kept intact. In addition, although it is not well-documented, when address override prefix is used with REP-string instruction, RCX high half is zeroed even if ECX was zero on the first iteration (as if an assignment was performed to ECX). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 38 +++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 743e8e3..6833b41 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -434,9 +434,21 @@ static int emulator_check_intercept(struct x86_emulate_ctxt *ctxt, return ctxt-ops-intercept(ctxt, info, stage); } -static void assign_masked(ulong *dest, ulong src, ulong mask) +static void assign_masked(ulong *dest, ulong src, int bytes) { - *dest = (*dest ~mask) | (src mask); + switch (bytes) { + case 2: + *dest = (u16)src | (*dest ~0xul); + break; + case 4: + *dest = (u32)src; + break; + case 8: + *dest = src; + break; + default: + BUG(); + } } static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt) @@ -476,26 +488,20 @@ register_address(struct x86_emulate_ctxt *ctxt, unsigned long reg) return address_mask(ctxt, reg); } -static void masked_increment(ulong *reg, ulong mask, int inc) +static void masked_increment(ulong *reg, int size, int inc) { - assign_masked(reg, *reg + inc, mask); + assign_masked(reg, *reg + inc, size); } static inline void register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long *reg, int inc) { - ulong mask; - - if (ctxt-ad_bytes == sizeof(unsigned long)) - mask = ~0UL; - else - mask = ad_mask(ctxt); - masked_increment(reg, mask, inc); + masked_increment(reg, ctxt-ad_bytes, inc); } static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) { - masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_mask(ctxt), inc); + masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_size(ctxt), inc); } static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) @@ -1712,17 +1718,17 @@ static int em_enter(struct x86_emulate_ctxt *ctxt) if (rc != X86EMUL_CONTINUE) return rc; assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP), - stack_mask(ctxt)); + stack_size(ctxt)); assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, VCPU_REGS_RSP) - frame_size, - stack_mask(ctxt)); + stack_size(ctxt)); return X86EMUL_CONTINUE; } static int em_leave(struct x86_emulate_ctxt *ctxt) { assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, VCPU_REGS_RBP), - stack_mask(ctxt)); + stack_size(ctxt)); return emulate_pop(ctxt, reg_rmw(ctxt, VCPU_REGS_RBP), ctxt-op_bytes); } @@ -4570,6 +4576,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) if (ctxt-rep_prefix (ctxt-d String)) { /* All REP prefixes have the same first termination condition */ if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) { + assign_masked(reg_write(ctxt, VCPU_REGS_RCX), 0, + ctxt-ad_bytes); ctxt-eip = ctxt-_eip; goto done; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] KVM: vmx: handle_dr does not handle RSP correctly
The RSP register is not automatically cached, causing mov DR instruction with RSP to fail. Instead the regular register accessing interface should be used. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 72b8012..0e2793f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5142,7 +5142,7 @@ static int handle_dr(struct kvm_vcpu *vcpu) return 1; kvm_register_write(vcpu, reg, val); } else - if (kvm_set_dr(vcpu, dr, vcpu-arch.regs[reg])) + if (kvm_set_dr(vcpu, dr, kvm_register_read(vcpu, reg))) return 1; skip_emulated_instruction(vcpu); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support
In long-mode, bit 7 in the PDPTE is not reserved only if 1GB pages are supported by the CPU. Currently the bit is considered by KVM as always reserved. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/cpuid.h | 7 +++ arch/x86/kvm/mmu.c | 8 ++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index eeecbed..f908731 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -88,4 +88,11 @@ static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu *vcpu) return best (best-ecx bit(X86_FEATURE_X2APIC)); } +static inline bool guest_cpuid_has_gbpages(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x8001, 0); + return best (best-edx bit(X86_FEATURE_GBPAGES)); +} #endif diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 65f2400..9314678 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -22,6 +22,7 @@ #include mmu.h #include x86.h #include kvm_cache_regs.h +#include cpuid.h #include linux/kvm_host.h #include linux/types.h @@ -3516,11 +3517,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, { int maxphyaddr = cpuid_maxphyaddr(vcpu); u64 exb_bit_rsvd = 0; + u64 gbpages_bit_rsvd = 0; context-bad_mt_xwr = 0; if (!context-nx) exb_bit_rsvd = rsvd_bits(63, 63); + if (!guest_cpuid_has_gbpages(vcpu)) + gbpages_bit_rsvd = rsvd_bits(7, 7); switch (context-root_level) { case PT32_ROOT_LEVEL: /* no rsvd bits for 2 level 4K page table entries */ @@ -3557,14 +3561,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, context-rsvd_bits_mask[0][3] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7); context-rsvd_bits_mask[0][2] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7); + gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51); context-rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51); context-rsvd_bits_mask[0][0] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51); context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3]; context-rsvd_bits_mask[1][2] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51) | + gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 29); context-rsvd_bits_mask[1][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 51) | -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15
On Wed, May 07, 2014 at 08:29:19AM +0200, Jan Kiszka wrote: On 2014-05-06 20:35, gso...@gmail.com wrote: Signed-off-by: Gabriel Somlo so...@cmu.edu --- Jan, After today's pull from kvm, I also need this to build against my Fedora 20 kernel (3.13.10-200.fc20.x86_64). Which version did you pull? Neither next nor queue contain this change yet. I have 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd (Linux 3.15-rc4) as the top commit. Right now I'm seeing different build problems: http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio Didn't look into details yet. That's kernel 3.11, afaict, so entirely different problem :) x86/external-module-compat.h | 5 + 1 file changed, 5 insertions(+) diff --git a/x86/external-module-compat.h b/x86/external-module-compat.h index dde9463..e62486b 100644 --- a/x86/external-module-compat.h +++ b/x86/external-module-compat.h @@ -1480,6 +1480,11 @@ static inline int __register_hotcpu_notifier(struct notifier_block *nb) { return 0; } +static inline int pci_enable_msix_exact(struct pci_dev *dev, + struct msix_entry *entries, int nvec) +{ + return -ENOSYS; +} First, this is not x86-specific, thus should go to external-module-compat-comm.h. Second, pci_enable_msix_exact was introduced with 3.14. And then I think we can (and should) provide the original content of this wrapper once it is needed, not a stub. The patch I sent fixed building for 3.13.10-200.fc20.x86_64; as soon as I upgraded to 3.14.2-200.fc20.x86_64 (about half hour later), I could build for it just fine, no errors. So I don't really *need* this fix anymore, but given what you said and that it works fine on 3.14 without a patch (at least for me), the patch *should* then probably be: diff --git a/external-module-compat-comm.h b/external-module-compat-comm.h index 086127d..94b18cc 100644 --- a/external-module-compat-comm.h +++ b/external-module-compat-comm.h @@ -1401,3 +1401,11 @@ static inline void guest_exit(void) #if LINUX_VERSION_CODE KERNEL_VERSION(3,13,0) static inline void smp_mb__after_srcu_read_unlock(void) {} #endif + +#if LINUX_VERSION_CODE KERNEL_VERSION(3,14,0) +static inline int pci_enable_msix_exact(struct pci_dev *dev, + struct msix_entry *entries, int nvec) +{ + return -ENOSYS; +} +#endif -- except I can't test it anymore without rebooting to 3.13, which I can't do right away :) Regards, --G -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: x86: Emulator does not calculate address correctly
Il 07/05/2014 14:32, Nadav Amit ha scritto: In long-mode, when the address size is 4 bytes, the linear address is not truncated as the emulator mistakenly does. Instead, the offset within the segment (the ea field) should be truncated according to the address size. As Intel SDM says: In 64-bit mode, the effective address components are added and the effective address is truncated ... before adding the full 64-bit segment base. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e8a5840..743e8e3 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, u16 sel; unsigned cpl; - la = seg_base(ctxt, addr.seg) + addr.ea; + la = seg_base(ctxt, addr.seg) + + (ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea); I think you need fetch || ctxt-ad_bytes == 8 here. Paolo switch (ctxt-mode) { case X86EMUL_MODE_PROT64: if (((signed long)la 16) 16 != la) @@ -678,7 +679,7 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, } break; } - if (fetch ? ctxt-mode != X86EMUL_MODE_PROT64 : ctxt-ad_bytes != 8) + if (ctxt-mode != X86EMUL_MODE_PROT64) la = (u32)-1; if (insn_aligned(ctxt, size) ((la (size - 1)) != 0)) return emulate_gp(ctxt, 0); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm/x86: implement hv EOI assist
It seems that it's easy to implement the EOI assist on top of the PV EOI feature: simply convert the page address to the format expected by PV EOI. Notes: -No EOI required is set only if interrupt injected is edge triggered; this is true because level interrupts are going through IOAPIC which disables PV EOI. In any case, if guest triggers EOI the bit will get cleared on exit. -For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE seems sufficient In any case, bit is cleared on exit so worst case it's never re-enabled -no handling of PV EOI data is performed at HV_X64_MSR_EOI write; HV_X64_MSR_EOI is a separate optimization - it's an X2APIC replacement that lets you do EOI with an MSR and not IO. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- patch is unchanged from RFC except for documenting the design points in the commit log, do address Marcelo's comments. This passed basic testing. Migration was not tested. I think at this point it's a good idea to merge this so it can get tested and used by more people. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8ae1ff5..d84d750fc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1890,6 +1890,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (!(data HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE)) { vcpu-arch.hv_vapic = data; + if (kvm_lapic_enable_pv_eoi(vcpu, 0)) + return 1; break; } gfn = data HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT; @@ -1900,6 +1902,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data) return 1; vcpu-arch.hv_vapic = data; mark_page_dirty(vcpu-kvm, gfn); + if (kvm_lapic_enable_pv_eoi(vcpu, gfn_to_gpa(gfn) | KVM_MSR_ENABLED)) + return 1; break; } case HV_X64_MSR_EOI: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call
Nadav Amit na...@cs.technion.ac.il writes: Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. In 64-bit mode, the resulting RIP is always 64-bit. Otherwise it is masked according to the instruction operand-size. Note that when 16-bit address size is used, bits 63:32 are unmodified. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6833b41..e406705 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) { - register_address_increment(ctxt, ctxt-_eip, rel); + /* 64-bit mode relative jumps are always 64-bit; otherwise mask */ + int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes; Just a nit, probably break this up for readability ? + masked_increment(ctxt-eip, op_bytes, rel); } static u32 desc_limit_scaled(struct desc_struct *desc) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode
Nadav Amit na...@cs.technion.ac.il writes: 32-bit operations are zero extended in 64-bit mode. Currently, the code does not handle them correctly and keeps the high bits. In 16-bit mode, the high 32-bits are kept intact. In addition, although it is not well-documented, when address override prefix It would be helpful to have a pointer in the SDM especially for cases that are not well-documented. is used with REP-string instruction, RCX high half is zeroed even if ECX was zero on the first iteration (as if an assignment was performed to ECX). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 38 +++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 743e8e3..6833b41 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -434,9 +434,21 @@ static int emulator_check_intercept(struct x86_emulate_ctxt *ctxt, return ctxt-ops-intercept(ctxt, info, stage); } -static void assign_masked(ulong *dest, ulong src, ulong mask) +static void assign_masked(ulong *dest, ulong src, int bytes) { - *dest = (*dest ~mask) | (src mask); + switch (bytes) { + case 2: + *dest = (u16)src | (*dest ~0xul); + break; + case 4: + *dest = (u32)src; + break; + case 8: + *dest = src; + break; + default: + BUG(); IIRC, Paolo mentioned that a WARN() is preferable. But I see a lot other places where BUG() is called, maybe, he can confirm. + } } static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt) @@ -476,26 +488,20 @@ register_address(struct x86_emulate_ctxt *ctxt, unsigned long reg) return address_mask(ctxt, reg); } -static void masked_increment(ulong *reg, ulong mask, int inc) +static void masked_increment(ulong *reg, int size, int inc) { - assign_masked(reg, *reg + inc, mask); + assign_masked(reg, *reg + inc, size); } static inline void register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long *reg, int inc) { - ulong mask; - - if (ctxt-ad_bytes == sizeof(unsigned long)) - mask = ~0UL; - else - mask = ad_mask(ctxt); - masked_increment(reg, mask, inc); + masked_increment(reg, ctxt-ad_bytes, inc); } static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) { - masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_mask(ctxt), inc); + masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_size(ctxt), inc); } static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) @@ -1712,17 +1718,17 @@ static int em_enter(struct x86_emulate_ctxt *ctxt) if (rc != X86EMUL_CONTINUE) return rc; assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP), - stack_mask(ctxt)); + stack_size(ctxt)); assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, VCPU_REGS_RSP) - frame_size, - stack_mask(ctxt)); + stack_size(ctxt)); return X86EMUL_CONTINUE; } static int em_leave(struct x86_emulate_ctxt *ctxt) { assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, VCPU_REGS_RBP), - stack_mask(ctxt)); + stack_size(ctxt)); return emulate_pop(ctxt, reg_rmw(ctxt, VCPU_REGS_RBP), ctxt-op_bytes); } @@ -4570,6 +4576,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) if (ctxt-rep_prefix (ctxt-d String)) { /* All REP prefixes have the same first termination condition */ if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) { + assign_masked(reg_write(ctxt, VCPU_REGS_RCX), 0, + ctxt-ad_bytes); ctxt-eip = ctxt-_eip; goto done; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote: Kim, Christoffer, On Tue, May 06 2014 at 7:04:48 pm BST, Christoffer Dall christoffer.d...@linaro.org wrote: On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote: Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE. Signed-off-by: Kim Phillips kim.phill...@linaro.org --- arch/arm/kvm/mmu.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 7789857..a354610 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -652,6 +652,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache; struct vm_area_struct *vma; pfn_t pfn; + pgprot_t mem_type = PAGE_S2; write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu)); if (fault_status == FSC_PERM !write_fault) { @@ -702,6 +703,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (is_error_pfn(pfn)) return -EFAULT; + if (kvm_is_mmio_pfn(pfn)) + mem_type = PAGE_S2_DEVICE; + spin_lock(kvm-mmu_lock); if (mmu_notifier_retry(kvm, mmu_seq)) goto out_unlock; @@ -709,7 +713,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, hugetlb = transparent_hugepage_adjust(pfn, fault_ipa); if (hugetlb) { - pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2); + pmd_t new_pmd = pfn_pmd(pfn, mem_type); new_pmd = pmd_mkhuge(new_pmd); if (writable) { kvm_set_s2pmd_writable(new_pmd); @@ -718,13 +722,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, coherent_icache_guest_page(kvm, hva PMD_MASK, PMD_SIZE); ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd); } else { - pte_t new_pte = pfn_pte(pfn, PAGE_S2); + pte_t new_pte = pfn_pte(pfn, mem_type); if (writable) { kvm_set_s2pte_writable(new_pte); kvm_set_pfn_dirty(pfn); } coherent_icache_guest_page(kvm, hva, PAGE_SIZE); - ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false); + ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, + mem_type == PAGE_S2_DEVICE); } -- 1.9.1 I think this looks reasonable. Acked-by: Christoffer Dall christoffer.d...@linaro.org I feel like I'm missing some context here, and the commit message is way too terse for me to make sense of it. So far, we can only get into user_mem_abort on a Stage-2 fault (translation or permission) for memory. How can we suddenly get here for a *device* fault? Do we get a special kind of memslot? I'm not saying the patch does anything wrong, but I'd like to understand the rationale behind it. On its own, it doesn't make much sense. Think device passthrough. There's nothing preventing user space from setting up a memory region to point to device memory (through VFIO or /dev/mem). If that's done, we should enforce device memory properties so writes don't linger around in the cache to be written some time later when that device memory potentially doesn't belong to the VM anymore. This is just one tiny piece of all of them to make device passthrough work, and we could hold off with this patch until we have something more complete. On the other hand, we need to start somewhere, and this is hardly intrusive and is functionally correct even though you don't have a full device passthrough setup. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support
On 04/27/2014 02:09 PM, Raghavendra K T wrote: For kvm part feel free to add: Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com V9 testing has shown no hangs. I was able to do some performance testing. here are the results: Overall we are seeing good improvement for pv-unfair version. System : 32 cpu sandybridge with HT on. (4 node machine with 32 GB each) Guest: 8GB with 16 vcpu/VM. Average was taken over 8-10 data points. Base = 3.15-rc2with PRAVIRT_SPINLOCK = y A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock) B = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n (queue spinlock without paravirt) C = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = n (queue spinlock with paravirt) Ebizzy % improvements overcommit ABC 0.5x4.42652.06111.5824 1.0x0.9015-7.7828 4.5443 1.5x46.1162 -2.9845 -3.5046 2.0x99.8150 -2.7116 4.7461 Dbench %improvements overcommit ABC 0.5x3.26173.54362.5676 1.0x0.63022.23425.2201 1.5x5.00274.82753.8375 2.0x23.8242 4.578212.6067 Absolute values of base results: (overcommit, value, stdev) Ebizzy ( records / sec with 120 sec run) 0.5x 20941.8750 (2%) 1.0x 17623.8750 (5%) 1.5x 5874.7778 (15%) 2.0x 3581.8750 (7%) Dbench (throughput in MB/sec) 0.5x 10009.6610 (5%) 1.0x 6583.0538 (1%) 1.5x 3991.9622 (4%) 2.0x 2527.0613 (2.5%) Thank for the testing. I will include your Test-by tag in the next version. -Longman -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 02/19] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 29 + arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 39 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..95c9c4e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -29,6 +29,7 @@ config X86 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_INT128 if X86_64 select ARCH_WANTS_PROT_NUMA_PROT_NONE + select ARCH_USE_QUEUE_SPINLOCK select HAVE_IDE select HAVE_OPROFILE select HAVE_PCSPKR_PLATFORM diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..e4a4f5d --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,29 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#if !defined(CONFIG_X86_OOSTORE) !defined(CONFIG_X86_PPRO_FENCE) + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * No special memory barrier other than a compiler one is needed for the + * x86 architecture. A compiler barrier is added at the end to make sure + * that the clearing the lock bit is done ASAP without artificial delay + * due to compiler optimization. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(*(u8 *)lock) = 0; + barrier(); +} + +#endif /* !CONFIG_X86_OOSTORE !CONFIG_X86_PPRO_FENCE */ + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 0f62f54..958d20f 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4f1bea1..7960268 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm/rwlock.h -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 19/19] pvqspinlock, x86: Enable PV qspinlock for XEN
This patch adds the necessary XEN specific code to allow XEN to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/spinlock.c | 147 +-- kernel/Kconfig.locks|2 +- 2 files changed, 143 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d1b6a32..2a259bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,12 @@ #include xen-ops.h #include debugfs.h +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifndef CONFIG_QUEUE_SPINLOCK + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +106,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } +#else /* CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_XEN_DEBUG_FS +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats;/* Wakeup by kicking count */ +static u32 wake_spur_stats;/* Spurious wakeup count*/ +static u64 time_blocked; /* Total blocking time */ + +static inline void xen_halt_stats(enum pv_lock_stats type) +{ + if (type == PV_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + else if (type == PV_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + else /* type == PV_HALT_ABORT */ + add_smp(halt_abort_stats, 1); +} + +static inline void xen_lock_stats(enum pv_lock_stats type) +{ + if (type == PV_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + else if (type == PV_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + else /* type == PV_KICK_NOHALT */ + add_smp(kick_nohlt_stats, 1); +} + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u64 delta; + + delta = sched_clock() - start; + add_smp(time_blocked, delta); +} +#else /* CONFIG_XEN_DEBUG_FS */ +static inline void xen_halt_stats(enum pv_lock_stats type) +{ +} + +static inline void xen_lock_stats(enum pv_lock_stats type) +{ +} + +static inline u64 spin_time_start(void) +{ + return 0; +} + +static inline void spin_time_accum_blocked(u64 start) +{ +} +#endif /* CONFIG_XEN_DEBUG_FS */ + +static void xen_kick_cpu(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); +} + +/* + * Halt the current CPU release it back to the host + */ +static void xen_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval) +{ + int irq = __this_cpu_read(lock_kicker_irq); + unsigned long flags; + u64 start; + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* +* Make sure an interrupt handler can't upset things in a +* partially setup state. +*/ + local_irq_save(flags); + start = spin_time_start(); + + xen_halt_stats(type); + /* clear pending */ + xen_clear_irq_pending(irq); + + /* Allow interrupts while blocked */ + local_irq_restore(flags); + /* +* Don't halt if the CPU state has been changed. +*/ + if (ACCESS_ONCE(*state) != sval) { + xen_halt_stats(PV_HALT_ABORT); + return; + } + /* +* If an interrupt happens here, it will leave the wakeup irq +* pending, which will cause xen_poll_irq() to return +* immediately. +*/ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); + spin_time_accum_blocked(start); +} +#endif /* CONFIG_QUEUE_SPINLOCK */ + static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); @@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -275,8 +389,15 @@ void __init xen_init_spinlocks(void) return; } printk(KERN_DEBUG
[PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Two KVM guests of 20 CPU cores (2 nodes) were created for performance testing in one of the following three configurations: 1) Only 1 VM is active 2) Both VMs are active and they share the same 20 physical CPUs (200% overcommit) 3) Both VMs are active and they shares 30 physical CPUs (10 delicated and 10 shared - 133% overcommit) The tests run included the disk workload of the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a 3.15-rc1 based kernel. The ebizzy -m test was was also run and its performance data were recorded. With two VMs running, the idle=poll kernel option was added to simulate a busy guest. The entry unfair + PV qspinlock below means that both the unfair lock and PV spinlock configuration options were turned on. AIM7 XFS Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 24896267.23 101.08 5.30 qspinlock 25316467.11 100.75 5.43 PV qspinlock 2507.20 101.94 5.40 unfair qspinlock 25495757.06 99.81 5.35 unfair + PV qspinlock 24861887.24 101.55 5.51 AIM7 XFS Disk Test (133% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 1114551 16.15 220.17 10.75 qspinlock 1159047 15.53 216.60 10.24 PV qspinlock 1170351 15.38 216.16 11.03 unfair qspinlock 1188119 15.15 209.37 10.82 unfair + PV qspinlock 1178782 15.27 211.37 11.25 AIM7 XFS Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 58746730.64 444.95 11.92 qspinlock 59327630.34 439.39 14.59 PV qspinlock 60140329.93 426.04 14.49 unfair qspinlock 65407027.52 400.82 10.86 unfair + PV qspinlock 61433429.30 393.38 28.56 AIM7 EXT4 Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 20022259.07 105.62 5.43 qspinlock 20066898.97 105.65 5.26 PV qspinlock 20022258.99 103.19 5.19 unfair qspinlock 19889509.05 103.81 5.03 unfair + PV qspinlock 19933559.03 107.99 5.68 AIM7 EXT4 Disk Test (133% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 987383 18.23 221.63 8.89 qspinlock 1050788 17.13 206.87 8.35 PV qspinlock 1058823 17.00 205.22 9.18 unfair qspinlock 1161290 15.50 184.22 8.84 unfair + PV qspinlock 1122894 16.03 195.86 9.34 AIM7 EXT4 Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 42075742.78 565.96 5.84 qspinlock 42745242.11 543.08 11.12 PV qspinlock 42065942.79 548.30 10.56 unfair qspinlock 50490935.65 466.71 5.38 unfair + PV qspinlock 50097435.93 469.02 6.77 EBIZZY-M Test (no overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 1230 10.00 88.341.42 qspinlock 1212 10.00 68.251.47 PV qspinlock 1265 10.00 91.501.41 unfair qspinlock 1304 10.00 77.941.49 unfair + PV qspinlock 1445 10.00 75.451.68 EBIZZY-M Test (133% overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 467 10.00 88.160.73 qspinlock 463 10.00 89.440.78 PV qspinlock 441 10.00 95.100.74 unfair qspinlock 1233 10.00 35.761.76 unfair + PV qspinlock 1555 10.00 32.121.96 EBIZZY-M Test (200% overcommit) kernelRec/s Real Time Sys Time
[PATCH v10 14/19] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/spinlock.h |4 ++-- arch/x86/kernel/kvm.c|2 +- arch/x86/kernel/paravirt-spinlocks.c |4 ++-- arch/x86/xen/spinlock.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 958d20f..428d0d1 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG - static_key_false(paravirt_ticketlocks_enabled)) { + static_key_false(paravirt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 0331cb3..7ab8ab3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -817,7 +817,7 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); printk(KERN_INFO KVM setup paravirtual spinlock\n); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 7dfd02d..6d36731 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); #endif #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 0ba5f3b..d1b6a32 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_domain()) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 15/19] pvqspinlock, x86: Add PV data structure methods
This patch modifies the para-virtualization (PV) infrastructure code of the x86-64 architecture to support the PV queue spinlock. Three new virtual methods are added to support PV qspinlock: 1) kick_cpu - schedule in a virtual CPU 2) halt_cpu - schedule out a virtual CPU 3) lockstat - update statistical data for debugfs Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/paravirt.h | 18 +- arch/x86/include/asm/paravirt_types.h | 17 + arch/x86/kernel/paravirt-spinlocks.c |6 ++ 3 files changed, 40 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..d71e123 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -711,7 +711,23 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, } #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK +static __always_inline void __queue_kick_cpu(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu); +} + +static __always_inline void +__queue_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval) +{ + PVOP_VCALL3(pv_lock_ops.halt_cpu, type, state, sval); +} +static __always_inline void __queue_lockstat(enum pv_lock_stats type) +{ + PVOP_VCALL1(pv_lock_ops.lockstat, type); +} +#else static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -723,7 +739,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } - +#endif #endif #ifdef CONFIG_X86_32 diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..549b3a0 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,26 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +#ifdef CONFIG_QUEUE_SPINLOCK +enum pv_lock_stats { + PV_HALT_QHEAD, /* Queue head halting */ + PV_HALT_QNODE, /* Other queue node halting */ + PV_HALT_ABORT, /* Halting aborted */ + PV_WAKE_KICKED, /* Wakeup by kicking*/ + PV_WAKE_SPURIOUS, /* Spurious wakeup */ + PV_KICK_NOHALT /* Kick but CPU not halted */ +}; +#endif + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*kick_cpu)(int cpu); + void (*halt_cpu)(enum pv_lock_stats type, s8 *state, s8 sval); + void (*lockstat)(enum pv_lock_stats type); +#else struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6d36731..8d15bea 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -11,9 +11,15 @@ #ifdef CONFIG_PARAVIRT_SPINLOCKS struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +#ifdef CONFIG_QUEUE_SPINLOCK + .kick_cpu = paravirt_nop, + .halt_cpu = paravirt_nop, + .lockstat = paravirt_nop, +#else .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), .unlock_kick = paravirt_nop, #endif +#endif }; EXPORT_SYMBOL(pv_lock_ops); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 16/19] pvqspinlock: Enable coexistence with the unfair lock
This patch enables the coexistence of both the PV qspinlock and unfair lock. When both are enabled, however, only the lock fastpath will perform lock stealing whereas the slowpath will have that disabled to get the best of both features. We also need to transition a CPU spinning too long in the pending bit code path back to the regular queuing code path so that it can be properly halted by the PV qspinlock code. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 74 ++-- 1 files changed, 64 insertions(+), 10 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0c86a6f..fb05e64 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -72,6 +72,30 @@ struct qnode { #define qhead mcs.locked /* The queue head flag */ /* + * Allow spinning loop count only if either PV spinlock or unfair lock is + * configured. + */ +#if defined(CONFIG_PARAVIRT_UNFAIR_LOCKS) || defined(CONFIG_PARAVIRT_SPINLOCKS) +#defineDEF_LOOP_CNT(c) int c = 0 +#defineINC_LOOP_CNT(c) (c)++ +#defineLOOP_CNT(c) c +#else +#defineDEF_LOOP_CNT(c) +#defineINC_LOOP_CNT(c) +#defineLOOP_CNT(c) 0 +#endif + +/* + * Check the pending bit spinning threshold only if PV qspinlock is enabled + */ +#define PSPIN_THRESHOLD(1 10) +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define pv_qspinlock_enabled() static_key_false(paravirt_spinlocks_enabled) +#else +#define pv_qspinlock_enabled() false +#endif + +/* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * @@ -302,9 +326,6 @@ cmpxchg_tail(struct qspinlock *lock, u32 old, u32 new) * starvation. */ #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS -#define DEF_LOOP_CNT(c)int c = 0 -#define INC_LOOP_CNT(c)(c)++ -#define LOOP_CNT(c)c #define LSTEAL_MIN (1 3) #define LSTEAL_MAX (1 10) #define LSTEAL_MIN_MASK(LSTEAL_MIN - 1) @@ -330,7 +351,11 @@ static inline void unfair_init_vars(struct qnode *node) static inline void unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_tail) { - if (!static_key_false(paravirt_unfairlocks_enabled)) + /* +* Disable waiter lock stealing if PV spinlock is enabled +*/ + if (pv_qspinlock_enabled() || + !static_key_false(paravirt_unfairlocks_enabled)) return; node-qprev = prev; @@ -356,7 +381,11 @@ unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_tail) */ static inline int unfair_check_and_clear_tail(struct qspinlock *lock, u32 tail) { - if (!static_key_false(paravirt_unfairlocks_enabled)) + /* +* Disable waiter lock stealing if PV spinlock is enabled +*/ + if (pv_qspinlock_enabled() || + !static_key_false(paravirt_unfairlocks_enabled)) return false; /* @@ -385,7 +414,11 @@ unfair_get_lock(struct qspinlock *lock, struct qnode *node, u32 tail, int count) int isqhead; struct qnode *next; - if (!static_key_false(paravirt_unfairlocks_enabled) || + /* +* Disable waiter lock stealing if PV spinlock is enabled +*/ + if (pv_qspinlock_enabled() || + !static_key_false(paravirt_unfairlocks_enabled) || ((count node-lsteal_mask) != node-lsteal_mask)) return false; @@ -463,9 +496,6 @@ unfair_get_lock(struct qspinlock *lock, struct qnode *node, u32 tail, int count) } #else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */ -#defineDEF_LOOP_CNT(c) -#defineINC_LOOP_CNT(c) -#defineLOOP_CNT(c) 0 static void unfair_init_vars(struct qnode *node) {} static void unfair_set_vars(struct qnode *node, struct qnode *prev, @@ -582,9 +612,28 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * store-release that clears the locked bit and create lock * sequentiality; this because not all clear_pending_set_locked() * implementations imply full barriers. +* +* When PV qspinlock is enabled, exit the pending bit code path and +* go back to the regular queuing path if the lock isn't available +* within a certain threshold. */ - while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_MASK) + if (pv_qspinlock_enabled()) + retry = PSPIN_THRESHOLD; + while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_MASK) { + if (pv_qspinlock_enabled() (--retry == 0)) { + /* +* Clear the pending bit and exit +*/ + for (;;) { + new = val ~_Q_PENDING_MASK; + old =
[PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest
Locking is always an issue in a virtualized environment because of 2 different types of problems: 1) Lock holder preemption 2) Lock waiter preemption One solution to the lock waiter preemption problem is to allow unfair lock in a virtualized environment. In this case, a new lock acquirer can come and steal the lock if the next-in-line CPU to get the lock is scheduled out. A simple unfair lock is the test-and-set byte lock where an lock acquirer constantly spins on the lock word and attempt to grab it when the lock is freed. This simple unfair lock has 2 main problems: 1) The constant spinning on the lock word put a lot of cacheline contention traffic on the affected cacheline, thus slowing tasks that need to access the cacheline. 2) Lock starvation is a real possibility especially if the number of virtual CPUs is large. A simple unfair queue spinlock can be implemented by allowing lock stealing in the fast path. The slowpath will still be the same as before and all the pending lock acquirers will have to wait in the queue in FIFO order. This cannot completely solve the lock waiter preemption problem, but it does help to alleviate the impact of this problem. To illustrate the performance impact of the various approaches, the disk workload of the AIM7 benchmark and the ebizzy test were run on a 4-socket 40-core Westmere-EX system (bare metal, HT off, ramdisk) on a 3.14 based kernel. The table below shows the performance of the different kernel flavors. AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 simple test-and-set 56250003.20 98.29 5.93 simple unfair 57507993.13 95.91 5.98 qspinlock AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 simple test-and-set593081 30.35 967.55 9.00 simple unfair 22929947.85 222.84 5.89 qspinlock Ebizzy -m test kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 simple test-and-set1667 10.00 198.93 2.89 simple unfair 2915 10.00 165.68 4.31 qspinlock The disk-xfs workload spent only about 2.88% of CPU time in _raw_spin_lock() whereas the disk-ext4 workload spent 57.8% of CPU time in _raw_spin_lock(). It can be seen that there wasn't too much difference in performance with low spinlock contention in the disk-xfs workload. With heavy spinlock contention, the performance of simple test-and-set lock can plummet when compared with the ticket and queue spinlocks. Unfair lock in a native environment is generally not a good idea as there is a possibility of lock starvation for a heavily contended lock. This patch adds a new configuration option for the x86 architecture to enable the use of unfair queue spinlock (PARAVIRT_UNFAIR_LOCKS) in a para-virtualized guest. A jump label (paravirt_unfairlocks_enabled) is used to switch between a fair and an unfair version of the spinlock code. This jump label will only be enabled in a virtual guest where the X86_FEATURE_HYPERVISOR feature bit is set. Enabling this configuration feature causes a slight decrease the performance of an uncontended lock-unlock operation by about 1-2% mainly due to the use of a static key. However, uncontended lock-unlock operation are really just a tiny percentage of a real workload. So there should no noticeable change in application performance. With the unfair locking activated on bare metal 4-socket Westmere-EX box, the execution times (in ms) of a spinlock micro-benchmark were as follows: # ofTicket Fair Unfair simpleUnfair taskslock queue lockqueue lockbyte lock -- --- ----- 1 135135 137 137 2 890 1082 421 718 3 1932 2248 708 1263 4 2829 2819 1030 1916 5 3834 3522 1323 2327 6 4963 4173 1723 2938 7 6299 4875 2067 3292 8 7691 5563 2360 3768 Executing one task per node, the performance data were: # ofTicket Fair Unfair simpleUnfair nodeslock queue lockqueue
[PATCH v10 12/19] unfair qspinlock: Variable frequency lock stealing mechanism
In order to fully resolve the lock waiter preemption problem in virtual guests, it is necessary to enable lock stealing in the lock waiters. A simple test-and-set lock, however, has 2 main problems: 1) The constant spinning on the lock word put a lot of cacheline contention traffic on the affected cacheline, thus slowing tasks that need to access the cacheline. 2) Lock starvation is a real possibility especially if the number of virtual CPUs is large. To alleviate these problems, this patch implements a variable frequency (from 1/8 to 1/1024) lock stealing mechanism for the lock waiters in the queue. The node next to the queue head try to steal lock once every 8 iterations of the pause loop. The next one in the queue has half the lock stealing frequency (once every 16 iterations) and so on until it reaches a maximum of once every 1024 iterations. This mechanism reduces the cacheline contention problem on the lock word while trying to maintain as much of a FIFO order as possible. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 147 +++- 1 files changed, 146 insertions(+), 1 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index a14241e..06dd486 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -63,6 +63,11 @@ */ struct qnode { struct mcs_spinlock mcs; +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS + int lsteal_mask;/* Lock stealing frequency mask */ + u32 prev_tail; /* Tail code of previous node */ + struct qnode *qprev; /* Previous queue node addr */ +#endif }; #define qhead mcs.locked /* The queue head flag */ @@ -215,6 +220,139 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) } #endif /* _Q_PENDING_BITS == 8 */ +/* + + * Inline functions for supporting unfair queue lock * + + */ +/* + * Unfair lock support in a virtualized guest + * + * An unfair lock can be implemented using a simple test-and-set lock like + * what is being done in a read-write lock. This simple scheme has 2 major + * problems: + * 1) It needs constant reading and occasionally writing to the lock word + * thus putting a lot of cacheline contention traffic on the affected + * cacheline. + * 2) Lock starvation is a real possibility especially if the number of + * virtual CPUs is large. + * + * To reduce the undesirable side effects of an unfair lock, the queue + * unfair spinlock implements a more elaborate scheme. Lock stealing is + * allowed in the following places: + * 1) In the spin_lock and spin_trylock fastpaths + * 2) When spinning in the waiter queue before becoming the queue head + * + * A lock acquirer has only one chance of stealing the lock in the spin_lock + * and spin_trylock fastpath. If the attempt fails for spin_lock, the task + * will be queued in the wait queue. + * + * Even in the wait queue, the task can still attempt to steal the lock + * periodically at a frequency about inversely and logarithmically proportional + * to its distance from the queue head. In other word, the closer it is to + * the queue head, the higher a chance it has of stealing the lock. This + * scheme reduces the load on the lock cacheline while trying to maintain + * a somewhat FIFO way of getting the lock so as to reduce the chance of lock + * starvation. + */ +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +#define DEF_LOOP_CNT(c)int c = 0 +#define INC_LOOP_CNT(c)(c)++ +#define LOOP_CNT(c)c +#define LSTEAL_MIN (1 3) +#define LSTEAL_MAX (1 10) +#define LSTEAL_MIN_MASK(LSTEAL_MIN - 1) +#define LSTEAL_MAX_MASK(LSTEAL_MAX - 1) + +/** + * unfair_init_vars - initialize unfair relevant fields in queue node structure + * @node: Current queue node address + */ +static inline void unfair_init_vars(struct qnode *node) +{ + node-qprev = NULL; + node-prev_tail = 0; + node-lsteal_mask = LSTEAL_MIN_MASK; +} + +/** + * unfair_set_vars - set unfair related fields in the queue node structure + * @node : Current queue node address + * @prev : Previous queue node address + * @prev_tail: Previous tail code + */ +static inline void +unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_tail) +{ + if (!static_key_false(paravirt_unfairlocks_enabled)) + return; + + node-qprev = prev; + node-prev_tail = prev_tail; + /* +* This node will spin double the number of time of the previous node +* before attempting to steal the lock until it reaches a maximum. +*/ + node-lsteal_mask = prev-qhead ? LSTEAL_MIN_MASK : +
[PATCH v10 13/19] unfair qspinlock: Enable lock stealing in lock waiters
The simple unfair queue lock cannot completely solve the lock waiter preemption problem as a preempted CPU at the front of the queue will block forward progress in all the other CPUs behind it in the queue. To allow those CPUs to move forward, it is necessary to enable lock stealing for those lock waiters as well. Enabling those lock waiters to try to steal the lock increases the cacheline pressure on the lock word. As a result, performance can suffer on a workload with heavy spinlock contention. The tables below shows the the performance (jobs/minutes) of other kernel flavors of a 3.14-based kernel at 3000 users of the AIM7 disk workload on a 4-socket Westmere-EX bare-metal system. The ebizzy test was run. AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 simple test-and-set 56250003.20 98.29 5.93 simple unfair 57507993.13 95.91 5.98 qspinlock complex unfair56782333.17 97.40 5.93 qspinlock AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 simple test-and-set593081 30.35 967.55 9.00 simple unfair 22929947.85 222.84 5.89 qspinlock complex unfair 972447 18.51 589.88 6.65 qspinlock Ebizzy -m test kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 simple test-and-set1667 10.00 198.93 2.89 simple unfair 2915 10.00 165.68 4.31 qspinlock complex unfair 1965 10.00 191.96 3.17 qspinlock With heavy spinlock contention, the complex unfair lock is faster than the simple test-and-set lock, but it is still slower than the baseline ticketlock. The table below shows the execution times (in ms) of a spinlock micro-benchmark on the same 4-socket Westmere-EX system. # ofTicket Fair Unfair simple Unfair complex taskslock queue lockqueue lock queue lock -- --- ---- - 1 135135 137137 2 890 1082 421663 3 1932 2248 708 1263 4 2829 2819 1030 1806 5 3834 3522 1323 2315 6 4963 4173 1723 2831 7 6299 4875 2067 2878 8 7691 5563 2360 3256 Executing one task per node, the performance data were: # ofTicket Fair Unfair simple Unfair complex nodeslock queue lockqueue lock queue lock -- --- ---- - 1135135 137137 2 4603 1034 670888 3 10940 12087 1389 2041 4 21555 10507 1869 4307 Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 160 ++-- 1 files changed, 154 insertions(+), 6 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 06dd486..0c86a6f 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -166,6 +166,23 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; } +/* + * cmpxchg_tail - Put in the new tail code if it matches the old one + * @lock : Pointer to queue spinlock structure + * @old : The old tail code value + * @new : The new tail code value + * Return: true if operation succeeds, fail otherwise + */ +static __always_inline int +cmpxchg_tail(struct qspinlock *lock, u32 old, u32 new) +{ + struct __qspinlock *l = (void *)lock; + + old = _Q_TAIL_OFFSET; + new = _Q_TAIL_OFFSET; + return cmpxchg(l-tail, old, new) == old; +} + #else /* _Q_PENDING_BITS == 8 */ /** @@ -218,6 +235,35 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) *pval = new; return old; } + +/* + * cmpxchg_tail - Put in the new tail code if it matches the old one + * @lock : Pointer to queue spinlock structure + * @old : The old tail code value + * @new : The new tail code value + *
[PATCH v10 17/19] pvqspinlock: Add qspinlock para-virtualization support
This patch adds base para-virtualization support to the queue spinlock in the same way as was done in the PV ticket lock code. In essence, the lock waiters will spin for a specified number of times (QSPIN_THRESHOLD = 2^14) and then halted itself. The queue head waiter, unlike the other waiter, will spins 2*QSPIN_THRESHOLD times before halting itself. Before being halted, the queue head waiter will set a flag (_Q_LOCKED_SLOWPATH) in the lock byte to indicate that the unlock slowpath has to be invoked. In the unlock slowpath, the current lock holder will find the queue head by following the previous node pointer links stored in the queue node structure until it finds one that has the qhead flag turned on. It then attempt to kick in the CPU of the queue head. After the queue head acquired the lock, it will also check the status of the next node and set _Q_LOCKED_SLOWPATH flag if it has been halted. Enabling the PV code does have a performance impact on spinlock acquisitions and releases. The following table shows the execution time (in ms) of a spinlock micro-benchmark that does lock/unlock operations 5M times for each task versus the number of contending tasks on a Westmere-EX system. # ofTicket lockQueue lock tasks PV off/PV on/%ChangePV off/PV on/%Change -- - 1135/ 179/+33% 137/ 168/+23% 2 1045/ 1103/ +6% 1161/ 1248/ +7% 3 1827/ 2683/+47% 2357/ 2600/+10% 4 2689/ 4191/+56% 2882/ 3115/ +8% 5 3736/ 5830/+56% 3493/ 3571/ +2% 6 4942/ 7609/+54% 4239/ 4198/ -1% 7 6304/ 9570/+52% 4931/ 4895/ -1% 8 7736/11323/+46% 5632/ 5588/ -1% It can be seen that the ticket lock PV code has a fairly big decrease in performance when there are 3 or more contending tasks. The queue spinlock PV code, on the other hand, only has a relatively minor drop in performance for with 1-4 contending tasks. With 5 or more contending tasks, there is practically no difference in performance. When coupled with unfair lock, the queue spinlock can be much faster than the PV ticket lock. When both the unfair lock and PV spinlock features is turned on, lock stealing will still be allowed in the fastpath, but not in the slowpath. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/pvqspinlock.h | 306 arch/x86/include/asm/qspinlock.h | 33 kernel/locking/qspinlock.c | 91 ++- 3 files changed, 427 insertions(+), 3 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h new file mode 100644 index 000..fea21aa --- /dev/null +++ b/arch/x86/include/asm/pvqspinlock.h @@ -0,0 +1,306 @@ +#ifndef _ASM_X86_PVQSPINLOCK_H +#define _ASM_X86_PVQSPINLOCK_H + +/* + * Queue Spinlock Para-Virtualization (PV) Support + * + * +--++-+ next ++ + * | Lock ||Queue|---|Next| + * |Holder|---|Head |---|Node| + * +--+ prev_tail +-+ prev_tail ++ + * + * The PV support code for queue spinlock is roughly the same as that + * of the ticket spinlock. Each CPU waiting for the lock will spin until it + * reaches a threshold. When that happens, it will put itself to halt so + * that the hypervisor can reuse the CPU cycles in some other guests as well + * as returning other hold-up CPUs faster. + * + * A major difference between the two versions of PV spinlock is the fact + * that the spin threshold of the queue spinlock is half of that of the + * ticket spinlock. However, the queue head will spin twice as long as the + * other nodes before it puts itself to halt. The reason for that is to + * increase halting chance of heavily contended locks to favor lightly + * contended locks (queue depth of 1 or less). + * + * There are 2 places where races can happen: + * 1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU + * kicking by the lock holder in the unlock path (in pv_kick_node). + * 2) Halting of the queue node CPU (in pv_queue_spin_check) and the + * the status check by the previous queue head (in pv_halt_check). + * See the comments on those functions to see how the races are being + * addressed. + */ + +/* + * Spin threshold for queue spinlock + */ +#defineQSPIN_THRESHOLD (1U14) +#define MAYHALT_THRESHOLD (QSPIN_THRESHOLD - 0x10) + +/* + * CPU state flags + */ +#define PV_CPU_ACTIVE 1 /* This CPU is active*/ +#define PV_CPU_KICKED 2 /* This CPU is being kicked */ +#define PV_CPU_HALTED -1 /* This CPU is halted*/ + +/* + * Additional fields to be added to the qnode structure + */ +#if CONFIG_NR_CPUS = (1 16) +#define _cpuid_t u32 +#else +#define _cpuid_t u16
[PATCH v10 11/19] qspinlock: Split the MCS queuing code into a separate slowerpath
With the pending addition of more codes to support unfair lock and PV spinlock, the complexity of the slowpath function increases to the point that the number of scratch-pad registers in the x86-64 architecture is not enough and so those additional non-scratch-pad registers will need to be used. This has the downside of requiring saving and restoring of those registers in the prolog and epilog of the slowpath function slowing down the nominally faster pending bit and trylock code path at the beginning of the slowpath function. This patch separates out the actual MCS queuing code into a slowerpath function. This avoids the slow down of the pending bit and trylock code path at the expense of a little bit of additional overhead to the MCS queuing code path. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 120 +-- 1 files changed, 70 insertions(+), 50 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 10e87e1..a14241e 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -335,57 +335,23 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) } /** - * queue_spin_lock_slowpath - acquire the queue spinlock + * queue_spin_lock_slowerpath - a slower path for acquiring queue spinlock * @lock: Pointer to queue spinlock structure - * @val: Current value of the queue spinlock 32-bit word - * - * (queue tail, pending bit, lock bit) - * - * fast :slow :unlock - * : : - * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) - * : | ^.--. / : - * : v \ \| : - * pending :(0,1,1) +-- (0,1,0) \ | : - * : | ^--' | | : - * : v | | : - * uncontended :(n,x,y) +-- (n,0,0) --' | : - * queue : | ^--' | : - * : v | : - * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : - * queue : ^--' : - * - * The pending bit processing is in the trylock_pending() function - * whereas the uncontended and contended queue processing is in the - * queue_spin_lock_slowpath() function. + * @val : Current value of the queue spinlock 32-bit word + * @node: Pointer to the queue node + * @tail: The tail code * + * The reason for splitting a slowerpath from slowpath is to avoid the + * unnecessary overhead of non-scratch pad register pushing and popping + * due to increased complexity with unfair and PV spinlock from slowing + * down the nominally faster pending bit and trylock code path. So this + * function is not inlined. */ -void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock, + u32 val, struct qnode *node, u32 tail) { - struct qnode *prev, *next, *node; - u32 old, tail; - int idx; - - BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); - - if (trylock_pending(lock, val)) - return; /* Lock acquired */ - - node = this_cpu_ptr(qnodes[0]); - idx = node-mcs.count++; - tail = encode_tail(smp_processor_id(), idx); - - node += idx; - node-qhead = 0; - node-mcs.next = NULL; - - /* -* We touched a (possibly) cold cacheline in the per-cpu queue node; -* attempt the trylock once more in the hope someone let go while we -* weren't watching. -*/ - if (queue_spin_trylock(lock)) - goto release; + struct qnode *prev, *next; + u32 old; /* * we already touched the queueing cacheline; don't bother with pending @@ -442,7 +408,7 @@ retry_queue_wait: } old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL); if (old == val) - goto release; /* No contention */ + return; /* No contention */ else if (old _Q_LOCKED_MASK) goto retry_queue_wait; @@ -450,14 +416,68 @@ retry_queue_wait: } /* -* contended path; wait for next, release. +* contended path; wait for next, return. */ while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next))) arch_mutex_cpu_relax(); arch_mcs_spin_unlock_contended(next-qhead); +} + +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock: Pointer to queue spinlock structure + * @val: Current value of the queue spinlock
[PATCH v10 05/19] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 kernel/locking/qspinlock.c| 107 +--- 2 files changed, 110 insertions(+), 10 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index ed5d89a..4914abe 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -38,6 +38,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -50,7 +58,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -61,6 +73,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index a49b82b..3e908f7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -48,6 +49,9 @@ * We can further change the first spinner to spin on a bit in the lock word * instead of its node; whereby avoiding the need to carry a node from lock to * unlock, and preserving API. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. */ #include mcs_spinlock.h @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * @pval : Pointer to current value of the queue spinlock 32-bit word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 +xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ +
[PATCH v10 09/19] qspinlock: Prepare for unfair lock support
If unfair lock is supported, the lock acquisition loop at the end of the queue_spin_lock_slowpath() function may need to detect the fact the lock can be stolen. Code are added for the stolen lock detection. A new qhead macro is also defined as a shorthand for mcs.locked. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 26 +++--- 1 files changed, 19 insertions(+), 7 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index e98d7d4..9e7659e 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -64,6 +64,7 @@ struct qnode { struct mcs_spinlock mcs; }; +#define qhead mcs.locked /* The queue head flag */ /* * Per-CPU queue node structures; we can never have more than 4 nested @@ -216,18 +217,20 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) /** * get_qlock - Set the lock bit and own the lock - * @lock: Pointer to queue spinlock structure + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 otherwise * * This routine should only be called when the caller is the only one * entitled to acquire the lock. */ -static __always_inline void get_qlock(struct qspinlock *lock) +static __always_inline int get_qlock(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; barrier(); ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; barrier(); + return 1; } /** @@ -365,7 +368,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) tail = encode_tail(smp_processor_id(), idx); node += idx; - node-mcs.locked = 0; + node-qhead = 0; node-mcs.next = NULL; /* @@ -391,7 +394,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) prev = decode_tail(old); ACCESS_ONCE(prev-mcs.next) = (struct mcs_spinlock *)node; - while (!smp_load_acquire(node-mcs.locked)) + while (!smp_load_acquire(node-qhead)) arch_mutex_cpu_relax(); } @@ -403,6 +406,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * * *,x,y - *,0,0 */ +retry_queue_wait: while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) arch_mutex_cpu_relax(); @@ -419,12 +423,20 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) */ for (;;) { if (val != tail) { - get_qlock(lock); - break; + /* +* The get_qlock function will only failed if the +* lock was stolen. +*/ + if (get_qlock(lock)) + break; + else + goto retry_queue_wait; } old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL); if (old == val) goto release; /* No contention */ + else if (old _Q_LOCKED_MASK) + goto retry_queue_wait; val = old; } @@ -435,7 +447,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next))) arch_mutex_cpu_relax(); - arch_mcs_spin_unlock_contended(next-mcs.locked); + arch_mcs_spin_unlock_contended(next-qhead); release: /* -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 04/19] qspinlock: Extract out the exchange of tail code word
This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 61 + 2 files changed, 41 insertions(+), 22 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index bd25081..ed5d89a 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -61,6 +61,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 6467bfc..a49b82b 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -86,6 +86,34 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * @pval : Pointer to current value of the queue spinlock 32-bit word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 +xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) +{ + u32 old, new, val = *pval; + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + *pval = new; + return old; +} + +/** * trylock_pending - try to acquire queue spinlock using the pending bit * @lock : Pointer to queue spinlock structure * @pval : Pointer to value of the queue spinlock 32-bit word @@ -196,36 +224,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) node-next = NULL; /* -* we already touched the queueing cacheline; don't bother with pending -* stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; trylock -* p,y,x - n,y,x ; prev = xchg(lock, node) +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* we already touched the queueing cacheline; don't bother with pending +* stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail, val); /* * if there was a previous node; link it and wait. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev = decode_tail(old); ACCESS_ONCE(prev-next) = node; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 01/19] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock.h | 118 include/asm-generic/qspinlock_types.h | 61 ++ kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 197 + 6 files changed, 385 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..e8a7ae8 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,118 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val; + +
[PATCH v10 03/19] qspinlock: Add pending bit
From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 121 +++-- 2 files changed, 110 insertions(+), 23 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index f66f845..bd25081 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -39,8 +39,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -48,7 +49,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -57,5 +62,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b97a1ad..6467bfc 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -83,23 +83,97 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + +/** + * trylock_pending - try to acquire queue spinlock using the pending bit + * @lock : Pointer to queue spinlock structure + * @pval : Pointer to value of the queue spinlock 32-bit word + * Return: 1 if lock acquired, 0 otherwise + */ +static inline int trylock_pending(struct qspinlock *lock, u32 *pval) +{ + u32 old, new, val = *pval; + + /* +* trylock || pending +* +* 0,0,0 - 0,0,1 ; trylock +* 0,0,1 - 0,1,1 ; pending +*/ + for (;;) { + /* +* If we observe any contention; queue. +*/ + if (val ~_Q_LOCKED_MASK) + return 0; + + new = _Q_LOCKED_VAL; + if (val == new) + new |= _Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + *pval = val = old; + } + + /* +* we won the trylock +*/ + if (new == _Q_LOCKED_VAL) + return 1; + + /* +* we're pending, wait for the owner to go away. +* +* *,1,1 - *,1,0 +*/ + while ((val = atomic_read(lock-val)) _Q_LOCKED_MASK) + arch_mutex_cpu_relax(); + + /* +* take ownership and clear the pending bit. +* +* *,1,0 - *,0,1 +*/ + for (;;) { + new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return 1; +} + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock bit) + * (queue tail, pending bit, lock bit) + * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'
[PATCH v10 08/19] qspinlock: Make a new qnode structure to support virtualization
In order to support additional virtualization features like unfair lock and para-virtualized spinlock, it is necessary to store additional CPU specific data into the queue node structure. As a result, a new qnode structure is created and the mcs_spinlock structure is now part of the new structure. It is also necessary to expand arch_mcs_spin_lock_contended() to the underlying while loop as additional code will need to be inserted into the loop. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 36 +++- 1 files changed, 23 insertions(+), 13 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0ee1a23..e98d7d4 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -57,12 +57,21 @@ #include mcs_spinlock.h /* + * To have additional features for better virtualization support, it is + * necessary to store additional data in the queue node structure. So + * a new queue node structure will have to be defined and used here. + */ +struct qnode { + struct mcs_spinlock mcs; +}; + +/* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one cacheline. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[4]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -79,12 +88,12 @@ static inline u32 encode_tail(int cpu, int idx) return tail; } -static inline struct mcs_spinlock *decode_tail(u32 tail) +static inline struct qnode *decode_tail(u32 tail) { int cpu = (tail _Q_TAIL_CPU_OFFSET) - 1; int idx = (tail _Q_TAIL_IDX_MASK) _Q_TAIL_IDX_OFFSET; - return per_cpu_ptr(mcs_nodes[idx], cpu); + return per_cpu_ptr(qnodes[idx], cpu); } #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) @@ -342,7 +351,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { - struct mcs_spinlock *prev, *next, *node; + struct qnode *prev, *next, *node; u32 old, tail; int idx; @@ -351,13 +360,13 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) if (trylock_pending(lock, val)) return; /* Lock acquired */ - node = this_cpu_ptr(mcs_nodes[0]); - idx = node-count++; + node = this_cpu_ptr(qnodes[0]); + idx = node-mcs.count++; tail = encode_tail(smp_processor_id(), idx); node += idx; - node-locked = 0; - node-next = NULL; + node-mcs.locked = 0; + node-mcs.next = NULL; /* * We touched a (possibly) cold cacheline in the per-cpu queue node; @@ -380,9 +389,10 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) */ if (old _Q_TAIL_MASK) { prev = decode_tail(old); - ACCESS_ONCE(prev-next) = node; + ACCESS_ONCE(prev-mcs.next) = (struct mcs_spinlock *)node; - arch_mcs_spin_lock_contended(node-locked); + while (!smp_load_acquire(node-mcs.locked)) + arch_mutex_cpu_relax(); } /* @@ -422,15 +432,15 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) /* * contended path; wait for next, release. */ - while (!(next = ACCESS_ONCE(node-next))) + while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next))) arch_mutex_cpu_relax(); - arch_mcs_spin_unlock_contended(next-locked); + arch_mcs_spin_unlock_contended(next-mcs.locked); release: /* * release the node */ - this_cpu_dec(mcs_nodes[0].count); + this_cpu_dec(qnodes[0].mcs.count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 06/19] qspinlock: prolong the stay in the pending bit path
There is a problem in the current trylock_pending() function. When the lock is free, but the pending bit holder hasn't grabbed the lock cleared the pending bit yet, the trylock_pending() function will fail. As a result, the regular queuing code path will be used most of the time even when there is only 2 tasks contending for the lock. Assuming that the pending bit holder is going to get the lock and clear the pending bit soon, it is actually better to wait than to be queued up which has a higher overhead. This patch modified the trylock_pending() function to wait until the pending bit holder gets the lock and clears the pending bit. In case both the lock and pending bits are set, the new code will also wait a bit to see if either one is cleared. If they are not, it will quit and be queued. The following tables show the before-patch execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU with 2 different types of loads - standalone (lock and protected data in different cachelines) and embedded (lock and protected data in the same cacheline). [Standalone/Embedded - same node] # of tasksTicket lock Queue lock %Change ----- -- --- 1 135/ 111 135/ 101 0%/ -9% 2 890/ 779 1885/1990 +112%/+156% 3 1932/1859 2333/2341+21%/ +26% 4 2829/2726 2900/2923 +3%/ +7% 5 3834/3761 3655/3648 -5%/ -3% 6 4963/4976 4336/4326-13%/ -13% 7 6299/6269 5057/5064-20%/ -19% 8 7691/7569 5786/5798-25%/ -23% With 1 task per NUMA node, the execution times are: [Standalone - different nodes] # of nodesTicket lock Queue lock %Change ----- -- --- 1 135135 0% 2 4604 5087 +10% 3 10940 12224 +12% 4 21555 10555 -51% It can be seen that the queue spinlock is slower than the ticket spinlock when there are 2 or 3 contending tasks. In all the other case, the queue spinlock is either equal or faster than the ticket spinlock. With this patch, the performance data for 2 contending tasks are: [Standalone/Embedded] # of tasksTicket lock Queue lock %Change ----- -- --- 2 890/779984/871+11%/+12% [Standalone - different nodes] # of nodesTicket lock Queue lock %Change ----- -- --- 2 4604 1364 -70% It can be seen that the queue spinlock performance for 2 contending tasks is now comparable to ticket spinlock on the same node, but much faster when in different nodes. With 3 contending tasks, however, the ticket spinlock is still quite a bit faster. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 31 +-- 1 files changed, 29 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 3e908f7..e734acb 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -212,6 +212,7 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) static inline int trylock_pending(struct qspinlock *lock, u32 *pval) { u32 old, new, val = *pval; + int retry = 1; /* * trylock || pending @@ -221,11 +222,37 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) */ for (;;) { /* -* If we observe any contention; queue. +* If we observe that the queue is not empty, +* return and be queued. */ - if (val ~_Q_LOCKED_MASK) + if (val _Q_TAIL_MASK) return 0; + if (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)) { + /* +* If both the lock and pending bits are set, we wait +* a while to see if that either bit will be cleared. +* If that is no change, we return and be queued. +*/ + if (!retry) + return 0; + retry--; + cpu_relax(); + cpu_relax(); + *pval = val = atomic_read(lock-val); + continue; + } else if (val == _Q_PENDING_VAL) { + /* +* Pending bit is set, but not the lock bit. +* Assuming
[PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the queue code word (tail) in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 61 +++ 1 files changed, 44 insertions(+), 17 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index e734acb..0ee1a23 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,23 +94,29 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * can allow better optimization of the lock acquisition for the pending * bit holder. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -200,6 +206,22 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) #endif /* _Q_PENDING_BITS == 8 */ /** + * get_qlock - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * This routine should only be called when the caller is the only one + * entitled to acquire the lock. + */ +static __always_inline void get_qlock(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + barrier(); + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; + barrier(); +} + +/** * trylock_pending - try to acquire queue spinlock using the pending bit * @lock : Pointer to queue spinlock structure * @pval : Pointer to value of the queue spinlock 32-bit word @@ -321,7 +343,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
[PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support
v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. v6-v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock PV code on overcommitted KVM guests to collect performance data. v5-v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4-v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3-v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2-v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1-v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support = 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set has 3 different sections: 1) Patches 1-7: Introduces a queue-based spinlock implementation that can replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and break data alignments. 2) Patches 8-13: Enables the use of unfair queue spinlock in a virtual guest. This can resolve some of the locking related performance issues due to the fact that the next CPU to get the lock may have been scheduled out for a period of time. 3) Patches 14-19: Enable qspinlock para-virtualization support by halting the waiting CPUs after spinning for a certain amount of time. The unlock code will detect the a sleeping waiter and wake it up. This is essentially the same logic as the PV ticketlock code. The queue spinlock has slightly better performance than the ticket spinlock in uncontended case. Its performance can be much better with moderate to heavy contention. This patch has the potential of improving the performance of all the workloads that have moderate to heavy spinlock contention. The queue spinlock is especially suitable for NUMA machines with at least 2 sockets, though noticeable performance benefit probably won't show up in machines with less than 4 sockets. The purpose of this patch set is not to solve any particular spinlock contention problems. Those need to be solved by refactoring the code to make more efficient use of the lock or finer granularity ones. The main purpose is to make the lock contention problems more tolerable until someone can spend the time and effort to fix them. Peter Zijlstra (2): qspinlock: Add pending bit qspinlock: Optimize for smaller NR_CPUS Waiman Long (17): qspinlock: A simple generic 4-byte queue spinlock qspinlock, x86: Enable x86-64 to use queue spinlock qspinlock: Extract out the exchange of tail code word qspinlock: prolong the stay in the pending bit path qspinlock:
[PATCH 2/9] arm64: move DBG_MDSCR_* to asm/debug-monitors.h
In order to be able to use the DBG_MDSCR_* macros from the KVM code, move the relevant definitions to the obvious include file. Also move the debug_el enum to a portion of the file that is guarded by #ifndef __ASSEMBLY__ in order to use that file from assembly code. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/include/asm/debug-monitors.h | 19 ++- arch/arm64/kernel/debug-monitors.c | 9 - 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/debug-monitors.h b/arch/arm64/include/asm/debug-monitors.h index 6e9b5b3..7fb3437 100644 --- a/arch/arm64/include/asm/debug-monitors.h +++ b/arch/arm64/include/asm/debug-monitors.h @@ -18,6 +18,15 @@ #ifdef __KERNEL__ +/* Low-level stepping controls. */ +#define DBG_MDSCR_SS (1 0) +#define DBG_SPSR_SS(1 21) + +/* MDSCR_EL1 enabling bits */ +#define DBG_MDSCR_KDE (1 13) +#define DBG_MDSCR_MDE (1 15) +#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE) + #defineDBG_ESR_EVT(x) (((x) 27) 0x7) /* AArch64 */ @@ -73,11 +82,6 @@ #define CACHE_FLUSH_IS_SAFE1 -enum debug_el { - DBG_ACTIVE_EL0 = 0, - DBG_ACTIVE_EL1, -}; - /* AArch32 */ #define DBG_ESR_EVT_BKPT 0x4 #define DBG_ESR_EVT_VECC 0x5 @@ -115,6 +119,11 @@ void unregister_break_hook(struct break_hook *hook); u8 debug_monitors_arch(void); +enum debug_el { + DBG_ACTIVE_EL0 = 0, + DBG_ACTIVE_EL1, +}; + void enable_debug_monitors(enum debug_el el); void disable_debug_monitors(enum debug_el el); diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index a7fb874..e022f87 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -30,15 +30,6 @@ #include asm/cputype.h #include asm/system_misc.h -/* Low-level stepping controls. */ -#define DBG_MDSCR_SS (1 0) -#define DBG_SPSR_SS(1 21) - -/* MDSCR_EL1 enabling bits */ -#define DBG_MDSCR_KDE (1 13) -#define DBG_MDSCR_MDE (1 15) -#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE) - /* Determine debug architecture. */ u8 debug_monitors_arch(void) { -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] arm64: KVM: rename pm_fake handler to trap_wi_raz
pm_fake doesn't quite describe what the handler does (ignoring writes and returning 0 for reads). As we're about to use it (a lot) in a different context, rename it with a (admitedly cryptic) name that make sense for all users. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/kvm/sys_regs.c | 83 --- 1 file changed, 43 insertions(+), 40 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 0324458..fc8d4e3 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -163,18 +163,9 @@ static bool access_sctlr(struct kvm_vcpu *vcpu, return true; } -/* - * We could trap ID_DFR0 and tell the guest we don't support performance - * monitoring. Unfortunately the patch to make the kernel check ID_DFR0 was - * NAKed, so it will read the PMCR anyway. - * - * Therefore we tell the guest we have 0 counters. Unfortunately, we - * must always support PMCCNTR (the cycle counter): we just RAZ/WI for - * all PM registers, which doesn't crash the guest kernel at least. - */ -static bool pm_fake(struct kvm_vcpu *vcpu, - const struct sys_reg_params *p, - const struct sys_reg_desc *r) +static bool trap_wi_raz(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) { if (p-is_write) return ignore_write(vcpu, p); @@ -201,6 +192,17 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) /* * Architected system registers. * Important: Must be sorted ascending by Op0, Op1, CRn, CRm, Op2 + * + * We could trap ID_DFR0 and tell the guest we don't support performance + * monitoring. Unfortunately the patch to make the kernel check ID_DFR0 was + * NAKed, so it will read the PMCR anyway. + * + * Therefore we tell the guest we have 0 counters. Unfortunately, we + * must always support PMCCNTR (the cycle counter): we just RAZ/WI for + * all PM registers, which doesn't crash the guest kernel at least. + * + * Same goes for the whole debug infrastructure, which probably breaks + * some guest functionnality. This should be fixed. */ static const struct sys_reg_desc sys_reg_descs[] = { /* DC ISW */ @@ -260,10 +262,10 @@ static const struct sys_reg_desc sys_reg_descs[] = { /* PMINTENSET_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001), - pm_fake }, + trap_wi_raz }, /* PMINTENCLR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b010), - pm_fake }, + trap_wi_raz }, /* MAIR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000), @@ -292,43 +294,43 @@ static const struct sys_reg_desc sys_reg_descs[] = { /* PMCR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000), - pm_fake }, + trap_wi_raz }, /* PMCNTENSET_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001), - pm_fake }, + trap_wi_raz }, /* PMCNTENCLR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b010), - pm_fake }, + trap_wi_raz }, /* PMOVSCLR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b011), - pm_fake }, + trap_wi_raz }, /* PMSWINC_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b100), - pm_fake }, + trap_wi_raz }, /* PMSELR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101), - pm_fake }, + trap_wi_raz }, /* PMCEID0_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110), - pm_fake }, + trap_wi_raz }, /* PMCEID1_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b111), - pm_fake }, + trap_wi_raz }, /* PMCCNTR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000), - pm_fake }, + trap_wi_raz }, /* PMXEVTYPER_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001), - pm_fake }, + trap_wi_raz }, /* PMXEVCNTR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010), - pm_fake }, + trap_wi_raz }, /* PMUSERENR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000), - pm_fake }, + trap_wi_raz }, /* PMOVSSET_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b011), - pm_fake }, + trap_wi_raz }, /* TPIDR_EL0 */ { Op0(0b11), Op1(0b011), CRn(0b1101), CRm(0b), Op2(0b010), @@ -374,19 +376,20 @@ static const struct sys_reg_desc cp15_regs[] = { { Op1( 0), CRn( 7), CRm(10), Op2( 2), access_dcsw }, { Op1( 0), CRn( 7), CRm(14), Op2( 2), access_dcsw
[PATCH 9/9] arm64: KVM: enable trapping of all debug registers
Enable trapping of the debug registers, preventing the guests to mess with the host state (and allowing guests to use the debug infrastructure as well). Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/kvm/hyp.S | 8 1 file changed, 8 insertions(+) diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index f9d5a1d..037d5d9 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -753,6 +753,14 @@ __kvm_hyp_code_start: mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) + orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) + + // Check for KVM_ARM64_DEBUG_DIRTY, and set to debug to trap + // if not dirty. + ldr x3, [x0, #VCPU_DEBUG_FLAGS] + tbnzx3, #KVM_ARM64_DEBUG_DIRTY_SHIFT, 1f + orr x2, x2, #MDCR_EL2_TDA +1: msr mdcr_el2, x2 .endm -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] arm64: KVM: common infrastructure for handling AArch32 CP14/CP15
As we're about to trap a bunch of CP14 registers, let's rework the CP15 handling so it can be generalized and work with multiple tables. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/include/asm/kvm_asm.h| 2 +- arch/arm64/include/asm/kvm_coproc.h | 3 +- arch/arm64/include/asm/kvm_host.h | 9 ++- arch/arm64/kvm/handle_exit.c| 4 +- arch/arm64/kvm/sys_regs.c | 121 +--- 5 files changed, 111 insertions(+), 28 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index e6b159a..12f9dd7 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -93,7 +93,7 @@ #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */ #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */ #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */ -#define NR_CP15_REGS (NR_SYS_REGS * 2) +#define NR_COPRO_REGS (NR_SYS_REGS * 2) #define ARM_EXCEPTION_IRQ0 #define ARM_EXCEPTION_TRAP 1 diff --git a/arch/arm64/include/asm/kvm_coproc.h b/arch/arm64/include/asm/kvm_coproc.h index 9a59301..0b52377 100644 --- a/arch/arm64/include/asm/kvm_coproc.h +++ b/arch/arm64/include/asm/kvm_coproc.h @@ -39,7 +39,8 @@ void kvm_register_target_sys_reg_table(unsigned int target, struct kvm_sys_reg_target_table *table); int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run); -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 4737961..31cff7a 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -86,7 +86,7 @@ struct kvm_cpu_context { struct kvm_regs gp_regs; union { u64 sys_regs[NR_SYS_REGS]; - u32 cp15[NR_CP15_REGS]; + u32 copro[NR_COPRO_REGS]; }; }; @@ -141,7 +141,12 @@ struct kvm_vcpu_arch { #define vcpu_gp_regs(v)((v)-arch.ctxt.gp_regs) #define vcpu_sys_reg(v,r) ((v)-arch.ctxt.sys_regs[(r)]) -#define vcpu_cp15(v,r) ((v)-arch.ctxt.cp15[(r)]) +/* + * CP14 and CP15 live in the same array, as they are backed by the + * same system registers. + */ +#define vcpu_cp14(v,r) ((v)-arch.ctxt.copro[(r)]) +#define vcpu_cp15(v,r) ((v)-arch.ctxt.copro[(r)]) struct kvm_vm_stat { u32 remote_tlb_flush; diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index 7bc41ea..f0ca49f 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -69,9 +69,9 @@ static exit_handle_fn arm_exit_handlers[] = { [ESR_EL2_EC_WFI]= kvm_handle_wfx, [ESR_EL2_EC_CP15_32]= kvm_handle_cp15_32, [ESR_EL2_EC_CP15_64]= kvm_handle_cp15_64, - [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_access, + [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_32, [ESR_EL2_EC_CP14_LS]= kvm_handle_cp14_load_store, - [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_access, + [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_64, [ESR_EL2_EC_HVC32] = handle_hvc, [ESR_EL2_EC_SMC32] = handle_smc, [ESR_EL2_EC_HVC64] = handle_hvc, diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 618d4fb..feafd8d 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -474,6 +474,10 @@ static const struct sys_reg_desc sys_reg_descs[] = { NULL, reset_val, FPEXC32_EL2, 0x70 }, }; +/* Trapped cp14 registers */ +static const struct sys_reg_desc cp14_regs[] = { +}; + /* * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding, * depending on the way they are accessed (as a 32bit or a 64bit @@ -581,26 +585,19 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run) return 1; } -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run) -{ - kvm_inject_undefined(vcpu); - return 1; -} - -static void emulate_cp15(struct kvm_vcpu *vcpu, -const struct sys_reg_params *params) +static int emulate_cp(struct kvm_vcpu *vcpu, + const struct sys_reg_params *params, + const struct sys_reg_desc *table, + size_t num) { - size_t num; - const struct sys_reg_desc *table, *r; + const struct sys_reg_desc *r; - table = get_target_table(vcpu-arch.target, false, num); + if (!table) +
[PATCH 8/9] arm64: KVM: implement lazy world switch for debug registers
Implement switching of the debug registers. While the number of registers is massive, CPUs usually don't implement them all (A57 has 6 breakpoints and 4 watchpoints, which gives us a total of 22 registers only). Also, we only save/restore them when MDSCR_EL1 has debug enabled, or when we've flagged the debug registers as dirty. It means that most of the time, we only save/restore MDSCR_EL1. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kvm/hyp.S| 449 +++- 2 files changed, 444 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index 646f888..ae73a83 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -120,6 +120,7 @@ int main(void) DEFINE(VCPU_ESR_EL2, offsetof(struct kvm_vcpu, arch.fault.esr_el2)); DEFINE(VCPU_FAR_EL2, offsetof(struct kvm_vcpu, arch.fault.far_el2)); DEFINE(VCPU_HPFAR_EL2, offsetof(struct kvm_vcpu, arch.fault.hpfar_el2)); + DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags)); DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2)); DEFINE(VCPU_IRQ_LINES, offsetof(struct kvm_vcpu, arch.irq_lines)); DEFINE(VCPU_HOST_CONTEXT,offsetof(struct kvm_vcpu, arch.host_cpu_context)); diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index 2c56012..f9d5a1d 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -21,6 +21,7 @@ #include asm/assembler.h #include asm/memory.h #include asm/asm-offsets.h +#include asm/debug-monitors.h #include asm/fpsimdmacros.h #include asm/kvm.h #include asm/kvm_asm.h @@ -215,6 +216,7 @@ __kvm_hyp_code_start: mrs x22,amair_el1 mrs x23,cntkctl_el1 mrs x24,par_el1 + mrs x25,mdscr_el1 stp x4, x5, [x3] stp x6, x7, [x3, #16] @@ -226,7 +228,202 @@ __kvm_hyp_code_start: stp x18, x19, [x3, #112] stp x20, x21, [x3, #128] stp x22, x23, [x3, #144] - str x24, [x3, #160] + stp x24, x25, [x3, #160] +.endm + +.macro save_debug + // x2: base address for cpu context + // x3: tmp register + + mrs x26, id_aa64dfr0_el1 + ubfxx24, x26, #12, #4 // Extract BRPs + ubfxx25, x26, #20, #4 // Extract WRPs + mov w26, #15 + sub w24, w26, w24 // How many BPs to skip + sub w25, w26, w25 // How many WPs to skip + + add x3, x2, #CPU_SYSREG_OFFSET(DBGBCR0_EL1) + + adr x26, 1f + add x26, x26, x24, lsl #2 + br x26 +1: + mrs x20, dbgbcr15_el1 + mrs x19, dbgbcr14_el1 + mrs x18, dbgbcr13_el1 + mrs x17, dbgbcr12_el1 + mrs x16, dbgbcr11_el1 + mrs x15, dbgbcr10_el1 + mrs x14, dbgbcr9_el1 + mrs x13, dbgbcr8_el1 + mrs x12, dbgbcr7_el1 + mrs x11, dbgbcr6_el1 + mrs x10, dbgbcr5_el1 + mrs x9, dbgbcr4_el1 + mrs x8, dbgbcr3_el1 + mrs x7, dbgbcr2_el1 + mrs x6, dbgbcr1_el1 + mrs x5, dbgbcr0_el1 + + adr x26, 1f + add x26, x26, x24, lsl #2 + br x26 + +1: + str x20, [x3, #(15 * 8)] + str x19, [x3, #(14 * 8)] + str x18, [x3, #(13 * 8)] + str x17, [x3, #(12 * 8)] + str x16, [x3, #(11 * 8)] + str x15, [x3, #(10 * 8)] + str x14, [x3, #(9 * 8)] + str x13, [x3, #(8 * 8)] + str x12, [x3, #(7 * 8)] + str x11, [x3, #(6 * 8)] + str x10, [x3, #(5 * 8)] + str x9, [x3, #(4 * 8)] + str x8, [x3, #(3 * 8)] + str x7, [x3, #(2 * 8)] + str x6, [x3, #(1 * 8)] + str x5, [x3, #(0 * 8)] + + add x3, x2, #CPU_SYSREG_OFFSET(DBGBVR0_EL1) + + adr x26, 1f + add x26, x26, x24, lsl #2 + br x26 +1: + mrs x20, dbgbvr15_el1 + mrs x19, dbgbvr14_el1 + mrs x18, dbgbvr13_el1 + mrs x17, dbgbvr12_el1 + mrs x16, dbgbvr11_el1 + mrs x15, dbgbvr10_el1 + mrs x14, dbgbvr9_el1 + mrs x13, dbgbvr8_el1 + mrs x12, dbgbvr7_el1 + mrs x11, dbgbvr6_el1 + mrs x10, dbgbvr5_el1 + mrs x9, dbgbvr4_el1 + mrs x8, dbgbvr3_el1 + mrs x7, dbgbvr2_el1 + mrs x6, dbgbvr1_el1 + mrs x5, dbgbvr0_el1 + + adr x26, 1f + add x26, x26, x24, lsl #2 + br x26 + +1: + str x20, [x3, #(15 * 8)] + str x19, [x3, #(14 * 8)] + str x18, [x3, #(13 * 8)] + str x17, [x3, #(12 * 8)] + str x16, [x3, #(11 * 8)] + str x15, [x3, #(10 * 8)] + str x14, [x3, #(9 * 8)] + str x13,
[PATCH 6/9] arm64: KVM: check ordering of all system register tables
We now have multiple tables for the various system registers we trap. Make sure we check the order of all of them, as it is critical that we get the order right (been there, done that...). Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/kvm/sys_regs.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 91ca0e4..c27a5cd 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1280,14 +1280,32 @@ int kvm_arm_copy_sys_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) return write_demux_regids(uindices); } +static int check_sysreg_table(const struct sys_reg_desc *table, unsigned int n) +{ + unsigned int i; + + for (i = 1; i n; i++) { + if (cmp_sys_reg(table[i-1], table[i]) = 0) { + kvm_err(sys_reg table %p out of order (%d)\n, table, i - 1); + return 1; + } + } + + return 0; +} + void kvm_sys_reg_table_init(void) { unsigned int i; struct sys_reg_desc clidr; /* Make sure tables are unique and in order. */ - for (i = 1; i ARRAY_SIZE(sys_reg_descs); i++) - BUG_ON(cmp_sys_reg(sys_reg_descs[i-1], sys_reg_descs[i]) = 0); + BUG_ON(check_sysreg_table(sys_reg_descs, ARRAY_SIZE(sys_reg_descs))); + BUG_ON(check_sysreg_table(cp14_regs, ARRAY_SIZE(cp14_regs))); + BUG_ON(check_sysreg_table(cp14_64_regs, ARRAY_SIZE(cp14_64_regs))); + BUG_ON(check_sysreg_table(cp15_regs, ARRAY_SIZE(cp15_regs))); + BUG_ON(check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs))); + BUG_ON(check_sysreg_table(invariant_sys_regs, ARRAY_SIZE(invariant_sys_regs))); /* We abuse the reset function to overwrite the table itself. */ for (i = 0; i ARRAY_SIZE(invariant_sys_regs); i++) -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] arm64: KVM: use separate tables for AArch32 32 and 64bit traps
An interesting feature of the CP14 encoding is that there is an overlap between 32 and 64bit registers, meaning they cannot live in the same table as we did for CP15. Create separate tables for 64bit CP14 and CP15 registers, and let the top level handler use the right one. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/kvm/sys_regs.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index feafd8d..91ca0e4 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -478,13 +478,16 @@ static const struct sys_reg_desc sys_reg_descs[] = { static const struct sys_reg_desc cp14_regs[] = { }; +/* Trapped cp14 64bit registers */ +static const struct sys_reg_desc cp14_64_regs[] = { +}; + /* * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding, * depending on the way they are accessed (as a 32bit or a 64bit * register). */ static const struct sys_reg_desc cp15_regs[] = { - { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, { Op1( 0), CRn( 1), CRm( 0), Op2( 0), access_sctlr, NULL, c1_SCTLR }, { Op1( 0), CRn( 2), CRm( 0), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, { Op1( 0), CRn( 2), CRm( 0), Op2( 1), access_vm_reg, NULL, c2_TTBR1 }, @@ -525,6 +528,10 @@ static const struct sys_reg_desc cp15_regs[] = { { Op1( 0), CRn(10), CRm( 3), Op2( 1), access_vm_reg, NULL, c10_AMAIR1 }, { Op1( 0), CRn(13), CRm( 0), Op2( 1), access_vm_reg, NULL, c13_CID }, +}; + +static const struct sys_reg_desc cp15_64_regs[] = { + { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, { Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR1 }, }; @@ -740,7 +747,7 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run) target_specific = get_target_table(vcpu-arch.target, false, num); return kvm_handle_cp_64(vcpu, - cp15_regs, ARRAY_SIZE(cp15_regs), + cp15_64_regs, ARRAY_SIZE(cp15_64_regs), target_specific, num); } @@ -758,7 +765,7 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run) int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run) { return kvm_handle_cp_64(vcpu, - cp14_regs, ARRAY_SIZE(cp14_regs), + cp14_64_regs, ARRAY_SIZE(cp14_64_regs), NULL, 0); } -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: x86: Emulator does not calculate address correctly
On 5/7/14, 4:57 PM, Paolo Bonzini wrote: Il 07/05/2014 14:32, Nadav Amit ha scritto: In long-mode, when the address size is 4 bytes, the linear address is not truncated as the emulator mistakenly does. Instead, the offset within the segment (the ea field) should be truncated according to the address size. As Intel SDM says: In 64-bit mode, the effective address components are added and the effective address is truncated ... before adding the full 64-bit segment base. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e8a5840..743e8e3 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, u16 sel; unsigned cpl; -la = seg_base(ctxt, addr.seg) + addr.ea; +la = seg_base(ctxt, addr.seg) + +(ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea); I think you need fetch || ctxt-ad_bytes == 8 here. Paolo Yes. I did not test the fetch scenario. I intend to do so soon to avoid such mistakes. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] arm64: KVM: add trap handlers for AArch64 debug registers
Add handlers for all the AArch64 debug registers that are accessible from EL0 or EL1. The trapping code keeps track of the state of the debug registers, allowing for the switch code to implement a lazy switching strategy. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/include/asm/kvm_asm.h | 28 ++-- arch/arm64/include/asm/kvm_host.h | 3 + arch/arm64/kvm/sys_regs.c | 130 +- 3 files changed, 151 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 9fcd54b..e6b159a 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -43,14 +43,25 @@ #defineAMAIR_EL1 19 /* Aux Memory Attribute Indirection Register */ #defineCNTKCTL_EL1 20 /* Timer Control Register (EL1) */ #definePAR_EL1 21 /* Physical Address Register */ +#define MDSCR_EL1 22 /* Monitor Debug System Control Register */ +#define DBGBCR0_EL123 /* Debug Breakpoint Control Registers (0-15) */ +#define DBGBCR15_EL1 38 +#define DBGBVR0_EL139 /* Debug Breakpoint Value Registers (0-15) */ +#define DBGBVR15_EL1 54 +#define DBGWCR0_EL155 /* Debug Watchpoint Control Registers (0-15) */ +#define DBGWCR15_EL1 70 +#define DBGWVR0_EL171 /* Debug Watchpoint Value Registers (0-15) */ +#define DBGWVR15_EL1 86 +#define MDCCINT_EL187 /* Monitor Debug Comms Channel Interrupt Enable Reg */ + /* 32bit specific registers. Keep them at the end of the range */ -#defineDACR32_EL2 22 /* Domain Access Control Register */ -#defineIFSR32_EL2 23 /* Instruction Fault Status Register */ -#defineFPEXC32_EL2 24 /* Floating-Point Exception Control Register */ -#defineDBGVCR32_EL225 /* Debug Vector Catch Register */ -#defineTEECR32_EL1 26 /* ThumbEE Configuration Register */ -#defineTEEHBR32_EL127 /* ThumbEE Handler Base Register */ -#defineNR_SYS_REGS 28 +#defineDACR32_EL2 88 /* Domain Access Control Register */ +#defineIFSR32_EL2 89 /* Instruction Fault Status Register */ +#defineFPEXC32_EL2 90 /* Floating-Point Exception Control Register */ +#defineDBGVCR32_EL291 /* Debug Vector Catch Register */ +#defineTEECR32_EL1 92 /* ThumbEE Configuration Register */ +#defineTEEHBR32_EL193 /* ThumbEE Handler Base Register */ +#defineNR_SYS_REGS 94 /* 32bit mapping */ #define c0_MPIDR (MPIDR_EL1 * 2) /* MultiProcessor ID Register */ @@ -87,6 +98,9 @@ #define ARM_EXCEPTION_IRQ0 #define ARM_EXCEPTION_TRAP 1 +#define KVM_ARM64_DEBUG_DIRTY_SHIFT0 +#define KVM_ARM64_DEBUG_DIRTY (1 KVM_ARM64_DEBUG_DIRTY_SHIFT) + #ifndef __ASSEMBLY__ struct kvm; struct kvm_vcpu; diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 0a1d697..4737961 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -101,6 +101,9 @@ struct kvm_vcpu_arch { /* Exception Information */ struct kvm_vcpu_fault_info fault; + /* Debug state */ + u64 debug_flags; + /* Pointer to host CPU context */ kvm_cpu_context_t *host_cpu_context; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index fc8d4e3..618d4fb 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -30,6 +30,7 @@ #include asm/kvm_mmu.h #include asm/cacheflush.h #include asm/cputype.h +#include asm/debug-monitors.h #include trace/events/kvm.h #include sys_regs.h @@ -173,6 +174,58 @@ static bool trap_wi_raz(struct kvm_vcpu *vcpu, return read_zero(vcpu, p); } +static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p-is_write) { + return ignore_write(vcpu, p); + } else { + *vcpu_reg(vcpu, p-Rt) = (1 3); + return true; + } +} + +static bool trap_dbgauthstatus_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p-is_write) { + return ignore_write(vcpu, p); + } else { + *vcpu_reg(vcpu, p-Rt) = 0x; /* Implemented and disabled */ + return true; + } +} + +/* + * Trap handler for DBG[BW][CV]Rn_EL1 and MDSCR_EL1. We track the + * dirtiness of the registers. + */ +static bool trap_debug_regs(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + /* +* The best thing to do would be to trap MDSCR_EL1
[PATCH 7/9] arm64: KVM: add trap handlers for AArch32 debug registers
Add handlers for all the AArch32 debug registers that are accessible from EL0 or EL1. The code follow the same strategy as the AArch64 counterpart with regards to tracking the dirty state of the debug registers. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/include/asm/kvm_asm.h | 9 +++ arch/arm64/kvm/sys_regs.c| 137 ++- 2 files changed, 145 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 12f9dd7..993a7db 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -93,6 +93,15 @@ #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */ #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */ #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */ + +#define cp14_DBGDSCRext(MDSCR_EL1 * 2) +#define cp14_DBGBCR0 (DBGBCR0_EL1 * 2) +#define cp14_DBGBVR0 (DBGBVR0_EL1 * 2) +#define cp14_DBGBXVR0 (cp14_DBGBVR0 + 1) +#define cp14_DBGWCR0 (DBGWCR0_EL1 * 2) +#define cp14_DBGWVR0 (DBGWVR0_EL1 * 2) +#define cp14_DBGDCCINT (MDCCINT_EL1 * 2) + #define NR_COPRO_REGS (NR_SYS_REGS * 2) #define ARM_EXCEPTION_IRQ0 diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index c27a5cd..4861fe4 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -474,12 +474,148 @@ static const struct sys_reg_desc sys_reg_descs[] = { NULL, reset_val, FPEXC32_EL2, 0x70 }, }; +static bool trap_dbgidr(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p-is_write) { + return ignore_write(vcpu, p); + } else { + u64 dfr = read_cpuid(ID_AA64DFR0_EL1); + u64 pfr = read_cpuid(ID_AA64PFR0_EL1); + u32 el3 = !!((pfr 12) 0xf); + + *vcpu_reg(vcpu, p-Rt) = dfr 20) 0xf) 28) | + (((dfr 12) 0xf) 24) | + (((dfr 28) 0xf) 20) | + (6 16) | (el3 14) | (el3 12)); + return true; + } +} + +static bool trap_debug32(struct kvm_vcpu *vcpu, +const struct sys_reg_params *p, +const struct sys_reg_desc *r) +{ + if (p-is_write) { + vcpu_cp14(vcpu, r-reg) = *vcpu_reg(vcpu, p-Rt); + vcpu-arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY; + } else { + *vcpu_reg(vcpu, p-Rt) = vcpu_cp14(vcpu, r-reg); + } + + return true; +} + +#define DBG_BCR_BVR_WCR_WVR(n) \ + /* DBGBCRn */ \ + { Op1( 0), CRn( 0), CRm((n)), Op2( 4), trap_debug32,\ + NULL, (cp14_DBGBCR0 + (n) * 2) }, \ + /* DBGBVRn */ \ + { Op1( 0), CRn( 0), CRm((n)), Op2( 5), trap_debug32,\ + NULL, (cp14_DBGBVR0 + (n) * 2) }, \ + /* DBGWVRn */ \ + { Op1( 0), CRn( 0), CRm((n)), Op2( 6), trap_debug32,\ + NULL, (cp14_DBGWVR0 + (n) * 2) }, \ + /* DBGWCRn */ \ + { Op1( 0), CRn( 0), CRm((n)), Op2( 7), trap_debug32,\ + NULL, (cp14_DBGWCR0 + (n) * 2) } + +#define DBGBXVR(n) \ + { Op1( 0), CRn( 1), CRm((n)), Op2( 1), trap_debug32,\ + NULL, cp14_DBGBXVR0 + n * 2 } + /* Trapped cp14 registers */ static const struct sys_reg_desc cp14_regs[] = { + /* DBGIDR */ + { Op1( 0), CRn( 0), CRm( 0), Op2( 0), trap_dbgidr }, + /* DBGDTRRXext */ + { Op1( 0), CRn( 0), CRm( 0), Op2( 2), trap_wi_raz }, + + DBG_BCR_BVR_WCR_WVR(0), + /* DBGDSCRint */ + { Op1( 0), CRn( 0), CRm( 1), Op2( 0), trap_wi_raz }, + DBG_BCR_BVR_WCR_WVR(1), + /* DBGDCCINT */ + { Op1( 0), CRn( 0), CRm( 2), Op2( 0), trap_debug32 }, + /* DBGDSCRext */ + { Op1( 0), CRn( 0), CRm( 2), Op2( 2), trap_debug32 }, + DBG_BCR_BVR_WCR_WVR(2), + /* DBGDTRTXext */ + { Op1( 0), CRn( 0), CRm( 3), Op2( 2), trap_wi_raz }, + DBG_BCR_BVR_WCR_WVR(3), + DBG_BCR_BVR_WCR_WVR(4), + DBG_BCR_BVR_WCR_WVR(5), + /* DBGWFAR */ + { Op1( 0), CRn( 0), CRm( 6), Op2( 0), trap_wi_raz }, + /* DBGOSECCR */ + { Op1( 0), CRn( 0), CRm( 6), Op2( 2), trap_wi_raz }, + DBG_BCR_BVR_WCR_WVR(6), + /* DBGVCR */ + { Op1( 0), CRn( 0), CRm( 7), Op2( 0), trap_debug32 }, + DBG_BCR_BVR_WCR_WVR(7), + DBG_BCR_BVR_WCR_WVR(8), + DBG_BCR_BVR_WCR_WVR(9), + DBG_BCR_BVR_WCR_WVR(10), + DBG_BCR_BVR_WCR_WVR(11), +
Re: KVM Nested L2 guest startup problems
Il 07/05/2014 17:30, Abel Gordon ha scritto: ... which we already do. The only secondary execution controls we allow are APIC page, unrestricted guest, WBINVD exits, and of course EPT. But we don't verify if L1 tries to enable the feature for L1 (even if it's not exposed)... Or do we ? Yes, we do: if (!vmx_control_verify(vmcs12-cpu_based_vm_exec_control, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high) || !vmx_control_verify(vmcs12-secondary_vm_exec_control, nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high) || !vmx_control_verify(vmcs12-pin_based_vm_exec_control, nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high) || !vmx_control_verify(vmcs12-vm_exit_controls, nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high) || !vmx_control_verify(vmcs12-vm_entry_controls, nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high)) { nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); return 1; } Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] arm64: KVM: debug infrastructure support
On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote: This patch series adds debug support, a key feature missing from the KVM/arm64 port. The main idea is to keep track of whether the debug registers are dirty (changed by the guest) or not. In this case, perform the usual save/restore dance, for one run only. It means we only have a penalty if a guest is actually using the debug registers. The huge amount of registers is properly frightening, but CPUs actually only implement a subset of them. Also, there is a number of registers we don't bother emulating (things having to do with external debug and OSlock). Presumably these registers now appear in the userspace interface too, yes? Did you check that they all cope with the migration reads all register values on the source and then writes them on the destination in arbitrary order semantics without further fiddling? (I have a note that says that at least OSLAR_EL1/OSLSR_EL1 won't work that way, for instance.) thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] arm64: KVM: rename pm_fake handler to trap_wi_raz
On 07/05/14 16:34, Peter Maydell wrote: On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote: pm_fake doesn't quite describe what the handler does (ignoring writes and returning 0 for reads). As we're about to use it (a lot) in a different context, rename it with a (admitedly cryptic) name that make sense for all users. -/* - * We could trap ID_DFR0 and tell the guest we don't support performance - * monitoring. Unfortunately the patch to make the kernel check ID_DFR0 was - * NAKed, so it will read the PMCR anyway. - * - * Therefore we tell the guest we have 0 counters. Unfortunately, we - * must always support PMCCNTR (the cycle counter): we just RAZ/WI for - * all PM registers, which doesn't crash the guest kernel at least. - */ -static bool pm_fake(struct kvm_vcpu *vcpu, - const struct sys_reg_params *p, - const struct sys_reg_desc *r) +static bool trap_wi_raz(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) The standard term for this is RAZ/WI, not WI/RAZ, so why not trap_raz_wi ? Good point. I'll update it. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode
Il 07/05/2014 14:32, Nadav Amit ha scritto: 32-bit operations are zero extended in 64-bit mode. Currently, the code does not handle them correctly and keeps the high bits. In 16-bit mode, the high 32-bits are kept intact. In addition, although it is not well-documented, when address override prefix is used with REP-string instruction, RCX high half is zeroed even if ECX was zero on the first iteration (as if an assignment was performed to ECX). Is this true even for REPZ and ZF=0 or REPNZ and ZF=1? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] arm64: KVM: debug infrastructure support
On 07/05/14 16:42, Peter Maydell wrote: On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote: This patch series adds debug support, a key feature missing from the KVM/arm64 port. The main idea is to keep track of whether the debug registers are dirty (changed by the guest) or not. In this case, perform the usual save/restore dance, for one run only. It means we only have a penalty if a guest is actually using the debug registers. The huge amount of registers is properly frightening, but CPUs actually only implement a subset of them. Also, there is a number of registers we don't bother emulating (things having to do with external debug and OSlock). Presumably these registers now appear in the userspace interface too, yes? Did you check that they all cope with the migration reads all register values on the source and then writes them on the destination in arbitrary order semantics without further fiddling? (I have a note that says that at least OSLAR_EL1/OSLSR_EL1 won't work that way, for instance.) The only registers that are exported to userspace are MDSCR_EL1, DBG{BW}{CV}Rn_EL1 and MDCCINT_EL1 (and their 32bit counterparts). They should be fine being saved/restored in any order, as long as you're not running the vcpu in between. The OSL*_EL1 registers are just ignored so far. So far, it is unclear how they would be supported in a guest, and I don't have any test environment for them. Should the need arise, support can be added on top of what we have now. M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call
Il 07/05/2014 14:32, Nadav Amit ha scritto: Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. In 64-bit mode, the resulting RIP is always 64-bit. Otherwise it is masked according to the instruction operand-size. Note that when 16-bit address size is used, bits 63:32 are unmodified. The SDM says If the operand-size attribute is 16, the upper two bytes of the EIP register are cleared, resulting in a maximum instruction pointer size of 16 bits. I'm not sure whether that should also imply that 63:32 are _not_ unmodified (because you do a 32-bit write not a 16-bit one), but in any case it looks like masked_increment is not the right function. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call
On 5/7/14, 5:43 PM, Bandan Das wrote: Nadav Amit na...@cs.technion.ac.il writes: Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. In 64-bit mode, the resulting RIP is always 64-bit. Otherwise it is masked according to the instruction operand-size. Note that when 16-bit address size is used, bits 63:32 are unmodified. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6833b41..e406705 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) { - register_address_increment(ctxt, ctxt-_eip, rel); + /* 64-bit mode relative jumps are always 64-bit; otherwise mask */ + int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes; Just a nit, probably break this up for readability ? I will make it more readable on the next version. Thanks, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/11] kvm tools: Modify term_putc to write more than one char
Hello. On 06-05-2014 19:51, Andreas Herrmann wrote: From: David Daney david.da...@cavium.com It is a performance enhancement. When running in a simulator, each system call to write a character takes a lot of time. Batching them up decreases the overhead (in the root kernel) of each virtio console write. Signed-off-by: David Daney david.da...@cavium.com Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com --- tools/kvm/term.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tools/kvm/term.c b/tools/kvm/term.c index 3de410b..b153eed 100644 --- a/tools/kvm/term.c +++ b/tools/kvm/term.c @@ -52,11 +52,14 @@ int term_getc(struct kvm *kvm, int term) int term_putc(char *addr, int cnt, int term) { int ret; + int num_remaining = cnt; - while (cnt--) { - ret = write(term_fds[term][TERM_FD_OUT], addr++, 1); + while (num_remaining) { + ret = write(term_fds[term][TERM_FD_OUT], addr, num_remaining); if (ret 0) return 0; Perhaps 'return cnt - num_remaining' instead? + num_remaining -= ret; + addr += ret; } return cnt; WBR, Sergei -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] KVM: x86: Fix exit handler and emulation bugs
Il 07/05/2014 14:32, Nadav Amit ha scritto: This series of patches fixes various scenarios in which KVM does not follow x86 specifications. Patches #4 and #5 are related; they reflect a new revision of the previously submitted patch that dealt with the wrong masking of registers in long-mode. Patch #3 is a follow-up to the previously sumbitted patch that fixed the wrong reserved page table masks. Patches #3 and #5 were not tested in a manner that actually checks the modified behavior. Not all the pathes in patch #4 were tested. Thanks for reviewing the patches. Nadav Amit (5): KVM: x86: Emulator does not calculate address correctly KVM: vmx: handle_dr does not handle RSP correctly KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support KVM: x86: Wrong register masking in 64-bit mode KVM: x86: Fix wrong masking on relative jump/call arch/x86/kvm/cpuid.h | 7 +++ arch/x86/kvm/emulate.c | 47 +-- arch/x86/kvm/mmu.c | 8 ++-- arch/x86/kvm/vmx.c | 2 +- 4 files changed, 43 insertions(+), 21 deletions(-) I'll apply patches 2 and 3. Thanks for your work! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/x86: implement hv EOI assist
Il 07/05/2014 15:29, Michael S. Tsirkin ha scritto: It seems that it's easy to implement the EOI assist on top of the PV EOI feature: simply convert the page address to the format expected by PV EOI. Notes: -No EOI required is set only if interrupt injected is edge triggered; this is true because level interrupts are going through IOAPIC which disables PV EOI. In any case, if guest triggers EOI the bit will get cleared on exit. -For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE seems sufficient In any case, bit is cleared on exit so worst case it's never re-enabled -no handling of PV EOI data is performed at HV_X64_MSR_EOI write; HV_X64_MSR_EOI is a separate optimization - it's an X2APIC replacement that lets you do EOI with an MSR and not IO. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- patch is unchanged from RFC except for documenting the design points in the commit log, do address Marcelo's comments. This passed basic testing. Migration was not tested. I think at this point it's a good idea to merge this so it can get tested and used by more people. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8ae1ff5..d84d750fc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1890,6 +1890,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (!(data HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE)) { vcpu-arch.hv_vapic = data; + if (kvm_lapic_enable_pv_eoi(vcpu, 0)) + return 1; break; } gfn = data HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT; @@ -1900,6 +1902,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data) return 1; vcpu-arch.hv_vapic = data; mark_page_dirty(vcpu-kvm, gfn); + if (kvm_lapic_enable_pv_eoi(vcpu, gfn_to_gpa(gfn) | KVM_MSR_ENABLED)) + return 1; break; } case HV_X64_MSR_EOI: Applying to kvm/queue, thanks. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode
On 5/7/14, 5:50 PM, Bandan Das wrote: Nadav Amit na...@cs.technion.ac.il writes: 32-bit operations are zero extended in 64-bit mode. Currently, the code does not handle them correctly and keeps the high bits. In 16-bit mode, the high 32-bits are kept intact. In addition, although it is not well-documented, when address override prefix It would be helpful to have a pointer in the SDM especially for cases that are not well-documented. is used with REP-string instruction, RCX high half is zeroed even if ECX was zero on the first iteration (as if an assignment was performed to ECX). Signed-off-by: Nadav Amit na...@cs.technion.ac.il First, as for the different masking/zero-extending behavior in different modes, I guess this behavior is documented in SDM Volume 1, Section 3.3.7: Address Calculations in 64-Bit Mode. It appears (from experiments on bare-metal) that it also regards ESI/EDI/ECX on REP-prefix instructions and I presume it regards ESP on stack operations and ECX on loop operations. I will ensure it soon. Second, the behavior of zero-extending RCX even if ECX is initially zero was experienced on native environment (Intel SandyBridge), and we found no erratum regarding it. I actually found no documentation that describes this behavior. The term not well-documented is probably misleading, and will be clarified on v2. Thanks, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 73721] KVM hv-time
https://bugzilla.kernel.org/show_bug.cgi?id=73721 Paolo Bonzini bonz...@gnu.org changed: What|Removed |Added Status|NEW |RESOLVED CC||bonz...@gnu.org Resolution|--- |CODE_FIX --- Comment #3 from Paolo Bonzini bonz...@gnu.org --- Already fixed by commit b351c39cc9e0151cee9b8d52a1e714928faabb38, but Marcelo forgot to CC stable. I'll propose the patch myself. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15
Il 07/05/2014 15:19, Gabriel L. Somlo ha scritto: On Wed, May 07, 2014 at 08:29:19AM +0200, Jan Kiszka wrote: On 2014-05-06 20:35, gso...@gmail.com wrote: Signed-off-by: Gabriel Somlo so...@cmu.edu --- Jan, After today's pull from kvm, I also need this to build against my Fedora 20 kernel (3.13.10-200.fc20.x86_64). Which version did you pull? Neither next nor queue contain this change yet. I have 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd (Linux 3.15-rc4) as the top commit. Right now I'm seeing different build problems: http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio Didn't look into details yet. That's kernel 3.11, afaict, so entirely different problem :) Probably time to kill srcu.c altogether, since it's available on 2.6.19. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] arm64: move DBG_MDSCR_* to asm/debug-monitors.h
On Wed, May 07, 2014 at 04:20:47PM +0100, Marc Zyngier wrote: In order to be able to use the DBG_MDSCR_* macros from the KVM code, move the relevant definitions to the obvious include file. Also move the debug_el enum to a portion of the file that is guarded by #ifndef __ASSEMBLY__ in order to use that file from assembly code. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm64/include/asm/debug-monitors.h | 19 ++- arch/arm64/kernel/debug-monitors.c | 9 - 2 files changed, 14 insertions(+), 14 deletions(-) Acked-by: Will Deacon will.dea...@arm.com Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: x86: emulate monitor and mwait instructions as nop
Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X = 10.7) from receiving invalid opcode faults after failing to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, preventing compliant guests from ever using them inadvertently. Signed-off-by: Gabriel L. Somlo so...@cmu.edu --- On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote: Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto: 2. Emulate monitor and mwait as nop, but continue to claim they are not supported via CPUID. That's the patch you cited. Not sure though whether that sort of undocumented functionality would be OK with the KVM crowd, though :) I'd go for this one. It seems unlikely a guest wants to get an exception intentionally. Paolo? That's okay, but please add a printk_once the first time mwait is called. OK, here's a first pass at an official submission. I have two questions: 1. I can't test svm.c (on AMD). As such, I'm not sure the skip_emulated_instruction() call in my own version of nop_interception() is necessary. If not, I could probably just call the already existing nop_on_interception() (line 1926 or thereabouts in svm.c), which just returns returns 1 without skipping anything. 2. I get defined but not used warnings on invalid_op_interception() (svm.c) and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently the only VM exit reasons which lead to an invalid opcode exception. Should my patch just nuke those functions (so that if anyone needs them in the future they'd have to re-add them), or comment them out, or call them after the return 1; statement in the monitor/mwait functions to shut up gcc, or ??? :) Thanks much, Gabriel arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/svm.c | 22 -- arch/x86/kvm/vmx.c | 22 -- 3 files changed, 42 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..d094fc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = + /* NOTE: MONITOR (and MWAIT) are emulated as NOP, +* but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f4f9c2..1976488 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } +static int nop_interception(struct vcpu_svm *svm) +{ + skip_emulated_instruction((svm-vcpu)); + return 1; +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return nop_interception(svm); +} + +static int mwait_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); + return nop_interception(svm); +} + static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = invalid_op_interception, - [SVM_EXIT_MWAIT]= invalid_op_interception, + [SVM_EXIT_MONITOR] = monitor_interception, + [SVM_EXIT_MWAIT]= mwait_interception, [SVM_EXIT_XSETBV] = xsetbv_interception, [SVM_EXIT_NPF] = pf_interception, }; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 33e8c02..060b384 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5669,6 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu) return 1; } +static int handle_nop(struct kvm_vcpu *vcpu) +{ + skip_emulated_instruction(vcpu); + return 1; +} + +static int handle_mwait(struct kvm_vcpu *vcpu) +{ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as
Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop
On Wed, May 07, 2014 at 02:10:59PM -0400, Gabriel L. Somlo wrote: Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X = 10.7) from receiving invalid opcode faults after failing to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, preventing compliant guests from ever using them inadvertently. Signed-off-by: Gabriel L. Somlo so...@cmu.edu If we really want to be paranoid and worry about guests that use this strange way to trigger invalid opcode, we can make it possible for userspace to enable/disable this hack, and teach qemu to set it. That would make it even safer than it was. Not sure it's worth it, just a thought. --- On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote: Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto: 2. Emulate monitor and mwait as nop, but continue to claim they are not supported via CPUID. That's the patch you cited. Not sure though whether that sort of undocumented functionality would be OK with the KVM crowd, though :) I'd go for this one. It seems unlikely a guest wants to get an exception intentionally. Paolo? That's okay, but please add a printk_once the first time mwait is called. OK, here's a first pass at an official submission. I have two questions: 1. I can't test svm.c (on AMD). As such, I'm not sure the skip_emulated_instruction() call in my own version of nop_interception() is necessary. If not, I could probably just call the already existing nop_on_interception() (line 1926 or thereabouts in svm.c), which just returns returns 1 without skipping anything. 2. I get defined but not used warnings on invalid_op_interception() (svm.c) and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently the only VM exit reasons which lead to an invalid opcode exception. Should my patch just nuke those functions (so that if anyone needs them in the future they'd have to re-add them), or comment them out, or call them after the return 1; statement in the monitor/mwait functions to shut up gcc, or ??? :) Thanks much, Gabriel arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/svm.c | 22 -- arch/x86/kvm/vmx.c | 22 -- 3 files changed, 42 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..d094fc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = + /* NOTE: MONITOR (and MWAIT) are emulated as NOP, + * but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f4f9c2..1976488 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } +static int nop_interception(struct vcpu_svm *svm) +{ + skip_emulated_instruction((svm-vcpu)); + return 1; +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return nop_interception(svm); +} + +static int mwait_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); + return nop_interception(svm); +} + static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = invalid_op_interception, - [SVM_EXIT_MWAIT]= invalid_op_interception, + [SVM_EXIT_MONITOR] = monitor_interception, + [SVM_EXIT_MWAIT]= mwait_interception, [SVM_EXIT_XSETBV] = xsetbv_interception, [SVM_EXIT_NPF] = pf_interception, }; diff --git a/arch/x86/kvm/vmx.c
Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop
On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote: On Wed, May 07, 2014 at 02:10:59PM -0400, Gabriel L. Somlo wrote: Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X = 10.7) from receiving invalid opcode faults after failing to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, preventing compliant guests from ever using them inadvertently. Signed-off-by: Gabriel L. Somlo so...@cmu.edu If we really want to be paranoid and worry about guests that use this strange way to trigger invalid opcode, we can make it possible for userspace to enable/disable this hack, and teach qemu to set it. That would make it even safer than it was. Not sure it's worth it, just a thought. Since we don't trap on non-exposed other instructions (new SSE and whatdoiknow) I don't think it's really bad to just expose MONITOR/MWAIT as nops. Alex --- On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote: Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto: 2. Emulate monitor and mwait as nop, but continue to claim they are not supported via CPUID. That's the patch you cited. Not sure though whether that sort of undocumented functionality would be OK with the KVM crowd, though :) I'd go for this one. It seems unlikely a guest wants to get an exception intentionally. Paolo? That's okay, but please add a printk_once the first time mwait is called. OK, here's a first pass at an official submission. I have two questions: 1. I can't test svm.c (on AMD). As such, I'm not sure the Joerg, mind to take a quick look here? Alex skip_emulated_instruction() call in my own version of nop_interception() is necessary. If not, I could probably just call the already existing nop_on_interception() (line 1926 or thereabouts in svm.c), which just returns returns 1 without skipping anything. 2. I get defined but not used warnings on invalid_op_interception() (svm.c) and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently the only VM exit reasons which lead to an invalid opcode exception. Should my patch just nuke those functions (so that if anyone needs them in the future they'd have to re-add them), or comment them out, or call them after the return 1; statement in the monitor/mwait functions to shut up gcc, or ??? :) Thanks much, Gabriel arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/svm.c | 22 -- arch/x86/kvm/vmx.c | 22 -- 3 files changed, 42 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..d094fc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = + /* NOTE: MONITOR (and MWAIT) are emulated as NOP, +* but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f4f9c2..1976488 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } +static int nop_interception(struct vcpu_svm *svm) +{ + skip_emulated_instruction((svm-vcpu)); + return 1; +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return nop_interception(svm); +} + +static int mwait_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); + return nop_interception(svm); +} + static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = invalid_op_interception, - [SVM_EXIT_MWAIT]= invalid_op_interception, + [SVM_EXIT_MONITOR] = monitor_interception, + [SVM_EXIT_MWAIT]
Re: [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support
On Wed, May 07, 2014 at 11:01:28AM -0400, Waiman Long wrote: v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. Any chance you can post these on a git tree? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM
Raghavendra KT had done some performance testing on this patch with the following results: Overall we are seeing good improvement for pv-unfair version. System: 32 cpu sandybridge with HT on (4 node with 32 GB each) Guest : 8GB with 16 vcpu/VM. Average was taken over 8-10 data points. Base = 3.15-rc2 with PRAVIRT_SPINLOCK = y A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock) B = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n (queue spinlock without paravirt) C = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = n (queue spinlock with paravirt) Could you do s/PRAVIRT/PARAVIRT/ please? Ebizzy %improvements overcommit ABC 0.5x 4.42652.0611 1.5824 1.0x 0.9015 -7.7828 4.5443 1.5x46.1162 -2.9845 -3.5046 2.0x99.8150 -2.7116 4.7461 Considering B sucks Dbench %improvements overcommit ABC 0.5x 3.26173.54362.5676 1.0x 0.63022.23425.2201 1.5x 5.00274.82753.8375 2.0x23.82424.578212.6067 Absolute values of base results: (overcommit, value, stdev) Ebizzy ( records / sec with 120 sec run) 0.5x 20941.8750 (2%) 1.0x 17623.8750 (5%) 1.5x 5874.7778 (15%) 2.0x 3581.8750 (7%) Dbench (throughput in MB/sec) 0.5x 10009.6610 (5%) 1.0x 6583.0538 (1%) 1.5x 3991.9622 (4%) 2.0x 2527.0613 (2.5%) Signed-off-by: Waiman Long waiman.l...@hp.com Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/kernel/kvm.c | 135 + kernel/Kconfig.locks |2 +- 2 files changed, 136 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7ab8ab3..eef427b 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } +#ifndef CONFIG_QUEUE_SPINLOCK enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } } +#else /* !CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_KVM_DEBUG_FS +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats; /* Wakeup by kicking count */ +static u32 wake_spur_stats; /* Spurious wakeup count*/ +static u64 time_blocked; /* Total blocking time */ + +static int __init kvm_spinlock_debugfs(void) +{ + d_kvm_debug = debugfs_create_dir(kvm-guest, NULL); + if (!d_kvm_debug) { + printk(KERN_WARNING +Could not create 'kvm' debugfs directory\n); + return -ENOMEM; + } + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm_debug); + + debugfs_create_u32(kick_nohlt_stats, +0644, d_spin_debug, kick_nohlt_stats); + debugfs_create_u32(halt_qhead_stats, +0644, d_spin_debug, halt_qhead_stats); + debugfs_create_u32(halt_qnode_stats, +0644, d_spin_debug, halt_qnode_stats); + debugfs_create_u32(halt_abort_stats, +0644, d_spin_debug, halt_abort_stats); + debugfs_create_u32(wake_kick_stats, +0644, d_spin_debug, wake_kick_stats); + debugfs_create_u32(wake_spur_stats, +0644, d_spin_debug, wake_spur_stats); + debugfs_create_u64(time_blocked, +0644, d_spin_debug, time_blocked); + return 0; +} + +static inline void kvm_halt_stats(enum pv_lock_stats type) +{ + if (type == PV_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + else if (type == PV_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + else /* type == PV_HALT_ABORT */ + add_smp(halt_abort_stats, 1); +} + +static inline void kvm_lock_stats(enum pv_lock_stats type) +{ + if (type == PV_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + else if (type == PV_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + else /* type == PV_KICK_NOHALT */ + add_smp(kick_nohlt_stats, 1); +} + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u64 delta; + +
Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop
Il 07/05/2014 20:10, Gabriel L. Somlo ha scritto: 1. I can't test svm.c (on AMD). As such, I'm not sure the skip_emulated_instruction() call in my own version of nop_interception() is necessary. If not, I could probably just call the already existing nop_on_interception() (line 1926 or thereabouts in svm.c), which just returns returns 1 without skipping anything. Yes, it's necessary. 2. I get defined but not used warnings on invalid_op_interception() (svm.c) and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently the only VM exit reasons which lead to an invalid opcode exception. Should my patch just nuke those functions (so that if anyone needs them in the future they'd have to re-add them), or comment them out, or call them after the return 1; statement in the monitor/mwait functions to shut up gcc, or ??? :) Nuke it. :) Thanks for working on OS X virtualization, weird guests are always the source of interesting quirks! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm: x86: emulate monitor and mwait instructions as nop
Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X = 10.7) from crashing after they fail to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, to prevent compliant guests from using them inadvertently. Signed-off-by: Gabriel L. Somlo so...@cmu.edu --- New in v2: remove invalid_op handler functions which were only used to handle exits caused by monitor and mwait On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote: On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote: If we really want to be paranoid and worry about guests that use this strange way to trigger invalid opcode, we can make it possible for userspace to enable/disable this hack, and teach qemu to set it. That would make it even safer than it was. Not sure it's worth it, just a thought. Since we don't trap on non-exposed other instructions (new SSE and whatdoiknow) I don't think it's really bad to just expose MONITOR/MWAIT as nops. So AFAICT, linux prefers to use mwait for idling if cpuid tells it that it's available. If we keep telling everyone that we do NOT have monitor and mwait available, compliant guests will never end up using them, and this hack would remain completely invisible to them, which is good (better to use hlt-based idle loops when you're a vm guest, that would actually allow the host to relax while you're halted :) So the only time anyone would be able to tell we have this hack would be when they're about to receive an invalid opcode for using monitor/mwait in violation of what CPUID (would have) told them. That's what happens to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs begain to seriously think about their OS running as a vm guest (on fusion and parallels)... Instead of killing the misbehaving guest with an invalid opcode, we'd allow them to peg the host CPU with their monitor == mwait == nop idle loop instead, which, at least on OS X, should be tolerable long enough to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext' and reboot the guest, after which things would settle down by reverting the guest to a hlt-based idle loop. The only reason I can think of to add functionality for enabling/disabling this hack would be to protect against a malicious guest which would use mwait *on purpose* to peg the host CPU. But a malicious guest could just run for(;;); in ring 0 and accomplish the same goal, so we wouldn't really gain anything in exchange for the added complexity... Thanks, Gabriel arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/svm.c | 28 arch/x86/kvm/vmx.c | 20 3 files changed, 38 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..d094fc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = + /* NOTE: MONITOR (and MWAIT) are emulated as NOP, +* but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f4f9c2..0b7d58d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm) return 1; } -static int invalid_op_interception(struct vcpu_svm *svm) -{ - kvm_queue_exception(svm-vcpu, UD_VECTOR); - return 1; -} - static int task_switch_interception(struct vcpu_svm *svm) { u16 tss_selector; @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } +static int nop_interception(struct vcpu_svm *svm) +{ + skip_emulated_instruction((svm-vcpu)); + return 1; +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return nop_interception(svm); +} + +static int mwait_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); + return nop_interception(svm); +} + static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
Re: [PATCH] kvmclock: Ensure time in migration never goes backward
On Tue, May 06, 2014 at 09:18:27AM +0200, Alexander Graf wrote: On 06.05.14 01:31, Marcelo Tosatti wrote: On Mon, May 05, 2014 at 08:23:43PM -0300, Marcelo Tosatti wrote: Hi Alexander, On Mon, May 05, 2014 at 03:51:22PM +0200, Alexander Graf wrote: When we migrate we ask the kernel about its current belief on what the guest time would be. KVM_GET_CLOCK which returns the time in struct kvm_clock_data. However, I've seen cases where the kvmclock guest structure indicates a time more recent than the kvm returned time. This should not happen because the value returned by KVM_GET_CLOCK (get_kernel_ns() + kvmclock_offset) should be relatively in sync with what is seen in the guest via kvmclock read. Yes, and it isn't. Any ideas why it's not? This patch really just uses the guest visible kvmclock time rather than the host view of it on migration. Effective frequency of TSC clock could be higher than NTP frequency. So NTP correction would slow down the host clock. There is definitely something very broken on the host's side since it does return a smaller time than the guest exposed interface indicates. Alex Can you please retrieve the values of system_timestamp, tsc_timestamp and the host clock on the trace you have ? Nothing forbids backwards time even without migration, in case the problem is the difference in frequency between TSC and NTP corrected host clock (point is, should figure out what is happening). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] kvmclock: Ensure time in migration never goes backward
On Tue, May 06, 2014 at 09:54:35PM +0200, Marcin Gibuła wrote: Yes, and it isn't. Any ideas why it's not? This patch really just uses the guest visible kvmclock time rather than the host view of it on migration. There is definitely something very broken on the host's side since it does return a smaller time than the guest exposed interface indicates. Don't know if helps but here are example values from time_at_migration and s-clock from your patch. Tested on 5 restores of saved VM that (used to) hang: s-clock time_at_migration 157082235125698 157113284546655 157082235125698 157113298196976 157082235125698 157113284615117 157082235125698 157113284486601 157082235125698 157113284479740 Now, when I compare system time on guest with and without patch: On unpatched qemu vm restores with date: Apr 18 06:56:36 On patched qemu it says: Apr 18 06:57:06 -- mg Marcin, Can you turn off the TSC clock on the host and attempt to reproduce the issue, without Alexander's patch? Just echo hpet /sys/devices/system/clocksource/clocksource0/current_clocksource (or any other clock except tsc if you don't have hpet). TIA. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html