Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse wrote: > > On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote: > > We looked into Intel IOMMU performance a while ago and learned a few > > useful things. We generally did a parallel 200 thread TCP_RR test, > > as this churns through mappings quickly and uses all available cores. > > > > First, the main bottleneck was software performance[1]. > > For the Intel IOMMU, *all* we need to do is put a PTE in place. For > real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't > need to do an IOTLB flush. It's a single 64-bit write of the PTE. > > All else is software overhead. > Agreed! > > (I'm deliberately ignoring the stupid chipsets where DMA page tables > aren't cache coherent and we need a clflush too. They make me too sad.) How much does Linux need to care about such chipsets? Can we argue that they get very sad performance and so be it? > > > > > This study preceded the recent patch to break the locks into pools > > ("Break up monolithic iommu table/lock into finer graularity pools > > and lock"). There were several points of lock contention: > > - the RB tree ... > > - the RB tree ... > > - the RB tree ... > > > > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at > > c15-paper-peleg.pdf) has some promising approaches. The magazine > > avoids the RB tree issue. > > I'm thinking of ditching the RB tree altogether and switching to the > allocator in lib/iommu-common.c (and thus getting the benefit of the > finer granularity pools). Sounds promising! Is 4 parallel arenas enough? We'll try to play with that here. I think lazy_flush leaves dangling references. > > > > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free > > page table cleanup algorithm could do well. > > When you say 'dynamic 1:1 mapping', is that the same thing that's been > suggested elsewhere — avoiding the IOVA allocator completely by using a > virtual address which *matches* the physical address, if that virtual > address is available? Yes. We munge in 2 upper address bits in the IOVA to encode read and write permissions as well. > > Simply cmpxchg on the PTE itself, and if it was > already set *then* we fall back to the allocator, obviously configured > to allocate from a range *higher* than the available physical memory. > > Jörg has been looking at this too, and was even trying to find space in > the PTE for a use count so a given page could be in more than one > mapping before we call back to the IOVA allocator. Aren't bits 63:52 available at all levels? > > > > > There are correctness fixes and optimizations left in the > > invalidation path: I want strict-ish semantics (a page doesn't go > > back into the freelist until the last IOTLB/IOMMU TLB entry is > > invalidated) with good performance, and that seems to imply that an > > additional page reference should be gotten at dma_map time and put > > back at the completion of the IOMMU flush routine. (This is worthy > > of much discussion.) > > We already do something like this for page table pages which are freed > by an unmap, FWIW. As I understood the code when I last looked, this was true only if a single unmap operation covered an entire table's worth (2MByte, or 1GByte, etc) of mappings. The caffeine hasn't hit yet, though, so I can't even begin to dive into the new calls into mm.c. > > > > Additionally, we can find ways to optimize the flush routine by > > realizing that if we have frequent maps and unmaps, it may be because > > the device creates and destroys buffers a lot; these kind of > > workloads use an IOVA for one event and then never come back. Maybe > > TLBs don't do much good and we could just flush the entire IOMMU TLB > > [and ATS cache] for that BDF. > > That would be a very interesting thing to look at. Although it would be > nice if we had a better way to measure the performance impact of IOTLB > misses — currently we don't have a lot of visibility at all. All benchmarks are lies. But we intend to run internal workloads as well as well-agreed loads and see how things go. > > > > 1: We verified that the IOMMU costs are almost entirely software > > overheads by forcing software 1:1 mode, where we create page tables > > for all physical addresses. We tested using leaf nodes of size 4KB, > > of 2MB, and of 1GB. In call cases, there is zero runtime maintenance > > of the page tables, and no IOMMU invalidations. We did piles of DMA > > maximizing x16 PCIe bandwidth on mul
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
We looked into Intel IOMMU performance a while ago and learned a few useful things. We generally did a parallel 200 thread TCP_RR test, as this churns through mappings quickly and uses all available cores. First, the main bottleneck was software performance[1]. This study preceded the recent patch to break the locks into pools ("Break up monolithic iommu table/lock into finer graularity pools and lock"). There were several points of lock contention: - the RB tree is per device (and in the network test, there's one device). Every dma_map and unmap holds the lock. - the RB tree lock is held during invalidations as well. There's a 250-entry queue for invalidations that doesn't do any batching intelligence (for example, promote to larger-range invalidations, flush entire device, etc). RB tree locks may be held while waiting for invalidation drains. Invalidations have even worse behavior with ATS enabled for a given device. - the RB tree has one entry per dma_map call (that entry is deleted by the corresponding dma_unmap). If we had merged all adjacent entries when we could, we would have not lost any information that's actually used by the code today. (There could be a check that a dma_unmap actually covers the entire region that was mapped, but there isn't.) At boot (without network traffic), two common NICs' drivers show tens of thousands of static dma_maps that never go away; this means the RB tree is ~14-16 levels deep. A rbtree walk (holding that big lock) has a 14-16 level pointer chase through mostly cache-cold entries. I wrote a modification to the RB tree handling that merges nodes that represent abutting IOVA ranges (and unmerges them on dma_unmap), and the same drivers created around 7 unique entries. Steady state grew to a few hundreds and maybe a thousand, but the fragmentation didn't get worse than that. This optimization got about a third of the performance back. Omer's paper (https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf) has some promising approaches. The magazine avoids the RB tree issue. I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free page table cleanup algorithm could do well. There are correctness fixes and optimizations left in the invalidation path: I want strict-ish semantics (a page doesn't go back into the freelist until the last IOTLB/IOMMU TLB entry is invalidated) with good performance, and that seems to imply that an additional page reference should be gotten at dma_map time and put back at the completion of the IOMMU flush routine. (This is worthy of much discussion.) Additionally, we can find ways to optimize the flush routine by realizing that if we have frequent maps and unmaps, it may be because the device creates and destroys buffers a lot; these kind of workloads use an IOVA for one event and then never come back. Maybe TLBs don't do much good and we could just flush the entire IOMMU TLB [and ATS cache] for that BDF. We'll try to get free time to do some of these things soon. Ben 1: We verified that the IOMMU costs are almost entirely software overheads by forcing software 1:1 mode, where we create page tables for all physical addresses. We tested using leaf nodes of size 4KB, of 2MB, and of 1GB. In call cases, there is zero runtime maintenance of the page tables, and no IOMMU invalidations. We did piles of DMA maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM addresses. At 4KB page size, we could see some bandwidth slowdown, but at 2MB and 1GB, there was < 1% performance loss as compared with IOMMU off. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: Preserve host CR4.MCE value while in guest mode.
Thanks, Paolo, for the document reference and the fixup. Greg: Sorry for the standard newbie gaffs. On Fri, Apr 17, 2015 at 3:16 AM, Paolo Bonzini wrote: > > > On 17/04/2015 07:10, Wanpeng Li wrote: >>> > >>> >Before the change, if guest CR4.MCE==0, then the machine check is >>> >escalated to Catastrophic Error (CATERR) and the machine dies. >> Could you point out which section of SDM describes that the machine check >> is escalated to a CATERR if CR4.MCE==0? > > It's under the description of "Interrupt 18--Machine-Check Exception (#MC)": > >The machine-check mechanism is enabled by setting the MCE flag in >control register CR4. [...] If the machine-check mechanism is not >enabled (the MCE flag in control register CR4 is clear), a >machine-check exception causes the processor to enter the shutdown >state. > > This of course also applies whenever the non-root mode CR4 (*not* the > shadow CR4 in the VMCS!) has the MCE flag cleared. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: [RFC] Keeping host value of CR4.MCE (and other bits) live during guest execution
Thanks very much, Andy and Paolo. I've brought my patch out of the dark age and just re-sent it under a new subject line. On Thu, Apr 16, 2015 at 2:23 AM, Paolo Bonzini wrote: > > > On 16/04/2015 01:52, Benjamin Serebrin wrote: >> There's a bug in kvm/vmx.c: if the host enabled machine check (CR4.MCE==1), >> the value gets zeroed while the CPU is running in guest context. >> If a machine check event arrives while the CPU is in guest context and >> effective CR4.MCE is zero, the machine raises CATERR and crashes. >> >> We should make sure the host value of CR4.MCE is always active. There >> are read and write shadows for the guest to think it wrote its own value. >> >> For discussion: there's new complexity with CR4 shadowing >> (1e02ce4cccdcb9688386e5b8d2c9fa4660b45389). I measure CR4 reads at 24 >> cycles on haswell and 36 on sandybridge, which compares well with >> L2 miss costs. Is the shadowing worth the complexity? CR4 is also >> cached (with no real consistency mechanism) in the VMCS at the time >> of guest VCPU creation. If there is ever a change in CR4 value >> over time, or if CR4 is different on different CPUs in the system, all this >> logic gets broken. > > Good catch. Please resubmit with your Signed-off-by, and rebased to > Linus's tree. > > Paolo > >> Thanks, >> Ben >> >> >> --- >> >> The host's decision to enable machine check exceptions should remain >> in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset >> and passed a slightly-modified 0 to the vmcs.guest_cr4 value. >> >> Tested: Inject machine check while a guest is spinning. >> Before the change, if guest CR4.MCE==0, then the machine check is >> escalated to Catastrophic Error (CATERR) and the machine dies. >> If guest CR4.MCE==1, then the machine check causes VMEXIT and is >> handled normally by host Linux. After the change, injecting a machine >> check causes normal Linux machine check handling. >> >> >> --- >> arch/x86/kvm/vmx.c | 12 ++-- >> 1 file changed, 10 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index a214104..44c8d24 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -3456,8 +3456,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, >> unsigned long cr3) >> >> static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) >> { >> - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ? >> - KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); >> + /* >> +* Pass through host's Machine Check Enable value to hw_cr4, which >> +* is in force while we are in guest mode. Do not let guests control >> +* this bit, even if host CR4.MCE == 0. >> +*/ >> + unsigned long hw_cr4 = >> + (read_cr4() & X86_CR4_MCE) | >> + (cr4 & ~X86_CR4_MCE) | >> + (to_vmx(vcpu)->rmode.vm86_active ? >> +KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); >> >> if (cr4 & X86_CR4_VMXE) { >> /* >> -- >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vt-x: Preserve host CR4.MCE value while in guest mode.
The host's decision to enable machine check exceptions should remain in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset and passed a slightly-modified 0 to the vmcs.guest_cr4 value. Tested: Built. On earlier version, tested by injecting machine check while a guest is spinning. Before the change, if guest CR4.MCE==0, then the machine check is escalated to Catastrophic Error (CATERR) and the machine dies. If guest CR4.MCE==1, then the machine check causes VMEXIT and is handled normally by host Linux. After the change, injecting a machine check causes normal Linux machine check handling. Signed-off-by: Ben Serebrin --- arch/x86/kvm/vmx.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f5e8dce..f7b6168 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3622,8 +3622,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ? -KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); + /* + * Pass through host's Machine Check Enable value to hw_cr4, which + * is in force while we are in guest mode. Do not let guests control + * this bit, even if host CR4.MCE == 0. + */ + unsigned long hw_cr4 = + (cr4_read_shadow() & X86_CR4_MCE) | + (cr4 & ~X86_CR4_MCE) | + (to_vmx(vcpu)->rmode.vm86_active ? + KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); if (cr4 & X86_CR4_VMXE) { /* -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: [RFC] Keeping host value of CR4.MCE (and other bits) live during guest execution
There's a bug in kvm/vmx.c: if the host enabled machine check (CR4.MCE==1), the value gets zeroed while the CPU is running in guest context. If a machine check event arrives while the CPU is in guest context and effective CR4.MCE is zero, the machine raises CATERR and crashes. We should make sure the host value of CR4.MCE is always active. There are read and write shadows for the guest to think it wrote its own value. For discussion: there's new complexity with CR4 shadowing (1e02ce4cccdcb9688386e5b8d2c9fa4660b45389). I measure CR4 reads at 24 cycles on haswell and 36 on sandybridge, which compares well with L2 miss costs. Is the shadowing worth the complexity? CR4 is also cached (with no real consistency mechanism) in the VMCS at the time of guest VCPU creation. If there is ever a change in CR4 value over time, or if CR4 is different on different CPUs in the system, all this logic gets broken. Thanks, Ben --- The host's decision to enable machine check exceptions should remain in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset and passed a slightly-modified 0 to the vmcs.guest_cr4 value. Tested: Inject machine check while a guest is spinning. Before the change, if guest CR4.MCE==0, then the machine check is escalated to Catastrophic Error (CATERR) and the machine dies. If guest CR4.MCE==1, then the machine check causes VMEXIT and is handled normally by host Linux. After the change, injecting a machine check causes normal Linux machine check handling. --- arch/x86/kvm/vmx.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a214104..44c8d24 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3456,8 +3456,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ? - KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); + /* +* Pass through host's Machine Check Enable value to hw_cr4, which +* is in force while we are in guest mode. Do not let guests control +* this bit, even if host CR4.MCE == 0. +*/ + unsigned long hw_cr4 = + (read_cr4() & X86_CR4_MCE) | + (cr4 & ~X86_CR4_MCE) | + (to_vmx(vcpu)->rmode.vm86_active ? +KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); if (cr4 & X86_CR4_VMXE) { /* -- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html