Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-16 Thread Benjamin Serebrin
On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse <dw...@infradead.org> wrote:
>
> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> > We looked into Intel IOMMU performance a while ago and learned a few
> > useful things.  We generally did a parallel 200 thread TCP_RR test,
> > as this churns through mappings quickly and uses all available cores.
> >
> > First, the main bottleneck was software performance[1].
>
> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>
> All else is software overhead.
>

Agreed!

>
> (I'm deliberately ignoring the stupid chipsets where DMA page tables
> aren't cache coherent and we need a clflush too. They make me too sad.)



How much does Linux need to care about such chipsets?  Can we argue
that they get very sad performance and so be it?

>
>
>
> >   This study preceded the recent patch to break the locks into pools
> > ("Break up monolithic iommu table/lock into finer graularity pools
> > and lock").  There were several points of lock contention:
> > - the RB tree ...
> > - the RB tree ...
> > - the RB tree ...
> >
> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
> > avoids the RB tree issue.
>
> I'm thinking of ditching the RB tree altogether and switching to the
> allocator in lib/iommu-common.c (and thus getting the benefit of the
> finer granularity pools).


Sounds promising!  Is 4 parallel arenas enough?  We'll try to play
with that here.
I think lazy_flush leaves dangling references.

>
>
> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> > page table cleanup algorithm could do well.
>
> When you say 'dynamic 1:1 mapping', is that the same thing that's been
> suggested elsewhere — avoiding the IOVA allocator completely by using a
> virtual address which *matches* the physical address, if that virtual
> address is available?


Yes.  We munge in 2 upper address bits in the IOVA to encode read and
write permissions as well.

>
> Simply cmpxchg on the PTE itself, and if it was
> already set *then* we fall back to the allocator, obviously configured
> to allocate from a range *higher* than the available physical memory.
>
> Jörg has been looking at this too, and was even trying to find space in
> the PTE for a use count so a given page could be in more than one
> mapping before we call back to the IOVA allocator.


Aren't bits 63:52 available at all levels?

>
>
>
> > There are correctness fixes and optimizations left in the
> > invalidation path: I want strict-ish semantics (a page doesn't go
> > back into the freelist until the last IOTLB/IOMMU TLB entry is
> > invalidated) with good performance, and that seems to imply that an
> > additional page reference should be gotten at dma_map time and put
> > back at the completion of the IOMMU flush routine.  (This is worthy
> > of much discussion.)
>
> We already do something like this for page table pages which are freed
> by an unmap, FWIW.


As I understood the code when I last looked, this was true only if a
single unmap operation covered an entire table's worth (2MByte, or
1GByte, etc) of mappings.  The caffeine hasn't hit yet, though, so I
can't even begin to dive into the new calls into mm.c.


>
>
> > Additionally, we can find ways to optimize the flush routine by
> > realizing that if we have frequent maps and unmaps, it may be because
> > the device creates and destroys buffers a lot; these kind of
> > workloads use an IOVA for one event and then never come back.  Maybe
> > TLBs don't do much good and we could just flush the entire IOMMU TLB
> > [and ATS cache] for that BDF.
>
> That would be a very interesting thing to look at. Although it would be
> nice if we had a better way to measure the performance impact of IOTLB
> misses — currently we don't have a lot of visibility at all.


All benchmarks are lies.  But we intend to run internal workloads as
well as well-agreed loads and see how things go.

>
>
> > 1: We verified that the IOMMU costs are almost entirely software
> > overheads by forcing software 1:1 mode, where we create page tables
> > for all physical addresses.  We tested using leaf nodes of size 4KB,
> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> > addresses.  At 4KB pa

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-15 Thread Benjamin Serebrin
We looked into Intel IOMMU performance a while ago and learned a few
useful things.  We generally did a parallel 200 thread TCP_RR test, as
this churns through mappings quickly and uses all available cores.

First, the main bottleneck was software performance[1].  This study
preceded the recent patch to break the locks into pools ("Break up
monolithic iommu table/lock into finer graularity pools and lock").
There were several points of lock contention:
- the RB tree is per device (and in the network test, there's one
device).  Every dma_map and unmap holds the lock.
- the RB tree lock is held during invalidations as well.  There's a
250-entry queue for invalidations that doesn't do any batching
intelligence (for example, promote to larger-range invalidations,
flush entire device, etc).  RB tree locks may be held while waiting
for invalidation drains.  Invalidations have even worse behavior with
ATS enabled for a given device.
- the RB tree has one entry per dma_map call (that entry is deleted by
the corresponding dma_unmap).  If we had merged all adjacent entries
when we could, we would have not lost any information that's actually
used by the code today.  (There could be a check that a dma_unmap
actually covers the entire region that was mapped, but there isn't.)
At boot (without network traffic), two common NICs' drivers show tens
of thousands of static dma_maps that never go away; this means the RB
tree is ~14-16 levels deep.  A rbtree walk (holding that big lock) has
a 14-16 level pointer chase through mostly cache-cold entries.  I
wrote a modification to the RB tree handling that merges nodes that
represent abutting IOVA ranges (and unmerges them on dma_unmap), and
the same drivers created around 7 unique entries.  Steady state grew
to a few hundreds and maybe a thousand, but the fragmentation didn't
get worse than that.  This optimization got about a third of the
performance back.

Omer's paper 
(https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf)
has some promising approaches.  The magazine avoids the RB tree issue.

I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
page table cleanup algorithm could do well.

There are correctness fixes and optimizations left in the invalidation
path: I want strict-ish semantics (a page doesn't go back into the
freelist until the last IOTLB/IOMMU TLB entry is invalidated) with
good performance, and that seems to imply that an additional page
reference should be gotten at dma_map time and put back at the
completion of the IOMMU flush routine.  (This is worthy of much
discussion.)

Additionally, we can find ways to optimize the flush routine by
realizing that if we have frequent maps and unmaps, it may be because
the device creates and destroys buffers a lot; these kind of workloads
use an IOVA for one event and then never come back.  Maybe TLBs don't
do much good and we could just flush the entire IOMMU TLB [and ATS
cache] for that BDF.

We'll try to get free time to do some of these things soon.

Ben


1: We verified that the IOMMU costs are almost entirely software
overheads by forcing software 1:1 mode, where we create page tables
for all physical addresses.  We tested using leaf nodes of size 4KB,
of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
of the page tables, and no IOMMU invalidations.  We did piles of DMA
maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
addresses.  At 4KB page size, we could see some bandwidth slowdown,
but at 2MB and 1GB, there was < 1% performance loss as compared with
IOMMU off.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Preserve host CR4.MCE value while in guest mode.

2015-04-17 Thread Benjamin Serebrin
Thanks, Paolo, for the document reference and the fixup.

Greg: Sorry for the standard newbie gaffs.

On Fri, Apr 17, 2015 at 3:16 AM, Paolo Bonzini pbonz...@redhat.com wrote:


 On 17/04/2015 07:10, Wanpeng Li wrote:
 
 Before the change, if guest CR4.MCE==0, then the machine check is
 escalated to Catastrophic Error (CATERR) and the machine dies.
 Could you point out which section of SDM describes that the machine check
 is escalated to a CATERR if CR4.MCE==0?

 It's under the description of Interrupt 18--Machine-Check Exception (#MC):

The machine-check mechanism is enabled by setting the MCE flag in
control register CR4.  [...] If the machine-check mechanism is not
enabled (the MCE flag in control register CR4 is clear), a
machine-check exception causes the processor to enter the shutdown
state.

 This of course also applies whenever the non-root mode CR4 (*not* the
 shadow CR4 in the VMCS!) has the MCE flag cleared.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [RFC] Keeping host value of CR4.MCE (and other bits) live during guest execution

2015-04-16 Thread Benjamin Serebrin
Thanks very much, Andy and Paolo.  I've brought my patch out of the dark age
and just re-sent it under a new subject line.


On Thu, Apr 16, 2015 at 2:23 AM, Paolo Bonzini pbonz...@redhat.com wrote:


 On 16/04/2015 01:52, Benjamin Serebrin wrote:
 There's a bug in kvm/vmx.c: if the host enabled machine check (CR4.MCE==1),
 the value gets zeroed while the CPU is running in guest context.
 If a machine check event arrives while the CPU is in guest context and
 effective CR4.MCE is zero, the machine raises CATERR and crashes.

 We should make sure the host value of CR4.MCE is always active.  There
 are read and write shadows for the guest to think it wrote its own value.

 For discussion: there's new complexity with CR4 shadowing
 (1e02ce4cccdcb9688386e5b8d2c9fa4660b45389).  I measure CR4 reads at 24
 cycles on haswell and 36 on sandybridge, which compares well with
 L2 miss costs.  Is the shadowing worth the complexity?  CR4 is also
 cached (with no real consistency mechanism) in the VMCS at the time
 of guest VCPU creation.  If there is ever a change in CR4 value
 over time, or if CR4 is different on different CPUs in the system, all this
 logic gets broken.

 Good catch.  Please resubmit with your Signed-off-by, and rebased to
 Linus's tree.

 Paolo

 Thanks,
 Ben


 ---

 The host's decision to enable machine check exceptions should remain
 in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
 and passed a slightly-modified 0 to the vmcs.guest_cr4 value.

 Tested: Inject machine check while a guest is spinning.
 Before the change, if guest CR4.MCE==0, then the machine check is
 escalated to Catastrophic Error (CATERR) and the machine dies.
 If guest CR4.MCE==1, then the machine check causes VMEXIT and is
 handled normally by host Linux. After the change, injecting a machine
 check causes normal Linux machine check handling.


 ---
  arch/x86/kvm/vmx.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index a214104..44c8d24 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3456,8 +3456,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu,
 unsigned long cr3)

  static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
  {
 -   unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
 -   KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 +   /*
 +* Pass through host's Machine Check Enable value to hw_cr4, which
 +* is in force while we are in guest mode.  Do not let guests control
 +* this bit, even if host CR4.MCE == 0.
 +*/
 +   unsigned long hw_cr4 =
 +   (read_cr4()  X86_CR4_MCE) |
 +   (cr4  ~X86_CR4_MCE) |
 +   (to_vmx(vcpu)-rmode.vm86_active ?
 +KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

 if (cr4  X86_CR4_VMXE) {
 /*
 --

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vt-x: Preserve host CR4.MCE value while in guest mode.

2015-04-16 Thread Benjamin Serebrin
The host's decision to enable machine check exceptions should remain
in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
and passed a slightly-modified 0 to the vmcs.guest_cr4 value.

Tested: Built.
On earlier version, tested by injecting machine check while a guest is spinning.
Before the change, if guest CR4.MCE==0, then the machine check is
escalated to Catastrophic Error (CATERR) and the machine dies.
If guest CR4.MCE==1, then the machine check causes VMEXIT and is
handled normally by host Linux. After the change, injecting a machine
check causes normal Linux machine check handling.

Signed-off-by: Ben Serebrin sereb...@google.com
---
 arch/x86/kvm/vmx.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f5e8dce..f7b6168 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3622,8 +3622,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu,
unsigned long cr3)

 static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
- unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
-KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
+ /*
+ * Pass through host's Machine Check Enable value to hw_cr4, which
+ * is in force while we are in guest mode.  Do not let guests control
+ * this bit, even if host CR4.MCE == 0.
+ */
+ unsigned long hw_cr4 =
+ (cr4_read_shadow()  X86_CR4_MCE) |
+ (cr4  ~X86_CR4_MCE) |
+ (to_vmx(vcpu)-rmode.vm86_active ?
+ KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

  if (cr4  X86_CR4_VMXE) {
  /*
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: [RFC] Keeping host value of CR4.MCE (and other bits) live during guest execution

2015-04-15 Thread Benjamin Serebrin
There's a bug in kvm/vmx.c: if the host enabled machine check (CR4.MCE==1),
the value gets zeroed while the CPU is running in guest context.
If a machine check event arrives while the CPU is in guest context and
effective CR4.MCE is zero, the machine raises CATERR and crashes.

We should make sure the host value of CR4.MCE is always active.  There
are read and write shadows for the guest to think it wrote its own value.

For discussion: there's new complexity with CR4 shadowing
(1e02ce4cccdcb9688386e5b8d2c9fa4660b45389).  I measure CR4 reads at 24
cycles on haswell and 36 on sandybridge, which compares well with
L2 miss costs.  Is the shadowing worth the complexity?  CR4 is also
cached (with no real consistency mechanism) in the VMCS at the time
of guest VCPU creation.  If there is ever a change in CR4 value
over time, or if CR4 is different on different CPUs in the system, all this
logic gets broken.

Thanks,
Ben


---

The host's decision to enable machine check exceptions should remain
in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
and passed a slightly-modified 0 to the vmcs.guest_cr4 value.

Tested: Inject machine check while a guest is spinning.
Before the change, if guest CR4.MCE==0, then the machine check is
escalated to Catastrophic Error (CATERR) and the machine dies.
If guest CR4.MCE==1, then the machine check causes VMEXIT and is
handled normally by host Linux. After the change, injecting a machine
check causes normal Linux machine check handling.


---
 arch/x86/kvm/vmx.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a214104..44c8d24 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3456,8 +3456,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu,
unsigned long cr3)

 static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
-   unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
-   KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
+   /*
+* Pass through host's Machine Check Enable value to hw_cr4, which
+* is in force while we are in guest mode.  Do not let guests control
+* this bit, even if host CR4.MCE == 0.
+*/
+   unsigned long hw_cr4 =
+   (read_cr4()  X86_CR4_MCE) |
+   (cr4  ~X86_CR4_MCE) |
+   (to_vmx(vcpu)-rmode.vm86_active ?
+KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

if (cr4  X86_CR4_VMXE) {
/*
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html