Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
On Tue, Apr 17, 2018 at 09:07:01AM +0200, Jesper Dangaard Brouer wrote:
> > > number should improve more).  
> > 
> > What is the number for the otherwise comparable setup without repolines?
> 
> Approx 12 Mpps.
> 
> You forgot to handle the dma_direct_mapping_error() case, which still
> used the retpoline in the above 8.9 Mpps measurement, I fixed it up and
> performance increased to 9.6 Mpps.
> 
> Notice, in this test there are still two retpoline/indirect-calls
> left.  The net_device->ndo_xdp_xmit and the invocation of the XDP BPF
> prog.

But that seems like a pretty clear indicator that we want the fast path
direct mapping.  I'll try to find some time over the next weeks to
do a cleaner version of it.


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Jesper Dangaard Brouer
On Mon, 16 Apr 2018 23:15:50 -0700
Christoph Hellwig  wrote:

> On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote:
> > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use
> > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops.  I also included that
> > in below fix patch.  
> 
> x86_swiotlb_dma_ops should not exist any mor, and x86 now uses
> dma_direct_ops.  Looks like you are applying it to an old kernel :)
> 
> > Performance improved to 8.9 Mpps from approx 6.5Mpps.
> > 
> > (This was without my bulking for net_device->ndo_xdp_xmit, so that
> > number should improve more).  
> 
> What is the number for the otherwise comparable setup without repolines?

Approx 12 Mpps.

You forgot to handle the dma_direct_mapping_error() case, which still
used the retpoline in the above 8.9 Mpps measurement, I fixed it up and
performance increased to 9.6 Mpps.

Notice, in this test there are still two retpoline/indirect-calls
left.  The net_device->ndo_xdp_xmit and the invocation of the XDP BPF
prog.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
> I'm not sure if I am really a fan of trying to solve this in this way.
> It seems like this is going to be optimizing the paths for one case at
> the detriment of others. Historically mapping and unmapping has always
> been expensive, especially in the case of IOMMU enabled environments.
> I would much rather see us focus on having swiotlb_dma_ops replaced
> with dma_direct_ops in the cases where the device can access all of
> physical memory.

I am definitively not a fan, but IFF indirect calls are such an overhead
it makes sense to avoid it for the common and simple case.  And the
direct mapping is a common case present on just about every
architecture, and it is a very simple fast path that just adds an offset
to the physical address.  So if we want to speed something up, this is
it.

> > -   if (ops->unmap_page)
> > +   if (!dev->is_dma_direct && ops->unmap_page)
> 
> If I understand correctly this is only needed for the swiotlb case and
> not the dma_direct case. It would make much more sense to just
> overwrite the dev->dma_ops pointer with dma_direct_ops to address all
> of the sync and unmap cases.

Yes.

> > +   if (dev->dma_ops == _direct_ops ||
> > +   (dev->dma_ops == _dma_ops &&
> > +mask == DMA_BIT_MASK(64)))
> > +   dev->is_dma_direct = true;
> > +   else
> > +   dev->is_dma_direct = false;
> 
> So I am not sure this will work on x86. If I am not mistaken I believe
> dev->dma_ops is normally not set and instead the default dma
> operations are pulled via get_arch_dma_ops which returns the global
> dma_ops pointer.

True, for x86 we'd need to check get_arch_dma_ops().

> What you may want to consider as an alternative would be to look at
> modifying drivers that are using the swiotlb so that you could just
> overwrite the dev->dma_ops with the dma_direct_ops in the cases where
> the hardware can support accessing all of physical hardware and where
> we aren't forcing the use of the bounce buffers in the swiotlb.
> 
> Then for the above code you only have to worry about the map calls,
> and you could just do a check against the dma_direct_ops pointer
> instead of having to add a new flag.

That would be the long term plan IFF we go down this route.  For now I
just wanted a quick hack for performance testing.


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote:
> On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use
> x86_swiotlb_dma_ops, instead of swiotlb_dma_ops.  I also included that
> in below fix patch.

x86_swiotlb_dma_ops should not exist any mor, and x86 now uses
dma_direct_ops.  Looks like you are applying it to an old kernel :)

> Performance improved to 8.9 Mpps from approx 6.5Mpps.
> 
> (This was without my bulking for net_device->ndo_xdp_xmit, so that
> number should improve more).

What is the number for the otherwise comparable setup without repolines?


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Jesper Dangaard Brouer
On Mon, 16 Apr 2018 05:27:06 -0700
Christoph Hellwig  wrote:

> Can you try the following hack which avoids indirect calls entirely
> for the fast path direct mapping case?
> 
> ---
> From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Mon, 16 Apr 2018 14:18:14 +0200
> Subject: dma-mapping: bypass dma_ops for direct mappings
> 
> Reportedly the retpoline mitigation for spectre causes huge penalties
> for indirect function calls.  This hack bypasses the dma_ops mechanism
> for simple direct mappings.

I did below to get it compiling, and working...

On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use
x86_swiotlb_dma_ops, instead of swiotlb_dma_ops.  I also included that
in below fix patch.

Performance improved to 8.9 Mpps from approx 6.5Mpps.

(This was without my bulking for net_device->ndo_xdp_xmit, so that
number should improve more).

---
[PATCH RFC] fixups for Hellwig's DMA avoid retpoline overhead patch

From: Jesper Dangaard Brouer 

Performance improved to 8.9 Mpps
8917613 pkt/s

it was around 6.5 Mpps before.
---
 arch/x86/kernel/pci-swiotlb.c |3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |1 +
 include/linux/dma-mapping.h   |   14 +-
 lib/Kconfig   |2 +-
 lib/Makefile  |1 +
 lib/dma-direct.c  |2 ++
 lib/swiotlb.c |1 +
 7 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 0ee0f8f34251..46207e288587 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -48,7 +48,7 @@ void x86_swiotlb_free_coherent(struct device *dev, size_t 
size,
dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs);
 }
 
-static const struct dma_map_ops x86_swiotlb_dma_ops = {
+const struct dma_map_ops x86_swiotlb_dma_ops = {
.mapping_error = swiotlb_dma_mapping_error,
.alloc = x86_swiotlb_alloc_coherent,
.free = x86_swiotlb_free_coherent,
@@ -62,6 +62,7 @@ static const struct dma_map_ops x86_swiotlb_dma_ops = {
.unmap_page = swiotlb_unmap_page,
.dma_supported = NULL,
 };
+EXPORT_SYMBOL(x86_swiotlb_dma_ops);
 
 /*
  * pci_swiotlb_detect_override - set swiotlb to 1 if necessary
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0daccaf72a30..6d2e3f75febc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10297,6 +10297,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
return err;
 
if (!dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(64))) {
+   pr_info("XXX %s() dma_set_mask_and_coherent\n", __func__);
pci_using_dac = 1;
} else {
err = dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(32));
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f2fb5aec7626..7fa92664ebfd 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -622,6 +622,7 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 }
 
 extern const struct dma_map_ops swiotlb_dma_ops;
+extern const struct dma_map_ops x86_swiotlb_dma_ops;
 
 #ifndef HAVE_ARCH_DMA_SET_MASK
 static inline int dma_set_mask(struct device *dev, u64 mask)
@@ -632,12 +633,23 @@ static inline int dma_set_mask(struct device *dev, u64 
mask)
dma_check_mask(dev, mask);
 
*dev->dma_mask = mask;
+#ifdef CONFIG_DMA_DIRECT_OPS
if (dev->dma_ops == _direct_ops ||
+# ifdef CONFIG_SWIOTLB
(dev->dma_ops == _dma_ops &&
-mask == DMA_BIT_MASK(64)))
+mask == DMA_BIT_MASK(64)) ||
+#  ifdef CONFIG_X86
+   (get_dma_ops(dev) == _swiotlb_dma_ops &&
+mask == DMA_BIT_MASK(64))
+#  endif /* CONFIG_X86 */
+# endif /* CONFIG_SWIOTLB */
+  )
dev->is_dma_direct = true;
else
+#endif /* CONFIG_DMA_DIRECT_OPS */
dev->is_dma_direct = false;
+
+   pr_info("XXX: %s() DMA is direct: %d\n", __func__, dev->is_dma_direct);
return 0;
 }
 #endif
diff --git a/lib/Kconfig b/lib/Kconfig
index e96089499371..6eba2bcf468a 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -416,7 +416,7 @@ config SGL_ALLOC
 config DMA_DIRECT_OPS
bool
depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT)
-   default n
+   default y
 
 config DMA_VIRT_OPS
bool
diff --git a/lib/Makefile b/lib/Makefile
index a90d4fcd748f..df4885eabf9c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -29,6 +29,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
 

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Alexander Duyck
On Mon, Apr 16, 2018 at 5:27 AM, Christoph Hellwig  wrote:
> Can you try the following hack which avoids indirect calls entirely
> for the fast path direct mapping case?
>
> ---
> From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Mon, 16 Apr 2018 14:18:14 +0200
> Subject: dma-mapping: bypass dma_ops for direct mappings
>
> Reportedly the retpoline mitigation for spectre causes huge penalties
> for indirect function calls.  This hack bypasses the dma_ops mechanism
> for simple direct mappings.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/device.h  |  1 +
>  include/linux/dma-mapping.h | 53 +++--
>  lib/dma-direct.c|  4 +--
>  3 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 0059b99e1f25..725eec4c6653 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -990,6 +990,7 @@ struct device {
> booloffline_disabled:1;
> booloffline:1;
> boolof_node_reused:1;
> +   boolis_dma_direct:1;
>  };
>
>  static inline struct device *kobj_to_dev(struct kobject *kobj)
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index f8ab1c0f589e..c5d384ae25d6 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -223,6 +223,13 @@ static inline const struct dma_map_ops 
> *get_dma_ops(struct device *dev)
>  }
>  #endif
>
> +/* do not use directly! */
> +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
> +   unsigned long offset, size_t size, enum dma_data_direction 
> dir,
> +   unsigned long attrs);
> +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
> +   int nents, enum dma_data_direction dir, unsigned long attrs);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>   size_t size,
>   enum dma_data_direction dir,
> @@ -232,9 +239,13 @@ static inline dma_addr_t dma_map_single_attrs(struct 
> device *dev, void *ptr,
> dma_addr_t addr;
>
> BUG_ON(!valid_dma_direction(dir));
> -   addr = ops->map_page(dev, virt_to_page(ptr),
> -offset_in_page(ptr), size,
> -dir, attrs);
> +   if (dev->is_dma_direct) {
> +   addr = dma_direct_map_page(dev, virt_to_page(ptr),
> +   offset_in_page(ptr), size, dir, attrs);
> +   } else {
> +   addr = ops->map_page(dev, virt_to_page(ptr),
> +   offset_in_page(ptr), size, dir, attrs);
> +   }
> debug_dma_map_page(dev, virt_to_page(ptr),
>offset_in_page(ptr), size,
>dir, addr, true);

I'm not sure if I am really a fan of trying to solve this in this way.
It seems like this is going to be optimizing the paths for one case at
the detriment of others. Historically mapping and unmapping has always
been expensive, especially in the case of IOMMU enabled environments.
I would much rather see us focus on having swiotlb_dma_ops replaced
with dma_direct_ops in the cases where the device can access all of
physical memory.

> @@ -249,7 +260,7 @@ static inline void dma_unmap_single_attrs(struct device 
> *dev, dma_addr_t addr,
> const struct dma_map_ops *ops = get_dma_ops(dev);
>
> BUG_ON(!valid_dma_direction(dir));
> -   if (ops->unmap_page)
> +   if (!dev->is_dma_direct && ops->unmap_page)

If I understand correctly this is only needed for the swiotlb case and
not the dma_direct case. It would make much more sense to just
overwrite the dev->dma_ops pointer with dma_direct_ops to address all
of the sync and unmap cases.

> ops->unmap_page(dev, addr, size, dir, attrs);
> debug_dma_unmap_page(dev, addr, size, dir, true);
>  }
> @@ -266,7 +277,10 @@ static inline int dma_map_sg_attrs(struct device *dev, 
> struct scatterlist *sg,
> int ents;
>
> BUG_ON(!valid_dma_direction(dir));
> -   ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +   if (dev->is_dma_direct)
> +   ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
> +   else
> +   ents = ops->map_sg(dev, sg, nents, dir, attrs);
> BUG_ON(ents < 0);
> debug_dma_map_sg(dev, sg, nents, ents, dir);
>
> @@ -281,7 +295,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
> struct scatterlist *sg
>
> BUG_ON(!valid_dma_direction(dir));
> debug_dma_unmap_sg(dev, sg, nents, dir);
> -   if (ops->unmap_sg)
> +   if (!dev->is_dma_direct && ops->unmap_sg)
> ops->unmap_sg(dev, sg, 

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Christoph Hellwig
Can you try the following hack which avoids indirect calls entirely
for the fast path direct mapping case?

---
>From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Mon, 16 Apr 2018 14:18:14 +0200
Subject: dma-mapping: bypass dma_ops for direct mappings

Reportedly the retpoline mitigation for spectre causes huge penalties
for indirect function calls.  This hack bypasses the dma_ops mechanism
for simple direct mappings.

Signed-off-by: Christoph Hellwig 
---
 include/linux/device.h  |  1 +
 include/linux/dma-mapping.h | 53 +++--
 lib/dma-direct.c|  4 +--
 3 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 0059b99e1f25..725eec4c6653 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -990,6 +990,7 @@ struct device {
booloffline_disabled:1;
booloffline:1;
boolof_node_reused:1;
+   boolis_dma_direct:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0f589e..c5d384ae25d6 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -223,6 +223,13 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
 }
 #endif
 
+/* do not use directly! */
+dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
+   unsigned long offset, size_t size, enum dma_data_direction dir,
+   unsigned long attrs);
+int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
+   int nents, enum dma_data_direction dir, unsigned long attrs);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
  size_t size,
  enum dma_data_direction dir,
@@ -232,9 +239,13 @@ static inline dma_addr_t dma_map_single_attrs(struct 
device *dev, void *ptr,
dma_addr_t addr;
 
BUG_ON(!valid_dma_direction(dir));
-   addr = ops->map_page(dev, virt_to_page(ptr),
-offset_in_page(ptr), size,
-dir, attrs);
+   if (dev->is_dma_direct) {
+   addr = dma_direct_map_page(dev, virt_to_page(ptr),
+   offset_in_page(ptr), size, dir, attrs);
+   } else {
+   addr = ops->map_page(dev, virt_to_page(ptr),
+   offset_in_page(ptr), size, dir, attrs);
+   }
debug_dma_map_page(dev, virt_to_page(ptr),
   offset_in_page(ptr), size,
   dir, addr, true);
@@ -249,7 +260,7 @@ static inline void dma_unmap_single_attrs(struct device 
*dev, dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (ops->unmap_page)
+   if (!dev->is_dma_direct && ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir, true);
 }
@@ -266,7 +277,10 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
int ents;
 
BUG_ON(!valid_dma_direction(dir));
-   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+   if (dev->is_dma_direct)
+   ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
+   else
+   ents = ops->map_sg(dev, sg, nents, dir, attrs);
BUG_ON(ents < 0);
debug_dma_map_sg(dev, sg, nents, ents, dir);
 
@@ -281,7 +295,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (ops->unmap_sg)
+   if (!dev->is_dma_direct && ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
@@ -295,7 +309,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device 
*dev,
dma_addr_t addr;
 
BUG_ON(!valid_dma_direction(dir));
-   addr = ops->map_page(dev, page, offset, size, dir, attrs);
+   if (dev->is_dma_direct)
+   addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
+   else
+   addr = ops->map_page(dev, page, offset, size, dir, attrs);
debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
return addr;
@@ -309,7 +326,7 @@ static inline void dma_unmap_page_attrs(struct device *dev,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (ops->unmap_page)
+   if (!dev->is_dma_direct && ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir, false);
 }
@@ 

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Jesper Dangaard Brouer
On Sat, 14 Apr 2018 21:29:26 +0200
David Woodhouse  wrote:

> On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote:
> > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:  
> > > I guess there is nothing we need to do!
> > >
> > > On x86, in case of no intel iommu or iommu is disabled, you end up in
> > > swiotlb for DMA API calls when system has 4G memory.
> > > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> > > use bounce buffer until and unless you have swiotlb=force specified in
> > > kernel commandline.  
> > 
> > Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
> > involves an indirect call to do exactly nothing, which in the workload
> > Jesper is looking at is causing a huge performance degradation due to
> > retpolines.  

Yes, exactly.

> 
> We should look at using the
> 
>  if (dma_ops == swiotlb_dma_ops)
>     swiotlb_map_page()
>  else
>     dma_ops->map_page()
> 
> trick for this. Perhaps with alternatives so that when an Intel or AMD
> IOMMU is detected, it's *that* which is checked for as the special
> case.

Yes, this trick is basically what I'm asking for :-)

It did sound like Hellwig wanted to first avoid/fix that x86 end-up
defaulting to swiotlb.  Thus, we just have to do the same trick with
the new default fall-through dma_ops.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-14 Thread David Woodhouse


On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote:
> On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:
> > I guess there is nothing we need to do!
> >
> > On x86, in case of no intel iommu or iommu is disabled, you end up in
> > swiotlb for DMA API calls when system has 4G memory.
> > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> > use bounce buffer until and unless you have swiotlb=force specified in
> > kernel commandline.
> 
> Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
> involves an indirect call to do exactly nothing, which in the workload
> Jesper is looking at is causing a huge performance degradation due to
> retpolines.

We should look at using the

 if (dma_ops == swiotlb_dma_ops)
    swiotlb_map_page()
 else
    dma_ops->map_page()

trick for this. Perhaps with alternatives so that when an Intel or AMD
IOMMU is detected, it's *that* which is checked for as the special
case.

smime.p7s
Description: S/MIME cryptographic signature


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:
> I guess there is nothing we need to do!
>
> On x86, in case of no intel iommu or iommu is disabled, you end up in
> swiotlb for DMA API calls when system has 4G memory.
> However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> use bounce buffer until and unless you have swiotlb=force specified in
> kernel commandline.

Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
involves an indirect call to do exactly nothing, which in the workload
Jesper is looking at is causing a huge performance degradation due to
retpolines.


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Tushar Dave



On 04/12/2018 07:56 AM, Christoph Hellwig wrote:

On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:

On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:

---
Implement support for keeping the DMA mapping through the XDP return
call, to remove RX map/unmap calls.  Implement bulking for XDP
ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
bulking via scatter-gatter DMA calls, XDP TX need it for DMA
map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
to mitigate (via bulk technique). Ask DMA maintainer for a common
case direct call for swiotlb DMA sync call ;-)


Why do you even end up in swiotlb code?  Once you bounce buffer your
performance is toast anyway..


I guess that is because x86 selects it as the default as soon as
we have more than 4G memory. That should be solveable fairly easily
with the per-device dma ops, though.\


I guess there is nothing we need to do!

On x86, in case of no intel iommu or iommu is disabled, you end up in
swiotlb for DMA API calls when system has 4G memory.
However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
use bounce buffer until and unless you have swiotlb=force specified in
kernel commandline.

e.g. here is the snip:
dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
unsigned long attrs)
{
phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = phys_to_dma(dev, phys);

BUG_ON(dir == DMA_NONE);
/*
 * If the address happens to be in the device's DMA window,
 * we can safely return the device addr and not worry about bounce
 * buffering it.
 */
if (dma_capable(dev, dev_addr, size) && swiotlb_force != 
SWIOTLB_FORCE)

return dev_addr;


-Tushar




Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote:
> > I guess that is because x86 selects it as the default as soon as
> > we have more than 4G memory. 
> 
> I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
> that might explain it. And I'm not hitting the bounce-buffer case.
> 
> How do I control which DMA engine I use? (So, I can play a little)

At the lowest level you control it by:

 (1) setting the dma_ops pointer in struct device
 (2) if that is NULL by choosing what is returned from
 get_arch_dma_ops()

> 
> 
> > That should be solveable fairly easily with the per-device dma ops,
> > though.
> 
> I didn't understand this part.

What I mean with that is that we can start out setting dma_ops
to dma_direct_ops for everyone on x86 when we start out (that is assuming
we don't have an iommu), and only switching to swiotlb_dma_ops when
actually required by either a dma_mask that can't address all memory,
or some other special cases like SEV or broken bridges.

> I wanted to ask your opinion, on a hackish idea I have...
> Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
> operation on another device (still/only calling sync_single_for_device).
> 
> With XDP_REDIRECT we are redirecting between net_device's. Usually
> we keep the RX-DMA mapping as we recycle the page. On the redirect to
> TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
> question is how to avoid this mapping(?).  In some cases, with some DMA
> engines (or lack of) I guess the DMA address is actually the same as
> the RX-DMA mapping dma_addr_t already known, right?  For those cases,
> would it be possible to just (re)use that address for TX?

You can't in any sensible way without breaking a lot of abstractions.
For dma direct ops that mapping will be the same unless the devices
have different dma_offsets in their struct device, or the architecture
overrides phys_to_dma entirely, in which case all bets are off.
If you have an iommu it depends on which devices are behind the same
iommu.


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
On Thu, 12 Apr 2018 16:56:53 +0200 Christoph Hellwig  wrote:

> On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:  
> > > ---
> > > Implement support for keeping the DMA mapping through the XDP return
> > > call, to remove RX map/unmap calls.  Implement bulking for XDP
> > > ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> > > bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> > > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> > > to mitigate (via bulk technique). Ask DMA maintainer for a common
> > > case direct call for swiotlb DMA sync call ;-)  
> > 
> > Why do you even end up in swiotlb code?  Once you bounce buffer your
> > performance is toast anyway..  
> 
> I guess that is because x86 selects it as the default as soon as
> we have more than 4G memory. 

I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
that might explain it. And I'm not hitting the bounce-buffer case.

How do I control which DMA engine I use? (So, I can play a little)


> That should be solveable fairly easily with the per-device dma ops,
> though.

I didn't understand this part.

I wanted to ask your opinion, on a hackish idea I have...
Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
operation on another device (still/only calling sync_single_for_device).

With XDP_REDIRECT we are redirecting between net_device's. Usually
we keep the RX-DMA mapping as we recycle the page. On the redirect to
TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
question is how to avoid this mapping(?).  In some cases, with some DMA
engines (or lack of) I guess the DMA address is actually the same as
the RX-DMA mapping dma_addr_t already known, right?  For those cases,
would it be possible to just (re)use that address for TX?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
> > ---
> > Implement support for keeping the DMA mapping through the XDP return
> > call, to remove RX map/unmap calls.  Implement bulking for XDP
> > ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> > bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> > to mitigate (via bulk technique). Ask DMA maintainer for a common
> > case direct call for swiotlb DMA sync call ;-)
> 
> Why do you even end up in swiotlb code?  Once you bounce buffer your
> performance is toast anyway..

I guess that is because x86 selects it as the default as soon as
we have more than 4G memory. That should be solveable fairly easily
with the per-device dma ops, though.


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
> ---
> Implement support for keeping the DMA mapping through the XDP return
> call, to remove RX map/unmap calls.  Implement bulking for XDP
> ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> to mitigate (via bulk technique). Ask DMA maintainer for a common
> case direct call for swiotlb DMA sync call ;-)

Why do you even end up in swiotlb code?  Once you bounce buffer your
performance is toast anyway..


XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
Heads-up XDP performance nerds!

I got an unpleasant surprise when I updated my GCC compiler (to support
the option -mindirect-branch=thunk-extern).  My XDP redirect
performance numbers when cut in half; from approx 13Mpps to 6Mpps
(single CPU core).  I've identified the issue, which is caused by
kernel CONFIG_RETPOLINE, that only have effect when the GCC compiler
have support.  This is mitigation of Spectre variant 2 (CVE-2017-5715)
related to indirect (function call) branches.

XDP_REDIRECT itself only have two primary (per packet) indirect
function calls, ndo_xdp_xmit and invoking bpf_prog, plus any
map_lookup_elem calls in the bpf_prog.  I PoC implemented bulking for
ndo_xdp_xmit, which helped, but not enough. The real root-cause is all
the DMA API calls, which uses function pointers extensively.


Mitigation plan
---
Implement support for keeping the DMA mapping through the XDP return
call, to remove RX map/unmap calls.  Implement bulking for XDP
ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
bulking via scatter-gatter DMA calls, XDP TX need it for DMA
map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
to mitigate (via bulk technique). Ask DMA maintainer for a common
case direct call for swiotlb DMA sync call ;-)

Root-cause verification
---
I have verified that indirect DMA calls are the root-cause, by
removing the DMA sync calls from the code (as they for swiotlb does
nothing), and manually inlined the DMA map calls (basically calling
phys_to_dma(dev, page_to_phys(page)) + offset). For my ixgbe test,
performance "returned" to 11Mpps.

Perf reports

It is not easy to diagnose via perf event tool. I'm coordinating with
ACME to make it easier to pinpoint the hotspots.  Lookout for symbols:
__x86_indirect_thunk_r10, __indirect_thunk_start, __x86_indirect_thunk_rdx
etc.  Be aware that they might not be super high in perf top, but they
stop CPU speculation.  Thus, instead use perf-stat and see the
negative effect of 'insn per cycle'.


Want to understand retpoline at ASM level read this:
 https://support.google.com/faqs/answer/7625886

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer