Re: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-13 Thread Christoph Hellwig
I've pulled v2 with the ia64 into dma-mapping for-next.  This should
give us a little more than a week in linux-next to sort out any
issues.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-11 Thread Christoph Hellwig
On Tue, Dec 11, 2018 at 05:13:30PM +, Luck, Tony wrote:
> > But that might not be your fault. My ancient system is getting flaky. A 
> > v4.19 build that
> > has booted before is also resetting :-(
> 
> After a power-cycle (and some time to let the machine cool off). System now 
> boots
> with your patch series plus the __phys_to_pfn() #define
> 
> So if you can figure the right way to fix that, you are good to go.
> 
> Tested-by: Tony Luck 

Thanks.  I'll just replace the __phys_to_pfn with PHYS_PFN for now,
and see if I can find time to get the whole kernel to agree to
one version of this macro eventually..
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-11 Thread Luck, Tony
> But that might not be your fault. My ancient system is getting flaky. A v4.19 
> build that
> has booted before is also resetting :-(

After a power-cycle (and some time to let the machine cool off). System now 
boots
with your patch series plus the __phys_to_pfn() #define

So if you can figure the right way to fix that, you are good to go.

Tested-by: Tony Luck 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-11 Thread Luck, Tony
> This should fix it:
...
> +#include 

Not quite. Still have an issue with __phys_to_pfn(paddr)

Trying ti #include  gave we a raft of redefined
macros. So I just added

#define  __phys_to_pfn(paddr)PHYS_PFN(paddr)

to arch/ia64/mm/init.c

That made the build work. But boot spontaneously resets after:

mptsas: ioc1: attaching ssp device: fw_channel 0, fw_id 6, phy 6, sas_addr 
0x5000c5000ecada69
scsi 5:0:0:0: Direct-Access SEAGATE  ST9146802SS  0003 PQ: 0 ANSI: 5
EFI Variables Facility v0.08 2004-May-17
sd 5:0:0:0: [sdb] 286749488 512-byte logical blocks: (147 GB/137 GiB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and 
FUA
 sdb: sdb1 sdb2
sd 5:0:0:0: [sdb] Attached SCSI disk

But that might not be your fault. My ancient system is getting flaky. A v4.19 
build that
has booted before is also resetting :-(

-Tony


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-10 Thread Christoph Hellwig
On Mon, Dec 10, 2018 at 01:51:13PM -0800, Luck, Tony wrote:
> But the ia64 build fails with:

Yes, I just got the same complaint form the buildbot, unfortunately
I don't have a good ia64 cross compiler locally given that Debian
is lacking one, and the one provided by the buildbot doesn't build on
Debian either..

This should fix it:

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 2c51733f1dfd..a007afaa556c 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-10 Thread Luck, Tony
On Fri, Dec 07, 2018 at 11:07:05AM -0800, Christoph Hellwig wrote:
> This works is based on the dma-mapping tree, so you probably want to
> want this git tree for testing:
> 
> git://git.infradead.org/users/hch/misc.git dma-direct-calls.2

Pulled this tree. Got HEAD

33b9fc015171 ("dma-mapping: bypass indirect calls for dma-direct")

But the ia64 build fails with:

arch/ia64/mm/init.c:75:21: warning: 'enum dma_data_direction' declared inside 
parameter list [enabled by default]
arch/ia64/mm/init.c:75:21: warning: its scope is only this definition or 
declaration, which is probably not what you want [enabled by default]
arch/ia64/mm/init.c:75:40: error: parameter 4 ('dir') has incomplete type
arch/ia64/mm/init.c:74:6: error: function declaration isn't a prototype 
[-Werror=strict-prototypes]
arch/ia64/mm/init.c: In function 'arch_sync_dma_for_cpu':
arch/ia64/mm/init.c:77:2: error: implicit declaration of function 
'__phys_to_pfn' [-Werror=implicit-function-declaration]

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings v2

2018-12-08 Thread Jesper Dangaard Brouer
On Fri,  7 Dec 2018 11:07:05 -0800
Christoph Hellwig  wrote:

> Hi all,
> 
> a while ago Jesper reported major performance regressions due to the
> spectre v2 mitigations in his XDP forwarding workloads.  A large part
> of that is due to the DMA mapping API indirect calls.
> 
> It turns out that the most common implementation of the DMA API is the
> direct mapping case, and now that we have merged almost all duplicate
> implementations of that into a single generic one is easily feasily to
> direct calls for this fast path.
> 
> This series adds consolidate the DMA mapping code by merging the
> swiotlb case into the dma direct case, and then treats NULL dma_ops
> as an indicator that that we should directly call the direct mapping
> case.  This recovers a large part of the retpoline induces XDP slowdown.
> 
> This works is based on the dma-mapping tree, so you probably want to
> want this git tree for testing:
> 
> git://git.infradead.org/users/hch/misc.git dma-direct-calls.2
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2

You can add my:
 Tested-by: Jesper Dangaard Brouer 
or
 Acked-by: Jesper Dangaard Brouer 

I'm very happy that you work on this.  And I've done micro-benchmark
testing of the patchset (and branch dma-direct-calls), which I've made
avail here:
 
https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org

My XDP performance is back, minus the BPF-indirect call, and
net_rx_action napi->poll, and net_device->ndo_xdp_xmit calls.  I
verified that manually disabling retpoline for these remaining netstack
retpoline-calls restore the performance full (well minus 1.5 nanosec).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-07 Thread Jesper Dangaard Brouer
On Fri, 7 Dec 2018 16:44:35 +0100
Jesper Dangaard Brouer  wrote:

> On Fri, 7 Dec 2018 02:21:42 +0100
> Christoph Hellwig  wrote:
> 
> > On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote:  
> > > On 06/12/2018 20:00, Christoph Hellwig wrote:
> > >> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote:
> > >>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at
> > >>> the point we detected the ACPI properties are wrong - that shouldn't be 
> > >>> too
> > >>> much of a headache to go back to.
> > >>
> > >> Ok.  I've cooked up a patch to use NULL as the go direct marker.
> > >> This cleans up a few things nicely, but also means we now need to
> > >> do the bypass scheme for all ops, not just the fast path.  But we
> > >> probably should just move the slow path ops out of line anyway,
> > >> so I'm not worried about it.  This has survived some very basic
> > >> testing on x86, and really needs to be cleaned up and split into
> > >> multiple patches..
> > >
> > > I've also just finished hacking something up to keep the arm64 status quo 
> > > - 
> > > I'll need to actually test it tomorrow, but the overall diff looks like 
> > > the 
> > > below.
> > 
> > Nice.  I created a branch that picked up your bits and also the ideas
> > from Linus, and the result looks reall nice.  I'll still need a signoff
> > for your bits, though.
> > 
> > Jesper, can you give this a spin if it changes the number even further?
> > 
> >   git://git.infradead.org/users/hch/misc.git dma-direct-calls.2
> > 
> >   
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2
> >   
> 
> I'll test it soon...
> 
> I looked at my perf stat recording on my existing tests[1] and there
> seems to be significantly more I-cache usage.

The I-cache pressure seems to be lower with the new branch, and
performance improved with 4.5 nanosec.

 
> Copy-paste from my summary[1]:
> [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org#summary-of-results

Updated:

* Summary of results

Using XDP_REDIRECT between drivers RX ixgbe(10G) redirect TX i40e(40G),
via BPF devmap (used samples/bpf/xdp_redirect_map) . (Note choose
higher TX link-speed to assure that we don't to have a TX bottleneck).
The baseline-kernel is at commit 
[[https://git.kernel.org/torvalds/c/ef78e5ec9214][ef78e5ec9214]], which is 
commit just
before Hellwigs changes in this tree.

Performance numbers in packets/sec (XDP_REDIRECT ixgbe -> i40e):
 - 11913154 (11,913,154) pps - baseline compiled without retpoline
 -  7438283  (7,438,283) pps - regression due to CONFIG_RETPOLINE
 -  9610088  (9,610,088) pps - mitigation via Hellwig dma-direct-calls
 - 10049223 (10,049,223) pps - Hellwig branch dma-direct-calls.2

Do notice at these extreme speeds the pps number increase rabbit with
small changes, e.g. difference to new branch is:
 - (1/9610088-1/10049223)*10^9 = 4.54 nanosec faster

>From the inst per cycle, it is clear that retpolines are stalling the CPU
pipeline:

| pps| insn per cycle |
|+|
| 11,913,154 |   2.39 |
| 7,438,283  |   1.54 |
| 9,610,088  |   2.04 |
| 10,049,223 |   1.99 |
|||


Strangely the Instruction-Cache is also under heavier pressure:

| pps| l2_rqsts.all_code_rd | l2_rqsts.code_rd_hit | 
l2_rqsts.code_rd_miss |
|+--+--+---|
| 11,913,154 | 874,547  | 742,335  | 132,198
   |
| 7,438,283  | 649,513  | 547,581  | 101,945
   |
| 9,610,088  | 2,568,064| 2,001,369| 566,683
   |
| 10,049,223 | 1,232,818| 1,152,514| 80,299 
   |
||  |  |
   |

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-07 Thread Jesper Dangaard Brouer
On Fri, 7 Dec 2018 02:21:42 +0100
Christoph Hellwig  wrote:

> On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote:
> > On 06/12/2018 20:00, Christoph Hellwig wrote:  
> >> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote:  
> >>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at
> >>> the point we detected the ACPI properties are wrong - that shouldn't be 
> >>> too
> >>> much of a headache to go back to.  
> >>
> >> Ok.  I've cooked up a patch to use NULL as the go direct marker.
> >> This cleans up a few things nicely, but also means we now need to
> >> do the bypass scheme for all ops, not just the fast path.  But we
> >> probably should just move the slow path ops out of line anyway,
> >> so I'm not worried about it.  This has survived some very basic
> >> testing on x86, and really needs to be cleaned up and split into
> >> multiple patches..  
> >
> > I've also just finished hacking something up to keep the arm64 status quo - 
> > I'll need to actually test it tomorrow, but the overall diff looks like the 
> > below.  
> 
> Nice.  I created a branch that picked up your bits and also the ideas
> from Linus, and the result looks reall nice.  I'll still need a signoff
> for your bits, though.
> 
> Jesper, can you give this a spin if it changes the number even further?
> 
>   git://git.infradead.org/users/hch/misc.git dma-direct-calls.2
> 
>   
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2

I'll test it soon...

I looked at my perf stat recording on my existing tests[1] and there
seems to be significantly more I-cache usage.

Copy-paste from my summary[1]:
 [1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org#summary-of-results

* Summary of results

Using XDP_REDIRECT between drivers RX ixgbe(10G) redirect TX i40e(40G),
via BPF devmap (used samples/bpf/xdp_redirect_map) . (Note choose
higher TX link-speed to assure that we don't to have a TX bottleneck).
The baseline-kernel is at commit https://git.kernel.org/torvalds/c/ef78e5ec9214,
which is commit just before Hellwigs changes in this tree.

Performance numbers in packets/sec (XDP_REDIRECT ixgbe -> i40e):
 - 11913154 (11,913,154) pps - baseline compiled without retpoline
 -  7438283  (7,438,283) pps - regression due to CONFIG_RETPOLINE
 -  9610088  (9,610,088) pps - mitigation via Hellwig dma-direct-calls

>From the inst per cycle, it is clear that retpolines are stalling the CPU
pipeline:

| pps| insn per cycle |
|+|
| 11,913,154 |   2.39 |
| 7,438,283  |   1.54 |
| 9,610,088  |   2.04 |


Strangely the Instruction-Cache is also under heavier pressure:

| pps| l2_rqsts.all_code_rd | l2_rqsts.code_rd_hit | 
l2_rqsts.code_rd_miss |
|+--+--+---|
| 11,913,154 | 874,547  | 742,335  | 132,198
   |
| 7,438,283  | 649,513  | 547,581  | 101,945
   |
| 9,610,088  | 2,568,064| 2,001,369| 566,683
   |
||  |  |
   |


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote:
> On 06/12/2018 20:00, Christoph Hellwig wrote:
>> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote:
>>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at
>>> the point we detected the ACPI properties are wrong - that shouldn't be too
>>> much of a headache to go back to.
>>
>> Ok.  I've cooked up a patch to use NULL as the go direct marker.
>> This cleans up a few things nicely, but also means we now need to
>> do the bypass scheme for all ops, not just the fast path.  But we
>> probably should just move the slow path ops out of line anyway,
>> so I'm not worried about it.  This has survived some very basic
>> testing on x86, and really needs to be cleaned up and split into
>> multiple patches..
>
> I've also just finished hacking something up to keep the arm64 status quo - 
> I'll need to actually test it tomorrow, but the overall diff looks like the 
> below.

Nice.  I created a branch that picked up your bits and also the ideas
from Linus, and the result looks reall nice.  I'll still need a signoff
for your bits, though.

Jesper, can you give this a spin if it changes the number even further?

  git://git.infradead.org/users/hch/misc.git dma-direct-calls.2

  
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Robin Murphy

On 06/12/2018 20:00, Christoph Hellwig wrote:

On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote:

I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at
the point we detected the ACPI properties are wrong - that shouldn't be too
much of a headache to go back to.


Ok.  I've cooked up a patch to use NULL as the go direct marker.
This cleans up a few things nicely, but also means we now need to
do the bypass scheme for all ops, not just the fast path.  But we
probably should just move the slow path ops out of line anyway,
so I'm not worried about it.  This has survived some very basic
testing on x86, and really needs to be cleaned up and split into
multiple patches..


I've also just finished hacking something up to keep the arm64 status 
quo - I'll need to actually test it tomorrow, but the overall diff looks 
like the below.


Robin.

->8-
diff --git a/arch/arm64/include/asm/dma-mapping.h 
b/arch/arm64/include/asm/dma-mapping.h

index 0626c3a93730..fe6287f68adb 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -19,15 +19,13 @@
 #include 
 #include 

-extern const struct dma_map_ops dummy_dma_ops;
-
 static inline const struct dma_map_ops *get_arch_dma_ops(struct 
bus_type *bus)

 {
/*
 * We expect no ISA devices, and all other DMA masters are expected to
 * have someone call arch_setup_dma_ops at device creation time.
 */
-   return _dma_ops;
+   return _dummy_ops;
 }

 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 17f8fc784de2..574aee2d04e7 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -255,98 +255,6 @@ static int __init atomic_pool_init(void)
return -ENOMEM;
 }

-/
- * The following APIs are for dummy DMA ops *
- /
-
-static void *__dummy_alloc(struct device *dev, size_t size,
-  dma_addr_t *dma_handle, gfp_t flags,
-  unsigned long attrs)
-{
-   return NULL;
-}
-
-static void __dummy_free(struct device *dev, size_t size,
-void *vaddr, dma_addr_t dma_handle,
-unsigned long attrs)
-{
-}
-
-static int __dummy_mmap(struct device *dev,
-   struct vm_area_struct *vma,
-   void *cpu_addr, dma_addr_t dma_addr, size_t size,
-   unsigned long attrs)
-{
-   return -ENXIO;
-}
-
-static dma_addr_t __dummy_map_page(struct device *dev, struct page *page,
-  unsigned long offset, size_t size,
-  enum dma_data_direction dir,
-  unsigned long attrs)
-{
-   return 0;
-}
-
-static void __dummy_unmap_page(struct device *dev, dma_addr_t dev_addr,
-  size_t size, enum dma_data_direction dir,
-  unsigned long attrs)
-{
-}
-
-static int __dummy_map_sg(struct device *dev, struct scatterlist *sgl,
- int nelems, enum dma_data_direction dir,
- unsigned long attrs)
-{
-   return 0;
-}
-
-static void __dummy_unmap_sg(struct device *dev,
-struct scatterlist *sgl, int nelems,
-enum dma_data_direction dir,
-unsigned long attrs)
-{
-}
-
-static void __dummy_sync_single(struct device *dev,
-   dma_addr_t dev_addr, size_t size,
-   enum dma_data_direction dir)
-{
-}
-
-static void __dummy_sync_sg(struct device *dev,
-   struct scatterlist *sgl, int nelems,
-   enum dma_data_direction dir)
-{
-}
-
-static int __dummy_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
-{
-   return 1;
-}
-
-static int __dummy_dma_supported(struct device *hwdev, u64 mask)
-{
-   return 0;
-}
-
-const struct dma_map_ops dummy_dma_ops = {
-   .alloc  = __dummy_alloc,
-   .free   = __dummy_free,
-   .mmap   = __dummy_mmap,
-   .map_page   = __dummy_map_page,
-   .unmap_page = __dummy_unmap_page,
-   .map_sg = __dummy_map_sg,
-   .unmap_sg   = __dummy_unmap_sg,
-   .sync_single_for_cpu= __dummy_sync_single,
-   .sync_single_for_device = __dummy_sync_single,
-   .sync_sg_for_cpu= __dummy_sync_sg,
-   .sync_sg_for_device = __dummy_sync_sg,
-   .mapping_error  = __dummy_mapping_error,
-   .dma_supported  = __dummy_dma_supported,
-};
-EXPORT_SYMBOL(dummy_dma_ops);
-
 static int __init arm64_dma_init(void)
 {
WARN_TAINT(ARCH_DMA_MINALIGN < cache_line_size(),
diff --git 

Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote:
> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at 
> the point we detected the ACPI properties are wrong - that shouldn't be too 
> much of a headache to go back to.

Ok.  I've cooked up a patch to use NULL as the go direct marker.
This cleans up a few things nicely, but also means we now need to
do the bypass scheme for all ops, not just the fast path.  But we
probably should just move the slow path ops out of line anyway,
so I'm not worried about it.  This has survived some very basic
testing on x86, and really needs to be cleaned up and split into
multiple patches..

The other nice thing this would allow is removing dma_direct_ops
entirely, which means we can simplify a few things even further.

This patch is relative to what I sent out before and the git tree,
and survives some very basic x86 testing.

---
>From 9318145675791833f752cba22484a0b4b4387f1b Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Thu, 6 Dec 2018 10:52:35 -0800
Subject: dma-direct: use NULL as the bypass indicator

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/include/asm/dma-mapping.h |   2 +-
 arch/arm/include/asm/dma-mapping.h   |   2 +-
 arch/arm/mm/dma-mapping-nommu.c  |  12 +--
 arch/arm64/include/asm/dma-mapping.h |   8 +-
 arch/arm64/mm/dma-mapping.c  |  89 ---
 arch/ia64/hp/common/hwsw_iommu.c |   2 +-
 arch/ia64/hp/common/sba_iommu.c  |   4 +-
 arch/ia64/kernel/dma-mapping.c   |   1 -
 arch/ia64/kernel/pci-dma.c   |   2 +-
 arch/mips/include/asm/dma-mapping.h  |   2 +-
 arch/parisc/kernel/setup.c   |   4 -
 arch/sparc/include/asm/dma-mapping.h |   4 +-
 arch/x86/kernel/pci-dma.c|   2 +-
 drivers/base/platform.c  |   2 +
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c  |   2 +-
 drivers/iommu/amd_iommu.c|  13 +--
 drivers/pci/controller/vmd.c |   4 +-
 include/asm-generic/dma-mapping.h|   2 +-
 include/linux/dma-direct.h   |   6 --
 include/linux/dma-mapping.h  | 125 ++-
 include/linux/dma-noncoherent.h  |   7 --
 kernel/dma/Kconfig   |   2 +-
 kernel/dma/direct.c  |   3 +
 23 files changed, 92 insertions(+), 208 deletions(-)

diff --git a/arch/alpha/include/asm/dma-mapping.h 
b/arch/alpha/include/asm/dma-mapping.h
index 8beeafd4f68e..c9203468d4fe 100644
--- a/arch/alpha/include/asm/dma-mapping.h
+++ b/arch/alpha/include/asm/dma-mapping.h
@@ -7,7 +7,7 @@ extern const struct dma_map_ops alpha_pci_ops;
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 {
 #ifdef CONFIG_ALPHA_JENSEN
-   return _direct_ops;
+   return NULL; /* use direct mapping */
 #else
return _pci_ops;
 #endif
diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index 965b7c846ecb..31d3b96f0f4b 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -18,7 +18,7 @@ extern const struct dma_map_ops arm_coherent_dma_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 {
-   return IS_ENABLED(CONFIG_MMU) ? _dma_ops : _direct_ops;
+   return IS_ENABLED(CONFIG_MMU) ? _dma_ops : NULL;
 }
 
 #ifdef __arch_page_to_dma
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index 712416ecd8e6..378763003079 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -209,16 +209,9 @@ const struct dma_map_ops arm_nommu_dma_ops = {
 };
 EXPORT_SYMBOL(arm_nommu_dma_ops);
 
-static const struct dma_map_ops *arm_nommu_get_dma_map_ops(bool coherent)
-{
-   return coherent ? _direct_ops : _nommu_dma_ops;
-}
-
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
const struct iommu_ops *iommu, bool coherent)
 {
-   const struct dma_map_ops *dma_ops;
-
if (IS_ENABLED(CONFIG_CPU_V7M)) {
/*
 * Cache support for v7m is optional, so can be treated as
@@ -234,7 +227,6 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, 
u64 size,
dev->archdata.dma_coherent = (get_cr() & CR_M) ? coherent : 
true;
}
 
-   dma_ops = arm_nommu_get_dma_map_ops(dev->archdata.dma_coherent);
-
-   set_dma_ops(dev, dma_ops);
+   if (!dev->archdata.dma_coherent)
+   set_dma_ops(dev, _nommu_dma_ops);
 }
diff --git a/arch/arm64/include/asm/dma-mapping.h 
b/arch/arm64/include/asm/dma-mapping.h
index c41f3fb1446c..95dbf3ef735a 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -24,15 +24,9 @@
 #include 
 #include 
 
-extern const struct dma_map_ops dummy_dma_ops;
-
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 {
-   /*
-* We expect no ISA devices, and all other DMA masters are expected to
-* have someone call 

Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Robin Murphy

On 06/12/2018 18:43, Christoph Hellwig wrote:

On Thu, Dec 06, 2018 at 10:28:22AM -0800, Linus Torvalds wrote:

which is counterproductive because it checks for the fast case *after*
you've done a load (and a check) that is entirely pointless for the
fast case.

Put another way, you made the fast case unnecessarily slow.

If this fast case matters (and the whole point of the series is that
it *does* matter), then you should make sure that

  (a) the dma_is_direct() function/macro uses "likely()" for the test


I'm a little worried about that.  Yes, for the common case it is
likely.  But if you run a setup where you say always have an iommu
it is not, in fact it is never called in that case, but we only know
that at runtime.


  (b) the code is organized like this instead:

+   if (dma_is_direct(ops))
+   dma_direct_unmap_page(dev, addr, size, dir, attrs);
+   else if (ops->unmap_page)
+   ops->unmap_page(dev, addr, size, dir, attrs);

because now you avoid that unnecessary conditional for the common
case, and you hopefully never even access the pointer (because you
know what's behind it: the direct ops cases that you're
short-circuiting).


Agreed on that, I've fixed it up now (current patch attached below).


In fact, as a further micro-optimization it might be a good idea to
just specify that the dma_is_direct() ops is a special pointer
(perhaps even just say that "NULL means it's direct"), because that
then makes the fast-case test much simpler (avoids a whole nasty
constant load, and testing for NULL in particular is often much
better).


So I wanted to do the NULL case, and it would be much nicer.  But the
arm folks went to great length to make sure they don't have a default
set of dma ops and require it to be explicitly set on every device
to catch cases where people don't set things up properly and I didn't
want to piss them off..  But maybe I should just go for it and see who
screams, as the benefit is pretty obvious.


I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at 
the point we detected the ACPI properties are wrong - that shouldn't be 
too much of a headache to go back to.


Robin.



---
 From 89cc8ab03798177339ad916efabd320e792ee034 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Mon, 3 Dec 2018 16:49:29 +0100
Subject: dma-mapping: bypass indirect calls for dma-direct

Avoid expensive indirect calls in the fast path DMA mapping
operations by directly calling the dma_direct_* ops if we are using
the directly mapped DMA operations.

Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-direct.h  |  17 --
  include/linux/dma-mapping.h | 110 
  kernel/dma/direct.c |  13 +++--
  3 files changed, 106 insertions(+), 34 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 3b0a3ea3876d..b7338702592a 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -60,22 +60,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, 
void *cpu_addr,
  struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs);
  void __dma_direct_free_pages(struct device *dev, size_t size, struct page 
*page);
-dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
-   unsigned long offset, size_t size, enum dma_data_direction dir,
-   unsigned long attrs);
-void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
-   size_t size, enum dma_data_direction dir, unsigned long attrs);
-int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
-   enum dma_data_direction dir, unsigned long attrs);
-void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void dma_direct_sync_single_for_device(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir);
-void dma_direct_sync_sg_for_device(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
-void dma_direct_sync_single_for_cpu(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir);
-void dma_direct_sync_sg_for_cpu(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
  int dma_direct_supported(struct device *dev, u64 mask);
  #endif /* _LINUX_DMA_DIRECT_H */
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 8916499d2805..98e4dd67c906 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -221,6 +221,69 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
  }
  #endif
  
+static inline bool dma_is_direct(const struct dma_map_ops *ops)

+{
+   return IS_ENABLED(CONFIG_DMA_DIRECT_OPS) && ops == 

Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 10:43 AM Christoph Hellwig  wrote:
>
> >
> >  (a) the dma_is_direct() function/macro uses "likely()" for the test
>
> I'm a little worried about that.  Yes, for the common case it is
> likely.  But if you run a setup where you say always have an iommu
> it is not, in fact it is never called in that case, but we only know
> that at runtime.

Note that "likely()" doesn't have any really huge overhead - it just
makes the compiler move the unlikely case out-of-line.

Compared to the overhead of the indirect branch, it's simply not a
huge deal, it's more a mispredict and cache layout issue.

So marking something "likely()" when it isn't doesn't really penalize
things too much. It's not like an exception or anything like that,
it's really just a marker for better code layout.

  Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 10:29:35AM -0800, Nadav Amit wrote:
> Did you happen to see my RFC for "automatic" indirect call promotion? 
> 
>   https://lkml.org/lkml/2018/10/18/175
> 
> I hope to get v1 based on Josh responses next week.

I'll take a look when you repost it.  Although I'm always a little vary
of under the hood magic like this..
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 10:28:22AM -0800, Linus Torvalds wrote:
> which is counterproductive because it checks for the fast case *after*
> you've done a load (and a check) that is entirely pointless for the
> fast case.
> 
> Put another way, you made the fast case unnecessarily slow.
> 
> If this fast case matters (and the whole point of the series is that
> it *does* matter), then you should make sure that
> 
>  (a) the dma_is_direct() function/macro uses "likely()" for the test

I'm a little worried about that.  Yes, for the common case it is
likely.  But if you run a setup where you say always have an iommu
it is not, in fact it is never called in that case, but we only know
that at runtime.

>  (b) the code is organized like this instead:
> 
> +   if (dma_is_direct(ops))
> +   dma_direct_unmap_page(dev, addr, size, dir, attrs);
> +   else if (ops->unmap_page)
> +   ops->unmap_page(dev, addr, size, dir, attrs);
> 
> because now you avoid that unnecessary conditional for the common
> case, and you hopefully never even access the pointer (because you
> know what's behind it: the direct ops cases that you're
> short-circuiting).

Agreed on that, I've fixed it up now (current patch attached below).

> In fact, as a further micro-optimization it might be a good idea to
> just specify that the dma_is_direct() ops is a special pointer
> (perhaps even just say that "NULL means it's direct"), because that
> then makes the fast-case test much simpler (avoids a whole nasty
> constant load, and testing for NULL in particular is often much
> better).

So I wanted to do the NULL case, and it would be much nicer.  But the
arm folks went to great length to make sure they don't have a default
set of dma ops and require it to be explicitly set on every device
to catch cases where people don't set things up properly and I didn't
want to piss them off..  But maybe I should just go for it and see who
screams, as the benefit is pretty obvious.

---
>From 89cc8ab03798177339ad916efabd320e792ee034 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Mon, 3 Dec 2018 16:49:29 +0100
Subject: dma-mapping: bypass indirect calls for dma-direct

Avoid expensive indirect calls in the fast path DMA mapping
operations by directly calling the dma_direct_* ops if we are using
the directly mapped DMA operations.

Signed-off-by: Christoph Hellwig 
---
 include/linux/dma-direct.h  |  17 --
 include/linux/dma-mapping.h | 110 
 kernel/dma/direct.c |  13 +++--
 3 files changed, 106 insertions(+), 34 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 3b0a3ea3876d..b7338702592a 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -60,22 +60,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, 
void *cpu_addr,
 struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs);
 void __dma_direct_free_pages(struct device *dev, size_t size, struct page 
*page);
-dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
-   unsigned long offset, size_t size, enum dma_data_direction dir,
-   unsigned long attrs);
-void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
-   size_t size, enum dma_data_direction dir, unsigned long attrs);
-int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
-   enum dma_data_direction dir, unsigned long attrs);
-void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void dma_direct_sync_single_for_device(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir);
-void dma_direct_sync_sg_for_device(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
-void dma_direct_sync_single_for_cpu(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir);
-void dma_direct_sync_sg_for_cpu(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
 int dma_direct_supported(struct device *dev, u64 mask);
 #endif /* _LINUX_DMA_DIRECT_H */
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 8916499d2805..98e4dd67c906 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -221,6 +221,69 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
 }
 #endif
 
+static inline bool dma_is_direct(const struct dma_map_ops *ops)
+{
+   return IS_ENABLED(CONFIG_DMA_DIRECT_OPS) && ops == _direct_ops;
+}
+
+/*
+ * All the dma_direct_* declarations are here just for the indirect call 
bypass,
+ * and must not be used directly drivers!
+ */
+dma_addr_t dma_direct_map_page(struct device *dev, struct page 

Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 10:28 AM Linus Torvalds
 wrote:
>
> Put another way, you made the fast case unnecessarily slow.

Side note: the code seems to be a bit confused about it, because
*some* cases test the fast case first, and some do it after they've
already accessed the pointer for the slow case.

So even aside from the performance and code generation issue (and
possible future "use a special bit pattern for the fast case"), it
would be good for _consistency_ to just always do the fast-case test
first.

Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Nadav Amit
> On Dec 6, 2018, at 9:43 AM, Jesper Dangaard Brouer  wrote:
> 
> On Thu,  6 Dec 2018 07:37:19 -0800
> Christoph Hellwig  wrote:
> 
>> Hi all,
>> 
>> a while ago Jesper reported major performance regressions due to the
>> spectre v2 mitigations in his XDP forwarding workloads.  A large part
>> of that is due to the DMA mapping API indirect calls.
>> 
>> It turns out that the most common implementation of the DMA API is the
>> direct mapping case, and now that we have merged almost all duplicate
>> implementations of that into a single generic one is easily feasily to
>> direct calls for this fast path.
>> 
>> This patch adds a check if we are using dma_direct_ops in each fast path
>> DMA operation, and just uses a direct call instead.  For the XDP workload
>> this increases the number of packets per second from 7,438,283 to
>> 9,610,088, so it provides a very significant speedup.
> 
> Full test report avail here:
> https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org
> 
> 
>> Note that the patch depends on a lot of work either queued up in the
>> DMA mapping tree, or still out on the list from review, so to actually
>> try the patch you probably want this git tree:
>> 
>> 
>>git://git.infradead.org/users/hch/misc.git dma-direct-calls
>> 
>> Gitweb:
>> 
>>
>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls

Did you happen to see my RFC for "automatic" indirect call promotion? 

https://lkml.org/lkml/2018/10/18/175

I hope to get v1 based on Josh responses next week.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Jesper Dangaard Brouer
On Thu,  6 Dec 2018 07:37:19 -0800
Christoph Hellwig  wrote:

> Hi all,
> 
> a while ago Jesper reported major performance regressions due to the
> spectre v2 mitigations in his XDP forwarding workloads.  A large part
> of that is due to the DMA mapping API indirect calls.
> 
> It turns out that the most common implementation of the DMA API is the
> direct mapping case, and now that we have merged almost all duplicate
> implementations of that into a single generic one is easily feasily to
> direct calls for this fast path.
> 
> This patch adds a check if we are using dma_direct_ops in each fast path
> DMA operation, and just uses a direct call instead.  For the XDP workload
> this increases the number of packets per second from 7,438,283 to
> 9,610,088, so it provides a very significant speedup.

Full test report avail here:
 
https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org


> Note that the patch depends on a lot of work either queued up in the
> DMA mapping tree, or still out on the list from review, so to actually
> try the patch you probably want this git tree:
> 
> 
> git://git.infradead.org/users/hch/misc.git dma-direct-calls
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu