Re: [RFC] avoid indirect calls for DMA direct mappings v2
I've pulled v2 with the ia64 into dma-mapping for-next. This should give us a little more than a week in linux-next to sort out any issues. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings v2
On Tue, Dec 11, 2018 at 05:13:30PM +, Luck, Tony wrote: > > But that might not be your fault. My ancient system is getting flaky. A > > v4.19 build that > > has booted before is also resetting :-( > > After a power-cycle (and some time to let the machine cool off). System now > boots > with your patch series plus the __phys_to_pfn() #define > > So if you can figure the right way to fix that, you are good to go. > > Tested-by: Tony Luck Thanks. I'll just replace the __phys_to_pfn with PHYS_PFN for now, and see if I can find time to get the whole kernel to agree to one version of this macro eventually.. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] avoid indirect calls for DMA direct mappings v2
> But that might not be your fault. My ancient system is getting flaky. A v4.19 > build that > has booted before is also resetting :-( After a power-cycle (and some time to let the machine cool off). System now boots with your patch series plus the __phys_to_pfn() #define So if you can figure the right way to fix that, you are good to go. Tested-by: Tony Luck ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] avoid indirect calls for DMA direct mappings v2
> This should fix it: ... > +#include Not quite. Still have an issue with __phys_to_pfn(paddr) Trying ti #include gave we a raft of redefined macros. So I just added #define __phys_to_pfn(paddr)PHYS_PFN(paddr) to arch/ia64/mm/init.c That made the build work. But boot spontaneously resets after: mptsas: ioc1: attaching ssp device: fw_channel 0, fw_id 6, phy 6, sas_addr 0x5000c5000ecada69 scsi 5:0:0:0: Direct-Access SEAGATE ST9146802SS 0003 PQ: 0 ANSI: 5 EFI Variables Facility v0.08 2004-May-17 sd 5:0:0:0: [sdb] 286749488 512-byte logical blocks: (147 GB/137 GiB) sd 5:0:0:0: [sdb] Write Protect is off sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA sdb: sdb1 sdb2 sd 5:0:0:0: [sdb] Attached SCSI disk But that might not be your fault. My ancient system is getting flaky. A v4.19 build that has booted before is also resetting :-( -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings v2
On Mon, Dec 10, 2018 at 01:51:13PM -0800, Luck, Tony wrote: > But the ia64 build fails with: Yes, I just got the same complaint form the buildbot, unfortunately I don't have a good ia64 cross compiler locally given that Debian is lacking one, and the one provided by the buildbot doesn't build on Debian either.. This should fix it: diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index 2c51733f1dfd..a007afaa556c 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings v2
On Fri, Dec 07, 2018 at 11:07:05AM -0800, Christoph Hellwig wrote: > This works is based on the dma-mapping tree, so you probably want to > want this git tree for testing: > > git://git.infradead.org/users/hch/misc.git dma-direct-calls.2 Pulled this tree. Got HEAD 33b9fc015171 ("dma-mapping: bypass indirect calls for dma-direct") But the ia64 build fails with: arch/ia64/mm/init.c:75:21: warning: 'enum dma_data_direction' declared inside parameter list [enabled by default] arch/ia64/mm/init.c:75:21: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] arch/ia64/mm/init.c:75:40: error: parameter 4 ('dir') has incomplete type arch/ia64/mm/init.c:74:6: error: function declaration isn't a prototype [-Werror=strict-prototypes] arch/ia64/mm/init.c: In function 'arch_sync_dma_for_cpu': arch/ia64/mm/init.c:77:2: error: implicit declaration of function '__phys_to_pfn' [-Werror=implicit-function-declaration] -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings v2
On Fri, 7 Dec 2018 11:07:05 -0800 Christoph Hellwig wrote: > Hi all, > > a while ago Jesper reported major performance regressions due to the > spectre v2 mitigations in his XDP forwarding workloads. A large part > of that is due to the DMA mapping API indirect calls. > > It turns out that the most common implementation of the DMA API is the > direct mapping case, and now that we have merged almost all duplicate > implementations of that into a single generic one is easily feasily to > direct calls for this fast path. > > This series adds consolidate the DMA mapping code by merging the > swiotlb case into the dma direct case, and then treats NULL dma_ops > as an indicator that that we should directly call the direct mapping > case. This recovers a large part of the retpoline induces XDP slowdown. > > This works is based on the dma-mapping tree, so you probably want to > want this git tree for testing: > > git://git.infradead.org/users/hch/misc.git dma-direct-calls.2 > > Gitweb: > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2 You can add my: Tested-by: Jesper Dangaard Brouer or Acked-by: Jesper Dangaard Brouer I'm very happy that you work on this. And I've done micro-benchmark testing of the patchset (and branch dma-direct-calls), which I've made avail here: https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org My XDP performance is back, minus the BPF-indirect call, and net_rx_action napi->poll, and net_device->ndo_xdp_xmit calls. I verified that manually disabling retpoline for these remaining netstack retpoline-calls restore the performance full (well minus 1.5 nanosec). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Fri, 7 Dec 2018 16:44:35 +0100 Jesper Dangaard Brouer wrote: > On Fri, 7 Dec 2018 02:21:42 +0100 > Christoph Hellwig wrote: > > > On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote: > > > On 06/12/2018 20:00, Christoph Hellwig wrote: > > >> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote: > > >>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at > > >>> the point we detected the ACPI properties are wrong - that shouldn't be > > >>> too > > >>> much of a headache to go back to. > > >> > > >> Ok. I've cooked up a patch to use NULL as the go direct marker. > > >> This cleans up a few things nicely, but also means we now need to > > >> do the bypass scheme for all ops, not just the fast path. But we > > >> probably should just move the slow path ops out of line anyway, > > >> so I'm not worried about it. This has survived some very basic > > >> testing on x86, and really needs to be cleaned up and split into > > >> multiple patches.. > > > > > > I've also just finished hacking something up to keep the arm64 status quo > > > - > > > I'll need to actually test it tomorrow, but the overall diff looks like > > > the > > > below. > > > > Nice. I created a branch that picked up your bits and also the ideas > > from Linus, and the result looks reall nice. I'll still need a signoff > > for your bits, though. > > > > Jesper, can you give this a spin if it changes the number even further? > > > > git://git.infradead.org/users/hch/misc.git dma-direct-calls.2 > > > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2 > > > > I'll test it soon... > > I looked at my perf stat recording on my existing tests[1] and there > seems to be significantly more I-cache usage. The I-cache pressure seems to be lower with the new branch, and performance improved with 4.5 nanosec. > Copy-paste from my summary[1]: > [1] > https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org#summary-of-results Updated: * Summary of results Using XDP_REDIRECT between drivers RX ixgbe(10G) redirect TX i40e(40G), via BPF devmap (used samples/bpf/xdp_redirect_map) . (Note choose higher TX link-speed to assure that we don't to have a TX bottleneck). The baseline-kernel is at commit [[https://git.kernel.org/torvalds/c/ef78e5ec9214][ef78e5ec9214]], which is commit just before Hellwigs changes in this tree. Performance numbers in packets/sec (XDP_REDIRECT ixgbe -> i40e): - 11913154 (11,913,154) pps - baseline compiled without retpoline - 7438283 (7,438,283) pps - regression due to CONFIG_RETPOLINE - 9610088 (9,610,088) pps - mitigation via Hellwig dma-direct-calls - 10049223 (10,049,223) pps - Hellwig branch dma-direct-calls.2 Do notice at these extreme speeds the pps number increase rabbit with small changes, e.g. difference to new branch is: - (1/9610088-1/10049223)*10^9 = 4.54 nanosec faster >From the inst per cycle, it is clear that retpolines are stalling the CPU pipeline: | pps| insn per cycle | |+| | 11,913,154 | 2.39 | | 7,438,283 | 1.54 | | 9,610,088 | 2.04 | | 10,049,223 | 1.99 | ||| Strangely the Instruction-Cache is also under heavier pressure: | pps| l2_rqsts.all_code_rd | l2_rqsts.code_rd_hit | l2_rqsts.code_rd_miss | |+--+--+---| | 11,913,154 | 874,547 | 742,335 | 132,198 | | 7,438,283 | 649,513 | 547,581 | 101,945 | | 9,610,088 | 2,568,064| 2,001,369| 566,683 | | 10,049,223 | 1,232,818| 1,152,514| 80,299 | || | | | -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Fri, 7 Dec 2018 02:21:42 +0100 Christoph Hellwig wrote: > On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote: > > On 06/12/2018 20:00, Christoph Hellwig wrote: > >> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote: > >>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at > >>> the point we detected the ACPI properties are wrong - that shouldn't be > >>> too > >>> much of a headache to go back to. > >> > >> Ok. I've cooked up a patch to use NULL as the go direct marker. > >> This cleans up a few things nicely, but also means we now need to > >> do the bypass scheme for all ops, not just the fast path. But we > >> probably should just move the slow path ops out of line anyway, > >> so I'm not worried about it. This has survived some very basic > >> testing on x86, and really needs to be cleaned up and split into > >> multiple patches.. > > > > I've also just finished hacking something up to keep the arm64 status quo - > > I'll need to actually test it tomorrow, but the overall diff looks like the > > below. > > Nice. I created a branch that picked up your bits and also the ideas > from Linus, and the result looks reall nice. I'll still need a signoff > for your bits, though. > > Jesper, can you give this a spin if it changes the number even further? > > git://git.infradead.org/users/hch/misc.git dma-direct-calls.2 > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2 I'll test it soon... I looked at my perf stat recording on my existing tests[1] and there seems to be significantly more I-cache usage. Copy-paste from my summary[1]: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org#summary-of-results * Summary of results Using XDP_REDIRECT between drivers RX ixgbe(10G) redirect TX i40e(40G), via BPF devmap (used samples/bpf/xdp_redirect_map) . (Note choose higher TX link-speed to assure that we don't to have a TX bottleneck). The baseline-kernel is at commit https://git.kernel.org/torvalds/c/ef78e5ec9214, which is commit just before Hellwigs changes in this tree. Performance numbers in packets/sec (XDP_REDIRECT ixgbe -> i40e): - 11913154 (11,913,154) pps - baseline compiled without retpoline - 7438283 (7,438,283) pps - regression due to CONFIG_RETPOLINE - 9610088 (9,610,088) pps - mitigation via Hellwig dma-direct-calls >From the inst per cycle, it is clear that retpolines are stalling the CPU pipeline: | pps| insn per cycle | |+| | 11,913,154 | 2.39 | | 7,438,283 | 1.54 | | 9,610,088 | 2.04 | Strangely the Instruction-Cache is also under heavier pressure: | pps| l2_rqsts.all_code_rd | l2_rqsts.code_rd_hit | l2_rqsts.code_rd_miss | |+--+--+---| | 11,913,154 | 874,547 | 742,335 | 132,198 | | 7,438,283 | 649,513 | 547,581 | 101,945 | | 9,610,088 | 2,568,064| 2,001,369| 566,683 | || | | | -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 06, 2018 at 08:24:38PM +, Robin Murphy wrote: > On 06/12/2018 20:00, Christoph Hellwig wrote: >> On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote: >>> I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at >>> the point we detected the ACPI properties are wrong - that shouldn't be too >>> much of a headache to go back to. >> >> Ok. I've cooked up a patch to use NULL as the go direct marker. >> This cleans up a few things nicely, but also means we now need to >> do the bypass scheme for all ops, not just the fast path. But we >> probably should just move the slow path ops out of line anyway, >> so I'm not worried about it. This has survived some very basic >> testing on x86, and really needs to be cleaned up and split into >> multiple patches.. > > I've also just finished hacking something up to keep the arm64 status quo - > I'll need to actually test it tomorrow, but the overall diff looks like the > below. Nice. I created a branch that picked up your bits and also the ideas from Linus, and the result looks reall nice. I'll still need a signoff for your bits, though. Jesper, can you give this a spin if it changes the number even further? git://git.infradead.org/users/hch/misc.git dma-direct-calls.2 http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On 06/12/2018 20:00, Christoph Hellwig wrote: On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote: I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at the point we detected the ACPI properties are wrong - that shouldn't be too much of a headache to go back to. Ok. I've cooked up a patch to use NULL as the go direct marker. This cleans up a few things nicely, but also means we now need to do the bypass scheme for all ops, not just the fast path. But we probably should just move the slow path ops out of line anyway, so I'm not worried about it. This has survived some very basic testing on x86, and really needs to be cleaned up and split into multiple patches.. I've also just finished hacking something up to keep the arm64 status quo - I'll need to actually test it tomorrow, but the overall diff looks like the below. Robin. ->8- diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index 0626c3a93730..fe6287f68adb 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -19,15 +19,13 @@ #include #include -extern const struct dma_map_ops dummy_dma_ops; - static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { /* * We expect no ISA devices, and all other DMA masters are expected to * have someone call arch_setup_dma_ops at device creation time. */ - return _dma_ops; + return _dummy_ops; } void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 17f8fc784de2..574aee2d04e7 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -255,98 +255,6 @@ static int __init atomic_pool_init(void) return -ENOMEM; } -/ - * The following APIs are for dummy DMA ops * - / - -static void *__dummy_alloc(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flags, - unsigned long attrs) -{ - return NULL; -} - -static void __dummy_free(struct device *dev, size_t size, -void *vaddr, dma_addr_t dma_handle, -unsigned long attrs) -{ -} - -static int __dummy_mmap(struct device *dev, - struct vm_area_struct *vma, - void *cpu_addr, dma_addr_t dma_addr, size_t size, - unsigned long attrs) -{ - return -ENXIO; -} - -static dma_addr_t __dummy_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, - unsigned long attrs) -{ - return 0; -} - -static void __dummy_unmap_page(struct device *dev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) -{ -} - -static int __dummy_map_sg(struct device *dev, struct scatterlist *sgl, - int nelems, enum dma_data_direction dir, - unsigned long attrs) -{ - return 0; -} - -static void __dummy_unmap_sg(struct device *dev, -struct scatterlist *sgl, int nelems, -enum dma_data_direction dir, -unsigned long attrs) -{ -} - -static void __dummy_sync_single(struct device *dev, - dma_addr_t dev_addr, size_t size, - enum dma_data_direction dir) -{ -} - -static void __dummy_sync_sg(struct device *dev, - struct scatterlist *sgl, int nelems, - enum dma_data_direction dir) -{ -} - -static int __dummy_mapping_error(struct device *hwdev, dma_addr_t dma_addr) -{ - return 1; -} - -static int __dummy_dma_supported(struct device *hwdev, u64 mask) -{ - return 0; -} - -const struct dma_map_ops dummy_dma_ops = { - .alloc = __dummy_alloc, - .free = __dummy_free, - .mmap = __dummy_mmap, - .map_page = __dummy_map_page, - .unmap_page = __dummy_unmap_page, - .map_sg = __dummy_map_sg, - .unmap_sg = __dummy_unmap_sg, - .sync_single_for_cpu= __dummy_sync_single, - .sync_single_for_device = __dummy_sync_single, - .sync_sg_for_cpu= __dummy_sync_sg, - .sync_sg_for_device = __dummy_sync_sg, - .mapping_error = __dummy_mapping_error, - .dma_supported = __dummy_dma_supported, -}; -EXPORT_SYMBOL(dummy_dma_ops); - static int __init arm64_dma_init(void) { WARN_TAINT(ARCH_DMA_MINALIGN < cache_line_size(), diff --git
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 06, 2018 at 06:54:17PM +, Robin Murphy wrote: > I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at > the point we detected the ACPI properties are wrong - that shouldn't be too > much of a headache to go back to. Ok. I've cooked up a patch to use NULL as the go direct marker. This cleans up a few things nicely, but also means we now need to do the bypass scheme for all ops, not just the fast path. But we probably should just move the slow path ops out of line anyway, so I'm not worried about it. This has survived some very basic testing on x86, and really needs to be cleaned up and split into multiple patches.. The other nice thing this would allow is removing dma_direct_ops entirely, which means we can simplify a few things even further. This patch is relative to what I sent out before and the git tree, and survives some very basic x86 testing. --- >From 9318145675791833f752cba22484a0b4b4387f1b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 6 Dec 2018 10:52:35 -0800 Subject: dma-direct: use NULL as the bypass indicator Signed-off-by: Christoph Hellwig --- arch/alpha/include/asm/dma-mapping.h | 2 +- arch/arm/include/asm/dma-mapping.h | 2 +- arch/arm/mm/dma-mapping-nommu.c | 12 +-- arch/arm64/include/asm/dma-mapping.h | 8 +- arch/arm64/mm/dma-mapping.c | 89 --- arch/ia64/hp/common/hwsw_iommu.c | 2 +- arch/ia64/hp/common/sba_iommu.c | 4 +- arch/ia64/kernel/dma-mapping.c | 1 - arch/ia64/kernel/pci-dma.c | 2 +- arch/mips/include/asm/dma-mapping.h | 2 +- arch/parisc/kernel/setup.c | 4 - arch/sparc/include/asm/dma-mapping.h | 4 +- arch/x86/kernel/pci-dma.c| 2 +- drivers/base/platform.c | 2 + drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 2 +- drivers/iommu/amd_iommu.c| 13 +-- drivers/pci/controller/vmd.c | 4 +- include/asm-generic/dma-mapping.h| 2 +- include/linux/dma-direct.h | 6 -- include/linux/dma-mapping.h | 125 ++- include/linux/dma-noncoherent.h | 7 -- kernel/dma/Kconfig | 2 +- kernel/dma/direct.c | 3 + 23 files changed, 92 insertions(+), 208 deletions(-) diff --git a/arch/alpha/include/asm/dma-mapping.h b/arch/alpha/include/asm/dma-mapping.h index 8beeafd4f68e..c9203468d4fe 100644 --- a/arch/alpha/include/asm/dma-mapping.h +++ b/arch/alpha/include/asm/dma-mapping.h @@ -7,7 +7,7 @@ extern const struct dma_map_ops alpha_pci_ops; static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { #ifdef CONFIG_ALPHA_JENSEN - return _direct_ops; + return NULL; /* use direct mapping */ #else return _pci_ops; #endif diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 965b7c846ecb..31d3b96f0f4b 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -18,7 +18,7 @@ extern const struct dma_map_ops arm_coherent_dma_ops; static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { - return IS_ENABLED(CONFIG_MMU) ? _dma_ops : _direct_ops; + return IS_ENABLED(CONFIG_MMU) ? _dma_ops : NULL; } #ifdef __arch_page_to_dma diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index 712416ecd8e6..378763003079 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -209,16 +209,9 @@ const struct dma_map_ops arm_nommu_dma_ops = { }; EXPORT_SYMBOL(arm_nommu_dma_ops); -static const struct dma_map_ops *arm_nommu_get_dma_map_ops(bool coherent) -{ - return coherent ? _direct_ops : _nommu_dma_ops; -} - void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent) { - const struct dma_map_ops *dma_ops; - if (IS_ENABLED(CONFIG_CPU_V7M)) { /* * Cache support for v7m is optional, so can be treated as @@ -234,7 +227,6 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, dev->archdata.dma_coherent = (get_cr() & CR_M) ? coherent : true; } - dma_ops = arm_nommu_get_dma_map_ops(dev->archdata.dma_coherent); - - set_dma_ops(dev, dma_ops); + if (!dev->archdata.dma_coherent) + set_dma_ops(dev, _nommu_dma_ops); } diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index c41f3fb1446c..95dbf3ef735a 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -24,15 +24,9 @@ #include #include -extern const struct dma_map_ops dummy_dma_ops; - static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { - /* -* We expect no ISA devices, and all other DMA masters are expected to -* have someone call
Re: [RFC] avoid indirect calls for DMA direct mappings
On 06/12/2018 18:43, Christoph Hellwig wrote: On Thu, Dec 06, 2018 at 10:28:22AM -0800, Linus Torvalds wrote: which is counterproductive because it checks for the fast case *after* you've done a load (and a check) that is entirely pointless for the fast case. Put another way, you made the fast case unnecessarily slow. If this fast case matters (and the whole point of the series is that it *does* matter), then you should make sure that (a) the dma_is_direct() function/macro uses "likely()" for the test I'm a little worried about that. Yes, for the common case it is likely. But if you run a setup where you say always have an iommu it is not, in fact it is never called in that case, but we only know that at runtime. (b) the code is organized like this instead: + if (dma_is_direct(ops)) + dma_direct_unmap_page(dev, addr, size, dir, attrs); + else if (ops->unmap_page) + ops->unmap_page(dev, addr, size, dir, attrs); because now you avoid that unnecessary conditional for the common case, and you hopefully never even access the pointer (because you know what's behind it: the direct ops cases that you're short-circuiting). Agreed on that, I've fixed it up now (current patch attached below). In fact, as a further micro-optimization it might be a good idea to just specify that the dma_is_direct() ops is a special pointer (perhaps even just say that "NULL means it's direct"), because that then makes the fast-case test much simpler (avoids a whole nasty constant load, and testing for NULL in particular is often much better). So I wanted to do the NULL case, and it would be much nicer. But the arm folks went to great length to make sure they don't have a default set of dma ops and require it to be explicitly set on every device to catch cases where people don't set things up properly and I didn't want to piss them off.. But maybe I should just go for it and see who screams, as the benefit is pretty obvious. I'm pretty sure we used to assign dummy_dma_ops explicitly to devices at the point we detected the ACPI properties are wrong - that shouldn't be too much of a headache to go back to. Robin. --- From 89cc8ab03798177339ad916efabd320e792ee034 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 16:49:29 +0100 Subject: dma-mapping: bypass indirect calls for dma-direct Avoid expensive indirect calls in the fast path DMA mapping operations by directly calling the dma_direct_* ops if we are using the directly mapped DMA operations. Signed-off-by: Christoph Hellwig --- include/linux/dma-direct.h | 17 -- include/linux/dma-mapping.h | 110 kernel/dma/direct.c | 13 +++-- 3 files changed, 106 insertions(+), 34 deletions(-) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 3b0a3ea3876d..b7338702592a 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -60,22 +60,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page); -dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs); -void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, enum dma_data_direction dir, unsigned long attrs); -int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, - enum dma_data_direction dir, unsigned long attrs); -void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, unsigned long attrs); -void dma_direct_sync_single_for_device(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_device(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); -void dma_direct_sync_single_for_cpu(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_cpu(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 8916499d2805..98e4dd67c906 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -221,6 +221,69 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +static inline bool dma_is_direct(const struct dma_map_ops *ops) +{ + return IS_ENABLED(CONFIG_DMA_DIRECT_OPS) && ops ==
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 6, 2018 at 10:43 AM Christoph Hellwig wrote: > > > > > (a) the dma_is_direct() function/macro uses "likely()" for the test > > I'm a little worried about that. Yes, for the common case it is > likely. But if you run a setup where you say always have an iommu > it is not, in fact it is never called in that case, but we only know > that at runtime. Note that "likely()" doesn't have any really huge overhead - it just makes the compiler move the unlikely case out-of-line. Compared to the overhead of the indirect branch, it's simply not a huge deal, it's more a mispredict and cache layout issue. So marking something "likely()" when it isn't doesn't really penalize things too much. It's not like an exception or anything like that, it's really just a marker for better code layout. Linus ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 06, 2018 at 10:29:35AM -0800, Nadav Amit wrote: > Did you happen to see my RFC for "automatic" indirect call promotion? > > https://lkml.org/lkml/2018/10/18/175 > > I hope to get v1 based on Josh responses next week. I'll take a look when you repost it. Although I'm always a little vary of under the hood magic like this.. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 06, 2018 at 10:28:22AM -0800, Linus Torvalds wrote: > which is counterproductive because it checks for the fast case *after* > you've done a load (and a check) that is entirely pointless for the > fast case. > > Put another way, you made the fast case unnecessarily slow. > > If this fast case matters (and the whole point of the series is that > it *does* matter), then you should make sure that > > (a) the dma_is_direct() function/macro uses "likely()" for the test I'm a little worried about that. Yes, for the common case it is likely. But if you run a setup where you say always have an iommu it is not, in fact it is never called in that case, but we only know that at runtime. > (b) the code is organized like this instead: > > + if (dma_is_direct(ops)) > + dma_direct_unmap_page(dev, addr, size, dir, attrs); > + else if (ops->unmap_page) > + ops->unmap_page(dev, addr, size, dir, attrs); > > because now you avoid that unnecessary conditional for the common > case, and you hopefully never even access the pointer (because you > know what's behind it: the direct ops cases that you're > short-circuiting). Agreed on that, I've fixed it up now (current patch attached below). > In fact, as a further micro-optimization it might be a good idea to > just specify that the dma_is_direct() ops is a special pointer > (perhaps even just say that "NULL means it's direct"), because that > then makes the fast-case test much simpler (avoids a whole nasty > constant load, and testing for NULL in particular is often much > better). So I wanted to do the NULL case, and it would be much nicer. But the arm folks went to great length to make sure they don't have a default set of dma ops and require it to be explicitly set on every device to catch cases where people don't set things up properly and I didn't want to piss them off.. But maybe I should just go for it and see who screams, as the benefit is pretty obvious. --- >From 89cc8ab03798177339ad916efabd320e792ee034 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 3 Dec 2018 16:49:29 +0100 Subject: dma-mapping: bypass indirect calls for dma-direct Avoid expensive indirect calls in the fast path DMA mapping operations by directly calling the dma_direct_* ops if we are using the directly mapped DMA operations. Signed-off-by: Christoph Hellwig --- include/linux/dma-direct.h | 17 -- include/linux/dma-mapping.h | 110 kernel/dma/direct.c | 13 +++-- 3 files changed, 106 insertions(+), 34 deletions(-) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 3b0a3ea3876d..b7338702592a 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -60,22 +60,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void __dma_direct_free_pages(struct device *dev, size_t size, struct page *page); -dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs); -void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, enum dma_data_direction dir, unsigned long attrs); -int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, - enum dma_data_direction dir, unsigned long attrs); -void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, unsigned long attrs); -void dma_direct_sync_single_for_device(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_device(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); -void dma_direct_sync_single_for_cpu(struct device *dev, - dma_addr_t addr, size_t size, enum dma_data_direction dir); -void dma_direct_sync_sg_for_cpu(struct device *dev, - struct scatterlist *sgl, int nents, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 8916499d2805..98e4dd67c906 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -221,6 +221,69 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +static inline bool dma_is_direct(const struct dma_map_ops *ops) +{ + return IS_ENABLED(CONFIG_DMA_DIRECT_OPS) && ops == _direct_ops; +} + +/* + * All the dma_direct_* declarations are here just for the indirect call bypass, + * and must not be used directly drivers! + */ +dma_addr_t dma_direct_map_page(struct device *dev, struct page
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, Dec 6, 2018 at 10:28 AM Linus Torvalds wrote: > > Put another way, you made the fast case unnecessarily slow. Side note: the code seems to be a bit confused about it, because *some* cases test the fast case first, and some do it after they've already accessed the pointer for the slow case. So even aside from the performance and code generation issue (and possible future "use a special bit pattern for the fast case"), it would be good for _consistency_ to just always do the fast-case test first. Linus ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
> On Dec 6, 2018, at 9:43 AM, Jesper Dangaard Brouer wrote: > > On Thu, 6 Dec 2018 07:37:19 -0800 > Christoph Hellwig wrote: > >> Hi all, >> >> a while ago Jesper reported major performance regressions due to the >> spectre v2 mitigations in his XDP forwarding workloads. A large part >> of that is due to the DMA mapping API indirect calls. >> >> It turns out that the most common implementation of the DMA API is the >> direct mapping case, and now that we have merged almost all duplicate >> implementations of that into a single generic one is easily feasily to >> direct calls for this fast path. >> >> This patch adds a check if we are using dma_direct_ops in each fast path >> DMA operation, and just uses a direct call instead. For the XDP workload >> this increases the number of packets per second from 7,438,283 to >> 9,610,088, so it provides a very significant speedup. > > Full test report avail here: > https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org > > >> Note that the patch depends on a lot of work either queued up in the >> DMA mapping tree, or still out on the list from review, so to actually >> try the patch you probably want this git tree: >> >> >>git://git.infradead.org/users/hch/misc.git dma-direct-calls >> >> Gitweb: >> >> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls Did you happen to see my RFC for "automatic" indirect call promotion? https://lkml.org/lkml/2018/10/18/175 I hope to get v1 based on Josh responses next week. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] avoid indirect calls for DMA direct mappings
On Thu, 6 Dec 2018 07:37:19 -0800 Christoph Hellwig wrote: > Hi all, > > a while ago Jesper reported major performance regressions due to the > spectre v2 mitigations in his XDP forwarding workloads. A large part > of that is due to the DMA mapping API indirect calls. > > It turns out that the most common implementation of the DMA API is the > direct mapping case, and now that we have merged almost all duplicate > implementations of that into a single generic one is easily feasily to > direct calls for this fast path. > > This patch adds a check if we are using dma_direct_ops in each fast path > DMA operation, and just uses a direct call instead. For the XDP workload > this increases the number of packets per second from 7,438,283 to > 9,610,088, so it provides a very significant speedup. Full test report avail here: https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org > Note that the patch depends on a lot of work either queued up in the > DMA mapping tree, or still out on the list from review, so to actually > try the patch you probably want this git tree: > > > git://git.infradead.org/users/hch/misc.git dma-direct-calls > > Gitweb: > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu