Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
Hi Barry, I love your patch! Perhaps something to improve: [auto build test WARNING on linus/master] [also build test WARNING on v5.8-rc2 next-20200625] [cannot apply to arm64/for-next/core hch-configfs/for-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Barry-Song/make-dma_alloc_coherent-NUMA-aware-by-per-NUMA-CMA/20200625-154656 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8be3a53e18e0e1a98f288f6c7f5e9da3adbe9c49 config: x86_64-randconfig-s022-20200624 (attached as .config) compiler: gcc-9 (Debian 9.3.0-13) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.2-dirty # save the attached .config to linux build tree make W=1 C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot sparse warnings: (new ones prefixed by >>) >> kernel/dma/contiguous.c:283:50: sparse: sparse: invalid access below >> 'dma_contiguous_pernuma_area' (-8 8) # https://github.com/0day-ci/linux/commit/d6930169a3364418b985c2d19c31ecf1c4c3d4a9 git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout d6930169a3364418b985c2d19c31ecf1c4c3d4a9 vim +/dma_contiguous_pernuma_area +283 kernel/dma/contiguous.c de9e14eebf33a6 drivers/base/dma-contiguous.c Marek Szyprowski 2014-10-13 253 b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 254 /** b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 255 * dma_alloc_contiguous() - allocate contiguous pages b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 256 * @dev: Pointer to device for which the allocation is performed. b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 257 * @size: Requested allocation size. b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 258 * @gfp: Allocation flags. b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 259 * b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 260 * This function allocates contiguous memory buffer for specified device. It d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 261 * tries to use device specific contiguous memory area if available, or it d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 262 * tries to use per-numa cma, if the allocation fails, it will fallback to d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 263 * try default global one. bd2e75633c8012 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 264 * d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 265 * Note that it bypass one-page size of allocations from the per-numa and d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 266 * global area as the addresses within one page are always contiguous, so d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 267 * there is no need to waste CMA pages for that kind; it also helps reduce d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 268 * fragmentations. b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 269 */ b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 270 struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 271 { 90ae409f9eb3bc kernel/dma/contiguous.c Christoph Hellwig 2019-08-20 272 size_t count = size >> PAGE_SHIFT; b1d2dc009dece4 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 273 struct page *page = NULL; bd2e75633c8012 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 274 struct cma *cma = NULL; d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 275 int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE; d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 276 bool alloc_from_pernuma = false; bd2e75633c8012 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 277 bd2e75633c8012 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 278 if (dev && dev->cma_area) bd2e75633c8012 kernel/dma/contiguous.c Nicolin Chen 2019-05-23 279 cma = dev->cma_area; d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 280 else if ((nid != NUMA_NO_NODE) && dma_contiguous_pernuma_area[nid] d6930169a33644 kernel/dma/contiguous.c Barry Song2020-06-25 281
RE: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Robin Murphy > Sent: Thursday, June 25, 2020 11:11 PM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com; w...@kernel.org; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com > Cc: iommu@lists.linux-foundation.org; Linuxarm ; > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Jonathan > Cameron ; Nicolas Saenz Julienne > ; Steve Capper ; Andrew > Morton ; Mike Rapoport > Subject: Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve > per-numa CMA > > On 2020-06-25 08:43, Barry Song wrote: > > This is useful for at least two scenarios: > > 1. ARM64 smmu will get memory from local numa node, it can save its > > command queues and page tables locally. Tests show it can decrease > > dma_unmap latency at lot. For example, without this patch, smmu on > > node2 will get memory from node0 by calling dma_alloc_coherent(), > > typically, it has to wait for more than 560ns for the completion of > > CMD_SYNC in an empty command queue; with this patch, it needs 240ns > > only. > > 2. when we set iommu passthrough, drivers will get memory from CMA, > > local memory means much less latency. > > > > Cc: Jonathan Cameron > > Cc: Christoph Hellwig > > Cc: Marek Szyprowski > > Cc: Will Deacon > > Cc: Robin Murphy > > Cc: Ganapatrao Kulkarni > > Cc: Catalin Marinas > > Cc: Nicolas Saenz Julienne > > Cc: Steve Capper > > Cc: Andrew Morton > > Cc: Mike Rapoport > > Signed-off-by: Barry Song > > --- > > include/linux/dma-contiguous.h | 4 ++ > > kernel/dma/Kconfig | 10 > > kernel/dma/contiguous.c| 99 > ++ > > 3 files changed, 104 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/dma-contiguous.h > b/include/linux/dma-contiguous.h > > index 03f8e98e3bcc..278a80a40456 100644 > > --- a/include/linux/dma-contiguous.h > > +++ b/include/linux/dma-contiguous.h > > @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct > cma *cma) > > > > void dma_contiguous_reserve(phys_addr_t addr_limit); > > > > +void dma_pernuma_cma_reserve(void); > > + > > int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t > base, > >phys_addr_t limit, struct cma **res_cma, > >bool fixed); > > @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct > cma *cma) { } > > > > static inline void dma_contiguous_reserve(phys_addr_t limit) { } > > > > +static inline void dma_pernuma_cma_reserve(void) { } > > + > > static inline int dma_contiguous_reserve_area(phys_addr_t size, > phys_addr_t base, > >phys_addr_t limit, struct cma **res_cma, > >bool fixed) > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index d006668c0027..aeb976b1d21c 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -104,6 +104,16 @@ config DMA_CMA > > if DMA_CMA > > comment "Default contiguous memory area size:" > > > > +config CMA_PERNUMA_SIZE_MBYTES > > + int "Size in Mega Bytes for per-numa CMA areas" > > + depends on NUMA > > + default 16 if ARM64 > > + default 0 > > + help > > + Defines the size (in MiB) of the per-numa memory area for Contiguous > > + Memory Allocator. Every numa node will get a separate CMA with this > > + size. If the size of 0 is selected, per-numa CMA is disabled. > > + > > I think this needs to be cleverer than just a static config option. > Pretty much everything else CMA-related is runtime-configurable to some > degree, and doing any per-node setup when booting on a single-node > system would be wasted effort. I agree some dynamic configuration should be supported to set the size of cma. It could be a kernel parameter bootargs, or leverage an existing parameter. For a system with NUMA enabled, but with only one node or actually non-numa, the current dma_pernuma_cma_reserve() won't do anything: void __init dma_pernuma_cma_reserve(void) { int nid; if (!pernuma_size_bytes || nr_online_nodes <= 1) return; } > > Since this is conceptually very similar to the existing hugetlb_cma > implementation I'm also wondering about inconsistency with respect to > specifying per-no
Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
On 2020-06-25 08:43, Barry Song wrote: This is useful for at least two scenarios: 1. ARM64 smmu will get memory from local numa node, it can save its command queues and page tables locally. Tests show it can decrease dma_unmap latency at lot. For example, without this patch, smmu on node2 will get memory from node0 by calling dma_alloc_coherent(), typically, it has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. 2. when we set iommu passthrough, drivers will get memory from CMA, local memory means much less latency. Cc: Jonathan Cameron Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- include/linux/dma-contiguous.h | 4 ++ kernel/dma/Kconfig | 10 kernel/dma/contiguous.c| 99 ++ 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..278a80a40456 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) void dma_contiguous_reserve(phys_addr_t addr_limit); +void dma_pernuma_cma_reserve(void); + int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base, phys_addr_t limit, struct cma **res_cma, bool fixed); @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) { } static inline void dma_contiguous_reserve(phys_addr_t limit) { } +static inline void dma_pernuma_cma_reserve(void) { } + static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base, phys_addr_t limit, struct cma **res_cma, bool fixed) diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index d006668c0027..aeb976b1d21c 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -104,6 +104,16 @@ config DMA_CMA if DMA_CMA comment "Default contiguous memory area size:" +config CMA_PERNUMA_SIZE_MBYTES + int "Size in Mega Bytes for per-numa CMA areas" + depends on NUMA + default 16 if ARM64 + default 0 + help + Defines the size (in MiB) of the per-numa memory area for Contiguous + Memory Allocator. Every numa node will get a separate CMA with this + size. If the size of 0 is selected, per-numa CMA is disabled. + I think this needs to be cleverer than just a static config option. Pretty much everything else CMA-related is runtime-configurable to some degree, and doing any per-node setup when booting on a single-node system would be wasted effort. Since this is conceptually very similar to the existing hugetlb_cma implementation I'm also wondering about inconsistency with respect to specifying per-node vs. total sizes. Another thought, though, is that systems large enough to have multiple NUMA nodes tend not to be short on memory, so it might not be unreasonable to base this all on whatever size the default area is given, and simply have a binary on/off switch to control the per-node aspect. config CMA_SIZE_MBYTES int "Size in Mega Bytes" depends on !CMA_SIZE_SEL_PERCENTAGE diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 15bc5026c485..bcbd53aead93 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -30,7 +30,14 @@ #define CMA_SIZE_MBYTES 0 #endif +#ifdef CONFIG_CMA_PERNUMA_SIZE_MBYTES +#define CMA_SIZE_PERNUMA_MBYTES CONFIG_CMA_PERNUMA_SIZE_MBYTES +#else +#define CMA_SIZE_PERNUMA_MBYTES 0 +#endif + struct cma *dma_contiguous_default_area; +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; /* * Default global CMA area size can be defined in kernel's .config. @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area; */ static const phys_addr_t size_bytes __initconst = (phys_addr_t)CMA_SIZE_MBYTES * SZ_1M; +static const phys_addr_t pernuma_size_bytes __initconst = + (phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M; static phys_addr_t size_cmdline __initdata = -1; static phys_addr_t base_cmdline __initdata; static phys_addr_t limit_cmdline __initdata; @@ -96,6 +105,33 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) #endif +void __init dma_pernuma_cma_reserve(void) +{ + int nid; + + if (!pernuma_size_bytes || nr_online_nodes <= 1) + return; + + for_each_node_state(nid, N_ONLINE) { Do we need/want notifiers to handle currently-offline nodes coming online later (I'm not sure off-hand how NUMA interacts with stuff like "maxcpus=n")? +
[PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
This is useful for at least two scenarios: 1. ARM64 smmu will get memory from local numa node, it can save its command queues and page tables locally. Tests show it can decrease dma_unmap latency at lot. For example, without this patch, smmu on node2 will get memory from node0 by calling dma_alloc_coherent(), typically, it has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. 2. when we set iommu passthrough, drivers will get memory from CMA, local memory means much less latency. Cc: Jonathan Cameron Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- include/linux/dma-contiguous.h | 4 ++ kernel/dma/Kconfig | 10 kernel/dma/contiguous.c| 99 ++ 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..278a80a40456 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) void dma_contiguous_reserve(phys_addr_t addr_limit); +void dma_pernuma_cma_reserve(void); + int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base, phys_addr_t limit, struct cma **res_cma, bool fixed); @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) { } static inline void dma_contiguous_reserve(phys_addr_t limit) { } +static inline void dma_pernuma_cma_reserve(void) { } + static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base, phys_addr_t limit, struct cma **res_cma, bool fixed) diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index d006668c0027..aeb976b1d21c 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -104,6 +104,16 @@ config DMA_CMA if DMA_CMA comment "Default contiguous memory area size:" +config CMA_PERNUMA_SIZE_MBYTES + int "Size in Mega Bytes for per-numa CMA areas" + depends on NUMA + default 16 if ARM64 + default 0 + help + Defines the size (in MiB) of the per-numa memory area for Contiguous + Memory Allocator. Every numa node will get a separate CMA with this + size. If the size of 0 is selected, per-numa CMA is disabled. + config CMA_SIZE_MBYTES int "Size in Mega Bytes" depends on !CMA_SIZE_SEL_PERCENTAGE diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 15bc5026c485..bcbd53aead93 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -30,7 +30,14 @@ #define CMA_SIZE_MBYTES 0 #endif +#ifdef CONFIG_CMA_PERNUMA_SIZE_MBYTES +#define CMA_SIZE_PERNUMA_MBYTES CONFIG_CMA_PERNUMA_SIZE_MBYTES +#else +#define CMA_SIZE_PERNUMA_MBYTES 0 +#endif + struct cma *dma_contiguous_default_area; +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; /* * Default global CMA area size can be defined in kernel's .config. @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area; */ static const phys_addr_t size_bytes __initconst = (phys_addr_t)CMA_SIZE_MBYTES * SZ_1M; +static const phys_addr_t pernuma_size_bytes __initconst = + (phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M; static phys_addr_t size_cmdline __initdata = -1; static phys_addr_t base_cmdline __initdata; static phys_addr_t limit_cmdline __initdata; @@ -96,6 +105,33 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) #endif +void __init dma_pernuma_cma_reserve(void) +{ + int nid; + + if (!pernuma_size_bytes || nr_online_nodes <= 1) + return; + + for_each_node_state(nid, N_ONLINE) { + int ret; + char name[20]; + + snprintf(name, sizeof(name), "pernuma%d", nid); + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0, +0, false, name, + &dma_contiguous_pernuma_area[nid], +nid); + if (ret) { + pr_warn("%s: reservation failed: err %d, node %d", __func__, + ret, nid); + continue; + } + + pr_debug("%s: reserved %llu MiB on node %d\n", __func__, + (unsigned long long)pernuma_size_bytes / SZ_1M, nid); + } +} + /** * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling * @limit: End address of the reserved memory (optional, 0 for any). @@ -222,22 +258,