Re: [PATCH] mm/dmapool: localize page allocations
On 5/17/2018 4:41 PM, Matthew Wilcox wrote: > Let's try a different example. I have a four-socket system with one > NVMe device with lots of hardware queues. Each CPU has its own queue > assigned to it. If I allocate all the PRP metadata on the socket with > the NVMe device attached to it, I'm sending a lot of coherency traffic > in the direction of that socket, in addition to the actual data. If the > PRP lists are allocated randomly on the various sockets, the traffic > is heading all over the fabric. If the PRP lists are allocated on the > local socket, the only time those lists move off this node is when the > device requests them. So.., your reasoning is that you actually want to keep the memory as close as possible to the CPU rather than the device itself. CPU would do frequent updates the buffer until the point where it hands off the buffer to the hardware. Device would fetch the memory via coherency when it needs to consume the data but this would be a one time penalty. It sounds logical to me. I was always told that you want to keep buffers as close as possible to the device. Maybe, it makes sense for things that device needs frequent access like receive buffers. If the majority user is CPU, then the buffer needs to be kept closer to the CPU. dma_alloc_coherent() is generally used for receiver buffer allocation in network adapters in general. People allocate a chunk and then create a queue that hardware owns for dumping events and data. Since DMA pool is a generic API, we should maybe request where we want to keep the buffers closer to and allocate buffers from the appropriate NUMA node based on that. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH] mm/dmapool: localize page allocations
On Thu, May 17, 2018 at 04:05:45PM -0400, Sinan Kaya wrote: > On 5/17/2018 3:46 PM, Matthew Wilcox wrote: > >> Remember that the CPU core that is running this driver is most probably on > >> the same NUMA node as the device itself. > > Umm ... says who? If my process is running on NUMA node 5 and I submit > > an I/O, it should be allocating from a pool on node 5, not from a pool > > on whichever node the device is attached to. > > OK, let's do an exercise. Maybe, I'm missing something in the big picture. Sure. > If a user process is running at node 5, it submits some work to the hardware > via block layer that is eventually invoked by syscall. > > Whatever buffer process is using, it gets copied into the kernel space as > it is crossing a userspace/kernel space boundary. > > Block layer packages a block request with the kernel pointers and makes a > request to the NVMe driver for consumption. > > Last time I checked, dma_alloc_coherent() API uses the locality information > from the device not from the CPU for allocation. Yes, it does. I wonder why that is; it doesn't actually make any sense. It'd be far more sensible to allocate it on memory local to the user than memory local to the device. > While the metadata for dma_pool is pointing to the currently running CPU core, > the DMA buffer itself is created using the device node itself today without > my patch. Umm ... dma_alloc_coherent memory is for metadata about the transfer, not for the memory used for the transaction. > I would think that you actually want to run the process at the same NUMA node > as the CPU and device itself for performance reasons. Otherwise, performance > expectations should be low. That's foolish. Consider a database appliance with four sockets, each with its own memory and I/O devices attached. You can't tell the user to shard the database into four pieces and have each socket only work on the quarter of the database that's available to each socket. They may as well buy four smaller machines. The point of buying a large NUMA machine is to use all of it. Let's try a different example. I have a four-socket system with one NVMe device with lots of hardware queues. Each CPU has its own queue assigned to it. If I allocate all the PRP metadata on the socket with the NVMe device attached to it, I'm sending a lot of coherency traffic in the direction of that socket, in addition to the actual data. If the PRP lists are allocated randomly on the various sockets, the traffic is heading all over the fabric. If the PRP lists are allocated on the local socket, the only time those lists move off this node is when the device requests them.
Re: [PATCH] mm/dmapool: localize page allocations
On 5/17/2018 3:46 PM, Matthew Wilcox wrote: >> Remember that the CPU core that is running this driver is most probably on >> the same NUMA node as the device itself. > Umm ... says who? If my process is running on NUMA node 5 and I submit > an I/O, it should be allocating from a pool on node 5, not from a pool > on whichever node the device is attached to. OK, let's do an exercise. Maybe, I'm missing something in the big picture. If a user process is running at node 5, it submits some work to the hardware via block layer that is eventually invoked by syscall. Whatever buffer process is using, it gets copied into the kernel space as it is crossing a userspace/kernel space boundary. Block layer packages a block request with the kernel pointers and makes a request to the NVMe driver for consumption. Last time I checked, dma_alloc_coherent() API uses the locality information from the device not from the CPU for allocation. While the metadata for dma_pool is pointing to the currently running CPU core, the DMA buffer itself is created using the device node itself today without my patch. I would think that you actually want to run the process at the same NUMA node as the CPU and device itself for performance reasons. Otherwise, performance expectations should be low. Even if user says please keep my process to a particular NUMA node, we keep pointing to the memory on the other node today. I don't know what is so special about memory on the default node. IMO, all memory allocations used by a driver need to follow the device. I wish I could do this in kmalloc(). devm_kmalloc() follows the device as another example not CPU. With these assumptions, even though user said please use the NUMA node from the device, we still keep pointing to the default domain for pointers. Isn't this wrong? > > If it actually makes a performance difference, then NVMe should allocate > one pool per queue, rather than one pool per device like it currently > does. > >> Also, if it was a one time init kind of thing, I'd say "yeah, leave it >> alone". >> DMA pool is used by a wide range of drivers and it is used to allocate >> fixed size buffers at runtime. > * DMA Pool allocator > * > * Copyright 2001 David Brownell > * Copyright 2007 Intel Corporation > * Author: Matthew Wilcox > > I know what it's used for. > cool, good to know. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH] mm/dmapool: localize page allocations
On Thu, May 17, 2018 at 03:37:21PM -0400, Sinan Kaya wrote: > On 5/17/2018 2:18 PM, Matthew Wilcox wrote: > > On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote: > >> Try to keep the pool closer to the device's NUMA node by changing kmalloc() > >> to kmalloc_node() and devres_alloc() to devres_alloc_node(). > > Have you measured any performance gains by doing this? The thing is that > > these allocations are for the metadata about the page, and the page is > > going to be used by CPUs in every node. So it's not clear to me that > > allocating it on the node nearest to the device is going to be any sort > > of a win. > > > > It is true that this is metadata but it is one of the things that is most > frequently used in spite of its small size. > > I don't think it makes any sense to cross a chip boundary for accessing a > pointer location on every single pool allocation. > > Remember that the CPU core that is running this driver is most probably on > the same NUMA node as the device itself. Umm ... says who? If my process is running on NUMA node 5 and I submit an I/O, it should be allocating from a pool on node 5, not from a pool on whichever node the device is attached to. If it actually makes a performance difference, then NVMe should allocate one pool per queue, rather than one pool per device like it currently does. > Also, if it was a one time init kind of thing, I'd say "yeah, leave it > alone". > DMA pool is used by a wide range of drivers and it is used to allocate > fixed size buffers at runtime. * DMA Pool allocator * * Copyright 2001 David Brownell * Copyright 2007 Intel Corporation * Author: Matthew Wilcox I know what it's used for.
Re: [PATCH] mm/dmapool: localize page allocations
On 5/17/2018 2:18 PM, Matthew Wilcox wrote: > On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote: >> Try to keep the pool closer to the device's NUMA node by changing kmalloc() >> to kmalloc_node() and devres_alloc() to devres_alloc_node(). > Have you measured any performance gains by doing this? The thing is that > these allocations are for the metadata about the page, and the page is > going to be used by CPUs in every node. So it's not clear to me that > allocating it on the node nearest to the device is going to be any sort > of a win. > It is true that this is metadata but it is one of the things that is most frequently used in spite of its small size. I don't think it makes any sense to cross a chip boundary for accessing a pointer location on every single pool allocation. Remember that the CPU core that is running this driver is most probably on the same NUMA node as the device itself. Also, if it was a one time init kind of thing, I'd say "yeah, leave it alone". DMA pool is used by a wide range of drivers and it is used to allocate fixed size buffers at runtime. Performance impact changes depending on the driver in use. This particular code is in use by network adapters as well as the NVMe driver. It does have a wide range of impact. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH] mm/dmapool: localize page allocations
On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote: > Try to keep the pool closer to the device's NUMA node by changing kmalloc() > to kmalloc_node() and devres_alloc() to devres_alloc_node(). Have you measured any performance gains by doing this? The thing is that these allocations are for the metadata about the page, and the page is going to be used by CPUs in every node. So it's not clear to me that allocating it on the node nearest to the device is going to be any sort of a win. > @@ -504,7 +504,8 @@ struct dma_pool *dmam_pool_create(const char *name, > struct device *dev, > { > struct dma_pool **ptr, *pool; > > - ptr = devres_alloc(dmam_pool_release, sizeof(*ptr), GFP_KERNEL); > + ptr = devres_alloc_node(dmam_pool_release, sizeof(*ptr), GFP_KERNEL, > + dev_to_node(dev)); > if (!ptr) > return NULL; ... are we really calling devres_alloc() for sizeof(void *)? That's sad.