Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. Saving queues and tables remotely will increase the latency of ARM SMMU significantly. For example, when SMMU is at node2 and the default global CMA is at node0, after sending a CMD_SYNC in an empty command queue, we have to wait more than 550ns for the completion of the command CMD_SYNC. However, if we save them locally, we only need to wait for 240ns.
with per-numa CMA, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. Barry Song (3): dma-direct: provide the ability to reserve per-numa CMA arm64: mm: reserve hugetlb CMA after numa_init arm64: mm: reserve per-numa CMA after numa_init arch/arm64/mm/init.c | 12 ++++++---- include/linux/dma-contiguous.h | 4 ++++ kernel/dma/Kconfig | 10 ++++++++ kernel/dma/contiguous.c | 43 +++++++++++++++++++++++++++++++++- 4 files changed, 63 insertions(+), 6 deletions(-) -- 2.23.0