On 7/14/19 10:06 AM, Nikolai Zhubr wrote: > Hi all, > > After reading some (apparently contradictory) revisions of DMA api references > in Documentation/DMA-*.txt, some (contradictory) discussions thereof, and > even digging through the in-tree drivers in search for a good enlightening > example, still I have to ask for advice. > > I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very > special PCIe device. And actually it does work already, but performs > differenly on different kernels. I'm targeting x86 (i686) only (although > preferrably the driver should stay platform-neutral) and I need to support > kernels 4.9+. Due to how the device is designed and used, very little has to > be done in kernel space. The device has large internal memory, which > accumulates some measurement data, and it is capable of transferring it to > the host using DMA (with at least 32-bit address space available). Arranging > memory for DMA is pretty much the only thing that userspace can not > reasonably do, so this needs to be in the driver. So my currenly attempted > layout is as follows: > > 1. In the (kernel-mode) driver, allocate large contiguous block of physical > memory to do DMA into. It will be later reused several times. This block does > not need to have a kernel-mode virtual address because it will never be > accessed from the driver directly. The block size is typically 128M and I use > CMA=256M. Currently I use dma_alloc_coherent(), but I'm not convinced it > really needs to be a strictly coherent memory, for performance reasons, see > below. Also, AFAICS on x86 dma_alloc_coherent() always creates a kernel > address mapping anyway, so maybe I'd better simply kalloc() with subsequent > dma_map_single()? > > 2. Upon DMA completion (from device to host), some sort of > barrier/synchronization might be necessary (to be safe WRT speculative loads, > cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(), however the > latter looks like a nop for x86 AFAICS, and the former is apparently > flush_write_buffers() which is not very involved either (asm lock; nop) and > does not look usefull for my case. Currentlly, I do not use any, and it seems > like OK, maybe by pure luck. So, is it so trivially simple on x86 or am I > just missing something horribly big here? > > 3. mmap this buffer for userspace. Reading from it should be as fast as > possible, therefore this block AFAICS should be cacheble (and prefetchable > and whatever else for better performance), at least from userspace context. > It is not quite clear if such properties would depend on block allocation > method (in step 1 above) or just on remapping attributes only. Currently, for > mmap I employ dma_mmap_coherent(), but it seems also possible to use > remap_pfn_range(), and also change vm_page_prot somewhat. I've already found > that e.g. pgprot_noncached hurts performance quite a lot, but supposedly > without it some DMA barrier (step 2 above) seems still necessary? > > Any hints greatly appreciated, > > Regards, > Nikolai
Hi, I suggest that you try some mailing list(s) besides linux-kernel. The MAINTAINERS file has these possibilities: dmaeng...@vger.kernel.org io...@lists.linux-foundation.org or just try linux...@vger.kernel.org -- ~Randy