Re: [PATCH 5/6 v2] bus: fsl-mc: supoprt dma configure for devices on fsl-mc bus
Hi Nipun, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc2 next-20180424] [cannot apply to iommu/next glikely/devicetree/next] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Nipun-Gupta/Support-for-fsl-mc-bus-and-its-devices-in-SMMU/20180418-034931 config: powerpc64-allmodconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc64 All errors (new ones prefixed by >>): drivers/bus/fsl-mc/fsl-mc-bus.c: In function 'fsl_mc_dma_configure': >> drivers/bus/fsl-mc/fsl-mc-bus.c:137:9: error: too many arguments to function >> 'of_dma_configure' return of_dma_configure(dev, dma_dev->of_node, 0); ^~~~ In file included from drivers/bus/fsl-mc/fsl-mc-bus.c:13:0: include/linux/of_device.h:58:5: note: declared here int of_dma_configure(struct device *dev, struct device_node *np); ^~~~ drivers/bus/fsl-mc/fsl-mc-bus.c: At top level: >> drivers/bus/fsl-mc/fsl-mc-bus.c:161:3: error: 'struct bus_type' has no >> member named 'dma_configure' .dma_configure = fsl_mc_dma_configure, ^ vim +/of_dma_configure +137 drivers/bus/fsl-mc/fsl-mc-bus.c 129 130 static int fsl_mc_dma_configure(struct device *dev) 131 { 132 struct device *dma_dev = dev; 133 134 while (dev_is_fsl_mc(dma_dev)) 135 dma_dev = dma_dev->parent; 136 > 137 return of_dma_configure(dev, dma_dev->of_node, 0); 138 } 139 140 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, 141 char *buf) 142 { 143 struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev); 144 145 return sprintf(buf, "fsl-mc:v%08Xd%s\n", mc_dev->obj_desc.vendor, 146 mc_dev->obj_desc.type); 147 } 148 static DEVICE_ATTR_RO(modalias); 149 150 static struct attribute *fsl_mc_dev_attrs[] = { 151 &dev_attr_modalias.attr, 152 NULL, 153 }; 154 155 ATTRIBUTE_GROUPS(fsl_mc_dev); 156 157 struct bus_type fsl_mc_bus_type = { 158 .name = "fsl-mc", 159 .match = fsl_mc_bus_match, 160 .uevent = fsl_mc_bus_uevent, > 161 .dma_configure = fsl_mc_dma_configure, 162 .dev_groups = fsl_mc_dev_groups, 163 }; 164 EXPORT_SYMBOL_GPL(fsl_mc_bus_type); 165 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote: > On arm that doesn't work. The iommu api seems like a good fit, except > the dma-api tends to get in the way a bit (drm/msm apparently has > similar problems like tegra), and if you need contiguous memory > dma_alloc_coherent is the only way to get at contiguous memory. There > was a huge discussion years ago about that, and direct cma access was > shot down because it would have exposed too much of the caching > attribute mangling required (most arm platforms need wc-pages to not > be in the kernel's linear map apparently). I think you completely misunderstand ARM from what you've written above, and this worries me greatly about giving DRM the level of control that is being asked for. Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache attributes are stored in the page tables. These caches are inherently non-aliasing when there are multiple mappings (which is a great step forward compared to the previous aliasing caches.) As the cache attributes are stored in the page tables, this in theory allows different virtual mappings of the same physical memory to have different cache attributes. However, there's a problem, and that's called speculative prefetching. Let's say you have one mapping which is cacheable, and another that is marked as write combining. If a cache line is speculatively prefetched through the cacheable mapping of this memory, and then you read the same physical location through the write combining mapping, it is possible that you could read cached data. So, it is generally accepted that all mappings of any particular physical bit of memory should have the same cache attributes to avoid unpredictable behaviour. This presents a problem with what is generally called "lowmem" where the memory is mapped in kernel virtual space with cacheable attributes. It can also happen with highmem if the memory is kmapped. This is why, on ARM, you can't use something like get_free_pages() to grab some pages from the system, pass it to the GPU, map it into userspace as write-combining, etc. It _might_ work for some CPUs, but ARM CPUs vary in how much prefetching they do, and what may work for one particular CPU is in no way guaranteed to work for another ARM CPU. The official line from architecture folk is to assume that the caches infinitely speculate, are of infinite size, and can writeback *dirty* data at any moment. The way to stop things like speculative prefetches to particular physical memory is to, quite "simply", not have any cacheable mappings of that physical memory anywhere in the system. Now, cache flushes on ARM tend to be fairly expensive for GPU buffers. If you have, say, an 8MB buffer (for a 1080p frame) and you need to do a cache operation on that buffer, you'll be iterating over it 32 or maybe 64 bytes at a time "just in case" there's a cache line present. Referring to my previous email, where I detailed the potential need for _two_ flushes, one before the GPU operation and one after, and this becomes _really_ expensive. At that point, you're probably way better off using write-combine memory where you don't need to spend CPU cycles performing cache flushing - potentially across all CPUs in the system if cache operations aren't broadcasted. This isn't a simple matter of "just provide some APIs for cache operations" - there's much more that needs to be understood by all parties here, especially when we have GPU drivers that can be used with quite different CPUs. It may well be that for some combinations of CPUs and workloads, it's better to use write-combine memory without cache flushing, but for other CPUs that tradeoff (for the same workload) could well be different. Older ARMs get more interesting, because they have aliasing caches. That means the CPU cache aliases across different virtual space mappings in some way, which complicates (a) the mapping of memory and (b) handling the cache operations on it. It's too late for me to go into that tonight, and I probably won't be reading mail for the next week and a half, sorry. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote: > > - dma api hides the cache flushing requirements from us. GPUs love > > non-snooped access, and worse give userspace control over that. We want > > a strict separation between mapping stuff and flushing stuff. With the > > IOMMU api we mostly have the former, but for the later arch maintainers > > regularly tells they won't allow that. So we have drm_clflush.c. > > The problem is that a cache flushing API entirely separate is hard. That > being said if you look at my generic dma-noncoherent API series it tries > to move that way. So far it is in early stages and apparently rather > buggy unfortunately. And if folk want a cacheable mapping with explicit cache flushing, the cache flushing must not be defined in terms of "this is what CPU seems to need" but from the point of view of a CPU with infinite prefetching, infinite caching and infinite capacity to perform writebacks of dirty cache lines at unexpected moments when the memory is mapped in a cacheable mapping. (The reason for that is you're operating in a non-CPU specific space, so you can't make any guarantees as to how much caching or prefetching will occur by the CPU - different CPUs will do different amounts.) So, for example, the sequence: GPU writes to memory CPU reads from cacheable memory if the memory was previously dirty (iow, CPU has written), you need to flush the dirty cache lines _before_ the GPU writes happen, but you don't know whether the CPU has speculatively prefetched, so you need to flush any prefetched cache lines before reading from the cacheable memory _after_ the GPU has finished writing. Also note that "flush" there can be "clean the cache", "clean and invalidate the cache" or "invalidate the cache" as appropriate - some CPUs are able to perform those three operations, and the appropriate one depends on not only where in the above sequence it's being used, but also on what the operations are. So, the above sequence could be: CPU invalidates cache for memory (due to possible dirty cache lines) GPU writes to memory CPU invalidates cache for memory (to get rid of any speculatively prefetched lines) CPU reads from cacheable memory Yes, in the above case, _two_ cache operations are required to ensure correct behaviour. However, if you know for certain that the memory was previously clean, then the first cache operation can be skipped. What I'm pointing out is there's much more than just "I want to flush the cache" here, which is currently what DRM seems to assume at the moment with the code in drm_cache.c. If we can agree a set of interfaces that allows _proper_ use of these facilities, one which can be used appropriately, then there shouldn't be a problem. The DMA API does that via it's ideas about who owns a particular buffer (because of the above problem) and that's something which would need to be carried over to such a cache flushing API (it should be pretty obvious that having a GPU read or write memory while the cache for that memory is being cleaned will lead to unexpected results.) Also note that things get even more interesting in a SMP environment if cache operations aren't broadcasted across the SMP cluster (which means cache operations have to be IPI'd to other CPUs.) The next issue, which I've brought up before, is that exposing cache flushing to userspace on architectures where it isn't already exposed comes. As has been shown by Google Project Zero, this risks exposing those architectures to Spectre and Meltdown exploits where they weren't at such a risk before. (I've pretty much shown here that you _do_ need to control which cache lines get flushed to make these exploits work, and flushing the cache by reading lots of data in liu of having the ability to explicitly flush bits of cache makes it very difficult to impossible for them to work.) -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 5:33 PM, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote: >> > Coordinating the backport of a trivial helper in the arm tree is not >> > the end of the world. Really, this cowboy attitude is a good reason >> > why graphics folks have such a bad rep. You keep poking into random >> > kernel internals, don't talk to anoyone and then complain if people >> > are upset. This shouldn't be surprising. >> >> Not really agreeing on the cowboy thing. The fundamental problem is that >> the dma api provides abstraction that seriously gets in the way of writing >> a gpu driver. Some examples: > > So talk to other people. Maybe people share your frustation. Or maybe > other people have a way to help. > >> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's >> hacks to fall back to a cache of pages allocated using >> dma_alloc_coherent if you build a kernel with bounce buffers. > > get_required_mask() is supposed to tell you if you are safe. However > we are missing lots of implementations of it for iommus so you might get > some false negatives, improvements welcome. It's been on my list of > things to fix in the DMA API, but it is nowhere near the top. I hasn't come up in a while in some fireworks, so I honestly don't remember exactly what the issues have been. But commit d766ef53006c2c38a7fe2bef0904105a793383f2 Author: Chris Wilson Date: Mon Dec 19 12:43:45 2016 + drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping and the various bits of code that a $ git grep SWIOTLB -- drivers/gpu turns up is what we're doing to hack around that stuff. And in general (there's some exceptions) gpus should be able to address everything, so I never fully understood where that's even coming from. >> - dma api hides the cache flushing requirements from us. GPUs love >> non-snooped access, and worse give userspace control over that. We want >> a strict separation between mapping stuff and flushing stuff. With the >> IOMMU api we mostly have the former, but for the later arch maintainers >> regularly tells they won't allow that. So we have drm_clflush.c. > > The problem is that a cache flushing API entirely separate is hard. That > being said if you look at my generic dma-noncoherent API series it tries > to move that way. So far it is in early stages and apparently rather > buggy unfortunately. I'm assuming this stuff here? https://lkml.org/lkml/2018/4/20/146 Anyway got lost in all that work a bit, looks really nice. >> - dma api hides how/where memory is allocated. Kinda similar problem, >> except now for CMA or address limits. So either we roll our own >> allocators and then dma_map_sg (and pray it doesn't bounce buffer), or >> we use dma_alloc_coherent and then grab the sgt to get at the CMA >> allocations because that's the only way. Which sucks, because we can't >> directly tell CMA how to back off if there's some way to make CMA memory >> available through other means (gpus love to hog all of memory, so we >> have shrinkers and everything). > > If you really care about doing explicitly cache flushing anyway (see > above) allocating your own memory and mapping it where needed is by > far the superior solution. On cache coherent architectures > dma_alloc_coherent is nothing but allocate memory + dma_map_single. > On non coherent allocations the memory might come through a special > pool or must be used through a special virtual address mapping that > is set up either statically or dynamically. For that case splitting > allocation and mapping is a good idea in many ways, and I plan to move > towards that once the number of dma mapping implementations is down > to a reasonable number so that it can actually be done. Yeah the above is pretty much what we do on x86. dma-api believes everything is coherent, so dma_map_sg does the mapping we want and nothing else (minus swiotlb fun). Cache flushing, allocations, all done by the driver. On arm that doesn't work. The iommu api seems like a good fit, except the dma-api tends to get in the way a bit (drm/msm apparently has similar problems like tegra), and if you need contiguous memory dma_alloc_coherent is the only way to get at contiguous memory. There was a huge discussion years ago about that, and direct cma access was shot down because it would have exposed too much of the caching attribute mangling required (most arm platforms need wc-pages to not be in the kernel's linear map apparently). Anything that separate these 3 things more (allocation pools, mapping through IOMMUs and flushing cpu caches) sounds like the right direction to me. Even if that throws some portability across platforms away - drivers who want to control things in this much detail aren't really portable (without some serious work) anyway. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___
Re: [PATCH v4 14/22] iommu: handle page response timeout
On Mon, 23 Apr 2018 16:36:23 +0100 Jean-Philippe Brucker wrote: > On Mon, Apr 16, 2018 at 10:49:03PM +0100, Jacob Pan wrote: > > When IO page faults are reported outside IOMMU subsystem, the page > > request handler may fail for various reasons. E.g. a guest received > > page requests but did not have a chance to run for a long time. The > > irresponsive behavior could hold off limited resources on the > > pending device. > > There can be hardware or credit based software solutions as > > suggested in the PCI ATS Ch-4. To provide a basic safty net this > > patch introduces a per device deferrable timer which monitors the > > longest pending page fault that requires a response. Proper action > > such as sending failure response code could be taken when timer > > expires but not included in this patch. We need to consider the > > life cycle of page groupd ID to prevent confusion with reused group > > ID by a device. For now, a warning message provides clue of such > > failure. > > > > Signed-off-by: Jacob Pan > > Signed-off-by: Ashok Raj > > --- > > drivers/iommu/iommu.c | 60 > > +-- > > include/linux/iommu.h | 4 2 files changed, 62 insertions(+), > > 2 deletions(-) > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > > index 628346c..f6512692 100644 > > --- a/drivers/iommu/iommu.c > > +++ b/drivers/iommu/iommu.c > > @@ -799,6 +799,39 @@ int iommu_group_unregister_notifier(struct > > iommu_group *group, } > > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); > > > > +/* Max time to wait for a pending page request */ > > +#define IOMMU_PAGE_RESPONSE_MAXTIME (HZ * 10) > > +static void iommu_dev_fault_timer_fn(struct timer_list *t) > > +{ > > + struct iommu_fault_param *fparam = from_timer(fparam, t, > > timer); > > + struct iommu_fault_event *evt, *iter; > > + > > + u64 now; > > + > > + now = get_jiffies_64(); > > + > > + /* The goal is to ensure driver or guest page fault > > handler(via vfio) > > +* send page response on time. Otherwise, limited queue > > resources > > +* may be occupied by some irresponsive guests or > > drivers. > > By "limited queue resources", do you mean the PRI fault queue in the > pIOMMU device, or something else? > I am referring to the device resource for tracking pending PRQs. Intel IOMMU does not track pending PRQs. > > I'm still uneasy about this timeout. We don't really know if the guest > doesn't respond because it is suspended, because it doesn't support > PRI or because it's attempting to kill the host. In the first case, > then receiving and responding to page requests later than 10s should > be fine, right? > when a guest is going into system suspend, suspend callback functions of assigned device driver and vIOMMU should be called. I think vIOMMU should propagate the iommu_suspend call to host IOMMU driver, therefore terminate all the pending PRQs. We can make the timeout adjustable. > Or maybe the guest is doing something weird like fetching pages from > network storage and it occasionally hits a latency oddity. This > wouldn't interrupt the fault queues, because other page requests for > the same device can be serviced in parallel, but if you implement a > PRG timeout it would still unfairly disable PRI. > The timeout here is intended to be a broader and basic safety net at per device level. We can implement finer grain safety mechanism but I am guessing it is better to be done in HW. > In the other cases (unsupported PRI or rogue guest) then disabling PRI > using a FAILURE status might be the right thing to do. However, > assuming the device follows the PCI spec it will stop sending page > requests once there are as many PPRs in flight as the allocated > credit. > Agreed, here I am not taking any actions. There may be need to drain in-fly requests. > Even though drivers set the PPR credit number arbitrarily (because > finding an ideal number is difficult or impossible), the device stops > issuing faults at some point if the guest is unresponsive, and it > won't grab any more shared resources, or use slots in shared queues. > Resources for pending faults can be cleaned when the device is reset > and assigned to a different guest. > > > That's for sane endpoints that follow the spec. If on the other hand, > we can't rely on the device implementation to respect our maximum > credit allocation, then we should do the accounting ourselves and > reject incoming faults with INVALID as fast as possible. Otherwise > it's an easy way for a guest to DoS the host and I don't think a > timeout solves this problem (The guest can wait 9 seconds before > replying to faults and meanwhile fill all the queues). In addition > the timeout is done on PRGs but not individual page faults, so a > guest could overflow the queues by triggering lots of page requests > without setting the last bit. > > > If there isn't any possibility of memory leak or abusing resources, I > don't think it
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote: > > Coordinating the backport of a trivial helper in the arm tree is not > > the end of the world. Really, this cowboy attitude is a good reason > > why graphics folks have such a bad rep. You keep poking into random > > kernel internals, don't talk to anoyone and then complain if people > > are upset. This shouldn't be surprising. > > Not really agreeing on the cowboy thing. The fundamental problem is that > the dma api provides abstraction that seriously gets in the way of writing > a gpu driver. Some examples: So talk to other people. Maybe people share your frustation. Or maybe other people have a way to help. > - We never want bounce buffers, ever. dma_map_sg gives us that, so there's > hacks to fall back to a cache of pages allocated using > dma_alloc_coherent if you build a kernel with bounce buffers. get_required_mask() is supposed to tell you if you are safe. However we are missing lots of implementations of it for iommus so you might get some false negatives, improvements welcome. It's been on my list of things to fix in the DMA API, but it is nowhere near the top. > - dma api hides the cache flushing requirements from us. GPUs love > non-snooped access, and worse give userspace control over that. We want > a strict separation between mapping stuff and flushing stuff. With the > IOMMU api we mostly have the former, but for the later arch maintainers > regularly tells they won't allow that. So we have drm_clflush.c. The problem is that a cache flushing API entirely separate is hard. That being said if you look at my generic dma-noncoherent API series it tries to move that way. So far it is in early stages and apparently rather buggy unfortunately. > - dma api hides how/where memory is allocated. Kinda similar problem, > except now for CMA or address limits. So either we roll our own > allocators and then dma_map_sg (and pray it doesn't bounce buffer), or > we use dma_alloc_coherent and then grab the sgt to get at the CMA > allocations because that's the only way. Which sucks, because we can't > directly tell CMA how to back off if there's some way to make CMA memory > available through other means (gpus love to hog all of memory, so we > have shrinkers and everything). If you really care about doing explicitly cache flushing anyway (see above) allocating your own memory and mapping it where needed is by far the superior solution. On cache coherent architectures dma_alloc_coherent is nothing but allocate memory + dma_map_single. On non coherent allocations the memory might come through a special pool or must be used through a special virtual address mapping that is set up either statically or dynamically. For that case splitting allocation and mapping is a good idea in many ways, and I plan to move towards that once the number of dma mapping implementations is down to a reasonable number so that it can actually be done. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
On Wed, Apr 25, 2018 at 12:10:47PM +0200, Thierry Reding wrote: > From: Thierry Reding > > Depending on the kernel configuration, early ARM architecture setup code > may have attached the GPU to a DMA/IOMMU mapping that transparently uses > the IOMMU to back the DMA API. Tegra requires special handling for IOMMU > backed buffers (a special bit in the GPU's MMU page tables indicates the > memory path to take: via the SMMU or directly to the memory controller). > Transparently backing DMA memory with an IOMMU prevents Nouveau from > properly handling such memory accesses and causes memory access faults. > > As a side-note: buffers other than those allocated in instance memory > don't need to be physically contiguous from the GPU's perspective since > the GPU can map them into contiguous buffers using its own MMU. Mapping > these buffers through the IOMMU is unnecessary and will even lead to > performance degradation because of the additional translation. > > Signed-off-by: Thierry Reding > --- > I had already sent this out independently to fix a regression that was > introduced in v4.16, but then Christoph pointed out that it should've > been sent to a wider audience and should use a core API rather than > calling into architecture code directly. > > I've added it to this series for easier reference and to show the need > for the new API. This is good stuff, I am struggling with something similar on ARM64. One problem that I wasn't able to fully solve cleanly was that for arm-smmu the SMMU HW resources are not released until the domain itself is destroyed and I never quite figured out a way to swap the default domain cleanly. This is a problem for the MSM GPU because not only do we run our own IOMMU as you do we also have a hardware dependency to use context bank 0 to asynchronously switch the pagetable during rendering. I'm not sure if this is a problem you have encountered. In any event, this code gets us a little bit further down the path and at least there is somebody else out there in the cold dark world that understands my pain. :) For your interest, here was my half-hearted attempt to avoid creating DMA domains in the first place based on a blacklist to try to spur a bit of discussion: https://patchwork.freedesktop.org/series/41573/ Jordan -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 3/5] ARM: dma-mapping: Implement arch_iommu_detach_device()
> +void arch_iommu_detach_device(struct device *dev) > +{ > +#ifdef CONFIG_ARM_DMA_USE_IOMMU > + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); > + const struct dma_map_ops *dma_ops; > + > + if (!mapping) > + return; > + > + arm_iommu_release_mapping(mapping); > + arm_iommu_detach_device(dev); > + > + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent); > + set_dma_ops(dev, dma_ops); Why not simply: set_dma_ops(dev, arm_get_dma_map_ops(dev->archdata.dma_coherent)); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 2/5] dma-mapping: Introduce dma_iommu_detach_device() API
On Wed, Apr 25, 2018 at 12:10:48PM +0200, Thierry Reding wrote: > From: Thierry Reding > > The dma_iommu_detach_device() API can be used by drivers to forcibly > detach a device from an IOMMU that architecture code might have attached > to. This is useful for drivers that need explicit control over the IOMMU > using the IOMMU API directly. Given that no one else implements it making it a generic API seems rather confusing. For now I'd rename it to arm_dma_iommu_detach_device() and only implement it in arm. Once I've got the dma mapping implementations consolidated to a small enough number we could think about something like a device quirk that tells the architecture to simply never even attach the iommu dma ops to start with. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
The series seems to miss a cover letter. Also I really think this patch original patch shouldn't be in the proper series. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU
On Wed, Apr 25, 2018 at 11:25:11AM +0100, Russell King - ARM Linux wrote: > > config ARM_DMA_USE_IOMMU > > - bool > > + def_bool y > > select ARM_HAS_SG_CHAIN > > select NEED_SG_DMA_LENGTH > > This doesn't work - as has recently been discussed with hch, we can't > globally enable NEED_SG_DMA_LENGTH on ARM - the ARM architecture > pre-dates the addition of the DMA length member in the scatterlist, > and not every machine supports the splitting of the DMA length from > the non-DMA length member. Hence, this will cause a regression, > sorry. Agreed as-is. However supporting it everywhere should not be much work, I'll take a look at it. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/13] iommu-common: move to arch/sparc
From: Christoph Hellwig Date: Wed, 25 Apr 2018 07:15:27 +0200 > This code is only used by sparc, and all new iommu drivers should use the > drivers/iommu/ framework. Also remove the unused exports. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Anshuman Khandual Acked-by: David S. Miller ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU
On Wed, Apr 25, 2018 at 12:10:51PM +0200, Thierry Reding wrote: > From: Thierry Reding > > The ARM_DMA_USE_IOMMU Kconfig option has side-effects that drivers can > not opt into but have to explicitly opt out of. This can lead to subtle > bugs that are difficult to track down and not immediately obvious to be > related to this Kconfig option. > > To avoid this confusion, always enable the option to expose any lurking > bugs once and allow any regressions introduced by the DMA/IOMMU code to > be caught more quickly in the future. > > Note that some drivers still use the Kconfig symbol to provide different > code paths depending on what architecture the code runs on (e.g. 32-bit > ARM vs. 64-bit ARM which have different and incompatible implementations > of the DMA/IOMMU integration code), so leave the symbol in place to keep > those drivers working. > > For the long term, it is preferable to transition everyone to the > generic DMA/IOMMU integration code in drivers/iommu/dma-iommu.c. > > Signed-off-by: Thierry Reding > --- > Changes in v2: > - new patch > > arch/arm/Kconfig | 2 +- > arch/arm/include/asm/device.h | 6 -- > arch/arm/mm/dma-mapping.c | 18 -- > drivers/iommu/Kconfig | 7 --- > drivers/media/platform/Kconfig | 1 - > 5 files changed, 1 insertion(+), 33 deletions(-) > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index fa0b190f8a38..3c91de78535a 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -124,7 +124,7 @@ config NEED_SG_DMA_LENGTH > bool > > config ARM_DMA_USE_IOMMU > - bool > + def_bool y > select ARM_HAS_SG_CHAIN > select NEED_SG_DMA_LENGTH This doesn't work - as has recently been discussed with hch, we can't globally enable NEED_SG_DMA_LENGTH on ARM - the ARM architecture pre-dates the addition of the DMA length member in the scatterlist, and not every machine supports the splitting of the DMA length from the non-DMA length member. Hence, this will cause a regression, sorry. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 4/5] drm/nouveau: tegra: Use dma_iommu_detach_device()
From: Thierry Reding Use the new dma_iommu_detach_device() function to replace the open-coded equivalent. Signed-off-by: Thierry Reding --- .../drm/nouveau/nvkm/engine/device/tegra.c| 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c index 23428a7056e9..c0a7f3839cbb 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c @@ -20,10 +20,6 @@ * DEALINGS IN THE SOFTWARE. */ -#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) -#include -#endif - #include #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER #include "priv.h" @@ -110,19 +106,8 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) unsigned long pgsize_bitmap; int ret; -#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) - if (dev->archdata.mapping) { - struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); - - arm_iommu_release_mapping(mapping); - arm_iommu_detach_device(dev); - - if (dev->archdata.dma_coherent) - set_dma_ops(dev, &arm_coherent_dma_ops); - else - set_dma_ops(dev, &arm_dma_ops); - } -#endif + /* make sure we can use the IOMMU exclusively */ + dma_iommu_detach_device(dev); if (!tdev->func->iommu_bit) return; -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 3/5] ARM: dma-mapping: Implement arch_iommu_detach_device()
From: Thierry Reding Implement this function to enable drivers from detaching from any IOMMU domains that architecture code might have attached them to so that they can take exclusive control of the IOMMU via the IOMMU API. Signed-off-by: Thierry Reding --- Changes in v2: - fix compilation arch/arm/include/asm/dma-mapping.h | 3 +++ arch/arm/mm/dma-mapping-nommu.c| 4 arch/arm/mm/dma-mapping.c | 17 + 3 files changed, 24 insertions(+) diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 8436f6ade57d..d6d5bd44f962 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, #define arch_teardown_dma_ops arch_teardown_dma_ops extern void arch_teardown_dma_ops(struct device *dev); +#define arch_iommu_detach_device arch_iommu_detach_device +extern void arch_iommu_detach_device(struct device *dev); + /* do not use this function in a driver */ static inline bool is_device_dma_coherent(struct device *dev) { diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index 619f24a42d09..60fef97d8452 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -242,6 +242,10 @@ void arch_teardown_dma_ops(struct device *dev) { } +void arch_iommu_detach_device(struct device *dev) +{ +} + #define PREALLOC_DMA_DEBUG_ENTRIES 4096 static int __init dma_debug_do_init(void) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 8c398fedbbb6..cc63a25bd088 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -2423,3 +2423,20 @@ void arch_teardown_dma_ops(struct device *dev) arm_teardown_iommu_dma_ops(dev); } + +void arch_iommu_detach_device(struct device *dev) +{ +#ifdef CONFIG_ARM_DMA_USE_IOMMU + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); + const struct dma_map_ops *dma_ops; + + if (!mapping) + return; + + arm_iommu_release_mapping(mapping); + arm_iommu_detach_device(dev); + + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent); + set_dma_ops(dev, dma_ops); +#endif +} -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
From: Thierry Reding Depending on the kernel configuration, early ARM architecture setup code may have attached the GPU to a DMA/IOMMU mapping that transparently uses the IOMMU to back the DMA API. Tegra requires special handling for IOMMU backed buffers (a special bit in the GPU's MMU page tables indicates the memory path to take: via the SMMU or directly to the memory controller). Transparently backing DMA memory with an IOMMU prevents Nouveau from properly handling such memory accesses and causes memory access faults. As a side-note: buffers other than those allocated in instance memory don't need to be physically contiguous from the GPU's perspective since the GPU can map them into contiguous buffers using its own MMU. Mapping these buffers through the IOMMU is unnecessary and will even lead to performance degradation because of the additional translation. Signed-off-by: Thierry Reding --- I had already sent this out independently to fix a regression that was introduced in v4.16, but then Christoph pointed out that it should've been sent to a wider audience and should use a core API rather than calling into architecture code directly. I've added it to this series for easier reference and to show the need for the new API. .../drm/nouveau/nvkm/engine/device/tegra.c| 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c index 78597da6313a..23428a7056e9 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c @@ -19,6 +19,11 @@ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * DEALINGS IN THE SOFTWARE. */ + +#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) +#include +#endif + #include #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER #include "priv.h" @@ -105,6 +110,20 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) unsigned long pgsize_bitmap; int ret; +#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) + if (dev->archdata.mapping) { + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); + + arm_iommu_release_mapping(mapping); + arm_iommu_detach_device(dev); + + if (dev->archdata.dma_coherent) + set_dma_ops(dev, &arm_coherent_dma_ops); + else + set_dma_ops(dev, &arm_dma_ops); + } +#endif + if (!tdev->func->iommu_bit) return; -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/5] dma-mapping: Introduce dma_iommu_detach_device() API
From: Thierry Reding The dma_iommu_detach_device() API can be used by drivers to forcibly detach a device from an IOMMU that architecture code might have attached to. This is useful for drivers that need explicit control over the IOMMU using the IOMMU API directly. Signed-off-by: Thierry Reding --- drivers/base/dma-mapping.c | 8 include/linux/dma-mapping.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c index d82566d6e237..18ddf32b10c9 100644 --- a/drivers/base/dma-mapping.c +++ b/drivers/base/dma-mapping.c @@ -366,3 +366,11 @@ void dma_deconfigure(struct device *dev) of_dma_deconfigure(dev); acpi_dma_deconfigure(dev); } + +void dma_iommu_detach_device(struct device *dev) +{ +#ifdef arch_iommu_detach_device + arch_iommu_detach_device(dev); +#endif +} +EXPORT_SYMBOL(dma_iommu_detach_device); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f8ab1c0f589e..732191a2c64e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -671,6 +671,8 @@ static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, static inline void arch_teardown_dma_ops(struct device *dev) { } #endif +extern void dma_iommu_detach_device(struct device *dev); + static inline unsigned int dma_get_max_seg_size(struct device *dev) { if (dev->dma_parms && dev->dma_parms->max_segment_size) -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU
From: Thierry Reding The ARM_DMA_USE_IOMMU Kconfig option has side-effects that drivers can not opt into but have to explicitly opt out of. This can lead to subtle bugs that are difficult to track down and not immediately obvious to be related to this Kconfig option. To avoid this confusion, always enable the option to expose any lurking bugs once and allow any regressions introduced by the DMA/IOMMU code to be caught more quickly in the future. Note that some drivers still use the Kconfig symbol to provide different code paths depending on what architecture the code runs on (e.g. 32-bit ARM vs. 64-bit ARM which have different and incompatible implementations of the DMA/IOMMU integration code), so leave the symbol in place to keep those drivers working. For the long term, it is preferable to transition everyone to the generic DMA/IOMMU integration code in drivers/iommu/dma-iommu.c. Signed-off-by: Thierry Reding --- Changes in v2: - new patch arch/arm/Kconfig | 2 +- arch/arm/include/asm/device.h | 6 -- arch/arm/mm/dma-mapping.c | 18 -- drivers/iommu/Kconfig | 7 --- drivers/media/platform/Kconfig | 1 - 5 files changed, 1 insertion(+), 33 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index fa0b190f8a38..3c91de78535a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -124,7 +124,7 @@ config NEED_SG_DMA_LENGTH bool config ARM_DMA_USE_IOMMU - bool + def_bool y select ARM_HAS_SG_CHAIN select NEED_SG_DMA_LENGTH diff --git a/arch/arm/include/asm/device.h b/arch/arm/include/asm/device.h index 3234fe9bba6e..c3cf38e16866 100644 --- a/arch/arm/include/asm/device.h +++ b/arch/arm/include/asm/device.h @@ -13,9 +13,7 @@ struct dev_archdata { #ifdef CONFIG_IOMMU_API void *iommu; /* private IOMMU data */ #endif -#ifdef CONFIG_ARM_DMA_USE_IOMMU struct dma_iommu_mapping*mapping; -#endif #ifdef CONFIG_XEN const struct dma_map_ops *dev_dma_ops; #endif @@ -31,10 +29,6 @@ struct pdev_archdata { #endif }; -#ifdef CONFIG_ARM_DMA_USE_IOMMU #define to_dma_iommu_mapping(dev) ((dev)->archdata.mapping) -#else -#define to_dma_iommu_mapping(dev) NULL -#endif #endif diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index cc63a25bd088..f6c28ed5651a 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1174,8 +1174,6 @@ static int __init dma_debug_do_init(void) } core_initcall(dma_debug_do_init); -#ifdef CONFIG_ARM_DMA_USE_IOMMU - static int __dma_info_to_prot(enum dma_data_direction dir, unsigned long attrs) { int prot = 0; @@ -2366,20 +2364,6 @@ static void arm_teardown_iommu_dma_ops(struct device *dev) arm_iommu_release_mapping(mapping); } -#else - -static bool arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size, - const struct iommu_ops *iommu) -{ - return false; -} - -static void arm_teardown_iommu_dma_ops(struct device *dev) { } - -#define arm_get_iommu_dma_map_ops arm_get_dma_map_ops - -#endif /* CONFIG_ARM_DMA_USE_IOMMU */ - static const struct dma_map_ops *arm_get_dma_map_ops(bool coherent) { return coherent ? &arm_coherent_dma_ops : &arm_dma_ops; @@ -2426,7 +2410,6 @@ void arch_teardown_dma_ops(struct device *dev) void arch_iommu_detach_device(struct device *dev) { -#ifdef CONFIG_ARM_DMA_USE_IOMMU struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); const struct dma_map_ops *dma_ops; @@ -2438,5 +2421,4 @@ void arch_iommu_detach_device(struct device *dev) dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent); set_dma_ops(dev, dma_ops); -#endif } diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index df171cb85822..7f0b3ca76a17 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -226,7 +226,6 @@ config ROCKCHIP_IOMMU depends on ARM || ARM64 depends on ARCH_ROCKCHIP || COMPILE_TEST select IOMMU_API - select ARM_DMA_USE_IOMMU help Support for IOMMUs found on Rockchip rk32xx SOCs. These IOMMUs allow virtualization of the address space used by most @@ -259,7 +258,6 @@ config EXYNOS_IOMMU depends on ARCH_EXYNOS && MMU depends on !CPU_BIG_ENDIAN # revisit driver if we can enable big-endian ptes select IOMMU_API - select ARM_DMA_USE_IOMMU help Support for the IOMMU (System MMU) of Samsung Exynos application processor family. This enables H/W multimedia accelerators to see @@ -283,7 +281,6 @@ config IPMMU_VMSA depends on ARCH_RENESAS || (COMPILE_TEST && !GENERIC_ATOMIC64) select IOMMU_API select IOMMU_IO_PGTABLE_LPAE - select ARM_DMA_USE_IOMMU help Support for the Renesas VMSA-compatible IPMMU Renesas found in the R-Mobile APE6 and R-Car H2/M2 SoCs. @@ -304,7 +301,6 @@ config ARM_SMMU
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote: > [discussion about this patch, which should have been cced to the iommu > and linux-arm-kernel lists, but wasn't: > https://www.spinics.net/lists/dri-devel/msg173630.html] > > On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote: > > > API from the iommu/dma-mapping code. Drivers have no business poking > > > into these details. > > > > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL, > > which is rather misleading if they are not meant to be used by drivers > > directly. > > The only reason the DMA ops are exported is because get_arch_dma_ops > references (or in case of the coherent ones used to reference). We > don't let drivers assign random dma ops. > > > > > > Thierry, please resend this with at least the iommu list and > > > linux-arm-kernel in Cc to have a proper discussion on the right API. > > > > I'm certainly open to help with finding a correct solution, but the > > patch above was purposefully terse because this is something that I > > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating > > such a backport between ARM and DRM trees does not sound like something > > that would help getting this fixed in v4.16. > > Coordinating the backport of a trivial helper in the arm tree is not > the end of the world. Really, this cowboy attitude is a good reason > why graphics folks have such a bad rep. You keep poking into random > kernel internals, don't talk to anoyone and then complain if people > are upset. This shouldn't be surprising. Not really agreeing on the cowboy thing. The fundamental problem is that the dma api provides abstraction that seriously gets in the way of writing a gpu driver. Some examples: - We never want bounce buffers, ever. dma_map_sg gives us that, so there's hacks to fall back to a cache of pages allocated using dma_alloc_coherent if you build a kernel with bounce buffers. - dma api hides the cache flushing requirements from us. GPUs love non-snooped access, and worse give userspace control over that. We want a strict separation between mapping stuff and flushing stuff. With the IOMMU api we mostly have the former, but for the later arch maintainers regularly tells they won't allow that. So we have drm_clflush.c. - dma api hides how/where memory is allocated. Kinda similar problem, except now for CMA or address limits. So either we roll our own allocators and then dma_map_sg (and pray it doesn't bounce buffer), or we use dma_alloc_coherent and then grab the sgt to get at the CMA allocations because that's the only way. Which sucks, because we can't directly tell CMA how to back off if there's some way to make CMA memory available through other means (gpus love to hog all of memory, so we have shrinkers and everything). For display drivers the dma api mostly works, until you start sharing buffers with other devices. So from our perspective it looks fairly often that core folks just don't want to support gpu use-cases, so we play a bit more cowboy and get things done some other way. Since this has been going on for years now we often don't even bother to start a discussion first, since it tended to go nowhere useful. -Daniel > > Granted, this issue could've been caught with a little more testing, but > > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU > > was just enabled unconditionally if it has side-effects that platforms > > don't opt in to but have to explicitly opt out of. > > Agreed on that count. Please send a patch. -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: noveau vs arm dma ops
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote: > [discussion about this patch, which should have been cced to the iommu > and linux-arm-kernel lists, but wasn't: > https://www.spinics.net/lists/dri-devel/msg173630.html] > > On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote: > > > API from the iommu/dma-mapping code. Drivers have no business poking > > > into these details. > > > > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL, > > which is rather misleading if they are not meant to be used by drivers > > directly. EXPORT_SYMBOL* means nothing as far as whether a driver should be able to use the symbol or not - it merely means that the symbol is made available to a module. Modules cover much more than just device drivers - we have library modules, filesystem modules, helper modules to name a few non-driver classes of modules. We also have symbols that are exported as part of the architecture implementation detail of a public interface. For example, the public interface "copy_from_user" is implemented as an inline function (actually several layers of inline functions) eventually calling into arm_copy_from_user(). arm_copy_from_user() is exported, but drivers (in fact no module) is allowed to make direct reference to arm_copy_from_user() - it'd fail when software PAN is enabled. The whole idea that "if a symbol is exported, it's fine for a driver to use it" is a complete work of fiction, always has been, and always will be. We've had this with the architecture implementation details of the DMA API before, and with the architecture implementation details of the CPU cache flushing. There's only so much commentry, or __ prefixes you can add to a symbol before things get rediculous, and none of it stops people creating this abuse. The only thing that seems to prevent it is to make life hard for folk wanting to use the symbols (eg, hiding the symbol prototype in a private header, etc.) Never, ever go under the covers of an interface. If the interface doesn't do what you want, _discuss_ it, don't just think "oh, that architecture private facility looks like what I need, I'll use that directly." If you ever are on the side of trying to maintain those implementation details that are abused in this way, you'll soon understand why this behaviour by driver authors is soo annoying, and the maintainability problems it creates. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] ARM: dma-mapping: Implement arch_iommu_detach_device()
On Wed, Apr 25, 2018 at 11:18:14AM +0200, Thierry Reding wrote: [...] > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > index 8c398fedbbb6..1957938d8c9c 100644 > --- a/arch/arm/mm/dma-mapping.c > +++ b/arch/arm/mm/dma-mapping.c > @@ -2366,6 +2366,21 @@ static void arm_teardown_iommu_dma_ops(struct device > *dev) > arm_iommu_release_mapping(mapping); > } > > +void arch_iommu_detach_device(struct device *dev) > +{ > + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); > + const struct dma_map_ops *dma_ops; > + > + if (!mapping) > + return; > + > + arm_iommu_release_mapping(mapping); > + arm_iommu_detach_device(dev); > + > + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent); > + set_dma_ops(dev, dma_ops); > +} Oh, the irony! This doesn't actually build and I failed to notice because I forgot to enable ARM_DMA_USE_IOMMU... Sorry about that, but I think we can use this as basis to discuss whether the API is good and I'll make sure to properly test the next revision before sending it out. Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 3/4] ARM: dma-mapping: Implement arch_iommu_detach_device()
From: Thierry Reding Implement this function to enable drivers from detaching from any IOMMU domains that architecture code might have attached them to so that they can take exclusive control of the IOMMU via the IOMMU API. Signed-off-by: Thierry Reding --- arch/arm/include/asm/dma-mapping.h | 3 +++ arch/arm/mm/dma-mapping-nommu.c| 4 arch/arm/mm/dma-mapping.c | 19 +++ 3 files changed, 26 insertions(+) diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 8436f6ade57d..d6d5bd44f962 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, #define arch_teardown_dma_ops arch_teardown_dma_ops extern void arch_teardown_dma_ops(struct device *dev); +#define arch_iommu_detach_device arch_iommu_detach_device +extern void arch_iommu_detach_device(struct device *dev); + /* do not use this function in a driver */ static inline bool is_device_dma_coherent(struct device *dev) { diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index 619f24a42d09..60fef97d8452 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -242,6 +242,10 @@ void arch_teardown_dma_ops(struct device *dev) { } +void arch_iommu_detach_device(struct device *dev) +{ +} + #define PREALLOC_DMA_DEBUG_ENTRIES 4096 static int __init dma_debug_do_init(void) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 8c398fedbbb6..1957938d8c9c 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -2366,6 +2366,21 @@ static void arm_teardown_iommu_dma_ops(struct device *dev) arm_iommu_release_mapping(mapping); } +void arch_iommu_detach_device(struct device *dev) +{ + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); + const struct dma_map_ops *dma_ops; + + if (!mapping) + return; + + arm_iommu_release_mapping(mapping); + arm_iommu_detach_device(dev); + + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent); + set_dma_ops(dev, dma_ops); +} + #else static bool arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size, @@ -2378,6 +2393,10 @@ static void arm_teardown_iommu_dma_ops(struct device *dev) { } #define arm_get_iommu_dma_map_ops arm_get_dma_map_ops +void arch_iommu_detach_device(struct device *dev) +{ +} + #endif /* CONFIG_ARM_DMA_USE_IOMMU */ static const struct dma_map_ops *arm_get_dma_map_ops(bool coherent) -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/4] dma-mapping: Introduce dma_iommu_detach_device() API
From: Thierry Reding The dma_iommu_detach_device() API can be used by drivers to forcibly detach a device from an IOMMU that architecture code might have attached to. This is useful for drivers that need explicit control over the IOMMU using the IOMMU API directly. Signed-off-by: Thierry Reding --- drivers/base/dma-mapping.c | 8 include/linux/dma-mapping.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c index d82566d6e237..18ddf32b10c9 100644 --- a/drivers/base/dma-mapping.c +++ b/drivers/base/dma-mapping.c @@ -366,3 +366,11 @@ void dma_deconfigure(struct device *dev) of_dma_deconfigure(dev); acpi_dma_deconfigure(dev); } + +void dma_iommu_detach_device(struct device *dev) +{ +#ifdef arch_iommu_detach_device + arch_iommu_detach_device(dev); +#endif +} +EXPORT_SYMBOL(dma_iommu_detach_device); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f8ab1c0f589e..732191a2c64e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -671,6 +671,8 @@ static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, static inline void arch_teardown_dma_ops(struct device *dev) { } #endif +extern void dma_iommu_detach_device(struct device *dev); + static inline unsigned int dma_get_max_seg_size(struct device *dev) { if (dev->dma_parms && dev->dma_parms->max_segment_size) -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 4/4] drm/nouveau: tegra: Use dma_iommu_detach_device()
From: Thierry Reding Use the new dma_iommu_detach_device() function to replace the open-coded equivalent. Signed-off-by: Thierry Reding --- .../drm/nouveau/nvkm/engine/device/tegra.c| 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c index 23428a7056e9..c0a7f3839cbb 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c @@ -20,10 +20,6 @@ * DEALINGS IN THE SOFTWARE. */ -#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) -#include -#endif - #include #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER #include "priv.h" @@ -110,19 +106,8 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) unsigned long pgsize_bitmap; int ret; -#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) - if (dev->archdata.mapping) { - struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); - - arm_iommu_release_mapping(mapping); - arm_iommu_detach_device(dev); - - if (dev->archdata.dma_coherent) - set_dma_ops(dev, &arm_coherent_dma_ops); - else - set_dma_ops(dev, &arm_dma_ops); - } -#endif + /* make sure we can use the IOMMU exclusively */ + dma_iommu_detach_device(dev); if (!tdev->func->iommu_bit) return; -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/4] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
From: Thierry Reding Depending on the kernel configuration, early ARM architecture setup code may have attached the GPU to a DMA/IOMMU mapping that transparently uses the IOMMU to back the DMA API. Tegra requires special handling for IOMMU backed buffers (a special bit in the GPU's MMU page tables indicates the memory path to take: via the SMMU or directly to the memory controller). Transparently backing DMA memory with an IOMMU prevents Nouveau from properly handling such memory accesses and causes memory access faults. As a side-note: buffers other than those allocated in instance memory don't need to be physically contiguous from the GPU's perspective since the GPU can map them into contiguous buffers using its own MMU. Mapping these buffers through the IOMMU is unnecessary and will even lead to performance degradation because of the additional translation. Signed-off-by: Thierry Reding --- I had already sent this out independently to fix a regression that was introduced in v4.16, but then Christoph pointed out that it should've been sent to a wider audience and should use a core API rather than calling into architecture code directly. I've added it to this series for easier reference and to show the need for the new API. .../drm/nouveau/nvkm/engine/device/tegra.c| 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c index 78597da6313a..23428a7056e9 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c @@ -19,6 +19,11 @@ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * DEALINGS IN THE SOFTWARE. */ + +#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) +#include +#endif + #include #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER #include "priv.h" @@ -105,6 +110,20 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) unsigned long pgsize_bitmap; int ret; +#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) + if (dev->archdata.mapping) { + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); + + arm_iommu_release_mapping(mapping); + arm_iommu_detach_device(dev); + + if (dev->archdata.dma_coherent) + set_dma_ops(dev, &arm_coherent_dma_ops); + else + set_dma_ops(dev, &arm_dma_ops); + } +#endif + if (!tdev->func->iommu_bit) return; -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
noveau vs arm dma ops
[discussion about this patch, which should have been cced to the iommu and linux-arm-kernel lists, but wasn't: https://www.spinics.net/lists/dri-devel/msg173630.html] On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote: > > API from the iommu/dma-mapping code. Drivers have no business poking > > into these details. > > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL, > which is rather misleading if they are not meant to be used by drivers > directly. The only reason the DMA ops are exported is because get_arch_dma_ops references (or in case of the coherent ones used to reference). We don't let drivers assign random dma ops. > > > Thierry, please resend this with at least the iommu list and > > linux-arm-kernel in Cc to have a proper discussion on the right API. > > I'm certainly open to help with finding a correct solution, but the > patch above was purposefully terse because this is something that I > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating > such a backport between ARM and DRM trees does not sound like something > that would help getting this fixed in v4.16. Coordinating the backport of a trivial helper in the arm tree is not the end of the world. Really, this cowboy attitude is a good reason why graphics folks have such a bad rep. You keep poking into random kernel internals, don't talk to anoyone and then complain if people are upset. This shouldn't be surprising. > Granted, this issue could've been caught with a little more testing, but > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU > was just enabled unconditionally if it has side-effects that platforms > don't opt in to but have to explicitly opt out of. Agreed on that count. Please send a patch. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [GIT PULL] dma mapping fixes for 4.17-rc3
On Tue, 24 Apr 2018, Christoph Hellwig wrote: > On Tue, Apr 24, 2018 at 11:54:26PM -0700, David Rientjes wrote: > > Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more > > accurately <=? > > No, it should really be <. The exactly 32-bit case is already covered > with GFP_DMA32. Eventualy it should be < 24-bit with a separate > GFP_DMA32 case for > 32-bit && < 64-bit. Takashi has been working on > that, but it is 4.18 material. > Ack, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [GIT PULL] dma mapping fixes for 4.17-rc3
On Wed, 25 Apr 2018, Christoph Hellwig wrote: > The following changes since commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e: > > Linux 4.17-rc2 (2018-04-22 19:20:09 -0700) > > are available in the Git repository at: > > git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-4.17-3 > > for you to fetch changes up to 60695be2bb6b0623f8e53bd9949d582a83c6d44a: > > dma-mapping: postpone cpu addr translation on mmap (2018-04-23 14:44:24 > +0200) > > > A few small dma-mapping fixes for Linux 4.17-rc3: > > - don't loop to try GFP_DMA allocations if ZONE_DMA is not actually >enabled (regression in 4.16) Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more accurately <=? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/5] ide: kill ide_toggle_bounce
On 4/24/18 12:16 PM, Christoph Hellwig wrote: > ide_toggle_bounce did select various strange block bounce limits, including > not bouncing at all as soon as an iommu is present in the system. Given > that the dma_map routines now handle any required bounce buffering except > for ISA DMA, and the ide code already must handle either ISA DMA or highmem > at least for iommu equipped systems we can get rid of the block layer > bounce limit setting entirely. Pretty sure I was the one to add this code, when highmem page IO was enabled about two decades ago... Outside of DMA, the issue was that the PIO code could not handle highmem. That's not the case anymore, so this should be fine. Reviewed-by: Jens Axboe -- Jens Axboe ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [GIT PULL] dma mapping fixes for 4.17-rc3
On Tue, Apr 24, 2018 at 11:54:26PM -0700, David Rientjes wrote: > Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more > accurately <=? No, it should really be <. The exactly 32-bit case is already covered with GFP_DMA32. Eventualy it should be < 24-bit with a separate GFP_DMA32 case for > 32-bit && < 64-bit. Takashi has been working on that, but it is 4.18 material. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu