Re: [PATCH 5/6 v2] bus: fsl-mc: supoprt dma configure for devices on fsl-mc bus

2018-04-25 Thread kbuild test robot
Hi Nipun,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc2 next-20180424]
[cannot apply to iommu/next glikely/devicetree/next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nipun-Gupta/Support-for-fsl-mc-bus-and-its-devices-in-SMMU/20180418-034931
config: powerpc64-allmodconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc64 

All errors (new ones prefixed by >>):

   drivers/bus/fsl-mc/fsl-mc-bus.c: In function 'fsl_mc_dma_configure':
>> drivers/bus/fsl-mc/fsl-mc-bus.c:137:9: error: too many arguments to function 
>> 'of_dma_configure'
 return of_dma_configure(dev, dma_dev->of_node, 0);
^~~~
   In file included from drivers/bus/fsl-mc/fsl-mc-bus.c:13:0:
   include/linux/of_device.h:58:5: note: declared here
int of_dma_configure(struct device *dev, struct device_node *np);
^~~~
   drivers/bus/fsl-mc/fsl-mc-bus.c: At top level:
>> drivers/bus/fsl-mc/fsl-mc-bus.c:161:3: error: 'struct bus_type' has no 
>> member named 'dma_configure'
 .dma_configure  = fsl_mc_dma_configure,
  ^

vim +/of_dma_configure +137 drivers/bus/fsl-mc/fsl-mc-bus.c

   129  
   130  static int fsl_mc_dma_configure(struct device *dev)
   131  {
   132  struct device *dma_dev = dev;
   133  
   134  while (dev_is_fsl_mc(dma_dev))
   135  dma_dev = dma_dev->parent;
   136  
 > 137  return of_dma_configure(dev, dma_dev->of_node, 0);
   138  }
   139  
   140  static ssize_t modalias_show(struct device *dev, struct 
device_attribute *attr,
   141   char *buf)
   142  {
   143  struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
   144  
   145  return sprintf(buf, "fsl-mc:v%08Xd%s\n", 
mc_dev->obj_desc.vendor,
   146 mc_dev->obj_desc.type);
   147  }
   148  static DEVICE_ATTR_RO(modalias);
   149  
   150  static struct attribute *fsl_mc_dev_attrs[] = {
   151  &dev_attr_modalias.attr,
   152  NULL,
   153  };
   154  
   155  ATTRIBUTE_GROUPS(fsl_mc_dev);
   156  
   157  struct bus_type fsl_mc_bus_type = {
   158  .name = "fsl-mc",
   159  .match = fsl_mc_bus_match,
   160  .uevent = fsl_mc_bus_uevent,
 > 161  .dma_configure  = fsl_mc_dma_configure,
   162  .dev_groups = fsl_mc_dev_groups,
   163  };
   164  EXPORT_SYMBOL_GPL(fsl_mc_bus_type);
   165  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
> On arm that doesn't work. The iommu api seems like a good fit, except
> the dma-api tends to get in the way a bit (drm/msm apparently has
> similar problems like tegra), and if you need contiguous memory
> dma_alloc_coherent is the only way to get at contiguous memory. There
> was a huge discussion years ago about that, and direct cma access was
> shot down because it would have exposed too much of the caching
> attribute mangling required (most arm platforms need wc-pages to not
> be in the kernel's linear map apparently).

I think you completely misunderstand ARM from what you've written above,
and this worries me greatly about giving DRM the level of control that
is being asked for.

Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
attributes are stored in the page tables.  These caches are inherently
non-aliasing when there are multiple mappings (which is a great step
forward compared to the previous aliasing caches.)

As the cache attributes are stored in the page tables, this in theory
allows different virtual mappings of the same physical memory to have
different cache attributes.  However, there's a problem, and that's
called speculative prefetching.

Let's say you have one mapping which is cacheable, and another that is
marked as write combining.  If a cache line is speculatively prefetched
through the cacheable mapping of this memory, and then you read the
same physical location through the write combining mapping, it is
possible that you could read cached data.

So, it is generally accepted that all mappings of any particular
physical bit of memory should have the same cache attributes to avoid
unpredictable behaviour.

This presents a problem with what is generally called "lowmem" where
the memory is mapped in kernel virtual space with cacheable
attributes.  It can also happen with highmem if the memory is
kmapped.

This is why, on ARM, you can't use something like get_free_pages() to
grab some pages from the system, pass it to the GPU, map it into
userspace as write-combining, etc.  It _might_ work for some CPUs,
but ARM CPUs vary in how much prefetching they do, and what may work
for one particular CPU is in no way guaranteed to work for another
ARM CPU.

The official line from architecture folk is to assume that the caches
infinitely speculate, are of infinite size, and can writeback *dirty*
data at any moment.

The way to stop things like speculative prefetches to particular
physical memory is to, quite "simply", not have any cacheable
mappings of that physical memory anywhere in the system.

Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
If you have, say, an 8MB buffer (for a 1080p frame) and you need to
do a cache operation on that buffer, you'll be iterating over it
32 or maybe 64 bytes at a time "just in case" there's a cache line
present.  Referring to my previous email, where I detailed the
potential need for _two_ flushes, one before the GPU operation and
one after, and this becomes _really_ expensive.  At that point, you're
probably way better off using write-combine memory where you don't
need to spend CPU cycles performing cache flushing - potentially
across all CPUs in the system if cache operations aren't broadcasted.

This isn't a simple matter of "just provide some APIs for cache
operations" - there's much more that needs to be understood by
all parties here, especially when we have GPU drivers that can be
used with quite different CPUs.

It may well be that for some combinations of CPUs and workloads, it's
better to use write-combine memory without cache flushing, but for
other CPUs that tradeoff (for the same workload) could well be
different.

Older ARMs get more interesting, because they have aliasing caches.
That means the CPU cache aliases across different virtual space
mappings in some way, which complicates (a) the mapping of memory
and (b) handling the cache operations on it.

It's too late for me to go into that tonight, and I probably won't
be reading mail for the next week and a half, sorry.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > - dma api hides the cache flushing requirements from us. GPUs love
> >   non-snooped access, and worse give userspace control over that. We want
> >   a strict separation between mapping stuff and flushing stuff. With the
> >   IOMMU api we mostly have the former, but for the later arch maintainers
> >   regularly tells they won't allow that. So we have drm_clflush.c.
> 
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

And if folk want a cacheable mapping with explicit cache flushing, the
cache flushing must not be defined in terms of "this is what CPU seems
to need" but from the point of view of a CPU with infinite prefetching,
infinite caching and infinite capacity to perform writebacks of dirty
cache lines at unexpected moments when the memory is mapped in a
cacheable mapping.

(The reason for that is you're operating in a non-CPU specific space,
so you can't make any guarantees as to how much caching or prefetching
will occur by the CPU - different CPUs will do different amounts.)

So, for example, the sequence:

GPU writes to memory
CPU reads from cacheable memory

if the memory was previously dirty (iow, CPU has written), you need to
flush the dirty cache lines _before_ the GPU writes happen, but you
don't know whether the CPU has speculatively prefetched, so you need
to flush any prefetched cache lines before reading from the cacheable
memory _after_ the GPU has finished writing.

Also note that "flush" there can be "clean the cache", "clean and
invalidate the cache" or "invalidate the cache" as appropriate - some
CPUs are able to perform those three operations, and the appropriate
one depends on not only where in the above sequence it's being used,
but also on what the operations are.

So, the above sequence could be:

CPU invalidates cache for memory
(due to possible dirty cache lines)
GPU writes to memory
CPU invalidates cache for memory
(to get rid of any speculatively prefetched
 lines)
CPU reads from cacheable memory

Yes, in the above case, _two_ cache operations are required to ensure
correct behaviour.  However, if you know for certain that the memory was
previously clean, then the first cache operation can be skipped.

What I'm pointing out is there's much more than just "I want to flush
the cache" here, which is currently what DRM seems to assume at the
moment with the code in drm_cache.c.

If we can agree a set of interfaces that allows _proper_ use of these
facilities, one which can be used appropriately, then there shouldn't
be a problem.  The DMA API does that via it's ideas about who owns a
particular buffer (because of the above problem) and that's something
which would need to be carried over to such a cache flushing API (it
should be pretty obvious that having a GPU read or write memory while
the cache for that memory is being cleaned will lead to unexpected
results.)

Also note that things get even more interesting in a SMP environment
if cache operations aren't broadcasted across the SMP cluster (which
means cache operations have to be IPI'd to other CPUs.)

The next issue, which I've brought up before, is that exposing cache
flushing to userspace on architectures where it isn't already exposed
comes.  As has been shown by Google Project Zero, this risks exposing
those architectures to Spectre and Meltdown exploits where they weren't
at such a risk before.  (I've pretty much shown here that you _do_
need to control which cache lines get flushed to make these exploits
work, and flushing the cache by reading lots of data in liu of having
the ability to explicitly flush bits of cache makes it very difficult
to impossible for them to work.)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 5:33 PM, Christoph Hellwig  wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > Coordinating the backport of a trivial helper in the arm tree is not
>> > the end of the world.  Really, this cowboy attitude is a good reason
>> > why graphics folks have such a bad rep.  You keep poking into random
>> > kernel internals, don't talk to anoyone and then complain if people
>> > are upset.  This shouldn't be surprising.
>>
>> Not really agreeing on the cowboy thing. The fundamental problem is that
>> the dma api provides abstraction that seriously gets in the way of writing
>> a gpu driver. Some examples:
>
> So talk to other people.  Maybe people share your frustation.  Or maybe
> other people have a way to help.
>
>> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>>   hacks to fall back to a cache of pages allocated using
>>   dma_alloc_coherent if you build a kernel with bounce buffers.
>
> get_required_mask() is supposed to tell you if you are safe.  However
> we are missing lots of implementations of it for iommus so you might get
> some false negatives, improvements welcome.  It's been on my list of
> things to fix in the DMA API, but it is nowhere near the top.

I hasn't come up in a while in some fireworks, so I honestly don't
remember exactly what the issues have been. But

commit d766ef53006c2c38a7fe2bef0904105a793383f2
Author: Chris Wilson 
Date:   Mon Dec 19 12:43:45 2016 +

drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping

and the various bits of code that a

$ git grep SWIOTLB -- drivers/gpu

turns up is what we're doing to hack around that stuff. And in general
(there's some exceptions) gpus should be able to address everything,
so I never fully understood where that's even coming from.

>> - dma api hides the cache flushing requirements from us. GPUs love
>>   non-snooped access, and worse give userspace control over that. We want
>>   a strict separation between mapping stuff and flushing stuff. With the
>>   IOMMU api we mostly have the former, but for the later arch maintainers
>>   regularly tells they won't allow that. So we have drm_clflush.c.
>
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

I'm assuming this stuff here?

https://lkml.org/lkml/2018/4/20/146

Anyway got lost in all that work a bit, looks really nice.

>> - dma api hides how/where memory is allocated. Kinda similar problem,
>>   except now for CMA or address limits. So either we roll our own
>>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>>   allocations because that's the only way. Which sucks, because we can't
>>   directly tell CMA how to back off if there's some way to make CMA memory
>>   available through other means (gpus love to hog all of memory, so we
>>   have shrinkers and everything).
>
> If you really care about doing explicitly cache flushing anyway (see
> above) allocating your own memory and mapping it where needed is by
> far the superior solution.  On cache coherent architectures
> dma_alloc_coherent is nothing but allocate memory + dma_map_single.
> On non coherent allocations the memory might come through a special
> pool or must be used through a special virtual address mapping that
> is set up either statically or dynamically.  For that case splitting
> allocation and mapping is a good idea in many ways, and I plan to move
> towards that once the number of dma mapping implementations is down
> to a reasonable number so that it can actually be done.

Yeah the above is pretty much what we do on x86. dma-api believes
everything is coherent, so dma_map_sg does the mapping we want and
nothing else (minus swiotlb fun). Cache flushing, allocations, all
done by the driver.

On arm that doesn't work. The iommu api seems like a good fit, except
the dma-api tends to get in the way a bit (drm/msm apparently has
similar problems like tegra), and if you need contiguous memory
dma_alloc_coherent is the only way to get at contiguous memory. There
was a huge discussion years ago about that, and direct cma access was
shot down because it would have exposed too much of the caching
attribute mangling required (most arm platforms need wc-pages to not
be in the kernel's linear map apparently).

Anything that separate these 3 things more (allocation pools, mapping
through IOMMUs and flushing cpu caches) sounds like the right
direction to me. Even if that throws some portability across platforms
away - drivers who want to control things in this much detail aren't
really portable (without some serious work) anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___

Re: [PATCH v4 14/22] iommu: handle page response timeout

2018-04-25 Thread Jacob Pan
On Mon, 23 Apr 2018 16:36:23 +0100
Jean-Philippe Brucker  wrote:

> On Mon, Apr 16, 2018 at 10:49:03PM +0100, Jacob Pan wrote:
> > When IO page faults are reported outside IOMMU subsystem, the page
> > request handler may fail for various reasons. E.g. a guest received
> > page requests but did not have a chance to run for a long time. The
> > irresponsive behavior could hold off limited resources on the
> > pending device.
> > There can be hardware or credit based software solutions as
> > suggested in the PCI ATS Ch-4. To provide a basic safty net this
> > patch introduces a per device deferrable timer which monitors the
> > longest pending page fault that requires a response. Proper action
> > such as sending failure response code could be taken when timer
> > expires but not included in this patch. We need to consider the
> > life cycle of page groupd ID to prevent confusion with reused group
> > ID by a device. For now, a warning message provides clue of such
> > failure.
> > 
> > Signed-off-by: Jacob Pan 
> > Signed-off-by: Ashok Raj 
> > ---
> >  drivers/iommu/iommu.c | 60
> > +--
> > include/linux/iommu.h |  4  2 files changed, 62 insertions(+),
> > 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 628346c..f6512692 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -799,6 +799,39 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> >  
> > +/* Max time to wait for a pending page request */
> > +#define IOMMU_PAGE_RESPONSE_MAXTIME (HZ * 10)
> > +static void iommu_dev_fault_timer_fn(struct timer_list *t)
> > +{
> > +   struct iommu_fault_param *fparam = from_timer(fparam, t,
> > timer);
> > +   struct iommu_fault_event *evt, *iter;
> > +
> > +   u64 now;
> > +
> > +   now = get_jiffies_64();
> > +
> > +   /* The goal is to ensure driver or guest page fault
> > handler(via vfio)
> > +* send page response on time. Otherwise, limited queue
> > resources
> > +* may be occupied by some irresponsive guests or
> > drivers.  
> 
> By "limited queue resources", do you mean the PRI fault queue in the
> pIOMMU device, or something else?
> 
I am referring to the device resource for tracking pending PRQs. Intel
IOMMU does not track pending PRQs.
> 
> I'm still uneasy about this timeout. We don't really know if the guest
> doesn't respond because it is suspended, because it doesn't support
> PRI or because it's attempting to kill the host. In the first case,
> then receiving and responding to page requests later than 10s should
> be fine, right?
> 
when a guest is going into system suspend, suspend callback functions of
assigned device driver and vIOMMU should be called. I think vIOMMU
should propagate the iommu_suspend call to host IOMMU driver, therefore
terminate all the pending PRQs. We can make the timeout adjustable.

> Or maybe the guest is doing something weird like fetching pages from
> network storage and it occasionally hits a latency oddity. This
> wouldn't interrupt the fault queues, because other page requests for
> the same device can be serviced in parallel, but if you implement a
> PRG timeout it would still unfairly disable PRI.
> 
The timeout here is intended to be a broader and basic safety net at
per device level. We can implement finer grain safety mechanism but I
am guessing it is better to be done in HW.
> In the other cases (unsupported PRI or rogue guest) then disabling PRI
> using a FAILURE status might be the right thing to do. However,
> assuming the device follows the PCI spec it will stop sending page
> requests once there are as many PPRs in flight as the allocated
> credit.
> 
Agreed, here I am not taking any actions. There may be need to drain
in-fly requests.
> Even though drivers set the PPR credit number arbitrarily (because
> finding an ideal number is difficult or impossible), the device stops
> issuing faults at some point if the guest is unresponsive, and it
> won't grab any more shared resources, or use slots in shared queues.
> Resources for pending faults can be cleaned when the device is reset
> and assigned to a different guest.
> 
> 
> That's for sane endpoints that follow the spec. If on the other hand,
> we can't rely on the device implementation to respect our maximum
> credit allocation, then we should do the accounting ourselves and
> reject incoming faults with INVALID as fast as possible. Otherwise
> it's an easy way for a guest to DoS the host and I don't think a
> timeout solves this problem (The guest can wait 9 seconds before
> replying to faults and meanwhile fill all the queues). In addition
> the timeout is done on PRGs but not individual page faults, so a
> guest could overflow the queues by triggering lots of page requests
> without setting the last bit.
> 
> 
> If there isn't any possibility of memory leak or abusing resources, I
> don't think it

Re: noveau vs arm dma ops

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > Coordinating the backport of a trivial helper in the arm tree is not
> > the end of the world.  Really, this cowboy attitude is a good reason
> > why graphics folks have such a bad rep.  You keep poking into random
> > kernel internals, don't talk to anoyone and then complain if people
> > are upset.  This shouldn't be surprising.
> 
> Not really agreeing on the cowboy thing. The fundamental problem is that
> the dma api provides abstraction that seriously gets in the way of writing
> a gpu driver. Some examples:

So talk to other people.  Maybe people share your frustation.  Or maybe
other people have a way to help.

> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>   hacks to fall back to a cache of pages allocated using
>   dma_alloc_coherent if you build a kernel with bounce buffers.

get_required_mask() is supposed to tell you if you are safe.  However
we are missing lots of implementations of it for iommus so you might get
some false negatives, improvements welcome.  It's been on my list of
things to fix in the DMA API, but it is nowhere near the top.

> - dma api hides the cache flushing requirements from us. GPUs love
>   non-snooped access, and worse give userspace control over that. We want
>   a strict separation between mapping stuff and flushing stuff. With the
>   IOMMU api we mostly have the former, but for the later arch maintainers
>   regularly tells they won't allow that. So we have drm_clflush.c.

The problem is that a cache flushing API entirely separate is hard. That
being said if you look at my generic dma-noncoherent API series it tries
to move that way.  So far it is in early stages and apparently rather
buggy unfortunately.

> - dma api hides how/where memory is allocated. Kinda similar problem,
>   except now for CMA or address limits. So either we roll our own
>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>   allocations because that's the only way. Which sucks, because we can't
>   directly tell CMA how to back off if there's some way to make CMA memory
>   available through other means (gpus love to hog all of memory, so we
>   have shrinkers and everything).

If you really care about doing explicitly cache flushing anyway (see
above) allocating your own memory and mapping it where needed is by
far the superior solution.  On cache coherent architectures
dma_alloc_coherent is nothing but allocate memory + dma_map_single.
On non coherent allocations the memory might come through a special
pool or must be used through a special virtual address mapping that
is set up either statically or dynamically.  For that case splitting
allocation and mapping is a good idea in many ways, and I plan to move
towards that once the number of dma mapping implementations is down
to a reasonable number so that it can actually be done.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping

2018-04-25 Thread Jordan Crouse
On Wed, Apr 25, 2018 at 12:10:47PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
> 
> Depending on the kernel configuration, early ARM architecture setup code
> may have attached the GPU to a DMA/IOMMU mapping that transparently uses
> the IOMMU to back the DMA API. Tegra requires special handling for IOMMU
> backed buffers (a special bit in the GPU's MMU page tables indicates the
> memory path to take: via the SMMU or directly to the memory controller).
> Transparently backing DMA memory with an IOMMU prevents Nouveau from
> properly handling such memory accesses and causes memory access faults.
> 
> As a side-note: buffers other than those allocated in instance memory
> don't need to be physically contiguous from the GPU's perspective since
> the GPU can map them into contiguous buffers using its own MMU. Mapping
> these buffers through the IOMMU is unnecessary and will even lead to
> performance degradation because of the additional translation.
> 
> Signed-off-by: Thierry Reding 
> ---
> I had already sent this out independently to fix a regression that was
> introduced in v4.16, but then Christoph pointed out that it should've
> been sent to a wider audience and should use a core API rather than
> calling into architecture code directly.
> 
> I've added it to this series for easier reference and to show the need
> for the new API.

This is good stuff, I am struggling with something similar on ARM64. One
problem that I wasn't able to fully solve cleanly was that for arm-smmu 
the SMMU HW resources are not released until the domain itself is destroyed
and I never quite figured out a way to swap the default domain cleanly.

This is a problem for the MSM GPU because not only do we run our own IOMMU as
you do we also have a hardware dependency to use context bank 0 to
asynchronously switch the pagetable during rendering.

I'm not sure if this is a problem you have encountered. In any event, this code
gets us a little bit further down the path and at least there is somebody else
out there in the cold dark world that understands my pain. :)

For your interest, here was my half-hearted attempt to avoid creating DMA
domains in the first place based on a blacklist to try to spur a bit of
discussion: https://patchwork.freedesktop.org/series/41573/

Jordan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/5] ARM: dma-mapping: Implement arch_iommu_detach_device()

2018-04-25 Thread Christoph Hellwig
> +void arch_iommu_detach_device(struct device *dev)
> +{
> +#ifdef CONFIG_ARM_DMA_USE_IOMMU
> + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
> + const struct dma_map_ops *dma_ops;
> +
> + if (!mapping)
> + return;
> +
> + arm_iommu_release_mapping(mapping);
> + arm_iommu_detach_device(dev);
> +
> + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent);
> + set_dma_ops(dev, dma_ops);

Why not simply:

set_dma_ops(dev, arm_get_dma_map_ops(dev->archdata.dma_coherent));
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/5] dma-mapping: Introduce dma_iommu_detach_device() API

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 12:10:48PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
> 
> The dma_iommu_detach_device() API can be used by drivers to forcibly
> detach a device from an IOMMU that architecture code might have attached
> to. This is useful for drivers that need explicit control over the IOMMU
> using the IOMMU API directly.

Given that no one else implements it making it a generic API seems
rather confusing.  For now I'd rename it to
arm_dma_iommu_detach_device() and only implement it in arm.

Once I've got the dma mapping implementations consolidated to a small
enough number we could think about something like a device quirk that
tells the architecture to simply never even attach the iommu dma ops
to start with.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping

2018-04-25 Thread Christoph Hellwig
The series seems to miss a cover letter.

Also I really think this patch original patch shouldn't be in the proper
series.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 11:25:11AM +0100, Russell King - ARM Linux wrote:
> >  config ARM_DMA_USE_IOMMU
> > -   bool
> > +   def_bool y
> > select ARM_HAS_SG_CHAIN
> > select NEED_SG_DMA_LENGTH
> 
> This doesn't work - as has recently been discussed with hch, we can't
> globally enable NEED_SG_DMA_LENGTH on ARM - the ARM architecture
> pre-dates the addition of the DMA length member in the scatterlist,
> and not every machine supports the splitting of the DMA length from
> the non-DMA length member.  Hence, this will cause a regression,
> sorry.

Agreed as-is.  However supporting it everywhere should not be much work,
I'll take a look at it.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/13] iommu-common: move to arch/sparc

2018-04-25 Thread David Miller
From: Christoph Hellwig 
Date: Wed, 25 Apr 2018 07:15:27 +0200

> This code is only used by sparc, and all new iommu drivers should use the
> drivers/iommu/ framework.  Also remove the unused exports.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Anshuman Khandual 

Acked-by: David S. Miller 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 12:10:51PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
> 
> The ARM_DMA_USE_IOMMU Kconfig option has side-effects that drivers can
> not opt into but have to explicitly opt out of. This can lead to subtle
> bugs that are difficult to track down and not immediately obvious to be
> related to this Kconfig option.
> 
> To avoid this confusion, always enable the option to expose any lurking
> bugs once and allow any regressions introduced by the DMA/IOMMU code to
> be caught more quickly in the future.
> 
> Note that some drivers still use the Kconfig symbol to provide different
> code paths depending on what architecture the code runs on (e.g. 32-bit
> ARM vs. 64-bit ARM which have different and incompatible implementations
> of the DMA/IOMMU integration code), so leave the symbol in place to keep
> those drivers working.
> 
> For the long term, it is preferable to transition everyone to the
> generic DMA/IOMMU integration code in drivers/iommu/dma-iommu.c.
> 
> Signed-off-by: Thierry Reding 
> ---
> Changes in v2:
> - new patch
> 
>  arch/arm/Kconfig   |  2 +-
>  arch/arm/include/asm/device.h  |  6 --
>  arch/arm/mm/dma-mapping.c  | 18 --
>  drivers/iommu/Kconfig  |  7 ---
>  drivers/media/platform/Kconfig |  1 -
>  5 files changed, 1 insertion(+), 33 deletions(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index fa0b190f8a38..3c91de78535a 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -124,7 +124,7 @@ config NEED_SG_DMA_LENGTH
>   bool
>  
>  config ARM_DMA_USE_IOMMU
> - bool
> + def_bool y
>   select ARM_HAS_SG_CHAIN
>   select NEED_SG_DMA_LENGTH

This doesn't work - as has recently been discussed with hch, we can't
globally enable NEED_SG_DMA_LENGTH on ARM - the ARM architecture
pre-dates the addition of the DMA length member in the scatterlist,
and not every machine supports the splitting of the DMA length from
the non-DMA length member.  Hence, this will cause a regression,
sorry.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 4/5] drm/nouveau: tegra: Use dma_iommu_detach_device()

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Use the new dma_iommu_detach_device() function to replace the open-coded
equivalent.

Signed-off-by: Thierry Reding 
---
 .../drm/nouveau/nvkm/engine/device/tegra.c| 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
index 23428a7056e9..c0a7f3839cbb 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
@@ -20,10 +20,6 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
-#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
-#include 
-#endif
-
 #include 
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
 #include "priv.h"
@@ -110,19 +106,8 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra 
*tdev)
unsigned long pgsize_bitmap;
int ret;
 
-#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
-   if (dev->archdata.mapping) {
-   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
-
-   arm_iommu_release_mapping(mapping);
-   arm_iommu_detach_device(dev);
-
-   if (dev->archdata.dma_coherent)
-   set_dma_ops(dev, &arm_coherent_dma_ops);
-   else
-   set_dma_ops(dev, &arm_dma_ops);
-   }
-#endif
+   /* make sure we can use the IOMMU exclusively */
+   dma_iommu_detach_device(dev);
 
if (!tdev->func->iommu_bit)
return;
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/5] ARM: dma-mapping: Implement arch_iommu_detach_device()

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Implement this function to enable drivers from detaching from any IOMMU
domains that architecture code might have attached them to so that they
can take exclusive control of the IOMMU via the IOMMU API.

Signed-off-by: Thierry Reding 
---
Changes in v2:
- fix compilation

 arch/arm/include/asm/dma-mapping.h |  3 +++
 arch/arm/mm/dma-mapping-nommu.c|  4 
 arch/arm/mm/dma-mapping.c  | 17 +
 3 files changed, 24 insertions(+)

diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index 8436f6ade57d..d6d5bd44f962 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64 
dma_base, u64 size,
 #define arch_teardown_dma_ops arch_teardown_dma_ops
 extern void arch_teardown_dma_ops(struct device *dev);
 
+#define arch_iommu_detach_device arch_iommu_detach_device
+extern void arch_iommu_detach_device(struct device *dev);
+
 /* do not use this function in a driver */
 static inline bool is_device_dma_coherent(struct device *dev)
 {
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index 619f24a42d09..60fef97d8452 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -242,6 +242,10 @@ void arch_teardown_dma_ops(struct device *dev)
 {
 }
 
+void arch_iommu_detach_device(struct device *dev)
+{
+}
+
 #define PREALLOC_DMA_DEBUG_ENTRIES 4096
 
 static int __init dma_debug_do_init(void)
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 8c398fedbbb6..cc63a25bd088 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -2423,3 +2423,20 @@ void arch_teardown_dma_ops(struct device *dev)
 
arm_teardown_iommu_dma_ops(dev);
 }
+
+void arch_iommu_detach_device(struct device *dev)
+{
+#ifdef CONFIG_ARM_DMA_USE_IOMMU
+   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
+   const struct dma_map_ops *dma_ops;
+
+   if (!mapping)
+   return;
+
+   arm_iommu_release_mapping(mapping);
+   arm_iommu_detach_device(dev);
+
+   dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent);
+   set_dma_ops(dev, dma_ops);
+#endif
+}
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/5] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Depending on the kernel configuration, early ARM architecture setup code
may have attached the GPU to a DMA/IOMMU mapping that transparently uses
the IOMMU to back the DMA API. Tegra requires special handling for IOMMU
backed buffers (a special bit in the GPU's MMU page tables indicates the
memory path to take: via the SMMU or directly to the memory controller).
Transparently backing DMA memory with an IOMMU prevents Nouveau from
properly handling such memory accesses and causes memory access faults.

As a side-note: buffers other than those allocated in instance memory
don't need to be physically contiguous from the GPU's perspective since
the GPU can map them into contiguous buffers using its own MMU. Mapping
these buffers through the IOMMU is unnecessary and will even lead to
performance degradation because of the additional translation.

Signed-off-by: Thierry Reding 
---
I had already sent this out independently to fix a regression that was
introduced in v4.16, but then Christoph pointed out that it should've
been sent to a wider audience and should use a core API rather than
calling into architecture code directly.

I've added it to this series for easier reference and to show the need
for the new API.

 .../drm/nouveau/nvkm/engine/device/tegra.c| 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
index 78597da6313a..23428a7056e9 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
@@ -19,6 +19,11 @@
  * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
  * DEALINGS IN THE SOFTWARE.
  */
+
+#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
+#include 
+#endif
+
 #include 
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
 #include "priv.h"
@@ -105,6 +110,20 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra 
*tdev)
unsigned long pgsize_bitmap;
int ret;
 
+#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
+   if (dev->archdata.mapping) {
+   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
+
+   arm_iommu_release_mapping(mapping);
+   arm_iommu_detach_device(dev);
+
+   if (dev->archdata.dma_coherent)
+   set_dma_ops(dev, &arm_coherent_dma_ops);
+   else
+   set_dma_ops(dev, &arm_dma_ops);
+   }
+#endif
+
if (!tdev->func->iommu_bit)
return;
 
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/5] dma-mapping: Introduce dma_iommu_detach_device() API

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

The dma_iommu_detach_device() API can be used by drivers to forcibly
detach a device from an IOMMU that architecture code might have attached
to. This is useful for drivers that need explicit control over the IOMMU
using the IOMMU API directly.

Signed-off-by: Thierry Reding 
---
 drivers/base/dma-mapping.c  | 8 
 include/linux/dma-mapping.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index d82566d6e237..18ddf32b10c9 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -366,3 +366,11 @@ void dma_deconfigure(struct device *dev)
of_dma_deconfigure(dev);
acpi_dma_deconfigure(dev);
 }
+
+void dma_iommu_detach_device(struct device *dev)
+{
+#ifdef arch_iommu_detach_device
+   arch_iommu_detach_device(dev);
+#endif
+}
+EXPORT_SYMBOL(dma_iommu_detach_device);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0f589e..732191a2c64e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -671,6 +671,8 @@ static inline void arch_setup_dma_ops(struct device *dev, 
u64 dma_base,
 static inline void arch_teardown_dma_ops(struct device *dev) { }
 #endif
 
+extern void dma_iommu_detach_device(struct device *dev);
+
 static inline unsigned int dma_get_max_seg_size(struct device *dev)
 {
if (dev->dma_parms && dev->dma_parms->max_segment_size)
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 5/5] ARM: Unconditionally enable ARM_DMA_USE_IOMMU

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

The ARM_DMA_USE_IOMMU Kconfig option has side-effects that drivers can
not opt into but have to explicitly opt out of. This can lead to subtle
bugs that are difficult to track down and not immediately obvious to be
related to this Kconfig option.

To avoid this confusion, always enable the option to expose any lurking
bugs once and allow any regressions introduced by the DMA/IOMMU code to
be caught more quickly in the future.

Note that some drivers still use the Kconfig symbol to provide different
code paths depending on what architecture the code runs on (e.g. 32-bit
ARM vs. 64-bit ARM which have different and incompatible implementations
of the DMA/IOMMU integration code), so leave the symbol in place to keep
those drivers working.

For the long term, it is preferable to transition everyone to the
generic DMA/IOMMU integration code in drivers/iommu/dma-iommu.c.

Signed-off-by: Thierry Reding 
---
Changes in v2:
- new patch

 arch/arm/Kconfig   |  2 +-
 arch/arm/include/asm/device.h  |  6 --
 arch/arm/mm/dma-mapping.c  | 18 --
 drivers/iommu/Kconfig  |  7 ---
 drivers/media/platform/Kconfig |  1 -
 5 files changed, 1 insertion(+), 33 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index fa0b190f8a38..3c91de78535a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -124,7 +124,7 @@ config NEED_SG_DMA_LENGTH
bool
 
 config ARM_DMA_USE_IOMMU
-   bool
+   def_bool y
select ARM_HAS_SG_CHAIN
select NEED_SG_DMA_LENGTH
 
diff --git a/arch/arm/include/asm/device.h b/arch/arm/include/asm/device.h
index 3234fe9bba6e..c3cf38e16866 100644
--- a/arch/arm/include/asm/device.h
+++ b/arch/arm/include/asm/device.h
@@ -13,9 +13,7 @@ struct dev_archdata {
 #ifdef CONFIG_IOMMU_API
void *iommu; /* private IOMMU data */
 #endif
-#ifdef CONFIG_ARM_DMA_USE_IOMMU
struct dma_iommu_mapping*mapping;
-#endif
 #ifdef CONFIG_XEN
const struct dma_map_ops *dev_dma_ops;
 #endif
@@ -31,10 +29,6 @@ struct pdev_archdata {
 #endif
 };
 
-#ifdef CONFIG_ARM_DMA_USE_IOMMU
 #define to_dma_iommu_mapping(dev) ((dev)->archdata.mapping)
-#else
-#define to_dma_iommu_mapping(dev) NULL
-#endif
 
 #endif
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cc63a25bd088..f6c28ed5651a 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1174,8 +1174,6 @@ static int __init dma_debug_do_init(void)
 }
 core_initcall(dma_debug_do_init);
 
-#ifdef CONFIG_ARM_DMA_USE_IOMMU
-
 static int __dma_info_to_prot(enum dma_data_direction dir, unsigned long attrs)
 {
int prot = 0;
@@ -2366,20 +2364,6 @@ static void arm_teardown_iommu_dma_ops(struct device 
*dev)
arm_iommu_release_mapping(mapping);
 }
 
-#else
-
-static bool arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu)
-{
-   return false;
-}
-
-static void arm_teardown_iommu_dma_ops(struct device *dev) { }
-
-#define arm_get_iommu_dma_map_ops arm_get_dma_map_ops
-
-#endif /* CONFIG_ARM_DMA_USE_IOMMU */
-
 static const struct dma_map_ops *arm_get_dma_map_ops(bool coherent)
 {
return coherent ? &arm_coherent_dma_ops : &arm_dma_ops;
@@ -2426,7 +2410,6 @@ void arch_teardown_dma_ops(struct device *dev)
 
 void arch_iommu_detach_device(struct device *dev)
 {
-#ifdef CONFIG_ARM_DMA_USE_IOMMU
struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
const struct dma_map_ops *dma_ops;
 
@@ -2438,5 +2421,4 @@ void arch_iommu_detach_device(struct device *dev)
 
dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent);
set_dma_ops(dev, dma_ops);
-#endif
 }
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index df171cb85822..7f0b3ca76a17 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -226,7 +226,6 @@ config ROCKCHIP_IOMMU
depends on ARM || ARM64
depends on ARCH_ROCKCHIP || COMPILE_TEST
select IOMMU_API
-   select ARM_DMA_USE_IOMMU
help
  Support for IOMMUs found on Rockchip rk32xx SOCs.
  These IOMMUs allow virtualization of the address space used by most
@@ -259,7 +258,6 @@ config EXYNOS_IOMMU
depends on ARCH_EXYNOS && MMU
depends on !CPU_BIG_ENDIAN # revisit driver if we can enable big-endian 
ptes
select IOMMU_API
-   select ARM_DMA_USE_IOMMU
help
  Support for the IOMMU (System MMU) of Samsung Exynos application
  processor family. This enables H/W multimedia accelerators to see
@@ -283,7 +281,6 @@ config IPMMU_VMSA
depends on ARCH_RENESAS || (COMPILE_TEST && !GENERIC_ATOMIC64)
select IOMMU_API
select IOMMU_IO_PGTABLE_LPAE
-   select ARM_DMA_USE_IOMMU
help
  Support for the Renesas VMSA-compatible IPMMU Renesas found in the
  R-Mobile APE6 and R-Car H2/M2 SoCs.
@@ -304,7 +301,6 @@ config ARM_SMMU
   

Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.
> 
> The only reason the DMA ops are exported is because get_arch_dma_ops
> references (or in case of the coherent ones used to reference).  We
> don't let drivers assign random dma ops.
> 
> > 
> > > Thierry, please resend this with at least the iommu list and
> > > linux-arm-kernel in Cc to have a proper discussion on the right API.
> > 
> > I'm certainly open to help with finding a correct solution, but the
> > patch above was purposefully terse because this is something that I
> > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> > such a backport between ARM and DRM trees does not sound like something
> > that would help getting this fixed in v4.16.
> 
> Coordinating the backport of a trivial helper in the arm tree is not
> the end of the world.  Really, this cowboy attitude is a good reason
> why graphics folks have such a bad rep.  You keep poking into random
> kernel internals, don't talk to anoyone and then complain if people
> are upset.  This shouldn't be surprising.

Not really agreeing on the cowboy thing. The fundamental problem is that
the dma api provides abstraction that seriously gets in the way of writing
a gpu driver. Some examples:

- We never want bounce buffers, ever. dma_map_sg gives us that, so there's
  hacks to fall back to a cache of pages allocated using
  dma_alloc_coherent if you build a kernel with bounce buffers.

- dma api hides the cache flushing requirements from us. GPUs love
  non-snooped access, and worse give userspace control over that. We want
  a strict separation between mapping stuff and flushing stuff. With the
  IOMMU api we mostly have the former, but for the later arch maintainers
  regularly tells they won't allow that. So we have drm_clflush.c.

- dma api hides how/where memory is allocated. Kinda similar problem,
  except now for CMA or address limits. So either we roll our own
  allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
  we use dma_alloc_coherent and then grab the sgt to get at the CMA
  allocations because that's the only way. Which sucks, because we can't
  directly tell CMA how to back off if there's some way to make CMA memory
  available through other means (gpus love to hog all of memory, so we
  have shrinkers and everything).

For display drivers the dma api mostly works, until you start sharing
buffers with other devices.

So from our perspective it looks fairly often that core folks just don't
want to support gpu use-cases, so we play a bit more cowboy and get things
done some other way. Since this has been going on for years now we often
don't even bother to start a discussion first, since it tended to go
nowhere useful.
-Daniel

> > Granted, this issue could've been caught with a little more testing, but
> > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> > was just enabled unconditionally if it has side-effects that platforms
> > don't opt in to but have to explicitly opt out of.
> 
> Agreed on that count.  Please send a patch.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.

EXPORT_SYMBOL* means nothing as far as whether a driver should
be able to use the symbol or not - it merely means that the symbol is
made available to a module.  Modules cover much more than just device
drivers - we have library modules, filesystem modules, helper modules
to name a few non-driver classes of modules.

We also have symbols that are exported as part of the architecture
implementation detail of a public interface.  For example, the
public interface "copy_from_user" is implemented as an inline
function (actually several layers of inline functions) eventually
calling into arm_copy_from_user().  arm_copy_from_user() is exported,
but drivers (in fact no module) is allowed to make direct reference
to arm_copy_from_user() - it'd fail when software PAN is enabled.

The whole idea that "if a symbol is exported, it's fine for a driver
to use it" is a complete work of fiction, always has been, and always
will be.

We've had this with the architecture implementation details of the
DMA API before, and with the architecture implementation details of
the CPU cache flushing.  There's only so much commentry, or __
prefixes you can add to a symbol before things get rediculous, and
none of it stops people creating this abuse.  The only thing that
seems to prevent it is to make life hard for folk wanting to use
the symbols (eg, hiding the symbol prototype in a private header,
etc.)

Never, ever go under the covers of an interface.  If the interface
doesn't do what you want, _discuss_ it, don't just think "oh, that
architecture private facility looks like what I need, I'll use that
directly."

If you ever are on the side of trying to maintain those implementation
details that are abused in this way, you'll soon understand why this
behaviour by driver authors is soo annoying, and the maintainability
problems it creates.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] ARM: dma-mapping: Implement arch_iommu_detach_device()

2018-04-25 Thread Thierry Reding
On Wed, Apr 25, 2018 at 11:18:14AM +0200, Thierry Reding wrote:
[...]
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 8c398fedbbb6..1957938d8c9c 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -2366,6 +2366,21 @@ static void arm_teardown_iommu_dma_ops(struct device 
> *dev)
>   arm_iommu_release_mapping(mapping);
>  }
>  
> +void arch_iommu_detach_device(struct device *dev)
> +{
> + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
> + const struct dma_map_ops *dma_ops;
> +
> + if (!mapping)
> + return;
> +
> + arm_iommu_release_mapping(mapping);
> + arm_iommu_detach_device(dev);
> +
> + dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent);
> + set_dma_ops(dev, dma_ops);
> +}

Oh, the irony! This doesn't actually build and I failed to notice
because I forgot to enable ARM_DMA_USE_IOMMU...

Sorry about that, but I think we can use this as basis to discuss
whether the API is good and I'll make sure to properly test the next
revision before sending it out.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 3/4] ARM: dma-mapping: Implement arch_iommu_detach_device()

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Implement this function to enable drivers from detaching from any IOMMU
domains that architecture code might have attached them to so that they
can take exclusive control of the IOMMU via the IOMMU API.

Signed-off-by: Thierry Reding 
---
 arch/arm/include/asm/dma-mapping.h |  3 +++
 arch/arm/mm/dma-mapping-nommu.c|  4 
 arch/arm/mm/dma-mapping.c  | 19 +++
 3 files changed, 26 insertions(+)

diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index 8436f6ade57d..d6d5bd44f962 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64 
dma_base, u64 size,
 #define arch_teardown_dma_ops arch_teardown_dma_ops
 extern void arch_teardown_dma_ops(struct device *dev);
 
+#define arch_iommu_detach_device arch_iommu_detach_device
+extern void arch_iommu_detach_device(struct device *dev);
+
 /* do not use this function in a driver */
 static inline bool is_device_dma_coherent(struct device *dev)
 {
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index 619f24a42d09..60fef97d8452 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -242,6 +242,10 @@ void arch_teardown_dma_ops(struct device *dev)
 {
 }
 
+void arch_iommu_detach_device(struct device *dev)
+{
+}
+
 #define PREALLOC_DMA_DEBUG_ENTRIES 4096
 
 static int __init dma_debug_do_init(void)
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 8c398fedbbb6..1957938d8c9c 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -2366,6 +2366,21 @@ static void arm_teardown_iommu_dma_ops(struct device 
*dev)
arm_iommu_release_mapping(mapping);
 }
 
+void arch_iommu_detach_device(struct device *dev)
+{
+   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
+   const struct dma_map_ops *dma_ops;
+
+   if (!mapping)
+   return;
+
+   arm_iommu_release_mapping(mapping);
+   arm_iommu_detach_device(dev);
+
+   dma_ops = arm_get_dma_map_ops(dev->archdata.dma_coherent);
+   set_dma_ops(dev, dma_ops);
+}
+
 #else
 
 static bool arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
@@ -2378,6 +2393,10 @@ static void arm_teardown_iommu_dma_ops(struct device 
*dev) { }
 
 #define arm_get_iommu_dma_map_ops arm_get_dma_map_ops
 
+void arch_iommu_detach_device(struct device *dev)
+{
+}
+
 #endif /* CONFIG_ARM_DMA_USE_IOMMU */
 
 static const struct dma_map_ops *arm_get_dma_map_ops(bool coherent)
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/4] dma-mapping: Introduce dma_iommu_detach_device() API

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

The dma_iommu_detach_device() API can be used by drivers to forcibly
detach a device from an IOMMU that architecture code might have attached
to. This is useful for drivers that need explicit control over the IOMMU
using the IOMMU API directly.

Signed-off-by: Thierry Reding 
---
 drivers/base/dma-mapping.c  | 8 
 include/linux/dma-mapping.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index d82566d6e237..18ddf32b10c9 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -366,3 +366,11 @@ void dma_deconfigure(struct device *dev)
of_dma_deconfigure(dev);
acpi_dma_deconfigure(dev);
 }
+
+void dma_iommu_detach_device(struct device *dev)
+{
+#ifdef arch_iommu_detach_device
+   arch_iommu_detach_device(dev);
+#endif
+}
+EXPORT_SYMBOL(dma_iommu_detach_device);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0f589e..732191a2c64e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -671,6 +671,8 @@ static inline void arch_setup_dma_ops(struct device *dev, 
u64 dma_base,
 static inline void arch_teardown_dma_ops(struct device *dev) { }
 #endif
 
+extern void dma_iommu_detach_device(struct device *dev);
+
 static inline unsigned int dma_get_max_seg_size(struct device *dev)
 {
if (dev->dma_parms && dev->dma_parms->max_segment_size)
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/4] drm/nouveau: tegra: Use dma_iommu_detach_device()

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Use the new dma_iommu_detach_device() function to replace the open-coded
equivalent.

Signed-off-by: Thierry Reding 
---
 .../drm/nouveau/nvkm/engine/device/tegra.c| 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
index 23428a7056e9..c0a7f3839cbb 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
@@ -20,10 +20,6 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
-#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
-#include 
-#endif
-
 #include 
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
 #include "priv.h"
@@ -110,19 +106,8 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra 
*tdev)
unsigned long pgsize_bitmap;
int ret;
 
-#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
-   if (dev->archdata.mapping) {
-   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
-
-   arm_iommu_release_mapping(mapping);
-   arm_iommu_detach_device(dev);
-
-   if (dev->archdata.dma_coherent)
-   set_dma_ops(dev, &arm_coherent_dma_ops);
-   else
-   set_dma_ops(dev, &arm_dma_ops);
-   }
-#endif
+   /* make sure we can use the IOMMU exclusively */
+   dma_iommu_detach_device(dev);
 
if (!tdev->func->iommu_bit)
return;
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping

2018-04-25 Thread Thierry Reding
From: Thierry Reding 

Depending on the kernel configuration, early ARM architecture setup code
may have attached the GPU to a DMA/IOMMU mapping that transparently uses
the IOMMU to back the DMA API. Tegra requires special handling for IOMMU
backed buffers (a special bit in the GPU's MMU page tables indicates the
memory path to take: via the SMMU or directly to the memory controller).
Transparently backing DMA memory with an IOMMU prevents Nouveau from
properly handling such memory accesses and causes memory access faults.

As a side-note: buffers other than those allocated in instance memory
don't need to be physically contiguous from the GPU's perspective since
the GPU can map them into contiguous buffers using its own MMU. Mapping
these buffers through the IOMMU is unnecessary and will even lead to
performance degradation because of the additional translation.

Signed-off-by: Thierry Reding 
---
I had already sent this out independently to fix a regression that was
introduced in v4.16, but then Christoph pointed out that it should've
been sent to a wider audience and should use a core API rather than
calling into architecture code directly.

I've added it to this series for easier reference and to show the need
for the new API.

 .../drm/nouveau/nvkm/engine/device/tegra.c| 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
index 78597da6313a..23428a7056e9 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c
@@ -19,6 +19,11 @@
  * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
  * DEALINGS IN THE SOFTWARE.
  */
+
+#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
+#include 
+#endif
+
 #include 
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
 #include "priv.h"
@@ -105,6 +110,20 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra 
*tdev)
unsigned long pgsize_bitmap;
int ret;
 
+#if IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)
+   if (dev->archdata.mapping) {
+   struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
+
+   arm_iommu_release_mapping(mapping);
+   arm_iommu_detach_device(dev);
+
+   if (dev->archdata.dma_coherent)
+   set_dma_ops(dev, &arm_coherent_dma_ops);
+   else
+   set_dma_ops(dev, &arm_dma_ops);
+   }
+#endif
+
if (!tdev->func->iommu_bit)
return;
 
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


noveau vs arm dma ops

2018-04-25 Thread Christoph Hellwig
[discussion about this patch, which should have been cced to the iommu
 and linux-arm-kernel lists, but wasn't:
 https://www.spinics.net/lists/dri-devel/msg173630.html]

On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > API from the iommu/dma-mapping code.  Drivers have no business poking
> > into these details.
> 
> The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> which is rather misleading if they are not meant to be used by drivers
> directly.

The only reason the DMA ops are exported is because get_arch_dma_ops
references (or in case of the coherent ones used to reference).  We
don't let drivers assign random dma ops.

> 
> > Thierry, please resend this with at least the iommu list and
> > linux-arm-kernel in Cc to have a proper discussion on the right API.
> 
> I'm certainly open to help with finding a correct solution, but the
> patch above was purposefully terse because this is something that I
> hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> such a backport between ARM and DRM trees does not sound like something
> that would help getting this fixed in v4.16.

Coordinating the backport of a trivial helper in the arm tree is not
the end of the world.  Really, this cowboy attitude is a good reason
why graphics folks have such a bad rep.  You keep poking into random
kernel internals, don't talk to anoyone and then complain if people
are upset.  This shouldn't be surprising.

> Granted, this issue could've been caught with a little more testing, but
> in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> was just enabled unconditionally if it has side-effects that platforms
> don't opt in to but have to explicitly opt out of.

Agreed on that count.  Please send a patch.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] dma mapping fixes for 4.17-rc3

2018-04-25 Thread David Rientjes via iommu
On Tue, 24 Apr 2018, Christoph Hellwig wrote:

> On Tue, Apr 24, 2018 at 11:54:26PM -0700, David Rientjes wrote:
> > Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more 
> > accurately <=? 
> 
> No, it should really be <.  The exactly 32-bit case is already covered
> with GFP_DMA32.  Eventualy it should be < 24-bit with a separate
> GFP_DMA32 case for > 32-bit && < 64-bit.  Takashi has been working on
> that, but it is 4.18 material.
> 

Ack, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] dma mapping fixes for 4.17-rc3

2018-04-25 Thread David Rientjes via iommu
On Wed, 25 Apr 2018, Christoph Hellwig wrote:

> The following changes since commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e:
> 
>   Linux 4.17-rc2 (2018-04-22 19:20:09 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-4.17-3
> 
> for you to fetch changes up to 60695be2bb6b0623f8e53bd9949d582a83c6d44a:
> 
>   dma-mapping: postpone cpu addr translation on mmap (2018-04-23 14:44:24 
> +0200)
> 
> 
> A few small dma-mapping fixes for Linux 4.17-rc3:
> 
>  - don't loop to try GFP_DMA allocations if ZONE_DMA is not actually
>enabled (regression in 4.16)

Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more 
accurately <=? 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-25 Thread Jens Axboe
On 4/24/18 12:16 PM, Christoph Hellwig wrote:
> ide_toggle_bounce did select various strange block bounce limits, including
> not bouncing at all as soon as an iommu is present in the system.  Given
> that the dma_map routines now handle any required bounce buffering except
> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
> at least for iommu equipped systems we can get rid of the block layer
> bounce limit setting entirely.

Pretty sure I was the one to add this code, when highmem page IO was
enabled about two decades ago...

Outside of DMA, the issue was that the PIO code could not handle
highmem. That's not the case anymore, so this should be fine.

Reviewed-by: Jens Axboe 

-- 
Jens Axboe

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] dma mapping fixes for 4.17-rc3

2018-04-25 Thread Christoph Hellwig
On Tue, Apr 24, 2018 at 11:54:26PM -0700, David Rientjes wrote:
> Shouldn't that test for dev->coherent_dma_mask < DMA_BIT_MASK(32) be more 
> accurately <=? 

No, it should really be <.  The exactly 32-bit case is already covered
with GFP_DMA32.  Eventualy it should be < 24-bit with a separate
GFP_DMA32 case for > 32-bit && < 64-bit.  Takashi has been working on
that, but it is 4.18 material.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu