Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-09-18 Thread Alex Williamson
On Wed, 12 Sep 2018 10:46:19 -0700
"Raj, Ashok"  wrote:

> On Thu, Aug 09, 2018 at 01:44:17PM -0600, Alex Williamson wrote:
> > On Thu,  9 Aug 2018 12:37:06 -0700
> > Ashok Raj  wrote:
> >   
> > > PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual
> > > Functions.
> > > 
> > > Some SRIOV devices have some bugs in RTL and VF's end up reading 1
> > > instead of 0 for the PIN.  
> > 
> > Hi Ashok,
> > 
> > One question, can we identify which VFs are known to have this
> > issue so that users (and downstreams) can know how to prioritize
> > this patch?  
> 
> Hi Alex
> 
> Sorry it took some time to hunt this down. 
> 
> The offending VF has a device ID : 0x270C
> The corresponding PF has a device ID: 0x270B.

Ok, I interpret Alan's previous comment about the patch[1] to suggest a
structure a bit more like that below.  IOW, we know that 8086:270c
inspires this change, but once it's included we won't know who else
relies on it.  We can perhaps encourage better hardware validation, or
at least better tracking of who needs this with a warning and
whitelist.  Testing, especially positive and negative testing against
the warning, and reviews welcome.  Thanks,

Alex

[1]https://lkml.org/lkml/2018/8/10/462

commit d780da26a81c6f47522ae0aeff03abd4d08b89b5
Author: Alex Williamson 
Date:   Tue Sep 18 21:27:57 2018 -0600

vfio/pci: Mask buggy SR-IOV VF INTx support

The SR-IOV spec requires that VFs must report zero for the INTx pin
register as VFs are precluded from INTx support.  It's much easier for
the host kernel to understand whether a device is a VF and therefore
whether a non-zero pin register value is bogus than it is to do the
same in userspace.  Override the INTx count for such devices and
virtualize the pin register to provide a consistent view of the device
to the user.

As this is clearly a spec violation, warn about it to support hardware
validation, but also provide a known whitelist as it doesn't do much
good to continue complaining if the hardware vendor doesn't plan to
fix it.

Known devices with this issue: 8086:270c

Signed-off-by: Alex Williamson 

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cddb453a1ba5..8af3f6f35f32 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -430,14 +430,41 @@ static int vfio_pci_open(void *device_data)
return ret;
 }
 
+static const struct pci_device_id known_bogus_vf_intx_pin[] = {
+   { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x270c) },
+   {}
+};
+
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
u8 pin;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
+   return 0;
+
pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
-   if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
-   return 1;
 
+   /*
+* Per SR-IOV spec rev 1.1, 3.4.1.18 the interrupt pin register
+* does not apply to VFs and VFs must implement this register
+* as read-only with value zero.  Userspace is not readily
+* able to identify a device as a VF and thus that the pin
+* definition on the device is bogus should a device violate
+* this requirement.  For such devices, override the bogus
+* value and provide a warning to support hardware validation
+* (or be quite if it's known).  PCI config space emulation
+* will virtualize this register for all VFs.
+*/
+   if (pin && vdev->pdev->is_virtfn) {
+   if (!pci_match_id(known_bogus_vf_intx_pin, vdev->pdev))
+   dev_warn_once(&vdev->pdev->dev,
+ "VF reports bogus INTx pin %d\n",
+ pin);
+   return 0;
+   }
+
+   return pin ? 1 : 0;
} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
u8 pos;
u16 flags;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 62023b4a373b..25130fa6e265 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1678,7 +1678,8 @@ int vfio_config_init(struct vfio_pci_device *vdev)
*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
}
 
-   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
+   pdev->is_virtfn)
vconfig[PCI_INTERRUPT_PIN] = 0;
 
ret = vfio_cap_init(vdev);
___
iommu mailing list
iommu@lists.linux-foundation.org
https://list

Re: [PATCH v4] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-09-18 Thread Ganapatrao Kulkarni
Hi Joerg,

can you please pull this patch?

On Wed, Sep 5, 2018 at 9:58 AM Ganapatrao Kulkarni
 wrote:
>
> As an optimisation for PCI devices, there is always first attempt
> been made to allocate iova from SAC address range. This will lead
> to unnecessary attempts, when there are no free ranges
> available. Adding fix to track recently failed iova address size and
> allow further attempts, only if requested size is lesser than a failed
> size. The size is updated when any replenish happens.
>
> Reviewed-by: Robin Murphy 
> Signed-off-by: Ganapatrao Kulkarni 
> ---
> v4:
> Rebsaed to 4.19-rc2
> v3:
> Update with comments [3] from Robin Murphy 
>
> [3] https://lkml.org/lkml/2018/8/13/116
>
> v2: update with comments [2] from Robin Murphy 
>
> [2] https://lkml.org/lkml/2018/8/7/166
>
> v1: Based on comments from Robin Murphy 
> for patch [1]
>
> [1] https://lkml.org/lkml/2018/4/19/780
>
>  drivers/iommu/iova.c | 22 +++---
>  include/linux/iova.h |  1 +
>  2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 83fe2621effe..f8d3ba247523 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
> granule,
> iovad->granule = granule;
> iovad->start_pfn = start_pfn;
> iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
> +   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
> iovad->flush_cb = NULL;
> iovad->fq = NULL;
> iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
> struct iova *free)
>
> cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
> if (free->pfn_hi < iovad->dma_32bit_pfn &&
> -   free->pfn_lo >= cached_iova->pfn_lo)
> +   free->pfn_lo >= cached_iova->pfn_lo) {
> iovad->cached32_node = rb_next(&free->node);
> +   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
> +   }
>
> cached_iova = rb_entry(iovad->cached_node, struct iova, node);
> if (free->pfn_lo >= cached_iova->pfn_lo)
> @@ -190,6 +193,10 @@ static int __alloc_and_insert_iova_range(struct 
> iova_domain *iovad,
>
> /* Walk the tree backwards */
> spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +   if (limit_pfn <= iovad->dma_32bit_pfn &&
> +   size >= iovad->max32_alloc_size)
> +   goto iova32_full;
> +
> curr = __get_cached_rbnode(iovad, limit_pfn);
> curr_iova = rb_entry(curr, struct iova, node);
> do {
> @@ -200,10 +207,8 @@ static int __alloc_and_insert_iova_range(struct 
> iova_domain *iovad,
> curr_iova = rb_entry(curr, struct iova, node);
> } while (curr && new_pfn <= curr_iova->pfn_hi);
>
> -   if (limit_pfn < size || new_pfn < iovad->start_pfn) {
> -   spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> -   return -ENOMEM;
> -   }
> +   if (limit_pfn < size || new_pfn < iovad->start_pfn)
> +   goto iova32_full;
>
> /* pfn_lo will point to size aligned address if size_aligned is set */
> new->pfn_lo = new_pfn;
> @@ -214,9 +219,12 @@ static int __alloc_and_insert_iova_range(struct 
> iova_domain *iovad,
> __cached_rbnode_insert_update(iovad, new);
>
> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> -
> -
> return 0;
> +
> +iova32_full:
> +   iovad->max32_alloc_size = size;
> +   spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +   return -ENOMEM;
>  }
>
>  static struct kmem_cache *iova_cache;
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index 928442dda565..0b93bf96693e 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -75,6 +75,7 @@ struct iova_domain {
> unsigned long   granule;/* pfn granularity for this domain */
> unsigned long   start_pfn;  /* Lower limit for this domain */
> unsigned long   dma_32bit_pfn;
> +   unsigned long   max32_alloc_size; /* Size of last failed allocation */
> struct iova anchor; /* rbtree lookup anchor */
> struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE];  /* IOVA range 
> caches */
>
> --
> 2.18.0
>
thanks,
Ganapat
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

2018-09-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Tuesday, September 18, 2018 11:47 PM
> 
> On 14/09/2018 22:04, Jacob Pan wrote:
> >> This example only needs to modify first-level translation, and works
> >> with SMMUv3. The kernel here could be the host, in which case
> >> second-level translation is disabled in the SMMU, or it could be the
> >> guest, in which case second-level mappings are created by QEMU and
> >> first-level translation is managed by assigning PASID tables to the
> >> guest.
> > There is a difference in case of guest SVA. VT-d v3 will bind guest
> > PASID and guest CR3 instead of the guest PASID table. Then turn on
> > nesting. In case of mdev, the second level is obtained from the aux
> > domain which was setup for the default PASID. Or in case of PCI device,
> > second level is harvested from RID2PASID.
> 
> Right, though I wasn't talking about the host managing guest SVA here,
> but a kernel binding the address space of one of its userspace drivers
> to the mdev.
> 
> >> So (2) would use iommu_sva_bind_device(),
> > We would need something different than that for guest bind, just to show
> > the two cases:>
> > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
> int
> > *pasid, unsigned long flags, void *drvdata)
> >
> > (WIP)
> > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> > where:
> > /**
> >  * struct gpasid_bind_data - Information about device and guest PASID
> > binding
> >  * @pasid:   Process address space ID used for the guest mm
> >  * @addr_width:  Guest address width. Paging mode can also be derived.
> >  * @gcr3:    Guest CR3 value from guest mm
> >  */
> > struct gpasid_bind_data {
> >     __u32 pasid;
> >     __u64 gcr3;
> >     __u32 addr_width;
> >     __u32 flags;
> > #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> > };
> > Perhaps there is room to merge with io_mm but the life cycle
> management
> > of guest PASID and host PASID will be different if you rely on mm
> > release callback than FD.

let's not calling gpasid here - which makes sense only in bind_pasid_table
proposal where pasid table thus pasid space is managed by guest. In above
context it is always about host pasid (allocated in system-wide), which
could point to a host cr3 (user process) or a guest cr3 (vm case).

> 
> I think gpasid management should stay separate from io_mm, since in your
> case VFIO mechanisms are used for life cycle management of the VM,
> similarly to the former bind_pasid_table proposal. For example closing
> the container fd would unbind all guest page tables. The QEMU process'
> address space lifetime seems like the wrong thing to track for gpasid.

I sort of agree (though not thinking through all the flow carefully). PASIDs
are allocated per iommu domain, thus release also happens when domain
is detached (along with container fd close).

> 
> >> but (1) needs something
> >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
> >> domain to second-level or nested translation? It seems silly to use a
> >> different API for first-level, since the flow in userspace and VFIO
> >> is the same as your second-level case as far as MAP_DMA ioctl goes.
> >> The difference is that in your case the auxiliary domain supports an
> >> additional operation which binds first-level page tables. An
> >> auxiliary domain that only supports first-level wouldn't support this
> >> operation, but it can still implement iommu_map/unmap/etc.
> >>
> > I think the intention is that when a mdev is created, we don;t
> > know whether it will be used for SVA or IOVA. So aux domain is here to
> > "hold a spot" for the default PASID such that MAP_DMA calls can work as
> > usual, which is second level only. Later, if SVA is used on the mdev
> > there will be another PASID allocated for that purpose.
> > Do we need to create an aux domain for each PASID? the translation can
> > be looked up by the combination of parent dev and pasid.
> 
> When allocating a new PASID for the guest, I suppose you need to clone
> the second-level translation config? In which case a single aux domain
> for the mdev might be easier to implement in the IOMMU driver. Entirely
> up to you since we don't have this case on SMMUv3
> 

One thing to highlight in related discussions (also mentioned in other
thread). There is not a new iommu domain type called 'aux'. 'aux' matters
only to a specific device when a domain is attached to that device which
has aux capability enabled. Same domain can be attached to other device
as normal domain. In that case multiple PASIDs allocated on same mdev
are tied to same aux domain, same bare metal SVA case, i.e. any domain
(normal or aux) can include 2nd level structure and multiple 1st level 
structures. Jean is correct - all PASIDs in same domain then share 2nd 
level translation, and there are io_mm or similar tracking structures to 
associate each PASID to a 1st level t

Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

2018-09-18 Thread Lu Baolu

Hi,

On 09/19/2018 07:26 AM, Tian, Kevin wrote:

From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
Sent: Tuesday, September 18, 2018 11:52 PM

On 15/09/2018 03:36, Tian, Kevin wrote:

4) Userspace opens another mdev.
-> iommu.c calls domain->ops->attach_dev(domain2, dev)


another mdev in same VFIO container or different? I assume the
latter since you mentioned a new domain2.


I was thinking a different VFIO container actually. I used domain2 to
try to make the example clearer


1)? When the container is closed, VFIO calls
iommu_detach_device(domain2, parent_dev)
-> iommu.c calls default_domain->ops->attach_dev(default_domain,

dev)

Given that auxiliary domains are attached, the IOMMU driver could

deduce

that this actually means "detach an auxiliary domain". But which one?


I didn't get this one. There is no need to stick to 1) behavior for
4), i.e. below is expected:
     domain2->ops->detach_dev(domain2, dev)


But in order to get that, the IOMMU core needs to know that domain2 is
auxiliary. Otherwise, detach_dev is never called when a default_domain
is present for the parent dev.

I guess one solution is to add an "auxiliary" attribute to iommu_domain,
so __iommu_detach_group would do something like:


this doesn't work. same domain can be also attached to another physical
device as non-aux domain (e.g. passthrough) at the same time (vfio-pci
device and vfio-mdev device in same container), then default domain
tweak is required in that case. "aux" takes effect only per-device, not
per-domain.


If we have below APIs for aux domain (the API names are just for
discussion purpose, subject to change):

iommu_querry_aux_domain_capability(dev)
iommu_enable_aux_domain(dev)
iommu_disable_aux_domain(dev)
iommu_check_aux_domain_status(dev)

then, we could do this like below:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ab3d7d3b1583..3bfb652c78e8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1469,12 +1469,31 @@ static int iommu_group_do_detach_device(struct 
device *dev, void *data)

return 0;
 }

+static int iommu_group_check_aux_domain(struct device *dev, void *data)
+{
+   const struct iommu_ops *ops = dev->bus->iommu_ops;
+
+   if (ops && ops->check_auxd)
+   return !ops->check_auxd(dev);
+
+   return -EINVAL;
+}
+
+/*
+ *  Check whether devices in @group have aux domain enabled.
+ */
+static int iommu_group_aux_domain_enabled(struct iommu_group *group)
+{
+   return __iommu_group_for_each_dev(group, NULL,
+ iommu_group_check_aux_domain);
+}
+
 static void __iommu_detach_group(struct iommu_domain *domain,
 struct iommu_group *group)
 {
int ret;

-   if (!group->default_domain) {
+   if (!group->default_domain || 
iommu_group_aux_domain_enabled(group)) {

__iommu_group_for_each_dev(group, domain,
   iommu_group_do_detach_device);
group->domain = NULL;

Best regards,
Lu Baolu





diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7113fe398b70..2b3e9b91aee7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct
iommu_domain *domain,
  {
int ret;

-   if (!group->default_domain) {
+   if (!group->default_domain || domain->auxiliary) {
__iommu_group_for_each_dev(group, domain,
   iommu_group_do_detach_device);
-   group->domain = NULL;
+   if (!domain->auxiliary)
+   group->domain = NULL;
return;
}

Not sure who would set this "auxiliary" attribute... Maybe the IOMMU
driver, when attaching the domain to a device that has auxiliary mode
enabled?


why cannot ARM implement a detach_dev for aux_domain too? My
feeling is that default domain twist is only for switch between 1/2/3
in concept.


If the core actually calls it, we can implement detach_dev :) The
problem is that the core never calls detach_dev when default_domain is
present (affects any IOMMU driver that relies on default_domain,
including AMD), and even in case 4) the default_domain is present for
the parent device


Then can we change that core logic so detach_dev is invoked in all
cases? yes there will be some changes in vendor drivers, but I expect
this change trivial (especially considering the gain in IOMMU API
simplicity side as described below).




So the proposed interface doesn't seem to work as is. If we want to use
iommu_attach/detach_device for auxiliary domains, the existing

behavior

of iommu.c, and IOMMU drivers that rely on it, have to change. Any
change I can think of right now seems more daunting than introducing a
different path for auxiliary domains, like iommu_attach_aux_domain for
example.



introducing *aux* specific API will cause different VFIO cod

RE: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

2018-09-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Tuesday, September 18, 2018 11:52 PM
> 
> On 15/09/2018 03:36, Tian, Kevin wrote:
> >> 4) Userspace opens another mdev.
> >> -> iommu.c calls domain->ops->attach_dev(domain2, dev)
> >
> > another mdev in same VFIO container or different? I assume the
> > latter since you mentioned a new domain2.
> 
> I was thinking a different VFIO container actually. I used domain2 to
> try to make the example clearer
> 
> >> 1)? When the container is closed, VFIO calls
> >> iommu_detach_device(domain2, parent_dev)
> >> -> iommu.c calls default_domain->ops->attach_dev(default_domain,
> dev)
> >> Given that auxiliary domains are attached, the IOMMU driver could
> deduce
> >> that this actually means "detach an auxiliary domain". But which one?
> >
> > I didn't get this one. There is no need to stick to 1) behavior for
> > 4), i.e. below is expected:
> >     domain2->ops->detach_dev(domain2, dev)
> 
> But in order to get that, the IOMMU core needs to know that domain2 is
> auxiliary. Otherwise, detach_dev is never called when a default_domain
> is present for the parent dev.
> 
> I guess one solution is to add an "auxiliary" attribute to iommu_domain,
> so __iommu_detach_group would do something like:

this doesn't work. same domain can be also attached to another physical
device as non-aux domain (e.g. passthrough) at the same time (vfio-pci
device and vfio-mdev device in same container), then default domain 
tweak is required in that case. "aux" takes effect only per-device, not 
per-domain.

> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 7113fe398b70..2b3e9b91aee7 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct
> iommu_domain *domain,
>  {
>   int ret;
> 
> - if (!group->default_domain) {
> + if (!group->default_domain || domain->auxiliary) {
>   __iommu_group_for_each_dev(group, domain,
>  iommu_group_do_detach_device);
> - group->domain = NULL;
> + if (!domain->auxiliary)
> + group->domain = NULL;
>   return;
>   }
> 
> Not sure who would set this "auxiliary" attribute... Maybe the IOMMU
> driver, when attaching the domain to a device that has auxiliary mode
> enabled?
> 
> > why cannot ARM implement a detach_dev for aux_domain too? My
> > feeling is that default domain twist is only for switch between 1/2/3
> > in concept.
> 
> If the core actually calls it, we can implement detach_dev :) The
> problem is that the core never calls detach_dev when default_domain is
> present (affects any IOMMU driver that relies on default_domain,
> including AMD), and even in case 4) the default_domain is present for
> the parent device

Then can we change that core logic so detach_dev is invoked in all
cases? yes there will be some changes in vendor drivers, but I expect
this change trivial (especially considering the gain in IOMMU API
simplicity side as described below).

> 
> >> So the proposed interface doesn't seem to work as is. If we want to use
> >> iommu_attach/detach_device for auxiliary domains, the existing
> behavior
> >> of iommu.c, and IOMMU drivers that rely on it, have to change. Any
> >> change I can think of right now seems more daunting than introducing a
> >> different path for auxiliary domains, like iommu_attach_aux_domain for
> >> example.
> >>
> >
> > introducing *aux* specific API will cause different VFIO code path to
> > handle RID-based and PASID-based mdev, since RID-based still needs
> > to use normal attach_domain that way.
> 
> The PASID-based mdev still requires a special case to retrieve the
> allocated PASID and program it in the parent device, so VFIO will need
> to know the difference between the two
> 

that retrieve/program is down by parent driver, instead of VFIO.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 6/6] iommu/arm-smmu: Support non-strict mode

2018-09-18 Thread Robin Murphy

On 2018-09-18 6:10 PM, Will Deacon wrote:

On Fri, Sep 14, 2018 at 03:30:24PM +0100, Robin Murphy wrote:

All we need is to wire up .flush_iotlb_all properly and implement the
domain attribute, and iommu-dma and io-pgtable-arm will do the rest for
us. Rather than bother implementing it for v7s format for the highly
unlikely chance of that being relevant, we can simply hide the
non-strict flag from io-pgtable for that combination just so anyone who
does actually try it will simply get over-invalidation instead of
failure to initialise domains.

Signed-off-by: Robin Murphy 
---
  drivers/iommu/arm-smmu.c | 40 +---
  1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index fd1b80ef9490..aa5be334753b 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ struct arm_smmu_domain {
const struct iommu_gather_ops   *tlb_ops;
struct arm_smmu_cfg cfg;
enum arm_smmu_domain_stage  stage;
+   boolnon_strict;
struct mutexinit_mutex; /* Protects smmu pointer */
spinlock_t  cb_lock; /* Serialises ATS1* ops and 
TLB syncs */
struct iommu_domain domain;
@@ -863,6 +864,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
  
+	if (smmu_domain->non_strict && cfg->fmt != ARM_SMMU_CTX_FMT_AARCH32_S)

+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;


Does this mean we end up over-invalidating when using short-descriptor?
Could we not bypass the flush queue in this case instead? Ideally, we'd
just reject the domain attribute but I don't know if we know about the
page-table format early enough for that. Alternatively, we could force
long format if the attribute is set.

What do you think?


If someone manages to run an arm64 kernel on a theoretical SMMUv2 
implementation which only supports short-descriptor, *and* explicitly 
sets the command-line option, then yes, they'll get both the synchronous 
TLBIs and the periodic TLBIALLs. As implied by the commit message, my 
natural response is "don't do that".


However, it will almost certainly take more effort to argue about it or 
come up with other bodges than it will to just implement the quirk in 
the v7s code, so if you really think it's a valid concern just shout.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 5/6] iommu/arm-smmu-v3: Add support for non-strict mode

2018-09-18 Thread Robin Murphy

On 2018-09-18 6:10 PM, Will Deacon wrote:

On Fri, Sep 14, 2018 at 03:30:23PM +0100, Robin Murphy wrote:

From: Zhen Lei 

Dynamically choose strict or non-strict mode for page table config based
on the iommu domain type.


It's the domain type in conjunction with the attribute that determines
whether we use lazy or strict invalidation.


Signed-off-by: Zhen Lei 
[rm: convert to domain attribute]
Signed-off-by: Robin Murphy 
---
  drivers/iommu/arm-smmu-v3.c | 30 --
  1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f10c852479fc..7bbfa5f7ce8e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -612,6 +612,7 @@ struct arm_smmu_domain {
struct mutexinit_mutex; /* Protects smmu pointer */
  
  	struct io_pgtable_ops		*pgtbl_ops;

+   boolnon_strict;
  
  	enum arm_smmu_domain_stage	stage;

union {
@@ -1633,6 +1634,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
  
+	if (smmu_domain->non_strict)

+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -1934,13 +1938,17 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
  {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
  
-	if (domain->type != IOMMU_DOMAIN_UNMANAGED)

-   return -EINVAL;
-
switch (attr) {
case DOMAIN_ATTR_NESTING:
+   if (domain->type != IOMMU_DOMAIN_UNMANAGED)
+   return -EINVAL;
*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
return 0;
+   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
+   if (domain->type != IOMMU_DOMAIN_DMA)
+   return -EINVAL;
+   *(int *)data = smmu_domain->non_strict;
+   return 0;
default:
return -ENODEV;


Hmm, there's a change in behaviour here (and also in the set function)
which is that unknown attributes now return -ENODEV for managed domains
instead of -EINVAL. I don't know if that's a problem, but I'd be inclined
to switch on the domain type and then have a nested switch for the supported
attributes.


Sure, a nested switch did actually cross my mind, but I was worried it 
might be a little boilerplate-heavy since there's still only one of each 
case (and this quick'n'dirty copy-paste job didn't need any thought...)


If that's your preference too, though, I'll respin both driver patches 
that way.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 4/6] iommu: Add bootup option "iommu.non_strict"

2018-09-18 Thread Robin Murphy

On 2018-09-18 6:10 PM, Will Deacon wrote:

On Fri, Sep 14, 2018 at 03:30:22PM +0100, Robin Murphy wrote:

From: Zhen Lei 

Add a bootup option to make the system manager can choose which mode to
be used. The default mode is strict.

Signed-off-by: Zhen Lei 
[rm: move handling out of SMMUv3 driver]
Signed-off-by: Robin Murphy 
---
  .../admin-guide/kernel-parameters.txt | 13 ++
  drivers/iommu/iommu.c | 26 +++
  2 files changed, 39 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9871e649ffef..406b91759b62 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1749,6 +1749,19 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
  
+	iommu.non_strict=	[ARM64]

+   Format: { "0" | "1" }
+   0 - strict mode, default.
+   Release IOVAs after the related TLBs are invalid
+   completely.
+   1 - non-strict mode.
+   Put off TLBs invalidation and release memory first.
+   It's good for scatter-gather performance but lacks
+   full isolation, an untrusted device can access the
+   reused memory because the TLBs may still valid.
+   Please take full consideration before choosing this
+   mode. Note that, VFIO will always use strict mode.


This text needs help. How about something like:

0 - strict mode, default.
Invalidate the TLB of the IOMMU hardware as part of every
unmap() operation.
1 - lazy mode.
Defer TLB invalidation so that the TLB of the IOMMU hardware
is invalidated periodically, rather than as part of every
unmap() operation.

(generally, I think I'd s/non strict/lazy/ in this patch to avoid the double
negatives)


+
iommu.passthrough=
[ARM64] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c15c5980299..2cabd0c0a4f3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -41,6 +41,7 @@ static unsigned int iommu_def_domain_type = 
IOMMU_DOMAIN_IDENTITY;
  #else
  static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
  #endif
+static bool iommu_dma_non_strict __read_mostly;
  
  struct iommu_callback_data {

const struct iommu_ops *ops;
@@ -131,6 +132,24 @@ static int __init iommu_set_def_domain_type(char *str)
  }
  early_param("iommu.passthrough", iommu_set_def_domain_type);
  
+static int __init iommu_dma_setup(char *str)

+{
+   int ret;
+
+   ret = kstrtobool(str, &iommu_dma_non_strict);
+   if (ret)
+   return ret;
+
+   if (iommu_dma_non_strict) {
+   pr_warn("WARNING: iommu non-strict mode is chosen.\n"
+   "It's good for scatter-gather performance but lacks full 
isolation\n");


Hmm, not sure about this message either and tainting is probably over the
top. Maybe drop the taint and just pr_info something like "IOMMU DMA ops
using lazy TLB invalidation: unable to protect against malicious devices"


+   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+   }
+
+   return 0;
+}
+early_param("iommu.non_strict", iommu_dma_setup);
+
  static ssize_t iommu_group_attr_show(struct kobject *kobj,
 struct attribute *__attr, char *buf)
  {
@@ -1072,6 +1091,13 @@ struct iommu_group *iommu_group_get_for_dev(struct 
device *dev)
group->default_domain = dom;
if (!group->domain)
group->domain = dom;
+
+   if (dom && iommu_dma_non_strict) {
+   int attr = 1;
+   iommu_domain_set_attr(dom,
+ DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+ &attr);
+   }


Hmm, I don't think we can guarantee that we're working with the DMA domain
here. Does this all fall out in the wash for the identity domain?


Indeed so - for one, I expect drivers to reject it for anything that 
isn't their own default DMA ops domain type (as #5 and #6 do), and 
furthermore it only has any effect once iommu_dma_init_domain() reads it 
back if it stuck, and other domain types should never be getting passed 
into there anyway.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 2/6] iommu/dma: Add support for non-strict mode

2018-09-18 Thread Robin Murphy

On 2018-09-18 6:10 PM, Will Deacon wrote:

Hi Robin,

On Fri, Sep 14, 2018 at 03:30:20PM +0100, Robin Murphy wrote:

From: Zhen Lei 

1. Save the related domain pointer in struct iommu_dma_cookie, make iovad
capable call domain->ops->flush_iotlb_all to flush TLB.
2. During the iommu domain initialization phase, base on domain->non_strict
field to check whether non-strict mode is supported or not. If so, call
init_iova_flush_queue to register iovad->flush_cb callback.
3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap
-->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to
put off iova freeing, and omit iommu_tlb_sync operation.


Hmm, this is basically just a commentary on the code. Please could you write
it more in terms of the problem that's being solved?


Sure - I intentionally kept a light touch when it came to the 
documentation and commit messages in this rework (other than patch #1 
where I eventually remembered the original reasoning and that it wasn't 
a bug). If we're more-or-less happy with the shape of the technical side 
I'll make sure to take a final pass through v8 to tidy up all the prose.



Signed-off-by: Zhen Lei 
[rm: convert raw boolean to domain attribute]
Signed-off-by: Robin Murphy 
---
  drivers/iommu/dma-iommu.c | 29 -
  include/linux/iommu.h |  1 +
  2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 511ff9a1d6d9..092e6926dc3c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -55,6 +55,9 @@ struct iommu_dma_cookie {
};
struct list_headmsi_page_list;
spinlock_t  msi_lock;
+
+   /* Only be assigned in non-strict mode, otherwise it's NULL */
+   struct iommu_domain *domain;
  };
  
  static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)

@@ -257,6 +260,17 @@ static int iova_reserve_iommu_regions(struct device *dev,
return ret;
  }
  
+static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)

+{
+   struct iommu_dma_cookie *cookie;
+   struct iommu_domain *domain;
+
+   cookie = container_of(iovad, struct iommu_dma_cookie, iovad);
+   domain = cookie->domain;
+
+   domain->ops->flush_iotlb_all(domain);


Can we rely on this function pointer being non-NULL? I think it would
be better to call iommu_flush_tlb_all(cookie->domain) instead.


Yeah, that's deliberate - in fact got as far as writing that change, 
then undid it as I realised that although the attribute conversion got 
rid of the explicit ops->flush_iotlb_all check, it still makes zero 
sense for an IOMMU driver to claim to support the flush queue attribute 
without also providing the relevant callback, so I do actually want this 
to blow up rather than silently do nothing if that assumption isn't met.



+}
+
  /**
   * iommu_dma_init_domain - Initialise a DMA mapping domain
   * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
@@ -275,6 +289,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, 
dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
unsigned long order, base_pfn, end_pfn;
+   int attr = 1;


Do we actually need to initialise this?


Oops, no, that's a left-over from the turned-out-messier-that-I-thought 
v6 implementation.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 0/6] Add non-strict mode support for iommu-dma

2018-09-18 Thread Robin Murphy

Hi Will,

On 2018-09-18 6:10 PM, Will Deacon wrote:

Hi Robin,

Thanks for turning this around so quickly.


Cheers for a pretty rapid review too :)


On Fri, Sep 14, 2018 at 03:30:18PM +0100, Robin Murphy wrote:

Since we'd like to get this polished up and merged and Leizhen has other
commitments, here's v7 of the previous series[1] wherein I address all
my own feedback :) This is a quick tweak of the v6 I sent yesterday
since I figured out slightly too late a much neater way of setting the
attribute at the appropriate time.

The principal change is that I've inverted things slightly such that
it's now a generic domain attribute controlled by iommu-dma given the
necessary support from individual IOMMU drivers. That way we can easily
enable other drivers straight away, as I've done for SMMUv2 here (which
also allowed me to give it a quick test with MMU-401s on a Juno board).
Otherwise it's really just cosmetic cleanup and rebasing onto Will's
pending SMMU queue.


I've been through and had a look, leaving some small comments on the patches
themselves. The only part I failed to figure out is how you tie the lifetime
of the flush queue to the lifetime of the domain so that the timer callback
can't fire after e.g. the DMA cookie has been freed. How does that work?


Er, to be honest I haven't looked or even considered it! Other than the 
parts I've massaged I kinda took the functionality of the previous 
series for granted. Let me cross-check the x86 code and figure it out.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Explicit IOVA management from a PCIe endpoint driver

2018-09-18 Thread Stephen Warren

On 09/18/2018 04:59 AM, Robin Murphy wrote:

Hi Stephen,

On 17/09/18 22:36, Stephen Warren wrote:

Joerg, Christoph, Marek, Robin,

I believe that the driver for our PCIe endpoint controller hardware 
will need to explicitly manage its IOVA space more than current APIs 
allow. I'd like to discuss how to make that possible.

...

One final note:

The memory controller can translate accesses to a small region of DRAM 
address space into accesses to an interrupt generation module. This 
allows devices attached to the PCIe bus to generate interrupts to 
software running on the system with the PCIe endpoint controller. Thus 
I deliberately described API 3 above as mapping a specific physical 
address into IOVA space, as opposed to mapping an existing DRAM 
allocation into IOVA space, in order to allow mapping this interrupt 
generation address space into IOVA space. If we needed separate APIs 
to map physical addresses vs. DRAM allocations into IOVA space, that 
would likely be fine too.


If that's the standard DesignWare MSI dingaling, then all you should 
need to do is ensure you IOVA is reserved in your allocator (if it can 
be entirely outside the EP BAR, even better) - AFAIK the writes get 
completely intercepted such that they never go out to the SMMU side at 
all, and thus no actual mapping is even needed.


Unfortunately it's not. We have some custom hardware module (that 
already existed for other purposes, such as interaction/synchronization 
between various graphics modules) that we will slightly repurpose as a 
plain interrupt generator for PCIe endpoint use-cases.



Does this API proposal sound reasonable?


Indeed, as I say apart from using streaming DMA for coherency management 
(which I think could be added in pretty much orthogonally later), this 
sounds like something you could plumb into the endpoint framework right 
now with no dependent changes elsewhere.


Great. I'll take a look at Oza's code and see about getting this 
implemented.


Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 6/6] iommu/arm-smmu: Support non-strict mode

2018-09-18 Thread Will Deacon
On Fri, Sep 14, 2018 at 03:30:24PM +0100, Robin Murphy wrote:
> All we need is to wire up .flush_iotlb_all properly and implement the
> domain attribute, and iommu-dma and io-pgtable-arm will do the rest for
> us. Rather than bother implementing it for v7s format for the highly
> unlikely chance of that being relevant, we can simply hide the
> non-strict flag from io-pgtable for that combination just so anyone who
> does actually try it will simply get over-invalidation instead of
> failure to initialise domains.
> 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/arm-smmu.c | 40 +---
>  1 file changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index fd1b80ef9490..aa5be334753b 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -246,6 +246,7 @@ struct arm_smmu_domain {
>   const struct iommu_gather_ops   *tlb_ops;
>   struct arm_smmu_cfg cfg;
>   enum arm_smmu_domain_stage  stage;
> + boolnon_strict;
>   struct mutexinit_mutex; /* Protects smmu pointer */
>   spinlock_t  cb_lock; /* Serialises ATS1* ops and 
> TLB syncs */
>   struct iommu_domain domain;
> @@ -863,6 +864,9 @@ static int arm_smmu_init_domain_context(struct 
> iommu_domain *domain,
>   if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
>   pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
>  
> + if (smmu_domain->non_strict && cfg->fmt != ARM_SMMU_CTX_FMT_AARCH32_S)
> + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;

Does this mean we end up over-invalidating when using short-descriptor?
Could we not bypass the flush queue in this case instead? Ideally, we'd
just reject the domain attribute but I don't know if we know about the
page-table format early enough for that. Alternatively, we could force
long format if the attribute is set.

What do you think?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 5/6] iommu/arm-smmu-v3: Add support for non-strict mode

2018-09-18 Thread Will Deacon
On Fri, Sep 14, 2018 at 03:30:23PM +0100, Robin Murphy wrote:
> From: Zhen Lei 
> 
> Dynamically choose strict or non-strict mode for page table config based
> on the iommu domain type.

It's the domain type in conjunction with the attribute that determines
whether we use lazy or strict invalidation.

> Signed-off-by: Zhen Lei 
> [rm: convert to domain attribute]
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/arm-smmu-v3.c | 30 --
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f10c852479fc..7bbfa5f7ce8e 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -612,6 +612,7 @@ struct arm_smmu_domain {
>   struct mutexinit_mutex; /* Protects smmu pointer */
>  
>   struct io_pgtable_ops   *pgtbl_ops;
> + boolnon_strict;
>  
>   enum arm_smmu_domain_stage  stage;
>   union {
> @@ -1633,6 +1634,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
> *domain)
>   if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
>   pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
>  
> + if (smmu_domain->non_strict)
> + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
> +
>   pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
>   if (!pgtbl_ops)
>   return -ENOMEM;
> @@ -1934,13 +1938,17 @@ static int arm_smmu_domain_get_attr(struct 
> iommu_domain *domain,
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>  
> - if (domain->type != IOMMU_DOMAIN_UNMANAGED)
> - return -EINVAL;
> -
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
> + if (domain->type != IOMMU_DOMAIN_UNMANAGED)
> + return -EINVAL;
>   *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
>   return 0;
> + case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
> + if (domain->type != IOMMU_DOMAIN_DMA)
> + return -EINVAL;
> + *(int *)data = smmu_domain->non_strict;
> + return 0;
>   default:
>   return -ENODEV;

Hmm, there's a change in behaviour here (and also in the set function)
which is that unknown attributes now return -ENODEV for managed domains
instead of -EINVAL. I don't know if that's a problem, but I'd be inclined
to switch on the domain type and then have a nested switch for the supported
attributes.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 4/6] iommu: Add bootup option "iommu.non_strict"

2018-09-18 Thread Will Deacon
On Fri, Sep 14, 2018 at 03:30:22PM +0100, Robin Murphy wrote:
> From: Zhen Lei 
> 
> Add a bootup option to make the system manager can choose which mode to
> be used. The default mode is strict.
> 
> Signed-off-by: Zhen Lei 
> [rm: move handling out of SMMUv3 driver]
> Signed-off-by: Robin Murphy 
> ---
>  .../admin-guide/kernel-parameters.txt | 13 ++
>  drivers/iommu/iommu.c | 26 +++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 9871e649ffef..406b91759b62 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1749,6 +1749,19 @@
>   nobypass[PPC/POWERNV]
>   Disable IOMMU bypass, using IOMMU for PCI devices.
>  
> + iommu.non_strict=   [ARM64]
> + Format: { "0" | "1" }
> + 0 - strict mode, default.
> + Release IOVAs after the related TLBs are invalid
> + completely.
> + 1 - non-strict mode.
> + Put off TLBs invalidation and release memory first.
> + It's good for scatter-gather performance but lacks
> + full isolation, an untrusted device can access the
> + reused memory because the TLBs may still valid.
> + Please take full consideration before choosing this
> + mode. Note that, VFIO will always use strict mode.

This text needs help. How about something like:

0 - strict mode, default.
Invalidate the TLB of the IOMMU hardware as part of every
unmap() operation.
1 - lazy mode.
Defer TLB invalidation so that the TLB of the IOMMU hardware
is invalidated periodically, rather than as part of every
unmap() operation.

(generally, I think I'd s/non strict/lazy/ in this patch to avoid the double
negatives)

> +
>   iommu.passthrough=
>   [ARM64] Configure DMA to bypass the IOMMU by default.
>   Format: { "0" | "1" }
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 8c15c5980299..2cabd0c0a4f3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -41,6 +41,7 @@ static unsigned int iommu_def_domain_type = 
> IOMMU_DOMAIN_IDENTITY;
>  #else
>  static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
>  #endif
> +static bool iommu_dma_non_strict __read_mostly;
>  
>  struct iommu_callback_data {
>   const struct iommu_ops *ops;
> @@ -131,6 +132,24 @@ static int __init iommu_set_def_domain_type(char *str)
>  }
>  early_param("iommu.passthrough", iommu_set_def_domain_type);
>  
> +static int __init iommu_dma_setup(char *str)
> +{
> + int ret;
> +
> + ret = kstrtobool(str, &iommu_dma_non_strict);
> + if (ret)
> + return ret;
> +
> + if (iommu_dma_non_strict) {
> + pr_warn("WARNING: iommu non-strict mode is chosen.\n"
> + "It's good for scatter-gather performance but lacks 
> full isolation\n");

Hmm, not sure about this message either and tainting is probably over the
top. Maybe drop the taint and just pr_info something like "IOMMU DMA ops
using lazy TLB invalidation: unable to protect against malicious devices"

> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> + }
> +
> + return 0;
> +}
> +early_param("iommu.non_strict", iommu_dma_setup);
> +
>  static ssize_t iommu_group_attr_show(struct kobject *kobj,
>struct attribute *__attr, char *buf)
>  {
> @@ -1072,6 +1091,13 @@ struct iommu_group *iommu_group_get_for_dev(struct 
> device *dev)
>   group->default_domain = dom;
>   if (!group->domain)
>   group->domain = dom;
> +
> + if (dom && iommu_dma_non_strict) {
> + int attr = 1;
> + iommu_domain_set_attr(dom,
> +   DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> +   &attr);
> + }

Hmm, I don't think we can guarantee that we're working with the DMA domain
here. Does this all fall out in the wash for the identity domain?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 2/6] iommu/dma: Add support for non-strict mode

2018-09-18 Thread Will Deacon
Hi Robin,

On Fri, Sep 14, 2018 at 03:30:20PM +0100, Robin Murphy wrote:
> From: Zhen Lei 
> 
> 1. Save the related domain pointer in struct iommu_dma_cookie, make iovad
>capable call domain->ops->flush_iotlb_all to flush TLB.
> 2. During the iommu domain initialization phase, base on domain->non_strict
>field to check whether non-strict mode is supported or not. If so, call
>init_iova_flush_queue to register iovad->flush_cb callback.
> 3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap
>-->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to
>put off iova freeing, and omit iommu_tlb_sync operation.

Hmm, this is basically just a commentary on the code. Please could you write
it more in terms of the problem that's being solved?

> Signed-off-by: Zhen Lei 
> [rm: convert raw boolean to domain attribute]
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/dma-iommu.c | 29 -
>  include/linux/iommu.h |  1 +
>  2 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 511ff9a1d6d9..092e6926dc3c 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -55,6 +55,9 @@ struct iommu_dma_cookie {
>   };
>   struct list_headmsi_page_list;
>   spinlock_t  msi_lock;
> +
> + /* Only be assigned in non-strict mode, otherwise it's NULL */
> + struct iommu_domain *domain;
>  };
>  
>  static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
> @@ -257,6 +260,17 @@ static int iova_reserve_iommu_regions(struct device *dev,
>   return ret;
>  }
>  
> +static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
> +{
> + struct iommu_dma_cookie *cookie;
> + struct iommu_domain *domain;
> +
> + cookie = container_of(iovad, struct iommu_dma_cookie, iovad);
> + domain = cookie->domain;
> +
> + domain->ops->flush_iotlb_all(domain);

Can we rely on this function pointer being non-NULL? I think it would
be better to call iommu_flush_tlb_all(cookie->domain) instead.

> +}
> +
>  /**
>   * iommu_dma_init_domain - Initialise a DMA mapping domain
>   * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
> @@ -275,6 +289,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, 
> dma_addr_t base,
>   struct iommu_dma_cookie *cookie = domain->iova_cookie;
>   struct iova_domain *iovad = &cookie->iovad;
>   unsigned long order, base_pfn, end_pfn;
> + int attr = 1;

Do we actually need to initialise this?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 0/6] Add non-strict mode support for iommu-dma

2018-09-18 Thread Will Deacon
Hi Robin,

Thanks for turning this around so quickly.

On Fri, Sep 14, 2018 at 03:30:18PM +0100, Robin Murphy wrote:
> Since we'd like to get this polished up and merged and Leizhen has other
> commitments, here's v7 of the previous series[1] wherein I address all
> my own feedback :) This is a quick tweak of the v6 I sent yesterday
> since I figured out slightly too late a much neater way of setting the
> attribute at the appropriate time.
> 
> The principal change is that I've inverted things slightly such that
> it's now a generic domain attribute controlled by iommu-dma given the
> necessary support from individual IOMMU drivers. That way we can easily
> enable other drivers straight away, as I've done for SMMUv2 here (which
> also allowed me to give it a quick test with MMU-401s on a Juno board).
> Otherwise it's really just cosmetic cleanup and rebasing onto Will's
> pending SMMU queue.

I've been through and had a look, leaving some small comments on the patches
themselves. The only part I failed to figure out is how you tie the lifetime
of the flush queue to the lifetime of the domain so that the timer callback
can't fire after e.g. the DMA cookie has been freed. How does that work?

Cheers,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: move swiotlb noncoherent dma support from arm64 to generic code

2018-09-18 Thread Christoph Hellwig
On Tue, Sep 18, 2018 at 02:28:42PM +0100, Robin Murphy wrote:
> On 17/09/18 16:38, Christoph Hellwig wrote:
>> Hi all,
>>
>> this series starts with various swiotlb cleanups, then adds support for
>> non-cache coherent devices to the generic swiotlb support, and finally
>> switches arm64 to use the generic code.
>
> I think there's going to be an issue with the embedded folks' grubby hack 
> in arm64's mem_init() which skips initialising SWIOTLB at all with 
> sufficiently little DRAM. I've been waiting for 
> dma-direct-noncoherent-merge so that I could fix that case to swizzle in 
> dma_direct_ops and avoid swiotlb_dma_ops entirely.

I wait for your review of dma-direct-noncoherent-merge to put it
into dma-mapping for-next..

That being said one thing I'm investigating is to eventually further
merge dma_direct_ops and swiotlb_ops - the reason for that beeing that
I want to remove the indirect calls for the common direct mapping case,
and if we don't merge them that will get complicated.  Note that
swiotlb will generally just work if you don't initialize the buffer
as long as we never see a physical address large enough to cause bounce
buffering.

>
>> Given that this series depends on patches in the dma-mapping tree, or
>> pending for it I've also published a git tree here:
>>
>>  git://git.infradead.org/users/hch/misc.git swiotlb-noncoherent
>
> However, upon sitting down to eagerly write that patch I've just 
> boot-tested the above branch as-is for a baseline and discovered a rather 
> more significant problem: arch_dma_alloc() is recursing back into 
> __swiotlb_alloc() and blowing the stack. Not good :(

Oops, I messed up when renaming things.  Try this patch on top:

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 83e597101c6a..c75c721eb74e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -955,7 +955,7 @@ void *__swiotlb_alloc(struct device *dev, size_t size, 
dma_addr_t *dma_handle,
 */
gfp |= __GFP_NOWARN;
 
-   vaddr = dma_direct_alloc(dev, size, dma_handle, gfp, attrs);
+   vaddr = dma_direct_alloc_pages(dev, size, dma_handle, gfp, attrs);
if (!vaddr)
vaddr = swiotlb_alloc_buffer(dev, size, dma_handle, attrs);
return vaddr;
@@ -973,7 +973,7 @@ void __swiotlb_free(struct device *dev, size_t size, void 
*vaddr,
dma_addr_t dma_addr, unsigned long attrs)
 {
if (!swiotlb_free_buffer(dev, size, dma_addr))
-   dma_direct_free(dev, size, vaddr, dma_addr, attrs);
+   dma_direct_free_pages(dev, size, vaddr, dma_addr, attrs);
 }
 
 static void swiotlb_free(struct device *dev, size_t size, void *vaddr,
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

2018-09-18 Thread Jean-Philippe Brucker
On 15/09/2018 03:36, Tian, Kevin wrote:
>> 4) Userspace opens another mdev.
>> -> iommu.c calls domain->ops->attach_dev(domain2, dev)
> 
> another mdev in same VFIO container or different? I assume the
> latter since you mentioned a new domain2.

I was thinking a different VFIO container actually. I used domain2 to
try to make the example clearer

>> 1)? When the container is closed, VFIO calls
>> iommu_detach_device(domain2, parent_dev)
>> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
>> Given that auxiliary domains are attached, the IOMMU driver could deduce
>> that this actually means "detach an auxiliary domain". But which one?
> 
> I didn't get this one. There is no need to stick to 1) behavior for
> 4), i.e. below is expected:
>     domain2->ops->detach_dev(domain2, dev)

But in order to get that, the IOMMU core needs to know that domain2 is
auxiliary. Otherwise, detach_dev is never called when a default_domain
is present for the parent dev.

I guess one solution is to add an "auxiliary" attribute to iommu_domain,
so __iommu_detach_group would do something like:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7113fe398b70..2b3e9b91aee7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct
iommu_domain *domain,
 {
int ret;

-   if (!group->default_domain) {
+   if (!group->default_domain || domain->auxiliary) {
__iommu_group_for_each_dev(group, domain,
   iommu_group_do_detach_device);
-   group->domain = NULL;
+   if (!domain->auxiliary)
+   group->domain = NULL;
return;
}

Not sure who would set this "auxiliary" attribute... Maybe the IOMMU
driver, when attaching the domain to a device that has auxiliary mode
enabled?

> why cannot ARM implement a detach_dev for aux_domain too? My
> feeling is that default domain twist is only for switch between 1/2/3
> in concept.

If the core actually calls it, we can implement detach_dev :) The
problem is that the core never calls detach_dev when default_domain is
present (affects any IOMMU driver that relies on default_domain,
including AMD), and even in case 4) the default_domain is present for
the parent device

>> So the proposed interface doesn't seem to work as is. If we want to use
>> iommu_attach/detach_device for auxiliary domains, the existing behavior
>> of iommu.c, and IOMMU drivers that rely on it, have to change. Any
>> change I can think of right now seems more daunting than introducing a
>> different path for auxiliary domains, like iommu_attach_aux_domain for
>> example.
>> 
> 
> introducing *aux* specific API will cause different VFIO code path to
> handle RID-based and PASID-based mdev, since RID-based still needs
> to use normal attach_domain that way.

The PASID-based mdev still requires a special case to retrieve the
allocated PASID and program it in the parent device, so VFIO will need
to know the difference between the two

Thanks,
Jean

> well, this argument is not very strong
> in itself, if indeed proposed way doesn't work for ARM. But let's see
> whether it is the case with more understanding of your actual concern.
> 
> Thanks
> Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

2018-09-18 Thread Jean-Philippe Brucker
On 14/09/2018 22:04, Jacob Pan wrote:
>> This example only needs to modify first-level translation, and works
>> with SMMUv3. The kernel here could be the host, in which case
>> second-level translation is disabled in the SMMU, or it could be the
>> guest, in which case second-level mappings are created by QEMU and
>> first-level translation is managed by assigning PASID tables to the
>> guest.
> There is a difference in case of guest SVA. VT-d v3 will bind guest
> PASID and guest CR3 instead of the guest PASID table. Then turn on
> nesting. In case of mdev, the second level is obtained from the aux
> domain which was setup for the default PASID. Or in case of PCI device,
> second level is harvested from RID2PASID.

Right, though I wasn't talking about the host managing guest SVA here,
but a kernel binding the address space of one of its userspace drivers
to the mdev.

>> So (2) would use iommu_sva_bind_device(), 
> We would need something different than that for guest bind, just to show
> the two cases:>
> int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
> *pasid, unsigned long flags, void *drvdata)
> 
> (WIP)
> int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> where:
> /**
>  * struct gpasid_bind_data - Information about device and guest PASID
> binding
>  * @pasid:   Process address space ID used for the guest mm
>  * @addr_width:  Guest address width. Paging mode can also be derived.
>  * @gcr3:    Guest CR3 value from guest mm
>  */
> struct gpasid_bind_data {
>     __u32 pasid;
>     __u64 gcr3;
>     __u32 addr_width;
>     __u32 flags;
> #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> };
> Perhaps there is room to merge with io_mm but the life cycle management
> of guest PASID and host PASID will be different if you rely on mm
> release callback than FD.

I think gpasid management should stay separate from io_mm, since in your
case VFIO mechanisms are used for life cycle management of the VM,
similarly to the former bind_pasid_table proposal. For example closing
the container fd would unbind all guest page tables. The QEMU process'
address space lifetime seems like the wrong thing to track for gpasid.

>> but (1) needs something
>> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
>> domain to second-level or nested translation? It seems silly to use a
>> different API for first-level, since the flow in userspace and VFIO
>> is the same as your second-level case as far as MAP_DMA ioctl goes.
>> The difference is that in your case the auxiliary domain supports an
>> additional operation which binds first-level page tables. An
>> auxiliary domain that only supports first-level wouldn't support this
>> operation, but it can still implement iommu_map/unmap/etc.
>> 
> I think the intention is that when a mdev is created, we don;t
> know whether it will be used for SVA or IOVA. So aux domain is here to
> "hold a spot" for the default PASID such that MAP_DMA calls can work as
> usual, which is second level only. Later, if SVA is used on the mdev
> there will be another PASID allocated for that purpose.
> Do we need to create an aux domain for each PASID? the translation can
> be looked up by the combination of parent dev and pasid.

When allocating a new PASID for the guest, I suppose you need to clone
the second-level translation config? In which case a single aux domain
for the mdev might be easier to implement in the IOMMU driver. Entirely
up to you since we don't have this case on SMMUv3

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu: Fix a typo

2018-09-18 Thread Rami Rosen
This patch fixes a typo in iommu.c.

Signed-off-by: Rami Rosen 
---
 drivers/iommu/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f3698006cb53..022020144f33 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1416,7 +1416,7 @@ struct iommu_domain *iommu_get_domain_for_dev(struct 
device *dev)
 EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 
 /*
- * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * IOMMU groups are really the natural working unit of the IOMMU, but
  * the IOMMU API works on domains and devices.  Bridge that gap by
  * iterating over the devices in a group.  Ideally we'd have a single
  * device which represents the requestor ID of the group, but we also
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 20/20] iommu/smmuv3: Report non recoverable faults

2018-09-18 Thread Eric Auger
When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.

Signed-off-by: Eric Auger 
---
 drivers/iommu/arm-smmu-v3.c | 124 
 1 file changed, 113 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d9d300ab62a5..948fc82fc4ce 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -178,6 +178,26 @@
 #define ARM_SMMU_PRIQ_IRQ_CFG1 0xd8
 #define ARM_SMMU_PRIQ_IRQ_CFG2 0xdc
 
+/* Events */
+#define ARM_SMMU_EVT_F_UUT 0x01
+#define ARM_SMMU_EVT_C_BAD_STREAMID0x02
+#define ARM_SMMU_EVT_F_STE_FETCH   0x03
+#define ARM_SMMU_EVT_C_BAD_STE 0x04
+#define ARM_SMMU_EVT_F_BAD_ATS_TREQ0x05
+#define ARM_SMMU_EVT_F_STREAM_DISABLED 0x06
+#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN0x07
+#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID 0x08
+#define ARM_SMMU_EVT_F_CD_FETCH0x09
+#define ARM_SMMU_EVT_C_BAD_CD  0x0a
+#define ARM_SMMU_EVT_F_WALK_EABT   0x0b
+#define ARM_SMMU_EVT_F_TRANSLATION 0x10
+#define ARM_SMMU_EVT_F_ADDR_SIZE   0x11
+#define ARM_SMMU_EVT_F_ACCESS  0x12
+#define ARM_SMMU_EVT_F_PERMISSION  0x13
+#define ARM_SMMU_EVT_F_TLB_CONFLICT0x20
+#define ARM_SMMU_EVT_F_CFG_CONFLICT0x21
+#define ARM_SMMU_EVT_E_PAGE_REQUEST0x24
+
 /* Common MSI config fields */
 #define MSI_CFG0_ADDR_MASK GENMASK_ULL(51, 2)
 #define MSI_CFG2_SHGENMASK(5, 4)
@@ -343,6 +363,11 @@
 #define EVTQ_MAX_SZ_SHIFT  7
 
 #define EVTQ_0_ID  GENMASK_ULL(7, 0)
+#define EVTQ_0_SUBSTREAMID GENMASK_ULL(31, 12)
+#define EVTQ_0_STREAMIDGENMASK_ULL(63, 32)
+#define EVTQ_1_S2  GENMASK_ULL(39, 39)
+#define EVTQ_1_CLASS   GENMASK_ULL(40, 41)
+#define EVTQ_3_FETCH_ADDR  GENMASK_ULL(51, 3)
 
 /* PRI queue */
 #define PRIQ_ENT_DWORDS2
@@ -1250,7 +1275,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device 
*smmu, u32 sid)
return 0;
 }
 
-__maybe_unused
 static struct arm_smmu_master_data *
 arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 {
@@ -1276,24 +1300,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 
sid)
return master;
 }
 
+static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt)
+{
+   u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]);
+   u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]);
+   bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]);
+   u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
+   struct arm_smmu_master_data *master;
+   struct iommu_fault_event event;
+   bool propagate = true;
+   u64 addr = evt[2];
+   int i;
+
+   master = arm_smmu_find_master(smmu, sid);
+   if (WARN_ON(!master))
+   return;
+
+   event.fault.type = IOMMU_FAULT_DMA_UNRECOV;
+
+   switch (type) {
+   case ARM_SMMU_EVT_C_BAD_STREAMID:
+   event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID;
+   break;
+   case ARM_SMMU_EVT_F_STREAM_DISABLED:
+   case ARM_SMMU_EVT_C_BAD_SUBSTREAMID:
+   event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID;
+   break;
+   case ARM_SMMU_EVT_F_CD_FETCH:
+   event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH;
+   break;
+   case ARM_SMMU_EVT_F_WALK_EABT:
+   event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT;
+   event.fault.addr = addr;
+   event.fault.fetch_addr = fetch_addr;
+   propagate = s1;
+   break;
+   case ARM_SMMU_EVT_F_TRANSLATION:
+   event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH;
+   event.fault.addr = addr;
+   event.fault.fetch_addr = fetch_addr;
+   propagate = s1;
+   break;
+   case ARM_SMMU_EVT_F_PERMISSION:
+   event.fault.reason = IOMMU_FAULT_REASON_PERMISSION;
+   event.fault.addr = addr;
+   propagate = s1;
+   break;
+   case ARM_SMMU_EVT_F_ACCESS:
+   event.fault.reason = IOMMU_FAULT_REASON_ACCESS;
+   event.fault.addr = addr;
+   propagate = s1;
+   break;
+   case ARM_SMMU_EVT_C_BAD_STE:
+   event.fault.reason = 
IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY;
+   break;
+   case ARM_SMMU_EVT_C_BAD_CD:
+   event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY;
+   break;
+   case ARM_SMMU_EVT_F_ADDR_SIZE:
+   event.fault.reason = IOMMU_FAULT_REASON_OOR_ADDRESS;
+   propagate = s1;
+   break;
+   case ARM_SMMU_EVT_F_STE_FETCH:
+   event.fault.reason = IOMMU_FAULT_REASON_DEVICE_CONTEX

[RFC v2 19/20] vfio: Document nested stage control

2018-09-18 Thread Eric Auger
New iotcls were introduced to pass information about guest stage1
to the host through VFIO. Let's document the nested stage control.

Signed-off-by: Eric Auger 

---

fault reporting is current missing to the picture

v1 -> v2:
- use the new ioctl names
- add doc related to fault handling
---
 Documentation/vfio.txt | 60 ++
 1 file changed, 60 insertions(+)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index f1a4d3c3ba0b..d4611759d306 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -239,6 +239,66 @@ group and can access them as follows::
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMU Dual Stage Control
+
+
+Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to
+the ARM terminology while "level" corresponds to Intel's VTD terminology. In
+the following text we use either without distinction.
+
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can use
+stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation
+(GPA -> HPA).
+
+The guest gets ownership of the stage 1 page tables and also owns stage 1
+configuration structures. The hypervisor owns the root configuration structure
+(for security reason), including stage 2 configuration. This works as long
+configuration structures and page table format are compatible between the
+virtual IOMMU and the physical IOMMU.
+
+Assuming the HW supports it, this nested mode is selected by choosing the
+VFIO_TYPE1_NESTING_IOMMU type through:
+
+ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
+
+This forces the hypervisor to use the stage 2, leaving stage 1 available for
+guest usage.
+
+Once groups are attached to the container, the guest stage 1 translation
+configuration data can be passed to VFIO by using
+
+ioctl(container, VFIO_IOMMU_BIND_PASID_TABLE, &pasid_table_info);
+
+This allows to combine guest stage 1 configuration structure along with
+hypervisor stage 2 configuration structure. stage 1 configuration structures
+are dependent on the IOMMU type.
+
+As the stage 1 translation is fully delegated to the HW, physical events that
+may occur (especially translation faults), need to be propagated up to
+the virtualizer and re-injected into the guest.
+
+By calling: ioctl(container, VFIO_IOMMU_SET_FAULT_EVENTFD &fault_config),
+the virtualizer can register an eventfd. This latter will be signalled
+whenever an event is detected at physical level. The fault handler,
+executed at userspace level has to call:
+ioctl(container, VFIO_IOMMU_GET_FAULT_EVENTS, &fault_info) to retrieve
+the pending fault events.
+
+When the guest invalidates stage 1 related caches, invalidations must be
+forwarded to the host through
+ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data);
+Those invalidations can happen at various granularity levels, page, context, 
...
+
+The ARM SMMU specification introduces another challenge: MSIs are translated by
+both the virtual SMMU and the physical SMMU. To build a nested mapping for the
+IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI
+doorbell GPA binding to the host. Then the hypervisor can build a nested stage 
2
+binding eventually translating into the physical MSI doorbell.
+
+This is achieved by
+ioctl(container, VFIO_IOMMU_BIND_MSI, &guest_binding);
+
 VFIO User API
 ---
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 17/20] vfio: VFIO_IOMMU_SET_FAULT_EVENTFD

2018-09-18 Thread Eric Auger
VFIO_IOMMU_SET_FAULT_EVENTFD Allows to associate a fault
event handler to a container. This is useful when the
guest owns the first translation stage and the host uses
the second stage. As the guest translation is fully handled
by the physical IOMMU, if any fault happens, this latter needs
to be propagated to the guest.

The userspace passes an eventfd and a queue size. Each time
a fault occurs on physical IOMMU side, the fault is pushed to
a kfifo and the eventfd is signalled. The userspace handler,
awaken on eventfd signalling then reads the populated fifo
and can inject the faults to the virtual IOMMU.

Signed-off-by: Lan Tianyu 
Signed-off-by: Eric Auger 

---

original API proposed by Lan in
[RFC PATCH 0/3] VFIO: Report IOMMU fault event to userspace
(https://www.spinics.net/lists/kvm/msg145280.html)

A current issue with the current implementation of
iommu_register_device_fault_handler() is that is expects
responses to void the list of pending faults. This looks
overkill for unrecoverable as responses are not mandated.
---
 drivers/vfio/vfio_iommu_type1.c | 150 ++--
 include/uapi/linux/vfio.h   |  16 
 2 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ffd46c26dc3..e52cbeb479c3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -41,6 +41,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -66,6 +68,9 @@ struct vfio_iommu {
struct blocking_notifier_head notifier;
boolv2;
boolnesting;
+   struct eventfd_ctx  *fault_ctx;
+   DECLARE_KFIFO_PTR(fault_fifo, struct iommu_fault);
+   struct mutexfault_lock;
 };
 
 struct vfio_domain {
@@ -114,6 +119,7 @@ struct vfio_regions {
(!list_empty(&iommu->domain_list))
 
 static int put_pfn(unsigned long pfn, int prot);
+static int vfio_iommu_teardown_fault_eventfd(struct vfio_iommu *iommu);
 
 /*
  * This code handles mapping and unmapping of user data buffers
@@ -1527,15 +1533,26 @@ static void vfio_sanity_check_pfn_list(struct 
vfio_iommu *iommu)
WARN_ON(iommu->notifier.head);
 }
 
+static inline int
+vfio_unregister_fault_handler_fn(struct device *dev, void *data)
+{
+   return iommu_unregister_device_fault_handler(dev);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
  struct iommu_group *iommu_group)
 {
struct vfio_iommu *iommu = iommu_data;
struct vfio_domain *domain;
struct vfio_group *group;
+   int ret;
 
mutex_lock(&iommu->lock);
 
+   ret = iommu_group_for_each_dev(iommu_group, NULL,
+  vfio_unregister_fault_handler_fn);
+   WARN_ON(ret);
+
if (iommu->external_domain) {
group = find_iommu_group(iommu->external_domain, iommu_group);
if (group) {
@@ -1613,6 +1630,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
INIT_LIST_HEAD(&iommu->domain_list);
iommu->dma_list = RB_ROOT;
mutex_init(&iommu->lock);
+   mutex_init(&iommu->fault_lock);
BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
 
return iommu;
@@ -1682,25 +1700,22 @@ static int vfio_cache_inv_fn(struct device *dev, void 
*data)
return iommu_cache_invalidate(d, dev, &ustruct->info);
 }
 
-static int
-vfio_cache_invalidate(struct vfio_iommu *iommu,
- struct vfio_iommu_type1_cache_invalidate *ustruct)
+/* iommu->lock must be held */
+static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu, void *data,
+  int (*fn)(struct device *, void *))
 {
struct vfio_domain *d;
struct vfio_group *g;
int ret = 0;
 
-   mutex_lock(&iommu->lock);
-
list_for_each_entry(d, &iommu->domain_list, next) {
list_for_each_entry(g, &d->group_list, next) {
-   ret = iommu_group_for_each_dev(g->iommu_group, ustruct,
-  vfio_cache_inv_fn);
+   ret = iommu_group_for_each_dev(g->iommu_group,
+  data, fn);
if (ret)
break;
}
}
-   mutex_unlock(&iommu->lock);
return ret;
 }
 
@@ -1740,6 +1755,102 @@ vfio_bind_pasid_table(struct vfio_iommu *iommu,
return ret;
 }
 
+static int
+vfio_iommu_fault_handler(struct iommu_fault_event *event, void *data)
+{
+   struct vfio_iommu *iommu = (struct vfio_iommu *)data;
+   int ret;
+
+   mutex_lock(&iommu->fault_lock);
+   if (!iommu->fault_ctx)
+   goto out;
+   ret = kfifo_put(&iommu->fault_fifo, event->fault

[RFC v2 18/20] vfio: VFIO_IOMMU_GET_FAULT_EVENTS

2018-09-18 Thread Eric Auger
Introduce an IOTCL that allows the userspace to consume fault
events that may have occurred. The userspace buffer gets filled
by pending faults, if any. The number of filled faults is
reported to the userspace. Read faults are removed from the kfifo.
the kernel does not expect any response from the userspace.

Signed-off-by: Lan Tianyu 
Signed-off-by: Eric Auger 

---

the fault_lock may not be needed as kfifo documentation says:
"Note that with only one concurrent reader and one concurrent
writer, you don't need extra locking to use these macro."
---
 drivers/vfio/vfio_iommu_type1.c | 33 -
 include/uapi/linux/vfio.h   | 10 ++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e52cbeb479c3..917bb8f9c9ae 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1769,7 +1769,7 @@ vfio_iommu_fault_handler(struct iommu_fault_event *event, 
void *data)
eventfd_signal(iommu->fault_ctx, 1);
 out:
mutex_unlock(&iommu->fault_lock);
-   return 0;
+   return ret;
 }
 
 static inline int
@@ -1981,6 +1981,37 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return -EINVAL;
 
return vfio_iommu_set_fault_eventfd(iommu, &ustruct);
+   } else if (cmd == VFIO_IOMMU_GET_FAULT_EVENTS) {
+   struct vfio_iommu_type1_get_fault_events ustruct;
+   size_t buf_size, len, elem_size;
+   int copied, max_events, ret;
+
+   minsz = offsetofend(struct vfio_iommu_type1_get_fault_events,
+   reserved);
+
+   if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ustruct.argsz < minsz || ustruct.flags)
+   return -EINVAL;
+
+   elem_size = sizeof(struct iommu_fault);
+   buf_size = ustruct.argsz - minsz;
+   max_events = buf_size / elem_size;
+   len = max_events * elem_size;
+
+   mutex_lock(&iommu->fault_lock);
+
+   ret = kfifo_to_user(&iommu->fault_fifo,
+   (void __user *)(arg + minsz), len, &copied);
+
+   mutex_unlock(&iommu->fault_lock);
+   if (ret)
+   return ret;
+
+   ustruct.count = copied / elem_size;
+
+   return copy_to_user((void __user *)arg, &ustruct, minsz);
}
 
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0d62598c818a..5b9165b7db8d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -703,6 +703,16 @@ struct vfio_iommu_type1_set_fault_eventfd {
 };
 #define VFIO_IOMMU_SET_FAULT_EVENTFD  _IO(VFIO_TYPE, VFIO_BASE + 25)
 
+struct vfio_iommu_type1_get_fault_events {
+   __u32   argsz;
+   __u32   flags;
+   __u32   count; /* number of faults returned */
+   __u32   reserved;
+   struct iommu_fault events[];
+};
+
+#define VFIO_IOMMU_GET_FAULT_EVENTS_IO(VFIO_TYPE, VFIO_BASE + 26)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 14/20] iommu: introduce device fault data

2018-09-18 Thread Eric Auger
From: Jacob Pan 

Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.

The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).

The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.

Signed-off-by: Jacob Pan 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Liu, Yi L 
Signed-off-by: Ashok Raj 
Signed-off-by: Eric Auger 
[moved part of the iommu_fault_event struct in the uapi, enriched
 the fault reasons to be able to map unrecoverable SMMUv3 errors]
---
 include/linux/iommu.h  | 55 -
 include/uapi/linux/iommu.h | 83 ++
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9bd3e63d562b..7529c14ff506 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -49,13 +49,17 @@ struct bus_type;
 struct device;
 struct iommu_domain;
 struct notifier_block;
+struct iommu_fault_event;
 
 /* iommu fault flags */
-#define IOMMU_FAULT_READ   0x0
-#define IOMMU_FAULT_WRITE  0x1
+#define IOMMU_FAULT_READ   (1 << 0)
+#define IOMMU_FAULT_WRITE  (1 << 1)
+#define IOMMU_FAULT_EXEC   (1 << 2)
+#define IOMMU_FAULT_PRIV   (1 << 3)
 
 typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
struct device *, unsigned long, int, void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
 
 struct iommu_domain_geometry {
dma_addr_t aperture_start; /* First address that can be mapped*/
@@ -262,6 +266,52 @@ struct iommu_device {
struct device *dev;
 };
 
+/**
+ * struct iommu_fault_event - Generic per device fault data
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request), information based on PCI ATS
+ * and PASID spec.
+ * - Un-recoverable faults of device interest
+ * - DMA remapping and IRQ remapping faults
+ *
+ * @fault: fault descriptor
+ * @device_private: if present, uniquely identify device-specific
+ *  private data for an individual page request.
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ * data. Users should not modify this field before
+ * sending the fault response.
+ */
+struct iommu_fault_event {
+   struct iommu_fault fault;
+   u64 device_private;
+   u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+   iommu_dev_fault_handler_t handler;
+   void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ * struct iommu_group  *iommu_group;
+ * struct iommu_fwspec *iommu_fwspec;
+ */
+struct iommu_param {
+   struct iommu_fault_param *fault_param;
+};
+
 int  iommu_device_register(struct iommu_device *iommu);
 void iommu_device_unregister(struct iommu_device *iommu);
 int  iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -429,6 +479,7 @@ struct iommu_ops {};
 struct iommu_group {};
 struct iommu_fwspec {};
 struct iommu_device {};
+struct iommu_fault_param {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 21adb2a964e5..a0fe5c2fb236 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -150,5 +150,88 @@ struct iommu_guest_msi_binding {
__u64   gpa;
__u32   granule;
 };
+
+/*  Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+   IOMMU_FAULT_DMA_UNRECOV = 1,/* unrecoverable fault */
+   IOMMU_FAULT_PAGE_REQ,   /* page request fault */
+};
+
+enum iommu_fault_reason {
+   IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+   /* IOMMU internal error, no specific reason to report out */
+   IOMMU_FAULT_REASON_INTERNAL,
+
+   /* Could not access the PASID table (fetch caused external abort) */
+   IOMMU_FAULT_REASON_PASID_FETCH,
+
+   /* could not access the device context (fetch caused external abort) */
+   IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH,
+
+   /* pasid entry is invalid or has configuration error

[RFC v2 16/20] iommu: introduce device fault report API

2018-09-18 Thread Eric Auger
From: Jacob Pan 

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.

Signed-off-by: Jacob Pan 
Signed-off-by: Ashok Raj 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Eric Auger 
[adapt to new iommu_fault fault field, test fault_param on
 iommu_unregister_device_fault_handler]
---
 drivers/iommu/iommu.c | 153 +-
 include/linux/iommu.h |  33 -
 2 files changed, 184 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b2b8f375b169..dd4f3677a9ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -618,6 +618,13 @@ int iommu_group_add_device(struct iommu_group *group, 
struct device *dev)
goto err_free_name;
}
 
+   dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
+   if (!dev->iommu_param) {
+   ret = -ENOMEM;
+   goto err_free_name;
+   }
+   mutex_init(&dev->iommu_param->lock);
+
kobject_get(group->devices_kobj);
 
dev->iommu_group = group;
@@ -648,6 +655,7 @@ int iommu_group_add_device(struct iommu_group *group, 
struct device *dev)
mutex_unlock(&group->mutex);
dev->iommu_group = NULL;
kobject_put(group->devices_kobj);
+   kfree(dev->iommu_param);
 err_free_name:
kfree(device->name);
 err_remove_link:
@@ -694,7 +702,7 @@ void iommu_group_remove_device(struct device *dev)
sysfs_remove_link(&dev->kobj, "iommu_group");
 
trace_remove_device_from_group(group->id, dev);
-
+   kfree(dev->iommu_param);
kfree(device->name);
kfree(device);
dev->iommu_group = NULL;
@@ -828,6 +836,149 @@ int iommu_group_unregister_notifier(struct iommu_group 
*group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/**
+ * iommu_register_device_fault_handler() - Register a device fault handler
+ * @dev: the device
+ * @handler: the fault handler
+ * @data: private data passed as argument to the handler
+ *
+ * When an IOMMU fault event is received, call this handler with the fault 
event
+ * and data as argument. The handler should return 0 on success. If the fault 
is
+ * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
+ * the fault by calling iommu_page_response() with one of the following
+ * response code:
+ * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
+ * - IOMMU_PAGE_RESP_INVALID: terminate the fault
+ * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
+ *   page faults if possible.
+ *
+ * Return 0 if the fault handler was installed successfully, or an error.
+ */
+int iommu_register_device_fault_handler(struct device *dev,
+   iommu_dev_fault_handler_t handler,
+   void *data)
+{
+   struct iommu_param *param = dev->iommu_param;
+   int ret = 0;
+
+   /*
+* Device iommu_param should have been allocated when device is
+* added to its iommu_group.
+*/
+   if (!param)
+   return -EINVAL;
+
+   mutex_lock(¶m->lock);
+   /* Only allow one fault handler registered for each device */
+   if (param->fault_param) {
+   ret = -EBUSY;
+   goto done_unlock;
+   }
+
+   get_device(dev);
+   param->fault_param =
+   kzalloc(sizeof(struct iommu_fault_param), G

[RFC v2 15/20] driver core: add per device iommu param

2018-09-18 Thread Eric Auger
From: Jacob Pan 

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.

Suggested-by: Greg Kroah-Hartman 
Reviewed-by: Greg Kroah-Hartman 
Signed-off-by: Jacob Pan 
Link: https://lkml.org/lkml/2017/10/6/81
---
 include/linux/device.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 8f882549edee..2daf4a94bd31 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -42,6 +42,7 @@ struct iommu_ops;
 struct iommu_group;
 struct iommu_fwspec;
 struct dev_pin_info;
+struct iommu_param;
 
 struct bus_attribute {
struct attributeattr;
@@ -922,6 +923,7 @@ struct dev_links_info {
  * device (i.e. the bus driver that discovered the device).
  * @iommu_group: IOMMU group the device belongs to.
  * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
  *
  * @offline_disabled: If set, the device is permanently online.
  * @offline:   Set after successful invocation of bus type's .offline().
@@ -1012,6 +1014,7 @@ struct device {
void(*release)(struct device *dev);
struct iommu_group  *iommu_group;
struct iommu_fwspec *iommu_fwspec;
+   struct iommu_param  *iommu_param;
 
booloffline_disabled:1;
booloffline:1;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 13/20] iommu/smmuv3: Implement bind_guest_msi

2018-09-18 Thread Eric Auger
The bind_guest_msi() callback checks the domain
is NESTED and redirect to the dma-iommu implementation.

Signed-off-by: Eric Auger 
---
 drivers/iommu/arm-smmu-v3.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4054da756a41..d9d300ab62a5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2168,6 +2168,27 @@ static void arm_smmu_put_resv_regions(struct device *dev,
kfree(entry);
 }
 
+static int arm_smmu_bind_guest_msi(struct iommu_domain *domain,
+  struct iommu_guest_msi_binding *binding)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu;
+   int ret = -EINVAL;
+
+   mutex_lock(&smmu_domain->init_mutex);
+   smmu = smmu_domain->smmu;
+   if (!smmu)
+   goto out;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   goto out;
+
+   ret = iommu_dma_bind_doorbell(domain, binding);
+out:
+   mutex_unlock(&smmu_domain->init_mutex);
+   return ret;
+}
+
 static int arm_smmu_bind_pasid_table(struct iommu_domain *domain,
 struct iommu_pasid_table_config *cfg)
 {
@@ -2311,6 +2332,7 @@ static struct iommu_ops arm_smmu_ops = {
.bind_pasid_table   = arm_smmu_bind_pasid_table,
.unbind_pasid_table = arm_smmu_unbind_pasid_table,
.cache_invalidate   = arm_smmu_cache_invalidate,
+   .bind_guest_msi = arm_smmu_bind_guest_msi,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-09-18 Thread Eric Auger
Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
 S1 S2
gIOVA-> gDB
   hIOVA->hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. That way the host can create the following nested
mapping:

 S1   S2
gIOVA->gDB->hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly map through the 2 stages.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- unmap stage2 on put()
---
 drivers/iommu/dma-iommu.c | 97 +--
 include/linux/dma-iommu.h | 11 +
 2 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 511ff9a1d6d9..53444c3e8f2f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -37,12 +37,14 @@
 struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
+   dma_addr_t  ipa;
phys_addr_t phys;
 };
 
 enum iommu_dma_cookie_type {
IOMMU_DMA_IOVA_COOKIE,
IOMMU_DMA_MSI_COOKIE,
+   IOMMU_DMA_NESTED_MSI_COOKIE,
 };
 
 struct iommu_dma_cookie {
@@ -109,14 +111,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
  *
  * Users who manage their own IOVA allocation and do not want DMA API support,
  * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
  * contiguous IOVA region, starting at @base, large enough to accommodate the
  * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages
+ * use case)
  */
 int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
 {
struct iommu_dma_cookie *cookie;
+   int nesting, ret;
 
if (domain->type != IOMMU_DOMAIN_UNMANAGED)
return -EINVAL;
@@ -124,7 +129,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
if (domain->iova_cookie)
return -EEXIST;
 
-   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+   ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nesting);
+   if (!ret && nesting)
+   cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+   else
+   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
if (!cookie)
return -ENOMEM;
 
@@ -145,6 +155,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+   bool s2_unmap = false;
 
if (!cookie)
return;
@@ -152,7 +163,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(&cookie->iovad);
 
+   if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)
+   s2_unmap = true;
+
list_for_each_entry_safe(msi, tmp, &cookie->msi_page_list, list) {
+   if (s2_unmap && msi->phys) {
+   size_t size = cookie_msi_granule(cookie);
+
+   WARN_ON(iommu_unmap(domain, msi->ipa, size) != size);
+   }
list_del(&msi->list);
kfree(msi);
}
@@ -161,6 +180,50 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 }
 EXPORT_SYMBOL(iommu_put_dma_cookie);
 
+/**
+ * iommu_dma_bind_doorbell - Allows to provide a usable IOVA page
+ * @domain: domain handle
+ * @binding: IOVA/IPA binding
+ *
+ * In nested stage use case, the user can provide IOVA/IPA bindings
+ * corresponding to a guest MSI stage 1 mapping. When the host needs
+ * to map its own MSI doorbells, it can use the IPA as stage 2 input
+ * and map it onto the physical MSI doorbell.
+ */
+int iommu_dma_bind_doorbell(struct iommu_domain *domain,
+   struct iommu_guest_msi_binding *binding)
+{
+   struct iommu_dma_cookie *cookie = domain->iova_cookie;
+   struct iommu_dma_msi_page *msi;
+   dma_addr_t ipa, iova;
+   size_t size;
+
+   if (!cookie)
+   return -EINVAL;
+
+   

[RFC v2 11/20] iommu/smmuv3: Implement cache_invalidate

2018-09-18 Thread Eric Auger
Implement IOMMU_INV_TYPE_TLB invalidations. When
nr_pages is null we interpret this as a context
invalidation.

Signed-off-by: Eric Auger 

---

The user API needs to be refined to discriminate context
invalidations from NH_VA invalidations. Also the leaf attribute
is not yet properly handled.

v1 -> v2:
- properly pass the asid
---
 drivers/iommu/arm-smmu-v3.c | 40 +
 1 file changed, 40 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index c7e05c0b93e1..4054da756a41 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2251,6 +2251,45 @@ static void arm_smmu_unbind_pasid_table(struct 
iommu_domain *domain)
mutex_unlock(&smmu_domain->init_mutex);
 }
 
+static int
+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -EINVAL;
+
+   if (!smmu)
+   return -EINVAL;
+
+   switch (inv_info->hdr.type) {
+   case IOMMU_INV_TYPE_TLB:
+   /*
+* TODO: On context invalidation, the userspace sets nr_pages
+* to 0. Refine the API to add a dedicated flags and also
+* properly handle the leaf parameter.
+*/
+   if (!inv_info->nr_pages) {
+   smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
+   arm_smmu_tlb_inv_context(smmu_domain);
+   } else {
+   size_t granule = 1 << (inv_info->size + 12);
+   size_t size = inv_info->nr_pages * granule;
+
+   smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
+   arm_smmu_tlb_inv_range_nosync(inv_info->addr, size,
+ granule, false,
+ smmu_domain);
+   __arm_smmu_tlb_sync(smmu);
+   }
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2271,6 +2310,7 @@ static struct iommu_ops arm_smmu_ops = {
.put_resv_regions   = arm_smmu_put_resv_regions,
.bind_pasid_table   = arm_smmu_bind_pasid_table,
.unbind_pasid_table = arm_smmu_unbind_pasid_table,
+   .cache_invalidate   = arm_smmu_cache_invalidate,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 10/20] iommu/smmuv3: Implement bind_pasid_table

2018-09-18 Thread Eric Auger
On bind_pasid_table() we program STE S1 related info set
by the guest into the actual physical STEs. At minimum
we need to program the context descriptor GPA and compute
whether the guest wanted to bypass the stage 1 or induce
aborts for this STE.

On unbind, the STE stage 1 fields are reset.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- invalidate the STE before changing them
- hold init_mutex
- handle new fields
---
 drivers/iommu/arm-smmu-v3.c | 85 +
 1 file changed, 85 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 9749c36208f3..c7e05c0b93e1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2168,6 +2168,89 @@ static void arm_smmu_put_resv_regions(struct device *dev,
kfree(entry);
 }
 
+static int arm_smmu_bind_pasid_table(struct iommu_domain *domain,
+struct iommu_pasid_table_config *cfg)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_master_data *entry;
+   struct arm_smmu_s1_cfg *s1_cfg;
+   struct arm_smmu_device *smmu;
+   unsigned long flags;
+   int ret = -EINVAL;
+
+   if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3)
+   return -EINVAL;
+
+   mutex_lock(&smmu_domain->init_mutex);
+
+   smmu = smmu_domain->smmu;
+
+   if (!smmu)
+   goto out;
+
+   if (!((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
+ (smmu->features & ARM_SMMU_FEAT_TRANS_S2))) {
+   dev_info(smmu_domain->smmu->dev,
+"does not implement two stages\n");
+   goto out;
+   }
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   goto out;
+
+   /* we currently support a single CD. S1DSS and S1FMT are ignored */
+   if (cfg->smmuv3.s1cdmax)
+   goto out;
+
+   s1_cfg = &smmu_domain->s1_cfg;
+   s1_cfg->nested_bypass = cfg->smmuv3.bypass;
+   s1_cfg->nested_abort = cfg->smmuv3.abort;
+   s1_cfg->cdptr_dma = cfg->smmuv3.s1contextptr;
+
+   spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+   list_for_each_entry(entry, &smmu_domain->devices, list) {
+   entry->ste.s1_cfg = &smmu_domain->s1_cfg;
+   entry->ste.nested = true;
+   arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+   }
+   spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+   ret = 0;
+out:
+   mutex_unlock(&smmu_domain->init_mutex);
+   return ret;
+}
+
+static void arm_smmu_unbind_pasid_table(struct iommu_domain *domain)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   struct arm_smmu_master_data *entry;
+   struct arm_smmu_s1_cfg *s1_cfg;
+   unsigned long flags;
+
+   if (!smmu)
+   return;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return;
+
+   mutex_lock(&smmu_domain->init_mutex);
+
+   s1_cfg = &smmu_domain->s1_cfg;
+
+   spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+   list_for_each_entry(entry, &smmu_domain->devices, list) {
+   entry->ste.s1_cfg = NULL;
+   entry->ste.nested = false;
+   arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+   }
+   spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+   s1_cfg->nested_abort = false;
+   s1_cfg->nested_bypass = false;
+   mutex_unlock(&smmu_domain->init_mutex);
+}
+
 static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2186,6 +2269,8 @@ static struct iommu_ops arm_smmu_ops = {
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = arm_smmu_put_resv_regions,
+   .bind_pasid_table   = arm_smmu_bind_pasid_table,
+   .unbind_pasid_table = arm_smmu_unbind_pasid_table,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 06/20] vfio: VFIO_IOMMU_BIND_MSI

2018-09-18 Thread Eric Auger
This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
passing the guest MSI binding to the host.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
---
 drivers/vfio/vfio_iommu_type1.c | 31 +++
 include/uapi/linux/vfio.h   |  7 +++
 2 files changed, 38 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 14529233f774..2ffd46c26dc3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1704,6 +1704,24 @@ vfio_cache_invalidate(struct vfio_iommu *iommu,
return ret;
 }
 
+static int
+vfio_iommu_bind_guest_msi(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_bind_guest_msi *ustruct)
+{
+   struct vfio_domain *d;
+   int ret;
+
+   mutex_lock(&iommu->lock);
+
+   list_for_each_entry(d, &iommu->domain_list, next) {
+   ret = iommu_bind_guest_msi(d->domain, &ustruct->binding);
+   if (ret)
+   break;
+   }
+   mutex_unlock(&iommu->lock);
+   return ret;
+}
+
 static int
 vfio_bind_pasid_table(struct vfio_iommu *iommu,
  struct vfio_iommu_type1_bind_pasid_table *ustruct)
@@ -1818,6 +1836,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return -EINVAL;
 
return vfio_cache_invalidate(iommu, &ustruct);
+   } else if (cmd == VFIO_IOMMU_BIND_MSI) {
+   struct vfio_iommu_type1_bind_guest_msi ustruct;
+
+   minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
+   binding);
+
+   if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ustruct.argsz < minsz || ustruct.flags)
+   return -EINVAL;
+
+   return vfio_iommu_bind_guest_msi(iommu, &ustruct);
}
 
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2a38d0bca0ca..6fb5e944e73c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -680,6 +680,13 @@ struct vfio_iommu_type1_cache_invalidate {
 };
 #define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE + 23)
 
+struct vfio_iommu_type1_bind_guest_msi {
+   __u32   argsz;
+   __u32   flags;
+   struct iommu_guest_msi_binding binding;
+};
+#define VFIO_IOMMU_BIND_MSI  _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 07/20] iommu/arm-smmu-v3: Link domains and devices

2018-09-18 Thread Eric Auger
From: Jean-Philippe Brucker 

When removing a mapping from a domain, we need to send an invalidation to
all devices that might have stored it in their Address Translation Cache
(ATC). In addition with SVM, we'll need to invalidate context descriptors
of all devices attached to a live domain.

Maintain a list of devices in each domain, protected by a spinlock. It is
updated every time we attach or detach devices to and from domains.

It needs to be a spinlock because we'll invalidate ATC entries from
within hardirq-safe contexts, but it may be possible to relax the read
side with RCU later.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/arm-smmu-v3.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5059d09f3202..5d0d080f0a02 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -596,6 +596,11 @@ struct arm_smmu_device {
 struct arm_smmu_master_data {
struct arm_smmu_device  *smmu;
struct arm_smmu_strtab_ent  ste;
+
+   struct arm_smmu_domain  *domain;
+   struct list_headlist; /* domain->devices */
+
+   struct device   *dev;
 };
 
 /* SMMU private data for an IOMMU domain */
@@ -619,6 +624,9 @@ struct arm_smmu_domain {
};
 
struct iommu_domain domain;
+
+   struct list_headdevices;
+   spinlock_t  devices_lock;
 };
 
 struct arm_smmu_option_prop {
@@ -1472,6 +1480,9 @@ static struct iommu_domain 
*arm_smmu_domain_alloc(unsigned type)
}
 
mutex_init(&smmu_domain->init_mutex);
+   INIT_LIST_HEAD(&smmu_domain->devices);
+   spin_lock_init(&smmu_domain->devices_lock);
+
return &smmu_domain->domain;
 }
 
@@ -1687,7 +1698,17 @@ static void arm_smmu_install_ste_for_dev(struct 
iommu_fwspec *fwspec)
 
 static void arm_smmu_detach_dev(struct device *dev)
 {
+   unsigned long flags;
struct arm_smmu_master_data *master = dev->iommu_fwspec->iommu_priv;
+   struct arm_smmu_domain *smmu_domain = master->domain;
+
+   if (smmu_domain) {
+   spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+   list_del(&master->list);
+   spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+   master->domain = NULL;
+   }
 
master->ste.assigned = false;
arm_smmu_install_ste_for_dev(dev->iommu_fwspec);
@@ -1696,6 +1717,7 @@ static void arm_smmu_detach_dev(struct device *dev)
 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
int ret = 0;
+   unsigned long flags;
struct arm_smmu_device *smmu;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_master_data *master;
@@ -1731,6 +1753,11 @@ static int arm_smmu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
}
 
ste->assigned = true;
+   master->domain = smmu_domain;
+
+   spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+   list_add(&master->list, &smmu_domain->devices);
+   spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) {
ste->s1_cfg = NULL;
@@ -1849,6 +1876,7 @@ static int arm_smmu_add_device(struct device *dev)
return -ENOMEM;
 
master->smmu = smmu;
+   master->dev = dev;
fwspec->iommu_priv = master;
}
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 08/20] iommu/arm-smmu-v3: Maintain a SID->device structure

2018-09-18 Thread Eric Auger
From: Jean-Philippe Brucker 

When handling faults from the event or PRI queue, we need to find the
struct device associated to a SID. Add a rb_tree to keep track of SIDs.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/arm-smmu-v3.c | 136 ++--
 1 file changed, 132 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5d0d080f0a02..80bb93b43a2e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -590,6 +590,16 @@ struct arm_smmu_device {
 
/* IOMMU core code handle */
struct iommu_device iommu;
+
+   struct rb_root  streams;
+   struct mutexstreams_mutex;
+
+};
+
+struct arm_smmu_stream {
+   u32 id;
+   struct arm_smmu_master_data *master;
+   struct rb_node  node;
 };
 
 /* SMMU private data for each master */
@@ -599,6 +609,7 @@ struct arm_smmu_master_data {
 
struct arm_smmu_domain  *domain;
struct list_headlist; /* domain->devices */
+   struct arm_smmu_stream  *streams;
 
struct device   *dev;
 };
@@ -1224,6 +1235,32 @@ static int arm_smmu_init_l2_strtab(struct 
arm_smmu_device *smmu, u32 sid)
return 0;
 }
 
+__maybe_unused
+static struct arm_smmu_master_data *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
+{
+   struct rb_node *node;
+   struct arm_smmu_stream *stream;
+   struct arm_smmu_master_data *master = NULL;
+
+   mutex_lock(&smmu->streams_mutex);
+   node = smmu->streams.rb_node;
+   while (node) {
+   stream = rb_entry(node, struct arm_smmu_stream, node);
+   if (stream->id < sid) {
+   node = node->rb_right;
+   } else if (stream->id > sid) {
+   node = node->rb_left;
+   } else {
+   master = stream->master;
+   break;
+   }
+   }
+   mutex_unlock(&smmu->streams_mutex);
+
+   return master;
+}
+
 /* IRQ and event handlers */
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
@@ -1847,6 +1884,71 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device 
*smmu, u32 sid)
return sid < limit;
 }
 
+static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
+ struct arm_smmu_master_data *master)
+{
+   int i;
+   int ret = 0;
+   struct arm_smmu_stream *new_stream, *cur_stream;
+   struct rb_node **new_node, *parent_node = NULL;
+   struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+   master->streams = kcalloc(fwspec->num_ids,
+ sizeof(struct arm_smmu_stream), GFP_KERNEL);
+   if (!master->streams)
+   return -ENOMEM;
+
+   mutex_lock(&smmu->streams_mutex);
+   for (i = 0; i < fwspec->num_ids && !ret; i++) {
+   new_stream = &master->streams[i];
+   new_stream->id = fwspec->ids[i];
+   new_stream->master = master;
+
+   new_node = &(smmu->streams.rb_node);
+   while (*new_node) {
+   cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
+ node);
+   parent_node = *new_node;
+   if (cur_stream->id > new_stream->id) {
+   new_node = &((*new_node)->rb_left);
+   } else if (cur_stream->id < new_stream->id) {
+   new_node = &((*new_node)->rb_right);
+   } else {
+   dev_warn(master->dev,
+"stream %u already in tree\n",
+cur_stream->id);
+   ret = -EINVAL;
+   break;
+   }
+   }
+
+   if (!ret) {
+   rb_link_node(&new_stream->node, parent_node, new_node);
+   rb_insert_color(&new_stream->node, &smmu->streams);
+   }
+   }
+   mutex_unlock(&smmu->streams_mutex);
+
+   return ret;
+}
+
+static void arm_smmu_remove_master(struct arm_smmu_device *smmu,
+  struct arm_smmu_master_data *master)
+{
+   int i;
+   struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+   if (!master->streams)
+   return;
+
+   mutex_lock(&smmu->streams_mutex);
+   for (i = 0; i < fwspec->num_ids; i++)
+   rb_erase(&master->streams[i].node, &smmu->streams);
+   mutex_unlock(&smmu->streams_mutex);
+
+   kfree(master->streams);
+}
+
 static struct iommu_ops arm_smmu_ops;
 
 static int arm_smmu_add_device(struct device *dev)
@@ -1895,13 +1997,35 @@ st

[RFC v2 09/20] iommu/smmuv3: Get prepared for nested stage support

2018-09-18 Thread Eric Auger
To allow nested stage support, we need to store both
stage 1 and stage 2 configurations (and remove the former
union).

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE.

We add a nested_bypass field to the S1 configuration as the first
stage can be bypassed. Also the guest may force the STE to abort:
this information gets stored into the nested_abort field.

Only S2 stage is "finalized" as the host does not configure
S1 CD, guest does.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- invalidate the STE before moving from a live STE config to another
- add the nested_abort and nested_bypass fields
---
 drivers/iommu/arm-smmu-v3.c | 43 -
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 80bb93b43a2e..9749c36208f3 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -222,6 +222,7 @@
 #define STRTAB_STE_0_CFG_BYPASS4
 #define STRTAB_STE_0_CFG_S1_TRANS  5
 #define STRTAB_STE_0_CFG_S2_TRANS  6
+#define STRTAB_STE_0_CFG_NESTED7
 
 #define STRTAB_STE_0_S1FMT GENMASK_ULL(5, 4)
 #define STRTAB_STE_0_S1FMT_LINEAR  0
@@ -497,6 +498,10 @@ struct arm_smmu_strtab_l1_desc {
 struct arm_smmu_s1_cfg {
__le64  *cdptr;
dma_addr_t  cdptr_dma;
+   /* in nested mode, tells s1 must be bypassed */
+   boolnested_bypass;
+   /* in nested mode, abort is forced by guest */
+   boolnested_abort;
 
struct arm_smmu_ctx_desc {
u16 asid;
@@ -521,6 +526,7 @@ struct arm_smmu_strtab_ent {
 * configured according to the domain type.
 */
boolassigned;
+   boolnested;
struct arm_smmu_s1_cfg  *s1_cfg;
struct arm_smmu_s2_cfg  *s2_cfg;
 };
@@ -629,10 +635,8 @@ struct arm_smmu_domain {
struct io_pgtable_ops   *pgtbl_ops;
 
enum arm_smmu_domain_stage  stage;
-   union {
-   struct arm_smmu_s1_cfg  s1_cfg;
-   struct arm_smmu_s2_cfg  s2_cfg;
-   };
+   struct arm_smmu_s1_cfg  s1_cfg;
+   struct arm_smmu_s2_cfg  s2_cfg;
 
struct iommu_domain domain;
 
@@ -1119,10 +1123,11 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
break;
case STRTAB_STE_0_CFG_S1_TRANS:
case STRTAB_STE_0_CFG_S2_TRANS:
+   case STRTAB_STE_0_CFG_NESTED:
ste_live = true;
break;
case STRTAB_STE_0_CFG_ABORT:
-   if (disable_bypass)
+   if (disable_bypass || ste->nested)
break;
default:
BUG(); /* STE corruption */
@@ -1134,7 +1139,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
 
/* Bypass/fault */
if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
-   if (!ste->assigned && disable_bypass)
+   if ((!ste->assigned && disable_bypass) ||
+   (ste->s1_cfg && ste->s1_cfg->nested_abort))
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_ABORT);
else
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_BYPASS);
@@ -1152,8 +1158,17 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
return;
}
 
+   if (ste->nested && ste_live) {
+   /*
+* When enabling nested, the STE may be transitionning from
+* s2 to nested and back. Invalidate the STE before changing it.
+*/
+   dst[0] = cpu_to_le64(0);
+   arm_smmu_sync_ste_for_sid(smmu, sid);
+   val = STRTAB_STE_0_V;
+   }
+
if (ste->s1_cfg) {
-   BUG_ON(ste_live);
dst[1] = cpu_to_le64(
 FIELD_PREP(STRTAB_STE_1_S1CIR, 
STRTAB_STE_1_S1C_CACHE_WBRA) |
 FIELD_PREP(STRTAB_STE_1_S1COR, 
STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1167,12 +1182,12 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
-   val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
-   FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
+   if (!ste->s1_cfg->nested_bypass)
+   val |= (ste->s1_cfg->cdptr_dma & 
STRTAB_STE_0_S1CTXPTR_MASK) |
+   FIELD_PREP(STRTAB_STE_0_CFG,

[RFC v2 05/20] vfio: VFIO_IOMMU_CACHE_INVALIDATE

2018-09-18 Thread Eric Auger
From: "Liu, Yi L" 

When the guest "owns" the stage 1 translation structures,  the host
IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation requests are trapped and passed down to the
host.

This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims
at propagating guest stage1 IOMMU cache invalidations to the host.

Signed-off-by: Liu, Yi L 
Signed-off-by: Eric Auger 

---

v1 -> v2:
- s/TLB/CACHE
- remove vfio_iommu_task usage
- commit message rewording
---
 drivers/vfio/vfio_iommu_type1.c | 44 +
 include/uapi/linux/vfio.h   |  7 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 044d3a046125..14529233f774 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1673,6 +1673,37 @@ static int vfio_domains_have_iommu_cache(struct 
vfio_iommu *iommu)
return ret;
 }
 
+static int vfio_cache_inv_fn(struct device *dev, void *data)
+{
+   struct vfio_iommu_type1_cache_invalidate *ustruct =
+   (struct vfio_iommu_type1_cache_invalidate *)data;
+   struct iommu_domain *d = iommu_get_domain_for_dev(dev);
+
+   return iommu_cache_invalidate(d, dev, &ustruct->info);
+}
+
+static int
+vfio_cache_invalidate(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_cache_invalidate *ustruct)
+{
+   struct vfio_domain *d;
+   struct vfio_group *g;
+   int ret = 0;
+
+   mutex_lock(&iommu->lock);
+
+   list_for_each_entry(d, &iommu->domain_list, next) {
+   list_for_each_entry(g, &d->group_list, next) {
+   ret = iommu_group_for_each_dev(g->iommu_group, ustruct,
+  vfio_cache_inv_fn);
+   if (ret)
+   break;
+   }
+   }
+   mutex_unlock(&iommu->lock);
+   return ret;
+}
+
 static int
 vfio_bind_pasid_table(struct vfio_iommu *iommu,
  struct vfio_iommu_type1_bind_pasid_table *ustruct)
@@ -1774,6 +1805,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return -EINVAL;
 
return vfio_bind_pasid_table(iommu, &ustruct);
+   } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
+   struct vfio_iommu_type1_cache_invalidate ustruct;
+
+   minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
+   info);
+
+   if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ustruct.argsz < minsz || ustruct.flags)
+   return -EINVAL;
+
+   return vfio_cache_invalidate(iommu, &ustruct);
}
 
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index f81ce0a75c01..2a38d0bca0ca 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -673,6 +673,13 @@ struct vfio_iommu_type1_bind_pasid_table {
 };
 #define VFIO_IOMMU_BIND_PASID_TABLE_IO(VFIO_TYPE, VFIO_BASE + 22)
 
+struct vfio_iommu_type1_cache_invalidate {
+   __u32   argsz;
+   __u32   flags;
+   struct iommu_cache_invalidate_info info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE + 23)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 04/20] vfio: VFIO_IOMMU_BIND_PASID_TABLE

2018-09-18 Thread Eric Auger
From: "Liu, Yi L" 

This patch adds VFIO_IOMMU_BIND_PASID_TABLE ioctl which aims at
passing the virtual iommu guest configuration to the VFIO driver
downto to the iommu subsystem.

Signed-off-by: Jacob Pan 
Signed-off-by: Liu, Yi L 
Signed-off-by: Eric Auger 

---

v1 -> v2:
- s/BIND_GUEST_STAGE/BIND_PASID_TABLE
- remove the struct device arg
---
 drivers/vfio/vfio_iommu_type1.c | 31 +++
 include/uapi/linux/vfio.h   |  8 
 2 files changed, 39 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d9fd3188615d..044d3a046125 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1673,6 +1673,24 @@ static int vfio_domains_have_iommu_cache(struct 
vfio_iommu *iommu)
return ret;
 }
 
+static int
+vfio_bind_pasid_table(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_bind_pasid_table *ustruct)
+{
+   struct vfio_domain *d;
+   int ret = 0;
+
+   mutex_lock(&iommu->lock);
+
+   list_for_each_entry(d, &iommu->domain_list, next) {
+   ret = iommu_bind_pasid_table(d->domain, &ustruct->config);
+   if (ret)
+   break;
+   }
+   mutex_unlock(&iommu->lock);
+   return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -1743,6 +1761,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
return copy_to_user((void __user *)arg, &unmap, minsz) ?
-EFAULT : 0;
+   } else if (cmd == VFIO_IOMMU_BIND_PASID_TABLE) {
+   struct vfio_iommu_type1_bind_pasid_table ustruct;
+
+   minsz = offsetofend(struct vfio_iommu_type1_bind_pasid_table,
+   config);
+
+   if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ustruct.argsz < minsz || ustruct.flags)
+   return -EINVAL;
+
+   return vfio_bind_pasid_table(iommu, &ustruct);
}
 
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1aa7b82e8169..f81ce0a75c01 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 #define VFIO_API_VERSION   0
 
@@ -665,6 +666,13 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE  _IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
 
+struct vfio_iommu_type1_bind_pasid_table {
+   __u32   argsz;
+   __u32   flags;
+   struct iommu_pasid_table_config config;
+};
+#define VFIO_IOMMU_BIND_PASID_TABLE_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 02/20] iommu: Introduce cache_invalidate API

2018-09-18 Thread Eric Auger
From: "Liu, Yi L" 

In any virtualization use case, when the first translation stage
is "owned" by the guest OS, the host IOMMU driver has no knowledge
of caching structure updates unless the guest invalidation activities
are trapped by the virtualizer and passed down to the host.

Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.

Signed-off-by: Liu, Yi L 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Jacob Pan 
Signed-off-by: Ashok Raj 
Signed-off-by: Eric Auger 

---
v1 -> v2:
- add arch_id field
- renamed tlb_invalidate into cache_invalidate as this API allows
  to invalidate context caches on top of IOTLBs

v1:
renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
header. Commit message reworded.
---
 drivers/iommu/iommu.c  | 14 ++
 include/linux/iommu.h  | 14 ++
 include/uapi/linux/iommu.h | 95 ++
 3 files changed, 123 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index db2c7c9502ae..1442a6c640af 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1381,6 +1381,20 @@ void iommu_unbind_pasid_table(struct iommu_domain 
*domain)
 }
 EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
 
+int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+  struct iommu_cache_invalidate_info *inv_info)
+{
+   int ret = 0;
+
+   if (unlikely(!domain->ops->cache_invalidate))
+   return -ENODEV;
+
+   ret = domain->ops->cache_invalidate(domain, dev, inv_info);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e56cad4863f7..3e5f6eb1f04a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -188,6 +188,7 @@ struct iommu_resv_region {
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @bind_pasid_table: bind pasid table
  * @unbind_pasid_table: unbind pasid table and restore defaults
+ * @cache_invalidate: invalidate translation caches
  */
 struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -238,6 +239,9 @@ struct iommu_ops {
struct iommu_pasid_table_config *cfg);
void (*unbind_pasid_table)(struct iommu_domain *domain);
 
+   int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
+   struct iommu_cache_invalidate_info *inv_info);
+
unsigned long pgsize_bitmap;
 };
 
@@ -302,6 +306,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
 extern int iommu_bind_pasid_table(struct iommu_domain *domain,
  struct iommu_pasid_table_config *cfg);
 extern void iommu_unbind_pasid_table(struct iommu_domain *domain);
+extern int iommu_cache_invalidate(struct iommu_domain *domain,
+   struct device *dev,
+   struct iommu_cache_invalidate_info *inv_info);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot);
@@ -704,6 +711,13 @@ static inline
 void iommu_unbind_pasid_table(struct iommu_domain *domain)
 {
 }
+static inline int
+iommu_cache_invalidate(struct iommu_domain *domain,
+  struct device *dev,
+  struct iommu_cache_invalidate_info *inv_info)
+{
+   return -ENODEV;
+}
 
 #endif /* CONFIG_IOMMU_API */
 
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index babec91ae7e1..4283e0334baf 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -49,4 +49,99 @@ struct iommu_pasid_table_config {
};
 };
 
+/**
+ * enum iommu_inv_granularity - Generic invalidation granularity
+ * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:  TLB entries or PASID caches of all
+ * PASIDs associated with a domain ID
+ * @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID cache associated
+ * with a PASID and a domain
+ * @IOMMU_INV_GRANU_PAGE_PASID:TLB entries of selected page 
range
+ * within a PASID
+ *
+ * When an invalidation request is passed down to IOMMU to flush translation
+ * caches, it may carry different granularity levels, which can be specific
+ * to certain types of translation caches.
+ * This enum is a collection of granularities for all types of translation
+ * caches. The idea is to make it easy for IOMMU model specific driver to
+ * convert f

[RFC v2 03/20] iommu: Introduce bind_guest_msi

2018-09-18 Thread Eric Auger
On ARM, MSI are translated by the SMMU. An IOVA is allocated
for each MSI doorbell. If both the host and the guest are exposed
with SMMUs, we end up with 2 different IOVAs allocated by each.
guest allocates an IOVA (gIOVA) to map onto the guest MSI
doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
onto the physical doorbell (hDB).

So we end up with 2 untied mappings:
 S1S2
gIOVA->gDB
  hIOVA->gDB

Currently the PCI device is programmed by the host with hIOVA
as MSI doorbell. So this does not work.

This patch introduces an API to pass gIOVA/gDB to the host so
that gIOVA can be reused by the host instead of re-allocating
a new IOVA. So the goal is to create the following nested mapping:

 S1S2
gIOVA->gDB ->hDB

and program the PCI device with gIOVA MSI doorbell.

Signed-off-by: Eric Auger 
---
 drivers/iommu/iommu.c  | 10 ++
 include/linux/iommu.h  | 13 +
 include/uapi/linux/iommu.h |  7 +++
 3 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1442a6c640af..b2b8f375b169 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1409,6 +1409,16 @@ static void __iommu_detach_device(struct iommu_domain 
*domain,
trace_detach_device_from_domain(dev);
 }
 
+int iommu_bind_guest_msi(struct iommu_domain *domain,
+struct iommu_guest_msi_binding *binding)
+{
+   if (unlikely(!domain->ops->bind_guest_msi))
+   return -ENODEV;
+
+   return domain->ops->bind_guest_msi(domain, binding);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
+
 void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 {
struct iommu_group *group;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3e5f6eb1f04a..9bd3e63d562b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -242,6 +242,9 @@ struct iommu_ops {
int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
struct iommu_cache_invalidate_info *inv_info);
 
+   int (*bind_guest_msi)(struct iommu_domain *domain,
+ struct iommu_guest_msi_binding *binding);
+
unsigned long pgsize_bitmap;
 };
 
@@ -309,6 +312,9 @@ extern void iommu_unbind_pasid_table(struct iommu_domain 
*domain);
 extern int iommu_cache_invalidate(struct iommu_domain *domain,
struct device *dev,
struct iommu_cache_invalidate_info *inv_info);
+extern int iommu_bind_guest_msi(struct iommu_domain *domain,
+   struct iommu_guest_msi_binding *binding);
+
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot);
@@ -719,6 +725,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
return -ENODEV;
 }
 
+static inline
+int iommu_bind_guest_msi(struct iommu_domain *domain,
+struct iommu_guest_msi_binding *binding)
+{
+   return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #ifdef CONFIG_IOMMU_DEBUGFS
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4283e0334baf..21adb2a964e5 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -144,4 +144,11 @@ struct iommu_cache_invalidate_info {
__u64   arch_id;
__u64   addr;
 };
+
+struct iommu_guest_msi_binding {
+   __u64   iova;
+   __u64   gpa;
+   __u32   granule;
+};
 #endif /* _UAPI_IOMMU_H */
+
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2 01/20] iommu: Introduce bind_pasid_table API

2018-09-18 Thread Eric Auger
From: Jacob Pan 

In virtualization use case, when a guest is assigned
a PCI host device, protected by a virtual IOMMU on a guest,
the physical IOMMU must be programmed to be consistent with
the guest mappings. If the physical IOMMU supports two
translation stages it makes sense to program guest mappings
onto the first stage/level (ARM/VTD terminology) while to host
owns the stage/level 2.

In that case, it is mandated to trap on guest configuration
settings and pass those to the physical iommu driver.

This patch adds a new API to the iommu subsystem that allows
to bind and unbind the guest configuration data to the host.

A generic iommu_pasid_table_config struct is introduced in
a new iommu.h uapi header. This is going to be used by the VFIO
user API. We foresee at least two specializations of this struct,
for PASID table passing and ARM SMMUv3.

Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Liu, Yi L 
Signed-off-by: Ashok Raj 
Signed-off-by: Jacob Pan 
Signed-off-by: Eric Auger 

---

In practice, I think it would be simpler to have a single
set_pasid_table function instead of bind/unbind. The "bypass" field
tells the stage 1 is bypassed (equivalent to the unbind actually).
On userspace we have notifications that the device context has changed.
Calling either bind or unbind requires to have an understand of what
was the previous state and call different notifiers. So to me the
bind/unbind complexifies the user integration while not bring much
benefits.

This patch generalizes the API introduced by Jacob & co-authors in
https://lwn.net/Articles/754331/

v1 -> v2:
- restore the original pasid table name
- remove the struct device * parameter in the API
- reworked iommu_pasid_smmuv3
---
 drivers/iommu/iommu.c  | 19 ++
 include/linux/iommu.h  | 21 +++
 include/uapi/linux/iommu.h | 52 ++
 3 files changed, 92 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c15c5980299..db2c7c9502ae 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1362,6 +1362,25 @@ int iommu_attach_device(struct iommu_domain *domain, 
struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain,
+  struct iommu_pasid_table_config *cfg)
+{
+   if (unlikely(!domain->ops->bind_pasid_table))
+   return -ENODEV;
+
+   return domain->ops->bind_pasid_table(domain, cfg);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+void iommu_unbind_pasid_table(struct iommu_domain *domain)
+{
+   if (unlikely(!domain->ops->unbind_pasid_table))
+   return;
+
+   domain->ops->unbind_pasid_table(domain);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 87994c265bf5..e56cad4863f7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IOMMU_READ (1 << 0)
 #define IOMMU_WRITE(1 << 1)
@@ -185,6 +186,8 @@ struct iommu_resv_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table
+ * @unbind_pasid_table: unbind pasid table and restore defaults
  */
 struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -231,6 +234,10 @@ struct iommu_ops {
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
bool (*is_attach_deferred)(struct iommu_domain *domain, struct device 
*dev);
 
+   int (*bind_pasid_table)(struct iommu_domain *domain,
+   struct iommu_pasid_table_config *cfg);
+   void (*unbind_pasid_table)(struct iommu_domain *domain);
+
unsigned long pgsize_bitmap;
 };
 
@@ -292,6 +299,9 @@ extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+ struct iommu_pasid_table_config *cfg);
+extern void iommu_unbind_pasid_table(struct iommu_domain *domain);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot);
@@ -684,6 +694,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct 
fwnode_handle *fwnode)
return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain,
+  struct iommu_pasid_tabl

[RFC v2 00/20] SMMUv3 Nested Stage Setup

2018-09-18 Thread Eric Auger
This series allows a virtualizer to program the nested stage mode.
This is useful when both the host and the guest are exposed with
an SMMUv3 and a PCI device is assigned to the guest using VFIO.

In this mode, the physical IOMMU must be programmed to translate
the two stages: the one set up by the guest (IOVA -> GPA) and the
one set up by the host VFIO driver as part of the assignment process
(GPA -> HPA).

On Intel, this is traditionnaly achieved by combining the 2 stages
into a single physical stage. However this relies on the capability
to trap on each guest translation structure update. This is possible
by using the VTD Caching Mode. Unfortunately the ARM SMMUv3 does
not offer a similar mechanism.

However, the ARM SMMUv3 architecture supports 2 physical stages! Those
were devised exactly with that use case in mind. Assuming the HW
implements both stages (optional), the guest now can use stage 1
while the host uses stage 2.

This assumes the virtualizer has means to propagate guest settings
to the host SMMUv3 driver. This series brings this VFIO/IOMMU
infrastructure.  Those services are:
- bind the guest stage 1 configuration to the stream table entry
- propagate guest TLB invalidations
- bind MSI IOVAs
- propagate faults collected at physical level up to the virtualizer

This series largely reuses the user API and infrastructure originally
devised for SVA/SVM and patches submitted by Jacob, Yi Liu, Tianyu in
[1-3] and Jean-Philippe [4].

This proof of concept also aims at illustrating how this API/infrastructure
would need to evolve to support this nested stage SMMUv3 use case.

Best Regards

Eric

This series can be found at:
https://github.com/eauger/linux/tree/v4.19-rc4-2stage-rfc-v2

This was tested on Qualcomm HW featuring SMMUv3 and with adapted QEMU
vSMMUv3.

References:
[1] [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual
Address (SVA)
https://lwn.net/Articles/754331/
[2] [RFC PATCH 0/8] Shared Virtual Memory virtualization for VT-d
(VFIO part)
https://lists.linuxfoundation.org/pipermail/iommu/2017-April/021475.html
[3] [RFC PATCH 0/3] VFIO: Report IOMMU fault event to userspace
https://www.spinics.net/lists/kvm/msg145280.html
[4] [v2,00/40] Shared Virtual Addressing for the IOMMU
https://patchwork.ozlabs.org/cover/912129/

History:

v1 -> v2:
- Added the fault reporting capability
- asid properly passed on invalidation (fix assignment of multiple
  devices)
- see individual change logs for more info


Eric Auger (11):
  iommu: Introduce bind_guest_msi
  vfio: VFIO_IOMMU_BIND_MSI
  iommu/smmuv3: Get prepared for nested stage support
  iommu/smmuv3: Implement bind_pasid_table
  iommu/smmuv3: Implement cache_invalidate
  dma-iommu: Implement NESTED_MSI cookie
  iommu/smmuv3: Implement bind_guest_msi
  vfio: VFIO_IOMMU_SET_FAULT_EVENTFD
  vfio: VFIO_IOMMU_GET_FAULT_EVENTS
  vfio: Document nested stage control
  iommu/smmuv3: Report non recoverable faults

Jacob Pan (4):
  iommu: Introduce bind_pasid_table API
  iommu: introduce device fault data
  driver core: add per device iommu param
  iommu: introduce device fault report API

Jean-Philippe Brucker (2):
  iommu/arm-smmu-v3: Link domains and devices
  iommu/arm-smmu-v3: Maintain a SID->device structure

Liu, Yi L (3):
  iommu: Introduce cache_invalidate API
  vfio: VFIO_IOMMU_BIND_PASID_TABLE
  vfio: VFIO_IOMMU_CACHE_INVALIDATE

 Documentation/vfio.txt  |  60 
 drivers/iommu/arm-smmu-v3.c | 476 ++--
 drivers/iommu/dma-iommu.c   |  97 ++-
 drivers/iommu/iommu.c   | 196 -
 drivers/vfio/vfio_iommu_type1.c | 269 ++
 include/linux/device.h  |   3 +
 include/linux/dma-iommu.h   |  11 +
 include/linux/iommu.h   | 134 -
 include/uapi/linux/iommu.h  | 237 
 include/uapi/linux/vfio.h   |  48 
 10 files changed, 1501 insertions(+), 30 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Explicit IOVA management from a PCIe endpoint driver

2018-09-18 Thread poza



On 2018-09-18 03:06, Stephen Warren wrote:

Joerg, Christoph, Marek, Robin,

I believe that the driver for our PCIe endpoint controller hardware
will need to explicitly manage its IOVA space more than current APIs
allow. I'd like to discuss how to make that possible.

First some background on our hardware:

NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller.
This can operate in either root port or endpoint mode. I'm
particularly interested in endpoint mode.

Our particular instantiation of this controller exposes a single
function with a single software-controlled PCIe BAR to the PCIe bus
(there are also BARs for access to DMA controller registers and
outbound MSI configuration, which can both be enabled/disabled but not
used for any other purpose). When a transaction is received from the
PCIe bus, the following happens:

1) Transaction is matched against the BAR base/size (in PCIe address
space) to determine whether it "hits" this BAR or not.

2) The transaction's address is processed by the PCIe controller's ATU
(Address Translation Unit), which can re-write the address that the
transaction accesses.

Our particular instantiation of the hardware only has 2 entries in the
ATU mapping table, which gives very little flexibility in setting up a
mapping.

As an FYI, ATU entries can match PCIe transactions either:
a) Any transaction received on a particular BAR.
b) Any transaction received within a single contiguous window of PCIe
address space. This kind of mapping entry obviously has to be set up
after device enumeration is complete so that it can match the correct
PCIe address.

Each ATU entry maps a single contiguous set of PCIe addresses to a
single contiguous set of IOVAs which are passed to the IOMMU.
Transactions can pass through the ATU without being translated if
desired.

3) The transaction is passed to the IOMMU, which can again re-write
the address that the transaction accesses.

4) The transaction is passed to the memory controller and reads/writes 
DRAM.


In general, we want to be able to expose a large and dynamic set of
data buffers to the PCIe bus; certainly /far/ more than two separate
buffers (the number of ATU table entries). With current Linux APIs,
these buffers will not be located in contiguous or adjacent physical
(DRAM) or virtual (IOVA) addresses, nor in any particular window of
physical or IOVA addresses. However, the ATU's mapping from PCIe to
IOVA can only expose one or two contiguous ranges of IOVA space. These
two sets of requirements are at odds!

So, I'd like to propose some new APIs that the PCIe endpoint driver can 
use:


1) Allocate/reserve an IOVA range of specified size, but don't map
anything into the IOVA range.


I had done some work on this in the past, those patches were tested on 
Broadcom HW.


https://lkml.org/lkml/2017/5/16/23,
https://lkml.org/lkml/2017/5/16/21,
https://lkml.org/lkml/2017/5/16/19

I could not pursue it further, since I do not have the same HW to test 
it.
Although now in Qualcomm SOC, we do use Synopsis Designware PCIe 
controller

but we dont restrict inbound addresses range for our SOC.

of course these patches can easily be ported, and extended.
they basically reserve IOVA ranges based on inbound dma-ranges DT 
property.


Regards,
Oza.



2) De-allocate the IOVA range allocated in (1).

3) Map a specific set (scatter-gather list I suppose) of
already-allocated/extant physical addresses into part of an IOVA range
allocated in (1).

4) Unmap a portion of an IOVA range that was mapped by (3).

One final note:

The memory controller can translate accesses to a small region of DRAM
address space into accesses to an interrupt generation module. This
allows devices attached to the PCIe bus to generate interrupts to
software running on the system with the PCIe endpoint controller. Thus
I deliberately described API 3 above as mapping a specific physical
address into IOVA space, as opposed to mapping an existing DRAM
allocation into IOVA space, in order to allow mapping this interrupt
generation address space into IOVA space. If we needed separate APIs
to map physical addresses vs. DRAM allocations into IOVA space, that
would likely be fine too.

Does this API proposal sound reasonable?

I have heard from some NVIDIA developers that the above APIs rather go
against the principle that individual drivers should not be aware of
the presence/absence of an IOMMU, and hence direct management of IOVA
allocation/layout is deliberately avoided, and hence there hasn't been
a need/desire for this kind of API in the past. However, I think our
current hardware design and use-case rather requires it. Do you agree?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/3] iommu: Avoid DMA ops domain refcount contention

2018-09-18 Thread Tom Murphy
>Not sure how fast Tom needs the common dma-iommu code for the x86 AMD iommu 
>conversion.

I am currently busy working on something else and won't be able to
do/test the x86 AMD iommu conversion anytime soon. So I don't need the
common dma-iommu code anytime soon.

On 17 September 2018 at 14:33, Christoph Hellwig  wrote:
> On Fri, Sep 14, 2018 at 01:48:59PM +0100, Will Deacon wrote:
>> > As far as merging goes, I don't mind at all whether this goes via IOMMU,
>> > or via dma-mapping provided Joerg's happy to ack it.
>>
>> I think it makes most sense for Joerg to take this series via his tree.
>
> FYI, I have WIP patches to move the arm dma-iommu wrappers to common
> code:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-maybe-coherent
>
> which will require some synchronization of the involved trees.  In the
> end it is iommu code, so the actual iommu patches should probably
> go into the iommu tree after all, but it might have to pull in the
> dma-mapping branch for that.  Or we just punt it until the next merge
> window.  Not sure how fast Tom needs the common dma-iommu code for
> the x86 AMD iommu conversion.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: move swiotlb noncoherent dma support from arm64 to generic code

2018-09-18 Thread Robin Murphy

Hi Christoph,

On 17/09/18 16:38, Christoph Hellwig wrote:

Hi all,

this series starts with various swiotlb cleanups, then adds support for
non-cache coherent devices to the generic swiotlb support, and finally
switches arm64 to use the generic code.


I think there's going to be an issue with the embedded folks' grubby 
hack in arm64's mem_init() which skips initialising SWIOTLB at all with 
sufficiently little DRAM. I've been waiting for 
dma-direct-noncoherent-merge so that I could fix that case to swizzle in 
dma_direct_ops and avoid swiotlb_dma_ops entirely.



Given that this series depends on patches in the dma-mapping tree, or
pending for it I've also published a git tree here:

 git://git.infradead.org/users/hch/misc.git swiotlb-noncoherent


However, upon sitting down to eagerly write that patch I've just 
boot-tested the above branch as-is for a baseline and discovered a 
rather more significant problem: arch_dma_alloc() is recursing back into 
__swiotlb_alloc() and blowing the stack. Not good :(


Robin.

->8-
[4.032760] Insufficient stack space to handle exception!
[4.032765] ESR: 0x9647 -- DABT (current EL)
[4.042666] FAR: 0x0a937fb0
[4.046113] Task stack: [0x0a938000..0x0a93c000]
[4.052399] IRQ stack:  [0x08008000..0x0800c000]
[4.058684] Overflow stack: [0x80097ff4b290..0x80097ff4c290]
[4.064972] CPU: 1 PID: 130 Comm: kworker/1:1 Not tainted 4.19.0-rc2+ 
#681
[4.071775] Hardware name: ARM LTD ARM Juno Development Platform/ARM 
Juno Development Platform, BIOS EDK II Jul 10 2018

[4.082456] Workqueue: events deferred_probe_work_func
[4.087542] pstate: 0005 (nzcv daif -PAN -UAO)
[4.092283] pc : arch_dma_alloc+0x0/0x198
[4.096250] lr : dma_direct_alloc+0x20/0x28
[4.100385] sp : 0a938010
[4.103660] x29: 0a938010 x28: 800974e15238
[4.108918] x27: 08bf30d8 x26: 80097543d400
[4.114176] x25: 0300 x24: 80097543d400
[4.119434] x23: 800974e15238 x22: 1000
[4.124691] x21: 80097543d400 x20: 1000
[4.129948] x19: 0300 x18: 
[4.135206] x17:  x16: 08bf1b58
[4.140463] x15: 091eb688 x14: 8a93ba1d
[4.145720] x13: ff00 x12: 
[4.150977] x11: 0001 x10: ff7f7fff7fff
[4.156235] x9 :  x8 : 
[4.161492] x7 : 800974df9810 x6 : 
[4.166749] x5 :  x4 : 0300
[4.172006] x3 : 006002c0 x2 : 800974e15238
[4.177263] x1 : 1000 x0 : 80097543d400
[4.182521] Kernel panic - not syncing: kernel stack overflow
[4.188207] CPU: 1 PID: 130 Comm: kworker/1:1 Not tainted 4.19.0-rc2+ 
#681
[4.195008] Hardware name: ARM LTD ARM Juno Development Platform/ARM 
Juno Development Platform, BIOS EDK II Jul 10 2018

[4.205681] Workqueue: events deferred_probe_work_func
[4.210765] Call trace:
[4.213183]  dump_backtrace+0x0/0x1f0
[4.216805]  show_stack+0x14/0x20
[4.220084]  dump_stack+0x9c/0xbc
[4.223362]  panic+0x138/0x294
[4.226381]  __stack_chk_fail+0x0/0x18
[4.230088]  handle_bad_stack+0x11c/0x130
[4.234052]  __bad_stack+0x88/0x8c
[4.237415]  arch_dma_alloc+0x0/0x198
[4.241036]  __swiotlb_alloc+0x3c/0x178
[4.244828]  arch_dma_alloc+0xd0/0x198
[4.248534]  dma_direct_alloc+0x20/0x28
[4.252327]  __swiotlb_alloc+0x3c/0x178
[4.256119]  arch_dma_alloc+0xd0/0x198
[4.259825]  dma_direct_alloc+0x20/0x28
[4.263617]  __swiotlb_alloc+0x3c/0x178
[4.267409]  arch_dma_alloc+0xd0/0x198
[4.271115]  dma_direct_alloc+0x20/0x28
[4.274907]  __swiotlb_alloc+0x3c/0x178
[4.278700]  arch_dma_alloc+0xd0/0x198
[4.282405]  dma_direct_alloc+0x20/0x28
[4.286198]  __swiotlb_alloc+0x3c/0x178
[4.289990]  arch_dma_alloc+0xd0/0x198
[4.293696]  dma_direct_alloc+0x20/0x28
[4.297489]  __swiotlb_alloc+0x3c/0x178
[4.301281]  arch_dma_alloc+0xd0/0x198
[4.304987]  dma_direct_alloc+0x20/0x28
[4.308779]  __swiotlb_alloc+0x3c/0x178
[4.312571]  arch_dma_alloc+0xd0/0x198
[4.316277]  dma_direct_alloc+0x20/0x28
[4.320069]  __swiotlb_alloc+0x3c/0x178
[4.323861]  arch_dma_alloc+0xd0/0x198
[4.327567]  dma_direct_alloc+0x20/0x28
[4.331359]  __swiotlb_alloc+0x3c/0x178
[4.335151]  arch_dma_alloc+0xd0/0x198
[4.338857]  dma_direct_alloc+0x20/0x28
[4.342650]  __swiotlb_alloc+0x3c/0x178
[4.346442]  arch_dma_alloc+0xd0/0x198
[4.350148]  dma_direct_alloc+0x20/0x28
[4.353940]  __swiotlb_alloc+0x3c/0x178
[4.357732]  arch_dma_alloc+0xd0/0x198
[4.361438]  dma_direct_alloc+0x20/0x28
[4.365230]  __swiotlb_alloc+0x3c/0x178
[4.369022]  arch_dma_alloc+0xd0/0x198
[4.372728]  dma_direct_alloc+0x20/0x28
[4.376520]  __swiotlb_alloc+0x3c/0x178
[

Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-09-18 Thread Jerome Glisse
On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > > So i want to summarize issues i have as this threads have dig deep into
> > > > details. For this i would like to differentiate two cases first the easy
> > > > one when relying on SVA/SVM. Then the second one when there is no 
> > > > SVA/SVM.
> > > 
> > > Thank you very much for the summary.
> > > 
> > > > In both cases your objectives as i understand them:
> > > > 
> > > > [R1]- expose a common user space API that make it easy to share boiler
> > > >   plate code accross many devices (discovering devices, opening
> > > >   device, creating context, creating command queue ...).
> > > > [R2]- try to share the device as much as possible up to device limits
> > > >   (number of independant queues the device has)
> > > > [R3]- minimize syscall by allowing user space to directly schedule on 
> > > > the
> > > >   device queue without a round trip to the kernel
> > > > 
> > > > I don't think i missed any.
> > > > 
> > > > 
> > > > (1) Device with SVA/SVM
> > > > 
> > > > For that case it is easy, you do not need to be in VFIO or part of any
> > > > thing specific in the kernel. There is no security risk (modulo bug in
> > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > > to a device is just couple dozen lines of code.
> > > > 
> > > 
> > > This is right...logically. But the kernel has no clear definition about 
> > > "Device
> > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one 
> > > of the
> > > boiler plate.
> > > 
> > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the 
> > > only
> > > one. If we add that support within VFIO, which solve most of the problem 
> > > of
> > > SVA/SVM, it will save a lot of work in the future.
> > 
> > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> > all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> > would require any more than that in your case.
> > 
> > 
> > > I think this is the key confliction between us. So could Alex please say
> > > something here? If the VFIO is going to take this into its scope, we can 
> > > try
> > > together to solve all the problem on the way. If it it is not, it is also
> > > simple, we can just go to another way to fulfill this part of 
> > > requirements even
> > > we have to duplicate most of the code.
> > > 
> > > Another point I need to emphasis here: because we have to replace the 
> > > hardware
> > > queue when fork, so it won't be very simple even in SVA/SVM case.
> > 
> > I am assuming hardware queue can only be setup by the kernel and thus
> > you are totaly safe forkwise as the queue is setup against a PASID and
> > the child does not bind to any PASID and you use VM_DONTCOPY on the
> > mmap of the hardware MMIO queue because you should really use that flag
> > for that.
> > 
> > 
> > > > (2) Device does not have SVA/SVM (or it is disabled)
> > > > 
> > > > You want to still allow device to be part of your framework. However
> > > > here i see fundamentals securities issues and you move the burden of
> > > > being careful to user space which i think is a bad idea. We should
> > > > never trus the userspace from kernel space.
> > > > 
> > > > To keep the same API for the user space code you want a 1:1 mapping
> > > > between device physical address and process virtual address (ie if
> > > > device access device physical address A it is accessing the same
> > > > memory as what is backing the virtual address A in the process.
> > > > 
> > > > Security issues are on two things:
> > > > [I1]- fork/exec, a process who opened any such device and created an
> > > >   active queue can transfer without its knowledge control of its
> > > >   commands queue through COW. The parent map some anonymous region
> > > >   to the device as a command queue buffer but because of COW the
> > > >   parent can be the first to copy on write and thus the child can
> > > >   inherit the original pages that are mapped to the hardware.
> > > >   Here parent lose control and child gain it.
> > > 
> > > This is indeed an issue. But it remains an issue only if you continue to 
> > > use the
> > > queue and the memory after fork. We can use at_fork kinds of gadget to 
> > > fix it in
> > > user space.
> > 
> > Trusting user space is a no go from my point of view.
> 
> Can we dive deeper on this? Maybe we have different understanding on "Trusting
> user space". As my understanding, "trusting user space" means "no matter what
> the user process does, it should only hurt itself and anything give to it, no
> the kernel and the other process".
> 
> In our case, we create a channel between a process and the hardware. The 
> proce

Re: Explicit IOVA management from a PCIe endpoint driver

2018-09-18 Thread Robin Murphy

Hi Stephen,

On 17/09/18 22:36, Stephen Warren wrote:

Joerg, Christoph, Marek, Robin,

I believe that the driver for our PCIe endpoint controller hardware will 
need to explicitly manage its IOVA space more than current APIs allow. 
I'd like to discuss how to make that possible.


First some background on our hardware:

NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This 
can operate in either root port or endpoint mode. I'm particularly 
interested in endpoint mode.


Our particular instantiation of this controller exposes a single 
function with a single software-controlled PCIe BAR to the PCIe bus 
(there are also BARs for access to DMA controller registers and outbound 
MSI configuration, which can both be enabled/disabled but not used for 
any other purpose). When a transaction is received from the PCIe bus, 
the following happens:


1) Transaction is matched against the BAR base/size (in PCIe address 
space) to determine whether it "hits" this BAR or not.


2) The transaction's address is processed by the PCIe controller's ATU 
(Address Translation Unit), which can re-write the address that the 
transaction accesses.


Our particular instantiation of the hardware only has 2 entries in the 
ATU mapping table, which gives very little flexibility in setting up a 
mapping.


As an FYI, ATU entries can match PCIe transactions either:
a) Any transaction received on a particular BAR.
b) Any transaction received within a single contiguous window of PCIe 
address space. This kind of mapping entry obviously has to be set up 
after device enumeration is complete so that it can match the correct 
PCIe address.


Each ATU entry maps a single contiguous set of PCIe addresses to a 
single contiguous set of IOVAs which are passed to the IOMMU. 
Transactions can pass through the ATU without being translated if desired.


3) The transaction is passed to the IOMMU, which can again re-write the 
address that the transaction accesses.


4) The transaction is passed to the memory controller and reads/writes 
DRAM.


In general, we want to be able to expose a large and dynamic set of data 
buffers to the PCIe bus; certainly /far/ more than two separate buffers 
(the number of ATU table entries). With current Linux APIs, these 
buffers will not be located in contiguous or adjacent physical (DRAM) or 
virtual (IOVA) addresses, nor in any particular window of physical or 
IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only 
expose one or two contiguous ranges of IOVA space. These two sets of 
requirements are at odds!


So, I'd like to propose some new APIs that the PCIe endpoint driver can 
use:


1) Allocate/reserve an IOVA range of specified size, but don't map 
anything into the IOVA range.


2) De-allocate the IOVA range allocated in (1).

3) Map a specific set (scatter-gather list I suppose) of 
already-allocated/extant physical addresses into part of an IOVA range 
allocated in (1).


4) Unmap a portion of an IOVA range that was mapped by (3).


That all sounds perfectly reasonable - basically it sounds like the 
endpoint framework wants the option to do the same as VFIO or many DRM 
drivers, i.e. set up its own IOMMU domain, attach the endpoint's group, 
and explicitly manage its mappings via IOMMU API calls. Provided you can 
assume cache-coherent PCI, that should be enough to get things going - 
supporting non-coherent endpoints is a little trickier in terms of 
making sure the endpoint controller and/or device gets the right DMA ops 
to only ever perform cache maintenance once you add streaming DMA 
mappings into the mix, but that's not insurmountable (and I think it's 
something we still need to address for DRM anyway, at least on arm64)



One final note:

The memory controller can translate accesses to a small region of DRAM 
address space into accesses to an interrupt generation module. This 
allows devices attached to the PCIe bus to generate interrupts to 
software running on the system with the PCIe endpoint controller. Thus I 
deliberately described API 3 above as mapping a specific physical 
address into IOVA space, as opposed to mapping an existing DRAM 
allocation into IOVA space, in order to allow mapping this interrupt 
generation address space into IOVA space. If we needed separate APIs to 
map physical addresses vs. DRAM allocations into IOVA space, that would 
likely be fine too.


If that's the standard DesignWare MSI dingaling, then all you should 
need to do is ensure you IOVA is reserved in your allocator (if it can 
be entirely outside the EP BAR, even better) - AFAIK the writes get 
completely intercepted such that they never go out to the SMMU side at 
all, and thus no actual mapping is even needed.



Does this API proposal sound reasonable?


Indeed, as I say apart from using streaming DMA for coherency management 
(which I think could be added in pretty much orthogonally later), this 
sounds like something you could plumb into the endpoint framework

[PATCH] iommu/amd: return devid as alias for ACPI HID devices

2018-09-18 Thread Arindam Nath
ACPI HID devices do not actually have an alias for
them in the IVRS. But dev_data->alias is still used
for indexing into the IOMMU device table for devices
being handled by the IOMMU. So for ACPI HID devices,
we simply return the corresponding devid as an alias,
as parsed from IVRS table.

Signed-off-by: Arindam Nath 
---
 drivers/iommu/amd_iommu.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 4e04fff23977..73e47d93e7a0 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -246,7 +246,13 @@ static u16 get_alias(struct device *dev)
 
/* The callers make sure that get_device_id() does not fail here */
devid = get_device_id(dev);
+
+   /* For ACPI HID devices, we simply return the devid as such */
+   if (!dev_is_pci(dev))
+   return devid;
+
ivrs_alias = amd_iommu_alias_table[devid];
+
pci_for_each_dma_alias(pdev, __last_alias, &pci_alias);
 
if (ivrs_alias == pci_alias)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] kernel/dma: Fix panic caused by passing cma to command line

2018-09-18 Thread Marek Szyprowski
Hi

On 2018-09-17 05:24, zhe...@windriver.com wrote:
> From: He Zhe 
>
> early_cma does not check input argument before passing it to
> simple_strtoull. The argument would be a NULL pointer if "cma", without
> its value, is set in command line and thus causes the following panic.
>
> PANIC: early exception 0xe3 IP 10:a3e9db8d error 0 cr2 0x0
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.19.0-rc3-yocto-standard+ #7
> [0.00] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
> ...
> [0.00] Call Trace:
> [0.00]  simple_strtoull+0x29/0x70
> [0.00]  memparse+0x26/0x90
> [0.00]  early_cma+0x17/0x6a
> [0.00]  do_early_param+0x57/0x8e
> [0.00]  parse_args+0x208/0x320
> [0.00]  ? rdinit_setup+0x30/0x30
> [0.00]  parse_early_options+0x29/0x2d
> [0.00]  ? rdinit_setup+0x30/0x30
> [0.00]  parse_early_param+0x36/0x4d
> [0.00]  setup_arch+0x336/0x99e
> [0.00]  start_kernel+0x6f/0x4e6
> [0.00]  x86_64_start_reservations+0x24/0x26
> [0.00]  x86_64_start_kernel+0x6f/0x72
> [0.00]  secondary_startup_64+0xa4/0xb0
>
> This patch adds a check to prevent the panic.
>
> Signed-off-by: He Zhe 
> Cc: sta...@vger.kernel.org
> Cc: h...@lst.de
> Cc: m.szyprow...@samsung.com
> Cc: robin.mur...@arm.com

Thanks for the fix.

Reviewed-by: Marek Szyprowski 

> ---
>   kernel/dma/contiguous.c | 6 +-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index 286d823..b2a8790 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -49,7 +49,11 @@ static phys_addr_t limit_cmdline;
>   
>   static int __init early_cma(char *p)
>   {
> - pr_debug("%s(%s)\n", __func__, p);
> + if (!p) {
> + pr_err("Config string not provided\n");
> + return -EINVAL;
> + }
> +
>   size_cmdline = memparse(p, &p);
>   if (*p != '@')
>   return 0;

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu