Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.
On Wed, 12 Sep 2018 10:46:19 -0700 "Raj, Ashok" wrote: > On Thu, Aug 09, 2018 at 01:44:17PM -0600, Alex Williamson wrote: > > On Thu, 9 Aug 2018 12:37:06 -0700 > > Ashok Raj wrote: > > > > > PCI_INTERRUPT_PIN should always read 0 for SRIOV Virtual > > > Functions. > > > > > > Some SRIOV devices have some bugs in RTL and VF's end up reading 1 > > > instead of 0 for the PIN. > > > > Hi Ashok, > > > > One question, can we identify which VFs are known to have this > > issue so that users (and downstreams) can know how to prioritize > > this patch? > > Hi Alex > > Sorry it took some time to hunt this down. > > The offending VF has a device ID : 0x270C > The corresponding PF has a device ID: 0x270B. Ok, I interpret Alan's previous comment about the patch[1] to suggest a structure a bit more like that below. IOW, we know that 8086:270c inspires this change, but once it's included we won't know who else relies on it. We can perhaps encourage better hardware validation, or at least better tracking of who needs this with a warning and whitelist. Testing, especially positive and negative testing against the warning, and reviews welcome. Thanks, Alex [1]https://lkml.org/lkml/2018/8/10/462 commit d780da26a81c6f47522ae0aeff03abd4d08b89b5 Author: Alex Williamson Date: Tue Sep 18 21:27:57 2018 -0600 vfio/pci: Mask buggy SR-IOV VF INTx support The SR-IOV spec requires that VFs must report zero for the INTx pin register as VFs are precluded from INTx support. It's much easier for the host kernel to understand whether a device is a VF and therefore whether a non-zero pin register value is bogus than it is to do the same in userspace. Override the INTx count for such devices and virtualize the pin register to provide a consistent view of the device to the user. As this is clearly a spec violation, warn about it to support hardware validation, but also provide a known whitelist as it doesn't do much good to continue complaining if the hardware vendor doesn't plan to fix it. Known devices with this issue: 8086:270c Signed-off-by: Alex Williamson diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index cddb453a1ba5..8af3f6f35f32 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -430,14 +430,41 @@ static int vfio_pci_open(void *device_data) return ret; } +static const struct pci_device_id known_bogus_vf_intx_pin[] = { + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x270c) }, + {} +}; + static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type) { if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) { u8 pin; + + if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx) + return 0; + pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin); - if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin) - return 1; + /* +* Per SR-IOV spec rev 1.1, 3.4.1.18 the interrupt pin register +* does not apply to VFs and VFs must implement this register +* as read-only with value zero. Userspace is not readily +* able to identify a device as a VF and thus that the pin +* definition on the device is bogus should a device violate +* this requirement. For such devices, override the bogus +* value and provide a warning to support hardware validation +* (or be quite if it's known). PCI config space emulation +* will virtualize this register for all VFs. +*/ + if (pin && vdev->pdev->is_virtfn) { + if (!pci_match_id(known_bogus_vf_intx_pin, vdev->pdev)) + dev_warn_once(&vdev->pdev->dev, + "VF reports bogus INTx pin %d\n", + pin); + return 0; + } + + return pin ? 1 : 0; } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) { u8 pos; u16 flags; diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 62023b4a373b..25130fa6e265 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1678,7 +1678,8 @@ int vfio_config_init(struct vfio_pci_device *vdev) *(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device); } - if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx) + if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx || + pdev->is_virtfn) vconfig[PCI_INTERRUPT_PIN] = 0; ret = vfio_cap_init(vdev); ___ iommu mailing list iommu@lists.linux-foundation.org https://list
Re: [PATCH v4] iommu/iova: Optimise attempts to allocate iova from 32bit address range
Hi Joerg, can you please pull this patch? On Wed, Sep 5, 2018 at 9:58 AM Ganapatrao Kulkarni wrote: > > As an optimisation for PCI devices, there is always first attempt > been made to allocate iova from SAC address range. This will lead > to unnecessary attempts, when there are no free ranges > available. Adding fix to track recently failed iova address size and > allow further attempts, only if requested size is lesser than a failed > size. The size is updated when any replenish happens. > > Reviewed-by: Robin Murphy > Signed-off-by: Ganapatrao Kulkarni > --- > v4: > Rebsaed to 4.19-rc2 > v3: > Update with comments [3] from Robin Murphy > > [3] https://lkml.org/lkml/2018/8/13/116 > > v2: update with comments [2] from Robin Murphy > > [2] https://lkml.org/lkml/2018/8/7/166 > > v1: Based on comments from Robin Murphy > for patch [1] > > [1] https://lkml.org/lkml/2018/4/19/780 > > drivers/iommu/iova.c | 22 +++--- > include/linux/iova.h | 1 + > 2 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c > index 83fe2621effe..f8d3ba247523 100644 > --- a/drivers/iommu/iova.c > +++ b/drivers/iommu/iova.c > @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long > granule, > iovad->granule = granule; > iovad->start_pfn = start_pfn; > iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); > + iovad->max32_alloc_size = iovad->dma_32bit_pfn; > iovad->flush_cb = NULL; > iovad->fq = NULL; > iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; > @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, > struct iova *free) > > cached_iova = rb_entry(iovad->cached32_node, struct iova, node); > if (free->pfn_hi < iovad->dma_32bit_pfn && > - free->pfn_lo >= cached_iova->pfn_lo) > + free->pfn_lo >= cached_iova->pfn_lo) { > iovad->cached32_node = rb_next(&free->node); > + iovad->max32_alloc_size = iovad->dma_32bit_pfn; > + } > > cached_iova = rb_entry(iovad->cached_node, struct iova, node); > if (free->pfn_lo >= cached_iova->pfn_lo) > @@ -190,6 +193,10 @@ static int __alloc_and_insert_iova_range(struct > iova_domain *iovad, > > /* Walk the tree backwards */ > spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); > + if (limit_pfn <= iovad->dma_32bit_pfn && > + size >= iovad->max32_alloc_size) > + goto iova32_full; > + > curr = __get_cached_rbnode(iovad, limit_pfn); > curr_iova = rb_entry(curr, struct iova, node); > do { > @@ -200,10 +207,8 @@ static int __alloc_and_insert_iova_range(struct > iova_domain *iovad, > curr_iova = rb_entry(curr, struct iova, node); > } while (curr && new_pfn <= curr_iova->pfn_hi); > > - if (limit_pfn < size || new_pfn < iovad->start_pfn) { > - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); > - return -ENOMEM; > - } > + if (limit_pfn < size || new_pfn < iovad->start_pfn) > + goto iova32_full; > > /* pfn_lo will point to size aligned address if size_aligned is set */ > new->pfn_lo = new_pfn; > @@ -214,9 +219,12 @@ static int __alloc_and_insert_iova_range(struct > iova_domain *iovad, > __cached_rbnode_insert_update(iovad, new); > > spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); > - > - > return 0; > + > +iova32_full: > + iovad->max32_alloc_size = size; > + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); > + return -ENOMEM; > } > > static struct kmem_cache *iova_cache; > diff --git a/include/linux/iova.h b/include/linux/iova.h > index 928442dda565..0b93bf96693e 100644 > --- a/include/linux/iova.h > +++ b/include/linux/iova.h > @@ -75,6 +75,7 @@ struct iova_domain { > unsigned long granule;/* pfn granularity for this domain */ > unsigned long start_pfn; /* Lower limit for this domain */ > unsigned long dma_32bit_pfn; > + unsigned long max32_alloc_size; /* Size of last failed allocation */ > struct iova anchor; /* rbtree lookup anchor */ > struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE]; /* IOVA range > caches */ > > -- > 2.18.0 > thanks, Ganapat ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] > Sent: Tuesday, September 18, 2018 11:47 PM > > On 14/09/2018 22:04, Jacob Pan wrote: > >> This example only needs to modify first-level translation, and works > >> with SMMUv3. The kernel here could be the host, in which case > >> second-level translation is disabled in the SMMU, or it could be the > >> guest, in which case second-level mappings are created by QEMU and > >> first-level translation is managed by assigning PASID tables to the > >> guest. > > There is a difference in case of guest SVA. VT-d v3 will bind guest > > PASID and guest CR3 instead of the guest PASID table. Then turn on > > nesting. In case of mdev, the second level is obtained from the aux > > domain which was setup for the default PASID. Or in case of PCI device, > > second level is harvested from RID2PASID. > > Right, though I wasn't talking about the host managing guest SVA here, > but a kernel binding the address space of one of its userspace drivers > to the mdev. > > >> So (2) would use iommu_sva_bind_device(), > > We would need something different than that for guest bind, just to show > > the two cases:> > > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, > int > > *pasid, unsigned long flags, void *drvdata) > > > > (WIP) > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data) > > where: > > /** > > * struct gpasid_bind_data - Information about device and guest PASID > > binding > > * @pasid: Process address space ID used for the guest mm > > * @addr_width: Guest address width. Paging mode can also be derived. > > * @gcr3: Guest CR3 value from guest mm > > */ > > struct gpasid_bind_data { > > __u32 pasid; > > __u64 gcr3; > > __u32 addr_width; > > __u32 flags; > > #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ > > }; > > Perhaps there is room to merge with io_mm but the life cycle > management > > of guest PASID and host PASID will be different if you rely on mm > > release callback than FD. let's not calling gpasid here - which makes sense only in bind_pasid_table proposal where pasid table thus pasid space is managed by guest. In above context it is always about host pasid (allocated in system-wide), which could point to a host cr3 (user process) or a guest cr3 (vm case). > > I think gpasid management should stay separate from io_mm, since in your > case VFIO mechanisms are used for life cycle management of the VM, > similarly to the former bind_pasid_table proposal. For example closing > the container fd would unbind all guest page tables. The QEMU process' > address space lifetime seems like the wrong thing to track for gpasid. I sort of agree (though not thinking through all the flow carefully). PASIDs are allocated per iommu domain, thus release also happens when domain is detached (along with container fd close). > > >> but (1) needs something > >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary > >> domain to second-level or nested translation? It seems silly to use a > >> different API for first-level, since the flow in userspace and VFIO > >> is the same as your second-level case as far as MAP_DMA ioctl goes. > >> The difference is that in your case the auxiliary domain supports an > >> additional operation which binds first-level page tables. An > >> auxiliary domain that only supports first-level wouldn't support this > >> operation, but it can still implement iommu_map/unmap/etc. > >> > > I think the intention is that when a mdev is created, we don;t > > know whether it will be used for SVA or IOVA. So aux domain is here to > > "hold a spot" for the default PASID such that MAP_DMA calls can work as > > usual, which is second level only. Later, if SVA is used on the mdev > > there will be another PASID allocated for that purpose. > > Do we need to create an aux domain for each PASID? the translation can > > be looked up by the combination of parent dev and pasid. > > When allocating a new PASID for the guest, I suppose you need to clone > the second-level translation config? In which case a single aux domain > for the mdev might be easier to implement in the IOMMU driver. Entirely > up to you since we don't have this case on SMMUv3 > One thing to highlight in related discussions (also mentioned in other thread). There is not a new iommu domain type called 'aux'. 'aux' matters only to a specific device when a domain is attached to that device which has aux capability enabled. Same domain can be attached to other device as normal domain. In that case multiple PASIDs allocated on same mdev are tied to same aux domain, same bare metal SVA case, i.e. any domain (normal or aux) can include 2nd level structure and multiple 1st level structures. Jean is correct - all PASIDs in same domain then share 2nd level translation, and there are io_mm or similar tracking structures to associate each PASID to a 1st level t
Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers
Hi, On 09/19/2018 07:26 AM, Tian, Kevin wrote: From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] Sent: Tuesday, September 18, 2018 11:52 PM On 15/09/2018 03:36, Tian, Kevin wrote: 4) Userspace opens another mdev. -> iommu.c calls domain->ops->attach_dev(domain2, dev) another mdev in same VFIO container or different? I assume the latter since you mentioned a new domain2. I was thinking a different VFIO container actually. I used domain2 to try to make the example clearer 1)? When the container is closed, VFIO calls iommu_detach_device(domain2, parent_dev) -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev) Given that auxiliary domains are attached, the IOMMU driver could deduce that this actually means "detach an auxiliary domain". But which one? I didn't get this one. There is no need to stick to 1) behavior for 4), i.e. below is expected: domain2->ops->detach_dev(domain2, dev) But in order to get that, the IOMMU core needs to know that domain2 is auxiliary. Otherwise, detach_dev is never called when a default_domain is present for the parent dev. I guess one solution is to add an "auxiliary" attribute to iommu_domain, so __iommu_detach_group would do something like: this doesn't work. same domain can be also attached to another physical device as non-aux domain (e.g. passthrough) at the same time (vfio-pci device and vfio-mdev device in same container), then default domain tweak is required in that case. "aux" takes effect only per-device, not per-domain. If we have below APIs for aux domain (the API names are just for discussion purpose, subject to change): iommu_querry_aux_domain_capability(dev) iommu_enable_aux_domain(dev) iommu_disable_aux_domain(dev) iommu_check_aux_domain_status(dev) then, we could do this like below: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index ab3d7d3b1583..3bfb652c78e8 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1469,12 +1469,31 @@ static int iommu_group_do_detach_device(struct device *dev, void *data) return 0; } +static int iommu_group_check_aux_domain(struct device *dev, void *data) +{ + const struct iommu_ops *ops = dev->bus->iommu_ops; + + if (ops && ops->check_auxd) + return !ops->check_auxd(dev); + + return -EINVAL; +} + +/* + * Check whether devices in @group have aux domain enabled. + */ +static int iommu_group_aux_domain_enabled(struct iommu_group *group) +{ + return __iommu_group_for_each_dev(group, NULL, + iommu_group_check_aux_domain); +} + static void __iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group) { int ret; - if (!group->default_domain) { + if (!group->default_domain || iommu_group_aux_domain_enabled(group)) { __iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device); group->domain = NULL; Best regards, Lu Baolu diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7113fe398b70..2b3e9b91aee7 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct iommu_domain *domain, { int ret; - if (!group->default_domain) { + if (!group->default_domain || domain->auxiliary) { __iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device); - group->domain = NULL; + if (!domain->auxiliary) + group->domain = NULL; return; } Not sure who would set this "auxiliary" attribute... Maybe the IOMMU driver, when attaching the domain to a device that has auxiliary mode enabled? why cannot ARM implement a detach_dev for aux_domain too? My feeling is that default domain twist is only for switch between 1/2/3 in concept. If the core actually calls it, we can implement detach_dev :) The problem is that the core never calls detach_dev when default_domain is present (affects any IOMMU driver that relies on default_domain, including AMD), and even in case 4) the default_domain is present for the parent device Then can we change that core logic so detach_dev is invoked in all cases? yes there will be some changes in vendor drivers, but I expect this change trivial (especially considering the gain in IOMMU API simplicity side as described below). So the proposed interface doesn't seem to work as is. If we want to use iommu_attach/detach_device for auxiliary domains, the existing behavior of iommu.c, and IOMMU drivers that rely on it, have to change. Any change I can think of right now seems more daunting than introducing a different path for auxiliary domains, like iommu_attach_aux_domain for example. introducing *aux* specific API will cause different VFIO cod
RE: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] > Sent: Tuesday, September 18, 2018 11:52 PM > > On 15/09/2018 03:36, Tian, Kevin wrote: > >> 4) Userspace opens another mdev. > >> -> iommu.c calls domain->ops->attach_dev(domain2, dev) > > > > another mdev in same VFIO container or different? I assume the > > latter since you mentioned a new domain2. > > I was thinking a different VFIO container actually. I used domain2 to > try to make the example clearer > > >> 1)? When the container is closed, VFIO calls > >> iommu_detach_device(domain2, parent_dev) > >> -> iommu.c calls default_domain->ops->attach_dev(default_domain, > dev) > >> Given that auxiliary domains are attached, the IOMMU driver could > deduce > >> that this actually means "detach an auxiliary domain". But which one? > > > > I didn't get this one. There is no need to stick to 1) behavior for > > 4), i.e. below is expected: > > domain2->ops->detach_dev(domain2, dev) > > But in order to get that, the IOMMU core needs to know that domain2 is > auxiliary. Otherwise, detach_dev is never called when a default_domain > is present for the parent dev. > > I guess one solution is to add an "auxiliary" attribute to iommu_domain, > so __iommu_detach_group would do something like: this doesn't work. same domain can be also attached to another physical device as non-aux domain (e.g. passthrough) at the same time (vfio-pci device and vfio-mdev device in same container), then default domain tweak is required in that case. "aux" takes effect only per-device, not per-domain. > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 7113fe398b70..2b3e9b91aee7 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct > iommu_domain *domain, > { > int ret; > > - if (!group->default_domain) { > + if (!group->default_domain || domain->auxiliary) { > __iommu_group_for_each_dev(group, domain, > iommu_group_do_detach_device); > - group->domain = NULL; > + if (!domain->auxiliary) > + group->domain = NULL; > return; > } > > Not sure who would set this "auxiliary" attribute... Maybe the IOMMU > driver, when attaching the domain to a device that has auxiliary mode > enabled? > > > why cannot ARM implement a detach_dev for aux_domain too? My > > feeling is that default domain twist is only for switch between 1/2/3 > > in concept. > > If the core actually calls it, we can implement detach_dev :) The > problem is that the core never calls detach_dev when default_domain is > present (affects any IOMMU driver that relies on default_domain, > including AMD), and even in case 4) the default_domain is present for > the parent device Then can we change that core logic so detach_dev is invoked in all cases? yes there will be some changes in vendor drivers, but I expect this change trivial (especially considering the gain in IOMMU API simplicity side as described below). > > >> So the proposed interface doesn't seem to work as is. If we want to use > >> iommu_attach/detach_device for auxiliary domains, the existing > behavior > >> of iommu.c, and IOMMU drivers that rely on it, have to change. Any > >> change I can think of right now seems more daunting than introducing a > >> different path for auxiliary domains, like iommu_attach_aux_domain for > >> example. > >> > > > > introducing *aux* specific API will cause different VFIO code path to > > handle RID-based and PASID-based mdev, since RID-based still needs > > to use normal attach_domain that way. > > The PASID-based mdev still requires a special case to retrieve the > allocated PASID and program it in the parent device, so VFIO will need > to know the difference between the two > that retrieve/program is down by parent driver, instead of VFIO. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 6/6] iommu/arm-smmu: Support non-strict mode
On 2018-09-18 6:10 PM, Will Deacon wrote: On Fri, Sep 14, 2018 at 03:30:24PM +0100, Robin Murphy wrote: All we need is to wire up .flush_iotlb_all properly and implement the domain attribute, and iommu-dma and io-pgtable-arm will do the rest for us. Rather than bother implementing it for v7s format for the highly unlikely chance of that being relevant, we can simply hide the non-strict flag from io-pgtable for that combination just so anyone who does actually try it will simply get over-invalidation instead of failure to initialise domains. Signed-off-by: Robin Murphy --- drivers/iommu/arm-smmu.c | 40 +--- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index fd1b80ef9490..aa5be334753b 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -246,6 +246,7 @@ struct arm_smmu_domain { const struct iommu_gather_ops *tlb_ops; struct arm_smmu_cfg cfg; enum arm_smmu_domain_stage stage; + boolnon_strict; struct mutexinit_mutex; /* Protects smmu pointer */ spinlock_t cb_lock; /* Serialises ATS1* ops and TLB syncs */ struct iommu_domain domain; @@ -863,6 +864,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain, if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA; + if (smmu_domain->non_strict && cfg->fmt != ARM_SMMU_CTX_FMT_AARCH32_S) + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; Does this mean we end up over-invalidating when using short-descriptor? Could we not bypass the flush queue in this case instead? Ideally, we'd just reject the domain attribute but I don't know if we know about the page-table format early enough for that. Alternatively, we could force long format if the attribute is set. What do you think? If someone manages to run an arm64 kernel on a theoretical SMMUv2 implementation which only supports short-descriptor, *and* explicitly sets the command-line option, then yes, they'll get both the synchronous TLBIs and the periodic TLBIALLs. As implied by the commit message, my natural response is "don't do that". However, it will almost certainly take more effort to argue about it or come up with other bodges than it will to just implement the quirk in the v7s code, so if you really think it's a valid concern just shout. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 5/6] iommu/arm-smmu-v3: Add support for non-strict mode
On 2018-09-18 6:10 PM, Will Deacon wrote: On Fri, Sep 14, 2018 at 03:30:23PM +0100, Robin Murphy wrote: From: Zhen Lei Dynamically choose strict or non-strict mode for page table config based on the iommu domain type. It's the domain type in conjunction with the attribute that determines whether we use lazy or strict invalidation. Signed-off-by: Zhen Lei [rm: convert to domain attribute] Signed-off-by: Robin Murphy --- drivers/iommu/arm-smmu-v3.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f10c852479fc..7bbfa5f7ce8e 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -612,6 +612,7 @@ struct arm_smmu_domain { struct mutexinit_mutex; /* Protects smmu pointer */ struct io_pgtable_ops *pgtbl_ops; + boolnon_strict; enum arm_smmu_domain_stage stage; union { @@ -1633,6 +1634,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain) if (smmu->features & ARM_SMMU_FEAT_COHERENCY) pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA; + if (smmu_domain->non_strict) + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; + pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain); if (!pgtbl_ops) return -ENOMEM; @@ -1934,13 +1938,17 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain, { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - if (domain->type != IOMMU_DOMAIN_UNMANAGED) - return -EINVAL; - switch (attr) { case DOMAIN_ATTR_NESTING: + if (domain->type != IOMMU_DOMAIN_UNMANAGED) + return -EINVAL; *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); return 0; + case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: + if (domain->type != IOMMU_DOMAIN_DMA) + return -EINVAL; + *(int *)data = smmu_domain->non_strict; + return 0; default: return -ENODEV; Hmm, there's a change in behaviour here (and also in the set function) which is that unknown attributes now return -ENODEV for managed domains instead of -EINVAL. I don't know if that's a problem, but I'd be inclined to switch on the domain type and then have a nested switch for the supported attributes. Sure, a nested switch did actually cross my mind, but I was worried it might be a little boilerplate-heavy since there's still only one of each case (and this quick'n'dirty copy-paste job didn't need any thought...) If that's your preference too, though, I'll respin both driver patches that way. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 4/6] iommu: Add bootup option "iommu.non_strict"
On 2018-09-18 6:10 PM, Will Deacon wrote: On Fri, Sep 14, 2018 at 03:30:22PM +0100, Robin Murphy wrote: From: Zhen Lei Add a bootup option to make the system manager can choose which mode to be used. The default mode is strict. Signed-off-by: Zhen Lei [rm: move handling out of SMMUv3 driver] Signed-off-by: Robin Murphy --- .../admin-guide/kernel-parameters.txt | 13 ++ drivers/iommu/iommu.c | 26 +++ 2 files changed, 39 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9871e649ffef..406b91759b62 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1749,6 +1749,19 @@ nobypass[PPC/POWERNV] Disable IOMMU bypass, using IOMMU for PCI devices. + iommu.non_strict= [ARM64] + Format: { "0" | "1" } + 0 - strict mode, default. + Release IOVAs after the related TLBs are invalid + completely. + 1 - non-strict mode. + Put off TLBs invalidation and release memory first. + It's good for scatter-gather performance but lacks + full isolation, an untrusted device can access the + reused memory because the TLBs may still valid. + Please take full consideration before choosing this + mode. Note that, VFIO will always use strict mode. This text needs help. How about something like: 0 - strict mode, default. Invalidate the TLB of the IOMMU hardware as part of every unmap() operation. 1 - lazy mode. Defer TLB invalidation so that the TLB of the IOMMU hardware is invalidated periodically, rather than as part of every unmap() operation. (generally, I think I'd s/non strict/lazy/ in this patch to avoid the double negatives) + iommu.passthrough= [ARM64] Configure DMA to bypass the IOMMU by default. Format: { "0" | "1" } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8c15c5980299..2cabd0c0a4f3 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -41,6 +41,7 @@ static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY; #else static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA; #endif +static bool iommu_dma_non_strict __read_mostly; struct iommu_callback_data { const struct iommu_ops *ops; @@ -131,6 +132,24 @@ static int __init iommu_set_def_domain_type(char *str) } early_param("iommu.passthrough", iommu_set_def_domain_type); +static int __init iommu_dma_setup(char *str) +{ + int ret; + + ret = kstrtobool(str, &iommu_dma_non_strict); + if (ret) + return ret; + + if (iommu_dma_non_strict) { + pr_warn("WARNING: iommu non-strict mode is chosen.\n" + "It's good for scatter-gather performance but lacks full isolation\n"); Hmm, not sure about this message either and tainting is probably over the top. Maybe drop the taint and just pr_info something like "IOMMU DMA ops using lazy TLB invalidation: unable to protect against malicious devices" + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); + } + + return 0; +} +early_param("iommu.non_strict", iommu_dma_setup); + static ssize_t iommu_group_attr_show(struct kobject *kobj, struct attribute *__attr, char *buf) { @@ -1072,6 +1091,13 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev) group->default_domain = dom; if (!group->domain) group->domain = dom; + + if (dom && iommu_dma_non_strict) { + int attr = 1; + iommu_domain_set_attr(dom, + DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, + &attr); + } Hmm, I don't think we can guarantee that we're working with the DMA domain here. Does this all fall out in the wash for the identity domain? Indeed so - for one, I expect drivers to reject it for anything that isn't their own default DMA ops domain type (as #5 and #6 do), and furthermore it only has any effect once iommu_dma_init_domain() reads it back if it stuck, and other domain types should never be getting passed into there anyway. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/6] iommu/dma: Add support for non-strict mode
On 2018-09-18 6:10 PM, Will Deacon wrote: Hi Robin, On Fri, Sep 14, 2018 at 03:30:20PM +0100, Robin Murphy wrote: From: Zhen Lei 1. Save the related domain pointer in struct iommu_dma_cookie, make iovad capable call domain->ops->flush_iotlb_all to flush TLB. 2. During the iommu domain initialization phase, base on domain->non_strict field to check whether non-strict mode is supported or not. If so, call init_iova_flush_queue to register iovad->flush_cb callback. 3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap -->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to put off iova freeing, and omit iommu_tlb_sync operation. Hmm, this is basically just a commentary on the code. Please could you write it more in terms of the problem that's being solved? Sure - I intentionally kept a light touch when it came to the documentation and commit messages in this rework (other than patch #1 where I eventually remembered the original reasoning and that it wasn't a bug). If we're more-or-less happy with the shape of the technical side I'll make sure to take a final pass through v8 to tidy up all the prose. Signed-off-by: Zhen Lei [rm: convert raw boolean to domain attribute] Signed-off-by: Robin Murphy --- drivers/iommu/dma-iommu.c | 29 - include/linux/iommu.h | 1 + 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 511ff9a1d6d9..092e6926dc3c 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -55,6 +55,9 @@ struct iommu_dma_cookie { }; struct list_headmsi_page_list; spinlock_t msi_lock; + + /* Only be assigned in non-strict mode, otherwise it's NULL */ + struct iommu_domain *domain; }; static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie) @@ -257,6 +260,17 @@ static int iova_reserve_iommu_regions(struct device *dev, return ret; } +static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad) +{ + struct iommu_dma_cookie *cookie; + struct iommu_domain *domain; + + cookie = container_of(iovad, struct iommu_dma_cookie, iovad); + domain = cookie->domain; + + domain->ops->flush_iotlb_all(domain); Can we rely on this function pointer being non-NULL? I think it would be better to call iommu_flush_tlb_all(cookie->domain) instead. Yeah, that's deliberate - in fact got as far as writing that change, then undid it as I realised that although the attribute conversion got rid of the explicit ops->flush_iotlb_all check, it still makes zero sense for an IOMMU driver to claim to support the flush queue attribute without also providing the relevant callback, so I do actually want this to blow up rather than silently do nothing if that assumption isn't met. +} + /** * iommu_dma_init_domain - Initialise a DMA mapping domain * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() @@ -275,6 +289,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; struct iova_domain *iovad = &cookie->iovad; unsigned long order, base_pfn, end_pfn; + int attr = 1; Do we actually need to initialise this? Oops, no, that's a left-over from the turned-out-messier-that-I-thought v6 implementation. Thanks, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 0/6] Add non-strict mode support for iommu-dma
Hi Will, On 2018-09-18 6:10 PM, Will Deacon wrote: Hi Robin, Thanks for turning this around so quickly. Cheers for a pretty rapid review too :) On Fri, Sep 14, 2018 at 03:30:18PM +0100, Robin Murphy wrote: Since we'd like to get this polished up and merged and Leizhen has other commitments, here's v7 of the previous series[1] wherein I address all my own feedback :) This is a quick tweak of the v6 I sent yesterday since I figured out slightly too late a much neater way of setting the attribute at the appropriate time. The principal change is that I've inverted things slightly such that it's now a generic domain attribute controlled by iommu-dma given the necessary support from individual IOMMU drivers. That way we can easily enable other drivers straight away, as I've done for SMMUv2 here (which also allowed me to give it a quick test with MMU-401s on a Juno board). Otherwise it's really just cosmetic cleanup and rebasing onto Will's pending SMMU queue. I've been through and had a look, leaving some small comments on the patches themselves. The only part I failed to figure out is how you tie the lifetime of the flush queue to the lifetime of the domain so that the timer callback can't fire after e.g. the DMA cookie has been freed. How does that work? Er, to be honest I haven't looked or even considered it! Other than the parts I've massaged I kinda took the functionality of the previous series for granted. Let me cross-check the x86 code and figure it out. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Explicit IOVA management from a PCIe endpoint driver
On 09/18/2018 04:59 AM, Robin Murphy wrote: Hi Stephen, On 17/09/18 22:36, Stephen Warren wrote: Joerg, Christoph, Marek, Robin, I believe that the driver for our PCIe endpoint controller hardware will need to explicitly manage its IOVA space more than current APIs allow. I'd like to discuss how to make that possible. ... One final note: The memory controller can translate accesses to a small region of DRAM address space into accesses to an interrupt generation module. This allows devices attached to the PCIe bus to generate interrupts to software running on the system with the PCIe endpoint controller. Thus I deliberately described API 3 above as mapping a specific physical address into IOVA space, as opposed to mapping an existing DRAM allocation into IOVA space, in order to allow mapping this interrupt generation address space into IOVA space. If we needed separate APIs to map physical addresses vs. DRAM allocations into IOVA space, that would likely be fine too. If that's the standard DesignWare MSI dingaling, then all you should need to do is ensure you IOVA is reserved in your allocator (if it can be entirely outside the EP BAR, even better) - AFAIK the writes get completely intercepted such that they never go out to the SMMU side at all, and thus no actual mapping is even needed. Unfortunately it's not. We have some custom hardware module (that already existed for other purposes, such as interaction/synchronization between various graphics modules) that we will slightly repurpose as a plain interrupt generator for PCIe endpoint use-cases. Does this API proposal sound reasonable? Indeed, as I say apart from using streaming DMA for coherency management (which I think could be added in pretty much orthogonally later), this sounds like something you could plumb into the endpoint framework right now with no dependent changes elsewhere. Great. I'll take a look at Oza's code and see about getting this implemented. Thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 6/6] iommu/arm-smmu: Support non-strict mode
On Fri, Sep 14, 2018 at 03:30:24PM +0100, Robin Murphy wrote: > All we need is to wire up .flush_iotlb_all properly and implement the > domain attribute, and iommu-dma and io-pgtable-arm will do the rest for > us. Rather than bother implementing it for v7s format for the highly > unlikely chance of that being relevant, we can simply hide the > non-strict flag from io-pgtable for that combination just so anyone who > does actually try it will simply get over-invalidation instead of > failure to initialise domains. > > Signed-off-by: Robin Murphy > --- > drivers/iommu/arm-smmu.c | 40 +--- > 1 file changed, 33 insertions(+), 7 deletions(-) > > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c > index fd1b80ef9490..aa5be334753b 100644 > --- a/drivers/iommu/arm-smmu.c > +++ b/drivers/iommu/arm-smmu.c > @@ -246,6 +246,7 @@ struct arm_smmu_domain { > const struct iommu_gather_ops *tlb_ops; > struct arm_smmu_cfg cfg; > enum arm_smmu_domain_stage stage; > + boolnon_strict; > struct mutexinit_mutex; /* Protects smmu pointer */ > spinlock_t cb_lock; /* Serialises ATS1* ops and > TLB syncs */ > struct iommu_domain domain; > @@ -863,6 +864,9 @@ static int arm_smmu_init_domain_context(struct > iommu_domain *domain, > if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) > pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA; > > + if (smmu_domain->non_strict && cfg->fmt != ARM_SMMU_CTX_FMT_AARCH32_S) > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; Does this mean we end up over-invalidating when using short-descriptor? Could we not bypass the flush queue in this case instead? Ideally, we'd just reject the domain attribute but I don't know if we know about the page-table format early enough for that. Alternatively, we could force long format if the attribute is set. What do you think? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 5/6] iommu/arm-smmu-v3: Add support for non-strict mode
On Fri, Sep 14, 2018 at 03:30:23PM +0100, Robin Murphy wrote: > From: Zhen Lei > > Dynamically choose strict or non-strict mode for page table config based > on the iommu domain type. It's the domain type in conjunction with the attribute that determines whether we use lazy or strict invalidation. > Signed-off-by: Zhen Lei > [rm: convert to domain attribute] > Signed-off-by: Robin Murphy > --- > drivers/iommu/arm-smmu-v3.c | 30 -- > 1 file changed, 24 insertions(+), 6 deletions(-) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f10c852479fc..7bbfa5f7ce8e 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -612,6 +612,7 @@ struct arm_smmu_domain { > struct mutexinit_mutex; /* Protects smmu pointer */ > > struct io_pgtable_ops *pgtbl_ops; > + boolnon_strict; > > enum arm_smmu_domain_stage stage; > union { > @@ -1633,6 +1634,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain > *domain) > if (smmu->features & ARM_SMMU_FEAT_COHERENCY) > pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA; > > + if (smmu_domain->non_strict) > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; > + > pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain); > if (!pgtbl_ops) > return -ENOMEM; > @@ -1934,13 +1938,17 @@ static int arm_smmu_domain_get_attr(struct > iommu_domain *domain, > { > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); > > - if (domain->type != IOMMU_DOMAIN_UNMANAGED) > - return -EINVAL; > - > switch (attr) { > case DOMAIN_ATTR_NESTING: > + if (domain->type != IOMMU_DOMAIN_UNMANAGED) > + return -EINVAL; > *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); > return 0; > + case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: > + if (domain->type != IOMMU_DOMAIN_DMA) > + return -EINVAL; > + *(int *)data = smmu_domain->non_strict; > + return 0; > default: > return -ENODEV; Hmm, there's a change in behaviour here (and also in the set function) which is that unknown attributes now return -ENODEV for managed domains instead of -EINVAL. I don't know if that's a problem, but I'd be inclined to switch on the domain type and then have a nested switch for the supported attributes. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 4/6] iommu: Add bootup option "iommu.non_strict"
On Fri, Sep 14, 2018 at 03:30:22PM +0100, Robin Murphy wrote: > From: Zhen Lei > > Add a bootup option to make the system manager can choose which mode to > be used. The default mode is strict. > > Signed-off-by: Zhen Lei > [rm: move handling out of SMMUv3 driver] > Signed-off-by: Robin Murphy > --- > .../admin-guide/kernel-parameters.txt | 13 ++ > drivers/iommu/iommu.c | 26 +++ > 2 files changed, 39 insertions(+) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 9871e649ffef..406b91759b62 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1749,6 +1749,19 @@ > nobypass[PPC/POWERNV] > Disable IOMMU bypass, using IOMMU for PCI devices. > > + iommu.non_strict= [ARM64] > + Format: { "0" | "1" } > + 0 - strict mode, default. > + Release IOVAs after the related TLBs are invalid > + completely. > + 1 - non-strict mode. > + Put off TLBs invalidation and release memory first. > + It's good for scatter-gather performance but lacks > + full isolation, an untrusted device can access the > + reused memory because the TLBs may still valid. > + Please take full consideration before choosing this > + mode. Note that, VFIO will always use strict mode. This text needs help. How about something like: 0 - strict mode, default. Invalidate the TLB of the IOMMU hardware as part of every unmap() operation. 1 - lazy mode. Defer TLB invalidation so that the TLB of the IOMMU hardware is invalidated periodically, rather than as part of every unmap() operation. (generally, I think I'd s/non strict/lazy/ in this patch to avoid the double negatives) > + > iommu.passthrough= > [ARM64] Configure DMA to bypass the IOMMU by default. > Format: { "0" | "1" } > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 8c15c5980299..2cabd0c0a4f3 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -41,6 +41,7 @@ static unsigned int iommu_def_domain_type = > IOMMU_DOMAIN_IDENTITY; > #else > static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA; > #endif > +static bool iommu_dma_non_strict __read_mostly; > > struct iommu_callback_data { > const struct iommu_ops *ops; > @@ -131,6 +132,24 @@ static int __init iommu_set_def_domain_type(char *str) > } > early_param("iommu.passthrough", iommu_set_def_domain_type); > > +static int __init iommu_dma_setup(char *str) > +{ > + int ret; > + > + ret = kstrtobool(str, &iommu_dma_non_strict); > + if (ret) > + return ret; > + > + if (iommu_dma_non_strict) { > + pr_warn("WARNING: iommu non-strict mode is chosen.\n" > + "It's good for scatter-gather performance but lacks > full isolation\n"); Hmm, not sure about this message either and tainting is probably over the top. Maybe drop the taint and just pr_info something like "IOMMU DMA ops using lazy TLB invalidation: unable to protect against malicious devices" > + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); > + } > + > + return 0; > +} > +early_param("iommu.non_strict", iommu_dma_setup); > + > static ssize_t iommu_group_attr_show(struct kobject *kobj, >struct attribute *__attr, char *buf) > { > @@ -1072,6 +1091,13 @@ struct iommu_group *iommu_group_get_for_dev(struct > device *dev) > group->default_domain = dom; > if (!group->domain) > group->domain = dom; > + > + if (dom && iommu_dma_non_strict) { > + int attr = 1; > + iommu_domain_set_attr(dom, > + DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, > + &attr); > + } Hmm, I don't think we can guarantee that we're working with the DMA domain here. Does this all fall out in the wash for the identity domain? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/6] iommu/dma: Add support for non-strict mode
Hi Robin, On Fri, Sep 14, 2018 at 03:30:20PM +0100, Robin Murphy wrote: > From: Zhen Lei > > 1. Save the related domain pointer in struct iommu_dma_cookie, make iovad >capable call domain->ops->flush_iotlb_all to flush TLB. > 2. During the iommu domain initialization phase, base on domain->non_strict >field to check whether non-strict mode is supported or not. If so, call >init_iova_flush_queue to register iovad->flush_cb callback. > 3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap >-->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to >put off iova freeing, and omit iommu_tlb_sync operation. Hmm, this is basically just a commentary on the code. Please could you write it more in terms of the problem that's being solved? > Signed-off-by: Zhen Lei > [rm: convert raw boolean to domain attribute] > Signed-off-by: Robin Murphy > --- > drivers/iommu/dma-iommu.c | 29 - > include/linux/iommu.h | 1 + > 2 files changed, 29 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 511ff9a1d6d9..092e6926dc3c 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -55,6 +55,9 @@ struct iommu_dma_cookie { > }; > struct list_headmsi_page_list; > spinlock_t msi_lock; > + > + /* Only be assigned in non-strict mode, otherwise it's NULL */ > + struct iommu_domain *domain; > }; > > static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie) > @@ -257,6 +260,17 @@ static int iova_reserve_iommu_regions(struct device *dev, > return ret; > } > > +static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad) > +{ > + struct iommu_dma_cookie *cookie; > + struct iommu_domain *domain; > + > + cookie = container_of(iovad, struct iommu_dma_cookie, iovad); > + domain = cookie->domain; > + > + domain->ops->flush_iotlb_all(domain); Can we rely on this function pointer being non-NULL? I think it would be better to call iommu_flush_tlb_all(cookie->domain) instead. > +} > + > /** > * iommu_dma_init_domain - Initialise a DMA mapping domain > * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() > @@ -275,6 +289,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, > dma_addr_t base, > struct iommu_dma_cookie *cookie = domain->iova_cookie; > struct iova_domain *iovad = &cookie->iovad; > unsigned long order, base_pfn, end_pfn; > + int attr = 1; Do we actually need to initialise this? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 0/6] Add non-strict mode support for iommu-dma
Hi Robin, Thanks for turning this around so quickly. On Fri, Sep 14, 2018 at 03:30:18PM +0100, Robin Murphy wrote: > Since we'd like to get this polished up and merged and Leizhen has other > commitments, here's v7 of the previous series[1] wherein I address all > my own feedback :) This is a quick tweak of the v6 I sent yesterday > since I figured out slightly too late a much neater way of setting the > attribute at the appropriate time. > > The principal change is that I've inverted things slightly such that > it's now a generic domain attribute controlled by iommu-dma given the > necessary support from individual IOMMU drivers. That way we can easily > enable other drivers straight away, as I've done for SMMUv2 here (which > also allowed me to give it a quick test with MMU-401s on a Juno board). > Otherwise it's really just cosmetic cleanup and rebasing onto Will's > pending SMMU queue. I've been through and had a look, leaving some small comments on the patches themselves. The only part I failed to figure out is how you tie the lifetime of the flush queue to the lifetime of the domain so that the timer callback can't fire after e.g. the DMA cookie has been freed. How does that work? Cheers, Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: move swiotlb noncoherent dma support from arm64 to generic code
On Tue, Sep 18, 2018 at 02:28:42PM +0100, Robin Murphy wrote: > On 17/09/18 16:38, Christoph Hellwig wrote: >> Hi all, >> >> this series starts with various swiotlb cleanups, then adds support for >> non-cache coherent devices to the generic swiotlb support, and finally >> switches arm64 to use the generic code. > > I think there's going to be an issue with the embedded folks' grubby hack > in arm64's mem_init() which skips initialising SWIOTLB at all with > sufficiently little DRAM. I've been waiting for > dma-direct-noncoherent-merge so that I could fix that case to swizzle in > dma_direct_ops and avoid swiotlb_dma_ops entirely. I wait for your review of dma-direct-noncoherent-merge to put it into dma-mapping for-next.. That being said one thing I'm investigating is to eventually further merge dma_direct_ops and swiotlb_ops - the reason for that beeing that I want to remove the indirect calls for the common direct mapping case, and if we don't merge them that will get complicated. Note that swiotlb will generally just work if you don't initialize the buffer as long as we never see a physical address large enough to cause bounce buffering. > >> Given that this series depends on patches in the dma-mapping tree, or >> pending for it I've also published a git tree here: >> >> git://git.infradead.org/users/hch/misc.git swiotlb-noncoherent > > However, upon sitting down to eagerly write that patch I've just > boot-tested the above branch as-is for a baseline and discovered a rather > more significant problem: arch_dma_alloc() is recursing back into > __swiotlb_alloc() and blowing the stack. Not good :( Oops, I messed up when renaming things. Try this patch on top: diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 83e597101c6a..c75c721eb74e 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -955,7 +955,7 @@ void *__swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, */ gfp |= __GFP_NOWARN; - vaddr = dma_direct_alloc(dev, size, dma_handle, gfp, attrs); + vaddr = dma_direct_alloc_pages(dev, size, dma_handle, gfp, attrs); if (!vaddr) vaddr = swiotlb_alloc_buffer(dev, size, dma_handle, attrs); return vaddr; @@ -973,7 +973,7 @@ void __swiotlb_free(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_addr, unsigned long attrs) { if (!swiotlb_free_buffer(dev, size, dma_addr)) - dma_direct_free(dev, size, vaddr, dma_addr, attrs); + dma_direct_free_pages(dev, size, vaddr, dma_addr, attrs); } static void swiotlb_free(struct device *dev, size_t size, void *vaddr, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers
On 15/09/2018 03:36, Tian, Kevin wrote: >> 4) Userspace opens another mdev. >> -> iommu.c calls domain->ops->attach_dev(domain2, dev) > > another mdev in same VFIO container or different? I assume the > latter since you mentioned a new domain2. I was thinking a different VFIO container actually. I used domain2 to try to make the example clearer >> 1)? When the container is closed, VFIO calls >> iommu_detach_device(domain2, parent_dev) >> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev) >> Given that auxiliary domains are attached, the IOMMU driver could deduce >> that this actually means "detach an auxiliary domain". But which one? > > I didn't get this one. There is no need to stick to 1) behavior for > 4), i.e. below is expected: > domain2->ops->detach_dev(domain2, dev) But in order to get that, the IOMMU core needs to know that domain2 is auxiliary. Otherwise, detach_dev is never called when a default_domain is present for the parent dev. I guess one solution is to add an "auxiliary" attribute to iommu_domain, so __iommu_detach_group would do something like: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7113fe398b70..2b3e9b91aee7 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct iommu_domain *domain, { int ret; - if (!group->default_domain) { + if (!group->default_domain || domain->auxiliary) { __iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device); - group->domain = NULL; + if (!domain->auxiliary) + group->domain = NULL; return; } Not sure who would set this "auxiliary" attribute... Maybe the IOMMU driver, when attaching the domain to a device that has auxiliary mode enabled? > why cannot ARM implement a detach_dev for aux_domain too? My > feeling is that default domain twist is only for switch between 1/2/3 > in concept. If the core actually calls it, we can implement detach_dev :) The problem is that the core never calls detach_dev when default_domain is present (affects any IOMMU driver that relies on default_domain, including AMD), and even in case 4) the default_domain is present for the parent device >> So the proposed interface doesn't seem to work as is. If we want to use >> iommu_attach/detach_device for auxiliary domains, the existing behavior >> of iommu.c, and IOMMU drivers that rely on it, have to change. Any >> change I can think of right now seems more daunting than introducing a >> different path for auxiliary domains, like iommu_attach_aux_domain for >> example. >> > > introducing *aux* specific API will cause different VFIO code path to > handle RID-based and PASID-based mdev, since RID-based still needs > to use normal attach_domain that way. The PASID-based mdev still requires a special case to retrieve the allocated PASID and program it in the parent device, so VFIO will need to know the difference between the two Thanks, Jean > well, this argument is not very strong > in itself, if indeed proposed way doesn't work for ARM. But let's see > whether it is the case with more understanding of your actual concern. > > Thanks > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device
On 14/09/2018 22:04, Jacob Pan wrote: >> This example only needs to modify first-level translation, and works >> with SMMUv3. The kernel here could be the host, in which case >> second-level translation is disabled in the SMMU, or it could be the >> guest, in which case second-level mappings are created by QEMU and >> first-level translation is managed by assigning PASID tables to the >> guest. > There is a difference in case of guest SVA. VT-d v3 will bind guest > PASID and guest CR3 instead of the guest PASID table. Then turn on > nesting. In case of mdev, the second level is obtained from the aux > domain which was setup for the default PASID. Or in case of PCI device, > second level is harvested from RID2PASID. Right, though I wasn't talking about the host managing guest SVA here, but a kernel binding the address space of one of its userspace drivers to the mdev. >> So (2) would use iommu_sva_bind_device(), > We would need something different than that for guest bind, just to show > the two cases:> > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int > *pasid, unsigned long flags, void *drvdata) > > (WIP) > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data) > where: > /** > * struct gpasid_bind_data - Information about device and guest PASID > binding > * @pasid: Process address space ID used for the guest mm > * @addr_width: Guest address width. Paging mode can also be derived. > * @gcr3: Guest CR3 value from guest mm > */ > struct gpasid_bind_data { > __u32 pasid; > __u64 gcr3; > __u32 addr_width; > __u32 flags; > #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ > }; > Perhaps there is room to merge with io_mm but the life cycle management > of guest PASID and host PASID will be different if you rely on mm > release callback than FD. I think gpasid management should stay separate from io_mm, since in your case VFIO mechanisms are used for life cycle management of the VM, similarly to the former bind_pasid_table proposal. For example closing the container fd would unbind all guest page tables. The QEMU process' address space lifetime seems like the wrong thing to track for gpasid. >> but (1) needs something >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary >> domain to second-level or nested translation? It seems silly to use a >> different API for first-level, since the flow in userspace and VFIO >> is the same as your second-level case as far as MAP_DMA ioctl goes. >> The difference is that in your case the auxiliary domain supports an >> additional operation which binds first-level page tables. An >> auxiliary domain that only supports first-level wouldn't support this >> operation, but it can still implement iommu_map/unmap/etc. >> > I think the intention is that when a mdev is created, we don;t > know whether it will be used for SVA or IOVA. So aux domain is here to > "hold a spot" for the default PASID such that MAP_DMA calls can work as > usual, which is second level only. Later, if SVA is used on the mdev > there will be another PASID allocated for that purpose. > Do we need to create an aux domain for each PASID? the translation can > be looked up by the combination of parent dev and pasid. When allocating a new PASID for the guest, I suppose you need to clone the second-level translation config? In which case a single aux domain for the mdev might be easier to implement in the IOMMU driver. Entirely up to you since we don't have this case on SMMUv3 Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu: Fix a typo
This patch fixes a typo in iommu.c. Signed-off-by: Rami Rosen --- drivers/iommu/iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index f3698006cb53..022020144f33 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1416,7 +1416,7 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev) EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev); /* - * IOMMU groups are really the natrual working unit of the IOMMU, but + * IOMMU groups are really the natural working unit of the IOMMU, but * the IOMMU API works on domains and devices. Bridge that gap by * iterating over the devices in a group. Ideally we'd have a single * device which represents the requestor ID of the group, but we also -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 20/20] iommu/smmuv3: Report non recoverable faults
When a stage 1 related fault event is read from the event queue, let's propagate it to potential external fault listeners, ie. users who registered a fault handler. Signed-off-by: Eric Auger --- drivers/iommu/arm-smmu-v3.c | 124 1 file changed, 113 insertions(+), 11 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index d9d300ab62a5..948fc82fc4ce 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -178,6 +178,26 @@ #define ARM_SMMU_PRIQ_IRQ_CFG1 0xd8 #define ARM_SMMU_PRIQ_IRQ_CFG2 0xdc +/* Events */ +#define ARM_SMMU_EVT_F_UUT 0x01 +#define ARM_SMMU_EVT_C_BAD_STREAMID0x02 +#define ARM_SMMU_EVT_F_STE_FETCH 0x03 +#define ARM_SMMU_EVT_C_BAD_STE 0x04 +#define ARM_SMMU_EVT_F_BAD_ATS_TREQ0x05 +#define ARM_SMMU_EVT_F_STREAM_DISABLED 0x06 +#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN0x07 +#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID 0x08 +#define ARM_SMMU_EVT_F_CD_FETCH0x09 +#define ARM_SMMU_EVT_C_BAD_CD 0x0a +#define ARM_SMMU_EVT_F_WALK_EABT 0x0b +#define ARM_SMMU_EVT_F_TRANSLATION 0x10 +#define ARM_SMMU_EVT_F_ADDR_SIZE 0x11 +#define ARM_SMMU_EVT_F_ACCESS 0x12 +#define ARM_SMMU_EVT_F_PERMISSION 0x13 +#define ARM_SMMU_EVT_F_TLB_CONFLICT0x20 +#define ARM_SMMU_EVT_F_CFG_CONFLICT0x21 +#define ARM_SMMU_EVT_E_PAGE_REQUEST0x24 + /* Common MSI config fields */ #define MSI_CFG0_ADDR_MASK GENMASK_ULL(51, 2) #define MSI_CFG2_SHGENMASK(5, 4) @@ -343,6 +363,11 @@ #define EVTQ_MAX_SZ_SHIFT 7 #define EVTQ_0_ID GENMASK_ULL(7, 0) +#define EVTQ_0_SUBSTREAMID GENMASK_ULL(31, 12) +#define EVTQ_0_STREAMIDGENMASK_ULL(63, 32) +#define EVTQ_1_S2 GENMASK_ULL(39, 39) +#define EVTQ_1_CLASS GENMASK_ULL(40, 41) +#define EVTQ_3_FETCH_ADDR GENMASK_ULL(51, 3) /* PRI queue */ #define PRIQ_ENT_DWORDS2 @@ -1250,7 +1275,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid) return 0; } -__maybe_unused static struct arm_smmu_master_data * arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) { @@ -1276,24 +1300,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) return master; } +static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt) +{ + u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]); + u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]); + bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]); + u8 type = FIELD_GET(EVTQ_0_ID, evt[0]); + struct arm_smmu_master_data *master; + struct iommu_fault_event event; + bool propagate = true; + u64 addr = evt[2]; + int i; + + master = arm_smmu_find_master(smmu, sid); + if (WARN_ON(!master)) + return; + + event.fault.type = IOMMU_FAULT_DMA_UNRECOV; + + switch (type) { + case ARM_SMMU_EVT_C_BAD_STREAMID: + event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID; + break; + case ARM_SMMU_EVT_F_STREAM_DISABLED: + case ARM_SMMU_EVT_C_BAD_SUBSTREAMID: + event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID; + break; + case ARM_SMMU_EVT_F_CD_FETCH: + event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH; + break; + case ARM_SMMU_EVT_F_WALK_EABT: + event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT; + event.fault.addr = addr; + event.fault.fetch_addr = fetch_addr; + propagate = s1; + break; + case ARM_SMMU_EVT_F_TRANSLATION: + event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH; + event.fault.addr = addr; + event.fault.fetch_addr = fetch_addr; + propagate = s1; + break; + case ARM_SMMU_EVT_F_PERMISSION: + event.fault.reason = IOMMU_FAULT_REASON_PERMISSION; + event.fault.addr = addr; + propagate = s1; + break; + case ARM_SMMU_EVT_F_ACCESS: + event.fault.reason = IOMMU_FAULT_REASON_ACCESS; + event.fault.addr = addr; + propagate = s1; + break; + case ARM_SMMU_EVT_C_BAD_STE: + event.fault.reason = IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY; + break; + case ARM_SMMU_EVT_C_BAD_CD: + event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY; + break; + case ARM_SMMU_EVT_F_ADDR_SIZE: + event.fault.reason = IOMMU_FAULT_REASON_OOR_ADDRESS; + propagate = s1; + break; + case ARM_SMMU_EVT_F_STE_FETCH: + event.fault.reason = IOMMU_FAULT_REASON_DEVICE_CONTEX
[RFC v2 19/20] vfio: Document nested stage control
New iotcls were introduced to pass information about guest stage1 to the host through VFIO. Let's document the nested stage control. Signed-off-by: Eric Auger --- fault reporting is current missing to the picture v1 -> v2: - use the new ioctl names - add doc related to fault handling --- Documentation/vfio.txt | 60 ++ 1 file changed, 60 insertions(+) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index f1a4d3c3ba0b..d4611759d306 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -239,6 +239,66 @@ group and can access them as follows:: /* Gratuitous device reset and go... */ ioctl(device, VFIO_DEVICE_RESET); +IOMMU Dual Stage Control + + +Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to +the ARM terminology while "level" corresponds to Intel's VTD terminology. In +the following text we use either without distinction. + +This is useful when the guest is exposed with a virtual IOMMU and some +devices are assigned to the guest through VFIO. Then the guest OS can use +stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation +(GPA -> HPA). + +The guest gets ownership of the stage 1 page tables and also owns stage 1 +configuration structures. The hypervisor owns the root configuration structure +(for security reason), including stage 2 configuration. This works as long +configuration structures and page table format are compatible between the +virtual IOMMU and the physical IOMMU. + +Assuming the HW supports it, this nested mode is selected by choosing the +VFIO_TYPE1_NESTING_IOMMU type through: + +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU); + +This forces the hypervisor to use the stage 2, leaving stage 1 available for +guest usage. + +Once groups are attached to the container, the guest stage 1 translation +configuration data can be passed to VFIO by using + +ioctl(container, VFIO_IOMMU_BIND_PASID_TABLE, &pasid_table_info); + +This allows to combine guest stage 1 configuration structure along with +hypervisor stage 2 configuration structure. stage 1 configuration structures +are dependent on the IOMMU type. + +As the stage 1 translation is fully delegated to the HW, physical events that +may occur (especially translation faults), need to be propagated up to +the virtualizer and re-injected into the guest. + +By calling: ioctl(container, VFIO_IOMMU_SET_FAULT_EVENTFD &fault_config), +the virtualizer can register an eventfd. This latter will be signalled +whenever an event is detected at physical level. The fault handler, +executed at userspace level has to call: +ioctl(container, VFIO_IOMMU_GET_FAULT_EVENTS, &fault_info) to retrieve +the pending fault events. + +When the guest invalidates stage 1 related caches, invalidations must be +forwarded to the host through +ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data); +Those invalidations can happen at various granularity levels, page, context, ... + +The ARM SMMU specification introduces another challenge: MSIs are translated by +both the virtual SMMU and the physical SMMU. To build a nested mapping for the +IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI +doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2 +binding eventually translating into the physical MSI doorbell. + +This is achieved by +ioctl(container, VFIO_IOMMU_BIND_MSI, &guest_binding); + VFIO User API --- -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 17/20] vfio: VFIO_IOMMU_SET_FAULT_EVENTFD
VFIO_IOMMU_SET_FAULT_EVENTFD Allows to associate a fault event handler to a container. This is useful when the guest owns the first translation stage and the host uses the second stage. As the guest translation is fully handled by the physical IOMMU, if any fault happens, this latter needs to be propagated to the guest. The userspace passes an eventfd and a queue size. Each time a fault occurs on physical IOMMU side, the fault is pushed to a kfifo and the eventfd is signalled. The userspace handler, awaken on eventfd signalling then reads the populated fifo and can inject the faults to the virtual IOMMU. Signed-off-by: Lan Tianyu Signed-off-by: Eric Auger --- original API proposed by Lan in [RFC PATCH 0/3] VFIO: Report IOMMU fault event to userspace (https://www.spinics.net/lists/kvm/msg145280.html) A current issue with the current implementation of iommu_register_device_fault_handler() is that is expects responses to void the list of pending faults. This looks overkill for unrecoverable as responses are not mandated. --- drivers/vfio/vfio_iommu_type1.c | 150 ++-- include/uapi/linux/vfio.h | 16 2 files changed, 157 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ffd46c26dc3..e52cbeb479c3 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -41,6 +41,8 @@ #include #include #include +#include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -66,6 +68,9 @@ struct vfio_iommu { struct blocking_notifier_head notifier; boolv2; boolnesting; + struct eventfd_ctx *fault_ctx; + DECLARE_KFIFO_PTR(fault_fifo, struct iommu_fault); + struct mutexfault_lock; }; struct vfio_domain { @@ -114,6 +119,7 @@ struct vfio_regions { (!list_empty(&iommu->domain_list)) static int put_pfn(unsigned long pfn, int prot); +static int vfio_iommu_teardown_fault_eventfd(struct vfio_iommu *iommu); /* * This code handles mapping and unmapping of user data buffers @@ -1527,15 +1533,26 @@ static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu) WARN_ON(iommu->notifier.head); } +static inline int +vfio_unregister_fault_handler_fn(struct device *dev, void *data) +{ + return iommu_unregister_device_fault_handler(dev); +} + static void vfio_iommu_type1_detach_group(void *iommu_data, struct iommu_group *iommu_group) { struct vfio_iommu *iommu = iommu_data; struct vfio_domain *domain; struct vfio_group *group; + int ret; mutex_lock(&iommu->lock); + ret = iommu_group_for_each_dev(iommu_group, NULL, + vfio_unregister_fault_handler_fn); + WARN_ON(ret); + if (iommu->external_domain) { group = find_iommu_group(iommu->external_domain, iommu_group); if (group) { @@ -1613,6 +1630,7 @@ static void *vfio_iommu_type1_open(unsigned long arg) INIT_LIST_HEAD(&iommu->domain_list); iommu->dma_list = RB_ROOT; mutex_init(&iommu->lock); + mutex_init(&iommu->fault_lock); BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier); return iommu; @@ -1682,25 +1700,22 @@ static int vfio_cache_inv_fn(struct device *dev, void *data) return iommu_cache_invalidate(d, dev, &ustruct->info); } -static int -vfio_cache_invalidate(struct vfio_iommu *iommu, - struct vfio_iommu_type1_cache_invalidate *ustruct) +/* iommu->lock must be held */ +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu, void *data, + int (*fn)(struct device *, void *)) { struct vfio_domain *d; struct vfio_group *g; int ret = 0; - mutex_lock(&iommu->lock); - list_for_each_entry(d, &iommu->domain_list, next) { list_for_each_entry(g, &d->group_list, next) { - ret = iommu_group_for_each_dev(g->iommu_group, ustruct, - vfio_cache_inv_fn); + ret = iommu_group_for_each_dev(g->iommu_group, + data, fn); if (ret) break; } } - mutex_unlock(&iommu->lock); return ret; } @@ -1740,6 +1755,102 @@ vfio_bind_pasid_table(struct vfio_iommu *iommu, return ret; } +static int +vfio_iommu_fault_handler(struct iommu_fault_event *event, void *data) +{ + struct vfio_iommu *iommu = (struct vfio_iommu *)data; + int ret; + + mutex_lock(&iommu->fault_lock); + if (!iommu->fault_ctx) + goto out; + ret = kfifo_put(&iommu->fault_fifo, event->fault
[RFC v2 18/20] vfio: VFIO_IOMMU_GET_FAULT_EVENTS
Introduce an IOTCL that allows the userspace to consume fault events that may have occurred. The userspace buffer gets filled by pending faults, if any. The number of filled faults is reported to the userspace. Read faults are removed from the kfifo. the kernel does not expect any response from the userspace. Signed-off-by: Lan Tianyu Signed-off-by: Eric Auger --- the fault_lock may not be needed as kfifo documentation says: "Note that with only one concurrent reader and one concurrent writer, you don't need extra locking to use these macro." --- drivers/vfio/vfio_iommu_type1.c | 33 - include/uapi/linux/vfio.h | 10 ++ 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e52cbeb479c3..917bb8f9c9ae 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1769,7 +1769,7 @@ vfio_iommu_fault_handler(struct iommu_fault_event *event, void *data) eventfd_signal(iommu->fault_ctx, 1); out: mutex_unlock(&iommu->fault_lock); - return 0; + return ret; } static inline int @@ -1981,6 +1981,37 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return -EINVAL; return vfio_iommu_set_fault_eventfd(iommu, &ustruct); + } else if (cmd == VFIO_IOMMU_GET_FAULT_EVENTS) { + struct vfio_iommu_type1_get_fault_events ustruct; + size_t buf_size, len, elem_size; + int copied, max_events, ret; + + minsz = offsetofend(struct vfio_iommu_type1_get_fault_events, + reserved); + + if (copy_from_user(&ustruct, (void __user *)arg, minsz)) + return -EFAULT; + + if (ustruct.argsz < minsz || ustruct.flags) + return -EINVAL; + + elem_size = sizeof(struct iommu_fault); + buf_size = ustruct.argsz - minsz; + max_events = buf_size / elem_size; + len = max_events * elem_size; + + mutex_lock(&iommu->fault_lock); + + ret = kfifo_to_user(&iommu->fault_fifo, + (void __user *)(arg + minsz), len, &copied); + + mutex_unlock(&iommu->fault_lock); + if (ret) + return ret; + + ustruct.count = copied / elem_size; + + return copy_to_user((void __user *)arg, &ustruct, minsz); } return -ENOTTY; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 0d62598c818a..5b9165b7db8d 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -703,6 +703,16 @@ struct vfio_iommu_type1_set_fault_eventfd { }; #define VFIO_IOMMU_SET_FAULT_EVENTFD _IO(VFIO_TYPE, VFIO_BASE + 25) +struct vfio_iommu_type1_get_fault_events { + __u32 argsz; + __u32 flags; + __u32 count; /* number of faults returned */ + __u32 reserved; + struct iommu_fault events[]; +}; + +#define VFIO_IOMMU_GET_FAULT_EVENTS_IO(VFIO_TYPE, VFIO_BASE + 26) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 14/20] iommu: introduce device fault data
From: Jacob Pan Device faults detected by IOMMU can be reported outside IOMMU subsystem for further processing. This patch intends to provide a generic device fault data such that device drivers can be communicated with IOMMU faults without model specific knowledge. The proposed format is the result of discussion at: https://lkml.org/lkml/2017/11/10/291 Part of the code is based on Jean-Philippe Brucker's patchset (https://patchwork.kernel.org/patch/9989315/). The assumption is that model specific IOMMU driver can filter and handle most of the internal faults if the cause is within IOMMU driver control. Therefore, the fault reasons can be reported are grouped and generalized based common specifications such as PCI ATS. Signed-off-by: Jacob Pan Signed-off-by: Jean-Philippe Brucker Signed-off-by: Liu, Yi L Signed-off-by: Ashok Raj Signed-off-by: Eric Auger [moved part of the iommu_fault_event struct in the uapi, enriched the fault reasons to be able to map unrecoverable SMMUv3 errors] --- include/linux/iommu.h | 55 - include/uapi/linux/iommu.h | 83 ++ 2 files changed, 136 insertions(+), 2 deletions(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 9bd3e63d562b..7529c14ff506 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -49,13 +49,17 @@ struct bus_type; struct device; struct iommu_domain; struct notifier_block; +struct iommu_fault_event; /* iommu fault flags */ -#define IOMMU_FAULT_READ 0x0 -#define IOMMU_FAULT_WRITE 0x1 +#define IOMMU_FAULT_READ (1 << 0) +#define IOMMU_FAULT_WRITE (1 << 1) +#define IOMMU_FAULT_EXEC (1 << 2) +#define IOMMU_FAULT_PRIV (1 << 3) typedef int (*iommu_fault_handler_t)(struct iommu_domain *, struct device *, unsigned long, int, void *); +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *); struct iommu_domain_geometry { dma_addr_t aperture_start; /* First address that can be mapped*/ @@ -262,6 +266,52 @@ struct iommu_device { struct device *dev; }; +/** + * struct iommu_fault_event - Generic per device fault data + * + * - PCI and non-PCI devices + * - Recoverable faults (e.g. page request), information based on PCI ATS + * and PASID spec. + * - Un-recoverable faults of device interest + * - DMA remapping and IRQ remapping faults + * + * @fault: fault descriptor + * @device_private: if present, uniquely identify device-specific + * private data for an individual page request. + * @iommu_private: used by the IOMMU driver for storing fault-specific + * data. Users should not modify this field before + * sending the fault response. + */ +struct iommu_fault_event { + struct iommu_fault fault; + u64 device_private; + u64 iommu_private; +}; + +/** + * struct iommu_fault_param - per-device IOMMU fault data + * @dev_fault_handler: Callback function to handle IOMMU faults at device level + * @data: handler private data + * + */ +struct iommu_fault_param { + iommu_dev_fault_handler_t handler; + void *data; +}; + +/** + * struct iommu_param - collection of per-device IOMMU data + * + * @fault_param: IOMMU detected device fault reporting data + * + * TODO: migrate other per device data pointers under iommu_dev_data, e.g. + * struct iommu_group *iommu_group; + * struct iommu_fwspec *iommu_fwspec; + */ +struct iommu_param { + struct iommu_fault_param *fault_param; +}; + int iommu_device_register(struct iommu_device *iommu); void iommu_device_unregister(struct iommu_device *iommu); int iommu_device_sysfs_add(struct iommu_device *iommu, @@ -429,6 +479,7 @@ struct iommu_ops {}; struct iommu_group {}; struct iommu_fwspec {}; struct iommu_device {}; +struct iommu_fault_param {}; static inline bool iommu_present(struct bus_type *bus) { diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 21adb2a964e5..a0fe5c2fb236 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -150,5 +150,88 @@ struct iommu_guest_msi_binding { __u64 gpa; __u32 granule; }; + +/* Generic fault types, can be expanded IRQ remapping fault */ +enum iommu_fault_type { + IOMMU_FAULT_DMA_UNRECOV = 1,/* unrecoverable fault */ + IOMMU_FAULT_PAGE_REQ, /* page request fault */ +}; + +enum iommu_fault_reason { + IOMMU_FAULT_REASON_UNKNOWN = 0, + + /* IOMMU internal error, no specific reason to report out */ + IOMMU_FAULT_REASON_INTERNAL, + + /* Could not access the PASID table (fetch caused external abort) */ + IOMMU_FAULT_REASON_PASID_FETCH, + + /* could not access the device context (fetch caused external abort) */ + IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH, + + /* pasid entry is invalid or has configuration error
[RFC v2 16/20] iommu: introduce device fault report API
From: Jacob Pan Traditionally, device specific faults are detected and handled within their own device drivers. When IOMMU is enabled, faults such as DMA related transactions are detected by IOMMU. There is no generic reporting mechanism to report faults back to the in-kernel device driver or the guest OS in case of assigned devices. Faults detected by IOMMU is based on the transaction's source ID which can be reported at per device basis, regardless of the device type is a PCI device or not. The fault types include recoverable (e.g. page request) and unrecoverable faults(e.g. access error). In most cases, faults can be handled by IOMMU drivers internally. The primary use cases are as follows: 1. page request fault originated from an SVM capable device that is assigned to guest via vIOMMU. In this case, the first level page tables are owned by the guest. Page request must be propagated to the guest to let guest OS fault in the pages then send page response. In this mechanism, the direct receiver of IOMMU fault notification is VFIO, which can relay notification events to QEMU or other user space software. 2. faults need more subtle handling by device drivers. Other than simply invoke reset function, there are needs to let device driver handle the fault with a smaller impact. This patchset is intended to create a generic fault report API such that it can scale as follows: - all IOMMU types - PCI and non-PCI devices - recoverable and unrecoverable faults - VFIO and other other in kernel users - DMA & IRQ remapping (TBD) The original idea was brought up by David Woodhouse and discussions summarized at https://lwn.net/Articles/608914/. Signed-off-by: Jacob Pan Signed-off-by: Ashok Raj Signed-off-by: Jean-Philippe Brucker Signed-off-by: Eric Auger [adapt to new iommu_fault fault field, test fault_param on iommu_unregister_device_fault_handler] --- drivers/iommu/iommu.c | 153 +- include/linux/iommu.h | 33 - 2 files changed, 184 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index b2b8f375b169..dd4f3677a9ac 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -618,6 +618,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) goto err_free_name; } + dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL); + if (!dev->iommu_param) { + ret = -ENOMEM; + goto err_free_name; + } + mutex_init(&dev->iommu_param->lock); + kobject_get(group->devices_kobj); dev->iommu_group = group; @@ -648,6 +655,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) mutex_unlock(&group->mutex); dev->iommu_group = NULL; kobject_put(group->devices_kobj); + kfree(dev->iommu_param); err_free_name: kfree(device->name); err_remove_link: @@ -694,7 +702,7 @@ void iommu_group_remove_device(struct device *dev) sysfs_remove_link(&dev->kobj, "iommu_group"); trace_remove_device_from_group(group->id, dev); - + kfree(dev->iommu_param); kfree(device->name); kfree(device); dev->iommu_group = NULL; @@ -828,6 +836,149 @@ int iommu_group_unregister_notifier(struct iommu_group *group, } EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); +/** + * iommu_register_device_fault_handler() - Register a device fault handler + * @dev: the device + * @handler: the fault handler + * @data: private data passed as argument to the handler + * + * When an IOMMU fault event is received, call this handler with the fault event + * and data as argument. The handler should return 0 on success. If the fault is + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete + * the fault by calling iommu_page_response() with one of the following + * response code: + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation + * - IOMMU_PAGE_RESP_INVALID: terminate the fault + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting + * page faults if possible. + * + * Return 0 if the fault handler was installed successfully, or an error. + */ +int iommu_register_device_fault_handler(struct device *dev, + iommu_dev_fault_handler_t handler, + void *data) +{ + struct iommu_param *param = dev->iommu_param; + int ret = 0; + + /* +* Device iommu_param should have been allocated when device is +* added to its iommu_group. +*/ + if (!param) + return -EINVAL; + + mutex_lock(¶m->lock); + /* Only allow one fault handler registered for each device */ + if (param->fault_param) { + ret = -EBUSY; + goto done_unlock; + } + + get_device(dev); + param->fault_param = + kzalloc(sizeof(struct iommu_fault_param), G
[RFC v2 15/20] driver core: add per device iommu param
From: Jacob Pan DMA faults can be detected by IOMMU at device level. Adding a pointer to struct device allows IOMMU subsystem to report relevant faults back to the device driver for further handling. For direct assigned device (or user space drivers), guest OS holds responsibility to handle and respond per device IOMMU fault. Therefore we need fault reporting mechanism to propagate faults beyond IOMMU subsystem. There are two other IOMMU data pointers under struct device today, here we introduce iommu_param as a parent pointer such that all device IOMMU data can be consolidated here. The idea was suggested here by Greg KH and Joerg. The name iommu_param is chosen here since iommu_data has been used. Suggested-by: Greg Kroah-Hartman Reviewed-by: Greg Kroah-Hartman Signed-off-by: Jacob Pan Link: https://lkml.org/lkml/2017/10/6/81 --- include/linux/device.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/device.h b/include/linux/device.h index 8f882549edee..2daf4a94bd31 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -42,6 +42,7 @@ struct iommu_ops; struct iommu_group; struct iommu_fwspec; struct dev_pin_info; +struct iommu_param; struct bus_attribute { struct attributeattr; @@ -922,6 +923,7 @@ struct dev_links_info { * device (i.e. the bus driver that discovered the device). * @iommu_group: IOMMU group the device belongs to. * @iommu_fwspec: IOMMU-specific properties supplied by firmware. + * @iommu_param: Per device generic IOMMU runtime data * * @offline_disabled: If set, the device is permanently online. * @offline: Set after successful invocation of bus type's .offline(). @@ -1012,6 +1014,7 @@ struct device { void(*release)(struct device *dev); struct iommu_group *iommu_group; struct iommu_fwspec *iommu_fwspec; + struct iommu_param *iommu_param; booloffline_disabled:1; booloffline:1; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 13/20] iommu/smmuv3: Implement bind_guest_msi
The bind_guest_msi() callback checks the domain is NESTED and redirect to the dma-iommu implementation. Signed-off-by: Eric Auger --- drivers/iommu/arm-smmu-v3.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 4054da756a41..d9d300ab62a5 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2168,6 +2168,27 @@ static void arm_smmu_put_resv_regions(struct device *dev, kfree(entry); } +static int arm_smmu_bind_guest_msi(struct iommu_domain *domain, + struct iommu_guest_msi_binding *binding) +{ + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_device *smmu; + int ret = -EINVAL; + + mutex_lock(&smmu_domain->init_mutex); + smmu = smmu_domain->smmu; + if (!smmu) + goto out; + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + goto out; + + ret = iommu_dma_bind_doorbell(domain, binding); +out: + mutex_unlock(&smmu_domain->init_mutex); + return ret; +} + static int arm_smmu_bind_pasid_table(struct iommu_domain *domain, struct iommu_pasid_table_config *cfg) { @@ -2311,6 +2332,7 @@ static struct iommu_ops arm_smmu_ops = { .bind_pasid_table = arm_smmu_bind_pasid_table, .unbind_pasid_table = arm_smmu_unbind_pasid_table, .cache_invalidate = arm_smmu_cache_invalidate, + .bind_guest_msi = arm_smmu_bind_guest_msi, .pgsize_bitmap = -1UL, /* Restricted during device attach */ }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie
Up to now, when the type was UNMANAGED, we used to allocate IOVA pages within a range provided by the user. This does not work in nested mode. If both the host and the guest are exposed with SMMUs, each would allocate an IOVA. The guest allocates an IOVA (gIOVA) to map onto the guest MSI doorbell (gDB). The Host allocates another IOVA (hIOVA) to map onto the physical doorbell (hDB). So we end up with 2 unrelated mappings, at S1 and S2: S1 S2 gIOVA-> gDB hIOVA->hDB The PCI device would be programmed with hIOVA. iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host so that gIOVA can be used by the host instead of re-allocating a new IOVA. That way the host can create the following nested mapping: S1 S2 gIOVA->gDB->hDB this time, the PCI device will be programmed with the gIOVA MSI doorbell which is correctly map through the 2 stages. Signed-off-by: Eric Auger --- v1 -> v2: - unmap stage2 on put() --- drivers/iommu/dma-iommu.c | 97 +-- include/linux/dma-iommu.h | 11 + 2 files changed, 105 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 511ff9a1d6d9..53444c3e8f2f 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -37,12 +37,14 @@ struct iommu_dma_msi_page { struct list_headlist; dma_addr_t iova; + dma_addr_t ipa; phys_addr_t phys; }; enum iommu_dma_cookie_type { IOMMU_DMA_IOVA_COOKIE, IOMMU_DMA_MSI_COOKIE, + IOMMU_DMA_NESTED_MSI_COOKIE, }; struct iommu_dma_cookie { @@ -109,14 +111,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie); * * Users who manage their own IOVA allocation and do not want DMA API support, * but would still like to take advantage of automatic MSI remapping, can use - * this to initialise their own domain appropriately. Users should reserve a + * this to initialise their own domain appropriately. Users may reserve a * contiguous IOVA region, starting at @base, large enough to accommodate the * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address - * used by the devices attached to @domain. + * used by the devices attached to @domain. The other way round is to provide + * usable iova pages through the iommu_dma_bind_doorbell API (nested stages + * use case) */ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base) { struct iommu_dma_cookie *cookie; + int nesting, ret; if (domain->type != IOMMU_DOMAIN_UNMANAGED) return -EINVAL; @@ -124,7 +129,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base) if (domain->iova_cookie) return -EEXIST; - cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE); + ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nesting); + if (!ret && nesting) + cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE); + else + cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE); + if (!cookie) return -ENOMEM; @@ -145,6 +155,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) { struct iommu_dma_cookie *cookie = domain->iova_cookie; struct iommu_dma_msi_page *msi, *tmp; + bool s2_unmap = false; if (!cookie) return; @@ -152,7 +163,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule) put_iova_domain(&cookie->iovad); + if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE) + s2_unmap = true; + list_for_each_entry_safe(msi, tmp, &cookie->msi_page_list, list) { + if (s2_unmap && msi->phys) { + size_t size = cookie_msi_granule(cookie); + + WARN_ON(iommu_unmap(domain, msi->ipa, size) != size); + } list_del(&msi->list); kfree(msi); } @@ -161,6 +180,50 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) } EXPORT_SYMBOL(iommu_put_dma_cookie); +/** + * iommu_dma_bind_doorbell - Allows to provide a usable IOVA page + * @domain: domain handle + * @binding: IOVA/IPA binding + * + * In nested stage use case, the user can provide IOVA/IPA bindings + * corresponding to a guest MSI stage 1 mapping. When the host needs + * to map its own MSI doorbells, it can use the IPA as stage 2 input + * and map it onto the physical MSI doorbell. + */ +int iommu_dma_bind_doorbell(struct iommu_domain *domain, + struct iommu_guest_msi_binding *binding) +{ + struct iommu_dma_cookie *cookie = domain->iova_cookie; + struct iommu_dma_msi_page *msi; + dma_addr_t ipa, iova; + size_t size; + + if (!cookie) + return -EINVAL; + +
[RFC v2 11/20] iommu/smmuv3: Implement cache_invalidate
Implement IOMMU_INV_TYPE_TLB invalidations. When nr_pages is null we interpret this as a context invalidation. Signed-off-by: Eric Auger --- The user API needs to be refined to discriminate context invalidations from NH_VA invalidations. Also the leaf attribute is not yet properly handled. v1 -> v2: - properly pass the asid --- drivers/iommu/arm-smmu-v3.c | 40 + 1 file changed, 40 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index c7e05c0b93e1..4054da756a41 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2251,6 +2251,45 @@ static void arm_smmu_unbind_pasid_table(struct iommu_domain *domain) mutex_unlock(&smmu_domain->init_mutex); } +static int +arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev, + struct iommu_cache_invalidate_info *inv_info) +{ + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_device *smmu = smmu_domain->smmu; + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + return -EINVAL; + + if (!smmu) + return -EINVAL; + + switch (inv_info->hdr.type) { + case IOMMU_INV_TYPE_TLB: + /* +* TODO: On context invalidation, the userspace sets nr_pages +* to 0. Refine the API to add a dedicated flags and also +* properly handle the leaf parameter. +*/ + if (!inv_info->nr_pages) { + smmu_domain->s1_cfg.cd.asid = inv_info->arch_id; + arm_smmu_tlb_inv_context(smmu_domain); + } else { + size_t granule = 1 << (inv_info->size + 12); + size_t size = inv_info->nr_pages * granule; + + smmu_domain->s1_cfg.cd.asid = inv_info->arch_id; + arm_smmu_tlb_inv_range_nosync(inv_info->addr, size, + granule, false, + smmu_domain); + __arm_smmu_tlb_sync(smmu); + } + return 0; + default: + return -EINVAL; + } +} + static struct iommu_ops arm_smmu_ops = { .capable= arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -2271,6 +2310,7 @@ static struct iommu_ops arm_smmu_ops = { .put_resv_regions = arm_smmu_put_resv_regions, .bind_pasid_table = arm_smmu_bind_pasid_table, .unbind_pasid_table = arm_smmu_unbind_pasid_table, + .cache_invalidate = arm_smmu_cache_invalidate, .pgsize_bitmap = -1UL, /* Restricted during device attach */ }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 10/20] iommu/smmuv3: Implement bind_pasid_table
On bind_pasid_table() we program STE S1 related info set by the guest into the actual physical STEs. At minimum we need to program the context descriptor GPA and compute whether the guest wanted to bypass the stage 1 or induce aborts for this STE. On unbind, the STE stage 1 fields are reset. Signed-off-by: Eric Auger --- v1 -> v2: - invalidate the STE before changing them - hold init_mutex - handle new fields --- drivers/iommu/arm-smmu-v3.c | 85 + 1 file changed, 85 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 9749c36208f3..c7e05c0b93e1 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2168,6 +2168,89 @@ static void arm_smmu_put_resv_regions(struct device *dev, kfree(entry); } +static int arm_smmu_bind_pasid_table(struct iommu_domain *domain, +struct iommu_pasid_table_config *cfg) +{ + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_master_data *entry; + struct arm_smmu_s1_cfg *s1_cfg; + struct arm_smmu_device *smmu; + unsigned long flags; + int ret = -EINVAL; + + if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3) + return -EINVAL; + + mutex_lock(&smmu_domain->init_mutex); + + smmu = smmu_domain->smmu; + + if (!smmu) + goto out; + + if (!((smmu->features & ARM_SMMU_FEAT_TRANS_S1) && + (smmu->features & ARM_SMMU_FEAT_TRANS_S2))) { + dev_info(smmu_domain->smmu->dev, +"does not implement two stages\n"); + goto out; + } + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + goto out; + + /* we currently support a single CD. S1DSS and S1FMT are ignored */ + if (cfg->smmuv3.s1cdmax) + goto out; + + s1_cfg = &smmu_domain->s1_cfg; + s1_cfg->nested_bypass = cfg->smmuv3.bypass; + s1_cfg->nested_abort = cfg->smmuv3.abort; + s1_cfg->cdptr_dma = cfg->smmuv3.s1contextptr; + + spin_lock_irqsave(&smmu_domain->devices_lock, flags); + list_for_each_entry(entry, &smmu_domain->devices, list) { + entry->ste.s1_cfg = &smmu_domain->s1_cfg; + entry->ste.nested = true; + arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec); + } + spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); + ret = 0; +out: + mutex_unlock(&smmu_domain->init_mutex); + return ret; +} + +static void arm_smmu_unbind_pasid_table(struct iommu_domain *domain) +{ + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_device *smmu = smmu_domain->smmu; + struct arm_smmu_master_data *entry; + struct arm_smmu_s1_cfg *s1_cfg; + unsigned long flags; + + if (!smmu) + return; + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + return; + + mutex_lock(&smmu_domain->init_mutex); + + s1_cfg = &smmu_domain->s1_cfg; + + spin_lock_irqsave(&smmu_domain->devices_lock, flags); + list_for_each_entry(entry, &smmu_domain->devices, list) { + entry->ste.s1_cfg = NULL; + entry->ste.nested = false; + arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec); + } + spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); + + s1_cfg->nested_abort = false; + s1_cfg->nested_bypass = false; + mutex_unlock(&smmu_domain->init_mutex); +} + static struct iommu_ops arm_smmu_ops = { .capable= arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -2186,6 +2269,8 @@ static struct iommu_ops arm_smmu_ops = { .of_xlate = arm_smmu_of_xlate, .get_resv_regions = arm_smmu_get_resv_regions, .put_resv_regions = arm_smmu_put_resv_regions, + .bind_pasid_table = arm_smmu_bind_pasid_table, + .unbind_pasid_table = arm_smmu_unbind_pasid_table, .pgsize_bitmap = -1UL, /* Restricted during device attach */ }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 06/20] vfio: VFIO_IOMMU_BIND_MSI
This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at passing the guest MSI binding to the host. Signed-off-by: Eric Auger --- v1 -> v2: - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi --- drivers/vfio/vfio_iommu_type1.c | 31 +++ include/uapi/linux/vfio.h | 7 +++ 2 files changed, 38 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 14529233f774..2ffd46c26dc3 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1704,6 +1704,24 @@ vfio_cache_invalidate(struct vfio_iommu *iommu, return ret; } +static int +vfio_iommu_bind_guest_msi(struct vfio_iommu *iommu, + struct vfio_iommu_type1_bind_guest_msi *ustruct) +{ + struct vfio_domain *d; + int ret; + + mutex_lock(&iommu->lock); + + list_for_each_entry(d, &iommu->domain_list, next) { + ret = iommu_bind_guest_msi(d->domain, &ustruct->binding); + if (ret) + break; + } + mutex_unlock(&iommu->lock); + return ret; +} + static int vfio_bind_pasid_table(struct vfio_iommu *iommu, struct vfio_iommu_type1_bind_pasid_table *ustruct) @@ -1818,6 +1836,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return -EINVAL; return vfio_cache_invalidate(iommu, &ustruct); + } else if (cmd == VFIO_IOMMU_BIND_MSI) { + struct vfio_iommu_type1_bind_guest_msi ustruct; + + minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi, + binding); + + if (copy_from_user(&ustruct, (void __user *)arg, minsz)) + return -EFAULT; + + if (ustruct.argsz < minsz || ustruct.flags) + return -EINVAL; + + return vfio_iommu_bind_guest_msi(iommu, &ustruct); } return -ENOTTY; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2a38d0bca0ca..6fb5e944e73c 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -680,6 +680,13 @@ struct vfio_iommu_type1_cache_invalidate { }; #define VFIO_IOMMU_CACHE_INVALIDATE _IO(VFIO_TYPE, VFIO_BASE + 23) +struct vfio_iommu_type1_bind_guest_msi { + __u32 argsz; + __u32 flags; + struct iommu_guest_msi_binding binding; +}; +#define VFIO_IOMMU_BIND_MSI _IO(VFIO_TYPE, VFIO_BASE + 24) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 07/20] iommu/arm-smmu-v3: Link domains and devices
From: Jean-Philippe Brucker When removing a mapping from a domain, we need to send an invalidation to all devices that might have stored it in their Address Translation Cache (ATC). In addition with SVM, we'll need to invalidate context descriptors of all devices attached to a live domain. Maintain a list of devices in each domain, protected by a spinlock. It is updated every time we attach or detach devices to and from domains. It needs to be a spinlock because we'll invalidate ATC entries from within hardirq-safe contexts, but it may be possible to relax the read side with RCU later. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/arm-smmu-v3.c | 28 1 file changed, 28 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 5059d09f3202..5d0d080f0a02 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -596,6 +596,11 @@ struct arm_smmu_device { struct arm_smmu_master_data { struct arm_smmu_device *smmu; struct arm_smmu_strtab_ent ste; + + struct arm_smmu_domain *domain; + struct list_headlist; /* domain->devices */ + + struct device *dev; }; /* SMMU private data for an IOMMU domain */ @@ -619,6 +624,9 @@ struct arm_smmu_domain { }; struct iommu_domain domain; + + struct list_headdevices; + spinlock_t devices_lock; }; struct arm_smmu_option_prop { @@ -1472,6 +1480,9 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type) } mutex_init(&smmu_domain->init_mutex); + INIT_LIST_HEAD(&smmu_domain->devices); + spin_lock_init(&smmu_domain->devices_lock); + return &smmu_domain->domain; } @@ -1687,7 +1698,17 @@ static void arm_smmu_install_ste_for_dev(struct iommu_fwspec *fwspec) static void arm_smmu_detach_dev(struct device *dev) { + unsigned long flags; struct arm_smmu_master_data *master = dev->iommu_fwspec->iommu_priv; + struct arm_smmu_domain *smmu_domain = master->domain; + + if (smmu_domain) { + spin_lock_irqsave(&smmu_domain->devices_lock, flags); + list_del(&master->list); + spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); + + master->domain = NULL; + } master->ste.assigned = false; arm_smmu_install_ste_for_dev(dev->iommu_fwspec); @@ -1696,6 +1717,7 @@ static void arm_smmu_detach_dev(struct device *dev) static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) { int ret = 0; + unsigned long flags; struct arm_smmu_device *smmu; struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct arm_smmu_master_data *master; @@ -1731,6 +1753,11 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) } ste->assigned = true; + master->domain = smmu_domain; + + spin_lock_irqsave(&smmu_domain->devices_lock, flags); + list_add(&master->list, &smmu_domain->devices); + spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) { ste->s1_cfg = NULL; @@ -1849,6 +1876,7 @@ static int arm_smmu_add_device(struct device *dev) return -ENOMEM; master->smmu = smmu; + master->dev = dev; fwspec->iommu_priv = master; } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 08/20] iommu/arm-smmu-v3: Maintain a SID->device structure
From: Jean-Philippe Brucker When handling faults from the event or PRI queue, we need to find the struct device associated to a SID. Add a rb_tree to keep track of SIDs. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/arm-smmu-v3.c | 136 ++-- 1 file changed, 132 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 5d0d080f0a02..80bb93b43a2e 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -590,6 +590,16 @@ struct arm_smmu_device { /* IOMMU core code handle */ struct iommu_device iommu; + + struct rb_root streams; + struct mutexstreams_mutex; + +}; + +struct arm_smmu_stream { + u32 id; + struct arm_smmu_master_data *master; + struct rb_node node; }; /* SMMU private data for each master */ @@ -599,6 +609,7 @@ struct arm_smmu_master_data { struct arm_smmu_domain *domain; struct list_headlist; /* domain->devices */ + struct arm_smmu_stream *streams; struct device *dev; }; @@ -1224,6 +1235,32 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid) return 0; } +__maybe_unused +static struct arm_smmu_master_data * +arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) +{ + struct rb_node *node; + struct arm_smmu_stream *stream; + struct arm_smmu_master_data *master = NULL; + + mutex_lock(&smmu->streams_mutex); + node = smmu->streams.rb_node; + while (node) { + stream = rb_entry(node, struct arm_smmu_stream, node); + if (stream->id < sid) { + node = node->rb_right; + } else if (stream->id > sid) { + node = node->rb_left; + } else { + master = stream->master; + break; + } + } + mutex_unlock(&smmu->streams_mutex); + + return master; +} + /* IRQ and event handlers */ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) { @@ -1847,6 +1884,71 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid) return sid < limit; } +static int arm_smmu_insert_master(struct arm_smmu_device *smmu, + struct arm_smmu_master_data *master) +{ + int i; + int ret = 0; + struct arm_smmu_stream *new_stream, *cur_stream; + struct rb_node **new_node, *parent_node = NULL; + struct iommu_fwspec *fwspec = master->dev->iommu_fwspec; + + master->streams = kcalloc(fwspec->num_ids, + sizeof(struct arm_smmu_stream), GFP_KERNEL); + if (!master->streams) + return -ENOMEM; + + mutex_lock(&smmu->streams_mutex); + for (i = 0; i < fwspec->num_ids && !ret; i++) { + new_stream = &master->streams[i]; + new_stream->id = fwspec->ids[i]; + new_stream->master = master; + + new_node = &(smmu->streams.rb_node); + while (*new_node) { + cur_stream = rb_entry(*new_node, struct arm_smmu_stream, + node); + parent_node = *new_node; + if (cur_stream->id > new_stream->id) { + new_node = &((*new_node)->rb_left); + } else if (cur_stream->id < new_stream->id) { + new_node = &((*new_node)->rb_right); + } else { + dev_warn(master->dev, +"stream %u already in tree\n", +cur_stream->id); + ret = -EINVAL; + break; + } + } + + if (!ret) { + rb_link_node(&new_stream->node, parent_node, new_node); + rb_insert_color(&new_stream->node, &smmu->streams); + } + } + mutex_unlock(&smmu->streams_mutex); + + return ret; +} + +static void arm_smmu_remove_master(struct arm_smmu_device *smmu, + struct arm_smmu_master_data *master) +{ + int i; + struct iommu_fwspec *fwspec = master->dev->iommu_fwspec; + + if (!master->streams) + return; + + mutex_lock(&smmu->streams_mutex); + for (i = 0; i < fwspec->num_ids; i++) + rb_erase(&master->streams[i].node, &smmu->streams); + mutex_unlock(&smmu->streams_mutex); + + kfree(master->streams); +} + static struct iommu_ops arm_smmu_ops; static int arm_smmu_add_device(struct device *dev) @@ -1895,13 +1997,35 @@ st
[RFC v2 09/20] iommu/smmuv3: Get prepared for nested stage support
To allow nested stage support, we need to store both stage 1 and stage 2 configurations (and remove the former union). arm_smmu_write_strtab_ent() is modified to write both stage fields in the STE. We add a nested_bypass field to the S1 configuration as the first stage can be bypassed. Also the guest may force the STE to abort: this information gets stored into the nested_abort field. Only S2 stage is "finalized" as the host does not configure S1 CD, guest does. Signed-off-by: Eric Auger --- v1 -> v2: - invalidate the STE before moving from a live STE config to another - add the nested_abort and nested_bypass fields --- drivers/iommu/arm-smmu-v3.c | 43 - 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 80bb93b43a2e..9749c36208f3 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -222,6 +222,7 @@ #define STRTAB_STE_0_CFG_BYPASS4 #define STRTAB_STE_0_CFG_S1_TRANS 5 #define STRTAB_STE_0_CFG_S2_TRANS 6 +#define STRTAB_STE_0_CFG_NESTED7 #define STRTAB_STE_0_S1FMT GENMASK_ULL(5, 4) #define STRTAB_STE_0_S1FMT_LINEAR 0 @@ -497,6 +498,10 @@ struct arm_smmu_strtab_l1_desc { struct arm_smmu_s1_cfg { __le64 *cdptr; dma_addr_t cdptr_dma; + /* in nested mode, tells s1 must be bypassed */ + boolnested_bypass; + /* in nested mode, abort is forced by guest */ + boolnested_abort; struct arm_smmu_ctx_desc { u16 asid; @@ -521,6 +526,7 @@ struct arm_smmu_strtab_ent { * configured according to the domain type. */ boolassigned; + boolnested; struct arm_smmu_s1_cfg *s1_cfg; struct arm_smmu_s2_cfg *s2_cfg; }; @@ -629,10 +635,8 @@ struct arm_smmu_domain { struct io_pgtable_ops *pgtbl_ops; enum arm_smmu_domain_stage stage; - union { - struct arm_smmu_s1_cfg s1_cfg; - struct arm_smmu_s2_cfg s2_cfg; - }; + struct arm_smmu_s1_cfg s1_cfg; + struct arm_smmu_s2_cfg s2_cfg; struct iommu_domain domain; @@ -1119,10 +1123,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid, break; case STRTAB_STE_0_CFG_S1_TRANS: case STRTAB_STE_0_CFG_S2_TRANS: + case STRTAB_STE_0_CFG_NESTED: ste_live = true; break; case STRTAB_STE_0_CFG_ABORT: - if (disable_bypass) + if (disable_bypass || ste->nested) break; default: BUG(); /* STE corruption */ @@ -1134,7 +1139,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid, /* Bypass/fault */ if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) { - if (!ste->assigned && disable_bypass) + if ((!ste->assigned && disable_bypass) || + (ste->s1_cfg && ste->s1_cfg->nested_abort)) val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT); else val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS); @@ -1152,8 +1158,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid, return; } + if (ste->nested && ste_live) { + /* +* When enabling nested, the STE may be transitionning from +* s2 to nested and back. Invalidate the STE before changing it. +*/ + dst[0] = cpu_to_le64(0); + arm_smmu_sync_ste_for_sid(smmu, sid); + val = STRTAB_STE_0_V; + } + if (ste->s1_cfg) { - BUG_ON(ste_live); dst[1] = cpu_to_le64( FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) | FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) | @@ -1167,12 +1182,12 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid, !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE)) dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD); - val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) | - FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS); + if (!ste->s1_cfg->nested_bypass) + val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) | + FIELD_PREP(STRTAB_STE_0_CFG,
[RFC v2 05/20] vfio: VFIO_IOMMU_CACHE_INVALIDATE
From: "Liu, Yi L" When the guest "owns" the stage 1 translation structures, the host IOMMU driver has no knowledge of caching structure updates unless the guest invalidation requests are trapped and passed down to the host. This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims at propagating guest stage1 IOMMU cache invalidations to the host. Signed-off-by: Liu, Yi L Signed-off-by: Eric Auger --- v1 -> v2: - s/TLB/CACHE - remove vfio_iommu_task usage - commit message rewording --- drivers/vfio/vfio_iommu_type1.c | 44 + include/uapi/linux/vfio.h | 7 ++ 2 files changed, 51 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 044d3a046125..14529233f774 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1673,6 +1673,37 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu) return ret; } +static int vfio_cache_inv_fn(struct device *dev, void *data) +{ + struct vfio_iommu_type1_cache_invalidate *ustruct = + (struct vfio_iommu_type1_cache_invalidate *)data; + struct iommu_domain *d = iommu_get_domain_for_dev(dev); + + return iommu_cache_invalidate(d, dev, &ustruct->info); +} + +static int +vfio_cache_invalidate(struct vfio_iommu *iommu, + struct vfio_iommu_type1_cache_invalidate *ustruct) +{ + struct vfio_domain *d; + struct vfio_group *g; + int ret = 0; + + mutex_lock(&iommu->lock); + + list_for_each_entry(d, &iommu->domain_list, next) { + list_for_each_entry(g, &d->group_list, next) { + ret = iommu_group_for_each_dev(g->iommu_group, ustruct, + vfio_cache_inv_fn); + if (ret) + break; + } + } + mutex_unlock(&iommu->lock); + return ret; +} + static int vfio_bind_pasid_table(struct vfio_iommu *iommu, struct vfio_iommu_type1_bind_pasid_table *ustruct) @@ -1774,6 +1805,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return -EINVAL; return vfio_bind_pasid_table(iommu, &ustruct); + } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) { + struct vfio_iommu_type1_cache_invalidate ustruct; + + minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate, + info); + + if (copy_from_user(&ustruct, (void __user *)arg, minsz)) + return -EFAULT; + + if (ustruct.argsz < minsz || ustruct.flags) + return -EINVAL; + + return vfio_cache_invalidate(iommu, &ustruct); } return -ENOTTY; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index f81ce0a75c01..2a38d0bca0ca 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -673,6 +673,13 @@ struct vfio_iommu_type1_bind_pasid_table { }; #define VFIO_IOMMU_BIND_PASID_TABLE_IO(VFIO_TYPE, VFIO_BASE + 22) +struct vfio_iommu_type1_cache_invalidate { + __u32 argsz; + __u32 flags; + struct iommu_cache_invalidate_info info; +}; +#define VFIO_IOMMU_CACHE_INVALIDATE _IO(VFIO_TYPE, VFIO_BASE + 23) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 04/20] vfio: VFIO_IOMMU_BIND_PASID_TABLE
From: "Liu, Yi L" This patch adds VFIO_IOMMU_BIND_PASID_TABLE ioctl which aims at passing the virtual iommu guest configuration to the VFIO driver downto to the iommu subsystem. Signed-off-by: Jacob Pan Signed-off-by: Liu, Yi L Signed-off-by: Eric Auger --- v1 -> v2: - s/BIND_GUEST_STAGE/BIND_PASID_TABLE - remove the struct device arg --- drivers/vfio/vfio_iommu_type1.c | 31 +++ include/uapi/linux/vfio.h | 8 2 files changed, 39 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d9fd3188615d..044d3a046125 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1673,6 +1673,24 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu) return ret; } +static int +vfio_bind_pasid_table(struct vfio_iommu *iommu, + struct vfio_iommu_type1_bind_pasid_table *ustruct) +{ + struct vfio_domain *d; + int ret = 0; + + mutex_lock(&iommu->lock); + + list_for_each_entry(d, &iommu->domain_list, next) { + ret = iommu_bind_pasid_table(d->domain, &ustruct->config); + if (ret) + break; + } + mutex_unlock(&iommu->lock); + return ret; +} + static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { @@ -1743,6 +1761,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return copy_to_user((void __user *)arg, &unmap, minsz) ? -EFAULT : 0; + } else if (cmd == VFIO_IOMMU_BIND_PASID_TABLE) { + struct vfio_iommu_type1_bind_pasid_table ustruct; + + minsz = offsetofend(struct vfio_iommu_type1_bind_pasid_table, + config); + + if (copy_from_user(&ustruct, (void __user *)arg, minsz)) + return -EFAULT; + + if (ustruct.argsz < minsz || ustruct.flags) + return -EINVAL; + + return vfio_bind_pasid_table(iommu, &ustruct); } return -ENOTTY; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 1aa7b82e8169..f81ce0a75c01 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -14,6 +14,7 @@ #include #include +#include #define VFIO_API_VERSION 0 @@ -665,6 +666,13 @@ struct vfio_iommu_type1_dma_unmap { #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15) #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16) +struct vfio_iommu_type1_bind_pasid_table { + __u32 argsz; + __u32 flags; + struct iommu_pasid_table_config config; +}; +#define VFIO_IOMMU_BIND_PASID_TABLE_IO(VFIO_TYPE, VFIO_BASE + 22) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 02/20] iommu: Introduce cache_invalidate API
From: "Liu, Yi L" In any virtualization use case, when the first translation stage is "owned" by the guest OS, the host IOMMU driver has no knowledge of caching structure updates unless the guest invalidation activities are trapped by the virtualizer and passed down to the host. Since the invalidation data are obtained from user space and will be written into physical IOMMU, we must allow security check at various layers. Therefore, generic invalidation data format are proposed here, model specific IOMMU drivers need to convert them into their own format. Signed-off-by: Liu, Yi L Signed-off-by: Jean-Philippe Brucker Signed-off-by: Jacob Pan Signed-off-by: Ashok Raj Signed-off-by: Eric Auger --- v1 -> v2: - add arch_id field - renamed tlb_invalidate into cache_invalidate as this API allows to invalidate context caches on top of IOTLBs v1: renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in header. Commit message reworded. --- drivers/iommu/iommu.c | 14 ++ include/linux/iommu.h | 14 ++ include/uapi/linux/iommu.h | 95 ++ 3 files changed, 123 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index db2c7c9502ae..1442a6c640af 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1381,6 +1381,20 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain) } EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table); +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev, + struct iommu_cache_invalidate_info *inv_info) +{ + int ret = 0; + + if (unlikely(!domain->ops->cache_invalidate)) + return -ENODEV; + + ret = domain->ops->cache_invalidate(domain, dev, inv_info); + + return ret; +} +EXPORT_SYMBOL_GPL(iommu_cache_invalidate); + static void __iommu_detach_device(struct iommu_domain *domain, struct device *dev) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index e56cad4863f7..3e5f6eb1f04a 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -188,6 +188,7 @@ struct iommu_resv_region { * @pgsize_bitmap: bitmap of all possible supported page sizes * @bind_pasid_table: bind pasid table * @unbind_pasid_table: unbind pasid table and restore defaults + * @cache_invalidate: invalidate translation caches */ struct iommu_ops { bool (*capable)(enum iommu_cap); @@ -238,6 +239,9 @@ struct iommu_ops { struct iommu_pasid_table_config *cfg); void (*unbind_pasid_table)(struct iommu_domain *domain); + int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev, + struct iommu_cache_invalidate_info *inv_info); + unsigned long pgsize_bitmap; }; @@ -302,6 +306,9 @@ extern void iommu_detach_device(struct iommu_domain *domain, extern int iommu_bind_pasid_table(struct iommu_domain *domain, struct iommu_pasid_table_config *cfg); extern void iommu_unbind_pasid_table(struct iommu_domain *domain); +extern int iommu_cache_invalidate(struct iommu_domain *domain, + struct device *dev, + struct iommu_cache_invalidate_info *inv_info); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot); @@ -704,6 +711,13 @@ static inline void iommu_unbind_pasid_table(struct iommu_domain *domain) { } +static inline int +iommu_cache_invalidate(struct iommu_domain *domain, + struct device *dev, + struct iommu_cache_invalidate_info *inv_info) +{ + return -ENODEV; +} #endif /* CONFIG_IOMMU_API */ diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index babec91ae7e1..4283e0334baf 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -49,4 +49,99 @@ struct iommu_pasid_table_config { }; }; +/** + * enum iommu_inv_granularity - Generic invalidation granularity + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID: TLB entries or PASID caches of all + * PASIDs associated with a domain ID + * @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID cache associated + * with a PASID and a domain + * @IOMMU_INV_GRANU_PAGE_PASID:TLB entries of selected page range + * within a PASID + * + * When an invalidation request is passed down to IOMMU to flush translation + * caches, it may carry different granularity levels, which can be specific + * to certain types of translation caches. + * This enum is a collection of granularities for all types of translation + * caches. The idea is to make it easy for IOMMU model specific driver to + * convert f
[RFC v2 03/20] iommu: Introduce bind_guest_msi
On ARM, MSI are translated by the SMMU. An IOVA is allocated for each MSI doorbell. If both the host and the guest are exposed with SMMUs, we end up with 2 different IOVAs allocated by each. guest allocates an IOVA (gIOVA) to map onto the guest MSI doorbell (gDB). The Host allocates another IOVA (hIOVA) to map onto the physical doorbell (hDB). So we end up with 2 untied mappings: S1S2 gIOVA->gDB hIOVA->gDB Currently the PCI device is programmed by the host with hIOVA as MSI doorbell. So this does not work. This patch introduces an API to pass gIOVA/gDB to the host so that gIOVA can be reused by the host instead of re-allocating a new IOVA. So the goal is to create the following nested mapping: S1S2 gIOVA->gDB ->hDB and program the PCI device with gIOVA MSI doorbell. Signed-off-by: Eric Auger --- drivers/iommu/iommu.c | 10 ++ include/linux/iommu.h | 13 + include/uapi/linux/iommu.h | 7 +++ 3 files changed, 30 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 1442a6c640af..b2b8f375b169 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1409,6 +1409,16 @@ static void __iommu_detach_device(struct iommu_domain *domain, trace_detach_device_from_domain(dev); } +int iommu_bind_guest_msi(struct iommu_domain *domain, +struct iommu_guest_msi_binding *binding) +{ + if (unlikely(!domain->ops->bind_guest_msi)) + return -ENODEV; + + return domain->ops->bind_guest_msi(domain, binding); +} +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi); + void iommu_detach_device(struct iommu_domain *domain, struct device *dev) { struct iommu_group *group; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 3e5f6eb1f04a..9bd3e63d562b 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -242,6 +242,9 @@ struct iommu_ops { int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev, struct iommu_cache_invalidate_info *inv_info); + int (*bind_guest_msi)(struct iommu_domain *domain, + struct iommu_guest_msi_binding *binding); + unsigned long pgsize_bitmap; }; @@ -309,6 +312,9 @@ extern void iommu_unbind_pasid_table(struct iommu_domain *domain); extern int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev, struct iommu_cache_invalidate_info *inv_info); +extern int iommu_bind_guest_msi(struct iommu_domain *domain, + struct iommu_guest_msi_binding *binding); + extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot); @@ -719,6 +725,13 @@ iommu_cache_invalidate(struct iommu_domain *domain, return -ENODEV; } +static inline +int iommu_bind_guest_msi(struct iommu_domain *domain, +struct iommu_guest_msi_binding *binding) +{ + return -ENODEV; +} + #endif /* CONFIG_IOMMU_API */ #ifdef CONFIG_IOMMU_DEBUGFS diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 4283e0334baf..21adb2a964e5 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -144,4 +144,11 @@ struct iommu_cache_invalidate_info { __u64 arch_id; __u64 addr; }; + +struct iommu_guest_msi_binding { + __u64 iova; + __u64 gpa; + __u32 granule; +}; #endif /* _UAPI_IOMMU_H */ + -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC v2 01/20] iommu: Introduce bind_pasid_table API
From: Jacob Pan In virtualization use case, when a guest is assigned a PCI host device, protected by a virtual IOMMU on a guest, the physical IOMMU must be programmed to be consistent with the guest mappings. If the physical IOMMU supports two translation stages it makes sense to program guest mappings onto the first stage/level (ARM/VTD terminology) while to host owns the stage/level 2. In that case, it is mandated to trap on guest configuration settings and pass those to the physical iommu driver. This patch adds a new API to the iommu subsystem that allows to bind and unbind the guest configuration data to the host. A generic iommu_pasid_table_config struct is introduced in a new iommu.h uapi header. This is going to be used by the VFIO user API. We foresee at least two specializations of this struct, for PASID table passing and ARM SMMUv3. Signed-off-by: Jean-Philippe Brucker Signed-off-by: Liu, Yi L Signed-off-by: Ashok Raj Signed-off-by: Jacob Pan Signed-off-by: Eric Auger --- In practice, I think it would be simpler to have a single set_pasid_table function instead of bind/unbind. The "bypass" field tells the stage 1 is bypassed (equivalent to the unbind actually). On userspace we have notifications that the device context has changed. Calling either bind or unbind requires to have an understand of what was the previous state and call different notifiers. So to me the bind/unbind complexifies the user integration while not bring much benefits. This patch generalizes the API introduced by Jacob & co-authors in https://lwn.net/Articles/754331/ v1 -> v2: - restore the original pasid table name - remove the struct device * parameter in the API - reworked iommu_pasid_smmuv3 --- drivers/iommu/iommu.c | 19 ++ include/linux/iommu.h | 21 +++ include/uapi/linux/iommu.h | 52 ++ 3 files changed, 92 insertions(+) create mode 100644 include/uapi/linux/iommu.h diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8c15c5980299..db2c7c9502ae 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1362,6 +1362,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_attach_device); +int iommu_bind_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg) +{ + if (unlikely(!domain->ops->bind_pasid_table)) + return -ENODEV; + + return domain->ops->bind_pasid_table(domain, cfg); +} +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table); + +void iommu_unbind_pasid_table(struct iommu_domain *domain) +{ + if (unlikely(!domain->ops->unbind_pasid_table)) + return; + + domain->ops->unbind_pasid_table(domain); +} +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table); + static void __iommu_detach_device(struct iommu_domain *domain, struct device *dev) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 87994c265bf5..e56cad4863f7 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -25,6 +25,7 @@ #include #include #include +#include #define IOMMU_READ (1 << 0) #define IOMMU_WRITE(1 << 1) @@ -185,6 +186,8 @@ struct iommu_resv_region { * @domain_get_windows: Return the number of windows for a domain * @of_xlate: add OF master IDs to iommu grouping * @pgsize_bitmap: bitmap of all possible supported page sizes + * @bind_pasid_table: bind pasid table + * @unbind_pasid_table: unbind pasid table and restore defaults */ struct iommu_ops { bool (*capable)(enum iommu_cap); @@ -231,6 +234,10 @@ struct iommu_ops { int (*of_xlate)(struct device *dev, struct of_phandle_args *args); bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev); + int (*bind_pasid_table)(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg); + void (*unbind_pasid_table)(struct iommu_domain *domain); + unsigned long pgsize_bitmap; }; @@ -292,6 +299,9 @@ extern int iommu_attach_device(struct iommu_domain *domain, struct device *dev); extern void iommu_detach_device(struct iommu_domain *domain, struct device *dev); +extern int iommu_bind_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg); +extern void iommu_unbind_pasid_table(struct iommu_domain *domain); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot); @@ -684,6 +694,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode) return NULL; } +static inline +int iommu_bind_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_tabl
[RFC v2 00/20] SMMUv3 Nested Stage Setup
This series allows a virtualizer to program the nested stage mode. This is useful when both the host and the guest are exposed with an SMMUv3 and a PCI device is assigned to the guest using VFIO. In this mode, the physical IOMMU must be programmed to translate the two stages: the one set up by the guest (IOVA -> GPA) and the one set up by the host VFIO driver as part of the assignment process (GPA -> HPA). On Intel, this is traditionnaly achieved by combining the 2 stages into a single physical stage. However this relies on the capability to trap on each guest translation structure update. This is possible by using the VTD Caching Mode. Unfortunately the ARM SMMUv3 does not offer a similar mechanism. However, the ARM SMMUv3 architecture supports 2 physical stages! Those were devised exactly with that use case in mind. Assuming the HW implements both stages (optional), the guest now can use stage 1 while the host uses stage 2. This assumes the virtualizer has means to propagate guest settings to the host SMMUv3 driver. This series brings this VFIO/IOMMU infrastructure. Those services are: - bind the guest stage 1 configuration to the stream table entry - propagate guest TLB invalidations - bind MSI IOVAs - propagate faults collected at physical level up to the virtualizer This series largely reuses the user API and infrastructure originally devised for SVA/SVM and patches submitted by Jacob, Yi Liu, Tianyu in [1-3] and Jean-Philippe [4]. This proof of concept also aims at illustrating how this API/infrastructure would need to evolve to support this nested stage SMMUv3 use case. Best Regards Eric This series can be found at: https://github.com/eauger/linux/tree/v4.19-rc4-2stage-rfc-v2 This was tested on Qualcomm HW featuring SMMUv3 and with adapted QEMU vSMMUv3. References: [1] [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) https://lwn.net/Articles/754331/ [2] [RFC PATCH 0/8] Shared Virtual Memory virtualization for VT-d (VFIO part) https://lists.linuxfoundation.org/pipermail/iommu/2017-April/021475.html [3] [RFC PATCH 0/3] VFIO: Report IOMMU fault event to userspace https://www.spinics.net/lists/kvm/msg145280.html [4] [v2,00/40] Shared Virtual Addressing for the IOMMU https://patchwork.ozlabs.org/cover/912129/ History: v1 -> v2: - Added the fault reporting capability - asid properly passed on invalidation (fix assignment of multiple devices) - see individual change logs for more info Eric Auger (11): iommu: Introduce bind_guest_msi vfio: VFIO_IOMMU_BIND_MSI iommu/smmuv3: Get prepared for nested stage support iommu/smmuv3: Implement bind_pasid_table iommu/smmuv3: Implement cache_invalidate dma-iommu: Implement NESTED_MSI cookie iommu/smmuv3: Implement bind_guest_msi vfio: VFIO_IOMMU_SET_FAULT_EVENTFD vfio: VFIO_IOMMU_GET_FAULT_EVENTS vfio: Document nested stage control iommu/smmuv3: Report non recoverable faults Jacob Pan (4): iommu: Introduce bind_pasid_table API iommu: introduce device fault data driver core: add per device iommu param iommu: introduce device fault report API Jean-Philippe Brucker (2): iommu/arm-smmu-v3: Link domains and devices iommu/arm-smmu-v3: Maintain a SID->device structure Liu, Yi L (3): iommu: Introduce cache_invalidate API vfio: VFIO_IOMMU_BIND_PASID_TABLE vfio: VFIO_IOMMU_CACHE_INVALIDATE Documentation/vfio.txt | 60 drivers/iommu/arm-smmu-v3.c | 476 ++-- drivers/iommu/dma-iommu.c | 97 ++- drivers/iommu/iommu.c | 196 - drivers/vfio/vfio_iommu_type1.c | 269 ++ include/linux/device.h | 3 + include/linux/dma-iommu.h | 11 + include/linux/iommu.h | 134 - include/uapi/linux/iommu.h | 237 include/uapi/linux/vfio.h | 48 10 files changed, 1501 insertions(+), 30 deletions(-) create mode 100644 include/uapi/linux/iommu.h -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Explicit IOVA management from a PCIe endpoint driver
On 2018-09-18 03:06, Stephen Warren wrote: Joerg, Christoph, Marek, Robin, I believe that the driver for our PCIe endpoint controller hardware will need to explicitly manage its IOVA space more than current APIs allow. I'd like to discuss how to make that possible. First some background on our hardware: NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This can operate in either root port or endpoint mode. I'm particularly interested in endpoint mode. Our particular instantiation of this controller exposes a single function with a single software-controlled PCIe BAR to the PCIe bus (there are also BARs for access to DMA controller registers and outbound MSI configuration, which can both be enabled/disabled but not used for any other purpose). When a transaction is received from the PCIe bus, the following happens: 1) Transaction is matched against the BAR base/size (in PCIe address space) to determine whether it "hits" this BAR or not. 2) The transaction's address is processed by the PCIe controller's ATU (Address Translation Unit), which can re-write the address that the transaction accesses. Our particular instantiation of the hardware only has 2 entries in the ATU mapping table, which gives very little flexibility in setting up a mapping. As an FYI, ATU entries can match PCIe transactions either: a) Any transaction received on a particular BAR. b) Any transaction received within a single contiguous window of PCIe address space. This kind of mapping entry obviously has to be set up after device enumeration is complete so that it can match the correct PCIe address. Each ATU entry maps a single contiguous set of PCIe addresses to a single contiguous set of IOVAs which are passed to the IOMMU. Transactions can pass through the ATU without being translated if desired. 3) The transaction is passed to the IOMMU, which can again re-write the address that the transaction accesses. 4) The transaction is passed to the memory controller and reads/writes DRAM. In general, we want to be able to expose a large and dynamic set of data buffers to the PCIe bus; certainly /far/ more than two separate buffers (the number of ATU table entries). With current Linux APIs, these buffers will not be located in contiguous or adjacent physical (DRAM) or virtual (IOVA) addresses, nor in any particular window of physical or IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only expose one or two contiguous ranges of IOVA space. These two sets of requirements are at odds! So, I'd like to propose some new APIs that the PCIe endpoint driver can use: 1) Allocate/reserve an IOVA range of specified size, but don't map anything into the IOVA range. I had done some work on this in the past, those patches were tested on Broadcom HW. https://lkml.org/lkml/2017/5/16/23, https://lkml.org/lkml/2017/5/16/21, https://lkml.org/lkml/2017/5/16/19 I could not pursue it further, since I do not have the same HW to test it. Although now in Qualcomm SOC, we do use Synopsis Designware PCIe controller but we dont restrict inbound addresses range for our SOC. of course these patches can easily be ported, and extended. they basically reserve IOVA ranges based on inbound dma-ranges DT property. Regards, Oza. 2) De-allocate the IOVA range allocated in (1). 3) Map a specific set (scatter-gather list I suppose) of already-allocated/extant physical addresses into part of an IOVA range allocated in (1). 4) Unmap a portion of an IOVA range that was mapped by (3). One final note: The memory controller can translate accesses to a small region of DRAM address space into accesses to an interrupt generation module. This allows devices attached to the PCIe bus to generate interrupts to software running on the system with the PCIe endpoint controller. Thus I deliberately described API 3 above as mapping a specific physical address into IOVA space, as opposed to mapping an existing DRAM allocation into IOVA space, in order to allow mapping this interrupt generation address space into IOVA space. If we needed separate APIs to map physical addresses vs. DRAM allocations into IOVA space, that would likely be fine too. Does this API proposal sound reasonable? I have heard from some NVIDIA developers that the above APIs rather go against the principle that individual drivers should not be aware of the presence/absence of an IOMMU, and hence direct management of IOVA allocation/layout is deliberately avoided, and hence there hasn't been a need/desire for this kind of API in the past. However, I think our current hardware design and use-case rather requires it. Do you agree? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/3] iommu: Avoid DMA ops domain refcount contention
>Not sure how fast Tom needs the common dma-iommu code for the x86 AMD iommu >conversion. I am currently busy working on something else and won't be able to do/test the x86 AMD iommu conversion anytime soon. So I don't need the common dma-iommu code anytime soon. On 17 September 2018 at 14:33, Christoph Hellwig wrote: > On Fri, Sep 14, 2018 at 01:48:59PM +0100, Will Deacon wrote: >> > As far as merging goes, I don't mind at all whether this goes via IOMMU, >> > or via dma-mapping provided Joerg's happy to ack it. >> >> I think it makes most sense for Joerg to take this series via his tree. > > FYI, I have WIP patches to move the arm dma-iommu wrappers to common > code: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-maybe-coherent > > which will require some synchronization of the involved trees. In the > end it is iommu code, so the actual iommu patches should probably > go into the iommu tree after all, but it might have to pull in the > dma-mapping branch for that. Or we just punt it until the next merge > window. Not sure how fast Tom needs the common dma-iommu code for > the x86 AMD iommu conversion. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: move swiotlb noncoherent dma support from arm64 to generic code
Hi Christoph, On 17/09/18 16:38, Christoph Hellwig wrote: Hi all, this series starts with various swiotlb cleanups, then adds support for non-cache coherent devices to the generic swiotlb support, and finally switches arm64 to use the generic code. I think there's going to be an issue with the embedded folks' grubby hack in arm64's mem_init() which skips initialising SWIOTLB at all with sufficiently little DRAM. I've been waiting for dma-direct-noncoherent-merge so that I could fix that case to swizzle in dma_direct_ops and avoid swiotlb_dma_ops entirely. Given that this series depends on patches in the dma-mapping tree, or pending for it I've also published a git tree here: git://git.infradead.org/users/hch/misc.git swiotlb-noncoherent However, upon sitting down to eagerly write that patch I've just boot-tested the above branch as-is for a baseline and discovered a rather more significant problem: arch_dma_alloc() is recursing back into __swiotlb_alloc() and blowing the stack. Not good :( Robin. ->8- [4.032760] Insufficient stack space to handle exception! [4.032765] ESR: 0x9647 -- DABT (current EL) [4.042666] FAR: 0x0a937fb0 [4.046113] Task stack: [0x0a938000..0x0a93c000] [4.052399] IRQ stack: [0x08008000..0x0800c000] [4.058684] Overflow stack: [0x80097ff4b290..0x80097ff4c290] [4.064972] CPU: 1 PID: 130 Comm: kworker/1:1 Not tainted 4.19.0-rc2+ #681 [4.071775] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Jul 10 2018 [4.082456] Workqueue: events deferred_probe_work_func [4.087542] pstate: 0005 (nzcv daif -PAN -UAO) [4.092283] pc : arch_dma_alloc+0x0/0x198 [4.096250] lr : dma_direct_alloc+0x20/0x28 [4.100385] sp : 0a938010 [4.103660] x29: 0a938010 x28: 800974e15238 [4.108918] x27: 08bf30d8 x26: 80097543d400 [4.114176] x25: 0300 x24: 80097543d400 [4.119434] x23: 800974e15238 x22: 1000 [4.124691] x21: 80097543d400 x20: 1000 [4.129948] x19: 0300 x18: [4.135206] x17: x16: 08bf1b58 [4.140463] x15: 091eb688 x14: 8a93ba1d [4.145720] x13: ff00 x12: [4.150977] x11: 0001 x10: ff7f7fff7fff [4.156235] x9 : x8 : [4.161492] x7 : 800974df9810 x6 : [4.166749] x5 : x4 : 0300 [4.172006] x3 : 006002c0 x2 : 800974e15238 [4.177263] x1 : 1000 x0 : 80097543d400 [4.182521] Kernel panic - not syncing: kernel stack overflow [4.188207] CPU: 1 PID: 130 Comm: kworker/1:1 Not tainted 4.19.0-rc2+ #681 [4.195008] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Jul 10 2018 [4.205681] Workqueue: events deferred_probe_work_func [4.210765] Call trace: [4.213183] dump_backtrace+0x0/0x1f0 [4.216805] show_stack+0x14/0x20 [4.220084] dump_stack+0x9c/0xbc [4.223362] panic+0x138/0x294 [4.226381] __stack_chk_fail+0x0/0x18 [4.230088] handle_bad_stack+0x11c/0x130 [4.234052] __bad_stack+0x88/0x8c [4.237415] arch_dma_alloc+0x0/0x198 [4.241036] __swiotlb_alloc+0x3c/0x178 [4.244828] arch_dma_alloc+0xd0/0x198 [4.248534] dma_direct_alloc+0x20/0x28 [4.252327] __swiotlb_alloc+0x3c/0x178 [4.256119] arch_dma_alloc+0xd0/0x198 [4.259825] dma_direct_alloc+0x20/0x28 [4.263617] __swiotlb_alloc+0x3c/0x178 [4.267409] arch_dma_alloc+0xd0/0x198 [4.271115] dma_direct_alloc+0x20/0x28 [4.274907] __swiotlb_alloc+0x3c/0x178 [4.278700] arch_dma_alloc+0xd0/0x198 [4.282405] dma_direct_alloc+0x20/0x28 [4.286198] __swiotlb_alloc+0x3c/0x178 [4.289990] arch_dma_alloc+0xd0/0x198 [4.293696] dma_direct_alloc+0x20/0x28 [4.297489] __swiotlb_alloc+0x3c/0x178 [4.301281] arch_dma_alloc+0xd0/0x198 [4.304987] dma_direct_alloc+0x20/0x28 [4.308779] __swiotlb_alloc+0x3c/0x178 [4.312571] arch_dma_alloc+0xd0/0x198 [4.316277] dma_direct_alloc+0x20/0x28 [4.320069] __swiotlb_alloc+0x3c/0x178 [4.323861] arch_dma_alloc+0xd0/0x198 [4.327567] dma_direct_alloc+0x20/0x28 [4.331359] __swiotlb_alloc+0x3c/0x178 [4.335151] arch_dma_alloc+0xd0/0x198 [4.338857] dma_direct_alloc+0x20/0x28 [4.342650] __swiotlb_alloc+0x3c/0x178 [4.346442] arch_dma_alloc+0xd0/0x198 [4.350148] dma_direct_alloc+0x20/0x28 [4.353940] __swiotlb_alloc+0x3c/0x178 [4.357732] arch_dma_alloc+0xd0/0x198 [4.361438] dma_direct_alloc+0x20/0x28 [4.365230] __swiotlb_alloc+0x3c/0x178 [4.369022] arch_dma_alloc+0xd0/0x198 [4.372728] dma_direct_alloc+0x20/0x28 [4.376520] __swiotlb_alloc+0x3c/0x178 [
Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote: > On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote: > > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote: > > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote: > > > > So i want to summarize issues i have as this threads have dig deep into > > > > details. For this i would like to differentiate two cases first the easy > > > > one when relying on SVA/SVM. Then the second one when there is no > > > > SVA/SVM. > > > > > > Thank you very much for the summary. > > > > > > > In both cases your objectives as i understand them: > > > > > > > > [R1]- expose a common user space API that make it easy to share boiler > > > > plate code accross many devices (discovering devices, opening > > > > device, creating context, creating command queue ...). > > > > [R2]- try to share the device as much as possible up to device limits > > > > (number of independant queues the device has) > > > > [R3]- minimize syscall by allowing user space to directly schedule on > > > > the > > > > device queue without a round trip to the kernel > > > > > > > > I don't think i missed any. > > > > > > > > > > > > (1) Device with SVA/SVM > > > > > > > > For that case it is easy, you do not need to be in VFIO or part of any > > > > thing specific in the kernel. There is no security risk (modulo bug in > > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process > > > > to a device is just couple dozen lines of code. > > > > > > > > > > This is right...logically. But the kernel has no clear definition about > > > "Device > > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one > > > of the > > > boiler plate. > > > > > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the > > > only > > > one. If we add that support within VFIO, which solve most of the problem > > > of > > > SVA/SVM, it will save a lot of work in the future. > > > > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user > > all do the SVA/SVM setup in couple dozen lines and i failed to see how it > > would require any more than that in your case. > > > > > > > I think this is the key confliction between us. So could Alex please say > > > something here? If the VFIO is going to take this into its scope, we can > > > try > > > together to solve all the problem on the way. If it it is not, it is also > > > simple, we can just go to another way to fulfill this part of > > > requirements even > > > we have to duplicate most of the code. > > > > > > Another point I need to emphasis here: because we have to replace the > > > hardware > > > queue when fork, so it won't be very simple even in SVA/SVM case. > > > > I am assuming hardware queue can only be setup by the kernel and thus > > you are totaly safe forkwise as the queue is setup against a PASID and > > the child does not bind to any PASID and you use VM_DONTCOPY on the > > mmap of the hardware MMIO queue because you should really use that flag > > for that. > > > > > > > > (2) Device does not have SVA/SVM (or it is disabled) > > > > > > > > You want to still allow device to be part of your framework. However > > > > here i see fundamentals securities issues and you move the burden of > > > > being careful to user space which i think is a bad idea. We should > > > > never trus the userspace from kernel space. > > > > > > > > To keep the same API for the user space code you want a 1:1 mapping > > > > between device physical address and process virtual address (ie if > > > > device access device physical address A it is accessing the same > > > > memory as what is backing the virtual address A in the process. > > > > > > > > Security issues are on two things: > > > > [I1]- fork/exec, a process who opened any such device and created an > > > > active queue can transfer without its knowledge control of its > > > > commands queue through COW. The parent map some anonymous region > > > > to the device as a command queue buffer but because of COW the > > > > parent can be the first to copy on write and thus the child can > > > > inherit the original pages that are mapped to the hardware. > > > > Here parent lose control and child gain it. > > > > > > This is indeed an issue. But it remains an issue only if you continue to > > > use the > > > queue and the memory after fork. We can use at_fork kinds of gadget to > > > fix it in > > > user space. > > > > Trusting user space is a no go from my point of view. > > Can we dive deeper on this? Maybe we have different understanding on "Trusting > user space". As my understanding, "trusting user space" means "no matter what > the user process does, it should only hurt itself and anything give to it, no > the kernel and the other process". > > In our case, we create a channel between a process and the hardware. The > proce
Re: Explicit IOVA management from a PCIe endpoint driver
Hi Stephen, On 17/09/18 22:36, Stephen Warren wrote: Joerg, Christoph, Marek, Robin, I believe that the driver for our PCIe endpoint controller hardware will need to explicitly manage its IOVA space more than current APIs allow. I'd like to discuss how to make that possible. First some background on our hardware: NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This can operate in either root port or endpoint mode. I'm particularly interested in endpoint mode. Our particular instantiation of this controller exposes a single function with a single software-controlled PCIe BAR to the PCIe bus (there are also BARs for access to DMA controller registers and outbound MSI configuration, which can both be enabled/disabled but not used for any other purpose). When a transaction is received from the PCIe bus, the following happens: 1) Transaction is matched against the BAR base/size (in PCIe address space) to determine whether it "hits" this BAR or not. 2) The transaction's address is processed by the PCIe controller's ATU (Address Translation Unit), which can re-write the address that the transaction accesses. Our particular instantiation of the hardware only has 2 entries in the ATU mapping table, which gives very little flexibility in setting up a mapping. As an FYI, ATU entries can match PCIe transactions either: a) Any transaction received on a particular BAR. b) Any transaction received within a single contiguous window of PCIe address space. This kind of mapping entry obviously has to be set up after device enumeration is complete so that it can match the correct PCIe address. Each ATU entry maps a single contiguous set of PCIe addresses to a single contiguous set of IOVAs which are passed to the IOMMU. Transactions can pass through the ATU without being translated if desired. 3) The transaction is passed to the IOMMU, which can again re-write the address that the transaction accesses. 4) The transaction is passed to the memory controller and reads/writes DRAM. In general, we want to be able to expose a large and dynamic set of data buffers to the PCIe bus; certainly /far/ more than two separate buffers (the number of ATU table entries). With current Linux APIs, these buffers will not be located in contiguous or adjacent physical (DRAM) or virtual (IOVA) addresses, nor in any particular window of physical or IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only expose one or two contiguous ranges of IOVA space. These two sets of requirements are at odds! So, I'd like to propose some new APIs that the PCIe endpoint driver can use: 1) Allocate/reserve an IOVA range of specified size, but don't map anything into the IOVA range. 2) De-allocate the IOVA range allocated in (1). 3) Map a specific set (scatter-gather list I suppose) of already-allocated/extant physical addresses into part of an IOVA range allocated in (1). 4) Unmap a portion of an IOVA range that was mapped by (3). That all sounds perfectly reasonable - basically it sounds like the endpoint framework wants the option to do the same as VFIO or many DRM drivers, i.e. set up its own IOMMU domain, attach the endpoint's group, and explicitly manage its mappings via IOMMU API calls. Provided you can assume cache-coherent PCI, that should be enough to get things going - supporting non-coherent endpoints is a little trickier in terms of making sure the endpoint controller and/or device gets the right DMA ops to only ever perform cache maintenance once you add streaming DMA mappings into the mix, but that's not insurmountable (and I think it's something we still need to address for DRM anyway, at least on arm64) One final note: The memory controller can translate accesses to a small region of DRAM address space into accesses to an interrupt generation module. This allows devices attached to the PCIe bus to generate interrupts to software running on the system with the PCIe endpoint controller. Thus I deliberately described API 3 above as mapping a specific physical address into IOVA space, as opposed to mapping an existing DRAM allocation into IOVA space, in order to allow mapping this interrupt generation address space into IOVA space. If we needed separate APIs to map physical addresses vs. DRAM allocations into IOVA space, that would likely be fine too. If that's the standard DesignWare MSI dingaling, then all you should need to do is ensure you IOVA is reserved in your allocator (if it can be entirely outside the EP BAR, even better) - AFAIK the writes get completely intercepted such that they never go out to the SMMU side at all, and thus no actual mapping is even needed. Does this API proposal sound reasonable? Indeed, as I say apart from using streaming DMA for coherency management (which I think could be added in pretty much orthogonally later), this sounds like something you could plumb into the endpoint framework
[PATCH] iommu/amd: return devid as alias for ACPI HID devices
ACPI HID devices do not actually have an alias for them in the IVRS. But dev_data->alias is still used for indexing into the IOMMU device table for devices being handled by the IOMMU. So for ACPI HID devices, we simply return the corresponding devid as an alias, as parsed from IVRS table. Signed-off-by: Arindam Nath --- drivers/iommu/amd_iommu.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 4e04fff23977..73e47d93e7a0 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -246,7 +246,13 @@ static u16 get_alias(struct device *dev) /* The callers make sure that get_device_id() does not fail here */ devid = get_device_id(dev); + + /* For ACPI HID devices, we simply return the devid as such */ + if (!dev_is_pci(dev)) + return devid; + ivrs_alias = amd_iommu_alias_table[devid]; + pci_for_each_dma_alias(pdev, __last_alias, &pci_alias); if (ivrs_alias == pci_alias) -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] kernel/dma: Fix panic caused by passing cma to command line
Hi On 2018-09-17 05:24, zhe...@windriver.com wrote: > From: He Zhe > > early_cma does not check input argument before passing it to > simple_strtoull. The argument would be a NULL pointer if "cma", without > its value, is set in command line and thus causes the following panic. > > PANIC: early exception 0xe3 IP 10:a3e9db8d error 0 cr2 0x0 > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted > 4.19.0-rc3-yocto-standard+ #7 > [0.00] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70 > ... > [0.00] Call Trace: > [0.00] simple_strtoull+0x29/0x70 > [0.00] memparse+0x26/0x90 > [0.00] early_cma+0x17/0x6a > [0.00] do_early_param+0x57/0x8e > [0.00] parse_args+0x208/0x320 > [0.00] ? rdinit_setup+0x30/0x30 > [0.00] parse_early_options+0x29/0x2d > [0.00] ? rdinit_setup+0x30/0x30 > [0.00] parse_early_param+0x36/0x4d > [0.00] setup_arch+0x336/0x99e > [0.00] start_kernel+0x6f/0x4e6 > [0.00] x86_64_start_reservations+0x24/0x26 > [0.00] x86_64_start_kernel+0x6f/0x72 > [0.00] secondary_startup_64+0xa4/0xb0 > > This patch adds a check to prevent the panic. > > Signed-off-by: He Zhe > Cc: sta...@vger.kernel.org > Cc: h...@lst.de > Cc: m.szyprow...@samsung.com > Cc: robin.mur...@arm.com Thanks for the fix. Reviewed-by: Marek Szyprowski > --- > kernel/dma/contiguous.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > index 286d823..b2a8790 100644 > --- a/kernel/dma/contiguous.c > +++ b/kernel/dma/contiguous.c > @@ -49,7 +49,11 @@ static phys_addr_t limit_cmdline; > > static int __init early_cma(char *p) > { > - pr_debug("%s(%s)\n", __func__, p); > + if (!p) { > + pr_err("Config string not provided\n"); > + return -EINVAL; > + } > + > size_cmdline = memparse(p, &p); > if (*p != '@') > return 0; Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu