Re: [PATCH v15 07/12] iommu/smmuv3: Implement cache_invalidate
On 2021/4/11 19:12, Eric Auger wrote: Implement domain-selective, pasid selective and page-selective IOTLB invalidations. Signed-off-by: Eric Auger --- v4 -> v15: - remove the redundant arm_smmu_cmdq_issue_sync(smmu) in IOMMU_INV_GRANU_ADDR case (Zenghui) - if RIL is not supported by the host, make sure the granule_size that is passed by the userspace is supported or fix it (Chenxiang) v13 -> v14: - Add domain invalidation - do global inval when asid is not provided with addr granularity v7 -> v8: - ASID based invalidation using iommu_inv_pasid_info - check ARCHID/PASID flags in addr based invalidation - use __arm_smmu_tlb_inv_context and __arm_smmu_tlb_inv_range_nosync v6 -> v7 - check the uapi version v3 -> v4: - adapt to changes in the uapi - add support for leaf parameter - do not use arm_smmu_tlb_inv_range_nosync or arm_smmu_tlb_inv_context anymore v2 -> v3: - replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync v1 -> v2: - properly pass the asid --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 + 1 file changed, 89 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 56a301fbe75a..bfc112cc0d38 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2961,6 +2961,94 @@ static void arm_smmu_detach_pasid_table(struct iommu_domain *domain) mutex_unlock(_domain->init_mutex); } +static int +arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev, + struct iommu_cache_invalidate_info *inv_info) +{ + struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL}; + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_device *smmu = smmu_domain->smmu; + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) + return -EINVAL; + + if (!smmu) + return -EINVAL; + + if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1) + return -EINVAL; + + if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID || + inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) { + return -ENOENT; + } + + if (!(inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB)) + return -EINVAL; + + /* IOTLB invalidation */ + + switch (inv_info->granularity) { + case IOMMU_INV_GRANU_PASID: + { + struct iommu_inv_pasid_info *info = + _info->granu.pasid_info; + + if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID) + return -ENOENT; + if (!(info->flags & IOMMU_INV_PASID_FLAGS_ARCHID)) + return -EINVAL; + + __arm_smmu_tlb_inv_context(smmu_domain, info->archid); + return 0; + } + case IOMMU_INV_GRANU_ADDR: + { + struct iommu_inv_addr_info *info = _info->granu.addr_info; + size_t granule_size = info->granule_size; + size_t size = info->nb_granules * info->granule_size; + bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF; + int tg; + + if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID) + return -ENOENT; + + if (!(info->flags & IOMMU_INV_ADDR_FLAGS_ARCHID)) + break; + + tg = __ffs(granule_size); + if (granule_size & ~(1 << tg)) + return -EINVAL; This check looks like to confirm the granule_size is a power of 2. Does the granule_size have to be a power of 2? I think it should also be handled correctly, even if the granule_size is not a power of 2. + /* +* When RIL is not supported, make sure the granule size that is +* passed is supported. In RIL mode, this is enforced in +* __arm_smmu_tlb_inv_range() +*/ + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV) && + !(granule_size & smmu_domain->domain.pgsize_bitmap)) { + tg = __ffs(smmu_domain->domain.pgsize_bitmap); + granule_size = 1 << tg; + size = size >> tg; Why does size need to be shifted tg bits to the right? Thanks, Kunkun Jiang + } + + arm_smmu_tlb_inv_range_domain(info->addr, size, + granule_size, leaf, + info->archid, smmu_domain); + return 0; + } + case IOMMU_INV_GRANU_DOMAIN: + break; + default: + return -EINVAL; + } + + /* Global S1 invalidation */ + cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; + arm_smmu_cmdq_issue_cmd(smmu, ); + arm_smmu_cmdq_issue_sync(smmu); + return 0;
Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework
On 2021/5/13 20:02, Lu Baolu wrote: > On 5/13/21 6:58 PM, Keqian Zhu wrote: >> >> >> On 2021/5/12 19:36, Lu Baolu wrote: >>> Hi keqian, >>> >>> On 5/12/21 4:44 PM, Keqian Zhu wrote: On 2021/5/12 11:20, Lu Baolu wrote: > On 5/11/21 3:40 PM, Keqian Zhu wrote: >>> For upper layers, before starting page tracking, they check the >>> dirty_page_trackable attribution of the domain and start it only it's >>> capable. Once the page tracking is switched on the vendor iommu driver >>> (or iommu core) should block further device attach/detach operations >>> until page tracking is stopped. >> But when a domain becomes capable after detaching a device, the upper >> layer >> still needs to query it and enable dirty log for it... >> >> To make things coordinated, maybe the upper layer can register a >> notifier, >> when the domain's capability change, the upper layer do not need to >> query, instead >> they just need to realize a callback, and do their specific policy in >> the callback. >> What do you think? >> > > That might be an option. But why not checking domain's attribution every > time a new tracking period is about to start? Hi Baolu, I'll add an attribution in iommu_domain, and the vendor iommu driver will update the attribution when attach/detach devices. The attribute should be protected by a lock, so the upper layer shouldn't access the attribute directly. Then the iommu_domain_support_dirty_log() still should be retained. Does this design looks good to you? >>> >>> Yes, that's what I was thinking of. But I am not sure whether it worth >>> of a lock here. It seems not to be a valid behavior for upper layer to >>> attach or detach any device while doing the dirty page tracking. >> Hi Baolu, >> >> Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can >> be called concurrently, >> a lock in iommu_domain_support_dirty_log() is still not enough. I will add >> another note for the dirty >> tracking interfaces. >> >> Do you have other suggestions? I will accelerate the progress, so I plan to >> send out v5 next week. > > No further comments expect below nit: > > "iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking" > > How about splitting it into > - iommu_start_dirty_log() > - iommu_stop_dirty_log() Yeah, actually this is my original version, and the "switch" style is suggested by Yi Sun. Anyway, I think both is OK, and the "switch" style can reduce some code. Thanks, Keqian > > Not a strong opinion anyway. > > Best regards, > baolu > . > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v5 5/7] iommu/vt-d: Fixup delivery mode of the HPET hardlockup interrupt
On Wed, May 05, 2021 at 01:03:18AM +0200, Thomas Gleixner wrote: > On Tue, May 04 2021 at 12:10, Ricardo Neri wrote: Thank you very much for your feedback, Thomas. I am sorry it took me a while to reply to your email. I needed to digest and research your comments. > > In x86 there is not an IRQF_NMI flag that can be used to indicate the > > There exists no IRQF_NMI flag at all. No architecture provides that. Thank you for the clarification. I think I meant to say that there is a request_nmi() function but AFAIK it is only used in the ARM PMU and would not work on x86. > > > delivery mode when requesting an interrupt (via request_irq()). Thus, > > there is no way for the interrupt remapping driver to know and set > > the delivery mode. > > There is no support for this today. So what? Using request_irq() plus a HPET quirk looked to me a reasonable way to use the irqdomain hierarchy to allocate an interrupt with NMI as the delivery mode. > > > Hence, when allocating an interrupt, check if such interrupt belongs to > > the HPET hardlockup detector and fixup the delivery mode accordingly. > > What? > > > + /* > > +* If we find the HPET hardlockup detector irq, fixup the > > +* delivery mode. > > +*/ > > + if (is_hpet_irq_hardlockup_detector(info)) > > + irq_cfg->delivery_mode = APIC_DELIVERY_MODE_NMI; > > Again. We are not sticking some random device checks into that > code. It's wrong and I explained it to you before. > > > https://lore.kernel.org/lkml/alpine.deb.2.21.1906161042080.1...@nanos.tec.linutronix.de/ > > But I'm happy to repeat it again: > > "No. This is horrible hackery violating all the layering which we carefully >put into place to avoid exactly this kind of sprinkling conditionals into >all code pathes. > >With some thought the existing irqdomain hierarchy can be used to achieve >the same thing without tons of extra functions and conditionals." > > So the outcome of thought and using the irqdomain hierarchy is: > >Replacing an hpet specific conditional in one place with an hpet >specific conditional in a different place. > > Impressive. I am sorry Thomas, I did try to make the quirk less hacky but I did not think of the solution you provide below. > > hpet_assign_irq(, bool nmi) > init_info(info) > ... > if (nmi) > info.flags |= X86_IRQ_ALLOC_AS_NMI; > >irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, ) > intel_irq_remapping_alloc(..., info) >irq_domain_alloc_irq_parents(..., info) > x86_vector_alloc_irqs(..., info) > { >if (info->flags & X86_IRQ_ALLOC_AS_NMI && nr_irqs != 1) > return -EINVAL; > >for (i = 0; i < nr_irqs; i++) { > > if (info->flags & X86_IRQ_ALLOC_AS_NMI) { > irq_cfg_setup_nmi(apicd); > continue; > } > ... > } > > irq_cfg_setup_nmi() sets irq_cfg->delivery_mode and whatever is required > and everything else just works. Of course this needs a few other minor > tweaks but none of those introduces random hpet quirks all over the > place. Not convoluted enough, right? Thanks for the detailed demonstration! It does seem cleaner than what I implemented. > > But that solves none of other problems. Let me summarize again which > options or non-options we have: > > 1) Selective IPIs from NMI context cannot work > >As explained in the other thread. > > 2) Shorthand IPI allbutself from NMI > >This should work, but that obviously does not take the watchdog >cpumask into account. > >Also this only works when IPI shorthand mode is enabled. See >apic_smt_update() for details. > > 3) Sending the IPIs from irq_work > >This would solve the problem, but if the CPU which is the NMI >target is really stuck in an interrupt disabled region then the >IPIs won't be sent. > >OTOH, if that's the case then the CPU which was processing the >NMI will continue to be stuck until the next NMI hits which >will detect that the CPU is stuck which is a good enough >reason to send a shorthand IPI to all CPUs ignoring the >watchdog cpumask. > >Same limitation vs. shorthand mode as #2 > > 4) Changing affinity of the HPET NMI from NMI > >As we established two years ago that cannot work with interrupt >remapping > > 5) Changing affinity of the HPET NMI from irq_work > >Same issues as #3 > > Anything else than #2 is just causing more problems than it solves, but > surely the NOHZ_FULL/isolation people might have opinions on this. > > OTOH, as this is opt-in, anything which wants a watchdog mask which is > not the full online set, has to accept that HPET has these restrictions. > > And that's exactly what I suggested two
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Jason, On Thu, 13 May 2021 19:31:22 -0300, Jason Gunthorpe wrote: > On Thu, May 13, 2021 at 01:22:51PM -0700, Jacob Pan wrote: > > Hi Tony, > > > > On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" > > wrote: > > > > > On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote: > > > > It seems there are two options: > > > > 1. Add a new IOMMU API to set up a system PASID with a *separate* > > > > IOMMU page table/domain, mark the device is PASID only with a flag. > > > > Use DMA APIs to explicit map/unmap. Based on this PASID-only flag, > > > > Vendor IOMMU driver will decide whether to use system PASID domain > > > > during map/unmap. Not clear if we also need to make IOVA==kernel VA. > > > > > > > > 2. Add a new IOMMU API to setup a system PASID which points to > > > > init_mm.pgd. This API only allows trusted device to bind with the > > > > system PASID at its own risk. There is no need for DMA API. This is > > > > the same as the current code except with an explicit API. > > > > > > > > Which option? > > > > > > Option #1 looks cleaner to me. Option #2 gives access to bits > > > of memory that the users of system PASID shouldn't ever need > > > to touch ... just map regions of memory that the kernel has > > > a "struct page" for. > > > > > > What does "use DMA APIs to explicitly map/unmap" mean? Is that > > > for the whole region? > > > > > If we map the entire kernel direct map during system PASID setup, then > > we don't need to use DMA API to map/unmap certain range. > > > > I was thinking this system PASID page table could be on-demand. The > > mapping is built by explicit use of DMA map/unmap APIs. > > Option 1 should be the PASID works exactly like a normal RID and uses > all the normal DMA APIs and IOMMU mechanisms, whatever the platform > implements. This might mean an iommu update on every operation or not. > > > > I'm expecting that once this system PASID has been initialized, > > > then any accelerator device with a kernel use case would use the > > > same PASID. I.e. DSA for page clearing, IAX for ZSwap compression > > > & decompression, etc. > > > > > OK, sounds like we have to map the entire kernel VA with struct page as > > you said. So we still by-pass DMA APIs, can we all agree on that? > > Option 2 should be the faster option, but not available in all cases. > > Option 1 isn't optional. DMA and IOMMU code has to be portable and > this is the portable API. > > If you want to do option 1 and option 2 then give it a go, but in most > common cases with the IOMMU in a direct map you shouldn't get a > notable performance win. > Looks like we are converging. Let me summarize the takeaways: 1. Remove IOMMU_SVA_BIND_SUPERVISOR flag from this patch, in fact there will be no flags at all for iommu_sva_bind_device() 2. Remove all supervisor SVA related vt-d, idxd code. 3. Create API iommu_setup_system_pasid_direct_map(option_flag) if (option_flag == 1) iommu_domain_alloc(IOMMU_DOMAIN_DMA); if (option_flag == 2) iommu_domain_alloc(IOMMU_DOMAIN_DIRECT); //new domain type? setup IOMMU page tables mirroring the direct map 4. Create API iommu_enable_dev_direct_map(struct dev, , ) - Drivers call this API to get the system PASID and which option is available on the system PASID - mark device as PASID only, perhaps a new flag in struct device->dev_iommu->pasid_only = 1 5. DMA API IOMMU vendor ops will take action based on the pasid_only flag to decide if the mapping is for system PASID page tables. Does it make sense? > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 05/22] PCI/P2PDMA: Print a warning if the host bridge is not in the whitelist
If the host bridge is not in the whitelist print a warning in the calc_map_type_and_dist_warn() path detailing the vendor and device IDs that would need to be added to the whitelist. Suggested-by: Don Dutile Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 175da3a9c8fb..0e0b2218eacd 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -340,7 +340,7 @@ static struct pci_dev *pci_host_bridge_dev(struct pci_host_bridge *host) } static bool __host_bridge_whitelist(struct pci_host_bridge *host, - bool same_host_bridge) + bool same_host_bridge, bool warn) { struct pci_dev *root = pci_host_bridge_dev(host); const struct pci_p2pdma_whitelist_entry *entry; @@ -361,6 +361,10 @@ static bool __host_bridge_whitelist(struct pci_host_bridge *host, return true; } + if (warn) + pci_warn(root, "Host bridge not in P2PDMA whitelist: %04x:%04x\n", +vendor, device); + return false; } @@ -368,16 +372,17 @@ static bool __host_bridge_whitelist(struct pci_host_bridge *host, * If we can't find a common upstream bridge take a look at the root * complex and compare it to a whitelist of known good hardware. */ -static bool host_bridge_whitelist(struct pci_dev *a, struct pci_dev *b) +static bool host_bridge_whitelist(struct pci_dev *a, struct pci_dev *b, + bool warn) { struct pci_host_bridge *host_a = pci_find_host_bridge(a->bus); struct pci_host_bridge *host_b = pci_find_host_bridge(b->bus); if (host_a == host_b) - return __host_bridge_whitelist(host_a, true); + return __host_bridge_whitelist(host_a, true, warn); - if (__host_bridge_whitelist(host_a, false) && - __host_bridge_whitelist(host_b, false)) + if (__host_bridge_whitelist(host_a, false, warn) && + __host_bridge_whitelist(host_b, false, warn)) return true; return false; @@ -513,7 +518,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) { if (!cpu_supports_p2pdma() && - !host_bridge_whitelist(provider, client)) + !host_bridge_whitelist(provider, client, acs_redirects)) map_type = PCI_P2PDMA_MAP_NOT_SUPPORTED; } -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 08/22] dma-mapping: Allow map_sg() ops to return negative error codes
Allow dma_map_sgtable() to pass errors from the map_sg() ops. This will be required for returning appropriate error codes when mapping P2PDMA memory. Introduce __dma_map_sg_attrs() which will return the raw error code from the map_sg operation (whether it be negative or zero). Then add a dma_map_sg_attrs() wrapper to convert any negative errors to zero to satisfy the existing calling convention. dma_map_sgtable() will convert a zero error return for old map_sg() ops into a -EINVAL return and return any negative errors as reported. This allows map_sg implementations to start returning multiple negative error codes. Legacy map_sg implementations can continue to return zero until they are all converted. Signed-off-by: Logan Gunthorpe --- include/linux/dma-map-ops.h | 8 ++-- include/linux/dma-mapping.h | 41 - kernel/dma/mapping.c| 13 +--- 3 files changed, 47 insertions(+), 15 deletions(-) diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d53a96a3d64..eaa969be8284 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -41,8 +41,12 @@ struct dma_map_ops { size_t size, enum dma_data_direction dir, unsigned long attrs); /* -* map_sg returns 0 on error and a value > 0 on success. -* It should never return a value < 0. +* map_sg should return a negative error code on error. +* dma_map_sgtable() will return the error code returned and convert +* a zero return (for legacy implementations) into -EINVAL. +* +* dma_map_sg() will always return zero on any negative or zero +* return to satisfy its own calling convention. */ int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 183e7103a66d..2b0ecf0aa4df 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -105,7 +105,7 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, unsigned long attrs); void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs); -int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, +int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, @@ -164,7 +164,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { } -static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, +static inline int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) { return 0; @@ -343,6 +343,34 @@ static inline void dma_sync_single_range_for_device(struct device *dev, return dma_sync_single_for_device(dev, addr + offset, size, dir); } +/** + * dma_map_sg_attrs - Map the given buffer for DMA + * @dev: The device for which to perform the DMA operation + * @sg:The sg_table object describing the buffer + * @dir: DMA direction + * @attrs: Optional DMA attributes for the map operation + * + * Maps a buffer described by a scatterlist passed in the sg argument with + * nents segments for the @dir DMA operation by the @dev device. + * + * Returns the number of mapped entries (which can be less than nents) + * on success. Zero is returned for any error. + * + * dma_unmap_sg_attrs() should be used to unmap the buffer with the + * original sg and original nents (not the value returned by this funciton). + */ +static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir, unsigned long attrs) +{ + int ret; + + ret = __dma_map_sg_attrs(dev, sg, nents, dir, attrs); + if (ret < 0) + ret = 0; + + return ret; +} + /** * dma_map_sgtable - Map the given buffer for DMA * @dev: The device for which to perform the DMA operation @@ -357,16 +385,19 @@ static inline void dma_sync_single_range_for_device(struct device *dev, * ownership of the buffer back to the CPU domain before touching the * buffer by the CPU. * - * Returns 0 on success or -EINVAL on error during mapping the buffer. + * Returns 0 on success or a negative error code on error */ static inline int dma_map_sgtable(struct device *dev, struct sg_table *sgt, enum dma_data_direction dir, unsigned long attrs) { int nents;
[PATCH v2 07/22] PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagemap and device
All callers of pci_p2pdma_map_type() have a struct dev_pgmap and a struct device (of the client doing the DMA transfer). Thus move the conversion to struct pci_devs for the provider and client into this function. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 4034ffa0eb06..6c2057cd3f37 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -844,14 +844,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish); -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider, - struct pci_dev *client) +static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, + struct device *dev) { + struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; enum pci_p2pdma_map_type ret; + struct pci_dev *client; if (!provider->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED; + if (!dev_is_pci(dev)) + return PCI_P2PDMA_MAP_NOT_SUPPORTED; + + client = to_pci_dev(dev); + ret = xa_to_value(xa_load(>p2pdma->map_types, map_types_idx(client))); if (ret == PCI_P2PDMA_MAP_UNKNOWN) @@ -892,14 +899,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, { struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(sg_page(sg)->pgmap); - struct pci_dev *client; - - if (WARN_ON_ONCE(!dev_is_pci(dev))) - return 0; - client = to_pci_dev(dev); - - switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) { + switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) { case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: return dma_map_sg_attrs(dev, sg, nents, dir, attrs); case PCI_P2PDMA_MAP_BUS_ADDR: @@ -922,17 +923,9 @@ EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs); void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) { - struct pci_p2pdma_pagemap *p2p_pgmap = - to_p2p_pgmap(sg_page(sg)->pgmap); enum pci_p2pdma_map_type map_type; - struct pci_dev *client; - - if (WARN_ON_ONCE(!dev_is_pci(dev))) - return; - - client = to_pci_dev(dev); - map_type = pci_p2pdma_map_type(p2p_pgmap->provider, client); + map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev); if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) dma_unmap_sg_attrs(dev, sg, nents, dir, attrs); -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 06/22] PCI/P2PDMA: Attempt to set map_type if it has not been set
Attempt to find the mapping type for P2PDMA pages on the first DMA map attempt if it has not been done ahead of time. Previously, the mapping type was expected to be calculated ahead of time, but if pages are to come from userspace then there's no way to ensure the path was checked ahead of time. With this change it's no longer invalid to call pci_p2pdma_map_sg() before the mapping type is calculated so drop the WARN_ON when that is the case. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 0e0b2218eacd..4034ffa0eb06 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -847,11 +847,17 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish); static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider, struct pci_dev *client) { + enum pci_p2pdma_map_type ret; + if (!provider->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED; - return xa_to_value(xa_load(>p2pdma->map_types, - map_types_idx(client))); + ret = xa_to_value(xa_load(>p2pdma->map_types, + map_types_idx(client))); + if (ret == PCI_P2PDMA_MAP_UNKNOWN) + return calc_map_type_and_dist_warn(provider, client, NULL); + + return ret; } static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap, @@ -899,7 +905,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, case PCI_P2PDMA_MAP_BUS_ADDR: return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents); default: - WARN_ON_ONCE(1); return 0; } } -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 12/22] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
Make use of the third free LSB in scatterlist's page_link on 64bit systems. The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a given SGL segments dma_address points to a PCI bus address. dma_unmap_sg_p2pdma() will need to perform different cleanup when a segment is marked as P2PDMA. Using this bit requires adding an additional dependency on CONFIG_64BIT to CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA use cases are restricted to newer root complexes and roughly require the extra address space for memory BARs used in the transactions. Signed-off-by: Logan Gunthorpe Reviewed-by: John Hubbard --- drivers/pci/Kconfig | 2 +- include/linux/scatterlist.h | 50 ++--- 2 files changed, 47 insertions(+), 5 deletions(-) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 0c473d75e625..90b4bddb3300 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -163,7 +163,7 @@ config PCI_PASID config PCI_P2PDMA bool "PCI peer-to-peer transfer support" - depends on ZONE_DEVICE + depends on ZONE_DEVICE && 64BIT select GENERIC_ALLOCATOR help Enableѕ drivers to do PCI peer-to-peer transactions to and from diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f70572b2938..562c0aa264f5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -58,6 +58,21 @@ struct sg_table { #define SG_CHAIN 0x01UL #define SG_END 0x02UL +/* + * bit 2 is the third free bit in the page_link on 64bit systems which + * is used by dma_unmap_sg() to determine if the dma_address is a PCI + * bus address when doing P2PDMA. + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this. + */ + +#ifdef CONFIG_PCI_P2PDMA +#define SG_DMA_PCI_P2PDMA 0x04UL +#else +#define SG_DMA_PCI_P2PDMA 0x00UL +#endif + +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA) + /* * We overload the LSB of the page pointer to indicate whether it's * a valid sg entry, or whether it points to the start of a new scatterlist. @@ -66,7 +81,9 @@ struct sg_table { #define sg_is_chain(sg)((sg)->page_link & SG_CHAIN) #define sg_is_last(sg) ((sg)->page_link & SG_END) #define sg_chain_ptr(sg) \ - ((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END))) + ((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK)) + +#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA) /** * sg_assign_page - Assign a given page to an SG entry @@ -80,13 +97,13 @@ struct sg_table { **/ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { - unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END); + unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK; /* * In order for the low bit stealing approach to work, pages * must be aligned at a 32-bit boundary as a minimum. */ - BUG_ON((unsigned long) page & (SG_CHAIN | SG_END)); + BUG_ON((unsigned long) page & SG_PAGE_LINK_MASK); #ifdef CONFIG_DEBUG_SG BUG_ON(sg_is_chain(sg)); #endif @@ -120,7 +137,7 @@ static inline struct page *sg_page(struct scatterlist *sg) #ifdef CONFIG_DEBUG_SG BUG_ON(sg_is_chain(sg)); #endif - return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END)); + return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK); } /** @@ -222,6 +239,31 @@ static inline void sg_unmark_end(struct scatterlist *sg) sg->page_link &= ~SG_END; } +/** + * sg_dma_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma + * @sg: SG entryScatterlist + * + * Description: + * Marks the passed in sg entry to indicate that the dma_address is + * a PCI bus address. + **/ +static inline void sg_dma_mark_pci_p2pdma(struct scatterlist *sg) +{ + sg->page_link |= SG_DMA_PCI_P2PDMA; +} + +/** + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma + * @sg: SG entryScatterlist + * + * Description: + * Clears the PCI P2PDMA mark + **/ +static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg) +{ + sg->page_link &= ~SG_DMA_PCI_P2PDMA; +} + /** * sg_phys - Return physical address of an sg entry * @sg: SG entry -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 11/22] dma-iommu: Return error code from iommu_dma_map_sg()
Pass through appropriate error codes from iommu_dma_map_sg() now that the error code will be passed through dma_map_sgtable(). Signed-off-by: Logan Gunthorpe --- drivers/iommu/dma-iommu.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..0d69f509a95d 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -970,7 +970,7 @@ static int iommu_dma_map_sg_swiotlb(struct device *dev, struct scatterlist *sg, out_unmap: iommu_dma_unmap_sg_swiotlb(dev, sg, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); - return 0; + return -EINVAL; } /* @@ -991,11 +991,14 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, dma_addr_t iova; size_t iova_len = 0; unsigned long mask = dma_get_seg_boundary(dev); + ssize_t ret; int i; - if (static_branch_unlikely(_deferred_attach_enabled) && - iommu_deferred_attach(dev, domain)) - return 0; + if (static_branch_unlikely(_deferred_attach_enabled)) { + ret = iommu_deferred_attach(dev, domain); + if (ret) + return ret; + } if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_sg_for_device(dev, sg, nents, dir); @@ -1043,14 +1046,17 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, } iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev); - if (!iova) + if (!iova) { + ret = -ENOMEM; goto out_restore_sg; + } /* * We'll leave any physical concatenation to the IOMMU driver's * implementation - it knows better than we do. */ - if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len) + ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot); + if (ret < iova_len) goto out_free_iova; return __finalise_sg(dev, sg, nents, iova); @@ -1059,7 +1065,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, iommu_dma_free_iova(cookie, iova, iova_len, NULL); out_restore_sg: __invalidate_sg(sg, nents); - return 0; + return ret; } static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 09/22] dma-direct: Return appropriate error code from dma_direct_map_sg()
Now that the map_sg() op expects error codes instead of return zero on error, convert dma_direct_map_sg() to return an error code. The only error to return presently is EINVAL if a page could not be mapped. Signed-off-by: Logan Gunthorpe --- kernel/dma/direct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index f737e3347059..803ee9321170 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -411,7 +411,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, out_unmap: dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); - return 0; + return -EINVAL; } dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 10/22] iommu: Return full error code from iommu_map_sg[_atomic]()
Convert to ssize_t return code so the return code from __iommu_map() can be returned all the way down through dma_iommu_map_sg(). Signed-off-by: Logan Gunthorpe --- drivers/iommu/iommu.c | 15 +++ include/linux/iommu.h | 22 +++--- 2 files changed, 18 insertions(+), 19 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 808ab70d5df5..df428206ea6e 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2567,9 +2567,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_unmap_fast); -static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova, -struct scatterlist *sg, unsigned int nents, int prot, -gfp_t gfp) +static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova, + struct scatterlist *sg, unsigned int nents, int prot, + gfp_t gfp) { const struct iommu_ops *ops = domain->ops; size_t len = 0, mapped = 0; @@ -2610,19 +2610,18 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova, /* undo mappings already done */ iommu_unmap(domain, iova, mapped); - return 0; - + return ret; } -size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova, - struct scatterlist *sg, unsigned int nents, int prot) +ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova, +struct scatterlist *sg, unsigned int nents, int prot) { might_sleep(); return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL); } EXPORT_SYMBOL_GPL(iommu_map_sg); -size_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova, +ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg, unsigned int nents, int prot) { return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 32d448050bf7..9369458ba1bd 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -414,11 +414,11 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, extern size_t iommu_unmap_fast(struct iommu_domain *domain, unsigned long iova, size_t size, struct iommu_iotlb_gather *iotlb_gather); -extern size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova, - struct scatterlist *sg,unsigned int nents, int prot); -extern size_t iommu_map_sg_atomic(struct iommu_domain *domain, - unsigned long iova, struct scatterlist *sg, - unsigned int nents, int prot); +extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova, + struct scatterlist *sg, unsigned int nents, int prot); +extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot); extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova); extern void iommu_set_fault_handler(struct iommu_domain *domain, iommu_fault_handler_t handler, void *token); @@ -679,18 +679,18 @@ static inline size_t iommu_unmap_fast(struct iommu_domain *domain, return 0; } -static inline size_t iommu_map_sg(struct iommu_domain *domain, - unsigned long iova, struct scatterlist *sg, - unsigned int nents, int prot) +static inline ssize_t iommu_map_sg(struct iommu_domain *domain, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot) { - return 0; + return -ENODEV; } -static inline size_t iommu_map_sg_atomic(struct iommu_domain *domain, +static inline ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg, unsigned int nents, int prot) { - return 0; + return -ENODEV; } static inline void iommu_flush_iotlb_all(struct iommu_domain *domain) -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 13/22] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static
pci_p2pdma_map_type() will be needed by the dma-iommu map_sg implementation because it will need to determine the mapping type ahead of actually doing the mapping to create the actual iommu mapping. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 24 +- include/linux/pci-p2pdma.h | 41 ++ 2 files changed, 56 insertions(+), 9 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 6c2057cd3f37..0568604fd8b4 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -20,13 +20,6 @@ #include #include -enum pci_p2pdma_map_type { - PCI_P2PDMA_MAP_UNKNOWN = 0, - PCI_P2PDMA_MAP_NOT_SUPPORTED, - PCI_P2PDMA_MAP_BUS_ADDR, - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, -}; - struct pci_p2pdma { struct gen_pool *pool; bool p2pmem_published; @@ -844,8 +837,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish); -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, - struct device *dev) +/** + * pci_p2pdma_map_type - return the type of mapping that should be used for + * a given device and pgmap + * @pgmap: the pagemap of a page to determine the mapping type for + * @dev: device that is mapping the page + * + * Returns one of: + * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done + * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally + * using the CPU physical address (in dma-direct) or an IOVA + * mapping for the IOMMU. + */ +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, +struct device *dev) { struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; enum pci_p2pdma_map_type ret; diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 8318a97c9c61..caac2d023f8f 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -16,6 +16,40 @@ struct block_device; struct scatterlist; +enum pci_p2pdma_map_type { + /* +* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping +* type hasn't been calculated yet. Functions that return this enum +* never return this value. +*/ + PCI_P2PDMA_MAP_UNKNOWN = 0, + + /* +* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will +* traverse the host bridge and the host bridge is not in the +* whitelist. DMA Mapping routines should return an error when +* this is returned. +*/ + PCI_P2PDMA_MAP_NOT_SUPPORTED, + + /* +* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to +* eachother directly through a PCI switch and the transaction will +* not traverse the host bridge. Such a mapping should program +* the DMA engine with PCI bus addresses. +*/ + PCI_P2PDMA_MAP_BUS_ADDR, + + /* +* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk +* to eachother, but the transaction traverses a host bridge on the +* whitelist. In this case, a normal mapping either with CPU physical +* addresses (in the case of dma-direct) or IOVA addresses (in the +* case of IOMMUs) should be used to program the DMA engine. +*/ + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, +}; + #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); @@ -30,6 +64,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev, unsigned int *nents, u32 length); void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl); void pci_p2pmem_publish(struct pci_dev *pdev, bool publish); +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, +struct device *dev); int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, @@ -83,6 +119,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev, static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) { } +static inline enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev) +{ + return PCI_P2PDMA_MAP_NOT_SUPPORTED; +} static inline int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org
[PATCH v2 22/22] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
This interface is superseded by support in dma_map_sg() which now supports heterogeneous scatterlists. There are no longer any users, so remove it. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 65 -- include/linux/pci-p2pdma.h | 27 2 files changed, 92 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index b0d779aeade9..767122e0a43f 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -873,71 +873,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, return ret; } -static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap, - struct device *dev, struct scatterlist *sg, int nents) -{ - struct scatterlist *s; - int i; - - for_each_sg(sg, s, nents, i) { - s->dma_address = sg_phys(s) - p2p_pgmap->bus_offset; - sg_dma_len(s) = s->length; - } - - return nents; -} - -/** - * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA - * @dev: device doing the DMA request - * @sg: scatter list to map - * @nents: elements in the scatterlist - * @dir: DMA direction - * @attrs: DMA attributes passed to dma_map_sg() (if called) - * - * Scatterlists mapped with this function should be unmapped using - * pci_p2pdma_unmap_sg_attrs(). - * - * Returns the number of SG entries mapped or 0 on error. - */ -int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir, unsigned long attrs) -{ - struct pci_p2pdma_pagemap *p2p_pgmap = - to_p2p_pgmap(sg_page(sg)->pgmap); - - switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) { - case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: - return dma_map_sg_attrs(dev, sg, nents, dir, attrs); - case PCI_P2PDMA_MAP_BUS_ADDR: - return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents); - default: - return 0; - } -} -EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs); - -/** - * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was - * mapped with pci_p2pdma_map_sg() - * @dev: device doing the DMA request - * @sg: scatter list to map - * @nents: number of elements returned by pci_p2pdma_map_sg() - * @dir: DMA direction - * @attrs: DMA attributes passed to dma_unmap_sg() (if called) - */ -void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir, unsigned long attrs) -{ - enum pci_p2pdma_map_type map_type; - - map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev); - - if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) - dma_unmap_sg_attrs(dev, sg, nents, dir, attrs); -} -EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs); - /** * pci_p2pdma_map_segment - map an sg segment determining the mapping type * @state: State structure that should be declared outside of the for_each_sg() diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index e5a8d5bc0f51..0c33a40a86e7 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -72,10 +72,6 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl); void pci_p2pmem_publish(struct pci_dev *pdev, bool publish); enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev); -int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir, unsigned long attrs); -void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir, unsigned long attrs); enum pci_p2pdma_map_type pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, struct scatterlist *sg); @@ -135,17 +131,6 @@ pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev) { return PCI_P2PDMA_MAP_NOT_SUPPORTED; } -static inline int pci_p2pdma_map_sg_attrs(struct device *dev, - struct scatterlist *sg, int nents, enum dma_data_direction dir, - unsigned long attrs) -{ - return 0; -} -static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev, - struct scatterlist *sg, int nents, enum dma_data_direction dir, - unsigned long attrs) -{ -} static inline enum pci_p2pdma_map_type pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, struct scatterlist *sg) @@ -181,16 +166,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client) return pci_p2pmem_find_many(, 1); } -static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir) -{ - return
[PATCH v2 14/22] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg() implementations. It takes an scatterlist segment that must point to a pci_p2pdma struct page and will map it if the mapping requires a bus address. The return value indicates whether the mapping required a bus address or whether the caller still needs to map the segment normally. If the segment should not be mapped, -EREMOTEIO is returned. This helper uses a state structure to track the changes to the pgmap across calls and avoid needing to lookup into the xarray for every page. Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU dma_map_sg() implementations where the sg segment containing the page differs from the sg segment containing the DMA address. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 59 ++ include/linux/pci-p2pdma.h | 21 ++ 2 files changed, 80 insertions(+) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 0568604fd8b4..b0d779aeade9 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -938,6 +938,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, } EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs); +/** + * pci_p2pdma_map_segment - map an sg segment determining the mapping type + * @state: State structure that should be declared outside of the for_each_sg() + * loop and initialized to zero. + * @dev: DMA device that's doing the mapping operation + * @sg: scatterlist segment to map + * + * This is a helper to be used by non-iommu dma_map_sg() implementations where + * the sg segment is the same for the page_link and the dma_address. + * + * Attempt to map a single segment in an SGL with the PCI bus address. + * The segment must point to a PCI P2PDMA page and thus must be + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check. + * + * Returns the type of mapping used and maps the page if the type is + * PCI_P2PDMA_MAP_BUS_ADDR. + */ +enum pci_p2pdma_map_type +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, + struct scatterlist *sg) +{ + if (state->pgmap != sg_page(sg)->pgmap) { + state->pgmap = sg_page(sg)->pgmap; + state->map = pci_p2pdma_map_type(state->pgmap, dev); + state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; + } + + if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) { + sg->dma_address = sg_phys(sg) + state->bus_off; + sg_dma_len(sg) = sg->length; + sg_dma_mark_pci_p2pdma(sg); + } + + return state->map; +} + +/** + * pci_p2pdma_map_bus_segment - map an sg segment pre determined to + * be mapped with PCI_P2PDMA_MAP_BUS_ADDR + * @pg_sg: scatterlist segment with the page to map + * @dma_sg: scatterlist segment to assign a dma address to + * + * This is a helper for iommu dma_map_sg() implementations when the + * segment for the dma address differs from the segment containing the + * source page. + * + * pci_p2pdma_map_type() must have already been called on the pg_sg and + * returned PCI_P2PDMA_MAP_BUS_ADDR. + */ +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg, + struct scatterlist *dma_sg) +{ + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap); + + dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset; + sg_dma_len(dma_sg) = pg_sg->length; + sg_dma_mark_pci_p2pdma(dma_sg); +} + /** * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store * to enable p2pdma diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index caac2d023f8f..e5a8d5bc0f51 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -13,6 +13,12 @@ #include +struct pci_p2pdma_map_state { + struct dev_pagemap *pgmap; + int map; + u64 bus_off; +}; + struct block_device; struct scatterlist; @@ -70,6 +76,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); +enum pci_p2pdma_map_type +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, + struct scatterlist *sg); +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg, + struct scatterlist *dma_sg); int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, @@ -135,6 +146,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev, unsigned long attrs) { } +static inline enum pci_p2pdma_map_type +pci_p2pdma_map_segment(struct
[PATCH v2 17/22] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
When a PCI P2PDMA page is seen, set the IOVA length of the segment to zero so that it is not mapped into the IOVA. Then, in finalise_sg(), apply the appropriate bus address to the segment. The IOVA is not created if the scatterlist only consists of P2PDMA pages. A P2PDMA page may have three possible outcomes when being mapped: 1) If the data path between the two devices doesn't go through the root port, then it should be mapped with a PCI bus address 2) If the data path goes through the host bridge, it should be mapped normally with an IOMMU IOVA. 3) It is not possible for the two devices to communicate and thus the mapping operation should fail (and it will return -EREMOTEIO). Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to indicate bus address segments. On unmap, P2PDMA segments are skipped over when determining the start and end IOVA addresses. With this change, the flags variable in the dma_map_ops is set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages. Signed-off-by: Logan Gunthorpe --- drivers/iommu/dma-iommu.c | 66 +++ 1 file changed, 60 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 0d69f509a95d..7cc6650e1fa5 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -891,6 +892,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents, sg_dma_address(s) = DMA_MAPPING_ERROR; sg_dma_len(s) = 0; + if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) { + if (i > 0) + cur = sg_next(cur); + + pci_p2pdma_map_bus_segment(s, cur); + count++; + cur_len = 0; + continue; + } + /* * Now fill in the real DMA data. If... * - there is a valid output segment to append to @@ -988,6 +999,8 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, struct iova_domain *iovad = >iovad; struct scatterlist *s, *prev = NULL; int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs); + struct dev_pagemap *pgmap = NULL; + enum pci_p2pdma_map_type map_type; dma_addr_t iova; size_t iova_len = 0; unsigned long mask = dma_get_seg_boundary(dev); @@ -1023,6 +1036,35 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, s_length = iova_align(iovad, s_length + s_iova_off); s->length = s_length; + if (is_pci_p2pdma_page(sg_page(s))) { + if (sg_page(s)->pgmap != pgmap) { + pgmap = sg_page(s)->pgmap; + map_type = pci_p2pdma_map_type(pgmap, dev); + } + + switch (map_type) { + case PCI_P2PDMA_MAP_BUS_ADDR: + /* +* A zero length will be ignored by +* iommu_map_sg() and then can be detected +* in __finalise_sg() to actually map the +* bus address. +*/ + s->length = 0; + continue; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + /* +* Mapping through host bridge should be +* mapped with regular IOVAs, thus we +* do nothing here and continue below. +*/ + break; + default: + ret = -EREMOTEIO; + goto out_restore_sg; + } + } + /* * Due to the alignment of our single IOVA allocation, we can * depend on these assumptions about the segment boundary mask: @@ -1045,6 +1087,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, prev = s; } + if (!iova_len) + return __finalise_sg(dev, sg, nents, 0); + iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev); if (!iova) { ret = -ENOMEM; @@ -1071,7 +1116,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) { - dma_addr_t start, end; + dma_addr_t end, start =
[PATCH v2 16/22] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
Add a flags member to the dma_map_ops structure with one flag to indicate support for PCI P2PDMA. Also, add a helper to check if a device supports PCI P2PDMA. Signed-off-by: Logan Gunthorpe --- include/linux/dma-map-ops.h | 10 ++ include/linux/dma-mapping.h | 5 + kernel/dma/mapping.c| 18 ++ 3 files changed, 33 insertions(+) diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index eaa969be8284..b7b51b89b054 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -11,7 +11,17 @@ struct cma; +/* + * Values for struct dma_map_ops.flags: + * + * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can + * handle PCI P2PDMA pages in the map_sg/unmap_sg operation. + */ +#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0) + struct dma_map_ops { + unsigned int flags; + void *(*alloc)(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 2b0ecf0aa4df..1149540e969f 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -138,6 +138,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, unsigned long attrs); bool dma_can_mmap(struct device *dev); int dma_supported(struct device *dev, u64 mask); +bool dma_pci_p2pdma_supported(struct device *dev); int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); @@ -242,6 +243,10 @@ static inline int dma_supported(struct device *dev, u64 mask) { return 0; } +static inline bool dma_pci_p2pdma_supported(struct device *dev) +{ + return false; +} static inline int dma_set_mask(struct device *dev, u64 mask) { return -EIO; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 700a2bb7cc9e..2940977b1030 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -658,6 +658,24 @@ int dma_supported(struct device *dev, u64 mask) } EXPORT_SYMBOL(dma_supported); +bool dma_pci_p2pdma_supported(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + /* if ops is not set, dma direct will be used which supports P2PDMA */ + if (!ops) + return true; + + /* +* Note: dma_ops_bypass is not checked here because P2PDMA should +* not be used with dma mapping ops that do not have support even +* if the specific device is bypassing them. +*/ + + return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED; +} +EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported); + #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK void arch_dma_set_mask(struct device *dev, u64 mask); #else -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 15/22] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map PCI P2PDMA pages directly without a hack in the callers. This allows for heterogeneous SGLs that contain both P2PDMA and regular pages. A P2PDMA page may have three possible outcomes when being mapped: 1) If the data path between the two devices doesn't go through the root port, then it should be mapped with a PCI bus address 2) If the data path goes through the host bridge, it should be mapped normally, as though it were a CPU physical address 3) It is not possible for the two devices to communicate and thus the mapping operation should fail (and it will return -EREMOTEIO). SGL segments that contain PCI bus addresses are marked with sg_dma_mark_pci_p2pdma() and are ignored when unmapped. Signed-off-by: Logan Gunthorpe --- kernel/dma/direct.c | 44 ++-- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 803ee9321170..567dac942e89 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "direct.h" /* @@ -381,29 +382,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev, arch_sync_dma_for_cpu_all(); } +/* + * Unmaps segments, except for ones marked as pci_p2pdma which do not + * require any further action as they contain a bus address. + */ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { struct scatterlist *sg; int i; - for_each_sg(sgl, sg, nents, i) - dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir, -attrs); + for_each_sg(sgl, sg, nents, i) { + if (sg_is_dma_pci_p2pdma(sg)) { + sg_dma_unmark_pci_p2pdma(sg); + } else { + dma_direct_unmap_page(dev, sg->dma_address, + sg_dma_len(sg), dir, attrs); + } + } } #endif int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { - int i; + struct pci_p2pdma_map_state p2pdma_state = {}; + enum pci_p2pdma_map_type map; struct scatterlist *sg; + int i, ret; for_each_sg(sgl, sg, nents, i) { + if (is_pci_p2pdma_page(sg_page(sg))) { + map = pci_p2pdma_map_segment(_state, dev, sg); + switch (map) { + case PCI_P2PDMA_MAP_BUS_ADDR: + continue; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + /* +* Mapping through host bridge should be +* mapped normally, thus we do nothing +* and continue below. +*/ + break; + default: + ret = -EREMOTEIO; + goto out_unmap; + } + } + sg->dma_address = dma_direct_map_page(dev, sg_page(sg), sg->offset, sg->length, dir, attrs); - if (sg->dma_address == DMA_MAPPING_ERROR) + if (sg->dma_address == DMA_MAPPING_ERROR) { + ret = -EINVAL; goto out_unmap; + } sg_dma_len(sg) = sg->length; } @@ -411,7 +443,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, out_unmap: dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC); - return -EINVAL; + return ret; } dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 19/22] nvme-pci: Convert to using dma_map_sgtable()
The dma_map operations now support P2PDMA pages directly. So remove the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls to dma_map_sgtable(). dma_map_sgtable() returns more complete error codes than dma_map_sg() and allows differentiating EREMOTEIO errors in case an unsupported P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET so the request isn't retried. Signed-off-by: Logan Gunthorpe --- drivers/nvme/host/pci.c | 69 + 1 file changed, 29 insertions(+), 40 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 9912291f43af..8844ef1005c3 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -229,11 +229,10 @@ struct nvme_iod { bool use_sgl; int aborted; int npages; /* In the PRP list. 0 means small pool in use */ - int nents; /* Used in scatterlist */ dma_addr_t first_dma; unsigned int dma_len; /* length of single DMA segment mapping */ dma_addr_t meta_dma; - struct scatterlist *sg; + struct sg_table sgt; }; static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev) @@ -525,7 +524,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx) static void **nvme_pci_iod_list(struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - return (void **)(iod->sg + blk_rq_nr_phys_segments(req)); + return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req)); } static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req) @@ -579,17 +578,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req) } -static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req) -{ - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - - if (is_pci_p2pdma_page(sg_page(iod->sg))) - pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents, - rq_dma_dir(req)); - else - dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req)); -} - static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); @@ -600,9 +588,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) return; } - WARN_ON_ONCE(!iod->nents); + WARN_ON_ONCE(!iod->sgt.nents); + + dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0); - nvme_unmap_sg(dev, req); if (iod->npages == 0) dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0], iod->first_dma); @@ -610,7 +599,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) nvme_free_sgls(dev, req); else nvme_free_prps(dev, req); - mempool_free(iod->sg, dev->iod_mempool); + mempool_free(iod->sgt.sgl, dev->iod_mempool); } static void nvme_print_sgl(struct scatterlist *sgl, int nents) @@ -633,7 +622,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct dma_pool *pool; int length = blk_rq_payload_bytes(req); - struct scatterlist *sg = iod->sg; + struct scatterlist *sg = iod->sgt.sgl; int dma_len = sg_dma_len(sg); u64 dma_addr = sg_dma_address(sg); int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1); @@ -706,16 +695,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev, dma_len = sg_dma_len(sg); } done: - cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg)); + cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl)); cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma); return BLK_STS_OK; free_prps: nvme_free_prps(dev, req); return BLK_STS_RESOURCE; bad_sgl: - WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents), + WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents), "Invalid SGL for payload:%d nents:%d\n", - blk_rq_payload_bytes(req), iod->nents); + blk_rq_payload_bytes(req), iod->sgt.nents); return BLK_STS_IOERR; } @@ -741,12 +730,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge, } static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev, - struct request *req, struct nvme_rw_command *cmd, int entries) + struct request *req, struct nvme_rw_command *cmd) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct dma_pool *pool; struct nvme_sgl_desc *sg_list; - struct scatterlist *sg = iod->sg; + struct scatterlist *sg = iod->sgt.sgl; + int entries = iod->sgt.nents; dma_addr_t sgl_dma; int i = 0; @@ -844,7 +834,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev,
[PATCH v2 04/22] PCI/P2PDMA: Avoid pci_get_slot() which sleeps
In order to use upstream_bridge_distance_warn() from a dma_map function, it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it might sleep. In order to avoid this, try to get the host bridge's device from the first element in the device list. It should be impossible for the host bridge's device to go away while references are held on child devices, so the first element should not be able to change and, thus, this should be safe. Introduce a static function called pci_host_bridge_dev() to obtain the host bridge's root device. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 09c864f193d2..175da3a9c8fb 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -308,10 +308,41 @@ static const struct pci_p2pdma_whitelist_entry { {} }; +/* + * This lookup function tries to find the PCI device corresponding to a given + * host bridge. + * + * It assumes the host bridge device is the first PCI device in the + * bus->devices list and that the devfn is 00.0. These assumptions should hold + * for all the devices in the whitelist above. + * + * This function is equivalent to pci_get_slot(host->bus, 0), however it does + * not take the pci_bus_sem lock seeing __host_bridge_whitelist() must not + * sleep. + * + * For this to be safe, the caller should hold a reference to a device on the + * bridge, which should ensure the host_bridge device will not be freed + * or removed from the head of the devices list. + */ +static struct pci_dev *pci_host_bridge_dev(struct pci_host_bridge *host) +{ + struct pci_dev *root; + + root = list_first_entry_or_null(>bus->devices, + struct pci_dev, bus_list); + + if (!root) + return NULL; + if (root->devfn != PCI_DEVFN(0, 0)) + return NULL; + + return root; +} + static bool __host_bridge_whitelist(struct pci_host_bridge *host, bool same_host_bridge) { - struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0)); + struct pci_dev *root = pci_host_bridge_dev(host); const struct pci_p2pdma_whitelist_entry *entry; unsigned short vendor, device; @@ -320,7 +351,6 @@ static bool __host_bridge_whitelist(struct pci_host_bridge *host, vendor = root->vendor; device = root->device; - pci_dev_put(root); for (entry = pci_p2pdma_whitelist; entry->vendor; entry++) { if (vendor != entry->vendor || device != entry->device) -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 21/22] RDMA/rw: use dma_map_sgtable()
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg() is no longer necessary and may be dropped. Switch to the dma_map_sgtable() interface which will allow for better error reporting if the P2PDMA pages are unsupported. The change to sgtable also appears to fix a couple subtle error path bugs: - In rdma_rw_ctx_init(), dma_unmap would be called with an sg that could have been incremented from the original call, as well as an nents that was not the original number of nents called when mapped. - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg were unmapped with the incorrect number of nents. Signed-off-by: Logan Gunthorpe --- drivers/infiniband/core/rw.c | 75 +++- include/rdma/ib_verbs.h | 19 + 2 files changed, 51 insertions(+), 43 deletions(-) diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index a588c2038479..68f2dda56138 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -273,26 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp, return 1; } -static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg, -u32 sg_cnt, enum dma_data_direction dir) -{ - if (is_pci_p2pdma_page(sg_page(sg))) - pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir); - else - ib_dma_unmap_sg(dev, sg, sg_cnt, dir); -} - -static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg, - u32 sg_cnt, enum dma_data_direction dir) -{ - if (is_pci_p2pdma_page(sg_page(sg))) { - if (WARN_ON_ONCE(ib_uses_virt_dma(dev))) - return 0; - return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir); - } - return ib_dma_map_sg(dev, sg, sg_cnt, dir); -} - /** * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context * @ctx: context to initialize @@ -313,12 +293,16 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num, u64 remote_addr, u32 rkey, enum dma_data_direction dir) { struct ib_device *dev = qp->pd->device; + struct sg_table sgt = { + .sgl = sg, + .orig_nents = sg_cnt, + }; int ret; - ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir); - if (!ret) - return -ENOMEM; - sg_cnt = ret; + ret = ib_dma_map_sgtable(dev, , dir, 0); + if (ret) + return ret; + sg_cnt = sgt.nents; /* * Skip to the S/G entry that sg_offset falls into: @@ -354,7 +338,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num, return ret; out_unmap_sg: - rdma_rw_unmap_sg(dev, sg, sg_cnt, dir); + ib_dma_unmap_sgtable(dev, , dir, 0); return ret; } EXPORT_SYMBOL(rdma_rw_ctx_init); @@ -387,6 +371,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, qp->integrity_en); struct ib_rdma_wr *rdma_wr; int count = 0, ret; + struct sg_table sgt = { + .sgl = sg, + .orig_nents = sg_cnt, + }; + struct sg_table prot_sgt = { + .sgl = prot_sg, + .orig_nents = prot_sg_cnt, + }; if (sg_cnt > pages_per_mr || prot_sg_cnt > pages_per_mr) { pr_err("SG count too large: sg_cnt=%d, prot_sg_cnt=%d, pages_per_mr=%d\n", @@ -394,18 +386,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, return -EINVAL; } - ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir); - if (!ret) - return -ENOMEM; - sg_cnt = ret; + ret = ib_dma_map_sgtable(dev, , dir, 0); + if (ret) + return ret; if (prot_sg_cnt) { - ret = rdma_rw_map_sg(dev, prot_sg, prot_sg_cnt, dir); - if (!ret) { - ret = -ENOMEM; + ret = ib_dma_map_sgtable(dev, _sgt, dir, 0); + if (ret) goto out_unmap_sg; - } - prot_sg_cnt = ret; } ctx->type = RDMA_RW_SIG_MR; @@ -426,10 +414,11 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, memcpy(ctx->reg->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs)); - ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sg_cnt, NULL, prot_sg, - prot_sg_cnt, NULL, SZ_4K); + ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sgt.nents, NULL, prot_sg, + prot_sgt.nents, NULL, SZ_4K); if (unlikely(ret)) { - pr_err("failed to map PI sg (%d)\n", sg_cnt + prot_sg_cnt); + pr_err("failed to map PI sg (%d)\n", + sgt.nents
[PATCH v2 20/22] RDMA/core: Introduce ib_dma_pci_p2p_dma_supported()
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check if a given ib_device can be used in P2PDMA transfers. This ensures the ib_device is not using virt_dma and also that the underlying dma_device supports P2PDMA. Use the new helper in nvme-rdma to replace the existing check for ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows switching away from pci_p2pdma_[un]map_sg(). Signed-off-by: Logan Gunthorpe --- drivers/nvme/target/rdma.c | 2 +- include/rdma/ib_verbs.h| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 6c1f3ab7649c..86f0bf4dc1ca 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -414,7 +414,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev, if (ib_dma_mapping_error(ndev->device, r->send_sge.addr)) goto out_free_rsp; - if (!ib_uses_virt_dma(ndev->device)) + if (ib_dma_pci_p2p_dma_supported(ndev->device)) r->req.p2p_client = >device->dev; r->send_sge.length = sizeof(*r->req.cqe); r->send_sge.lkey = ndev->pd->local_dma_lkey; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7e2f3699b898..724e80a656f7 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -3943,6 +3943,17 @@ static inline bool ib_uses_virt_dma(struct ib_device *dev) return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device; } +/* + * Check if a IB device's underlying DMA mapping supports P2PDMA transfers. + */ +static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev) +{ + if (ib_uses_virt_dma(dev)) + return false; + + return dma_pci_p2pdma_supported(dev->dma_device); +} + /** * ib_dma_mapping_error - check a DMA addr for error * @dev: The device for which the dma_addr was created -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 00/22] Add new DMA mapping operation for P2PDMA
Hi, This patchset continues my work to to add P2PDMA support to the common dma map operations. This allows for creating SGLs that have both P2PDMA and regular pages which is a necessary step to allowing P2PDMA pages in userspace. The earlier RFC[1] and v1[2] postings generated a lot of great feedback. This version adds a bunch more cleanup at the start of the series. I'll probably look to split the earlier patches off and get them merged indpendantly after a round of review with this series as this series has gotten quite long. I'm happy to do a few more passes if anyone has any further feedback or better ideas. This series is based on v5.13-rc1 and a git branch can be found here: https://github.com/sbates130272/linux-p2pmem/ p2pdma_map_ops_v2 Thanks, Logan [1] https://lore.kernel.org/linux-block/20210311233142.7900-1-log...@deltatee.com/ [2] https://lore.kernel.org/linux-block/20210408170123.8788-1-log...@deltatee.com/ Changes sine v1: * Rebased onto v5.13-rc1 * Add some cleanup to the existing P2PDMA code to fix up some naming conventions and documentation as the code has evolved a bit since the names were chosen. (As suggested by John) * Add a patch that adds a warning if a host bridge is not in the whitelist (as suggested by Don) * Change to using dma_map_sgtable() instead of creating a new interface. For this, a couple of .map_sg implementations were changed to return full error codes. (as per Christoph) * Renamed the scatterlist functions to include the term "dma" to indicate that they apply to the DMA side of the sg. (per Jason) * Introduce ib_dma_pci_p2p_dma_supported() helper instead of open coding the check (per Jason) * Numerous minor adjustments and documentation fixes Changes since the RFC: * Added comment and fixed up the pci_get_slot patch. (per Bjorn) * Fixed glaring sg_phys() double offset bug. (per Robin) * Created a new map operation (dma_map_sg_p2pdma()) with a new calling convention instead of modifying the calling convention of dma_map_sg(). (per Robin) * Integrated the two similar pci_p2pdma_dma_map_type() and pci_p2pdma_map_type() functions into one (per Ira) * Reworked some of the logic in the map_sg() implementations into helpers in the p2pdma code. (per Christoph) * Dropped a bunch of unnecessary symbol exports (per Christoph) * Expanded the code in dma_pci_p2pdma_supported() for clarity. (per Ira and Christoph) * Finished off using the new dma_map_sg_p2pdma() call in rdma_rw and removed the old pci_p2pdma_[un]map_sg(). (per Jason) -- Logan Gunthorpe (22): PCI/P2PDMA: Rename upstream_bridge_distance() and rework documentation PCI/P2PDMA: Use a buffer on the stack for collecting the acs list PCI/P2PDMA: Cleanup type for return value of calc_map_type_and_dist() PCI/P2PDMA: Avoid pci_get_slot() which sleeps PCI/P2PDMA: Print a warning if the host bridge is not in the whitelist PCI/P2PDMA: Attempt to set map_type if it has not been set PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagemap and device dma-mapping: Allow map_sg() ops to return negative error codes dma-direct: Return appropriate error code from dma_direct_map_sg() iommu: Return full error code from iommu_map_sg[_atomic]() dma-iommu: Return error code from iommu_dma_map_sg() lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL PCI/P2PDMA: Make pci_p2pdma_map_type() non-static PCI/P2PDMA: Introduce helpers for dma_map_sg implementations dma-direct: Support PCI P2PDMA pages in dma-direct map_sg dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg nvme-pci: Check DMA ops when indicating support for PCI P2PDMA nvme-pci: Convert to using dma_map_sgtable() RDMA/core: Introduce ib_dma_pci_p2p_dma_supported() RDMA/rw: use dma_map_sgtable() PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg() drivers/infiniband/core/rw.c | 75 -- drivers/iommu/dma-iommu.c| 86 +-- drivers/iommu/iommu.c| 15 +- drivers/nvme/host/core.c | 3 +- drivers/nvme/host/nvme.h | 2 +- drivers/nvme/host/pci.c | 80 +- drivers/nvme/target/rdma.c | 2 +- drivers/pci/Kconfig | 2 +- drivers/pci/p2pdma.c | 273 +++ include/linux/dma-map-ops.h | 18 ++- include/linux/dma-mapping.h | 46 +- include/linux/iommu.h| 22 +-- include/linux/pci-p2pdma.h | 81 --- include/linux/scatterlist.h | 50 ++- include/rdma/ib_verbs.h | 30 kernel/dma/direct.c | 44 +- kernel/dma/mapping.c | 31 +++- 17 files changed, 570 insertions(+), 290 deletions(-) base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5 -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 01/22] PCI/P2PDMA: Rename upstream_bridge_distance() and rework documentation
The function upstream_bridge_distance() has evolved such that it's name is no longer entirely reflective of what the function does. The function not only calculates the distance between two peers but also calculates how the DMA addresses for those two peers should be mapped. Thus, rename the function to calc_map_type_and_dist() and rework the documentation some to better describe the two pieces of information the function returns. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 63 ++-- 1 file changed, 32 insertions(+), 31 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 196382630363..6f90e9812f6e 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -354,7 +354,7 @@ static bool host_bridge_whitelist(struct pci_dev *a, struct pci_dev *b) } static enum pci_p2pdma_map_type -__upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client, +__calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, int *dist, bool *acs_redirects, struct seq_buf *acs_list) { struct pci_dev *a = provider, *b = client, *bb; @@ -433,17 +433,18 @@ static unsigned long map_types_idx(struct pci_dev *client) } /* - * Find the distance through the nearest common upstream bridge between - * two PCI devices. + * Calculate the P2PDMA mapping type and distance between two PCI devices. * - * If the two devices are the same device then 0 will be returned. + * If the two devices are the same device then PCI_P2PDMA_MAP_BUS_ADDR + * and a distance of 0 will be returned. * * If there are two virtual functions of the same device behind the same - * bridge port then 2 will be returned (one step down to the PCIe switch, - * then one step back to the same device). + * bridge port then PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 will be + * returned (one step down to the PCIe switch, then one step back to the + * same device). * * In the case where two devices are connected to the same PCIe switch, the - * value 4 will be returned. This corresponds to the following PCI tree: + * distance of 4 will be returned. This corresponds to the following PCI tree: * * -+ Root Port * \+ Switch Upstream Port @@ -454,31 +455,31 @@ static unsigned long map_types_idx(struct pci_dev *client) * * The distance is 4 because we traverse from Device A through the downstream * port of the switch, to the common upstream port, back up to the second - * downstream port and then to Device B. - * - * Any two devices that cannot communicate using p2pdma will return - * PCI_P2PDMA_MAP_NOT_SUPPORTED. + * downstream port and then to Device B. The mapping type returned will depend + * on the ACS redirection setting of the bridges along the path. If ACS + * redirect is set on any bridge port in the path then the TLPs will go through + * the host bridge. Otherwise PCI_P2PDMA_MAP_BUS_ADDR is returned. * * Any two devices that have a data path that goes through the host bridge - * will consult a whitelist. If the host bridges are on the whitelist, - * this function will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE. - * - * If either bridge is not on the whitelist this function returns - * PCI_P2PDMA_MAP_NOT_SUPPORTED. + * will consult a whitelist. If the host bridge is in the whitelist, + * this function will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the + * distance set to the number of ports per above. If the device is not + * in the whitelist the type will be returned PCI_P2PDMA_MAP_NOT_SUPPORTED. * - * If a bridge which has any ACS redirection bits set is in the path, - * acs_redirects will be set to true. In this case, a list of all infringing - * bridge addresses will be populated in acs_list (assuming it's non-null) - * for printk purposes. + * If any ACS redirect bits are set, then the acs_redirects boolean will be + * set to true and their pci device name will be appended to the acs_list + * seq_buf. This seq_buf is used to print a warning informing the user + * how to disable ACS using a command line parameter. + * (See calc_map_type_and_dist_warn() below) */ static enum pci_p2pdma_map_type -upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client, +calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, int *dist, bool *acs_redirects, struct seq_buf *acs_list) { enum pci_p2pdma_map_type map_type; - map_type = __upstream_bridge_distance(provider, client, dist, - acs_redirects, acs_list); + map_type = __calc_map_type_and_dist(provider, client, dist, + acs_redirects, acs_list); if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) { if (!cpu_supports_p2pdma() && @@ -494,8 +495,8 @@ upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client, } static enum pci_p2pdma_map_type
[PATCH v2 18/22] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops flags can be checked for PCI P2PDMA support. Signed-off-by: Logan Gunthorpe --- drivers/nvme/host/core.c | 3 ++- drivers/nvme/host/nvme.h | 2 +- drivers/nvme/host/pci.c | 11 +-- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 522c9b229f80..9544a8948bc4 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3682,7 +3682,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid, blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue); blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue); - if (ctrl->ops->flags & NVME_F_PCI_P2PDMA) + if (ctrl->ops->supports_pci_p2pdma && + ctrl->ops->supports_pci_p2pdma(ctrl)) blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue); ns->queue->queuedata = ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 05f31a2c64bb..84649ce6c206 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -487,7 +487,6 @@ struct nvme_ctrl_ops { unsigned int flags; #define NVME_F_FABRICS (1 << 0) #define NVME_F_METADATA_SUPPORTED (1 << 1) -#define NVME_F_PCI_P2PDMA (1 << 2) int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val); int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val); int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val); @@ -495,6 +494,7 @@ struct nvme_ctrl_ops { void (*submit_async_event)(struct nvme_ctrl *ctrl); void (*delete_ctrl)(struct nvme_ctrl *ctrl); int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size); + bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl); }; #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a29b170701fc..9912291f43af 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2756,17 +2756,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size) return snprintf(buf, size, "%s\n", dev_name(>dev)); } +static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl) +{ + struct nvme_dev *dev = to_nvme_dev(ctrl); + + return dma_pci_p2pdma_supported(dev->dev); +} + static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .name = "pcie", .module = THIS_MODULE, - .flags = NVME_F_METADATA_SUPPORTED | - NVME_F_PCI_P2PDMA, + .flags = NVME_F_METADATA_SUPPORTED, .reg_read32 = nvme_pci_reg_read32, .reg_write32= nvme_pci_reg_write32, .reg_read64 = nvme_pci_reg_read64, .free_ctrl = nvme_pci_free_ctrl, .submit_async_event = nvme_pci_submit_async_event, .get_address= nvme_pci_get_address, + .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma, }; static int nvme_dev_map(struct nvme_dev *dev) -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 02/22] PCI/P2PDMA: Use a buffer on the stack for collecting the acs list
In order to call the calc_map_type_and_dist_warn() function from a dma_map operation, the function must not sleep. The only reason it sleeps is to allocate memory for the seq_buf to print a verbose warning telling the user how to disable ACS for that path. Instead of allocating the memory with kmalloc, allocate it on the stack with a smaller buffer. A 128B buffer is enough to print 10 pci device names. A system with 10 bridge ports between two devices that have ACS enabled would be unusually large, so this should still be a reasonable limit. This also allows cleaning up the awkward (and broken) return with -ENOMEM which contradicts the return type and the caller was not prepared for. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 6f90e9812f6e..3a5fb63c5f2c 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -500,11 +500,10 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, struct pci_dev *client, { struct seq_buf acs_list; bool acs_redirects; + char buf[128]; int ret; - seq_buf_init(_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE); - if (!acs_list.buffer) - return -ENOMEM; + seq_buf_init(_list, buf, sizeof(buf)); ret = calc_map_type_and_dist(provider, client, dist, _redirects, _list); @@ -522,8 +521,6 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, struct pci_dev *client, pci_name(provider)); } - kfree(acs_list.buffer); - return ret; } -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 03/22] PCI/P2PDMA: Cleanup type for return value of calc_map_type_and_dist()
Instead of using an int for the return value of this function use the correct enum pci_p2pdma_map_type. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 3a5fb63c5f2c..09c864f193d2 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -544,11 +544,11 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, struct pci_dev *client, int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, int num_clients, bool verbose) { + enum pci_p2pdma_map_type map; bool not_supported = false; struct pci_dev *pci_client; int total_dist = 0; - int distance; - int i, ret; + int i, distance; if (num_clients == 0) return -1; @@ -563,15 +563,15 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, } if (verbose) - ret = calc_map_type_and_dist_warn(provider, pci_client, + map = calc_map_type_and_dist_warn(provider, pci_client, ); else - ret = calc_map_type_and_dist(provider, pci_client, + map = calc_map_type_and_dist(provider, pci_client, , NULL, NULL); pci_dev_put(pci_client); - if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) + if (map == PCI_P2PDMA_MAP_NOT_SUPPORTED) not_supported = true; if (not_supported && !verbose) -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 01:22:51PM -0700, Jacob Pan wrote: > Hi Tony, > > On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" > wrote: > > > On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote: > > > It seems there are two options: > > > 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU > > > page table/domain, mark the device is PASID only with a flag. Use DMA > > > APIs to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU > > > driver will decide whether to use system PASID domain during map/unmap. > > > Not clear if we also need to make IOVA==kernel VA. > > > > > > 2. Add a new IOMMU API to setup a system PASID which points to > > > init_mm.pgd. This API only allows trusted device to bind with the > > > system PASID at its own risk. There is no need for DMA API. This is the > > > same as the current code except with an explicit API. > > > > > > Which option? > > > > Option #1 looks cleaner to me. Option #2 gives access to bits > > of memory that the users of system PASID shouldn't ever need > > to touch ... just map regions of memory that the kernel has > > a "struct page" for. > > > > What does "use DMA APIs to explicitly map/unmap" mean? Is that > > for the whole region? > > > If we map the entire kernel direct map during system PASID setup, then we > don't need to use DMA API to map/unmap certain range. > > I was thinking this system PASID page table could be on-demand. The mapping > is built by explicit use of DMA map/unmap APIs. Option 1 should be the PASID works exactly like a normal RID and uses all the normal DMA APIs and IOMMU mechanisms, whatever the platform implements. This might mean an iommu update on every operation or not. > > I'm expecting that once this system PASID has been initialized, > > then any accelerator device with a kernel use case would use the > > same PASID. I.e. DSA for page clearing, IAX for ZSwap compression > > & decompression, etc. > > > OK, sounds like we have to map the entire kernel VA with struct page as you > said. So we still by-pass DMA APIs, can we all agree on that? Option 2 should be the faster option, but not available in all cases. Option 1 isn't optional. DMA and IOMMU code has to be portable and this is the portable API. If you want to do option 1 and option 2 then give it a go, but in most common cases with the IOMMU in a direct map you shouldn't get a notable performance win. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Tony, On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" wrote: > On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote: > > It seems there are two options: > > 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU > > page table/domain, mark the device is PASID only with a flag. Use DMA > > APIs to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU > > driver will decide whether to use system PASID domain during map/unmap. > > Not clear if we also need to make IOVA==kernel VA. > > > > 2. Add a new IOMMU API to setup a system PASID which points to > > init_mm.pgd. This API only allows trusted device to bind with the > > system PASID at its own risk. There is no need for DMA API. This is the > > same as the current code except with an explicit API. > > > > Which option? > > Option #1 looks cleaner to me. Option #2 gives access to bits > of memory that the users of system PASID shouldn't ever need > to touch ... just map regions of memory that the kernel has > a "struct page" for. > > What does "use DMA APIs to explicitly map/unmap" mean? Is that > for the whole region? > If we map the entire kernel direct map during system PASID setup, then we don't need to use DMA API to map/unmap certain range. I was thinking this system PASID page table could be on-demand. The mapping is built by explicit use of DMA map/unmap APIs. > I'm expecting that once this system PASID has been initialized, > then any accelerator device with a kernel use case would use the > same PASID. I.e. DSA for page clearing, IAX for ZSwap compression > & decompression, etc. > OK, sounds like we have to map the entire kernel VA with struct page as you said. So we still by-pass DMA APIs, can we all agree on that? > -Tony Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote: > It seems there are two options: > 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU page > table/domain, mark the device is PASID only with a flag. Use DMA APIs > to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU driver > will decide whether to use system PASID domain during map/unmap. Not clear > if we also need to make IOVA==kernel VA. > > 2. Add a new IOMMU API to setup a system PASID which points to init_mm.pgd. > This API only allows trusted device to bind with the system PASID at its > own risk. There is no need for DMA API. This is the same as the current > code except with an explicit API. > > Which option? Option #1 looks cleaner to me. Option #2 gives access to bits of memory that the users of system PASID shouldn't ever need to touch ... just map regions of memory that the kernel has a "struct page" for. What does "use DMA APIs to explicitly map/unmap" mean? Is that for the whole region? I'm expecting that once this system PASID has been initialized, then any accelerator device with a kernel use case would use the same PASID. I.e. DSA for page clearing, IAX for ZSwap compression & decompression, etc. -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Jason, On Thu, 13 May 2021 16:20:14 -0300, Jason Gunthorpe wrote: > On Thu, May 13, 2021 at 07:14:54PM +, Luck, Tony wrote: > > > If you want this then be explicit about what it is you are making when > > > building the API. Don't try to hide it under some generic idea of > > > "kernel PCI DMA SVA" > > > > So, a special API call (that Dave can call from IDXD) to set up this > > kernel PASID. With suitable documentation to explain the scope. > > Maybe with a separate CONFIG option so it can be completely > > stubbed out (IDXD does *NOT* "select" this option ... users have > > to explicitly pick it). > > > > > I could easily see an admin option to "turn this off" entirely as > > > being too dangerous, especially if the users have no interest in > > > IDXD. > > > > And a kernel command line option to block IDXD from using that > > special API ... for users on generic kernels who want to block > > this use model (but still use IDXD in non-kernel cases). Users > > who don't want IDXD at all can block loading of the driver. > > A generic IOMMU API should not be IDXD specific, if you want to allow > some special "integrated to the SOC accelerator PASID" mode then it > should be a global IOMMU API and any security toggles for it should be > global and unrelated to IDXD. > > Concurrently it seems necessary to have a solution for secure kernel > PASID use under the DMA API and reserve this special mode for getting > higher performance. > > I think you need to come with a proposal, and it is a bit alarming a > noteworthy security hole was added under the guise of "kernel SVA" :( > It seems there are two options: 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU page table/domain, mark the device is PASID only with a flag. Use DMA APIs to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU driver will decide whether to use system PASID domain during map/unmap. Not clear if we also need to make IOVA==kernel VA. 2. Add a new IOMMU API to setup a system PASID which points to init_mm.pgd. This API only allows trusted device to bind with the system PASID at its own risk. There is no need for DMA API. This is the same as the current code except with an explicit API. Which option? > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 07:14:54PM +, Luck, Tony wrote: > > If you want this then be explicit about what it is you are making when > > building the API. Don't try to hide it under some generic idea of > > "kernel PCI DMA SVA" > > So, a special API call (that Dave can call from IDXD) to set up this > kernel PASID. With suitable documentation to explain the scope. > Maybe with a separate CONFIG option so it can be completely > stubbed out (IDXD does *NOT* "select" this option ... users have > to explicitly pick it). > > > I could easily see an admin option to "turn this off" entirely as > > being too dangerous, especially if the users have no interest in IDXD. > > And a kernel command line option to block IDXD from using that > special API ... for users on generic kernels who want to block > this use model (but still use IDXD in non-kernel cases). Users > who don't want IDXD at all can block loading of the driver. A generic IOMMU API should not be IDXD specific, if you want to allow some special "integrated to the SOC accelerator PASID" mode then it should be a global IOMMU API and any security toggles for it should be global and unrelated to IDXD. Concurrently it seems necessary to have a solution for secure kernel PASID use under the DMA API and reserve this special mode for getting higher performance. I think you need to come with a proposal, and it is a bit alarming a noteworthy security hole was added under the guise of "kernel SVA" :( Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
> If you want this then be explicit about what it is you are making when > building the API. Don't try to hide it under some generic idea of > "kernel PCI DMA SVA" So, a special API call (that Dave can call from IDXD) to set up this kernel PASID. With suitable documentation to explain the scope. Maybe with a separate CONFIG option so it can be completely stubbed out (IDXD does *NOT* "select" this option ... users have to explicitly pick it). > I could easily see an admin option to "turn this off" entirely as > being too dangerous, especially if the users have no interest in IDXD. And a kernel command line option to block IDXD from using that special API ... for users on generic kernels who want to block this use model (but still use IDXD in non-kernel cases). Users who don't want IDXD at all can block loading of the driver. Would that work? -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 11:53:49AM -0700, Luck, Tony wrote: > I'd like people to think of DSA as an extension to the instruction > set. It implements asynchronous instructions like "MEMFILL" and > "MEMCOPY". These can be limited in scope when executed in user > processes or guests. But when executed by the host OS ring0 code > they can have all the same access that ring0 code has when it > dereferences a pointer. If you want this then be explicit about what it is you are making when building the API. Don't try to hide it under some generic idea of "kernel PCI DMA SVA" Add appropriately safe APIs that might have a chance of making it secure, eg by confirming it is a physically trusted on-socket device. It is not just Thunderbolt devices that could trigger this, many devices with programmable firmware can spoof PCI DID/VID - having an exploit chain that compromises a in-system FW device, chains it to faking a IDXD HW, then accessing the all-memory PASID is pretty alarming if the admin thought they had protection against DMA attacks. I could easially see an admin option to "turn this off" entirely as being too dangerous, especially if the users have no interest in IDXD. Which is why being explicit of intent is really important. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 02:33:03PM -0300, Jason Gunthorpe wrote: > The page table under the kernel PASID should behave the same way that > the kernel would operate the page table assigned to a kernel RID. > > If the kernel has security off then the PASID should map to all > physical memory, just like the RID does. > > If security is on then every DMA map needs to be loaded into the > PASID's io page table no different than a RID page table. > > "kernel SVA" is, IMHO, not a desirable thing, it completely destroys > the kernel's DMA security model. > > > If people want to use an accelerator on memory allocated by vmalloc() > > things will get more complicated. But maybe we can delay solving that > > problem until someone comes up with a real use case that needs to > > do this? > > If you have a HW limitation that the device can only issue TLPs > with a PASID, even for kernel users, then I think the proper thing is > to tell the IOMMU layer than a certain 'struct device' enters > PASID-only mode and the IOMMU layer should construct an appropriate > PASID and flow the dma operations through it. > > Pretending the DMA layer doesn't exist and that PASID gets a free pass > is not OK in the kernel. I can see why a tight security model is needed to stop random devices having access to mamory that they should not be able to access. Now that PCIe devices can be plugged into Thunderbolt ports on computers, nobody wants to repeat the disaster that Firewire ports created for systems over a decade ago. But I'd like to challege the one-size-fits-all policy. There's a big difference between a random device plugged into a port (which may even lie about its VendorID/DeviceID) and a device that is part of the CPU socket. I'd like people to think of DSA as an extension to the instruction set. It implements asynchronous instructions like "MEMFILL" and "MEMCOPY". These can be limited in scope when executed in user processes or guests. But when executed by the host OS ring0 code they can have all the same access that ring0 code has when it dereferences a pointer. -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 04:44:14PM +, Luck, Tony wrote: > > For shared workqueue, it can only generate DMA request with PASID. The > > submission is done by ENQCMDS (S for supervisor) instruction. > > > > If we were not to share page tables with init_mm, we need a system PASID > > that doing the same direct mapping in IOMMU page tables. > > Note that for the currently envisioned kernel use cases for accelerators it > would be OK for this system PASID to just provide either: > > 1) A 1:1 mapping for physical addresses. Kernel users of the accelerators > would provide physical addresses in descriptors. > 2) The same mapping that the kernel uses for its "1:1" map of all physical > memory. Users would use kernel virtual addresses in that "1:1" range > (e.g. those obtained from page_to_virt(struct page *p);) Well, no, neither of those are OK. The page table under the kernel PASID should behave the same way that the kernel would operate the page table assigned to a kernel RID. If the kernel has security off then the PASID should map to all physical memory, just like the RID does. If security is on then every DMA map needs to be loaded into the PASID's io page table no different than a RID page table. "kernel SVA" is, IMHO, not a desirable thing, it completely destroys the kernel's DMA security model. > If people want to use an accelerator on memory allocated by vmalloc() > things will get more complicated. But maybe we can delay solving that > problem until someone comes up with a real use case that needs to > do this? If you have a HW limitation that the device can only issue TLPs with a PASID, even for kernel users, then I think the proper thing is to tell the IOMMU layer than a certain 'struct device' enters PASID-only mode and the IOMMU layer should construct an appropriate PASID and flow the dma operations through it. Pretending the DMA layer doesn't exist and that PASID gets a free pass is not OK in the kernel. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
> For shared workqueue, it can only generate DMA request with PASID. The > submission is done by ENQCMDS (S for supervisor) instruction. > > If we were not to share page tables with init_mm, we need a system PASID > that doing the same direct mapping in IOMMU page tables. Note that for the currently envisioned kernel use cases for accelerators it would be OK for this system PASID to just provide either: 1) A 1:1 mapping for physical addresses. Kernel users of the accelerators would provide physical addresses in descriptors. 2) The same mapping that the kernel uses for its "1:1" map of all physical memory. Users would use kernel virtual addresses in that "1:1" range (e.g. those obtained from page_to_virt(struct page *p);) If people want to use an accelerator on memory allocated by vmalloc() things will get more complicated. But maybe we can delay solving that problem until someone comes up with a real use case that needs to do this? -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Jason, On Thu, 13 May 2021 10:38:34 -0300, Jason Gunthorpe wrote: > On Thu, May 13, 2021 at 06:00:12AM -0700, Jacob Pan wrote: > > > > If you want to do SVA PASID then it also must come with DMA APIs to > > > > manage the CPU cache coherence that are all NOP's on x86. > > > > > > Yes. And we have plenty of precende where an IOMMU is in "bypass" > > > mode to allow access to all memory and then uses the simple > > > dma-direct case. > > I agree it is better not to expose the entire direct map. But the > > missing piece of using DMA APIs is the PASID. The caller needs the > > PASID value to do work submission once buffer is mapped. > > You still haven't explained why the kernel driver should have a PASID at > all. > For shared workqueue, it can only generate DMA request with PASID. The submission is done by ENQCMDS (S for supervisor) instruction. If we were not to share page tables with init_mm, we need a system PASID that doing the same direct mapping in IOMMU page tables. > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote: > On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote: > > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > > > Again, I don't know enough about VDPA to make sense of that. Are we > > > essentially talking non-PCI virtual devices here? In which case you > > > could define the VDPA "bus" to always have one-device groups. > > > > It is much worse than that. > > > > What these non-PCI devices need is for the kernel driver to be part of > > the IOMMU group of the underlying PCI device but tell VFIO land that > > "groups don't matter" > > I don't really see a semantic distinction between "always one-device > groups" and "groups don't matter". Really the only way you can afford > to not care about groups is if they're singletons. The kernel driver under the mdev may not be in an "always one-device" group. It is a kernel driver so the only thing we know and care about is that all devices in the HW group are bound to kernel drivers. The vfio device that spawns from this kernel driver is really a "groups don't matter" vfio device because at the IOMMU layer it should be riding on the physical group of the kernel driver. At the VFIO layer we no longer care about the group abstraction because the system guarentees isolation in some other way. The issue is a software one of tightly coupling IOMMU HW groups to VFIO's API and then introducing an entire class of VFIO mdev devices that no longer care about IOMMU HW groups at all. Currently mdev tries to trick this by creating singleton groups, but it is very ugly and very tightly coupled to a specific expectation of the few existing mdev drivers. Trying to add PASID made it alot worse. > Aside: I'm primarily using "group" to mean the underlying hardware > unit, not the vfio construct on top of it, I'm not sure that's been > clear throughout. Sure, that is obviously fixed, but I'm not interested in that. I'm interested in having a VFIO API that makes sense for vfio-pci which has a tight coupling to the HW notion of a IOMMU and also vfio mdev's that have no concept of a HW IOMMU group. > So.. your model assumes that every device has a safe quiescent state > where it won't do any harm until poked, whether its group is > currently kernel owned, or owned by a userspace that doesn't know > anything about it. This is today's model, yes. When you run dpdk on a multi-group device vfio already ensures that all the device groups remained parked and inaccessible. > At minimum this does mean that in order to use one device in the group > you must have permission to use *all* the devices in the group - > otherwise you may be able to operate a device you don't have > permission to by DMAing to its registers from a device you do have > permission to. If the administator configures the system with different security labels for different VFIO devices then yes removing groups makes this more tricky as all devices in the group should have the same label. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote: > On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote: > > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > > > > > This is a good feature in general when let's say there is a linux > > > supported > > > device which has a proprietary device firmware update tool which only > > > exists > > > as an x86 binary and your hardware is not x86 - running qemu + vfio in > > > full > > > emulation would provide a way to run the tool to update a physical device. > > > > That specific use case doesn't really need a vIOMMU though, does it? > > Possibly not, but the mechanics needed to do vIOMMU on different host > IOMMU aren't really different from what you need for a no-vIOMMU > guest. For very simple vIOMMUs this might be true, but this new features of nesting PASID, migration, etc, etc all make the vIOMMU complicated and emuluating it completely alot harder. Stuffing a vfio-pci into a guest and creating a physical map using a single IOASID is comparably trivial. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 04:01:20PM +1000, David Gibson wrote: > But.. even if you're exposing page tables to userspace.. with hardware > that has explicit support for nesting you can probably expose the hw > tables directly which is great for the cases that works for. But > surely for older IOMMUs which don't do nesting you must have some way > of shadowing guest IO page tables to host IO page tables to translate > GPA to HPA at least? I expect this would be in quemu and would be part of the expensive emulation I suggested. Converting the guest's page table structure into a sequence of map/unmaps to a non-nestable IOASID. > If you're doing that, I don't see that converting page table format > is really any harder It isn't, but it is a completely different flow and custom from the normal HW accelerated nesting. > It might not be a theoretically complete emulation of the vIOMMU, but > it can support in-practice usage. In particular it works pretty well > if your backend has a nice big IOVA range (like x86 IOMMUS) but your > guest platform typically uses relatively small IOVA windows. PAPR on > x86 is exactly that... well.. possibly not the 64-bit window, but > because of old PAPR platforms that didn't support that, we can choose > not to advertise that and guests will cope. So maybe this multi-window thing is generic API somehow. You'll have to check what Kevin comes up with to ensure it fits in Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 8/8] iommu/dma: Reserve any RMR regions associated with a dev
Get ACPI IORT RMR regions associated with a dev reserved so that there is a unity mapping for them in SMMU. Signed-off-by: Shameer Kolothum --- drivers/iommu/dma-iommu.c | 66 --- 1 file changed, 61 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 2d9caf548a32..6838caf3e8ff 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -174,22 +174,78 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) } EXPORT_SYMBOL(iommu_put_dma_cookie); +static bool iommu_dma_dev_has_rmr(struct iommu_fwspec *fwspec, + struct iommu_rmr *e) +{ + int i; + + for (i = 0; i < fwspec->num_ids; i++) { + if (e->sid == fwspec->ids[i]) + return true; + } + + return false; +} + +static void iommu_dma_get_rmr_resv_regions(struct device *dev, + struct list_head *list) +{ + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct list_head rmr_list; + struct iommu_rmr *rmr; + + INIT_LIST_HEAD(_list); + if (iommu_dma_get_rmrs(fwspec->iommu_fwnode, _list)) + return; + + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus); + + if (!host->preserve_config) + return; + } + + list_for_each_entry(rmr, _list, list) { + int prot = IOMMU_READ | IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; + struct iommu_resv_region *region; + enum iommu_resv_type type; + + if (!iommu_dma_dev_has_rmr(fwspec, rmr)) + continue; + + if (rmr->flags & IOMMU_RMR_REMAP_PERMITTED) + type = IOMMU_RESV_DIRECT_RELAXABLE; + else + type = IOMMU_RESV_DIRECT; + + region = iommu_alloc_resv_region(rmr->base_address, +rmr->length, prot, +type); + if (!region) + return; + + list_add_tail(>list, list); + } +} /** * iommu_dma_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() * @list: Reserved region list from iommu_get_resv_regions() * * IOMMU drivers can use this to implement their .get_resv_regions callback - * for general non-IOMMU-specific reservations. Currently, this covers GICv3 - * ITS region reservation on ACPI based ARM platforms that may require HW MSI - * reservation. + * for general non-IOMMU-specific reservations. Currently this covers, + * -GICv3 ITS region reservation on ACPI based ARM platforms that may + * require HW MSI reservation. + * -Any ACPI IORT RMR memory range reservations (IORT spec rev E.b) */ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list) { - if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode)) + if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode)) { iort_iommu_msi_get_resv_regions(dev, list); - + iommu_dma_get_rmr_resv_regions(dev, list); + } } EXPORT_SYMBOL(iommu_dma_get_resv_regions); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 7/8] iommu/arm-smmu: Get associated RMR info and install bypass SMR
From: Jon Nettleton Check if there is any RMR info associated with the devices behind the SMMU and if any, install bypass SMRs for them. This is to keep any ongoing traffic associated with these devices alive when we enable/reset SMMU during probe(). Signed-off-by: Jon Nettleton Signed-off-by: Steven Price Signed-off-by: Shameer Kolothum --- drivers/iommu/arm/arm-smmu/arm-smmu.c | 64 +++ 1 file changed, 64 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 6f72c4d208ca..f67aeb30b5ef 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -2042,6 +2042,66 @@ err_reset_platform_ops: __maybe_unused; return err; } +static void arm_smmu_rmr_install_bypass_smr(struct arm_smmu_device *smmu) +{ + struct list_head rmr_list; + struct iommu_rmr *e; + int i, cnt = 0; + u32 smr; + u32 reg; + + INIT_LIST_HEAD(_list); + if (iommu_dma_get_rmrs(dev_fwnode(smmu->dev), _list)) + return; + + reg = arm_smmu_gr0_read(smmu, ARM_SMMU_GR0_sCR0); + + if ((reg & ARM_SMMU_sCR0_USFCFG) && !(reg & ARM_SMMU_sCR0_CLIENTPD)) { + /* +* SMMU is already enabled and disallowing bypass, so preserve +* the existing SMRs +*/ + for (i = 0; i < smmu->num_mapping_groups; i++) { + smr = arm_smmu_gr0_read(smmu, ARM_SMMU_GR0_SMR(i)); + if (!FIELD_GET(ARM_SMMU_SMR_VALID, smr)) + continue; + smmu->smrs[i].id = FIELD_GET(ARM_SMMU_SMR_ID, smr); + smmu->smrs[i].mask = FIELD_GET(ARM_SMMU_SMR_MASK, smr); + smmu->smrs[i].valid = true; + } + } + + list_for_each_entry(e, _list, list) { + u32 sid = e->sid; + + i = arm_smmu_find_sme(smmu, sid, ~0); + if (i < 0) + continue; + if (smmu->s2crs[i].count == 0) { + smmu->smrs[i].id = sid; + smmu->smrs[i].mask = ~0; + smmu->smrs[i].valid = true; + } + smmu->s2crs[i].count++; + smmu->s2crs[i].type = S2CR_TYPE_BYPASS; + smmu->s2crs[i].privcfg = S2CR_PRIVCFG_DEFAULT; + smmu->s2crs[i].cbndx = 0xff; + + cnt++; + } + + if ((reg & ARM_SMMU_sCR0_USFCFG) && !(reg & ARM_SMMU_sCR0_CLIENTPD)) { + /* Remove the valid bit for unused SMRs */ + for (i = 0; i < smmu->num_mapping_groups; i++) { + if (smmu->s2crs[i].count == 0) + smmu->smrs[i].valid = false; + } + } + + dev_notice(smmu->dev, "\tpreserved %d boot mapping%s\n", cnt, + cnt == 1 ? "" : "s"); +} + static int arm_smmu_device_probe(struct platform_device *pdev) { struct resource *res; @@ -2168,6 +2228,10 @@ static int arm_smmu_device_probe(struct platform_device *pdev) } platform_set_drvdata(pdev, smmu); + + /* Check for RMRs and install bypass SMRs if any */ + arm_smmu_rmr_install_bypass_smr(smmu); + arm_smmu_device_reset(smmu); arm_smmu_test_smr_masks(smmu); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 6/8] iommu/arm-smmu-v3: Get associated RMR info and install bypass STE
Check if there is any RMR info associated with the devices behind the SMMUv3 and if any, install bypass STEs for them. This is to keep any ongoing traffic associated with these devices alive when we enable/reset SMMUv3 during probe(). Signed-off-by: Shameer Kolothum --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 + 1 file changed, 33 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f9195b740f48..c2d2e65b9856 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3574,6 +3574,36 @@ static void __iomem *arm_smmu_ioremap(struct device *dev, resource_size_t start, return devm_ioremap_resource(dev, ); } +static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu) +{ + struct list_head rmr_list; + struct iommu_rmr *e; + int ret; + + INIT_LIST_HEAD(_list); + if (iommu_dma_get_rmrs(dev_fwnode(smmu->dev), _list)) + return; + + /* +* Since, we don't have a mechanism to differentiate the RMR +* SIDs that has an ongoing live stream, install bypass STEs +* for all the reported ones. +*/ + list_for_each_entry(e, _list, list) { + __le64 *step; + + ret = arm_smmu_init_sid_strtab(smmu, e->sid); + if (ret) { + dev_err(smmu->dev, "RMR bypass(0x%x) failed\n", + e->sid); + continue; + } + + step = arm_smmu_get_step_for_sid(smmu, e->sid); + arm_smmu_write_strtab_ent(NULL, e->sid, step, true); + } +} + static int arm_smmu_device_probe(struct platform_device *pdev) { int irq, ret; @@ -3657,6 +3687,9 @@ static int arm_smmu_device_probe(struct platform_device *pdev) /* Record our private device structure */ platform_set_drvdata(pdev, smmu); + /* Check for RMRs and install bypass STEs if any */ + arm_smmu_rmr_install_bypass_ste(smmu); + /* Reset the device */ ret = arm_smmu_device_reset(smmu, bypass); if (ret) -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 5/8] iommu/arm-smmu-v3: Add bypass flag to arm_smmu_write_strtab_ent()
By default, disable_bypass is set and any dev without an iommu domain installs STE with CFG_ABORT during arm_smmu_init_bypass_stes(). Introduce a "bypass" flag to arm_smmu_write_strtab_ent() so that we can force it to install CFG_BYPASS STE for specific SIDs. This will be useful in follow up patch to install bypass for IORT RMR SIDs. Signed-off-by: Shameer Kolothum --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 754bad6092c1..f9195b740f48 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1174,7 +1174,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid) } static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, - __le64 *dst) + __le64 *dst, bool bypass) { /* * This is hideously complicated, but we only really care about @@ -1245,7 +1245,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, /* Bypass/fault */ if (!smmu_domain || !(s1_cfg || s2_cfg)) { - if (!smmu_domain && disable_bypass) + if (!smmu_domain && disable_bypass && !bypass) val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT); else val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS); @@ -1320,7 +1320,7 @@ static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent) unsigned int i; for (i = 0; i < nent; ++i) { - arm_smmu_write_strtab_ent(NULL, -1, strtab); + arm_smmu_write_strtab_ent(NULL, -1, strtab, false); strtab += STRTAB_STE_DWORDS; } } @@ -2097,7 +2097,7 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master) if (j < i) continue; - arm_smmu_write_strtab_ent(master, sid, step); + arm_smmu_write_strtab_ent(master, sid, step, false); } } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 4/8] iommu/arm-smmu-v3: Introduce strtab init helper
Introduce a helper to check the sid range and to init the l2 strtab entries(bypass). This will be useful when we have to initialize the l2 strtab with bypass for RMR SIDs. Signed-off-by: Shameer Kolothum --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++-- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 54b2f27b81d4..754bad6092c1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2369,6 +2369,19 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid) return sid < limit; } +static int arm_smmu_init_sid_strtab(struct arm_smmu_device *smmu, u32 sid) +{ + /* Check the SIDs are in range of the SMMU and our stream table */ + if (!arm_smmu_sid_in_range(smmu, sid)) + return -ERANGE; + + /* Ensure l2 strtab is initialised */ + if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) + return arm_smmu_init_l2_strtab(smmu, sid); + + return 0; +} + static int arm_smmu_insert_master(struct arm_smmu_device *smmu, struct arm_smmu_master *master) { @@ -2392,20 +2405,9 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu, new_stream->id = sid; new_stream->master = master; - /* -* Check the SIDs are in range of the SMMU and our stream table -*/ - if (!arm_smmu_sid_in_range(smmu, sid)) { - ret = -ERANGE; + ret = arm_smmu_init_sid_strtab(smmu, sid); + if (ret) break; - } - - /* Ensure l2 strtab is initialised */ - if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) { - ret = arm_smmu_init_l2_strtab(smmu, sid); - if (ret) - break; - } /* Insert into SID tree */ new_node = &(smmu->streams.rb_node); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 3/8] ACPI/IORT: Add a helper to retrieve RMR memory regions
Add a helper function that retrieves RMR memory descriptors associated with a given IOMMU. This will be used by IOMMU drivers to setup necessary mappings. Now that we have this, invoke it from the generic helper interface. Signed-off-by: Shameer Kolothum --- drivers/acpi/arm64/iort.c | 40 +++ drivers/iommu/dma-iommu.c | 3 +++ include/linux/acpi_iort.h | 7 +++ 3 files changed, 50 insertions(+) diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index fea1ffaedf3b..6ca88c38987a 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -837,6 +838,43 @@ static inline int iort_add_device_replay(struct device *dev) return err; } +/** + * iort_iommu_get_rmrs - Helper to retrieve RMR info associated with IOMMU + * @iommu: fwnode for the IOMMU + * @head: RMR list head to be populated + * + * Returns: 0 on success, <0 failure + */ +int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode, + struct list_head *head) +{ + struct iort_rmr_entry *e; + struct acpi_iort_node *iommu; + + iommu = iort_get_iort_node(iommu_fwnode); + if (!iommu) + return -EINVAL; + + list_for_each_entry(e, _rmr_list, list) { + struct acpi_iort_rmr_desc *rmr_desc; + struct iommu_rmr *rmr; + + if (e->smmu != iommu) + continue; + + rmr_desc = e->rmr_desc; + rmr = iommu_dma_alloc_rmr(rmr_desc->base_address, + rmr_desc->length, e->sid, + e->flags); + if (!rmr) + return -ENOMEM; + + list_add_tail(>list, head); + } + + return 0; +} + /** * iort_iommu_msi_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() @@ -1108,6 +1146,8 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) const struct iommu_ops *iort_iommu_configure_id(struct device *dev, const u32 *input_id) { return NULL; } +int iort_iommu_get_rmrs(struct fwnode_handle *fwnode, struct list_head *head) +{ return -ENODEV; } #endif static int nc_dma_get_range(struct device *dev, u64 *size) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 674bd8815159..2d9caf548a32 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -203,6 +203,9 @@ EXPORT_SYMBOL(iommu_dma_get_resv_regions); int iommu_dma_get_rmrs(struct fwnode_handle *iommu_fwnode, struct list_head *list) { + if (!is_of_node(iommu_fwnode)) + return iort_iommu_get_rmrs(iommu_fwnode, list); + return -EINVAL; } EXPORT_SYMBOL(iommu_dma_get_rmrs); diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 1a12baa58e40..e8c45fa59531 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -39,6 +39,8 @@ const struct iommu_ops *iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); phys_addr_t acpi_iort_dma_get_max_cpu_address(void); +int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode, + struct list_head *list); #else static inline void acpi_iort_init(void) { } static inline u32 iort_msi_map_id(struct device *dev, u32 id) @@ -59,6 +61,11 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) static inline phys_addr_t acpi_iort_dma_get_max_cpu_address(void) { return PHYS_ADDR_MAX; } + +static inline +int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode, + struct list_head *list) +{ return -ENODEV; } #endif #endif /* __ACPI_IORT_H__ */ -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 2/8] iommu/dma: Introduce generic helper to retrieve RMR info
Reserved Memory Regions(RMR) associated with an IOMMU can be described through ACPI IORT tables in systems with devices that require a unity mapping or bypass for those regions. Introduce a generic interface so that IOMMU drivers can retrieve and set up necessary mappings. Signed-off-by: Shameer Kolothum --- drivers/iommu/dma-iommu.c | 33 + include/linux/dma-iommu.h | 10 ++ include/linux/iommu.h | 19 +++ 3 files changed, 62 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..674bd8815159 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -193,6 +193,39 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list) } EXPORT_SYMBOL(iommu_dma_get_resv_regions); +/** + * iommu_dma_get_rmrs - Retrieve Reserved Memory Regions(RMRs) associated + * with a given IOMMU + * @iommu_fwnode: fwnode associated with IOMMU + * @list: RMR list to be populated + * + */ +int iommu_dma_get_rmrs(struct fwnode_handle *iommu_fwnode, + struct list_head *list) +{ + return -EINVAL; +} +EXPORT_SYMBOL(iommu_dma_get_rmrs); + +struct iommu_rmr *iommu_dma_alloc_rmr(u64 base, u64 length, u32 sid, + u32 flags) +{ + struct iommu_rmr *rmr; + + rmr = kzalloc(sizeof(*rmr), GFP_KERNEL); + if (!rmr) + return NULL; + + INIT_LIST_HEAD(>list); + rmr->base_address = base; + rmr->length = length; + rmr->sid = sid; + rmr->flags = flags; + + return rmr; +} +EXPORT_SYMBOL(iommu_dma_alloc_rmr); + static int cookie_init_hw_msi_region(struct iommu_dma_cookie *cookie, phys_addr_t start, phys_addr_t end) { diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 6e75a2d689b4..319f332c279f 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -42,12 +42,17 @@ void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, extern bool iommu_dma_forcedac; +int iommu_dma_get_rmrs(struct fwnode_handle *iommu, struct list_head *list); +struct iommu_rmr *iommu_dma_alloc_rmr(u64 base, u64 length, u32 sid, u32 flags); + #else /* CONFIG_IOMMU_DMA */ struct iommu_domain; struct msi_desc; struct msi_msg; struct device; +struct fwnode_handle; +struct iommu_rmr; static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) @@ -83,5 +88,10 @@ static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_he { } +int iommu_dma_get_rmrs(struct fwnode_handle *iommu, struct list_head *list) +{ + return -ENODEV; +} + #endif /* CONFIG_IOMMU_DMA */ #endif /* __DMA_IOMMU_H */ diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 32d448050bf7..73cd2831cb45 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -555,6 +555,25 @@ struct iommu_sva { struct device *dev; }; +/** + * struct iommu_rmr - Reserved Memory Region details per IOMMU + * @list: Linked list pointers to hold RMR region info + * @base_address: base address of Reserved Memory Region + * @length: length of memory region + * @sid: associated stream id + * @flags: flags that apply to the RMR node + */ +struct iommu_rmr { + struct list_headlist; + phys_addr_t base_address; + u64 length; + u32 sid; + u32 flags; +}; + +/* RMR Remap permitted */ +#define IOMMU_RMR_REMAP_PERMITTED (1 << 0) + int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode, const struct iommu_ops *ops); void iommu_fwspec_free(struct device *dev); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 0/8] ACPI/IORT: Support for IORT RMR node
Hi, v3 -->v4 -Included the SMMUv2 SMR bypass install changes suggested by Steve(patch #7) -As per Robin's comments, RMR reserve implementation is now more generic (patch #8) and dropped v3 patches 8 and 10. -Rebase to 5.13-rc1 The whole series is available here, https://github.com/hisilicon/kernel-dev/tree/private-v5.13-rc1-rmr-v4-ext RFC v2 --> v3 -Dropped RFC tag as the ACPICA header changes are now ready to be part of 5.13[0]. But this series still has a dependency on that patch. -Added IORT E.b related changes(node flags, _DSM function 5 checks for PCIe). -Changed RMR to stream id mapping from M:N to M:1 as per the spec and discussion here[1]. -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!) Sanity tested on a HiSilicon D06. Further testing and feedback is greatly appreciated. https://github.com/hisilicon/kernel-dev/tree/private-v5.12-rc8-rmr-v3 Thanks, Shameer [0] https://lore.kernel.org/linux-acpi/20210406213028.718796-22-erik.kan...@intel.com/ [1] https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-April/000150.html RFC v1 --> v2: - Added a generic interface for IOMMU drivers to retrieve all the RMR info associated with a given IOMMU. - SMMUv3 driver gets the RMR list during probe() and installs bypass STEs for all the SIDs in the RMR list. This is to keep the ongoing traffic alive(if any) during SMMUv3 reset. This is based on the suggestions received for v1 to take care of the EFI framebuffer use case. Only sanity tested for now. - During the probe/attach device, SMMUv3 driver reserves any RMR region associated with the device such that there is a unity mapping for them in SMMU. --- From RFC v1: - The series adds support to IORT RMR nodes specified in IORT Revision E -ARM DEN 0049E[0]. RMR nodes are used to describe memory ranges that are used by endpoints and require a unity mapping in SMMU. We have faced issues with 3408iMR RAID controller cards which fail to boot when SMMU is enabled. This is because these controllers make use of host memory for various caching related purposes and when SMMU is enabled the iMR firmware fails to access these memory regions as there is no mapping for them. IORT RMR provides a way for UEFI to describe and report these memory regions so that the kernel can make a unity mapping for these in SMMU. Tests: With a UEFI, that reports the RMR for the dev, [16F0h 5872 1] Type : 06 [16F1h 5873 2] Length : 007C [16F3h 5875 1] Revision : 00 [1038h 0056 2] Reserved : [1038h 0056 2] Identifier : [16F8h 5880 4]Mapping Count : 0001 [16FCh 5884 4] Mapping Offset : 0040 [1700h 5888 4]Number of RMR Descriptors : 0002 [1704h 5892 4]RMR Descriptor Offset : 0018 [1708h 5896 8] Base Address of RMR : E640 [1710h 5904 8]Length of RMR : 0010 [1718h 5912 4] Reserved : [171Ch 5916 8] Base Address of RMR : 27B0 [1724h 5924 8]Length of RMR : 00C0 [172Ch 5932 4] Reserved : [1730h 5936 4] Input base : [1734h 5940 4] ID Count : 0001 [1738h 5944 4] Output Base : 0003 [173Ch 5948 4] Output Reference : 0064 [1740h 5952 4]Flags (decoded below) : 0001 Single Mapping : 1 ... Without the series the RAID controller initialization fails as below, ... [ 12.631117] megaraid_sas :03:00.0: FW supports sync cache: Yes [ 12.637360] megaraid_sas :03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x4009 [ 18.776377] megaraid_sas :03:00.0: Init cmd return status FAILED for SCSI host 0 [ 23.019383] megaraid_sas :03:00.0: Waiting for FW to come to ready state [ 106.684281] megaraid_sas :03:00.0: FW in FAULT state, Fault code:0x1 subcode:0x0 func:megasas_transition_to_ready [ 106.695186] megaraid_sas :03:00.0: System Register set: [ 106.889787] megaraid_sas :03:00.0: Failed to transition controller to ready for scsi0. [ 106.910475] megaraid_sas :03:00.0: Failed from megasas_init_fw 6407 estuary:/$ With the series, now the kernel has direct mapping for the dev as below, estuary:/$ cat /sys/kernel/iommu_groups/0/reserved_regions 0x0800 0x080f msi 0x27b0 0x286f direct
[PATCH v4 1/8] ACPI/IORT: Add support for RMR node parsing
Add support for parsing RMR node information from ACPI. Find associated stream id and smmu node info from the RMR node and populate a linked list with RMR memory descriptors. Signed-off-by: Shameer Kolothum --- drivers/acpi/arm64/iort.c | 104 +- 1 file changed, 103 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 3912a1f6058e..fea1ffaedf3b 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -40,6 +40,19 @@ struct iort_fwnode { static LIST_HEAD(iort_fwnode_list); static DEFINE_SPINLOCK(iort_fwnode_lock); +/* + * One entry for IORT RMR. + */ +struct iort_rmr_entry { + struct list_head list; + u32 sid; + struct acpi_iort_node *smmu; + struct acpi_iort_rmr_desc *rmr_desc; + u32 flags; +}; + +static LIST_HEAD(iort_rmr_list); /* list of RMR regions from ACPI */ + /** * iort_set_fwnode() - Create iort_fwnode and use it to register *iommu data in the iort_fwnode_list @@ -393,7 +406,8 @@ static struct acpi_iort_node *iort_node_get_id(struct acpi_iort_node *node, if (node->type == ACPI_IORT_NODE_NAMED_COMPONENT || node->type == ACPI_IORT_NODE_PCI_ROOT_COMPLEX || node->type == ACPI_IORT_NODE_SMMU_V3 || - node->type == ACPI_IORT_NODE_PMCG) { + node->type == ACPI_IORT_NODE_PMCG || + node->type == ACPI_IORT_NODE_RMR) { *id_out = map->output_base; return parent; } @@ -1660,6 +1674,91 @@ static void __init iort_enable_acs(struct acpi_iort_node *iort_node) #else static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { } #endif +static int iort_rmr_desc_valid(struct acpi_iort_rmr_desc *desc, u32 count) +{ + int i, j; + + for (i = 0; i < count; i++) { + u64 end, start = desc[i].base_address, length = desc[i].length; + + if (!IS_ALIGNED(start, SZ_64K) || !IS_ALIGNED(length, SZ_64K)) + return -EINVAL; + + end = start + length - 1; + + /* Check for address overlap */ + for (j = i + 1; j < count; j++) { + u64 e_start = desc[j].base_address; + u64 e_end = e_start + desc[j].length - 1; + + if (start <= e_end && end >= e_start) + return -EINVAL; + } + } + + return 0; +} + +static int __init iort_parse_rmr(struct acpi_iort_node *iort_node) +{ + struct acpi_iort_node *smmu; + struct iort_rmr_entry *e; + struct acpi_iort_rmr *rmr; + struct acpi_iort_rmr_desc *rmr_desc; + u32 map_count = iort_node->mapping_count; + u32 sid; + int i, ret = 0; + + if (iort_node->type != ACPI_IORT_NODE_RMR) + return 0; + + if (!iort_node->mapping_offset || map_count != 1) { + pr_err(FW_BUG "Invalid ID mapping, skipping RMR node %p\n", + iort_node); + return -EINVAL; + } + + /* Retrieve associated smmu and stream id */ + smmu = iort_node_get_id(iort_node, , 0); + if (!smmu) { + pr_err(FW_BUG "Invalid SMMU reference, skipping RMR node %p\n", + iort_node); + return -EINVAL; + } + + /* Retrieve RMR data */ + rmr = (struct acpi_iort_rmr *)iort_node->node_data; + if (!rmr->rmr_offset || !rmr->rmr_count) { + pr_err(FW_BUG "Invalid RMR descriptor array, skipping RMR node %p\n", + iort_node); + return -EINVAL; + } + + rmr_desc = ACPI_ADD_PTR(struct acpi_iort_rmr_desc, iort_node, + rmr->rmr_offset); + + ret = iort_rmr_desc_valid(rmr_desc, rmr->rmr_count); + if (ret) { + pr_err(FW_BUG "Invalid RMR descriptor[%d] for node %p, skipping...\n", + i, iort_node); + return ret; + } + + for (i = 0; i < rmr->rmr_count; i++, rmr_desc++) { + e = kmalloc(sizeof(*e), GFP_KERNEL); + if (!e) + return -ENOMEM; + + e->sid = sid; + e->smmu = smmu; + e->rmr_desc = rmr_desc; + e->flags = rmr->flags; + + list_add_tail(>list, _rmr_list); + } + + return 0; +} static void __init iort_init_platform_devices(void) { @@ -1689,6 +1788,9 @@ static void __init iort_init_platform_devices(void) iort_enable_acs(iort_node); + if (iort_table->revision == 3) + iort_parse_rmr(iort_node); + ops = iort_get_dev_cfg(iort_node); if (ops) { fwnode = acpi_alloc_fwnode_static(); -- 2.17.1
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
On Thu, May 13, 2021 at 06:00:12AM -0700, Jacob Pan wrote: > > > If you want to do SVA PASID then it also must come with DMA APIs to > > > manage the CPU cache coherence that are all NOP's on x86. > > > > Yes. And we have plenty of precende where an IOMMU is in "bypass" mode > > to allow access to all memory and then uses the simple dma-direct case. > I agree it is better not to expose the entire direct map. But the missing > piece of using DMA APIs is the PASID. The caller needs the PASID value to > do work submission once buffer is mapped. You still haven't explained why the kernel driver should have a PASID at all. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Christoph, On Wed, 12 May 2021 07:37:40 +0100, Christoph Hellwig wrote: > On Tue, May 11, 2021 at 04:47:26PM -0300, Jason Gunthorpe wrote: > > > Let me try to break down your concerns: > > > 1. portability - driver uses DMA APIs can function w/ and w/o IOMMU. > > > is that your concern? But PASID is intrinsically tied with IOMMU and > > > if the drivers are using a generic sva-lib API, why they are not > > > portable? SVA by its definition is to avoid map/unmap every time. > > > > Kernel explicitly does not support this programming model. All DMA is > > explicit and the DMA API hides platform details like IOMMU and CPU > > cache coherences. Just because x86 doesn't care about this doesn't > > make any of it optional. > > Exactly. Perhaps we can view these SVA capable devices as FPU or a device that can be fused onto the CPU by PASID binding. We don't require DMA map/unmap for FPU to use kernel virtual address, right? > > > If you want to do SVA PASID then it also must come with DMA APIs to > > manage the CPU cache coherence that are all NOP's on x86. > > Yes. And we have plenty of precende where an IOMMU is in "bypass" mode > to allow access to all memory and then uses the simple dma-direct case. I agree it is better not to expose the entire direct map. But the missing piece of using DMA APIs is the PASID. The caller needs the PASID value to do work submission once buffer is mapped. Perhaps we can add a parameter to the DMA API to specify the address space? As Jason suggested the definition of IOASID, which represents a page table. Just my quick thought, can you help us with a viable solution? I know we are supposed to hide IOMMU by using DMA APIs which makes drivers portable w/ and w/o IOMMU. This IOASID can be optional. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework
On 5/13/21 6:58 PM, Keqian Zhu wrote: On 2021/5/12 19:36, Lu Baolu wrote: Hi keqian, On 5/12/21 4:44 PM, Keqian Zhu wrote: On 2021/5/12 11:20, Lu Baolu wrote: On 5/11/21 3:40 PM, Keqian Zhu wrote: For upper layers, before starting page tracking, they check the dirty_page_trackable attribution of the domain and start it only it's capable. Once the page tracking is switched on the vendor iommu driver (or iommu core) should block further device attach/detach operations until page tracking is stopped. But when a domain becomes capable after detaching a device, the upper layer still needs to query it and enable dirty log for it... To make things coordinated, maybe the upper layer can register a notifier, when the domain's capability change, the upper layer do not need to query, instead they just need to realize a callback, and do their specific policy in the callback. What do you think? That might be an option. But why not checking domain's attribution every time a new tracking period is about to start? Hi Baolu, I'll add an attribution in iommu_domain, and the vendor iommu driver will update the attribution when attach/detach devices. The attribute should be protected by a lock, so the upper layer shouldn't access the attribute directly. Then the iommu_domain_support_dirty_log() still should be retained. Does this design looks good to you? Yes, that's what I was thinking of. But I am not sure whether it worth of a lock here. It seems not to be a valid behavior for upper layer to attach or detach any device while doing the dirty page tracking. Hi Baolu, Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can be called concurrently, a lock in iommu_domain_support_dirty_log() is still not enough. I will add another note for the dirty tracking interfaces. Do you have other suggestions? I will accelerate the progress, so I plan to send out v5 next week. No further comments expect below nit: "iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking" How about splitting it into - iommu_start_dirty_log() - iommu_stop_dirty_log() Not a strong opinion anyway. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/6] vfio: remove the unused mdev iommu hook
On Thu, May 13, 2021 at 03:28:52AM +, Tian, Kevin wrote: > Are you specially concerned about this iommu_device hack which > directly connects mdev_device to iommu layer or the entire removed > logic including the aux domain concept? For the former we are now > following up the referred thread to find a clean way. But for the latter > we feel it's still necessary regardless of how iommu interface is redesigned > to support device connection from the upper level driver. The reason is > that with mdev or subdevice one physical device could be attached to > multiple domains now. there could be a primary domain with DOMAIN_ > DMA type for DMA_API use by parent driver itself, and multiple auxiliary > domains with DOMAIN_UNMANAGED types for subdevices assigned to > different VMs. Why do we need more domains than just the physical domain for the parent? How does auxdomain appear in /dev/ioasid? I never understood what this dead code was supposed to be used for. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework
On 2021/5/12 19:36, Lu Baolu wrote: > Hi keqian, > > On 5/12/21 4:44 PM, Keqian Zhu wrote: >> >> >> On 2021/5/12 11:20, Lu Baolu wrote: >>> On 5/11/21 3:40 PM, Keqian Zhu wrote: > For upper layers, before starting page tracking, they check the > dirty_page_trackable attribution of the domain and start it only it's > capable. Once the page tracking is switched on the vendor iommu driver > (or iommu core) should block further device attach/detach operations > until page tracking is stopped. But when a domain becomes capable after detaching a device, the upper layer still needs to query it and enable dirty log for it... To make things coordinated, maybe the upper layer can register a notifier, when the domain's capability change, the upper layer do not need to query, instead they just need to realize a callback, and do their specific policy in the callback. What do you think? >>> >>> That might be an option. But why not checking domain's attribution every >>> time a new tracking period is about to start? >> Hi Baolu, >> >> I'll add an attribution in iommu_domain, and the vendor iommu driver will >> update >> the attribution when attach/detach devices. >> >> The attribute should be protected by a lock, so the upper layer shouldn't >> access >> the attribute directly. Then the iommu_domain_support_dirty_log() still >> should be >> retained. Does this design looks good to you? > > Yes, that's what I was thinking of. But I am not sure whether it worth > of a lock here. It seems not to be a valid behavior for upper layer to > attach or detach any device while doing the dirty page tracking. Hi Baolu, Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can be called concurrently, a lock in iommu_domain_support_dirty_log() is still not enough. I will add another note for the dirty tracking interfaces. Do you have other suggestions? I will accelerate the progress, so I plan to send out v5 next week. Thanks, Keqian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/1] iommu: Delete a duplicate check in iommu_change_dev_def_domain()
Function iommu_group_store_type() is the only caller of the static function iommu_change_dev_def_domain() and has performed "if (WARN_ON(!group))" detection before calling it. So the one here is redundant. Signed-off-by: Zhen Lei --- drivers/iommu/iommu.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 971068da67cb91d..8cdf6a1c4bfd773 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3059,9 +3059,6 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, int ret, dev_def_dom; struct device *dev; - if (!group) - return -EINVAL; - mutex_lock(>mutex); if (group->default_domain != group->domain) { -- 2.26.0.106.g9fadedd ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: David Gibson > Sent: Thursday, May 13, 2021 2:01 PM > > > > But this definitely all becomes HW specific. > > > > For instance I want to have an ARM vIOMMU driver it needs to do some > > > > ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is > ARMvXXX]) > > if (ret == -EOPNOTSUPP) > > ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..) > > // and do completely different and more expensive emulation > > > > I can get a little bit of generality, but at the end of the day the > > IOMMU must create a specific HW layout of the nested page table, if it > > can't, it can't. > > Erm.. I don't really know how your IOASID interface works here. I'm > thinking about the VFIO interface where maps and unmaps are via > explicit ioctl()s, which provides an obvious point to do translation > between page table formats. > We are working on a draft uAPI proposal based on the discussions in this thread. expect to send it out next week. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote: > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > > > This is a good feature in general when let's say there is a linux supported > > device which has a proprietary device firmware update tool which only exists > > as an x86 binary and your hardware is not x86 - running qemu + vfio in full > > emulation would provide a way to run the tool to update a physical device. > > That specific use case doesn't really need a vIOMMU though, does it? Possibly not, but the mechanics needed to do vIOMMU on different host IOMMU aren't really different from what you need for a no-vIOMMU guest. With a vIOMMU you need to map guest IOVA space into the host IOVA space. With no no-vIOMMU you need to map guest physical addresses into the host IOVA space. In either case the GPA/gIOVA to userspace and userspace to HPA mappings are basically arbitrary. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote: > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > > Again, I don't know enough about VDPA to make sense of that. Are we > > essentially talking non-PCI virtual devices here? In which case you > > could define the VDPA "bus" to always have one-device groups. > > It is much worse than that. > > What these non-PCI devices need is for the kernel driver to be part of > the IOMMU group of the underlying PCI device but tell VFIO land that > "groups don't matter" I don't really see a semantic distinction between "always one-device groups" and "groups don't matter". Really the only way you can afford to not care about groups is if they're singletons. > Today mdev tries to fake this by using singleton iommu groups, but it > is really horrible and direcly hacks up the VFIO IOMMU code to > understand these special cases. Intel was proposing more special > hacking in the VFIO IOMMU code to extend this to PASID. At this stage I don't really understand why that would end up so horrible. > When we get to a /dev/ioasid this is all nonsense. The kernel device > driver is going to have to tell drivers/iommu exactly what kind of > ioasid it can accept, be it a PASID inside a kernel owned group, a SW > emulated 'mdev' ioasid, or whatever. > > In these cases the "group" idea has become a fiction that just creates > a pain. I don't see how the group is a fiction in this instance. You can still have devices that can't be isolated, therefore you can have non-singleton groups. > "Just reorganize VDPA to do something insane with the driver > core so we can create a dummy group to satisfy an unnecessary uAPI > restriction" is not a very compelling argument. > > So if the nonsensical groups goes away for PASID/mdev, where does it > leave the uAPI in other cases? > > > I don't think simplified-but-wrong is a good goal. The thing about > > groups is that if they're there, you can't just "not care" about them, > > they affect you whether you like it or not. > > You really can. If one thing claims the group then all the other group > devices become locked out. Aside: I'm primarily using "group" to mean the underlying hardware unit, not the vfio construct on top of it, I'm not sure that's been clear throughout. So.. your model assumes that every device has a safe quiescent state where it won't do any harm until poked, whether its group is currently kernel owned, or owned by a userspace that doesn't know anything about it. At minimum this does mean that in order to use one device in the group you must have permission to use *all* the devices in the group - otherwise you may be able to operate a device you don't have permission to by DMAing to its registers from a device you do have permission to. Whatever scripts are managing ownership of devices also need to know about groups, because they need to put all the devices into that quiescent state before the group can change ownership. > The main point to understand is that groups are NOT an application > restriction! It is a whole system restriction that the operator needs > to understand and deal with. This is why things like dpdk don't care > about the group at all - there is nothing they can do with the > information. > > If the operator says to run dpdk on a specific device then the > operator is the one that has to deal with all the other devices in the > group getting locked out. Ok, I think I see your point there. > At best the application can make it more obvious that the operator is > doing something dangerous, but the current kernel API doesn't seem to > really support that either. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 03:15:37PM -0300, Jason Gunthorpe wrote: > On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote: > > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: > > > > > There is a certain appeal to having some > > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > > > > > information like windows that can be optionally called by the viommu > > > > > driver and it remains well defined and described. > > > > > > > > Windows really aren't ppc specific. They're absolutely there on x86 > > > > and everything else as well - it's just that people are used to having > > > > a window at 0.. that you can often get away with > > > > treating it sloppily. > > > > > > My point is this detailed control seems to go on to more than just > > > windows. As you say the vIOMMU is emulating specific HW that needs to > > > have kernel interfaces to match it exactly. > > > > It's really not that bad. The case of emulating the PAPR vIOMMU on > > something else is relatively easy, because all updates to the IO page > > tables go through hypercalls. So, as long as the backend IOMMU can > > map all the IOVAs that the guest IOMMU can, then qemu's implementation > > of those hypercalls just needs to put an equivalent mapping in the > > backend, which it can do with a generic VFIO_DMA_MAP. > > So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU? Well, I don't want to preclude it in the API. I'm not sure about that specific example, but in most cases it should be possible to run the PAPR vIOMMU on an x86 IOMMU backend. Obviously only something you'd want to do for testing and experimentation, but it could be quite useful for that. > > vIOMMUs with page tables in guest memory are harder, but only really > > in the usual ways that a vIOMMU of that type is harder (needs cache > > mode or whatever). At whatever point you need to shadow from the > > guest IO page tables to the host backend, you can again do that with > > generic maps, as long as the backend supports the necessary IOVAs, and > > has an IO page size that's equal to or a submultiple of the vIOMMU > > page size. > > But this definitely all becomes HW specific. > > For instance I want to have an ARM vIOMMU driver it needs to do some > > ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX]) > if (ret == -EOPNOTSUPP) > ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..) > // and do completely different and more expensive emulation > > I can get a little bit of generality, but at the end of the day the > IOMMU must create a specific HW layout of the nested page table, if it > can't, it can't. Erm.. I don't really know how your IOASID interface works here. I'm thinking about the VFIO interface where maps and unmaps are via explicit ioctl()s, which provides an obvious point to do translation between page table formats. But.. even if you're exposing page tables to userspace.. with hardware that has explicit support for nesting you can probably expose the hw tables directly which is great for the cases that works for. But surely for older IOMMUs which don't do nesting you must have some way of shadowing guest IO page tables to host IO page tables to translate GPA to HPA at least? If you're doing that, I don't see that converting page table format is really any harder > > > I'm remarking that trying to unify every HW IOMMU implementation that > > > ever has/will exist into a generic API complete enough to allow the > > > vIOMMU to be created is likely to result in an API too complicated to > > > understand.. > > > > Maybe not every one, but I think we can get a pretty wide range with a > > reasonable interface. > > It sounds like a reasonable guideline is if the feature is actually > general to all IOMMUs and can be used by qemu as part of a vIOMMU > emulation when compatible vIOMMU HW is not available. > > Having 'requested window' support that isn't actually implemented in > every IOMMU is going to mean the PAPR vIOMMU emulation won't work, > defeating the whole point of making things general? The trick is that you don't necessarily need dynamic window support in the backend to emulate it in the vIOMMU. If your backend has fixed windows, then you emulate request window as: if (requested window is within backend windows) no-op; else return ERROR; It might not be a theoretically complete emulation of the vIOMMU, but it can support in-practice usage. In particular it works pretty well if your backend has a nice big IOVA range (like x86 IOMMUS) but your guest platform typically uses relatively small IOVA windows. PAPR on x86 is exactly that... well.. possibly not the 64-bit window, but because of old PAPR platforms that didn't support that, we can choose not to advertise that and guests will cope. -- David Gibson