date:20210513

Re: [PATCH v15 07/12] iommu/smmuv3: Implement cache_invalidate

2021-05-13 Thread Kunkun Jiang





On 2021/4/11 19:12, Eric Auger wrote:

Implement domain-selective, pasid selective and page-selective
IOTLB invalidations.

Signed-off-by: Eric Auger 

---
v4 -> v15:
- remove the redundant arm_smmu_cmdq_issue_sync(smmu)
   in IOMMU_INV_GRANU_ADDR case (Zenghui)
- if RIL is not supported by the host, make sure the granule_size
   that is passed by the userspace is supported or fix it
   (Chenxiang)

v13 -> v14:
- Add domain invalidation
- do global inval when asid is not provided with addr
   granularity

v7 -> v8:
- ASID based invalidation using iommu_inv_pasid_info
- check ARCHID/PASID flags in addr based invalidation
- use __arm_smmu_tlb_inv_context and __arm_smmu_tlb_inv_range_nosync

v6 -> v7
- check the uapi version

v3 -> v4:
- adapt to changes in the uapi
- add support for leaf parameter
- do not use arm_smmu_tlb_inv_range_nosync or arm_smmu_tlb_inv_context
   anymore

v2 -> v3:
- replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync

v1 -> v2:
- properly pass the asid
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +
  1 file changed, 89 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 56a301fbe75a..bfc112cc0d38 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2961,6 +2961,94 @@ static void arm_smmu_detach_pasid_table(struct 
iommu_domain *domain)
mutex_unlock(_domain->init_mutex);
  }
  
+static int

+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+   struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL};
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -EINVAL;
+
+   if (!smmu)
+   return -EINVAL;
+
+   if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+   return -EINVAL;
+
+   if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID ||
+   inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) {
+   return -ENOENT;
+   }
+
+   if (!(inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB))
+   return -EINVAL;
+
+   /* IOTLB invalidation */
+
+   switch (inv_info->granularity) {
+   case IOMMU_INV_GRANU_PASID:
+   {
+   struct iommu_inv_pasid_info *info =
+   _info->granu.pasid_info;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+   if (!(info->flags & IOMMU_INV_PASID_FLAGS_ARCHID))
+   return -EINVAL;
+
+   __arm_smmu_tlb_inv_context(smmu_domain, info->archid);
+   return 0;
+   }
+   case IOMMU_INV_GRANU_ADDR:
+   {
+   struct iommu_inv_addr_info *info = _info->granu.addr_info;
+   size_t granule_size  = info->granule_size;
+   size_t size = info->nb_granules * info->granule_size;
+   bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF;
+   int tg;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+
+   if (!(info->flags & IOMMU_INV_ADDR_FLAGS_ARCHID))
+   break;
+
+   tg = __ffs(granule_size);
+   if (granule_size & ~(1 << tg))
+   return -EINVAL;

This check looks like to confirm the granule_size is a power of 2.
Does the granule_size have to be a power of 2?
I think it should also be handled correctly, even if the granule_size is 
not a power of 2.

+   /*
+* When RIL is not supported, make sure the granule size that is
+* passed is supported. In RIL mode, this is enforced in
+* __arm_smmu_tlb_inv_range()
+*/
+   if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+   !(granule_size & smmu_domain->domain.pgsize_bitmap)) {
+   tg = __ffs(smmu_domain->domain.pgsize_bitmap);
+   granule_size = 1 << tg;
+   size = size >> tg;

Why does size need to be shifted tg bits to the right?

Thanks,
Kunkun Jiang

+   }
+
+   arm_smmu_tlb_inv_range_domain(info->addr, size,
+ granule_size, leaf,
+ info->archid, smmu_domain);
+   return 0;
+   }
+   case IOMMU_INV_GRANU_DOMAIN:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* Global S1 invalidation */
+   cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
+   arm_smmu_cmdq_issue_cmd(smmu, );
+   arm_smmu_cmdq_issue_sync(smmu);
+   return 0;

Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework

2021-05-13 Thread Keqian Zhu




On 2021/5/13 20:02, Lu Baolu wrote:
> On 5/13/21 6:58 PM, Keqian Zhu wrote:
>>
>>
>> On 2021/5/12 19:36, Lu Baolu wrote:
>>> Hi keqian,
>>>
>>> On 5/12/21 4:44 PM, Keqian Zhu wrote:


 On 2021/5/12 11:20, Lu Baolu wrote:
> On 5/11/21 3:40 PM, Keqian Zhu wrote:
>>> For upper layers, before starting page tracking, they check the
>>> dirty_page_trackable attribution of the domain and start it only it's
>>> capable. Once the page tracking is switched on the vendor iommu driver
>>> (or iommu core) should block further device attach/detach operations
>>> until page tracking is stopped.
>> But when a domain becomes capable after detaching a device, the upper 
>> layer
>> still needs to query it and enable dirty log for it...
>>
>> To make things coordinated, maybe the upper layer can register a 
>> notifier,
>> when the domain's capability change, the upper layer do not need to 
>> query, instead
>> they just need to realize a callback, and do their specific policy in 
>> the callback.
>> What do you think?
>>
>
> That might be an option. But why not checking domain's attribution every
> time a new tracking period is about to start?
 Hi Baolu,

 I'll add an attribution in iommu_domain, and the vendor iommu driver will 
 update
 the attribution when attach/detach devices.

 The attribute should be protected by a lock, so the upper layer shouldn't 
 access
 the attribute directly. Then the iommu_domain_support_dirty_log() still 
 should be
 retained. Does this design looks good to you?
>>>
>>> Yes, that's what I was thinking of. But I am not sure whether it worth
>>> of a lock here. It seems not to be a valid behavior for upper layer to
>>> attach or detach any device while doing the dirty page tracking.
>> Hi Baolu,
>>
>> Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can 
>> be called concurrently,
>> a lock in iommu_domain_support_dirty_log() is still not enough. I will add 
>> another note for the dirty
>> tracking interfaces.
>>
>> Do you have other suggestions? I will accelerate the progress, so I plan to 
>> send out v5 next week.
> 
> No further comments expect below nit:
> 
> "iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking"
> 
> How about splitting it into
>  - iommu_start_dirty_log()
>  - iommu_stop_dirty_log()
Yeah, actually this is my original version, and the "switch" style is suggested 
by Yi Sun.
Anyway, I think both is OK, and the "switch" style can reduce some code.

Thanks,
Keqian

> 
> Not a strong opinion anyway.
> 
> Best regards,
> baolu
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v5 5/7] iommu/vt-d: Fixup delivery mode of the HPET hardlockup interrupt

2021-05-13 Thread Ricardo Neri

On Wed, May 05, 2021 at 01:03:18AM +0200, Thomas Gleixner wrote:
> On Tue, May 04 2021 at 12:10, Ricardo Neri wrote:

Thank you very much for your feedback, Thomas. I am sorry it took me a
while to reply to your email. I needed to digest and research your
comments.

> > In x86 there is not an IRQF_NMI flag that can be used to indicate the
> 
> There exists no IRQF_NMI flag at all. No architecture provides that.

Thank you for the clarification. I think I meant to say that there is a
request_nmi() function but AFAIK it is only used in the ARM PMU and
would not work on x86.

> 
> > delivery mode when requesting an interrupt (via request_irq()). Thus,
> > there is no way for the interrupt remapping driver to know and set
> > the delivery mode.
> 
> There is no support for this today. So what?

Using request_irq() plus a HPET quirk looked to me a reasonable
way to use the irqdomain hierarchy to allocate an interrupt with NMI as
the delivery mode.

> 
> > Hence, when allocating an interrupt, check if such interrupt belongs to
> > the HPET hardlockup detector and fixup the delivery mode accordingly.
> 
> What?
> 
> > +   /*
> > +* If we find the HPET hardlockup detector irq, fixup the
> > +* delivery mode.
> > +*/
> > +   if (is_hpet_irq_hardlockup_detector(info))
> > +   irq_cfg->delivery_mode = APIC_DELIVERY_MODE_NMI;
> 
> Again. We are not sticking some random device checks into that
> code. It's wrong and I explained it to you before.
> 
>   
> https://lore.kernel.org/lkml/alpine.deb.2.21.1906161042080.1...@nanos.tec.linutronix.de/
> 
> But I'm happy to repeat it again:
> 
>   "No. This is horrible hackery violating all the layering which we carefully
>put into place to avoid exactly this kind of sprinkling conditionals into
>all code pathes.
> 
>With some thought the existing irqdomain hierarchy can be used to achieve
>the same thing without tons of extra functions and conditionals."
> 
> So the outcome of thought and using the irqdomain hierarchy is:
> 
>Replacing an hpet specific conditional in one place with an hpet
>specific conditional in a different place.
> 
> Impressive.

I am sorry Thomas, I did try to make the quirk less hacky but I did not
think of the solution you provide below.

> 
> hpet_assign_irq(, bool nmi)
>   init_info(info)
> ...
> if (nmi)
> info.flags |= X86_IRQ_ALLOC_AS_NMI;
>   
>irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, )
>  intel_irq_remapping_alloc(..., info)
>irq_domain_alloc_irq_parents(..., info)
>  x86_vector_alloc_irqs(..., info)
>  {   
>if (info->flags & X86_IRQ_ALLOC_AS_NMI && nr_irqs != 1)
>   return -EINVAL;
> 
>for (i = 0; i < nr_irqs; i++) {
>  
>  if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
>  irq_cfg_setup_nmi(apicd);
>  continue;
>  }
>  ...
>  }
> 
> irq_cfg_setup_nmi() sets irq_cfg->delivery_mode and whatever is required
> and everything else just works. Of course this needs a few other minor
> tweaks but none of those introduces random hpet quirks all over the
> place. Not convoluted enough, right?

Thanks for the detailed demonstration! It does seem cleaner than what I
implemented.

> 
> But that solves none of other problems. Let me summarize again which
> options or non-options we have:
> 
> 1) Selective IPIs from NMI context cannot work
> 
>As explained in the other thread.
> 
> 2) Shorthand IPI allbutself from NMI
> 
>This should work, but that obviously does not take the watchdog
>cpumask into account.
> 
>Also this only works when IPI shorthand mode is enabled. See
>apic_smt_update() for details.
> 
> 3) Sending the IPIs from irq_work
> 
>This would solve the problem, but if the CPU which is the NMI
>target is really stuck in an interrupt disabled region then the
>IPIs won't be sent.
> 
>OTOH, if that's the case then the CPU which was processing the
>NMI will continue to be stuck until the next NMI hits which
>will detect that the CPU is stuck which is a good enough
>reason to send a shorthand IPI to all CPUs ignoring the
>watchdog cpumask.
> 
>Same limitation vs. shorthand mode as #2
> 
> 4) Changing affinity of the HPET NMI from NMI
> 
>As we established two years ago that cannot work with interrupt
>remapping
> 
> 5) Changing affinity of the HPET NMI from irq_work
> 
>Same issues as #3
> 
> Anything else than #2 is just causing more problems than it solves, but
> surely the NOHZ_FULL/isolation people might have opinions on this.
> 
> OTOH, as this is opt-in, anything which wants a watchdog mask which is
> not the full online set, has to accept that HPET has these restrictions.
> 
> And that's exactly what I suggested two

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jacob Pan

Hi Jason,

On Thu, 13 May 2021 19:31:22 -0300, Jason Gunthorpe  wrote:

> On Thu, May 13, 2021 at 01:22:51PM -0700, Jacob Pan wrote:
> > Hi Tony,
> > 
> > On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" 
> > wrote:
> >   
> > > On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote:  
> > > > It seems there are two options:
> > > > 1. Add a new IOMMU API to set up a system PASID with a *separate*
> > > > IOMMU page table/domain, mark the device is PASID only with a flag.
> > > > Use DMA APIs to explicit map/unmap. Based on this PASID-only flag,
> > > > Vendor IOMMU driver will decide whether to use system PASID domain
> > > > during map/unmap. Not clear if we also need to make IOVA==kernel VA.
> > > > 
> > > > 2. Add a new IOMMU API to setup a system PASID which points to
> > > > init_mm.pgd. This API only allows trusted device to bind with the
> > > > system PASID at its own risk. There is no need for DMA API. This is
> > > > the same as the current code except with an explicit API.
> > > > 
> > > > Which option?
> > > 
> > > Option #1 looks cleaner to me. Option #2 gives access to bits
> > > of memory that the users of system PASID shouldn't ever need
> > > to touch ... just map regions of memory that the kernel has
> > > a "struct page" for.
> > > 
> > > What does "use DMA APIs to explicitly map/unmap" mean? Is that
> > > for the whole region?
> > >   
> > If we map the entire kernel direct map during system PASID setup, then
> > we don't need to use DMA API to map/unmap certain range.
> > 
> > I was thinking this system PASID page table could be on-demand. The
> > mapping is built by explicit use of DMA map/unmap APIs.  
> 
> Option 1 should be the PASID works exactly like a normal RID and uses
> all the normal DMA APIs and IOMMU mechanisms, whatever the platform
> implements. This might mean an iommu update on every operation or not.
>  
> > > I'm expecting that once this system PASID has been initialized,
> > > then any accelerator device with a kernel use case would use the
> > > same PASID. I.e. DSA for page clearing, IAX for ZSwap compression
> > > & decompression, etc.
> > >   
> > OK, sounds like we have to map the entire kernel VA with struct page as
> > you said. So we still by-pass DMA APIs, can we all agree on that?  
> 
> Option 2 should be the faster option, but not available in all cases.
> 
> Option 1 isn't optional. DMA and IOMMU code has to be portable and
> this is the portable API.
> 
> If you want to do option 1 and option 2 then give it a go, but in most
> common cases with the IOMMU in a direct map you shouldn't get a
> notable performance win.
> 
Looks like we are converging. Let me summarize the takeaways:
1. Remove IOMMU_SVA_BIND_SUPERVISOR flag from this patch, in fact there
will be no flags at all for iommu_sva_bind_device()
2. Remove all supervisor SVA related vt-d, idxd code.
3. Create API iommu_setup_system_pasid_direct_map(option_flag)
if (option_flag == 1)
iommu_domain_alloc(IOMMU_DOMAIN_DMA);
if (option_flag == 2)
iommu_domain_alloc(IOMMU_DOMAIN_DIRECT); //new domain type?
setup IOMMU page tables mirroring the direct map
4. Create API iommu_enable_dev_direct_map(struct dev, , )
- Drivers call this API to get the system PASID and which option is
available on the system PASID
- mark device as PASID only, perhaps a new flag in struct
device->dev_iommu->pasid_only = 1
5. DMA API IOMMU vendor ops will take action based on the pasid_only flag to
decide if the mapping is for system PASID page tables.

Does it make sense?


> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 05/22] PCI/P2PDMA: Print a warning if the host bridge is not in the whitelist

2021-05-13 Thread Logan Gunthorpe

If the host bridge is not in the whitelist print a warning in the
calc_map_type_and_dist_warn() path detailing the vendor and device IDs
that would need to be added to the whitelist.

Suggested-by: Don Dutile 
Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 175da3a9c8fb..0e0b2218eacd 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -340,7 +340,7 @@ static struct pci_dev *pci_host_bridge_dev(struct 
pci_host_bridge *host)
 }
 
 static bool __host_bridge_whitelist(struct pci_host_bridge *host,
-   bool same_host_bridge)
+   bool same_host_bridge, bool warn)
 {
struct pci_dev *root = pci_host_bridge_dev(host);
const struct pci_p2pdma_whitelist_entry *entry;
@@ -361,6 +361,10 @@ static bool __host_bridge_whitelist(struct pci_host_bridge 
*host,
return true;
}
 
+   if (warn)
+   pci_warn(root, "Host bridge not in P2PDMA whitelist: 
%04x:%04x\n",
+vendor, device);
+
return false;
 }
 
@@ -368,16 +372,17 @@ static bool __host_bridge_whitelist(struct 
pci_host_bridge *host,
  * If we can't find a common upstream bridge take a look at the root
  * complex and compare it to a whitelist of known good hardware.
  */
-static bool host_bridge_whitelist(struct pci_dev *a, struct pci_dev *b)
+static bool host_bridge_whitelist(struct pci_dev *a, struct pci_dev *b,
+ bool warn)
 {
struct pci_host_bridge *host_a = pci_find_host_bridge(a->bus);
struct pci_host_bridge *host_b = pci_find_host_bridge(b->bus);
 
if (host_a == host_b)
-   return __host_bridge_whitelist(host_a, true);
+   return __host_bridge_whitelist(host_a, true, warn);
 
-   if (__host_bridge_whitelist(host_a, false) &&
-   __host_bridge_whitelist(host_b, false))
+   if (__host_bridge_whitelist(host_a, false, warn) &&
+   __host_bridge_whitelist(host_b, false, warn))
return true;
 
return false;
@@ -513,7 +518,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct 
pci_dev *client,
 
if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) {
if (!cpu_supports_p2pdma() &&
-   !host_bridge_whitelist(provider, client))
+   !host_bridge_whitelist(provider, client, acs_redirects))
map_type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
}
 
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 08/22] dma-mapping: Allow map_sg() ops to return negative error codes

2021-05-13 Thread Logan Gunthorpe

Allow dma_map_sgtable() to pass errors from the map_sg() ops. This
will be required for returning appropriate error codes when mapping
P2PDMA memory.

Introduce __dma_map_sg_attrs() which will return the raw error code
from the map_sg operation (whether it be negative or zero). Then add a
dma_map_sg_attrs() wrapper to convert any negative errors to zero to
satisfy the existing calling convention.

dma_map_sgtable() will convert a zero error return for old map_sg() ops
into a -EINVAL return and return any negative errors as reported.

This allows map_sg implementations to start returning multiple
negative error codes. Legacy map_sg implementations can continue
to return zero until they are all converted.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/dma-map-ops.h |  8 ++--
 include/linux/dma-mapping.h | 41 -
 kernel/dma/mapping.c| 13 +---
 3 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d53a96a3d64..eaa969be8284 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -41,8 +41,12 @@ struct dma_map_ops {
size_t size, enum dma_data_direction dir,
unsigned long attrs);
/*
-* map_sg returns 0 on error and a value > 0 on success.
-* It should never return a value < 0.
+* map_sg should return a negative error code on error.
+* dma_map_sgtable() will return the error code returned and convert
+* a zero return (for legacy implementations) into -EINVAL.
+*
+* dma_map_sg() will always return zero on any negative or zero
+* return to satisfy its own calling convention.
 */
int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 183e7103a66d..2b0ecf0aa4df 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -105,7 +105,7 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct 
page *page,
unsigned long attrs);
 void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir, unsigned long attrs);
-int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
+int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
 void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
  int nents, enum dma_data_direction dir,
@@ -164,7 +164,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, 
dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
 }
-static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
+static inline int __dma_map_sg_attrs(struct device *dev, struct scatterlist 
*sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
return 0;
@@ -343,6 +343,34 @@ static inline void dma_sync_single_range_for_device(struct 
device *dev,
return dma_sync_single_for_device(dev, addr + offset, size, dir);
 }
 
+/**
+ * dma_map_sg_attrs - Map the given buffer for DMA
+ * @dev:   The device for which to perform the DMA operation
+ * @sg:The sg_table object describing the buffer
+ * @dir:   DMA direction
+ * @attrs: Optional DMA attributes for the map operation
+ *
+ * Maps a buffer described by a scatterlist passed in the sg argument with
+ * nents segments for the @dir DMA operation by the @dev device.
+ *
+ * Returns the number of mapped entries (which can be less than nents)
+ * on success. Zero is returned for any error.
+ *
+ * dma_unmap_sg_attrs() should be used to unmap the buffer with the
+ * original sg and original nents (not the value returned by this funciton).
+ */
+static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
+   int nents, enum dma_data_direction dir, unsigned long attrs)
+{
+   int ret;
+
+   ret = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
+   if (ret < 0)
+   ret = 0;
+
+   return ret;
+}
+
 /**
  * dma_map_sgtable - Map the given buffer for DMA
  * @dev:   The device for which to perform the DMA operation
@@ -357,16 +385,19 @@ static inline void 
dma_sync_single_range_for_device(struct device *dev,
  * ownership of the buffer back to the CPU domain before touching the
  * buffer by the CPU.
  *
- * Returns 0 on success or -EINVAL on error during mapping the buffer.
+ * Returns 0 on success or a negative error code on error
  */
 static inline int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
 {
int nents;

[PATCH v2 07/22] PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagemap and device

2021-05-13 Thread Logan Gunthorpe

All callers of pci_p2pdma_map_type() have a struct dev_pgmap and a
struct device (of the client doing the DMA transfer). Thus move the
conversion to struct pci_devs for the provider and client into this
function.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4034ffa0eb06..6c2057cd3f37 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -844,14 +844,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool 
publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
-   struct pci_dev *client)
+static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+   struct device *dev)
 {
+   struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
enum pci_p2pdma_map_type ret;
+   struct pci_dev *client;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
+   if (!dev_is_pci(dev))
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+
+   client = to_pci_dev(dev);
+
ret = xa_to_value(xa_load(>p2pdma->map_types,
  map_types_idx(client)));
if (ret == PCI_P2PDMA_MAP_UNKNOWN)
@@ -892,14 +899,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 {
struct pci_p2pdma_pagemap *p2p_pgmap =
to_p2p_pgmap(sg_page(sg)->pgmap);
-   struct pci_dev *client;
-
-   if (WARN_ON_ONCE(!dev_is_pci(dev)))
-   return 0;
 
-   client = to_pci_dev(dev);
-
-   switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
+   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
case PCI_P2PDMA_MAP_BUS_ADDR:
@@ -922,17 +923,9 @@ EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
enum pci_p2pdma_map_type map_type;
-   struct pci_dev *client;
-
-   if (WARN_ON_ONCE(!dev_is_pci(dev)))
-   return;
-
-   client = to_pci_dev(dev);
 
-   map_type = pci_p2pdma_map_type(p2p_pgmap->provider, client);
+   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
 
if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 06/22] PCI/P2PDMA: Attempt to set map_type if it has not been set

2021-05-13 Thread Logan Gunthorpe

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

With this change it's no longer invalid to call pci_p2pdma_map_sg()
before the mapping type is calculated so drop the WARN_ON when that
is the case.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0e0b2218eacd..4034ffa0eb06 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -847,11 +847,17 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
struct pci_dev *client)
 {
+   enum pci_p2pdma_map_type ret;
+
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
-   return xa_to_value(xa_load(>p2pdma->map_types,
-  map_types_idx(client)));
+   ret = xa_to_value(xa_load(>p2pdma->map_types,
+ map_types_idx(client)));
+   if (ret == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist_warn(provider, client, NULL);
+
+   return ret;
 }
 
 static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
@@ -899,7 +905,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
return 0;
}
 }
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 12/22] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

2021-05-13 Thread Logan Gunthorpe

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as P2PDMA.

Using this bit requires adding an additional dependency on CONFIG_64BIT to
CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: John Hubbard 
---
 drivers/pci/Kconfig |  2 +-
 include/linux/scatterlist.h | 50 ++---
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..90b4bddb3300 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -163,7 +163,7 @@ config PCI_PASID
 
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
-   depends on ZONE_DEVICE
+   depends on ZONE_DEVICE && 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..562c0aa264f5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -58,6 +58,21 @@ struct sg_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a PCI
+ * bus address when doing P2PDMA.
+ * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
+ */
+
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_PCI_P2PDMA  0x04UL
+#else
+#define SG_DMA_PCI_P2PDMA  0x00UL
+#endif
+
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
@@ -66,7 +81,9 @@ struct sg_table {
 #define sg_is_chain(sg)((sg)->page_link & SG_CHAIN)
 #define sg_is_last(sg) ((sg)->page_link & SG_END)
 #define sg_chain_ptr(sg)   \
-   ((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
+   ((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK))
+
+#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -80,13 +97,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
-   unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
+   unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
 
/*
 * In order for the low bit stealing approach to work, pages
 * must be aligned at a 32-bit boundary as a minimum.
 */
-   BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
+   BUG_ON((unsigned long) page & SG_PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
 #endif
@@ -120,7 +137,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
 #endif
-   return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
+   return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
 }
 
 /**
@@ -222,6 +239,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a PCI bus address.
+ **/
+static inline void sg_dma_mark_pci_p2pdma(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_PCI_P2PDMA;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the PCI P2PDMA mark
+ **/
+static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_PCI_P2PDMA;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 11/22] dma-iommu: Return error code from iommu_dma_map_sg()

2021-05-13 Thread Logan Gunthorpe

Pass through appropriate error codes from iommu_dma_map_sg() now that
the error code will be passed through dma_map_sgtable().

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..0d69f509a95d 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -970,7 +970,7 @@ static int iommu_dma_map_sg_swiotlb(struct device *dev, 
struct scatterlist *sg,
 
 out_unmap:
iommu_dma_unmap_sg_swiotlb(dev, sg, i, dir, attrs | 
DMA_ATTR_SKIP_CPU_SYNC);
-   return 0;
+   return -EINVAL;
 }
 
 /*
@@ -991,11 +991,14 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
+   ssize_t ret;
int i;
 
-   if (static_branch_unlikely(_deferred_attach_enabled) &&
-   iommu_deferred_attach(dev, domain))
-   return 0;
+   if (static_branch_unlikely(_deferred_attach_enabled)) {
+   ret = iommu_deferred_attach(dev, domain);
+   if (ret)
+   return ret;
+   }
 
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
@@ -1043,14 +1046,17 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
}
 
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
-   if (!iova)
+   if (!iova) {
+   ret = -ENOMEM;
goto out_restore_sg;
+   }
 
/*
 * We'll leave any physical concatenation to the IOMMU driver's
 * implementation - it knows better than we do.
 */
-   if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len)
+   ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot);
+   if (ret < iova_len)
goto out_free_iova;
 
return __finalise_sg(dev, sg, nents, iova);
@@ -1059,7 +1065,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
__invalidate_sg(sg, nents);
-   return 0;
+   return ret;
 }
 
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 09/22] dma-direct: Return appropriate error code from dma_direct_map_sg()

2021-05-13 Thread Logan Gunthorpe

Now that the map_sg() op expects error codes instead of return zero on
error, convert dma_direct_map_sg() to return an error code. The
only error to return presently is EINVAL if a page could not
be mapped.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..803ee9321170 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -411,7 +411,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return 0;
+   return -EINVAL;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 10/22] iommu: Return full error code from iommu_map_sg[_atomic]()

2021-05-13 Thread Logan Gunthorpe

Convert to ssize_t return code so the return code from __iommu_map()
can be returned all the way down through dma_iommu_map_sg().

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/iommu.c | 15 +++
 include/linux/iommu.h | 22 +++---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 808ab70d5df5..df428206ea6e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2567,9 +2567,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
-static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-struct scatterlist *sg, unsigned int nents, int 
prot,
-gfp_t gfp)
+static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+   struct scatterlist *sg, unsigned int nents, int prot,
+   gfp_t gfp)
 {
const struct iommu_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
@@ -2610,19 +2610,18 @@ static size_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
/* undo mappings already done */
iommu_unmap(domain, iova, mapped);
 
-   return 0;
-
+   return ret;
 }
 
-size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot)
+ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+struct scatterlist *sg, unsigned int nents, int prot)
 {
might_sleep();
return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(iommu_map_sg);
 
-size_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
+ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
struct scatterlist *sg, unsigned int nents, int prot)
 {
return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..9369458ba1bd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -414,11 +414,11 @@ extern size_t iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size,
   struct iommu_iotlb_gather *iotlb_gather);
-extern size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-  struct scatterlist *sg,unsigned int nents, int prot);
-extern size_t iommu_map_sg_atomic(struct iommu_domain *domain,
- unsigned long iova, struct scatterlist *sg,
- unsigned int nents, int prot);
+extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+   struct scatterlist *sg, unsigned int nents, int prot);
+extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
+  unsigned long iova, struct scatterlist *sg,
+  unsigned int nents, int prot);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t 
iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler, void *token);
@@ -679,18 +679,18 @@ static inline size_t iommu_unmap_fast(struct iommu_domain 
*domain,
return 0;
 }
 
-static inline size_t iommu_map_sg(struct iommu_domain *domain,
- unsigned long iova, struct scatterlist *sg,
- unsigned int nents, int prot)
+static inline ssize_t iommu_map_sg(struct iommu_domain *domain,
+  unsigned long iova, struct scatterlist *sg,
+  unsigned int nents, int prot)
 {
-   return 0;
+   return -ENODEV;
 }
 
-static inline size_t iommu_map_sg_atomic(struct iommu_domain *domain,
+static inline ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
  unsigned long iova, struct scatterlist *sg,
  unsigned int nents, int prot)
 {
-   return 0;
+   return -ENODEV;
 }
 
 static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 13/22] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static

2021-05-13 Thread Logan Gunthorpe

pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual iommu mapping.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 24 +-
 include/linux/pci-p2pdma.h | 41 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 6c2057cd3f37..0568604fd8b4 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,13 +20,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -844,8 +837,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
enum pci_p2pdma_map_type ret;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..caac2d023f8f 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,40 @@
 struct block_device;
 struct scatterlist;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* whitelist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* eachother directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to eachother, but the transaction traverses a host bridge on the
+* whitelist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
@@ -30,6 +64,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
 int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -83,6 +119,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH v2 22/22] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2021-05-13 Thread Logan Gunthorpe

This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 65 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 92 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index b0d779aeade9..767122e0a43f 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -873,71 +873,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return ret;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) - p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index e5a8d5bc0f51..0c33a40a86e7 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -72,10 +72,6 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct 
scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
   struct scatterlist *sg);
@@ -135,17 +131,6 @@ pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct 
device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
   struct scatterlist *sg)
@@ -181,16 +166,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(, 1);
 }
 
-static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir)
-{
-   return

[PATCH v2 14/22] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2021-05-13 Thread Logan Gunthorpe

Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 59 ++
 include/linux/pci-p2pdma.h | 21 ++
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0568604fd8b4..b0d779aeade9 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -938,6 +938,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-iommu dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_pci_p2pdma(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a dma address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the dma address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_pci_p2pdma(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index caac2d023f8f..e5a8d5bc0f51 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -13,6 +13,12 @@
 
 #include 
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 struct block_device;
 struct scatterlist;
 
@@ -70,6 +76,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -135,6 +146,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device 
*dev,
unsigned long attrs)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct

[PATCH v2 17/22] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-05-13 Thread Logan Gunthorpe

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 66 +++
 1 file changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 0d69f509a95d..7cc6650e1fa5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -891,6 +892,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -988,6 +999,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1023,6 +1036,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1045,6 +1087,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1071,7 +1116,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-   dma_addr_t start, end;
+   dma_addr_t end, start =

[PATCH v2 16/22] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support

2021-05-13 Thread Logan Gunthorpe

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index eaa969be8284..b7b51b89b054 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2b0ecf0aa4df..1149540e969f 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -138,6 +138,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -242,6 +243,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 700a2bb7cc9e..2940977b1030 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -658,6 +658,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 15/22] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-05-13 Thread Logan Gunthorpe

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 44 ++--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 803ee9321170..567dac942e89 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "direct.h"
 
 /*
@@ -381,29 +382,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl, sg, nents, i) {
+   if (sg_is_dma_pci_p2pdma(sg)) {
+   sg_dma_unmark_pci_p2pdma(sg);
+   } else  {
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped normally, thus we do nothing
+* and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EINVAL;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -411,7 +443,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EINVAL;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 19/22] nvme-pci: Convert to using dma_map_sgtable()

2021-05-13 Thread Logan Gunthorpe

The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9912291f43af..8844ef1005c3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -229,11 +229,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -525,7 +524,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -579,17 +578,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
 
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -600,9 +588,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -610,7 +599,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -633,7 +622,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -706,16 +695,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -741,12 +730,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
+   int entries = iod->sgt.nents;
dma_addr_t sgl_dma;
int i = 0;
 
@@ -844,7 +834,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev,

[PATCH v2 04/22] PCI/P2PDMA: Avoid pci_get_slot() which sleeps

2021-05-13 Thread Logan Gunthorpe

In order to use upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
might sleep.

In order to avoid this, try to get the host bridge's device from the first
element in the device list. It should be impossible for the host bridge's
device to go away while references are held on child devices, so the first
element should not be able to change and, thus, this should be safe.

Introduce a static function called pci_host_bridge_dev() to obtain the
host bridge's root device.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 09c864f193d2..175da3a9c8fb 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -308,10 +308,41 @@ static const struct pci_p2pdma_whitelist_entry {
{}
 };
 
+/*
+ * This lookup function tries to find the PCI device corresponding to a given
+ * host bridge.
+ *
+ * It assumes the host bridge device is the first PCI device in the
+ * bus->devices list and that the devfn is 00.0. These assumptions should hold
+ * for all the devices in the whitelist above.
+ *
+ * This function is equivalent to pci_get_slot(host->bus, 0), however it does
+ * not take the pci_bus_sem lock seeing __host_bridge_whitelist() must not
+ * sleep.
+ *
+ * For this to be safe, the caller should hold a reference to a device on the
+ * bridge, which should ensure the host_bridge device will not be freed
+ * or removed from the head of the devices list.
+ */
+static struct pci_dev *pci_host_bridge_dev(struct pci_host_bridge *host)
+{
+   struct pci_dev *root;
+
+   root = list_first_entry_or_null(>bus->devices,
+   struct pci_dev, bus_list);
+
+   if (!root)
+   return NULL;
+   if (root->devfn != PCI_DEVFN(0, 0))
+   return NULL;
+
+   return root;
+}
+
 static bool __host_bridge_whitelist(struct pci_host_bridge *host,
bool same_host_bridge)
 {
-   struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
+   struct pci_dev *root = pci_host_bridge_dev(host);
const struct pci_p2pdma_whitelist_entry *entry;
unsigned short vendor, device;
 
@@ -320,7 +351,6 @@ static bool __host_bridge_whitelist(struct pci_host_bridge 
*host,
 
vendor = root->vendor;
device = root->device;
-   pci_dev_put(root);
 
for (entry = pci_p2pdma_whitelist; entry->vendor; entry++) {
if (vendor != entry->vendor || device != entry->device)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 21/22] RDMA/rw: use dma_map_sgtable()

2021-05-13 Thread Logan Gunthorpe

dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped.

Switch to the dma_map_sgtable() interface which will allow for better
error reporting if the P2PDMA pages are unsupported.

The change to sgtable also appears to fix a couple subtle error path
bugs:

  - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
that could have been incremented from the original call, as
well as an nents that was not the original number of nents
called when mapped.
  - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
were unmapped with the incorrect number of nents.

Signed-off-by: Logan Gunthorpe 
---
 drivers/infiniband/core/rw.c | 75 +++-
 include/rdma/ib_verbs.h  | 19 +
 2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index a588c2038479..68f2dda56138 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -273,26 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
- u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
-   }
-   return ib_dma_map_sg(dev, sg, sg_cnt, dir);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -313,12 +293,16 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp, u32 port_num,
u64 remote_addr, u32 rkey, enum dma_data_direction dir)
 {
struct ib_device *dev = qp->pd->device;
+   struct sg_table sgt = {
+   .sgl = sg,
+   .orig_nents = sg_cnt,
+   };
int ret;
 
-   ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-   if (!ret)
-   return -ENOMEM;
-   sg_cnt = ret;
+   ret = ib_dma_map_sgtable(dev, , dir, 0);
+   if (ret)
+   return ret;
+   sg_cnt = sgt.nents;
 
/*
 * Skip to the S/G entry that sg_offset falls into:
@@ -354,7 +338,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+   ib_dma_unmap_sgtable(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -387,6 +371,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
qp->integrity_en);
struct ib_rdma_wr *rdma_wr;
int count = 0, ret;
+   struct sg_table sgt = {
+   .sgl = sg,
+   .orig_nents = sg_cnt,
+   };
+   struct sg_table prot_sgt = {
+   .sgl = prot_sg,
+   .orig_nents = prot_sg_cnt,
+   };
 
if (sg_cnt > pages_per_mr || prot_sg_cnt > pages_per_mr) {
pr_err("SG count too large: sg_cnt=%d, prot_sg_cnt=%d, 
pages_per_mr=%d\n",
@@ -394,18 +386,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-   if (!ret)
-   return -ENOMEM;
-   sg_cnt = ret;
+   ret = ib_dma_map_sgtable(dev, , dir, 0);
+   if (ret)
+   return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sg(dev, prot_sg, prot_sg_cnt, dir);
-   if (!ret) {
-   ret = -ENOMEM;
+   ret = ib_dma_map_sgtable(dev, _sgt, dir, 0);
+   if (ret)
goto out_unmap_sg;
-   }
-   prot_sg_cnt = ret;
}
 
ctx->type = RDMA_RW_SIG_MR;
@@ -426,10 +414,11 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
 
memcpy(ctx->reg->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));
 
-   ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sg_cnt, NULL, prot_sg,
- prot_sg_cnt, NULL, SZ_4K);
+   ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sgt.nents, NULL, prot_sg,
+ prot_sgt.nents, NULL, SZ_4K);
if (unlikely(ret)) {
-   pr_err("failed to map PI sg (%d)\n", sg_cnt + prot_sg_cnt);
+   pr_err("failed to map PI sg (%d)\n",
+  sgt.nents

[PATCH v2 20/22] RDMA/core: Introduce ib_dma_pci_p2p_dma_supported()

2021-05-13 Thread Logan Gunthorpe

Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 6c1f3ab7649c..86f0bf4dc1ca 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -414,7 +414,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = >device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7e2f3699b898..724e80a656f7 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3943,6 +3943,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 00/22] Add new DMA mapping operation for P2PDMA

2021-05-13 Thread Logan Gunthorpe

Hi,

This patchset continues my work to to add P2PDMA support to the common
dma map operations. This allows for creating SGLs that have both P2PDMA
and regular pages which is a necessary step to allowing P2PDMA pages in
userspace.

The earlier RFC[1] and v1[2] postings generated a lot of great feedback.
This version adds a bunch more cleanup at the start of the series. I'll
probably look to split the earlier patches off and get them merged
indpendantly after a round of review with this series as this series
has gotten quite long.

I'm happy to do a few more passes if anyone has any further feedback
or better ideas.

This series is based on v5.13-rc1 and a git branch can be found here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_map_ops_v2

Thanks,

Logan

[1] 
https://lore.kernel.org/linux-block/20210311233142.7900-1-log...@deltatee.com/
[2] 
https://lore.kernel.org/linux-block/20210408170123.8788-1-log...@deltatee.com/

Changes sine v1:
 * Rebased onto v5.13-rc1
 * Add some cleanup to the existing P2PDMA code to fix up some naming
   conventions and documentation as the code has evolved a bit since the
   names were chosen. (As suggested by John)
 * Add a patch that adds a warning if a host bridge is not in the whitelist
   (as suggested by Don)
 * Change to using dma_map_sgtable() instead of creating a new
   interface. For this, a couple of .map_sg implementations were changed
   to return full error codes. (as per Christoph)
 * Renamed the scatterlist functions to include the term "dma" to
   indicate that they apply to the DMA side of the sg. (per Jason)
 * Introduce ib_dma_pci_p2p_dma_supported() helper instead of open
   coding the check (per Jason)
 * Numerous minor adjustments and documentation fixes

Changes since the RFC:
 * Added comment and fixed up the pci_get_slot patch. (per Bjorn)
 * Fixed glaring sg_phys() double offset bug. (per Robin)
 * Created a new map operation (dma_map_sg_p2pdma()) with a new calling
   convention instead of modifying the calling convention of
   dma_map_sg(). (per Robin)
 * Integrated the two similar pci_p2pdma_dma_map_type() and
   pci_p2pdma_map_type() functions into one (per Ira)
 * Reworked some of the logic in the map_sg() implementations into
   helpers in the p2pdma code. (per Christoph)
 * Dropped a bunch of unnecessary symbol exports (per Christoph)
 * Expanded the code in dma_pci_p2pdma_supported() for clarity. (per
   Ira and Christoph)
 * Finished off using the new dma_map_sg_p2pdma() call in rdma_rw
   and removed the old pci_p2pdma_[un]map_sg(). (per Jason)

--

Logan Gunthorpe (22):
  PCI/P2PDMA: Rename upstream_bridge_distance() and rework documentation
  PCI/P2PDMA: Use a buffer on the stack for collecting the acs list
  PCI/P2PDMA: Cleanup type for return value of calc_map_type_and_dist()
  PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  PCI/P2PDMA: Print a warning if the host bridge is not in the whitelist
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagemap and device
  dma-mapping: Allow map_sg() ops to return negative error codes
  dma-direct: Return appropriate error code from dma_direct_map_sg()
  iommu: Return full error code from iommu_map_sg[_atomic]()
  dma-iommu: Return error code from iommu_dma_map_sg()
  lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: Make pci_p2pdma_map_type() non-static
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: Convert to using dma_map_sgtable()
  RDMA/core: Introduce ib_dma_pci_p2p_dma_supported()
  RDMA/rw: use dma_map_sgtable()
  PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

 drivers/infiniband/core/rw.c |  75 --
 drivers/iommu/dma-iommu.c|  86 +--
 drivers/iommu/iommu.c|  15 +-
 drivers/nvme/host/core.c |   3 +-
 drivers/nvme/host/nvme.h |   2 +-
 drivers/nvme/host/pci.c  |  80 +-
 drivers/nvme/target/rdma.c   |   2 +-
 drivers/pci/Kconfig  |   2 +-
 drivers/pci/p2pdma.c | 273 +++
 include/linux/dma-map-ops.h  |  18 ++-
 include/linux/dma-mapping.h  |  46 +-
 include/linux/iommu.h|  22 +--
 include/linux/pci-p2pdma.h   |  81 ---
 include/linux/scatterlist.h  |  50 ++-
 include/rdma/ib_verbs.h  |  30 
 kernel/dma/direct.c  |  44 +-
 kernel/dma/mapping.c |  31 +++-
 17 files changed, 570 insertions(+), 290 deletions(-)


base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
--
2.20.1
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 01/22] PCI/P2PDMA: Rename upstream_bridge_distance() and rework documentation

2021-05-13 Thread Logan Gunthorpe

The function upstream_bridge_distance() has evolved such that it's name
is no longer entirely reflective of what the function does.

The function not only calculates the distance between two peers but also
calculates how the DMA addresses for those two peers should be mapped.

Thus, rename the function to calc_map_type_and_dist() and rework the
documentation some to better describe the two pieces of information
the function returns.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 63 ++--
 1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 196382630363..6f90e9812f6e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -354,7 +354,7 @@ static bool host_bridge_whitelist(struct pci_dev *a, struct 
pci_dev *b)
 }
 
 static enum pci_p2pdma_map_type
-__upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,
+__calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client,
int *dist, bool *acs_redirects, struct seq_buf *acs_list)
 {
struct pci_dev *a = provider, *b = client, *bb;
@@ -433,17 +433,18 @@ static unsigned long map_types_idx(struct pci_dev *client)
 }
 
 /*
- * Find the distance through the nearest common upstream bridge between
- * two PCI devices.
+ * Calculate the P2PDMA mapping type and distance between two PCI devices.
  *
- * If the two devices are the same device then 0 will be returned.
+ * If the two devices are the same device then PCI_P2PDMA_MAP_BUS_ADDR
+ * and a distance of 0 will be returned.
  *
  * If there are two virtual functions of the same device behind the same
- * bridge port then 2 will be returned (one step down to the PCIe switch,
- * then one step back to the same device).
+ * bridge port then PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 will be
+ * returned (one step down to the PCIe switch, then one step back to the
+ * same device).
  *
  * In the case where two devices are connected to the same PCIe switch, the
- * value 4 will be returned. This corresponds to the following PCI tree:
+ * distance of 4 will be returned. This corresponds to the following PCI tree:
  *
  * -+  Root Port
  *  \+ Switch Upstream Port
@@ -454,31 +455,31 @@ static unsigned long map_types_idx(struct pci_dev *client)
  *
  * The distance is 4 because we traverse from Device A through the downstream
  * port of the switch, to the common upstream port, back up to the second
- * downstream port and then to Device B.
- *
- * Any two devices that cannot communicate using p2pdma will return
- * PCI_P2PDMA_MAP_NOT_SUPPORTED.
+ * downstream port and then to Device B. The mapping type returned will depend
+ * on the ACS redirection setting of the bridges along the path. If ACS
+ * redirect is set on any bridge port in the path then the TLPs will go through
+ * the host bridge. Otherwise PCI_P2PDMA_MAP_BUS_ADDR is returned.
  *
  * Any two devices that have a data path that goes through the host bridge
- * will consult a whitelist. If the host bridges are on the whitelist,
- * this function will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE.
- *
- * If either bridge is not on the whitelist this function returns
- * PCI_P2PDMA_MAP_NOT_SUPPORTED.
+ * will consult a whitelist. If the host bridge is in the whitelist,
+ * this function will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the
+ * distance set to the number of ports per above. If the device is not
+ * in the whitelist the type will be returned PCI_P2PDMA_MAP_NOT_SUPPORTED.
  *
- * If a bridge which has any ACS redirection bits set is in the path,
- * acs_redirects will be set to true. In this case, a list of all infringing
- * bridge addresses will be populated in acs_list (assuming it's non-null)
- * for printk purposes.
+ * If any ACS redirect bits are set, then the acs_redirects boolean will be
+ * set to true and their pci device name will be appended to the acs_list
+ * seq_buf. This seq_buf is used to print a warning informing the user
+ * how to disable ACS using a command line parameter.
+ * (See calc_map_type_and_dist_warn() below)
  */
 static enum pci_p2pdma_map_type
-upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,
+calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client,
int *dist, bool *acs_redirects, struct seq_buf *acs_list)
 {
enum pci_p2pdma_map_type map_type;
 
-   map_type = __upstream_bridge_distance(provider, client, dist,
- acs_redirects, acs_list);
+   map_type = __calc_map_type_and_dist(provider, client, dist,
+   acs_redirects, acs_list);
 
if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) {
if (!cpu_supports_p2pdma() &&
@@ -494,8 +495,8 @@ upstream_bridge_distance(struct pci_dev *provider, struct 
pci_dev *client,
 }
 
 static enum pci_p2pdma_map_type

[PATCH v2 18/22] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA

2021-05-13 Thread Logan Gunthorpe

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 522c9b229f80..9544a8948bc4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3682,7 +3682,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->queue->queuedata = ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 05f31a2c64bb..84649ce6c206 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -487,7 +487,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -495,6 +494,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a29b170701fc..9912291f43af 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2756,17 +2756,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(>dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 02/22] PCI/P2PDMA: Use a buffer on the stack for collecting the acs list

2021-05-13 Thread Logan Gunthorpe

In order to call the calc_map_type_and_dist_warn() function from
a dma_map operation, the function must not sleep. The only reason
it sleeps is to allocate memory for the seq_buf to print a verbose
warning telling the user how to disable ACS for that path.

Instead of allocating the memory with kmalloc, allocate it on
the stack with a smaller buffer. A 128B buffer is enough to print
10 pci device names. A system with 10 bridge ports between two devices
that have ACS enabled would be unusually large, so this should
still be a reasonable limit.

This also allows cleaning up the awkward (and broken) return with
-ENOMEM which contradicts the return type and the caller was
not prepared for.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 6f90e9812f6e..3a5fb63c5f2c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -500,11 +500,10 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, 
struct pci_dev *client,
 {
struct seq_buf acs_list;
bool acs_redirects;
+   char buf[128];
int ret;
 
-   seq_buf_init(_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
-   if (!acs_list.buffer)
-   return -ENOMEM;
+   seq_buf_init(_list, buf, sizeof(buf));
 
ret = calc_map_type_and_dist(provider, client, dist, _redirects,
 _list);
@@ -522,8 +521,6 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, 
struct pci_dev *client,
 pci_name(provider));
}
 
-   kfree(acs_list.buffer);
-
return ret;
 }
 
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 03/22] PCI/P2PDMA: Cleanup type for return value of calc_map_type_and_dist()

2021-05-13 Thread Logan Gunthorpe

Instead of using an int for the return value of this function use the
correct enum pci_p2pdma_map_type.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 3a5fb63c5f2c..09c864f193d2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -544,11 +544,11 @@ calc_map_type_and_dist_warn(struct pci_dev *provider, 
struct pci_dev *client,
 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
 int num_clients, bool verbose)
 {
+   enum pci_p2pdma_map_type map;
bool not_supported = false;
struct pci_dev *pci_client;
int total_dist = 0;
-   int distance;
-   int i, ret;
+   int i, distance;
 
if (num_clients == 0)
return -1;
@@ -563,15 +563,15 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, 
struct device **clients,
}
 
if (verbose)
-   ret = calc_map_type_and_dist_warn(provider, pci_client,
+   map = calc_map_type_and_dist_warn(provider, pci_client,
  );
else
-   ret = calc_map_type_and_dist(provider, pci_client,
+   map = calc_map_type_and_dist(provider, pci_client,
 , NULL, NULL);
 
pci_dev_put(pci_client);
 
-   if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED)
+   if (map == PCI_P2PDMA_MAP_NOT_SUPPORTED)
not_supported = true;
 
if (not_supported && !verbose)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 01:22:51PM -0700, Jacob Pan wrote:
> Hi Tony,
> 
> On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" 
> wrote:
> 
> > On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote:
> > > It seems there are two options:
> > > 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU
> > > page table/domain, mark the device is PASID only with a flag. Use DMA
> > > APIs to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU
> > > driver will decide whether to use system PASID domain during map/unmap.
> > > Not clear if we also need to make IOVA==kernel VA.
> > > 
> > > 2. Add a new IOMMU API to setup a system PASID which points to
> > > init_mm.pgd. This API only allows trusted device to bind with the
> > > system PASID at its own risk. There is no need for DMA API. This is the
> > > same as the current code except with an explicit API.
> > > 
> > > Which option?  
> > 
> > Option #1 looks cleaner to me. Option #2 gives access to bits
> > of memory that the users of system PASID shouldn't ever need
> > to touch ... just map regions of memory that the kernel has
> > a "struct page" for.
> > 
> > What does "use DMA APIs to explicitly map/unmap" mean? Is that
> > for the whole region?
> > 
> If we map the entire kernel direct map during system PASID setup, then we
> don't need to use DMA API to map/unmap certain range.
> 
> I was thinking this system PASID page table could be on-demand. The mapping
> is built by explicit use of DMA map/unmap APIs.

Option 1 should be the PASID works exactly like a normal RID and uses
all the normal DMA APIs and IOMMU mechanisms, whatever the platform
implements. This might mean an iommu update on every operation or not.
 
> > I'm expecting that once this system PASID has been initialized,
> > then any accelerator device with a kernel use case would use the
> > same PASID. I.e. DSA for page clearing, IAX for ZSwap compression
> > & decompression, etc.
> > 
> OK, sounds like we have to map the entire kernel VA with struct page as you
> said. So we still by-pass DMA APIs, can we all agree on that?

Option 2 should be the faster option, but not available in all cases.

Option 1 isn't optional. DMA and IOMMU code has to be portable and
this is the portable API.

If you want to do option 1 and option 2 then give it a go, but in most
common cases with the IOMMU in a direct map you shouldn't get a
notable performance win.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jacob Pan

Hi Tony,

On Thu, 13 May 2021 12:57:49 -0700, "Luck, Tony" 
wrote:

> On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote:
> > It seems there are two options:
> > 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU
> > page table/domain, mark the device is PASID only with a flag. Use DMA
> > APIs to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU
> > driver will decide whether to use system PASID domain during map/unmap.
> > Not clear if we also need to make IOVA==kernel VA.
> > 
> > 2. Add a new IOMMU API to setup a system PASID which points to
> > init_mm.pgd. This API only allows trusted device to bind with the
> > system PASID at its own risk. There is no need for DMA API. This is the
> > same as the current code except with an explicit API.
> > 
> > Which option?  
> 
> Option #1 looks cleaner to me. Option #2 gives access to bits
> of memory that the users of system PASID shouldn't ever need
> to touch ... just map regions of memory that the kernel has
> a "struct page" for.
> 
> What does "use DMA APIs to explicitly map/unmap" mean? Is that
> for the whole region?
> 
If we map the entire kernel direct map during system PASID setup, then we
don't need to use DMA API to map/unmap certain range.

I was thinking this system PASID page table could be on-demand. The mapping
is built by explicit use of DMA map/unmap APIs.

> I'm expecting that once this system PASID has been initialized,
> then any accelerator device with a kernel use case would use the
> same PASID. I.e. DSA for page clearing, IAX for ZSwap compression
> & decompression, etc.
> 
OK, sounds like we have to map the entire kernel VA with struct page as you
said. So we still by-pass DMA APIs, can we all agree on that?

> -Tony


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Luck, Tony

On Thu, May 13, 2021 at 12:46:21PM -0700, Jacob Pan wrote:
> It seems there are two options:
> 1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU page
> table/domain, mark the device is PASID only with a flag. Use DMA APIs
> to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU driver
> will decide whether to use system PASID domain during map/unmap. Not clear
> if we also need to make IOVA==kernel VA.
> 
> 2. Add a new IOMMU API to setup a system PASID which points to init_mm.pgd.
> This API only allows trusted device to bind with the system PASID at its
> own risk. There is no need for DMA API. This is the same as the current
> code except with an explicit API.
> 
> Which option?

Option #1 looks cleaner to me. Option #2 gives access to bits
of memory that the users of system PASID shouldn't ever need
to touch ... just map regions of memory that the kernel has
a "struct page" for.

What does "use DMA APIs to explicitly map/unmap" mean? Is that
for the whole region?

I'm expecting that once this system PASID has been initialized,
then any accelerator device with a kernel use case would use the
same PASID. I.e. DSA for page clearing, IAX for ZSwap compression
& decompression, etc.

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jacob Pan

Hi Jason,

On Thu, 13 May 2021 16:20:14 -0300, Jason Gunthorpe  wrote:

> On Thu, May 13, 2021 at 07:14:54PM +, Luck, Tony wrote:
> > > If you want this then be explicit about what it is you are making when
> > > building the API. Don't try to hide it under some generic idea of
> > > "kernel PCI DMA SVA"  
> > 
> > So, a special API call (that Dave can call from IDXD) to set up this
> > kernel PASID. With suitable documentation to explain the scope.
> > Maybe with a separate CONFIG option so it can be completely
> > stubbed out (IDXD does *NOT* "select" this option ... users have
> > to explicitly pick it).
> >   
> > > I could easily see an admin option to "turn this off" entirely as
> > > being too dangerous, especially if the users have no interest in
> > > IDXD.  
> > 
> > And a kernel command line option to block IDXD from using that
> > special API ... for users on generic kernels who want to block
> > this use model (but still use IDXD in non-kernel cases). Users
> > who don't want IDXD at all can block loading of the driver.  
> 
> A generic IOMMU API should not be IDXD specific, if you want to allow
> some special "integrated to the SOC accelerator PASID" mode then it
> should be a global IOMMU API and any security toggles for it should be
> global and unrelated to IDXD.
> 
> Concurrently it seems necessary to have a solution for secure kernel
> PASID use under the DMA API and reserve this special mode for getting
> higher performance.
> 
> I think you need to come with a proposal, and it is a bit alarming a
> noteworthy security hole was added under the guise of "kernel SVA" :(
> 
It seems there are two options:
1. Add a new IOMMU API to set up a system PASID with a *separate* IOMMU page
table/domain, mark the device is PASID only with a flag. Use DMA APIs
to explicit map/unmap. Based on this PASID-only flag, Vendor IOMMU driver
will decide whether to use system PASID domain during map/unmap. Not clear
if we also need to make IOVA==kernel VA.

2. Add a new IOMMU API to setup a system PASID which points to init_mm.pgd.
This API only allows trusted device to bind with the system PASID at its
own risk. There is no need for DMA API. This is the same as the current
code except with an explicit API.

Which option?

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 07:14:54PM +, Luck, Tony wrote:
> > If you want this then be explicit about what it is you are making when
> > building the API. Don't try to hide it under some generic idea of
> > "kernel PCI DMA SVA"
> 
> So, a special API call (that Dave can call from IDXD) to set up this
> kernel PASID. With suitable documentation to explain the scope.
> Maybe with a separate CONFIG option so it can be completely
> stubbed out (IDXD does *NOT* "select" this option ... users have
> to explicitly pick it).
> 
> > I could easily see an admin option to "turn this off" entirely as
> > being too dangerous, especially if the users have no interest in IDXD.
> 
> And a kernel command line option to block IDXD from using that
> special API ... for users on generic kernels who want to block
> this use model (but still use IDXD in non-kernel cases). Users
> who don't want IDXD at all can block loading of the driver.

A generic IOMMU API should not be IDXD specific, if you want to allow
some special "integrated to the SOC accelerator PASID" mode then it
should be a global IOMMU API and any security toggles for it should be
global and unrelated to IDXD.

Concurrently it seems necessary to have a solution for secure kernel
PASID use under the DMA API and reserve this special mode for getting
higher performance.

I think you need to come with a proposal, and it is a bit alarming a
noteworthy security hole was added under the guise of "kernel SVA" :(

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Luck, Tony

> If you want this then be explicit about what it is you are making when
> building the API. Don't try to hide it under some generic idea of
> "kernel PCI DMA SVA"

So, a special API call (that Dave can call from IDXD) to set up this
kernel PASID. With suitable documentation to explain the scope.
Maybe with a separate CONFIG option so it can be completely
stubbed out (IDXD does *NOT* "select" this option ... users have
to explicitly pick it).

> I could easily see an admin option to "turn this off" entirely as
> being too dangerous, especially if the users have no interest in IDXD.

And a kernel command line option to block IDXD from using that
special API ... for users on generic kernels who want to block
this use model (but still use IDXD in non-kernel cases). Users
who don't want IDXD at all can block loading of the driver.

Would that work?

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 11:53:49AM -0700, Luck, Tony wrote:

> I'd like people to think of DSA as an extension to the instruction
> set. It implements asynchronous instructions like "MEMFILL" and
> "MEMCOPY". These can be limited in scope when executed in user
> processes or guests. But when executed by the host OS ring0 code
> they can have all the same access that ring0 code has when it
> dereferences a pointer.

If you want this then be explicit about what it is you are making when
building the API. Don't try to hide it under some generic idea of
"kernel PCI DMA SVA"

Add appropriately safe APIs that might have a chance of making it
secure, eg by confirming it is a physically trusted on-socket device.

It is not just Thunderbolt devices that could trigger this, many
devices with programmable firmware can spoof PCI DID/VID - having an
exploit chain that compromises a in-system FW device, chains it to
faking a IDXD HW, then accessing the all-memory PASID is pretty
alarming if the admin thought they had protection against DMA attacks.

I could easially see an admin option to "turn this off" entirely as
being too dangerous, especially if the users have no interest in IDXD.

Which is why being explicit of intent is really important.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Luck, Tony

On Thu, May 13, 2021 at 02:33:03PM -0300, Jason Gunthorpe wrote:
> The page table under the kernel PASID should behave the same way that
> the kernel would operate the page table assigned to a kernel RID.
> 
> If the kernel has security off then the PASID should map to all
> physical memory, just like the RID does.
> 
> If security is on then every DMA map needs to be loaded into the
> PASID's io page table no different than a RID page table.
> 
> "kernel SVA" is, IMHO, not a desirable thing, it completely destroys
> the kernel's DMA security model.
> 
> > If people want to use an accelerator on memory allocated by vmalloc()
> > things will get more complicated. But maybe we can delay solving that
> > problem until someone comes up with a real use case that needs to
> > do this?
> 
> If you have a HW limitation that the device can only issue TLPs
> with a PASID, even for kernel users, then I think the proper thing is
> to tell the IOMMU layer than a certain 'struct device' enters
> PASID-only mode and the IOMMU layer should construct an appropriate
> PASID and flow the dma operations through it.
> 
> Pretending the DMA layer doesn't exist and that PASID gets a free pass
> is not OK in the kernel.

I can see why a tight security model is needed to stop
random devices having access to mamory that they should
not be able to access.  Now that PCIe devices can be plugged
into Thunderbolt ports on computers, nobody wants to repeat
the disaster that Firewire ports created for systems over
a decade ago.

But I'd like to challege the one-size-fits-all policy. There's
a big difference between a random device plugged into a port
(which may even lie about its VendorID/DeviceID) and a device
that is part of the CPU socket.

I'd like people to think of DSA as an extension to the instruction
set. It implements asynchronous instructions like "MEMFILL" and
"MEMCOPY". These can be limited in scope when executed in user
processes or guests. But when executed by the host OS ring0 code
they can have all the same access that ring0 code has when it
dereferences a pointer.

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 04:44:14PM +, Luck, Tony wrote:
> > For shared workqueue, it can only generate DMA request with PASID. The
> > submission is done by ENQCMDS (S for supervisor) instruction.
> >
> > If we were not to share page tables with init_mm, we need a system PASID
> > that doing the same direct mapping in IOMMU page tables.
> 
> Note that for the currently envisioned kernel use cases for accelerators it
> would be OK for this system PASID to just provide either:
> 
> 1) A 1:1 mapping for physical addresses.  Kernel users of the accelerators
>   would provide physical addresses in descriptors.
> 2) The same mapping that the kernel uses for its "1:1" map of all physical
> memory. Users would use kernel virtual addresses in that "1:1" range
> (e.g. those obtained from page_to_virt(struct page *p);)

Well, no, neither of those are OK.

The page table under the kernel PASID should behave the same way that
the kernel would operate the page table assigned to a kernel RID.

If the kernel has security off then the PASID should map to all
physical memory, just like the RID does.

If security is on then every DMA map needs to be loaded into the
PASID's io page table no different than a RID page table.

"kernel SVA" is, IMHO, not a desirable thing, it completely destroys
the kernel's DMA security model.

> If people want to use an accelerator on memory allocated by vmalloc()
> things will get more complicated. But maybe we can delay solving that
> problem until someone comes up with a real use case that needs to
> do this?

If you have a HW limitation that the device can only issue TLPs
with a PASID, even for kernel users, then I think the proper thing is
to tell the IOMMU layer than a certain 'struct device' enters
PASID-only mode and the IOMMU layer should construct an appropriate
PASID and flow the dma operations through it.

Pretending the DMA layer doesn't exist and that PASID gets a free pass
is not OK in the kernel.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Luck, Tony

> For shared workqueue, it can only generate DMA request with PASID. The
> submission is done by ENQCMDS (S for supervisor) instruction.
>
> If we were not to share page tables with init_mm, we need a system PASID
> that doing the same direct mapping in IOMMU page tables.

Note that for the currently envisioned kernel use cases for accelerators it
would be OK for this system PASID to just provide either:

1) A 1:1 mapping for physical addresses.  Kernel users of the accelerators
  would provide physical addresses in descriptors.
2) The same mapping that the kernel uses for its "1:1" map of all physical
memory. Users would use kernel virtual addresses in that "1:1" range
(e.g. those obtained from page_to_virt(struct page *p);)

If people want to use an accelerator on memory allocated by vmalloc()
things will get more complicated. But maybe we can delay solving that
problem until someone comes up with a real use case that needs to
do this?

-Tony


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jacob Pan

Hi Jason,

On Thu, 13 May 2021 10:38:34 -0300, Jason Gunthorpe  wrote:

> On Thu, May 13, 2021 at 06:00:12AM -0700, Jacob Pan wrote:
> > > > If you want to do SVA PASID then it also must come with DMA APIs to
> > > > manage the CPU cache coherence that are all NOP's on x86.
> > > 
> > > Yes.  And we have plenty of precende where an IOMMU is in "bypass"
> > > mode to allow access to all memory and then uses the simple
> > > dma-direct case.  
> > I agree it is better not to expose the entire direct map. But the
> > missing piece of using DMA APIs is the PASID. The caller needs the
> > PASID value to do work submission once buffer is mapped.  
> 
> You still haven't explained why the kernel driver should have a PASID at
> all.
> 
For shared workqueue, it can only generate DMA request with PASID. The
submission is done by ENQCMDS (S for supervisor) instruction.

If we were not to share page tables with init_mm, we need a system PASID
that doing the same direct mapping in IOMMU page tables.

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote:
> On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > > Again, I don't know enough about VDPA to make sense of that.  Are we
> > > essentially talking non-PCI virtual devices here?  In which case you
> > > could define the VDPA "bus" to always have one-device groups.
> > 
> > It is much worse than that.
> > 
> > What these non-PCI devices need is for the kernel driver to be part of
> > the IOMMU group of the underlying PCI device but tell VFIO land that
> > "groups don't matter"
> 
> I don't really see a semantic distinction between "always one-device
> groups" and "groups don't matter".  Really the only way you can afford
> to not care about groups is if they're singletons.

The kernel driver under the mdev may not be in an "always one-device"
group.

It is a kernel driver so the only thing we know and care about is that
all devices in the HW group are bound to kernel drivers.

The vfio device that spawns from this kernel driver is really a
"groups don't matter" vfio device because at the IOMMU layer it should
be riding on the physical group of the kernel driver.  At the VFIO
layer we no longer care about the group abstraction because the system
guarentees isolation in some other way.

The issue is a software one of tightly coupling IOMMU HW groups to
VFIO's API and then introducing an entire class of VFIO mdev devices
that no longer care about IOMMU HW groups at all.

Currently mdev tries to trick this by creating singleton groups, but
it is very ugly and very tightly coupled to a specific expectation of
the few existing mdev drivers. Trying to add PASID made it alot worse.

> Aside: I'm primarily using "group" to mean the underlying hardware
> unit, not the vfio construct on top of it, I'm not sure that's been
> clear throughout.

Sure, that is obviously fixed, but I'm not interested in that.

I'm interested in having a VFIO API that makes sense for vfio-pci
which has a tight coupling to the HW notion of a IOMMU and also vfio
mdev's that have no concept of a HW IOMMU group.

> So.. your model assumes that every device has a safe quiescent state
> where it won't do any harm until poked, whether its group is
> currently kernel owned, or owned by a userspace that doesn't know
> anything about it.

This is today's model, yes. When you run dpdk on a multi-group device
vfio already ensures that all the device groups remained parked and
inaccessible.

> At minimum this does mean that in order to use one device in the group
> you must have permission to use *all* the devices in the group -
> otherwise you may be able to operate a device you don't have
> permission to by DMAing to its registers from a device you do have
> permission to.

If the administator configures the system with different security
labels for different VFIO devices then yes removing groups makes this
more tricky as all devices in the group should have the same label.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote:
> On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> > 
> > > This is a good feature in general when let's say there is a linux 
> > > supported
> > > device which has a proprietary device firmware update tool which only 
> > > exists
> > > as an x86 binary and your hardware is not x86 - running qemu + vfio in 
> > > full
> > > emulation would provide a way to run the tool to update a physical device.
> > 
> > That specific use case doesn't really need a vIOMMU though, does it?
> 
> Possibly not, but the mechanics needed to do vIOMMU on different host
> IOMMU aren't really different from what you need for a no-vIOMMU
> guest.  

For very simple vIOMMUs this might be true, but this new features of nesting
PASID, migration, etc, etc all make the vIOMMU complicated and
emuluating it completely alot harder.

Stuffing a vfio-pci into a guest and creating a physical map using a
single IOASID is comparably trivial.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 04:01:20PM +1000, David Gibson wrote:

> But.. even if you're exposing page tables to userspace.. with hardware
> that has explicit support for nesting you can probably expose the hw
> tables directly which is great for the cases that works for.  But
> surely for older IOMMUs which don't do nesting you must have some way
> of shadowing guest IO page tables to host IO page tables to translate
> GPA to HPA at least?

I expect this would be in quemu and would be part of the expensive
emulation I suggested. Converting the guest's page table structure
into a sequence of map/unmaps to a non-nestable IOASID.

> If you're doing that, I don't see that converting page table format
> is really any harder

It isn't, but it is a completely different flow and custom from the
normal HW accelerated nesting.

> It might not be a theoretically complete emulation of the vIOMMU, but
> it can support in-practice usage.  In particular it works pretty well
> if your backend has a nice big IOVA range (like x86 IOMMUS) but your
> guest platform typically uses relatively small IOVA windows.  PAPR on
> x86 is exactly that... well.. possibly not the 64-bit window, but
> because of old PAPR platforms that didn't support that, we can choose
> not to advertise that and guests will cope.

So maybe this multi-window thing is generic API somehow. You'll have
to check what Kevin comes up with to ensure it fits in

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 8/8] iommu/dma: Reserve any RMR regions associated with a dev

2021-05-13 Thread Shameer Kolothum

Get ACPI IORT RMR regions associated with a dev reserved
so that there is a unity mapping for them in SMMU.

Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/dma-iommu.c | 66 ---
 1 file changed, 61 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 2d9caf548a32..6838caf3e8ff 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -174,22 +174,78 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 }
 EXPORT_SYMBOL(iommu_put_dma_cookie);
 
+static bool iommu_dma_dev_has_rmr(struct iommu_fwspec *fwspec,
+ struct iommu_rmr *e)
+{
+   int i;
+
+   for (i = 0; i < fwspec->num_ids; i++) {
+   if (e->sid == fwspec->ids[i])
+   return true;
+   }
+
+   return false;
+}
+
+static void iommu_dma_get_rmr_resv_regions(struct device *dev,
+  struct list_head *list)
+{
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct list_head rmr_list;
+   struct iommu_rmr *rmr;
+
+   INIT_LIST_HEAD(_list);
+   if (iommu_dma_get_rmrs(fwspec->iommu_fwnode, _list))
+   return;
+
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus);
+
+   if (!host->preserve_config)
+   return;
+   }
+
+   list_for_each_entry(rmr, _list, list) {
+   int prot = IOMMU_READ | IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+   struct iommu_resv_region *region;
+   enum iommu_resv_type type;
+
+   if (!iommu_dma_dev_has_rmr(fwspec, rmr))
+   continue;
+
+   if (rmr->flags & IOMMU_RMR_REMAP_PERMITTED)
+   type = IOMMU_RESV_DIRECT_RELAXABLE;
+   else
+   type = IOMMU_RESV_DIRECT;
+
+   region = iommu_alloc_resv_region(rmr->base_address,
+rmr->length, prot,
+type);
+   if (!region)
+   return;
+
+   list_add_tail(>list, list);
+   }
+}
 /**
  * iommu_dma_get_resv_regions - Reserved region driver helper
  * @dev: Device from iommu_get_resv_regions()
  * @list: Reserved region list from iommu_get_resv_regions()
  *
  * IOMMU drivers can use this to implement their .get_resv_regions callback
- * for general non-IOMMU-specific reservations. Currently, this covers GICv3
- * ITS region reservation on ACPI based ARM platforms that may require HW MSI
- * reservation.
+ * for general non-IOMMU-specific reservations. Currently this covers,
+ *  -GICv3 ITS region reservation on ACPI based ARM platforms that may
+ *   require HW MSI reservation.
+ *  -Any ACPI IORT RMR memory range reservations (IORT spec rev E.b)
  */
 void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)
 {
 
-   if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode))
+   if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode)) {
iort_iommu_msi_get_resv_regions(dev, list);
-
+   iommu_dma_get_rmr_resv_regions(dev, list);
+   }
 }
 EXPORT_SYMBOL(iommu_dma_get_resv_regions);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 7/8] iommu/arm-smmu: Get associated RMR info and install bypass SMR

2021-05-13 Thread Shameer Kolothum

From: Jon Nettleton 

Check if there is any RMR info associated with the devices behind
the SMMU and if any, install bypass SMRs for them. This is to
keep any ongoing traffic associated with these devices alive
when we enable/reset SMMU during probe().

Signed-off-by: Jon Nettleton 
Signed-off-by: Steven Price 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 64 +++
 1 file changed, 64 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..f67aeb30b5ef 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -2042,6 +2042,66 @@ err_reset_platform_ops: __maybe_unused;
return err;
 }
 
+static void arm_smmu_rmr_install_bypass_smr(struct arm_smmu_device *smmu)
+{
+   struct list_head rmr_list;
+   struct iommu_rmr *e;
+   int i, cnt = 0;
+   u32 smr;
+   u32 reg;
+
+   INIT_LIST_HEAD(_list);
+   if (iommu_dma_get_rmrs(dev_fwnode(smmu->dev), _list))
+   return;
+
+   reg = arm_smmu_gr0_read(smmu, ARM_SMMU_GR0_sCR0);
+
+   if ((reg & ARM_SMMU_sCR0_USFCFG) && !(reg & ARM_SMMU_sCR0_CLIENTPD)) {
+   /*
+* SMMU is already enabled and disallowing bypass, so preserve
+* the existing SMRs
+*/
+   for (i = 0; i < smmu->num_mapping_groups; i++) {
+   smr = arm_smmu_gr0_read(smmu, ARM_SMMU_GR0_SMR(i));
+   if (!FIELD_GET(ARM_SMMU_SMR_VALID, smr))
+   continue;
+   smmu->smrs[i].id = FIELD_GET(ARM_SMMU_SMR_ID, smr);
+   smmu->smrs[i].mask = FIELD_GET(ARM_SMMU_SMR_MASK, smr);
+   smmu->smrs[i].valid = true;
+   }
+   }
+
+   list_for_each_entry(e, _list, list) {
+   u32 sid = e->sid;
+
+   i = arm_smmu_find_sme(smmu, sid, ~0);
+   if (i < 0)
+   continue;
+   if (smmu->s2crs[i].count == 0) {
+   smmu->smrs[i].id = sid;
+   smmu->smrs[i].mask = ~0;
+   smmu->smrs[i].valid = true;
+   }
+   smmu->s2crs[i].count++;
+   smmu->s2crs[i].type = S2CR_TYPE_BYPASS;
+   smmu->s2crs[i].privcfg = S2CR_PRIVCFG_DEFAULT;
+   smmu->s2crs[i].cbndx = 0xff;
+
+   cnt++;
+   }
+
+   if ((reg & ARM_SMMU_sCR0_USFCFG) && !(reg & ARM_SMMU_sCR0_CLIENTPD)) {
+   /* Remove the valid bit for unused SMRs */
+   for (i = 0; i < smmu->num_mapping_groups; i++) {
+   if (smmu->s2crs[i].count == 0)
+   smmu->smrs[i].valid = false;
+   }
+   }
+
+   dev_notice(smmu->dev, "\tpreserved %d boot mapping%s\n", cnt,
+  cnt == 1 ? "" : "s");
+}
+
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
struct resource *res;
@@ -2168,6 +2228,10 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
}
 
platform_set_drvdata(pdev, smmu);
+
+   /* Check for RMRs and install bypass SMRs if any */
+   arm_smmu_rmr_install_bypass_smr(smmu);
+
arm_smmu_device_reset(smmu);
arm_smmu_test_smr_masks(smmu);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 6/8] iommu/arm-smmu-v3: Get associated RMR info and install bypass STE

2021-05-13 Thread Shameer Kolothum

Check if there is any RMR info associated with the devices behind
the SMMUv3 and if any, install bypass STEs for them. This is to
keep any ongoing traffic associated with these devices alive
when we enable/reset SMMUv3 during probe().

Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f9195b740f48..c2d2e65b9856 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3574,6 +3574,36 @@ static void __iomem *arm_smmu_ioremap(struct device 
*dev, resource_size_t start,
return devm_ioremap_resource(dev, );
 }
 
+static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
+{
+   struct list_head rmr_list;
+   struct iommu_rmr *e;
+   int ret;
+
+   INIT_LIST_HEAD(_list);
+   if (iommu_dma_get_rmrs(dev_fwnode(smmu->dev), _list))
+   return;
+
+   /*
+* Since, we don't have a mechanism to differentiate the RMR
+* SIDs that has an ongoing live stream, install bypass STEs
+* for all the reported ones. 
+*/
+   list_for_each_entry(e, _list, list) {
+   __le64 *step;
+
+   ret = arm_smmu_init_sid_strtab(smmu, e->sid);
+   if (ret) {
+   dev_err(smmu->dev, "RMR bypass(0x%x) failed\n",
+   e->sid);
+   continue;
+   }
+
+   step = arm_smmu_get_step_for_sid(smmu, e->sid);
+   arm_smmu_write_strtab_ent(NULL, e->sid, step, true);
+   }
+}
+
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
int irq, ret;
@@ -3657,6 +3687,9 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
/* Record our private device structure */
platform_set_drvdata(pdev, smmu);
 
+   /* Check for RMRs and install bypass STEs if any */
+   arm_smmu_rmr_install_bypass_ste(smmu);
+
/* Reset the device */
ret = arm_smmu_device_reset(smmu, bypass);
if (ret)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 5/8] iommu/arm-smmu-v3: Add bypass flag to arm_smmu_write_strtab_ent()

2021-05-13 Thread Shameer Kolothum

By default, disable_bypass is set and any dev without an iommu domain
installs STE with CFG_ABORT during arm_smmu_init_bypass_stes(). Introduce
a "bypass" flag to arm_smmu_write_strtab_ent() so that we can force it to
install CFG_BYPASS STE for specific SIDs. This will be useful in follow
up patch to install bypass for IORT RMR SIDs.

Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 754bad6092c1..f9195b740f48 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1174,7 +1174,7 @@ static void arm_smmu_sync_ste_for_sid(struct 
arm_smmu_device *smmu, u32 sid)
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
- __le64 *dst)
+ __le64 *dst, bool bypass)
 {
/*
 * This is hideously complicated, but we only really care about
@@ -1245,7 +1245,7 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_master *master, u32 sid,
 
/* Bypass/fault */
if (!smmu_domain || !(s1_cfg || s2_cfg)) {
-   if (!smmu_domain && disable_bypass)
+   if (!smmu_domain && disable_bypass && !bypass)
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_ABORT);
else
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_BYPASS);
@@ -1320,7 +1320,7 @@ static void arm_smmu_init_bypass_stes(__le64 *strtab, 
unsigned int nent)
unsigned int i;
 
for (i = 0; i < nent; ++i) {
-   arm_smmu_write_strtab_ent(NULL, -1, strtab);
+   arm_smmu_write_strtab_ent(NULL, -1, strtab, false);
strtab += STRTAB_STE_DWORDS;
}
 }
@@ -2097,7 +2097,7 @@ static void arm_smmu_install_ste_for_dev(struct 
arm_smmu_master *master)
if (j < i)
continue;
 
-   arm_smmu_write_strtab_ent(master, sid, step);
+   arm_smmu_write_strtab_ent(master, sid, step, false);
}
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 4/8] iommu/arm-smmu-v3: Introduce strtab init helper

2021-05-13 Thread Shameer Kolothum

Introduce a helper to check the sid range and to init the l2 strtab
entries(bypass). This will be useful when we have to initialize the
l2 strtab with bypass for RMR SIDs.

Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++--
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 54b2f27b81d4..754bad6092c1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2369,6 +2369,19 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device 
*smmu, u32 sid)
return sid < limit;
 }
 
+static int arm_smmu_init_sid_strtab(struct arm_smmu_device *smmu, u32 sid)
+{
+   /* Check the SIDs are in range of the SMMU and our stream table */
+   if (!arm_smmu_sid_in_range(smmu, sid))
+   return -ERANGE;
+
+   /* Ensure l2 strtab is initialised */
+   if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
+   return arm_smmu_init_l2_strtab(smmu, sid);
+
+   return 0;
+}
+
 static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
  struct arm_smmu_master *master)
 {
@@ -2392,20 +2405,9 @@ static int arm_smmu_insert_master(struct arm_smmu_device 
*smmu,
new_stream->id = sid;
new_stream->master = master;
 
-   /*
-* Check the SIDs are in range of the SMMU and our stream table
-*/
-   if (!arm_smmu_sid_in_range(smmu, sid)) {
-   ret = -ERANGE;
+   ret = arm_smmu_init_sid_strtab(smmu, sid);
+   if (ret)
break;
-   }
-
-   /* Ensure l2 strtab is initialised */
-   if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-   ret = arm_smmu_init_l2_strtab(smmu, sid);
-   if (ret)
-   break;
-   }
 
/* Insert into SID tree */
new_node = &(smmu->streams.rb_node);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 3/8] ACPI/IORT: Add a helper to retrieve RMR memory regions

2021-05-13 Thread Shameer Kolothum

Add a helper function that retrieves RMR memory descriptors
associated with a given IOMMU. This will be used by IOMMU
drivers to setup necessary mappings.

Now that we have this, invoke it from the generic helper
interface.

Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 40 +++
 drivers/iommu/dma-iommu.c |  3 +++
 include/linux/acpi_iort.h |  7 +++
 3 files changed, 50 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index fea1ffaedf3b..6ca88c38987a 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -837,6 +838,43 @@ static inline int iort_add_device_replay(struct device 
*dev)
return err;
 }
 
+/**
+ * iort_iommu_get_rmrs - Helper to retrieve RMR info associated with IOMMU
+ * @iommu: fwnode for the IOMMU
+ * @head: RMR list head to be populated
+ *
+ * Returns: 0 on success, <0 failure
+ */
+int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode,
+   struct list_head *head)
+{
+   struct iort_rmr_entry *e;
+   struct acpi_iort_node *iommu;
+
+   iommu = iort_get_iort_node(iommu_fwnode);
+   if (!iommu)
+   return -EINVAL;
+
+   list_for_each_entry(e, _rmr_list, list) {
+   struct acpi_iort_rmr_desc *rmr_desc;
+   struct iommu_rmr *rmr;
+
+   if (e->smmu != iommu)
+   continue;
+
+   rmr_desc = e->rmr_desc;
+   rmr = iommu_dma_alloc_rmr(rmr_desc->base_address,
+ rmr_desc->length, e->sid,
+ e->flags);
+   if (!rmr)
+   return -ENOMEM;
+
+   list_add_tail(>list, head);
+   }
+
+   return 0;
+}
+
 /**
  * iort_iommu_msi_get_resv_regions - Reserved region driver helper
  * @dev: Device from iommu_get_resv_regions()
@@ -1108,6 +1146,8 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
 const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
const u32 *input_id)
 { return NULL; }
+int iort_iommu_get_rmrs(struct fwnode_handle *fwnode, struct list_head *head)
+{ return -ENODEV; }
 #endif
 
 static int nc_dma_get_range(struct device *dev, u64 *size)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 674bd8815159..2d9caf548a32 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -203,6 +203,9 @@ EXPORT_SYMBOL(iommu_dma_get_resv_regions);
 int iommu_dma_get_rmrs(struct fwnode_handle *iommu_fwnode,
   struct list_head *list)
 {
+   if (!is_of_node(iommu_fwnode))
+   return iort_iommu_get_rmrs(iommu_fwnode, list);
+
return -EINVAL;
 }
 EXPORT_SYMBOL(iommu_dma_get_rmrs);
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 1a12baa58e40..e8c45fa59531 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -39,6 +39,8 @@ const struct iommu_ops *iort_iommu_configure_id(struct device 
*dev,
const u32 *id_in);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
+int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode,
+   struct list_head *list);
 #else
 static inline void acpi_iort_init(void) { }
 static inline u32 iort_msi_map_id(struct device *dev, u32 id)
@@ -59,6 +61,11 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
 
 static inline phys_addr_t acpi_iort_dma_get_max_cpu_address(void)
 { return PHYS_ADDR_MAX; }
+
+static inline
+int iort_iommu_get_rmrs(struct fwnode_handle *iommu_fwnode,
+   struct list_head *list)
+{ return -ENODEV; }
 #endif
 
 #endif /* __ACPI_IORT_H__ */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 2/8] iommu/dma: Introduce generic helper to retrieve RMR info

2021-05-13 Thread Shameer Kolothum

Reserved Memory Regions(RMR) associated with an IOMMU can be
described through ACPI IORT tables in systems with devices
that require a unity mapping or bypass for those
regions.

Introduce a generic interface so that IOMMU drivers can retrieve
and set up necessary mappings.

Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/dma-iommu.c | 33 +
 include/linux/dma-iommu.h | 10 ++
 include/linux/iommu.h | 19 +++
 3 files changed, 62 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..674bd8815159 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -193,6 +193,39 @@ void iommu_dma_get_resv_regions(struct device *dev, struct 
list_head *list)
 }
 EXPORT_SYMBOL(iommu_dma_get_resv_regions);
 
+/**
+ * iommu_dma_get_rmrs - Retrieve Reserved Memory Regions(RMRs) associated
+ *  with a given IOMMU
+ * @iommu_fwnode: fwnode associated with IOMMU
+ * @list: RMR list to be populated
+ *
+ */
+int iommu_dma_get_rmrs(struct fwnode_handle *iommu_fwnode,
+  struct list_head *list)
+{
+   return -EINVAL;
+}
+EXPORT_SYMBOL(iommu_dma_get_rmrs);
+
+struct iommu_rmr *iommu_dma_alloc_rmr(u64 base, u64 length, u32 sid,
+ u32 flags)
+{
+   struct iommu_rmr *rmr;
+
+   rmr = kzalloc(sizeof(*rmr), GFP_KERNEL);
+   if (!rmr)
+   return NULL;
+
+   INIT_LIST_HEAD(>list);
+   rmr->base_address = base;
+   rmr->length = length;
+   rmr->sid = sid;
+   rmr->flags = flags;
+
+   return rmr;
+}
+EXPORT_SYMBOL(iommu_dma_alloc_rmr);
+
 static int cookie_init_hw_msi_region(struct iommu_dma_cookie *cookie,
phys_addr_t start, phys_addr_t end)
 {
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 6e75a2d689b4..319f332c279f 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -42,12 +42,17 @@ void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
 
 extern bool iommu_dma_forcedac;
 
+int iommu_dma_get_rmrs(struct fwnode_handle *iommu, struct list_head *list);
+struct iommu_rmr *iommu_dma_alloc_rmr(u64 base, u64 length, u32 sid, u32 
flags);
+
 #else /* CONFIG_IOMMU_DMA */
 
 struct iommu_domain;
 struct msi_desc;
 struct msi_msg;
 struct device;
+struct fwnode_handle;
+struct iommu_rmr;
 
 static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base,
u64 size)
@@ -83,5 +88,10 @@ static inline void iommu_dma_get_resv_regions(struct device 
*dev, struct list_he
 {
 }
 
+int iommu_dma_get_rmrs(struct fwnode_handle *iommu, struct list_head *list)
+{
+   return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_DMA */
 #endif /* __DMA_IOMMU_H */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..73cd2831cb45 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -555,6 +555,25 @@ struct iommu_sva {
struct device   *dev;
 };
 
+/**
+ * struct iommu_rmr - Reserved Memory Region details per IOMMU
+ * @list: Linked list pointers to hold RMR region info
+ * @base_address: base address of Reserved Memory Region
+ * @length: length of memory region
+ * @sid: associated stream id
+ * @flags: flags that apply to the RMR node
+ */
+struct iommu_rmr {
+   struct list_headlist;
+   phys_addr_t base_address;
+   u64 length;
+   u32 sid;
+   u32 flags;
+};
+
+/* RMR Remap permitted */
+#define IOMMU_RMR_REMAP_PERMITTED  (1 << 0)
+
 int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode,
  const struct iommu_ops *ops);
 void iommu_fwspec_free(struct device *dev);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 0/8] ACPI/IORT: Support for IORT RMR node

2021-05-13 Thread Shameer Kolothum

Hi,

v3 -->v4
-Included the SMMUv2 SMR bypass install changes suggested by
 Steve(patch #7)
-As per Robin's comments, RMR reserve implementation is now
 more generic  (patch #8) and dropped v3 patches 8 and 10.
-Rebase to 5.13-rc1 

The whole series is available here,
https://github.com/hisilicon/kernel-dev/tree/private-v5.13-rc1-rmr-v4-ext

RFC v2 --> v3
 -Dropped RFC tag as the ACPICA header changes are now ready to be
  part of 5.13[0]. But this series still has a dependency on that patch.
 -Added IORT E.b related changes(node flags, _DSM function 5 checks for
  PCIe).
 -Changed RMR to stream id mapping from M:N to M:1 as per the spec and
  discussion here[1].
 -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!) 

Sanity tested on a HiSilicon D06. Further testing and feedback is greatly
appreciated.

https://github.com/hisilicon/kernel-dev/tree/private-v5.12-rc8-rmr-v3

Thanks,
Shameer

[0] 
https://lore.kernel.org/linux-acpi/20210406213028.718796-22-erik.kan...@intel.com/
[1] 
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-April/000150.html

RFC v1 --> v2:
 - Added a generic interface for IOMMU drivers to retrieve all the 
   RMR info associated with a given IOMMU.
 - SMMUv3 driver gets the RMR list during probe() and installs
   bypass STEs for all the SIDs in the RMR list. This is to keep
   the ongoing traffic alive(if any) during SMMUv3 reset. This is
   based on the suggestions received for v1 to take care of the
   EFI framebuffer use case. Only sanity tested for now.
 - During the probe/attach device, SMMUv3 driver reserves any
   RMR region associated with the device such that there is a unity
   mapping for them in SMMU.
---    

From RFC v1:
-
The series adds support to IORT RMR nodes specified in IORT
Revision E -ARM DEN 0049E[0]. RMR nodes are used to describe memory
ranges that are used by endpoints and require a unity mapping
in SMMU.

We have faced issues with 3408iMR RAID controller cards which
fail to boot when SMMU is enabled. This is because these controllers
make use of host memory for various caching related purposes and when
SMMU is enabled the iMR firmware fails to access these memory regions
as there is no mapping for them. IORT RMR provides a way for UEFI to
describe and report these memory regions so that the kernel can make
a unity mapping for these in SMMU.

Tests:

With a UEFI, that reports the RMR for the dev,

[16F0h 5872   1] Type : 06
[16F1h 5873   2]   Length : 007C
[16F3h 5875   1] Revision : 00
[1038h 0056   2] Reserved : 
[1038h 0056   2]   Identifier : 
[16F8h 5880   4]Mapping Count : 0001
[16FCh 5884   4]   Mapping Offset : 0040

[1700h 5888   4]Number of RMR Descriptors : 0002
[1704h 5892   4]RMR Descriptor Offset : 0018

[1708h 5896   8]  Base Address of RMR : E640
[1710h 5904   8]Length of RMR : 0010
[1718h 5912   4] Reserved : 

[171Ch 5916   8]  Base Address of RMR : 27B0
[1724h 5924   8]Length of RMR : 00C0
[172Ch 5932   4] Reserved : 

[1730h 5936   4]   Input base : 
[1734h 5940   4] ID Count : 0001
[1738h 5944   4]  Output Base : 0003
[173Ch 5948   4] Output Reference : 0064
[1740h 5952   4]Flags (decoded below) : 0001
   Single Mapping : 1
...

Without the series the RAID controller initialization fails as
below,

...
[   12.631117] megaraid_sas :03:00.0: FW supports sync cache: Yes   
[   12.637360] megaraid_sas :03:00.0: megasas_disable_intr_fusion is called 
outbound_intr_mask:0x4009   
[   18.776377] megaraid_sas :03:00.0: Init cmd return status FAILED for 
SCSI host 0 

[   23.019383] megaraid_sas :03:00.0: Waiting for FW to come to ready state 
[  106.684281] megaraid_sas :03:00.0: FW in FAULT state, Fault code:0x1 
subcode:0x0 func:megasas_transition_to_ready
[  106.695186] megaraid_sas :03:00.0: System Register set:  
[  106.889787] megaraid_sas :03:00.0: Failed to transition controller to 
ready for scsi0.
   
[  106.910475] megaraid_sas :03:00.0: Failed from megasas_init_fw 6407  
estuary:/$

With the series, now the kernel has direct mapping for the dev as
below,

estuary:/$ cat /sys/kernel/iommu_groups/0/reserved_regions  
0x0800 0x080f msi   
0x27b0 0x286f direct

[PATCH v4 1/8] ACPI/IORT: Add support for RMR node parsing

2021-05-13 Thread Shameer Kolothum

Add support for parsing RMR node information from ACPI.
Find associated stream id and smmu node info from the
RMR node and populate a linked list with RMR memory
descriptors.

Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 104 +-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3912a1f6058e..fea1ffaedf3b 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -40,6 +40,19 @@ struct iort_fwnode {
 static LIST_HEAD(iort_fwnode_list);
 static DEFINE_SPINLOCK(iort_fwnode_lock);
 
+/*
+ * One entry for IORT RMR.
+ */
+struct iort_rmr_entry {
+   struct list_head list;
+   u32 sid;
+   struct acpi_iort_node *smmu;
+   struct acpi_iort_rmr_desc *rmr_desc;
+   u32 flags;
+};
+
+static LIST_HEAD(iort_rmr_list); /* list of RMR regions from ACPI */
+
 /**
  * iort_set_fwnode() - Create iort_fwnode and use it to register
  *iommu data in the iort_fwnode_list
@@ -393,7 +406,8 @@ static struct acpi_iort_node *iort_node_get_id(struct 
acpi_iort_node *node,
if (node->type == ACPI_IORT_NODE_NAMED_COMPONENT ||
node->type == ACPI_IORT_NODE_PCI_ROOT_COMPLEX ||
node->type == ACPI_IORT_NODE_SMMU_V3 ||
-   node->type == ACPI_IORT_NODE_PMCG) {
+   node->type == ACPI_IORT_NODE_PMCG ||
+   node->type == ACPI_IORT_NODE_RMR) {
*id_out = map->output_base;
return parent;
}
@@ -1660,6 +1674,91 @@ static void __init iort_enable_acs(struct acpi_iort_node 
*iort_node)
 #else
 static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { }
 #endif
+static int iort_rmr_desc_valid(struct acpi_iort_rmr_desc *desc, u32 count)
+{
+   int i, j;
+
+   for (i = 0; i < count; i++) {
+   u64 end, start = desc[i].base_address, length = desc[i].length;
+
+   if (!IS_ALIGNED(start, SZ_64K) || !IS_ALIGNED(length, SZ_64K))
+   return -EINVAL;
+
+   end = start + length - 1;
+
+   /* Check for address overlap */
+   for (j = i + 1; j < count; j++) {
+   u64 e_start = desc[j].base_address;
+   u64 e_end = e_start + desc[j].length - 1;
+
+   if (start <= e_end && end >= e_start)
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
+static int __init iort_parse_rmr(struct acpi_iort_node *iort_node)
+{
+   struct acpi_iort_node *smmu;
+   struct iort_rmr_entry *e;
+   struct acpi_iort_rmr *rmr;
+   struct acpi_iort_rmr_desc *rmr_desc;
+   u32 map_count = iort_node->mapping_count;
+   u32  sid;
+   int i, ret = 0;
+
+   if (iort_node->type != ACPI_IORT_NODE_RMR)
+   return 0;
+
+   if (!iort_node->mapping_offset || map_count != 1) {
+   pr_err(FW_BUG "Invalid ID mapping, skipping RMR node %p\n",
+  iort_node);
+   return -EINVAL;
+   }
+
+   /* Retrieve associated smmu and stream id */
+   smmu = iort_node_get_id(iort_node, , 0);
+   if (!smmu) {
+   pr_err(FW_BUG "Invalid SMMU reference, skipping RMR node %p\n",
+  iort_node);
+   return -EINVAL;
+   }
+
+   /* Retrieve RMR data */
+   rmr = (struct acpi_iort_rmr *)iort_node->node_data;
+   if (!rmr->rmr_offset || !rmr->rmr_count) {
+   pr_err(FW_BUG "Invalid RMR descriptor array, skipping RMR node 
%p\n",
+  iort_node);
+   return -EINVAL;
+   }
+
+   rmr_desc = ACPI_ADD_PTR(struct acpi_iort_rmr_desc, iort_node,
+   rmr->rmr_offset);
+
+   ret = iort_rmr_desc_valid(rmr_desc, rmr->rmr_count);
+   if (ret) {
+   pr_err(FW_BUG "Invalid RMR descriptor[%d] for node %p, 
skipping...\n",
+  i, iort_node);
+   return ret;
+   }
+
+   for (i = 0; i < rmr->rmr_count; i++, rmr_desc++) {
+   e = kmalloc(sizeof(*e), GFP_KERNEL);
+   if (!e)
+   return -ENOMEM;
+
+   e->sid = sid;
+   e->smmu = smmu;
+   e->rmr_desc = rmr_desc;
+   e->flags = rmr->flags;
+
+   list_add_tail(>list, _rmr_list);
+   }
+
+   return 0;
+}
 
 static void __init iort_init_platform_devices(void)
 {
@@ -1689,6 +1788,9 @@ static void __init iort_init_platform_devices(void)
 
iort_enable_acs(iort_node);
 
+   if (iort_table->revision == 3)
+   iort_parse_rmr(iort_node);
+
ops = iort_get_dev_cfg(iort_node);
if (ops) {
fwnode = acpi_alloc_fwnode_static();
-- 
2.17.1

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 06:00:12AM -0700, Jacob Pan wrote:
> > > If you want to do SVA PASID then it also must come with DMA APIs to
> > > manage the CPU cache coherence that are all NOP's on x86.  
> > 
> > Yes.  And we have plenty of precende where an IOMMU is in "bypass" mode
> > to allow access to all memory and then uses the simple dma-direct case.
> I agree it is better not to expose the entire direct map. But the missing
> piece of using DMA APIs is the PASID. The caller needs the PASID value to
> do work submission once buffer is mapped.

You still haven't explained why the kernel driver should have a PASID at all.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-05-13 Thread Jacob Pan

Hi Christoph,

On Wed, 12 May 2021 07:37:40 +0100, Christoph Hellwig 
wrote:

> On Tue, May 11, 2021 at 04:47:26PM -0300, Jason Gunthorpe wrote:
> > > Let me try to break down your concerns:
> > > 1. portability - driver uses DMA APIs can function w/ and w/o IOMMU.
> > > is that your concern? But PASID is intrinsically tied with IOMMU and
> > > if the drivers are using a generic sva-lib API, why they are not
> > > portable? SVA by its definition is to avoid map/unmap every time.  
> > 
> > Kernel explicitly does not support this programming model. All DMA is
> > explicit and the DMA API hides platform details like IOMMU and CPU
> > cache coherences. Just because x86 doesn't care about this doesn't
> > make any of it optional.  
> 
> Exactly.
Perhaps we can view these SVA capable devices as FPU or a device that can
be fused onto the CPU by PASID binding. We don't require DMA map/unmap for
FPU to use kernel virtual address, right?

> 
> > If you want to do SVA PASID then it also must come with DMA APIs to
> > manage the CPU cache coherence that are all NOP's on x86.  
> 
> Yes.  And we have plenty of precende where an IOMMU is in "bypass" mode
> to allow access to all memory and then uses the simple dma-direct case.
I agree it is better not to expose the entire direct map. But the missing
piece of using DMA APIs is the PASID. The caller needs the PASID value to
do work submission once buffer is mapped.

Perhaps we can add a parameter to the DMA API to specify the address space?
As Jason suggested the definition of IOASID, which represents a page table.
Just my quick thought, can you help us with a viable solution?

I know we are supposed to hide IOMMU by using DMA APIs which makes drivers
portable w/ and w/o IOMMU. This IOASID can be optional.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework

2021-05-13 Thread Lu Baolu


On 5/13/21 6:58 PM, Keqian Zhu wrote:



On 2021/5/12 19:36, Lu Baolu wrote:

Hi keqian,

On 5/12/21 4:44 PM, Keqian Zhu wrote:



On 2021/5/12 11:20, Lu Baolu wrote:

On 5/11/21 3:40 PM, Keqian Zhu wrote:

For upper layers, before starting page tracking, they check the
dirty_page_trackable attribution of the domain and start it only it's
capable. Once the page tracking is switched on the vendor iommu driver
(or iommu core) should block further device attach/detach operations
until page tracking is stopped.

But when a domain becomes capable after detaching a device, the upper layer
still needs to query it and enable dirty log for it...

To make things coordinated, maybe the upper layer can register a notifier,
when the domain's capability change, the upper layer do not need to query, 
instead
they just need to realize a callback, and do their specific policy in the 
callback.
What do you think?



That might be an option. But why not checking domain's attribution every
time a new tracking period is about to start?

Hi Baolu,

I'll add an attribution in iommu_domain, and the vendor iommu driver will update
the attribution when attach/detach devices.

The attribute should be protected by a lock, so the upper layer shouldn't access
the attribute directly. Then the iommu_domain_support_dirty_log() still should 
be
retained. Does this design looks good to you?


Yes, that's what I was thinking of. But I am not sure whether it worth
of a lock here. It seems not to be a valid behavior for upper layer to
attach or detach any device while doing the dirty page tracking.

Hi Baolu,

Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can be 
called concurrently,
a lock in iommu_domain_support_dirty_log() is still not enough. I will add 
another note for the dirty
tracking interfaces.

Do you have other suggestions? I will accelerate the progress, so I plan to 
send out v5 next week.


No further comments expect below nit:

"iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking"

How about splitting it into
 - iommu_start_dirty_log()
 - iommu_stop_dirty_log()

Not a strong opinion anyway.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/6] vfio: remove the unused mdev iommu hook

2021-05-13 Thread Jason Gunthorpe

On Thu, May 13, 2021 at 03:28:52AM +, Tian, Kevin wrote:

> Are you specially concerned about this iommu_device hack which 
> directly connects mdev_device to iommu layer or the entire removed
> logic including the aux domain concept? For the former we are now 
> following up the referred thread to find a clean way. But for the latter 
> we feel it's still necessary regardless of how iommu interface is redesigned 
> to support device connection from the upper level driver. The reason is 
> that with mdev or subdevice one physical device could be attached to 
> multiple domains now. there could be a primary domain with DOMAIN_
> DMA type for DMA_API use by parent driver itself, and multiple auxiliary 
> domains with DOMAIN_UNMANAGED types for subdevices assigned to 
> different VMs.

Why do we need more domains than just the physical domain for the
parent? How does auxdomain appear in /dev/ioasid?

I never understood what this dead code was supposed to be used for.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework

2021-05-13 Thread Keqian Zhu




On 2021/5/12 19:36, Lu Baolu wrote:
> Hi keqian,
> 
> On 5/12/21 4:44 PM, Keqian Zhu wrote:
>>
>>
>> On 2021/5/12 11:20, Lu Baolu wrote:
>>> On 5/11/21 3:40 PM, Keqian Zhu wrote:
> For upper layers, before starting page tracking, they check the
> dirty_page_trackable attribution of the domain and start it only it's
> capable. Once the page tracking is switched on the vendor iommu driver
> (or iommu core) should block further device attach/detach operations
> until page tracking is stopped.
 But when a domain becomes capable after detaching a device, the upper layer
 still needs to query it and enable dirty log for it...

 To make things coordinated, maybe the upper layer can register a notifier,
 when the domain's capability change, the upper layer do not need to query, 
 instead
 they just need to realize a callback, and do their specific policy in the 
 callback.
 What do you think?

>>>
>>> That might be an option. But why not checking domain's attribution every
>>> time a new tracking period is about to start?
>> Hi Baolu,
>>
>> I'll add an attribution in iommu_domain, and the vendor iommu driver will 
>> update
>> the attribution when attach/detach devices.
>>
>> The attribute should be protected by a lock, so the upper layer shouldn't 
>> access
>> the attribute directly. Then the iommu_domain_support_dirty_log() still 
>> should be
>> retained. Does this design looks good to you?
> 
> Yes, that's what I was thinking of. But I am not sure whether it worth
> of a lock here. It seems not to be a valid behavior for upper layer to
> attach or detach any device while doing the dirty page tracking.
Hi Baolu,

Right, if the "detach|attach" interfaces and "dirty tracking" interfaces can be 
called concurrently,
a lock in iommu_domain_support_dirty_log() is still not enough. I will add 
another note for the dirty
tracking interfaces.

Do you have other suggestions? I will accelerate the progress, so I plan to 
send out v5 next week.

Thanks,
Keqian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 1/1] iommu: Delete a duplicate check in iommu_change_dev_def_domain()

2021-05-13 Thread Zhen Lei

Function iommu_group_store_type() is the only caller of the static
function iommu_change_dev_def_domain() and has performed
"if (WARN_ON(!group))" detection before calling it. So the one here is
redundant.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/iommu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 971068da67cb91d..8cdf6a1c4bfd773 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3059,9 +3059,6 @@ static int iommu_change_dev_def_domain(struct iommu_group 
*group,
int ret, dev_def_dom;
struct device *dev;
 
-   if (!group)
-   return -EINVAL;
-
mutex_lock(>mutex);
 
if (group->default_domain != group->domain) {
-- 
2.26.0.106.g9fadedd


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Tian, Kevin

> From: David Gibson 
> Sent: Thursday, May 13, 2021 2:01 PM
> >
> > But this definitely all becomes HW specific.
> >
> > For instance I want to have an ARM vIOMMU driver it needs to do some
> >
> >  ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is
> ARMvXXX])
> >  if (ret == -EOPNOTSUPP)
> >  ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
> >  // and do completely different and more expensive emulation
> >
> > I can get a little bit of generality, but at the end of the day the
> > IOMMU must create a specific HW layout of the nested page table, if it
> > can't, it can't.
> 
> Erm.. I don't really know how your IOASID interface works here.  I'm
> thinking about the VFIO interface where maps and unmaps are via
> explicit ioctl()s, which provides an obvious point to do translation
> between page table formats.
> 

We are working on a draft uAPI proposal based on the discussions in this
thread. expect to send it out next week. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson

On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> 
> > This is a good feature in general when let's say there is a linux supported
> > device which has a proprietary device firmware update tool which only exists
> > as an x86 binary and your hardware is not x86 - running qemu + vfio in full
> > emulation would provide a way to run the tool to update a physical device.
> 
> That specific use case doesn't really need a vIOMMU though, does it?

Possibly not, but the mechanics needed to do vIOMMU on different host
IOMMU aren't really different from what you need for a no-vIOMMU
guest.  With a vIOMMU you need to map guest IOVA space into the host
IOVA space.  With no no-vIOMMU you need to map guest physical
addresses into the host IOVA space.  In either case the GPA/gIOVA to
userspace and userspace to HPA mappings are basically arbitrary.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson

On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > Again, I don't know enough about VDPA to make sense of that.  Are we
> > essentially talking non-PCI virtual devices here?  In which case you
> > could define the VDPA "bus" to always have one-device groups.
> 
> It is much worse than that.
> 
> What these non-PCI devices need is for the kernel driver to be part of
> the IOMMU group of the underlying PCI device but tell VFIO land that
> "groups don't matter"

I don't really see a semantic distinction between "always one-device
groups" and "groups don't matter".  Really the only way you can afford
to not care about groups is if they're singletons.

> Today mdev tries to fake this by using singleton iommu groups, but it
> is really horrible and direcly hacks up the VFIO IOMMU code to
> understand these special cases. Intel was proposing more special
> hacking in the VFIO IOMMU code to extend this to PASID.

At this stage I don't really understand why that would end up so
horrible.

> When we get to a /dev/ioasid this is all nonsense. The kernel device
> driver is going to have to tell drivers/iommu exactly what kind of
> ioasid it can accept, be it a PASID inside a kernel owned group, a SW
> emulated 'mdev' ioasid, or whatever.
> 
> In these cases the "group" idea has become a fiction that just creates
> a pain.

I don't see how the group is a fiction in this instance.  You can
still have devices that can't be isolated, therefore you can have
non-singleton groups.

> "Just reorganize VDPA to do something insane with the driver
> core so we can create a dummy group to satisfy an unnecessary uAPI
> restriction" is not a very compelling argument.
> 
> So if the nonsensical groups goes away for PASID/mdev, where does it
> leave the uAPI in other cases?
> 
> > I don't think simplified-but-wrong is a good goal.  The thing about
> > groups is that if they're there, you can't just "not care" about them,
> > they affect you whether you like it or not.
> 
> You really can. If one thing claims the group then all the other group
> devices become locked out.

Aside: I'm primarily using "group" to mean the underlying hardware
unit, not the vfio construct on top of it, I'm not sure that's been
clear throughout.

So.. your model assumes that every device has a safe quiescent state
where it won't do any harm until poked, whether its group is
currently kernel owned, or owned by a userspace that doesn't know
anything about it.

At minimum this does mean that in order to use one device in the group
you must have permission to use *all* the devices in the group -
otherwise you may be able to operate a device you don't have
permission to by DMAing to its registers from a device you do have
permission to.

Whatever scripts are managing ownership of devices also need to know
about groups, because they need to put all the devices into that
quiescent state before the group can change ownership.

> The main point to understand is that groups are NOT an application
> restriction! It is a whole system restriction that the operator needs
> to understand and deal with. This is why things like dpdk don't care
> about the group at all - there is nothing they can do with the
> information.
> 
> If the operator says to run dpdk on a specific device then the
> operator is the one that has to deal with all the other devices in the
> group getting locked out.

Ok, I think I see your point there.

> At best the application can make it more obvious that the operator is
> doing something dangerous, but the current kernel API doesn't seem to
> really support that either.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson

On Tue, May 04, 2021 at 03:15:37PM -0300, Jason Gunthorpe wrote:
> On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote:
> > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > > > There is a certain appeal to having some
> > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > > > information like windows that can be optionally called by the viommu
> > > > > driver and it remains well defined and described.
> > > > 
> > > > Windows really aren't ppc specific.  They're absolutely there on x86
> > > > and everything else as well - it's just that people are used to having
> > > > a window at 0.. that you can often get away with
> > > > treating it sloppily.
> > > 
> > > My point is this detailed control seems to go on to more than just
> > > windows. As you say the vIOMMU is emulating specific HW that needs to
> > > have kernel interfaces to match it exactly.
> > 
> > It's really not that bad.  The case of emulating the PAPR vIOMMU on
> > something else is relatively easy, because all updates to the IO page
> > tables go through hypercalls.  So, as long as the backend IOMMU can
> > map all the IOVAs that the guest IOMMU can, then qemu's implementation
> > of those hypercalls just needs to put an equivalent mapping in the
> > backend, which it can do with a generic VFIO_DMA_MAP.
> 
> So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU?

Well, I don't want to preclude it in the API.  I'm not sure about that
specific example, but in most cases it should be possible to run the
PAPR vIOMMU on an x86 IOMMU backend.  Obviously only something you'd
want to do for testing and experimentation, but it could be quite
useful for that.

> > vIOMMUs with page tables in guest memory are harder, but only really
> > in the usual ways that a vIOMMU of that type is harder (needs cache
> > mode or whatever).  At whatever point you need to shadow from the
> > guest IO page tables to the host backend, you can again do that with
> > generic maps, as long as the backend supports the necessary IOVAs, and
> > has an IO page size that's equal to or a submultiple of the vIOMMU
> > page size.
> 
> But this definitely all becomes HW specific.
> 
> For instance I want to have an ARM vIOMMU driver it needs to do some
> 
>  ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX])
>  if (ret == -EOPNOTSUPP)
>  ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
>  // and do completely different and more expensive emulation
> 
> I can get a little bit of generality, but at the end of the day the
> IOMMU must create a specific HW layout of the nested page table, if it
> can't, it can't.

Erm.. I don't really know how your IOASID interface works here.  I'm
thinking about the VFIO interface where maps and unmaps are via
explicit ioctl()s, which provides an obvious point to do translation
between page table formats.

But.. even if you're exposing page tables to userspace.. with hardware
that has explicit support for nesting you can probably expose the hw
tables directly which is great for the cases that works for.  But
surely for older IOMMUs which don't do nesting you must have some way
of shadowing guest IO page tables to host IO page tables to translate
GPA to HPA at least?  If you're doing that, I don't see that
converting page table format is really any harder

> > > I'm remarking that trying to unify every HW IOMMU implementation that
> > > ever has/will exist into a generic API complete enough to allow the
> > > vIOMMU to be created is likely to result in an API too complicated to
> > > understand..
> > 
> > Maybe not every one, but I think we can get a pretty wide range with a
> > reasonable interface.  
> 
> It sounds like a reasonable guideline is if the feature is actually
> general to all IOMMUs and can be used by qemu as part of a vIOMMU
> emulation when compatible vIOMMU HW is not available.
> 
> Having 'requested window' support that isn't actually implemented in
> every IOMMU is going to mean the PAPR vIOMMU emulation won't work,
> defeating the whole point of making things general?

The trick is that you don't necessarily need dynamic window support in
the backend to emulate it in the vIOMMU.  If your backend has fixed
windows, then you emulate request window as:
if (requested window is within backend windows)
no-op;
else
return ERROR;

It might not be a theoretically complete emulation of the vIOMMU, but
it can support in-practice usage.  In particular it works pretty well
if your backend has a nice big IOVA range (like x86 IOMMUS) but your
guest platform typically uses relatively small IOVA windows.  PAPR on
x86 is exactly that... well.. possibly not the 64-bit window, but
because of old PAPR platforms that didn't support that, we can choose
not to advertise that and guests will cope.

-- 
David Gibson

60 matches

Mail list logo