Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Baolu Lu

On 2022/6/1 02:51, Jason Gunthorpe wrote:

Oh, I've spent the last couple of weeks hacking up horrible things
manipulating entries in init_mm, and never realised that that was actually
the special case. Oh well, live and learn.

The init_mm is sort of different, it doesn't have zap in quite the
same way, for example. I was talking about the typical process mm.

Anyhow, the right solution is to use RCU as I described before, Baolu
do you want to try?


Yes, of course.

Your discussion with Robin gave me a lot of inspiration. Very
appreciated! I want to use a separate patch to solve this debugfs
problem, because it has exceeded the original intention of this series.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Baolu Lu

On 2022/5/31 23:59, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:


+    break;
+    pgtable_walk_level(m, phys_to_virt(phys_addr),


Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
gets you a valid struct page. Whether that page is direct-mapped kernel
memory or not is a different matter.


Even though this is debugfs, if the operation is sketchy like that and
can theortically crash the kernel the driver should test capabilities,
CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
better cap for 'userspace may crash the kernel'


Yes. We should test both CAP_SYS_ADMIN and CAP_SYS_RAWIO.

Best regards,
baolu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Baolu Lu

Hi Robin,

Thank you for the comments.

On 2022/5/31 21:52, Robin Murphy wrote:

On 2022-05-31 04:02, Baolu Lu wrote:

On 2022/5/30 20:14, Jason Gunthorpe wrote:

On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:


[--snip--]

diff --git a/drivers/iommu/intel/debugfs.c 
b/drivers/iommu/intel/debugfs.c

index d927ef10641b..e6f4835b8d9f 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file 
*m, struct dma_pte *pde,

  continue;

  path[level] = pde->val;
-    if (dma_pte_superpage(pde) || level == 1)
+    if (dma_pte_superpage(pde) || level == 1) {
  dump_page_info(m, start, path);
-    else
-    pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
+    } else {
+    unsigned long phys_addr;
+
+    phys_addr = (unsigned long)dma_pte_addr(pde);
+    if (!pfn_valid(__phys_to_pfn(phys_addr)))


Given that pte_present(pde) passed just above, it was almost certainly a 
valid entry, so it seems unlikely that the physical address it pointed 
to could have disappeared in the meantime. If you're worried about the 
potential case where we've been preempted during this walk for long 
enough that the page has already been freed by an unmap, reallocated, 


Yes. This is exactly what I am worried about and what this patch wants
to solve.

and filled with someone else's data that happens to look like valid 
PTEs, this still isn't enough, since that data could just as well happen 
to look like valid physical addresses too.
I imagine that if you want to safely walk pagetables concurrently with 
them potentially being freed, you'd probably need to get RCU involved.


I don't want to make the map/unmap interface more complex or inefficient
because of a debugfs feature. I hope that the debugfs and map/unmap
interfaces are orthogonal, just like the IOMMU hardware traversing the
page tables, as long as the accessed physical address is valid and
accessible. Otherwise, stop the traversal immediately. If we can't
achieve this, I'd rather stop supporting this debugfs node.




+    break;
+    pgtable_walk_level(m, phys_to_virt(phys_addr),


Also, obligatory reminder that pfn_valid() only means that pfn_to_page() 
gets you a valid struct page. Whether that page is direct-mapped kernel 
memory or not is a different matter.


Perhaps I can check this from the page flags?




 level - 1, start, path);
+    }
  path[level] = 0;
  }
  }

-static int show_device_domain_translation(struct device *dev, void 
*data)
+static int __show_device_domain_translation(struct device *dev, void 
*data)

  {
  struct device_domain_info *info = dev_iommu_priv_get(dev);
  struct dmar_domain *domain = info->domain;
  struct seq_file *m = data;
  u64 path[6] = { 0 };

-    if (!domain)
-    return 0;
-
  seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
 (u64)virt_to_phys(domain->pgd));
  seq_puts(m, 
"IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
@@ -359,20 +362,27 @@ static int show_device_domain_translation(struct 
device *dev, void *data)

  pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
  seq_putc(m, '\n');

-    return 0;
+    return 1;
  }

-static int domain_translation_struct_show(struct seq_file *m, void 
*unused)
+static int show_device_domain_translation(struct device *dev, void 
*data)

  {
-    unsigned long flags;
-    int ret;
+    struct iommu_group *group;

-    spin_lock_irqsave(&device_domain_lock, flags);
-    ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-   show_device_domain_translation);
-    spin_unlock_irqrestore(&device_domain_lock, flags);
+    group = iommu_group_get(dev);
+    if (group) {
+    iommu_group_for_each_dev(group, data,
+ __show_device_domain_translation);


Why group_for_each_dev?


This will hold the group mutex when the callback is invoked. With the
group mutex hold, the domain could not get changed.

If there *are* multiple devices in the group 
then by definition they should be attached to the same domain, so 
dumping that domain's mappings more than once seems pointless. 
Especially given that the outer bus_for_each_dev iteration will already 
visit each individual device anyway, so this would only make the 
redundancy even worse than it already is.


__show_device_domain_translation() only dumps mappings once as it always
returns 1.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] dma-debug: Make things less spammy under memory pressure

2022-05-31 Thread Christoph Hellwig
On Tue, May 31, 2022 at 03:19:45PM -0700, Rob Clark wrote:
> um, quite..  tbf that was in the context of a WIP igt test for
> shrinker which was trying to cycle thru ~2x RAM size worth of GEM
> buffers on something like 72 threads.  So it could just be threads
> that had gotten past the dma_debug_disabled() check already before
> global_disable was set to true?
> 
> I guess this could be pr_err_once() instead, then?

Yes, we could use pr_err_once to reduce the chattyness while still
keeping global_disable to disable all the actual tracking.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, June 1, 2022 4:44 AM
> 
> Hi Jason,
> 
> On Tue, 31 May 2022 16:05:50 -0300, Jason Gunthorpe 
> wrote:
> 
> > On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote:
> >
> > > The reason why I store PASID at IOMMU domain is for IOTLB flush within
> > > the domain. Device driver is not aware of domain level IOTLB flush. We
> > > also have iova_cookie for each domain which essentially is for
> > > RIDPASID.
> >
> > You need to make the PASID stuff work generically.
> >
> > The domain needs to hold a list of all the places it needs to flush
> > and that list needs to be maintained during attach/detach.
> >
> > A single PASID on the domain is obviously never going to work
> > generically.
> >
> I agree, I did it this way really meant to be part of iommu_domain's
> iommu_dma_cookie, not meant to be global. But for the lack of common
> storage between identity domain and dma domain, I put it here as global.
> 
> Then should we also extract RIDPASID to become part of the generic API?
> i.e. set pasid, flush IOTLB etc. Right? RIDPASID is not in group's
> pasid_array today.
> 

RIDPASID is just an alias to RID in the PASID table (similar to pasid#0
on other platforms). it's reserved and not exposed outside the 
intel-iommu driver. So I don't think it should be managed via the
generic iommu API. But probably you can generalize it with other
pasids within intel-iommu driver if doing so can simplify the logic
there.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Wednesday, June 1, 2022 1:30 AM
> > >
> > > In both cases the pasid is stored in the attach data instead of the
> > > domain.
> > >
> So during IOTLB flush for the domain, do we loop through the attach data?

Yes and it's required.

> 
> > > DMA API pasid is no special from above except it needs to allow
> > > one device attached to the same domain twice (one with RID
> > > and the other with RID+PASID).
> > >
> > > for iommufd those operations are initiated by userspace via
> > > iommufd uAPI.
> >
> > My understanding is that device driver owns its PASID policy. If ENQCMD
> > is supported on the device, the PASIDs should be allocated through
> > ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device
> > driver.
> >
> It seems the changes we want for this patchset are:
> 1. move ioasid_alloc() from the core to device (allocation scope will be
> based on whether ENQCMD is intended or not)

yes, and the driver can specify whether the allocation is system-wide
or per-device.

> 2. store pasid in the attach data
> 3. use the same iommufd api to attach/set pasid on its default domain

s/iommufd/iommu/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops

2022-05-31 Thread Stefano Stabellini
On Tue, 31 May 2022, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko 
> 
> The main purpose of this binding is to communicate Xen specific
> information using generic IOMMU device tree bindings (which is
> a good fit here) rather than introducing a custom property.
> 
> Introduce Xen specific IOMMU for the virtualized device (e.g. virtio)
> to be used by Xen grant DMA-mapping layer in the subsequent commit.
> 
> The reference to Xen specific IOMMU node using "iommus" property
> indicates that Xen grant mappings need to be enabled for the device,
> and it specifies the ID of the domain where the corresponding backend
> resides. The domid (domain ID) is used as an argument to the Xen grant
> mapping APIs.
> 
> This is needed for the option to restrict memory access using Xen grant
> mappings to work which primary goal is to enable using virtio devices
> in Xen guests.
> 
> Signed-off-by: Oleksandr Tyshchenko 
> ---
> Changes RFC -> V1:
>- update commit subject/description and text in description
>- move to devicetree/bindings/arm/
> 
> Changes V1 -> V2:
>- update text in description
>- change the maintainer of the binding
>- fix validation issue
>- reference xen,dev-domid.yaml schema from virtio/mmio.yaml
> 
> Change V2 -> V3:
>- Stefano already gave his Reviewed-by, I dropped it due to the changes 
> (significant)
>- use generic IOMMU device tree bindings instead of custom property
>  "xen,dev-domid"
>- change commit subject and description, was
>  "dt-bindings: Add xen,dev-domid property description for xen-grant DMA 
> ops"
> ---
>  .../devicetree/bindings/iommu/xen,grant-dma.yaml   | 49 
> ++
>  1 file changed, 49 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
> 
> diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml 
> b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
> new file mode 100644
> index ..ab5765c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
> @@ -0,0 +1,49 @@
> +# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Xen specific IOMMU for virtualized devices (e.g. virtio)
> +
> +maintainers:
> +  - Stefano Stabellini 
> +
> +description:
> +  The reference to Xen specific IOMMU node using "iommus" property indicates
> +  that Xen grant mappings need to be enabled for the device, and it specifies
> +  the ID of the domain where the corresponding backend resides.
> +  The binding is required to restrict memory access using Xen grant mappings.

I think this is OK and in line with the discussion we had on the list. I
propose the following wording instead:

"""
The Xen IOMMU represents the Xen grant table interface. Grant mappings
are to be used with devices connected to the Xen IOMMU using the
"iommus" property, which also specifies the ID of the backend domain.
The binding is required to restrict memory access using Xen grant
mappings.
"""


> +properties:
> +  compatible:
> +const: xen,grant-dma
> +
> +  '#iommu-cells':
> +const: 1
> +description:
> +  Xen specific IOMMU is multiple-master IOMMU device.
> +  The single cell describes the domid (domain ID) of the domain where
> +  the backend is running.

Here I would say:

"""
The single cell is the domid (domain ID) of the domain where the backend
is running.
"""

With the two wording improvements:

Reviewed-by: Stefano Stabellini 


> +required:
> +  - compatible
> +  - "#iommu-cells"
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +xen_iommu {
> +compatible = "xen,grant-dma";
> +#iommu-cells = <1>;
> +};
> +
> +virtio@3000 {
> +compatible = "virtio,mmio";
> +reg = <0x3000 0x100>;
> +interrupts = <41>;
> +
> +/* The backend is located in Xen domain with ID 1 */
> +iommus = <&xen_iommu 1>;
> +};
> -- 
> 2.7.4
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:

> There are only 3 instances where we'll free a table while the domain is
> live. The first is the one legitimate race condition, where two map requests
> targeting relatively nearby PTEs both go to fill in an intermediate level of
> table; whoever loses that race frees the table they allocated, but it was
> never visible to anyone else so that's definitely fine. The second is if
> we're mapping a block entry, and find that there's already a table entry
> there, wherein we assume the table must be empty, clear the entry,
> invalidate any walk caches, install the block entry, then free the orphaned
> table; since we're mapping the entire IOVA range covered by that table,
> there should be no other operations on that IOVA range attempting to walk
> the table at the same time, so it's fine. 

I saw these two in the Intel driver

> The third is effectively the inverse, if we get a block-sized unmap
> but find a table entry rather than a block at that point (on the
> assumption that it's de-facto allowed for a single unmap to cover
> multiple adjacent mappings as long as it does so exactly); similarly
> we assume that the table must be full, and no other operations
> should be racing because we're unmapping its whole IOVA range, so we
> remove the table entry, invalidate, and free as before.

Not sure I noticed this one though

This all it all makes sense though.

> Although we don't have debug dumping for io-pgtable-arm, it's good to be
> thinking about this, since it's made me realise that dirty-tracking sweeps
> per that proposal might pose a similar kind of concern, so we might still
> need to harden these corners for the sake of that.

Lets make sure Joao sees this..

It is interesting because we probably don't want the big latency
spikes that would come from using locking to block map/unmap while
dirty reading is happening - eg at the iommfd level.

>From a consistency POV the only case that matters is unmap and unmap
should already be doing a dedicated dirty read directly prior to the
unmap (as per that other long thread)

So having safe racy reading in the kernel is probably best, and so RCU
would be good here too.

> that somewhere I have some half-finished patches making io-pgtable-arm use
> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
> how it goes when I get there).

This is all very similar to how the mm's tlb gather stuff works,
especially on PPC which requires RCU protection of page tables instead
of the IPI trick.

Currently the rcu_head and the lru overlap in the struct page. To do
this I'd suggest following what was done for slab - see ca1a46d6f506
("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
like 'struct slab' for use with iommu page tables and put the
list_head and rcu_head sequentially in the struct iommu_pt_page.

Continue to use put_pages_list() just with the list_head in the new
struct not the lru.

If we need it for dirty tracking then it makes sense to start
progressing parts at least..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-debug: Make things less spammy under memory pressure

2022-05-31 Thread Rob Clark
On Tue, May 31, 2022 at 3:00 PM Robin Murphy  wrote:
>
> On 2022-05-31 22:51, Rob Clark wrote:
> > From: Rob Clark 
> >
> > Ratelimit the error msg to avoid flooding the console.
> >
> > Signed-off-by: Rob Clark 
> > ---
> >   kernel/dma/debug.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
> > index f8ff598596b8..683966f0247b 100644
> > --- a/kernel/dma/debug.c
> > +++ b/kernel/dma/debug.c
> > @@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry 
> > *entry, unsigned long attrs)
> >
> >   rc = active_cacheline_insert(entry);
> >   if (rc == -ENOMEM) {
> > - pr_err("cacheline tracking ENOMEM, dma-debug disabled\n");
> > + pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug 
> > disabled\n");
> >   global_disable = true;
>
> Given that it's supposed to disable itself entirely if it ever gets
> here, just how spammy is it exactly?

um, quite..  tbf that was in the context of a WIP igt test for
shrinker which was trying to cycle thru ~2x RAM size worth of GEM
buffers on something like 72 threads.  So it could just be threads
that had gotten past the dma_debug_disabled() check already before
global_disable was set to true?

I guess this could be pr_err_once() instead, then?

BR,
-R

> Thanks,
> Robin.
>
> >   } else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
> >   err_printk(entry->dev, entry,
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free

2022-05-31 Thread Tony Battersby
On 5/31/22 17:54, Keith Busch wrote:
> On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
>> dma_pool_free() scales poorly when the pool contains many pages because
>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>> scalability by replacing the linear scan with a red-black tree lookup.
>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
> The improvement should say O(n) to O(log n), right?

That would be the improvement for a single call to dma_pool_alloc or
dma_pool_free, but I was going with the improvement for "n" calls
instead, which is consistent with the improvement for the example in the
cover letter for mpt3sas.  I would have to look up the convention to be
sure of the proper notation in a situation like this, but I usually
think "inserting N items takes N^2 time"; to me it makes less sense to
say "inserting 1 item takes N time", because the N seems to come out of
nowhere.

Tony

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] dma-debug: Make things less spammy under memory pressure

2022-05-31 Thread Robin Murphy

On 2022-05-31 22:51, Rob Clark wrote:

From: Rob Clark 

Ratelimit the error msg to avoid flooding the console.

Signed-off-by: Rob Clark 
---
  kernel/dma/debug.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index f8ff598596b8..683966f0247b 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry *entry, 
unsigned long attrs)
  
  	rc = active_cacheline_insert(entry);

if (rc == -ENOMEM) {
-   pr_err("cacheline tracking ENOMEM, dma-debug disabled\n");
+   pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug 
disabled\n");
global_disable = true;


Given that it's supposed to disable itself entirely if it ever gets 
here, just how spammy is it exactly?


Thanks,
Robin.


} else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
err_printk(entry->dev, entry,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics

2022-05-31 Thread Robin Murphy

On 2022-05-31 20:52, Tony Battersby wrote:

On 5/31/22 15:48, Robin Murphy wrote:

On 2022-05-31 19:17, Tony Battersby wrote:


 pool->name, blocks,
-(size_t) pages *
-(pool->allocation / pool->size),
+(size_t) pages * pool->blks_per_alloc,
 pool->size, pages);
size -= temp;
next += temp;
@@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
+   retval->blks_per_alloc =
+   (allocation / boundary) * (boundary / size) +
+   (allocation % boundary) / size;

Do we really need to store this? Sure, 4 divisions (which could possibly
be fewer given the constraints on boundary) isn't the absolute cheapest
calculation, but I still can't imagine anyone would be polling sysfs
stats hard enough to even notice.


The stored value is also used in patch #5, in more performance-critical
code, although only when debug is enabled.


Ah, fair enough. On second look I think 64-bit systems could effectively 
store this for free anyway, if patch #2 moved "size" down past "dev" in 
struct dma_pool, such that blks_per_alloc then ends up padding out the 
hole again.


FWIW the mathematician in me also now can't help seeing the algebraic 
reduction to at least "(allocation + (allocation % boundary)) / size", 
but is now too tired to reason about the power-of-two constraints and 
whether the intermediate integer truncations matter...


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free

2022-05-31 Thread Keith Busch
On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with a red-black tree lookup.
> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

The improvement should say O(n) to O(log n), right?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-debug: Make things less spammy under memory pressure

2022-05-31 Thread Rob Clark
From: Rob Clark 

Ratelimit the error msg to avoid flooding the console.

Signed-off-by: Rob Clark 
---
 kernel/dma/debug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index f8ff598596b8..683966f0247b 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry *entry, 
unsigned long attrs)
 
rc = active_cacheline_insert(entry);
if (rc == -ENOMEM) {
-   pr_err("cacheline tracking ENOMEM, dma-debug disabled\n");
+   pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug 
disabled\n");
global_disable = true;
} else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
err_printk(entry->dev, entry,
-- 
2.36.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy

2022-05-31 Thread Keith Busch
On Tue, May 31, 2022 at 02:22:21PM -0400, Tony Battersby wrote:
> +static void pool_free_page(struct dma_pool *pool,
> +struct dma_page *page,
> +bool destroying_pool)

'destroying_pool' is always true, so I don't think you need it.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 19:51, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:


And we expect the iommu driver to be unable to free page table levels
that have IOVA boundaries in them?


I'm not entirely sure what you mean there, but in general an unmap request
is expected to match some previous map request


atomic cmpxchg is OK for inserting new page table levels but it can't
protect you against concurrent freeing of page table levels. So
without locks it means that page tables can't usually be freed. Which
seems to match what the Intel driver does - at least from a cursory
look.

This is one of the reasons the mm has the mmap/etc lock and spinlocks
because we do expect page table levels to get wiped out when VMA's are
zap'd - all the different locks provide the protection against page
tables disappearing under from something manipulating them.

Basically every "lockless" walk in (process) MM land is actually
protected by some kind of lock that blocks zap_page_range() from
removing the page table levels themselves.


I'm not an expert in the Intel or AMD code, so I can only speak with 
confidence about what we do in io-pgtable-arm, but the main reason for 
not freeing pagetables is that it's simply not worth the bother of 
trying to work out whether a whole sub-tree is empty. Not to mention 
whether it's *still* empty by the time that we may have figured out that 
it was.


There are only 3 instances where we'll free a table while the domain is 
live. The first is the one legitimate race condition, where two map 
requests targeting relatively nearby PTEs both go to fill in an 
intermediate level of table; whoever loses that race frees the table 
they allocated, but it was never visible to anyone else so that's 
definitely fine. The second is if we're mapping a block entry, and find 
that there's already a table entry there, wherein we assume the table 
must be empty, clear the entry, invalidate any walk caches, install the 
block entry, then free the orphaned table; since we're mapping the 
entire IOVA range covered by that table, there should be no other 
operations on that IOVA range attempting to walk the table at the same 
time, so it's fine. The third is effectively the inverse, if we get a 
block-sized unmap but find a table entry rather than a block at that 
point (on the assumption that it's de-facto allowed for a single unmap 
to cover multiple adjacent mappings as long as it does so exactly); 
similarly we assume that the table must be full, and no other operations 
should be racing because we're unmapping its whole IOVA range, so we 
remove the table entry, invalidate, and free as before.


Again for efficiency reasons we don't attempt to validate those 
assumptions by inspecting the freed tables, so odd behaviour can fall 
out if the caller *does* do something bogus. For example if two calls 
race to map a block and a page in the same (unmapped) region, the block 
mapping will always succeed (and be what ends up in the final pagetable 
state), but the page mapping may or may not report failure depending on 
the exact timing.


Although we don't have debug dumping for io-pgtable-arm, it's good to be 
thinking about this, since it's made me realise that dirty-tracking 
sweeps per that proposal might pose a similar kind of concern, so we 
might still need to harden these corners for the sake of that. Which 
also reminds me that somewhere I have some half-finished patches making 
io-pgtable-arm use the iommu_iotlb_gather freelist, so maybe I'll tackle 
both concerns at once (perhaps we might even be able to RCU-ify the 
freelist generically? I'll see how it goes when I get there).


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified

2022-05-31 Thread Rob Clark
On Tue, May 31, 2022 at 9:19 AM Will Deacon  wrote:
>
> On Tue, May 31, 2022 at 09:15:22AM -0700, Rob Clark wrote:
> > On Tue, May 31, 2022 at 8:46 AM Will Deacon  wrote:
> > >
> > > On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote:
> > > > From: AngeloGioacchino Del Regno 
> > > > 
> > > >
> > > > As specified in this driver, the context banks are 0x1000 apart.
> > > > Problem is that sometimes the context number (our asid) does not
> > > > match this logic and we end up using the wrong one: this starts
> > > > being a problem in the case that we need to send TZ commands
> > > > to do anything on a specific context.
> > >
> > > I don't understand this. The ASID is a software construct, so it shouldn't
> > > matter what we use. If it does matter, then please can you explain why? 
> > > The
> > > fact that the context banks are 0x1000 apart seems unrelated.
> >
> > I think the connection is that mapping from ctx bank to ASID is 1:1
>
> But in what sense? How is the ASID used beyond a tag in the TLB? The commit
> message hints at "TZ commands" being a problem.
>
> I'm not doubting that this is needed to make the thing work, I just don't
> understand why.

(disclaimer, it has been quite a while since I've looked at the smmu
setup with earlier tz, ie. things that use qcom_iommu, but from
memory...)

We cannot actually assign the context banks ourselves, so in the dt
bindings the "ASID" is actually the context bank index.  I don't
remember exactly if this was a limitation of the tz interface, or
result of not being able to program the smmu's global registers
ourselves.

BR,
-R
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Jacob Pan
Hi Jason,

On Tue, 31 May 2022 16:05:50 -0300, Jason Gunthorpe  wrote:

> On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote:
> 
> > The reason why I store PASID at IOMMU domain is for IOTLB flush within
> > the domain. Device driver is not aware of domain level IOTLB flush. We
> > also have iova_cookie for each domain which essentially is for
> > RIDPASID.  
> 
> You need to make the PASID stuff work generically.
> 
> The domain needs to hold a list of all the places it needs to flush
> and that list needs to be maintained during attach/detach.
> 
> A single PASID on the domain is obviously never going to work
> generically.
> 
I agree, I did it this way really meant to be part of iommu_domain's
iommu_dma_cookie, not meant to be global. But for the lack of common
storage between identity domain and dma domain, I put it here as global.

Then should we also extract RIDPASID to become part of the generic API?
i.e. set pasid, flush IOTLB etc. Right? RIDPASID is not in group's
pasid_array today.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics

2022-05-31 Thread Tony Battersby
On 5/31/22 15:48, Robin Murphy wrote:
> On 2022-05-31 19:17, Tony Battersby wrote:
>
>>   pool->name, blocks,
>> - (size_t) pages *
>> - (pool->allocation / pool->size),
>> + (size_t) pages * pool->blks_per_alloc,
>>   pool->size, pages);
>>  size -= temp;
>>  next += temp;
>> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, 
>> struct device *dev,
>>  retval->size = size;
>>  retval->boundary = boundary;
>>  retval->allocation = allocation;
>> +retval->blks_per_alloc =
>> +(allocation / boundary) * (boundary / size) +
>> +(allocation % boundary) / size;
> Do we really need to store this? Sure, 4 divisions (which could possibly 
> be fewer given the constraints on boundary) isn't the absolute cheapest 
> calculation, but I still can't imagine anyone would be polling sysfs 
> stats hard enough to even notice.
>
The stored value is also used in patch #5, in more performance-critical
code, although only when debug is enabled.

Tony

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics

2022-05-31 Thread Robin Murphy

On 2022-05-31 19:17, Tony Battersby wrote:

The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect.  Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

This depends on the patch "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

Signed-off-by: Tony Battersby 
---
  mm/dmapool.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 782143144a32..9e30f4425dea 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -47,6 +47,7 @@ struct dma_pool { /* the pool */
struct device *dev;
unsigned int allocation;
unsigned int boundary;
+   unsigned int blks_per_alloc;
char name[32];
struct list_head pools;
  };
@@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
/* per-pool info, no real statistics yet */
temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",


Nit: if we're tinkering with this, it's probably worth updating the 
whole function to use sysfs_emit{_at}().



 pool->name, blocks,
-(size_t) pages *
-(pool->allocation / pool->size),
+(size_t) pages * pool->blks_per_alloc,
 pool->size, pages);
size -= temp;
next += temp;
@@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
+   retval->blks_per_alloc =
+   (allocation / boundary) * (boundary / size) +
+   (allocation % boundary) / size;


Do we really need to store this? Sure, 4 divisions (which could possibly 
be fewer given the constraints on boundary) isn't the absolute cheapest 
calculation, but I still can't imagine anyone would be polling sysfs 
stats hard enough to even notice.


Thanks,
Robin.

  
  	INIT_LIST_HEAD(&retval->pools);
  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy

2022-05-31 Thread Robin Murphy

On 2022-05-31 19:22, Tony Battersby wrote:

Remove a small amount of code duplication between dma_pool_destroy() and
pool_free_page() in preparation for adding more code without having to
duplicate it.  No functional changes.

Signed-off-by: Tony Battersby 
---
  mm/dmapool.c | 34 --
  1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 8749a9d7927e..58c11dcaa4e4 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
return page->in_use != 0;
  }
  
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)

+static void pool_free_page(struct dma_pool *pool,
+  struct dma_page *page,
+  bool destroying_pool)
  {
+   void *vaddr = page->vaddr;
dma_addr_t dma = page->dma;
  
+	if (destroying_pool && is_page_busy(page)) {

+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, vaddr);
+   /* leak the still-in-use consistent memory */
+   } else {
  #ifdefDMAPOOL_DEBUG
-   memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+   memset(vaddr, POOL_POISON_FREED, pool->allocation);
  #endif
-   dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
+   dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
+   }
+
list_del(&page->page_list);


If we're tearing down the whole pool, surely we can skip this as well? 
(Same for the second list in patch #9)


In fact I think it might make more sense to refactor in the opposite 
direction and just streamline this directly into dma_pool_destroy(), 
more like:


list_for_each_entry_safe() {
if (is_page_busy()) {
dev_err();
} else {
dma_free_coherent();
}
kfree(page);
}


kfree(page);
  }
@@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct 
dma_page *page)
   */
  void dma_pool_destroy(struct dma_pool *pool)
  {
-   struct dma_page *page, *tmp;
+   struct dma_page *page;


Nit: you bring this back again in patch #10, so we may as well leave the 
list_for_each_entry_safe() iterator in place until then as well, and 
save a bit of churn in this patch.



bool empty = false;
  
  	if (unlikely(!pool))

@@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
device_remove_file(pool->dev, &dev_attr_pools);
mutex_unlock(&pools_reg_lock);
  
-	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {

-   if (is_page_busy(page)) {
-   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-   pool->name, page->vaddr);
-   /* leak the still-in-use consistent memory */
-   list_del(&page->page_list);
-   kfree(page);
-   } else
-   pool_free_page(pool, page);
+   while ((page = list_first_entry_or_null(&pool->page_list,
+   struct dma_page,
+   page_list))) {
+   pool_free_page(pool, page, true);
}
  
  	kfree(pool);

@@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
page->offset = offset;
/*
 * Resist a temptation to do
-*if (!is_page_busy(page)) pool_free_page(pool, page);
+*if (!is_page_busy(page)) pool_free_page(pool, page, false);


Further to the above, even if we did retain a separate function, if an 
argument is hard-coded at the one single callsite, and the only 
reference to passing any other value is in a comment effectively saying 
"don't do this", do we really need to pretend it's an argument at all? ;)


FWIW I'd just reword the comment in more general terms, e.g. "Resist the 
temptation to free unused pages immediately..."


Thanks,
Robin.


 * Better have a few empty pages hang around.
 */
spin_unlock_irqrestore(&pool->lock, flags);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote:

> The reason why I store PASID at IOMMU domain is for IOTLB flush within the
> domain. Device driver is not aware of domain level IOTLB flush. We also
> have iova_cookie for each domain which essentially is for RIDPASID.

You need to make the PASID stuff work generically.

The domain needs to hold a list of all the places it needs to flush
and that list needs to be maintained during attach/detach.

A single PASID on the domain is obviously never going to work
generically.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:

> > And we expect the iommu driver to be unable to free page table levels
> > that have IOVA boundaries in them?
> 
> I'm not entirely sure what you mean there, but in general an unmap request
> is expected to match some previous map request 

atomic cmpxchg is OK for inserting new page table levels but it can't
protect you against concurrent freeing of page table levels. So
without locks it means that page tables can't usually be freed. Which
seems to match what the Intel driver does - at least from a cursory
look.

This is one of the reasons the mm has the mmap/etc lock and spinlocks
because we do expect page table levels to get wiped out when VMA's are
zap'd - all the different locks provide the protection against page
tables disappearing under from something manipulating them.

Basically every "lockless" walk in (process) MM land is actually
protected by some kind of lock that blocks zap_page_range() from
removing the page table levels themselves.

> They might either unmap the entire region originally mapped, or just
> the requested part, or might fail entirely (IIRC there was some
> nasty code in VFIO for detecting a particular behaviour).

This is something I did differently in iommufd. It always generates
unmaps that are strict supersets of the maps it issued. So it would be
a kernel bug if the driver unmaps more or less than requested.

> Oh, I've spent the last couple of weeks hacking up horrible things
> manipulating entries in init_mm, and never realised that that was actually
> the special case. Oh well, live and learn.

The init_mm is sort of different, it doesn't have zap in quite the
same way, for example. I was talking about the typical process mm.

Anyhow, the right solution is to use RCU as I described before, Baolu
do you want to try?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 02/10] dmapool: cleanup integer types

2022-05-31 Thread Tony Battersby
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places.  Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.

Signed-off-by: Tony Battersby 
---

This puts an upper bound on 'size' of INT_MAX to avoid overflowing the
following comparison in pool_initialise_page():

unsigned int offset = 0;
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) > ...

'boundary' is passed in as a size_t but gets stored as an unsigned int.
'boundary' values >= 'allocation' do not have any effect, so clipping
'boundary' to 'allocation' keeps it within the range of unsigned int
without affecting anything else.  A few lines above (not in the diff)
you can see that if 'boundary' is passed in as 0 then it is set to
'allocation', so it is nothing new.  For reference, here is the
relevant code after being patched:

if (!boundary)
boundary = allocation;
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;

boundary = min(boundary, allocation);

 mm/dmapool.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 0f89de408cbe..d7b372248111 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -43,10 +43,10 @@
 struct dma_pool {  /* the pool */
struct list_head page_list;
spinlock_t lock;
-   size_t size;
+   unsigned int size;
struct device *dev;
-   size_t allocation;
-   size_t boundary;
+   unsigned int allocation;
+   unsigned int boundary;
char name[32];
struct list_head pools;
 };
@@ -80,7 +80,7 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
mutex_lock(&pools_lock);
list_for_each_entry(pool, &dev->dma_pools, pools) {
unsigned pages = 0;
-   unsigned blocks = 0;
+   size_t blocks = 0;
 
spin_lock_irq(&pool->lock);
list_for_each_entry(page, &pool->page_list, page_list) {
@@ -90,9 +90,10 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
spin_unlock_irq(&pool->lock);
 
/* per-pool info, no real statistics yet */
-   temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n",
+   temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 pool->name, blocks,
-pages * (pool->allocation / pool->size),
+(size_t) pages *
+(pool->allocation / pool->size),
 pool->size, pages);
size -= temp;
next += temp;
@@ -139,7 +140,7 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
else if (align & (align - 1))
return NULL;
 
-   if (size == 0)
+   if (size == 0 || size > INT_MAX)
return NULL;
else if (size < 4)
size = 4;
@@ -152,6 +153,8 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;
 
+   boundary = min(boundary, allocation);
+
retval = kmalloc(sizeof(*retval), GFP_KERNEL);
if (!retval)
return retval;
@@ -312,7 +315,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 {
unsigned long flags;
struct dma_page *page;
-   size_t offset;
+   unsigned int offset;
void *retval;
 
might_alloc(mem_flags);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 00/10] mpt3sas and dmapool scalability

2022-05-31 Thread Tony Battersby
This patch series improves dmapool scalability by replacing linear scans
with red-black trees.

History:

In 2018 this patch series made it through 4 versions.  v1 used red-black
trees; v2 - v4 put the dma pool info directly into struct page and used
virt_to_page() to get at it.  v4 made a brief appearance in linux-next,
but it caused problems on non-x86 archs where virt_to_page() doesn't
work with dma_alloc_coherent, so it was reverted.  I was too busy at the
time to repost the red-black tree version, and I forgot about it until
now.  This version is based on the red-black trees of v1, but addressing
all the review comments I got at the time and with additional cleanup
patches.

Note that Keith Busch is also working on improving dmapool scalability,
so for now I would recommend not merging my scalability patches until
Keith's approach can be evaluated.  In the meantime, my patches can
serve as a benchmark comparison.  I also have a number of cleanup
patches in my series that could be useful on their own.

References:

v1
https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e91...@cybernetics.com/

v2
https://lore.kernel.org/linux-mm/ec701153-fdc9-37f3-c267-f056159b4...@cybernetics.com/

v3
https://lore.kernel.org/linux-mm/d48854ff-995d-228e-8356-54c141c32...@cybernetics.com/

v4
https://lore.kernel.org/linux-mm/88395080-efc1-4e7b-f813-bb90c86d0...@cybernetics.com/

problem caused by virt_to_page()
https://lore.kernel.org/linux-kernel/20181206013054.gi6...@atomide.com/

Keith Busch's dmapool performance enhancements
https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbu...@kernel.org/

Below is my original description of the motivation for these patches.

drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation.  With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:

chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
{
dma_addr[i] = dma_pool_alloc(chain_dma_pool);
}

And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:

for (i = 0; i < 373959; i++)
{
dma_pool_free(chain_dma_pool, dma_addr[i]);
}
dma_pool_destroy(chain_dma_pool);

With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n^2) complexity, although dma_pool_free() is much worse due to
implementation details.  On my system, the dma_pool_free() loop above
takes about 9 seconds to run.  Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.

mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:

cat /sys/devices/pci:80/:80:07.0/:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool  1  21 192 1
reply_free pool 1  1  41728   1
reply pool  1  1  1335296 1
sense pool  1  1  970272  1
chain pool  373959 386048 128 12064
reply_post_free pool12 12 166528  12

The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops.  With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster.  Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):

dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
  original:350 ms ( 1x)
  dmapool patches:  18 ms (19x)

dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
  original:8901 ms (   1x)
  dmapool patches:   19 ms ( 477x)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 01/10] dmapool: remove checks for dev == NULL

2022-05-31 Thread Tony Battersby
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 42 +++---
 1 file changed, 11 insertions(+), 31 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index a7eb5d0eb2da..0f89de408cbe 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
mutex_lock(&pools_reg_lock);
mutex_lock(&pools_lock);
list_del(&pool->pools);
-   if (pool->dev && list_empty(&pool->dev->dma_pools))
+   if (list_empty(&pool->dev->dma_pools))
empty = true;
mutex_unlock(&pools_lock);
if (empty)
@@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
 
list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
if (is_page_busy(page)) {
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-   pool->name, page->vaddr);
-   else
-   pr_err("%s %s, %p busy\n", __func__,
-  pool->name, page->vaddr);
+   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
+   pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(&page->page_list);
kfree(page);
@@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t 
mem_flags,
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p (corrupted)\n",
-   __func__, pool->name, retval);
-   else
-   pr_err("%s %s, %p (corrupted)\n",
-   __func__, pool->name, retval);
+   dev_err(pool->dev, "%s %s, %p (corrupted)\n",
+   __func__, pool->name, retval);
 
/*
 * Dump the first 4 bytes even if they are not
@@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
-   __func__, pool->name, vaddr, &dma);
-   else
-   pr_err("%s %s, %p/%pad (bad dma)\n",
-  __func__, pool->name, vaddr, &dma);
+   dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
+   __func__, pool->name, vaddr, &dma);
return;
}
 
@@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
 #ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
-   __func__, pool->name, vaddr, &dma);
-   else
-   pr_err("%s %s, %p (bad vaddr)/%pad\n",
-  __func__, pool->name, vaddr, &dma);
+   dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
+   __func__, pool->name, vaddr, &dma);
return;
}
{
@@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
continue;
}
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, dma %pad already 
free\n",
-   __func__, pool->name, &dma);
-   else
-   pr_err("%s %s, dma %pad already free\n",
-  __func__, pool->name, &dma);
+   dev_err(pool->dev, "%s %s, dma %pad already free\n",
+   __func__, pool->name, &dma);
return;
}
}
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 03/10] dmapool: fix boundary comparison

2022-05-31 Thread Tony Battersby
Fix the boundary comparison when constructing the list of free blocks
for the case that 'size' is a power of two.  Since 'boundary' is also a
power of two, that would make 'boundary' a multiple of 'size', in which
case a single block would never cross the boundary.  This bug would
cause some of the allocated memory to be wasted (but not leaked).

Example:

size   = 512
boundary   = 2048
allocation = 4096

Address range
   0 -  511
 512 - 1023
1024 - 1535
1536 - 2047 *
2048 - 2559
2560 - 3071
3072 - 3583
3584 - 4095 *

Prior to this fix, the address ranges marked with "*" would not have
been used even though they didn't cross the given boundary.

Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross 
boundaries")
Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index d7b372248111..782143144a32 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -210,7 +210,7 @@ static void pool_initialise_page(struct dma_pool *pool, 
struct dma_page *page)
 
do {
unsigned int next = offset + pool->size;
-   if (unlikely((next + pool->size) >= next_boundary)) {
+   if (unlikely((next + pool->size) > next_boundary)) {
next = next_boundary;
next_boundary += pool->boundary;
}
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/10] dmapool: remove checks for dev == NULL

2022-05-31 Thread Robin Murphy

On 2022-05-31 19:12, Tony Battersby wrote:

dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.


Furthermore, dma_pool_create() already dereferences the dev argument 
unconditionally, so there's no way we could possibly get as far as these 
checks even if a caller did ever pass NULL.


Reviewed-by: Robin Murphy 


Signed-off-by: Tony Battersby 
---
  mm/dmapool.c | 42 +++---
  1 file changed, 11 insertions(+), 31 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index a7eb5d0eb2da..0f89de408cbe 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
mutex_lock(&pools_reg_lock);
mutex_lock(&pools_lock);
list_del(&pool->pools);
-   if (pool->dev && list_empty(&pool->dev->dma_pools))
+   if (list_empty(&pool->dev->dma_pools))
empty = true;
mutex_unlock(&pools_lock);
if (empty)
@@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
  
  	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {

if (is_page_busy(page)) {
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-   pool->name, page->vaddr);
-   else
-   pr_err("%s %s, %p busy\n", __func__,
-  pool->name, page->vaddr);
+   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
+   pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(&page->page_list);
kfree(page);
@@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t 
mem_flags,
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p (corrupted)\n",
-   __func__, pool->name, retval);
-   else
-   pr_err("%s %s, %p (corrupted)\n",
-   __func__, pool->name, retval);
+   dev_err(pool->dev, "%s %s, %p (corrupted)\n",
+   __func__, pool->name, retval);
  
  			/*

 * Dump the first 4 bytes even if they are not
@@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
-   __func__, pool->name, vaddr, &dma);
-   else
-   pr_err("%s %s, %p/%pad (bad dma)\n",
-  __func__, pool->name, vaddr, &dma);
+   dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
+   __func__, pool->name, vaddr, &dma);
return;
}
  
@@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)

  #ifdefDMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
-   __func__, pool->name, vaddr, &dma);
-   else
-   pr_err("%s %s, %p (bad vaddr)/%pad\n",
-  __func__, pool->name, vaddr, &dma);
+   dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
+   __func__, pool->name, vaddr, &dma);
return;
}
{
@@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
continue;
}
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "%s %s, dma %pad already 
free\n",
-   __func__, pool->name, &dma);
-   else
-   pr_err("%s %s, dma %pad already free\n",
-  __func__, pool->name, &dma);
+   dev_err(pool->dev, "%s %s, dma %pad already free\n",
+   __func__, pool->name, &dma);

[PATCH 10/10] dmapool: improve scalability of dma_pool_free

2022-05-31 Thread Tony Battersby
dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages.  Improve its
scalability by replacing the linear scan with a red-black tree lookup.
In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 128 ---
 1 file changed, 100 insertions(+), 28 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index b3dd2ace0d2a..24535483f781 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -12,11 +12,12 @@
  * Many older drivers still have their own code to do this.
  *
  * The current design of this allocator is fairly simple.  The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages.  Each page in the page_list is split into blocks of at
- * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page.  Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool' which keeps a red-black tree of all
+ * allocated pages, keyed by DMA address for fast lookup when freeing.
+ * Each page in the page_tree is split into blocks of at least 'size' bytes.
+ * Free blocks are tracked in an unsorted singly-linked list of free blocks
+ * within the page.  Used blocks aren't tracked, but we keep a count of how
+ * many are currently allocated from each page.
  *
  * The avail_page_list keeps track of pages that have one or more free blocks
  * available to (re)allocate.  Pages are moved in and out of avail_page_list
@@ -36,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -45,7 +47,7 @@
 #endif
 
 struct dma_pool {  /* the pool */
-   struct list_head page_list;
+   struct rb_root page_tree;
struct list_head avail_page_list;
spinlock_t lock;
unsigned int size;
@@ -58,7 +60,7 @@ struct dma_pool { /* the pool */
 };
 
 struct dma_page {  /* cacheable header for 'allocation' bytes */
-   struct list_head page_list;
+   struct rb_node page_node;
struct list_head avail_page_link;
void *vaddr;
dma_addr_t dma;
@@ -69,6 +71,11 @@ struct dma_page {/* cacheable header for 
'allocation' bytes */
 static DEFINE_MUTEX(pools_lock);
 static DEFINE_MUTEX(pools_reg_lock);
 
+static inline struct dma_page *rb_to_dma_page(struct rb_node *node)
+{
+   return rb_entry(node, struct dma_page, page_node);
+}
+
 static ssize_t pools_show(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
unsigned temp;
@@ -76,6 +83,7 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
char *next;
struct dma_page *page;
struct dma_pool *pool;
+   struct rb_node *node;
 
next = buf;
size = PAGE_SIZE;
@@ -90,7 +98,10 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
size_t blocks = 0;
 
spin_lock_irq(&pool->lock);
-   list_for_each_entry(page, &pool->page_list, page_list) {
+   for (node = rb_first(&pool->page_tree);
+node;
+node = rb_next(node)) {
+   page = rb_to_dma_page(node);
pages++;
blocks += page->in_use;
}
@@ -169,7 +180,7 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
 
retval->dev = dev;
 
-   INIT_LIST_HEAD(&retval->page_list);
+   retval->page_tree = RB_ROOT;
INIT_LIST_HEAD(&retval->avail_page_list);
spin_lock_init(&retval->lock);
retval->size = size;
@@ -213,6 +224,63 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
 }
 EXPORT_SYMBOL(dma_pool_create);
 
+/*
+ * Find the dma_page that manages the given DMA address.
+ */
+static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
+{
+   struct rb_node *node = pool->page_tree.rb_node;
+
+   while (node) {
+   struct dma_page *page = rb_to_dma_page(node);
+
+   if (dma < page->dma)
+   node = node->rb_left;
+   else if ((dma - page->dma) >= pool->allocation)
+   node = node->rb_right;
+   else
+   return page;
+   }
+   return NULL;
+}
+
+/*
+ * Insert a dma_page into the page_tree.
+ */
+static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page)
+{
+   dma_addr_t dma = new_page->dma;
+   struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL;
+
+   while (*node) {
+   struct dma_page *this_page = rb_to_dma_page(*node);
+
+   parent = *node;
+   if (dma < this_page-

[PATCH 09/10] dmapool: improve scalability of dma_pool_alloc

2022-05-31 Thread Tony Battersby
dma_pool_alloc() scales poorly when allocating a large number of pages
because it does a linear scan of all previously-allocated pages before
allocating a new one.  Improve its scalability by maintaining a separate
list of pages that have free blocks ready to (re)allocate.  In big O
notation, this improves the algorithm from O(n^2) to O(n).

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 58c11dcaa4e4..b3dd2ace0d2a 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -17,6 +17,10 @@
  * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
  * list of free blocks within the page.  Used blocks aren't tracked, but we
  * keep a count of how many are currently allocated from each page.
+ *
+ * The avail_page_list keeps track of pages that have one or more free blocks
+ * available to (re)allocate.  Pages are moved in and out of avail_page_list
+ * as their blocks are allocated and freed.
  */
 
 #include 
@@ -42,6 +46,7 @@
 
 struct dma_pool {  /* the pool */
struct list_head page_list;
+   struct list_head avail_page_list;
spinlock_t lock;
unsigned int size;
struct device *dev;
@@ -54,6 +59,7 @@ struct dma_pool { /* the pool */
 
 struct dma_page {  /* cacheable header for 'allocation' bytes */
struct list_head page_list;
+   struct list_head avail_page_link;
void *vaddr;
dma_addr_t dma;
unsigned int in_use;
@@ -164,6 +170,7 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
retval->dev = dev;
 
INIT_LIST_HEAD(&retval->page_list);
+   INIT_LIST_HEAD(&retval->avail_page_list);
spin_lock_init(&retval->lock);
retval->size = size;
retval->boundary = boundary;
@@ -270,6 +277,7 @@ static void pool_free_page(struct dma_pool *pool,
}
 
list_del(&page->page_list);
+   list_del(&page->avail_page_link);
kfree(page);
 }
 
@@ -330,10 +338,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t 
mem_flags,
might_alloc(mem_flags);
 
spin_lock_irqsave(&pool->lock, flags);
-   list_for_each_entry(page, &pool->page_list, page_list) {
-   if (page->offset < pool->allocation)
-   goto ready;
-   }
+   page = list_first_entry_or_null(&pool->avail_page_list,
+   struct dma_page,
+   avail_page_link);
+   if (page)
+   goto ready;
 
/* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
spin_unlock_irqrestore(&pool->lock, flags);
@@ -345,10 +354,13 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t 
mem_flags,
spin_lock_irqsave(&pool->lock, flags);
 
list_add(&page->page_list, &pool->page_list);
+   list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
page->in_use++;
offset = page->offset;
page->offset = *(int *)(page->vaddr + offset);
+   if (page->offset >= pool->allocation)
+   list_del_init(&page->avail_page_link);
retval = offset + page->vaddr;
*handle = offset + page->dma;
 #ifdef DMAPOOL_DEBUG
@@ -470,6 +482,13 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
memset(vaddr, 0, pool->size);
 #endif
 
+   /*
+* list_empty() on the page tests if the page is already linked into
+* avail_page_list to avoid adding it more than once.
+*/
+   if (list_empty(&page->avail_page_link))
+   list_add(&page->avail_page_link, &pool->avail_page_list);
+
page->in_use--;
*(int *)vaddr = page->offset;
page->offset = offset;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 08/10] dmapool: cleanup dma_pool_destroy

2022-05-31 Thread Tony Battersby
Remove a small amount of code duplication between dma_pool_destroy() and
pool_free_page() in preparation for adding more code without having to
duplicate it.  No functional changes.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 34 --
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 8749a9d7927e..58c11dcaa4e4 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
return page->in_use != 0;
 }
 
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
+static void pool_free_page(struct dma_pool *pool,
+  struct dma_page *page,
+  bool destroying_pool)
 {
+   void *vaddr = page->vaddr;
dma_addr_t dma = page->dma;
 
+   if (destroying_pool && is_page_busy(page)) {
+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, vaddr);
+   /* leak the still-in-use consistent memory */
+   } else {
 #ifdef DMAPOOL_DEBUG
-   memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+   memset(vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-   dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
+   dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
+   }
+
list_del(&page->page_list);
kfree(page);
 }
@@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct 
dma_page *page)
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-   struct dma_page *page, *tmp;
+   struct dma_page *page;
bool empty = false;
 
if (unlikely(!pool))
@@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
device_remove_file(pool->dev, &dev_attr_pools);
mutex_unlock(&pools_reg_lock);
 
-   list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
-   if (is_page_busy(page)) {
-   dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-   pool->name, page->vaddr);
-   /* leak the still-in-use consistent memory */
-   list_del(&page->page_list);
-   kfree(page);
-   } else
-   pool_free_page(pool, page);
+   while ((page = list_first_entry_or_null(&pool->page_list,
+   struct dma_page,
+   page_list))) {
+   pool_free_page(pool, page, true);
}
 
kfree(pool);
@@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
page->offset = offset;
/*
 * Resist a temptation to do
-*if (!is_page_busy(page)) pool_free_page(pool, page);
+*if (!is_page_busy(page)) pool_free_page(pool, page, false);
 * Better have a few empty pages hang around.
 */
spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 07/10] dmapool: speedup DMAPOOL_DEBUG with init_on_alloc

2022-05-31 Thread Tony Battersby
Avoid double-memset of the same allocated memory in dma_pool_alloc()
when both DMAPOOL_DEBUG is enabled and init_on_alloc=1.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 49019ef6dd83..8749a9d7927e 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -365,7 +365,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
break;
}
}
-   if (!(mem_flags & __GFP_ZERO))
+   if (!want_init_on_alloc(mem_flags))
memset(retval, POOL_POISON_ALLOCATED, pool->size);
 #endif
spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 06/10] dmapool: ignore init_on_free when DMAPOOL_DEBUG enabled

2022-05-31 Thread Tony Battersby
There are two cases:

1) In the normal case that the memory is being freed correctly, then
DMAPOOL_DEBUG will memset the memory anyway, so speed things up by
avoiding a double-memset of the same memory.

2) In the abnormal case that DMAPOOL_DEBUG detects that a driver passes
incorrect parameters to dma_pool_free() (e.g. double-free, invalid
free, mismatched vaddr/dma, etc.), then that is a kernel bug, and we
don't want to clear the passed-in possibly-invalid memory pointer
because we can't be sure that the memory is really free.  So don't clear
it just because init_on_free=1.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 7a9161d4f7a6..49019ef6dd83 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -415,8 +415,6 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
}
 
offset = vaddr - page->vaddr;
-   if (want_init_on_free())
-   memset(vaddr, 0, pool->size);
 #ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(&pool->lock, flags);
@@ -461,6 +459,9 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
goto freelist_corrupt;
}
memset(vaddr, POOL_POISON_FREED, pool->size);
+#else
+   if (want_init_on_free())
+   memset(vaddr, 0, pool->size);
 #endif
 
page->in_use--;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 05/10] dmapool: debug: prevent endless loop in case of corruption

2022-05-31 Thread Tony Battersby
Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
driver corrupts DMA pool memory.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 9e30f4425dea..7a9161d4f7a6 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -426,16 +426,39 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, 
dma_addr_t dma)
}
{
unsigned int chain = page->offset;
+   unsigned int free_blks = 0;
+
while (chain < pool->allocation) {
-   if (chain != offset) {
-   chain = *(int *)(page->vaddr + chain);
-   continue;
+   if (unlikely(chain == offset)) {
+   spin_unlock_irqrestore(&pool->lock, flags);
+   dev_err(pool->dev,
+   "%s %s, dma %pad already free\n",
+   __func__, pool->name, &dma);
+   return;
}
-   spin_unlock_irqrestore(&pool->lock, flags);
-   dev_err(pool->dev, "%s %s, dma %pad already free\n",
-   __func__, pool->name, &dma);
-   return;
+
+   /*
+* A buggy driver could corrupt the freelist by
+* use-after-free, buffer overflow, etc.  Besides
+* checking for corruption, this also prevents an
+* endless loop in case corruption causes a circular
+* loop in the freelist.
+*/
+   if (unlikely(++free_blks + page->in_use >
+pool->blks_per_alloc)) {
+ freelist_corrupt:
+   spin_unlock_irqrestore(&pool->lock, flags);
+   dev_err(pool->dev,
+   "%s %s, freelist corrupted\n",
+   __func__, pool->name);
+   return;
+   }
+
+   chain = *(int *)(page->vaddr + chain);
}
+   if (unlikely(free_blks + page->in_use !=
+pool->blks_per_alloc))
+   goto freelist_corrupt;
}
memset(vaddr, POOL_POISON_FREED, pool->size);
 #endif
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 04/10] dmapool: improve accuracy of debug statistics

2022-05-31 Thread Tony Battersby
The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect.  Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

This depends on the patch "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

Signed-off-by: Tony Battersby 
---
 mm/dmapool.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 782143144a32..9e30f4425dea 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -47,6 +47,7 @@ struct dma_pool { /* the pool */
struct device *dev;
unsigned int allocation;
unsigned int boundary;
+   unsigned int blks_per_alloc;
char name[32];
struct list_head pools;
 };
@@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct 
device_attribute *attr, cha
/* per-pool info, no real statistics yet */
temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 pool->name, blocks,
-(size_t) pages *
-(pool->allocation / pool->size),
+(size_t) pages * pool->blks_per_alloc,
 pool->size, pages);
size -= temp;
next += temp;
@@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct 
device *dev,
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
+   retval->blks_per_alloc =
+   (allocation / boundary) * (boundary / size) +
+   (allocation % boundary) / size;
 
INIT_LIST_HEAD(&retval->pools);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 17:21, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:


The DMA API doesn't need locking, partly since it can trust itself not to do
stupid things, and mostly because it's DMA API performance that's
fundamentally incompatible with serialisation anyway. Why do you think we
have a complicated per-CPU IOVA caching mechanism, if not to support big
multi-queue devices with multiple CPU threads mapping/unmapping in different
parts of the same DMA domain concurrently?


Well, per-CPU is a form of locking.


Right, but that only applies for alloc_iova_fast() itself - once the 
CPUs have each got their distinct IOVA region, they can then pile into 
iommu_map() completely unhindered (and the inverse for the unmap path).



So what are the actual locking rules here? We can call map/unmap
concurrently but not if ... ?

IOVA overlaps?


Well, I think the de-facto rule is that you technically *can* make 
overlapping requests, but one or both may fail, and the final outcome in 
terms of what ends up mapped in the domain is undefined (especially if 
they both succeed). The only real benefit of enforcing serialisation 
would be better failure behaviour in such cases, but it remains 
fundamentally nonsensical for callers to make contradictory requests 
anyway, whether concurrently or sequentially, so there doesn't seem much 
point in spending effort on improving support for nonsense.



And we expect the iommu driver to be unable to free page table levels
that have IOVA boundaries in them?


I'm not entirely sure what you mean there, but in general an unmap 
request is expected to match some previous map request - there isn't a 
defined API-level behaviour for partial unmaps. They might either unmap 
the entire region originally mapped, or just the requested part, or 
might fail entirely (IIRC there was some nasty code in VFIO for 
detecting a particular behaviour). Similarly for unmapping anything 
that's already not mapped, some drivers treat that as a no-op, others as 
an error. But again, this is even further unrelated to concurrency.



The simpler drivers already serialise on a per-domain lock internally, while
the more performance-focused ones implement lock-free atomic pagetable
management in a similar style to CPU arch code; either way it should work
fine as-is.


The mm has page table locks at every level and generally expects them
to be held for a lot of manipulations. There are some lockless cases,
but it is not as aggressive as this sounds.


Oh, I've spent the last couple of weeks hacking up horrible things 
manipulating entries in init_mm, and never realised that that was 
actually the special case. Oh well, live and learn.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Jacob Pan
Hi Baolu,

On Tue, 31 May 2022 20:45:28 +0800, Baolu Lu 
wrote:

> On 2022/5/31 18:12, Tian, Kevin wrote:
>  +++ b/include/linux/iommu.h
>  @@ -105,6 +105,8 @@ struct iommu_domain {
>   enum iommu_page_response_code (*iopf_handler)(struct  
> >> iommu_fault *fault,  
> void *data);
>   void *fault_data;
>  +ioasid_t pasid; /* Used for DMA requests
>  with PASID */
>  +atomic_t pasid_users;  
> >>> These are poorly named, this is really the DMA API global PASID and
> >>> shouldn't be used for other things.
> >>>
> >>>
> >>>
> >>> Perhaps I misunderstood, do you mind explaining more?  
> >> You still haven't really explained what this is for in this patch,
> >> maybe it just needs a better commit message, or maybe something is
> >> wrong.
> >>
> >> I keep saying the DMA API usage is not special, so why do we need to
> >> create a new global pasid and refcount? Realistically this is only
> >> going to be used by IDXD, why can't we just allocate a PASID and
> >> return it to the driver every time a driver asks for DMA API on PASI
> >> mode? Why does the core need to do anything special?
> >>  
The reason why I store PASID at IOMMU domain is for IOTLB flush within the
domain. Device driver is not aware of domain level IOTLB flush. We also
have iova_cookie for each domain which essentially is for RIDPASID.

> > Agree. I guess it was a mistake caused by treating ENQCMD as the
> > only user although the actual semantics of the invented interfaces
> > have already evolved to be quite general.
> > 
> > This is very similar to what we have been discussing for iommufd.
> > a PASID is just an additional routing info when attaching a device
> > to an I/O address space (DMA API in this context) and by default
> > it should be a per-device resource except when ENQCMD is
> > explicitly opt in.
> > 
> > Hence it's right time for us to develop common facility working
> > for both this DMA API usage and iommufd, i.e.:
> > 
> > for normal PASID attach to a domain, driver:
> > 
> > allocates a local pasid from device local space;
> > attaches the local pasid to a domain;
> > 
> > for PASID attach in particular for ENQCMD, driver:
> > 
> > allocates a global pasid in system-wide;
> > attaches the global pasid to a domain;
> > set the global pasid in PASID_MSR;
> > 
> > In both cases the pasid is stored in the attach data instead of the
> > domain.
> > 
So during IOTLB flush for the domain, do we loop through the attach data?

> > DMA API pasid is no special from above except it needs to allow
> > one device attached to the same domain twice (one with RID
> > and the other with RID+PASID).
> > 
> > for iommufd those operations are initiated by userspace via
> > iommufd uAPI.  
> 
> My understanding is that device driver owns its PASID policy. If ENQCMD
> is supported on the device, the PASIDs should be allocated through
> ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device
> driver.
> 
It seems the changes we want for this patchset are:
1. move ioasid_alloc() from the core to device (allocation scope will be
based on whether ENQCMD is intended or not)
2. store pasid in the attach data
3. use the same iommufd api to attach/set pasid on its default domain
Am I summarizing correctly?

> For kernel DMA w/ PASID, after the driver has a PASID for this purpose,
> it can just set the default domain to the PASID on device. There's no
> need for enable/disable() interfaces.
> 
> Best regards,
> baolu


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops

2022-05-31 Thread Oleksandr


On 31.05.22 14:52, Krzysztof Kozlowski wrote:

Hello Krzysztof


On 30/05/2022 23:00, Oleksandr Tyshchenko wrote:

From: Oleksandr Tyshchenko 

Thank you for your patch. There is something to discuss/improve.


diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml 
b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
new file mode 100644
index ..ab5765c
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
@@ -0,0 +1,49 @@
+# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Xen specific IOMMU for virtualized devices (e.g. virtio)
+
+maintainers:
+  - Stefano Stabellini 
+
+description:
+  The reference to Xen specific IOMMU node using "iommus" property indicates
+  that Xen grant mappings need to be enabled for the device, and it specifies
+  the ID of the domain where the corresponding backend resides.
+  The binding is required to restrict memory access using Xen grant mappings.
+
+properties:
+  compatible:
+const: xen,grant-dma
+
+  '#iommu-cells':
+const: 1
+description:
+  Xen specific IOMMU is multiple-master IOMMU device.
+  The single cell describes the domid (domain ID) of the domain where
+  the backend is running.
+
+required:
+  - compatible
+  - "#iommu-cells"
+
+additionalProperties: false
+
+examples:
+  - |
+xen_iommu {

No underscores in node names, generic node names, so this looks like
"iommu".



ok, will change





+compatible = "xen,grant-dma";
+#iommu-cells = <1>;
+};
+
+virtio@3000 {
+compatible = "virtio,mmio";
+reg = <0x3000 0x100>;
+interrupts = <41>;
+
+/* The backend is located in Xen domain with ID 1 */
+iommus = <&xen_iommu 1>;

There is no need usually to give consumer examples in provider binding.
If there is nothing specific here (looks exactly like every IOMMU
consumer in Linux kernel), drop the consumer.



I got it.  There is nothing specific from the device tree's perspective, 
I was thinking to draw attention to the IOMMU specifier (in our case, 
the master device's ID == backend's domain ID). But  '#iommu-cells' 
description above already clarifies that. Will drop.






+};


Best regards,
Krzysztof


--
Regards,

Oleksandr Tyshchenko

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [git pull] IOMMU Updates for Linux v5.19

2022-05-31 Thread pr-tracker-bot
The pull request you sent on Tue, 31 May 2022 14:37:42 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
> tags/iommu-updates-v5.19

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e1cbc3b96a9974746b2a80c3a6c8a0f7eff7b1b5

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 16:59, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:


+    break;
+    pgtable_walk_level(m, phys_to_virt(phys_addr),


Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
gets you a valid struct page. Whether that page is direct-mapped kernel
memory or not is a different matter.


Even though this is debugfs, if the operation is sketchy like that and
can theortically crash the kernel the driver should test capabilities,
CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
better cap for 'userspace may crash the kernel'


It shouldn't be insurmountable to make this safe, it just needs a bit 
more than pfn_valid(), which can still return true off the ends of the 
memory map if they're not perfectly section-aligned, and for random 
reserved holes in the middle.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/6] iommu/qcom: Write TCR before TTBRs to fix ASID access behavior

2022-05-31 Thread Robin Murphy

On 2022-05-31 16:55, Will Deacon wrote:

On Fri, May 27, 2022 at 11:28:57PM +0200, Konrad Dybcio wrote:

From: AngeloGioacchino Del Regno 

As also stated in the arm-smmu driver, we must write the TCR before
writing the TTBRs, since the TCR determines the access behavior of
some fields.


Where is this stated in the arm-smmu driver?


In arm_smmu_write_context_bank() - IIRC it's mostly about the case where 
if you write a 16-bit ASID to TTBR before setting TCR2.AS you might end 
up losing the top 8 bits of it. However, in the context of a pantomime 
where we just have to pretend to program the "hardware" the way the 
firmware has already programmed it (on pain of getting randomly reset if 
we look at it wrong), I can't imagine it really matters.


Robin.


Signed-off-by: AngeloGioacchino Del Regno 

Signed-off-by: Marijn Suijten 
Signed-off-by: Konrad Dybcio 
---
  drivers/iommu/arm/arm-smmu/qcom_iommu.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 1728d4d7fe25..75f353866c40 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -273,18 +273,18 @@ static int qcom_iommu_init_domain(struct iommu_domain 
*domain,
ctx->secure_init = true;
}
  
-		/* TTBRs */

-   iommu_writeq(ctx, ARM_SMMU_CB_TTBR0,
-   pgtbl_cfg.arm_lpae_s1_cfg.ttbr |
-   FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid));
-   iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0);
-
/* TCR */
iommu_writel(ctx, ARM_SMMU_CB_TCR2,
arm_smmu_lpae_tcr2(&pgtbl_cfg));
iommu_writel(ctx, ARM_SMMU_CB_TCR,
 arm_smmu_lpae_tcr(&pgtbl_cfg) | ARM_SMMU_TCR_EAE);
  
+		/* TTBRs */

+   iommu_writeq(ctx, ARM_SMMU_CB_TTBR0,
+   pgtbl_cfg.arm_lpae_s1_cfg.ttbr |
+   FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid));
+   iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0);


I'd have thought that SCTLR.M would be clear here, so it shouldn't matter
what order we write these in.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:

> The DMA API doesn't need locking, partly since it can trust itself not to do
> stupid things, and mostly because it's DMA API performance that's
> fundamentally incompatible with serialisation anyway. Why do you think we
> have a complicated per-CPU IOVA caching mechanism, if not to support big
> multi-queue devices with multiple CPU threads mapping/unmapping in different
> parts of the same DMA domain concurrently?

Well, per-CPU is a form of locking.

So what are the actual locking rules here? We can call map/unmap
concurrently but not if ... ?

IOVA overlaps?

And we expect the iommu driver to be unable to free page table levels
that have IOVA boundaries in them?

> The simpler drivers already serialise on a per-domain lock internally, while
> the more performance-focused ones implement lock-free atomic pagetable
> management in a similar style to CPU arch code; either way it should work
> fine as-is.

The mm has page table locks at every level and generally expects them
to be held for a lot of manipulations. There are some lockless cases,
but it is not as aggressive as this sounds.

> The difference with debugfs is that it's a completely orthogonal
> side-channel - an iommu_domain user like VFIO or iommu-dma can make sure its
> *own* API usage is sane, but can't be aware of the user triggering some
> driver-internal introspection of that domain in a manner that could race
> more harmfully. 

The mm solution to this problem is to RCU free the page table
levels. This way something like debugfs can read a page table under
RCU completely safely, though incoherently, and there is no
performance cost on the map/unmap fast path side.

Today struct page has a rcu_head that can be used to rcu free it, so
it costs nothing.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified

2022-05-31 Thread Will Deacon
On Tue, May 31, 2022 at 09:15:22AM -0700, Rob Clark wrote:
> On Tue, May 31, 2022 at 8:46 AM Will Deacon  wrote:
> >
> > On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote:
> > > From: AngeloGioacchino Del Regno 
> > > 
> > >
> > > As specified in this driver, the context banks are 0x1000 apart.
> > > Problem is that sometimes the context number (our asid) does not
> > > match this logic and we end up using the wrong one: this starts
> > > being a problem in the case that we need to send TZ commands
> > > to do anything on a specific context.
> >
> > I don't understand this. The ASID is a software construct, so it shouldn't
> > matter what we use. If it does matter, then please can you explain why? The
> > fact that the context banks are 0x1000 apart seems unrelated.
> 
> I think the connection is that mapping from ctx bank to ASID is 1:1

But in what sense? How is the ASID used beyond a tag in the TLB? The commit
message hints at "TZ commands" being a problem.

I'm not doubting that this is needed to make the thing work, I just don't
understand why.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified

2022-05-31 Thread Rob Clark
On Tue, May 31, 2022 at 8:46 AM Will Deacon  wrote:
>
> On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote:
> > From: AngeloGioacchino Del Regno 
> >
> > As specified in this driver, the context banks are 0x1000 apart.
> > Problem is that sometimes the context number (our asid) does not
> > match this logic and we end up using the wrong one: this starts
> > being a problem in the case that we need to send TZ commands
> > to do anything on a specific context.
>
> I don't understand this. The ASID is a software construct, so it shouldn't
> matter what we use. If it does matter, then please can you explain why? The
> fact that the context banks are 0x1000 apart seems unrelated.

I think the connection is that mapping from ctx bank to ASID is 1:1

BR,
-R
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 08:45:28PM +0800, Baolu Lu wrote:

> My understanding is that device driver owns its PASID policy.

I'm not sure this is actually useful, the core code should provide the
allocator as I can't think of any device that actually needs a special
allocator.

> If ENQCMD is supported on the device, the PASIDs should be allocated
> through ioasid_alloc(). Otherwise, the whole PASID pool is managed
> by the device driver.

This is OK only if we know in-advance that the device needs a global
PASID. ENQCMD is one reason, maybe there are others down the road.

Better to make this core code policy.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 16:13, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:

On 2022-05-31 15:53, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:

On 2022/5/31 21:10, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:


For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.


Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?


There's no locking around updating the IOPTEs. The basic assumption is
that at any time, there's only a single thread manipulating the mappings
of the range specified in iommu_map/unmap() APIs. Therefore, the race
only exists when multiple ranges share some high-level IOPTEs. The IOMMU
driver updates those IOPTEs using the compare-and-exchange atomic
operation.


Oh? Did I miss where that was documented as part of the iommu API?

Daniel posted patches for VFIO to multi-thread iommu_domin mapping.

iommufd goes out of its way to avoid this kind of serialization so
that userspace can parallel map IOVA.

I think if this is the requirement then the iommu API needs to
provide a lock around the domain for the driver..


Eww, no, we can't kill performance by forcing serialisation on the entire
API just for one silly driver-internal debugfs corner :(


I'm not worried about debugfs, I'm worried about these efforts to
speed up VFIO VM booting by parallel domain loading:

https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jor...@oracle.com/

The DMA API should maintain its own external lock, but the general
domain interface to the rest of the kernel should be safe, IMHO.


The DMA API doesn't need locking, partly since it can trust itself not 
to do stupid things, and mostly because it's DMA API performance that's 
fundamentally incompatible with serialisation anyway. Why do you think 
we have a complicated per-CPU IOVA caching mechanism, if not to support 
big multi-queue devices with multiple CPU threads mapping/unmapping in 
different parts of the same DMA domain concurrently?



Or at least it should be documented..


As far as I'm aware there's never been any restriction at the IOMMU API 
level. It should be self-evident that racing concurrent 
iommu_{map,unmap} requests against iommu_domain_free(), or against each 
other for overlapping IOVAs, is a bad idea and don't do that if you want 
a well-defined outcome. The simpler drivers already serialise on a 
per-domain lock internally, while the more performance-focused ones 
implement lock-free atomic pagetable management in a similar style to 
CPU arch code; either way it should work fine as-is. The difference with 
debugfs is that it's a completely orthogonal side-channel - an 
iommu_domain user like VFIO or iommu-dma can make sure its *own* API 
usage is sane, but can't be aware of the user triggering some 
driver-internal introspection of that domain in a manner that could race 
more harmfully. It has to be down to individual drivers to make that 
safe if they choose to expose such an interface.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:

> > +    break;
> > +    pgtable_walk_level(m, phys_to_virt(phys_addr),
> 
> Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
> gets you a valid struct page. Whether that page is direct-mapped kernel
> memory or not is a different matter.

Even though this is debugfs, if the operation is sketchy like that and
can theortically crash the kernel the driver should test capabilities,
CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
better cap for 'userspace may crash the kernel'

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/6] iommu/qcom: Write TCR before TTBRs to fix ASID access behavior

2022-05-31 Thread Will Deacon
On Fri, May 27, 2022 at 11:28:57PM +0200, Konrad Dybcio wrote:
> From: AngeloGioacchino Del Regno 
> 
> As also stated in the arm-smmu driver, we must write the TCR before
> writing the TTBRs, since the TCR determines the access behavior of
> some fields.

Where is this stated in the arm-smmu driver?

> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 
> Signed-off-by: Marijn Suijten 
> Signed-off-by: Konrad Dybcio 
> ---
>  drivers/iommu/arm/arm-smmu/qcom_iommu.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
> b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
> index 1728d4d7fe25..75f353866c40 100644
> --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
> +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
> @@ -273,18 +273,18 @@ static int qcom_iommu_init_domain(struct iommu_domain 
> *domain,
>   ctx->secure_init = true;
>   }
>  
> - /* TTBRs */
> - iommu_writeq(ctx, ARM_SMMU_CB_TTBR0,
> - pgtbl_cfg.arm_lpae_s1_cfg.ttbr |
> - FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid));
> - iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0);
> -
>   /* TCR */
>   iommu_writel(ctx, ARM_SMMU_CB_TCR2,
>   arm_smmu_lpae_tcr2(&pgtbl_cfg));
>   iommu_writel(ctx, ARM_SMMU_CB_TCR,
>arm_smmu_lpae_tcr(&pgtbl_cfg) | ARM_SMMU_TCR_EAE);
>  
> + /* TTBRs */
> + iommu_writeq(ctx, ARM_SMMU_CB_TTBR0,
> + pgtbl_cfg.arm_lpae_s1_cfg.ttbr |
> + FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid));
> + iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0);

I'd have thought that SCTLR.M would be clear here, so it shouldn't matter
what order we write these in.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support

2022-05-31 Thread Joao Martins
On 5/31/22 13:39, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> AMD implementation of unmap_read_dirty() is pretty simple as
>> mostly reuses unmap code with the extra addition of marshalling
>> the dirty bit into the bitmap as it walks the to-be-unmapped
>> IOPTE.
>>
>> Extra care is taken though, to switch over to cmpxchg as opposed
>> to a non-serialized store to the PTE and testing the dirty bit
>> only set until cmpxchg succeeds to set to 0.
>>
>> Signed-off-by: Joao Martins 
>> ---
>>   drivers/iommu/amd/io_pgtable.c | 44 +-
>>   drivers/iommu/amd/iommu.c  | 22 +
>>   2 files changed, 60 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
>> index 8325ef193093..1868c3b58e6d 100644
>> --- a/drivers/iommu/amd/io_pgtable.c
>> +++ b/drivers/iommu/amd/io_pgtable.c
>> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct 
>> list_head *freelist)
>>  free_sub_pt(pt, mode, freelist);
>>   }
>>   
>> +static bool free_pte_dirty(u64 *pte, u64 pteval)
> 
> Nitpick: Since we free and clearing the dirty bit, should we change
> the function name to free_clear_pte_dirty()?
> 

We free and *read* the dirty bit. It just so happens that we clear dirty
bit and every other one in the process. Just to make sure that I am not
clear the dirty bit explicitly (like the read_and_clear_dirty())

>> +{
>> +bool dirty = false;
>> +
>> +while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
> 
> We should use 0ULL instead of 0.
>

ack.

>> +dirty = true;
>> +
>> +return dirty;
>> +}
>> +
> 
> Actually, what do you think if we enhance the current free_clear_pte()
> to also handle the check dirty as well?
> 
See further below, about dropping this patch.

>>   /*
>>* Generic mapping functions. It maps a physical address into a DMA
>>* address space. It allocates the page table pages if necessary.
>> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops 
>> *ops, unsigned long iova,
>>  return ret;
>>   }
>>   
>> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> -  unsigned long iova,
>> -  size_t size,
>> -  struct iommu_iotlb_gather *gather)
>> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> +   unsigned long iova,
>> +   size_t size,
>> +   struct iommu_iotlb_gather *gather,
>> +   struct iommu_dirty_bitmap *dirty)
>>   {
>>  struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>  unsigned long long unmapped;
>> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct 
>> io_pgtable_ops *ops,
>>  while (unmapped < size) {
>>  pte = fetch_pte(pgtable, iova, &unmap_size);
>>  if (pte) {
>> -int i, count;
>> +unsigned long i, count;
>> +bool pte_dirty = false;
>>   
>>  count = PAGE_SIZE_PTE_COUNT(unmap_size);
>>  for (i = 0; i < count; i++)
>> -pte[i] = 0ULL;
>> +pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
>> +
> 
> Actually, what if we change the existing free_clear_pte() to 
> free_and_clear_dirty_pte(),
> and incorporate the logic for
> 
Likewise, but otherwise it would be a good idea.

>> ...
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 0a86392b2367..a8fcb6e9a684 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain 
>> *dom, unsigned long iova,
>>  return r;
>>   }
>>   
>> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
>> + unsigned long iova, size_t page_size,
>> + struct iommu_iotlb_gather *gather,
>> + struct iommu_dirty_bitmap *dirty)
>> +{
>> +struct protection_domain *domain = to_pdomain(dom);
>> +struct io_pgtable_ops *ops = &domain->iop.iop.ops;
>> +size_t r;
>> +
>> +if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
>> +(domain->iop.mode == PAGE_MODE_NONE))
>> +return 0;
>> +
>> +r = (ops->unmap_read_dirty) ?
>> +ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
>> +
>> +amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
>> +
>> +return r;
>> +}
>> +
> 
> Instead of creating a new function, what if we enhance the current 
> amd_iommu_unmap()
> to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), 
> and
> then both amd_iommu_unmap() and amd_io

Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified

2022-05-31 Thread Will Deacon
On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote:
> From: AngeloGioacchino Del Regno 
> 
> As specified in this driver, the context banks are 0x1000 apart.
> Problem is that sometimes the context number (our asid) does not
> match this logic and we end up using the wrong one: this starts
> being a problem in the case that we need to send TZ commands
> to do anything on a specific context.

I don't understand this. The ASID is a software construct, so it shouldn't
matter what we use. If it does matter, then please can you explain why? The
fact that the context banks are 0x1000 apart seems unrelated.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-05-31 Thread Joao Martins
On 5/31/22 12:34, Suravee Suthikulpanit wrote:
> Joao,
> 
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +bool enable)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct iommu_dev_data *dev_data;
>> +bool dom_flush = false;
>> +
>> +if (!amd_iommu_had_support)
>> +return -EOPNOTSUPP;
>> +
>> +list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.
> 
Ugh, yes. Will fix.

>> +struct amd_iommu *iommu;
>> +u64 pte_root;
>> +
>> +iommu = amd_iommu_rlookup_table[dev_data->devid];
>> +pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
>> +
>> +/* No change? */
>> +if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
>> +continue;
>> +
>> +pte_root = (enable ?
>> +pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
>> +
>> +/* Flush device DTE */
>> +amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
>> +device_flush_dte(dev_data);
>> +dom_flush = true;
>> +}
>> +
>> +/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
>> +if (dom_flush) {
>> +unsigned long flags;
>> +
>> +spin_lock_irqsave(&pdomain->lock, flags);
>> +amd_iommu_domain_flush_tlb_pde(pdomain);
>> +amd_iommu_domain_flush_complete(pdomain);
>> +spin_unlock_irqrestore(&pdomain->lock, flags);
>> +}
> 
> And call spin_unlock_irqrestore(&pdomain->lock, flags); here.

ack

Additionally, something that I am thinking for v2 was going to have
@had bool field in iommu_dev_data. That would align better with the
rest of amd iommu code rather than me introducing this pattern of
using hardware location of PTE roots. Let me know if you disagree.

>> +
>> +return 0;
>> +}
>> +
>> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct iommu_dev_data *dev_data;
>> +u64 dte;
>> +
> 
> Also call spin_lock_irqsave(&pdomain->lock, flags) here
> 
ack
>> +list_for_each_entry(dev_data, &pdomain->dev_list, list) {
>> +dte = amd_iommu_dev_table[dev_data->devid].data[0];
>> +if (!(dte & DTE_FLAG_HAD))
>> +return false;
>> +}
>> +
> 
> And call spin_unlock_irqsave(&pdomain->lock, flags) here
> 
ack

Same comment as I was saying above, and replace the @dte checking
to just instead check this new variable.

>> +return true;
>> +}
>> +
>> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +  unsigned long iova, size_t size,
>> +  struct iommu_dirty_bitmap *dirty)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
>> +
>> +if (!amd_iommu_get_dirty_tracking(domain))
>> +return -EOPNOTSUPP;
>> +
>> +if (!ops || !ops->read_and_clear_dirty)
>> +return -ENODEV;
> 
> We move this check before the amd_iommu_get_dirty_tracking().
> 

Yeap, better fail earlier.

> Best Regards,
> Suravee
> 
>> +
>> +return ops->read_and_clear_dirty(ops, iova, size, dirty);
>> +}
>> +
>> +
>>   static void amd_iommu_get_resv_regions(struct device *dev,
>> struct list_head *head)
>>   {
>> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>>  .flush_iotlb_all = amd_iommu_flush_iotlb_all,
>>  .iotlb_sync = amd_iommu_iotlb_sync,
>>  .free   = amd_iommu_domain_free,
>> +.set_dirty_tracking = amd_iommu_set_dirty_tracking,
>> +.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>>  }
>>   };
>>   
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:
> On 2022-05-31 15:53, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> > > On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > > > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> > > > 
> > > > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > > > work because debugfs may depend on the DMA of the devices to work. It
> > > > > seems that what we can do is to allow this race, but when we traverse
> > > > > the page table in debugfs, we will check the validity of the physical
> > > > > address retrieved from the page table entry. Then, the worst case is 
> > > > > to
> > > > > print some useless information.
> > > > 
> > > > Sounds horrible, don't you have locking around the IOPTEs of some
> > > > kind? How does updating them work reliably?
> > > 
> > > There's no locking around updating the IOPTEs. The basic assumption is
> > > that at any time, there's only a single thread manipulating the mappings
> > > of the range specified in iommu_map/unmap() APIs. Therefore, the race
> > > only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> > > driver updates those IOPTEs using the compare-and-exchange atomic
> > > operation.
> > 
> > Oh? Did I miss where that was documented as part of the iommu API?
> > 
> > Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
> > 
> > iommufd goes out of its way to avoid this kind of serialization so
> > that userspace can parallel map IOVA.
> > 
> > I think if this is the requirement then the iommu API needs to
> > provide a lock around the domain for the driver..
> 
> Eww, no, we can't kill performance by forcing serialisation on the entire
> API just for one silly driver-internal debugfs corner :(

I'm not worried about debugfs, I'm worried about these efforts to
speed up VFIO VM booting by parallel domain loading:

https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jor...@oracle.com/

The DMA API should maintain its own external lock, but the general
domain interface to the rest of the kernel should be safe, IMHO.

Or at least it should be documented..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 15:53, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:

On 2022/5/31 21:10, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:


For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.


Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?


There's no locking around updating the IOPTEs. The basic assumption is
that at any time, there's only a single thread manipulating the mappings
of the range specified in iommu_map/unmap() APIs. Therefore, the race
only exists when multiple ranges share some high-level IOPTEs. The IOMMU
driver updates those IOPTEs using the compare-and-exchange atomic
operation.


Oh? Did I miss where that was documented as part of the iommu API?

Daniel posted patches for VFIO to multi-thread iommu_domin mapping.

iommufd goes out of its way to avoid this kind of serialization so
that userspace can parallel map IOVA.

I think if this is the requirement then the iommu API needs to
provide a lock around the domain for the driver..


Eww, no, we can't kill performance by forcing serialisation on the 
entire API just for one silly driver-internal debugfs corner :(


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> > 
> > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > work because debugfs may depend on the DMA of the devices to work. It
> > > seems that what we can do is to allow this race, but when we traverse
> > > the page table in debugfs, we will check the validity of the physical
> > > address retrieved from the page table entry. Then, the worst case is to
> > > print some useless information.
> > 
> > Sounds horrible, don't you have locking around the IOPTEs of some
> > kind? How does updating them work reliably?
> 
> There's no locking around updating the IOPTEs. The basic assumption is
> that at any time, there's only a single thread manipulating the mappings
> of the range specified in iommu_map/unmap() APIs. Therefore, the race
> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> driver updates those IOPTEs using the compare-and-exchange atomic
> operation.

Oh? Did I miss where that was documented as part of the iommu API?

Daniel posted patches for VFIO to multi-thread iommu_domin mapping.

iommufd goes out of its way to avoid this kind of serialization so
that userspace can parallel map IOVA.

I think if this is the requirement then the iommu API needs to
provide a lock around the domain for the driver..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-31 Thread Michael Kelley (LINUX) via iommu
From: h...@lst.de  Sent: Tuesday, May 31, 2022 12:17 AM
> 
> On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote:
> > B) The contents of the memory buffer must transition between
> > encrypted and not encrypted.  The hardware doesn't provide
> > any mechanism to do such a transition "in place".  The only
> > way to transition is for the CPU to copy the contents between
> > an encrypted area and an unencrypted area of memory.
> >
> > Because of (B), we're stuck needing bounce buffers.  There's no
> > way to avoid them with the current hardware.  Tianyu also pointed
> > out not wanting to expose uninitialized guest memory to the host,
> > so, for example, sharing a read buffer with the host requires that
> > it first be initialized to zero.
> 
> Ok, B is a deal breaker.  I just brought this in because I've received
> review comments that state bouncing is just the easiest option for
> now and we could map it into the hypervisor in the future.  But at
> least for SEV that does not seem like an option without hardware
> changes.
> 
> > We should reset and make sure we agree on the top-level approach.
> > 1) We want general scalability improvements to swiotlb.  These
> > improvements should scale to high CPUs counts (> 100) and for
> > multiple NUMA nodes.
> > 2) Drivers should not require any special knowledge of swiotlb to
> > benefit from the improvements.  No new swiotlb APIs should be
> > need to be used by drivers -- the swiotlb scalability improvements
> > should be transparent.
> > 3) The scalability improvements should not be based on device
> > boundaries since a single device may have multiple channels
> > doing bounce buffering on multiple CPUs in parallel.
> 
> Agreed to all counts.
> 
> > The patch from Andi Kleen [3] (not submitted upstream, but referenced
> > by Tianyu as the basis for his patches) seems like a good starting point
> > for meeting the top-level approach.
> 
> Yes, I think doing per-cpu and/or per-node scaling sounds like the
> right general approach. Why was this never sent out?

I'll defer to Andi on what his thinking/plan is for this patch.

> 
> > Andi and Robin had some
> > back-and-forth about Andi's patch that I haven't delved into yet, but
> > getting that worked out seems like a better overall approach.  I had
> > an offline chat with Tianyu, and he would agree as well.
> 
> Where was this discussion?

I was just referring to this thread that you are already on:
https://lore.kernel.org/lkml/e7b644f0-6c90-fe99-792d-75c38505d...@arm.com/

Michael

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Baolu Lu

On 2022/5/31 21:10, Jason Gunthorpe wrote:

On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:


For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.


Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?


There's no locking around updating the IOPTEs. The basic assumption is
that at any time, there's only a single thread manipulating the mappings
of the range specified in iommu_map/unmap() APIs. Therefore, the race
only exists when multiple ranges share some high-level IOPTEs. The IOMMU
driver updates those IOPTEs using the compare-and-exchange atomic
operation.



It is just debugfs, so maybe it is not the end of the world, but
still..


Fair enough. I think this is somewhat similar to that IOMMU hardware can
traverse the page table at any time without considering when the CPUs
update it. The IOMMU hardware will generate faults when it encounters
failure during the traverse of page table. Similarly, perhaps debugfs
could dump all-ones for an invalid IOPTE?

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Robin Murphy

On 2022-05-31 04:02, Baolu Lu wrote:

On 2022/5/30 20:14, Jason Gunthorpe wrote:

On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:


 From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
From: Lu Baolu 
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock 
usage


The domain_translation_struct debugfs node is used to dump static
mappings of PCI devices. It potentially races with setting new
domains to devices and the iommu_map/unmap() interfaces. The existing
code tries to use the global spinlock device_domain_lock to avoid the
races, but this is problematical as this lock is only used to protect
the device tracking lists of the domains.

Instead of using an immature lock to cover up the problem, it's better
to explicitly restrict the use of this debugfs node. This also makes
device_domain_lock static.


What does "explicitly restrict" mean?


I originally thought about adding restrictions on this interface to a
document. But obviously, this is a naive idea. :-) I went over the code
again. The races exist in two paths:

1. Dump the page table in use while setting a new page table to the
    device.
2. A high-level page table entry has been marked as non-present, but the
    dumping code has walked down to the low-level tables.

For case 1, we can try to solve it by dumping tables while holding the
group->mutex.

For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.

The real code looks like this:

 From 3feb0727f9d7095729ef75ab1967270045b3a38c Mon Sep 17 00:00:00 2001
From: Lu Baolu 
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage

The domain_translation_struct debugfs node is used to dump the DMAR page
tables for the PCI devices. It potentially races with setting domains to
devices and the iommu_unmap() interface. The existing code uses a global
spinlock device_domain_lock to avoid the races, but this is problematical
as this lock is only used to protect the device tracking lists of each
domain.

This replaces device_domain_lock with group->mutex to protect the traverse
of the page tables from setting a new domain and always check the physical
address retrieved from the page table entry before traversing to the next-
level page table.

As a cleanup, this also makes device_domain_lock static.

Signed-off-by: Lu Baolu 
---
  drivers/iommu/intel/debugfs.c | 42 ++-
  drivers/iommu/intel/iommu.c   |  2 +-
  drivers/iommu/intel/iommu.h   |  1 -
  3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..e6f4835b8d9f 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, 
struct dma_pte *pde,

  continue;

  path[level] = pde->val;
-    if (dma_pte_superpage(pde) || level == 1)
+    if (dma_pte_superpage(pde) || level == 1) {
  dump_page_info(m, start, path);
-    else
-    pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
+    } else {
+    unsigned long phys_addr;
+
+    phys_addr = (unsigned long)dma_pte_addr(pde);
+    if (!pfn_valid(__phys_to_pfn(phys_addr)))


Given that pte_present(pde) passed just above, it was almost certainly a 
valid entry, so it seems unlikely that the physical address it pointed 
to could have disappeared in the meantime. If you're worried about the 
potential case where we've been preempted during this walk for long 
enough that the page has already been freed by an unmap, reallocated, 
and filled with someone else's data that happens to look like valid 
PTEs, this still isn't enough, since that data could just as well happen 
to look like valid physical addresses too.


I imagine that if you want to safely walk pagetables concurrently with 
them potentially being freed, you'd probably need to get RCU involved.



+    break;
+    pgtable_walk_level(m, phys_to_virt(phys_addr),


Also, obligatory reminder that pfn_valid() only means that pfn_to_page() 
gets you a valid struct page. Whether that page is direct-mapped kernel 
memory or not is a different matter.



     level - 1, start, path);
+    }
  path[level] = 0;
  }
  }

-static int show_device_domain_translation(struct device *dev, void *data)
+static int __show_device_domain_translation(struct device *dev, void 
*data)

  {
  struct d

Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-05-31 Thread Jason Gunthorpe via iommu
On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:

> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> iommu_unmap() and dumping tables in debugfs exclusive. This does not
> work because debugfs may depend on the DMA of the devices to work. It
> seems that what we can do is to allow this race, but when we traverse
> the page table in debugfs, we will check the validity of the physical
> address retrieved from the page table entry. Then, the worst case is to
> print some useless information.

Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?

It is just debugfs, so maybe it is not the end of the world, but
still..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Baolu Lu

On 2022/5/31 18:12, Tian, Kevin wrote:

+++ b/include/linux/iommu.h
@@ -105,6 +105,8 @@ struct iommu_domain {
enum iommu_page_response_code (*iopf_handler)(struct

iommu_fault *fault,

  void *data);
void *fault_data;
+   ioasid_t pasid; /* Used for DMA requests with PASID */
+   atomic_t pasid_users;

These are poorly named, this is really the DMA API global PASID and
shouldn't be used for other things.



Perhaps I misunderstood, do you mind explaining more?

You still haven't really explained what this is for in this patch,
maybe it just needs a better commit message, or maybe something is
wrong.

I keep saying the DMA API usage is not special, so why do we need to
create a new global pasid and refcount? Realistically this is only
going to be used by IDXD, why can't we just allocate a PASID and
return it to the driver every time a driver asks for DMA API on PASI
mode? Why does the core need to do anything special?


Agree. I guess it was a mistake caused by treating ENQCMD as the
only user although the actual semantics of the invented interfaces
have already evolved to be quite general.

This is very similar to what we have been discussing for iommufd.
a PASID is just an additional routing info when attaching a device
to an I/O address space (DMA API in this context) and by default
it should be a per-device resource except when ENQCMD is
explicitly opt in.

Hence it's right time for us to develop common facility working
for both this DMA API usage and iommufd, i.e.:

for normal PASID attach to a domain, driver:

allocates a local pasid from device local space;
attaches the local pasid to a domain;

for PASID attach in particular for ENQCMD, driver:

allocates a global pasid in system-wide;
attaches the global pasid to a domain;
set the global pasid in PASID_MSR;

In both cases the pasid is stored in the attach data instead of the
domain.

DMA API pasid is no special from above except it needs to allow
one device attached to the same domain twice (one with RID
and the other with RID+PASID).

for iommufd those operations are initiated by userspace via
iommufd uAPI.


My understanding is that device driver owns its PASID policy. If ENQCMD
is supported on the device, the PASIDs should be allocated through
ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device
driver.

For kernel DMA w/ PASID, after the driver has a PASID for this purpose,
it can just set the default domain to the PASID on device. There's no
need for enable/disable() interfaces.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support

2022-05-31 Thread Suravee Suthikulpanit via iommu




On 4/29/22 4:09 AM, Joao Martins wrote:

AMD implementation of unmap_read_dirty() is pretty simple as
mostly reuses unmap code with the extra addition of marshalling
the dirty bit into the bitmap as it walks the to-be-unmapped
IOPTE.

Extra care is taken though, to switch over to cmpxchg as opposed
to a non-serialized store to the PTE and testing the dirty bit
only set until cmpxchg succeeds to set to 0.

Signed-off-by: Joao Martins 
---
  drivers/iommu/amd/io_pgtable.c | 44 +-
  drivers/iommu/amd/iommu.c  | 22 +
  2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 8325ef193093..1868c3b58e6d 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct 
list_head *freelist)
free_sub_pt(pt, mode, freelist);
  }
  
+static bool free_pte_dirty(u64 *pte, u64 pteval)


Nitpick: Since we free and clearing the dirty bit, should we change
the function name to free_clear_pte_dirty()?


+{
+   bool dirty = false;
+
+   while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))


We should use 0ULL instead of 0.


+   dirty = true;
+
+   return dirty;
+}
+


Actually, what do you think if we enhance the current free_clear_pte()
to also handle the check dirty as well?


  /*
   * Generic mapping functions. It maps a physical address into a DMA
   * address space. It allocates the page table pages if necessary.
@@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, 
unsigned long iova,
return ret;
  }
  
-static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,

- unsigned long iova,
- size_t size,
- struct iommu_iotlb_gather *gather)
+static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+  unsigned long iova,
+  size_t size,
+  struct iommu_iotlb_gather *gather,
+  struct iommu_dirty_bitmap *dirty)
  {
struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
unsigned long long unmapped;
@@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct 
io_pgtable_ops *ops,
while (unmapped < size) {
pte = fetch_pte(pgtable, iova, &unmap_size);
if (pte) {
-   int i, count;
+   unsigned long i, count;
+   bool pte_dirty = false;
  
  			count = PAGE_SIZE_PTE_COUNT(unmap_size);

for (i = 0; i < count; i++)
-   pte[i] = 0ULL;
+   pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
+


Actually, what if we change the existing free_clear_pte() to 
free_and_clear_dirty_pte(),
and incorporate the logic for


...
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0a86392b2367..a8fcb6e9a684 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
return r;
  }
  
+static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,

+unsigned long iova, size_t page_size,
+struct iommu_iotlb_gather *gather,
+struct iommu_dirty_bitmap *dirty)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = &domain->iop.iop.ops;
+   size_t r;
+
+   if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
+   (domain->iop.mode == PAGE_MODE_NONE))
+   return 0;
+
+   r = (ops->unmap_read_dirty) ?
+   ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
+
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
+}
+


Instead of creating a new function, what if we enhance the current 
amd_iommu_unmap()
to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), 
and
then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call
the __amd_iommu_unmap_read_dirty()?

Best Regards,
Suravee
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[git pull] IOMMU Updates for Linux v5.19

2022-05-31 Thread Joerg Roedel
Hi Linus,

Apologies for the late pull request, I know you prefer the main stuff in
the first week. Some vacation and a public holiday came in between here.

So here are the IOMMU updates for 5.19. Some patches are probably
arleady merged via the VFIO tree, namely everyting from the
vfio-notifier-fix topic branch.

Also, there will be a merge conflict in MAINTAINERS and
drivers/iommu/amd/iommu.c. The latter one is resolved by removing the
function in question, for the former I attached my resolution.

With that in mind:

The following changes since commit 42226c989789d8da4af1de0c31070c96726d990c:

  Linux 5.18-rc7 (2022-05-15 18:08:58 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-updates-v5.19

for you to fetch changes up to b0dacee202efbf1a5d9f5cdfd82049e8b5b085d2:

  Merge branches 'apple/dart', 'arm/mediatek', 'arm/msm', 'arm/smmu', 
'ppc/pamu', 'x86/vt-d', 'x86/amd' and 'vfio-notifier-fix' into next (2022-05-20 
12:27:17 +0200)


IOMMU Updates for Linux v5.19

Including:

- Intel VT-d driver updates
  - Domain force snooping improvement.
  - Cleanups, no intentional functional changes.

- ARM SMMU driver updates
  - Add new Qualcomm device-tree compatible strings
  - Add new Nvidia device-tree compatible string for Tegra234
  - Fix UAF in SMMUv3 shared virtual addressing code
  - Force identity-mapped domains for users of ye olde SMMU
legacy binding
  - Minor cleanups

- Patches to fix a BUG_ON in the vfio_iommu_group_notifier
  - Groundwork for upcoming iommufd framework
  - Introduction of DMA ownership so that an entire IOMMU group
is either controlled by the kernel or by user-space

- MT8195 and MT8186 support in the Mediatek IOMMU driver

- Patches to make forcing of cache-coherent DMA more coherent
  between IOMMU drivers

- Fixes for thunderbolt device DMA protection

- Various smaller fixes and cleanups


Bjorn Andersson (2):
  dt-bindings: arm-smmu: Add compatible for Qualcomm SC8280XP
  iommu/arm-smmu-qcom: Add SC8280XP support

Christophe Leroy (1):
  iommu/fsl_pamu: Prepare cleanup of powerpc's asm/prom.h

Jason Gunthorpe (5):
  vfio: Delete the unbound_list
  iommu: Introduce the domain op enforce_cache_coherency()
  vfio: Move the Intel no-snoop control off of IOMMU_CACHE
  iommu: Redefine IOMMU_CAP_CACHE_COHERENCY as the cap flag for IOMMU_CACHE
  vfio: Require that devices support DMA cache coherence

Jason Gunthorpe via iommu (1):
  iommu: iommu_group_claim_dma_owner() must always assign a domain

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3-sva: Fix mm use-after-free

Joerg Roedel (4):
  Merge tag 'arm-smmu-updates' of 
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
  iommu/amd: Increase timeout waiting for GA log enablement
  Merge tag 'v5.18-rc7' into arm/smmu
  Merge branches 'apple/dart', 'arm/mediatek', 'arm/msm', 'arm/smmu', 
'ppc/pamu', 'x86/vt-d', 'x86/amd' and 'vfio-notifier-fix' into next

Lu Baolu (17):
  iommu: Add DMA ownership management interfaces
  driver core: Add dma_cleanup callback in bus_type
  amba: Stop sharing platform_dma_configure()
  bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
  PCI: pci_stub: Set driver_managed_dma
  PCI: portdrv: Set driver_managed_dma
  vfio: Set DMA ownership for VFIO devices
  vfio: Remove use of vfio_group_viable()
  vfio: Remove iommu group notifier
  iommu: Remove iommu group changes notifier
  iommu/vt-d: Change return type of dmar_insert_one_dev_info()
  iommu/vt-d: Fold dmar_insert_one_dev_info() into its caller
  iommu/vt-d: Size Page Request Queue to avoid overflow condition
  iommu/vt-d: Block force-snoop domain attaching if no SC support
  iommu/vt-d: Check domain force_snooping against attached devices
  iommu/vt-d: Remove domain_update_iommu_snooping()
  iommu/vt-d: Remove hard coding PGSNP bit in PASID entries

Mario Limonciello (3):
  iommu/amd: Enable swiotlb in all cases
  dma-iommu: Check that swiotlb is active before trying to use it
  iommu/amd: Indicate whether DMA remap support is enabled

Matthew Rosato (1):
  iommu/s390: Tolerate repeat attach_dev calls

Miles Chen (1):
  iommu/mediatek: Fix NULL pointer dereference when printing dev_name

Muhammad Usama Anjum (1):
  iommu/vt-d: Remove unneeded validity check on dev

Rob Herring (1):
  dt-bindings: iommu: Drop client node in examples

Robin Murphy (5):
  iommu: Introduce device_iommu_capable()
  iommu: Add capability for pre-boot DMA protection
  thunderbolt: Make iommu_dma_protection more accura

Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-05-31 Thread Baolu Lu

Hi Suravee ,

On 2022/5/31 19:34, Suravee Suthikulpanit wrote:

On 4/29/22 4:09 AM, Joao Martins wrote:

.
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+    bool enable)
+{
+    struct protection_domain *pdomain = to_pdomain(domain);
+    struct iommu_dev_data *dev_data;
+    bool dom_flush = false;
+
+    if (!amd_iommu_had_support)
+    return -EOPNOTSUPP;
+
+    list_for_each_entry(dev_data, &pdomain->dev_list, list) {


Since we iterate through device list for the domain, we would need to
call spin_lock_irqsave(&pdomain->lock, flags) here.


Not related, just out of curiosity. Does it really need to disable the
interrupt while holding this lock? Any case this list would be traversed
in any interrupt context? Perhaps I missed anything?

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops

2022-05-31 Thread Krzysztof Kozlowski
On 30/05/2022 23:00, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko 

Thank you for your patch. There is something to discuss/improve.

> diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml 
> b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
> new file mode 100644
> index ..ab5765c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
> @@ -0,0 +1,49 @@
> +# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Xen specific IOMMU for virtualized devices (e.g. virtio)
> +
> +maintainers:
> +  - Stefano Stabellini 
> +
> +description:
> +  The reference to Xen specific IOMMU node using "iommus" property indicates
> +  that Xen grant mappings need to be enabled for the device, and it specifies
> +  the ID of the domain where the corresponding backend resides.
> +  The binding is required to restrict memory access using Xen grant mappings.
> +
> +properties:
> +  compatible:
> +const: xen,grant-dma
> +
> +  '#iommu-cells':
> +const: 1
> +description:
> +  Xen specific IOMMU is multiple-master IOMMU device.
> +  The single cell describes the domid (domain ID) of the domain where
> +  the backend is running.
> +
> +required:
> +  - compatible
> +  - "#iommu-cells"
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +xen_iommu {

No underscores in node names, generic node names, so this looks like
"iommu".

> +compatible = "xen,grant-dma";
> +#iommu-cells = <1>;
> +};
> +
> +virtio@3000 {
> +compatible = "virtio,mmio";
> +reg = <0x3000 0x100>;
> +interrupts = <41>;
> +
> +/* The backend is located in Xen domain with ID 1 */
> +iommus = <&xen_iommu 1>;

There is no need usually to give consumer examples in provider binding.
If there is nothing specific here (looks exactly like every IOMMU
consumer in Linux kernel), drop the consumer.

> +};


Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-05-31 Thread Suravee Suthikulpanit via iommu

Joao,

On 4/29/22 4:09 AM, Joao Martins wrote:

.
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+   bool enable)
+{
+   struct protection_domain *pdomain = to_pdomain(domain);
+   struct iommu_dev_data *dev_data;
+   bool dom_flush = false;
+
+   if (!amd_iommu_had_support)
+   return -EOPNOTSUPP;
+
+   list_for_each_entry(dev_data, &pdomain->dev_list, list) {


Since we iterate through device list for the domain, we would need to
call spin_lock_irqsave(&pdomain->lock, flags) here.


+   struct amd_iommu *iommu;
+   u64 pte_root;
+
+   iommu = amd_iommu_rlookup_table[dev_data->devid];
+   pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
+
+   /* No change? */
+   if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
+   continue;
+
+   pte_root = (enable ?
+   pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
+
+   /* Flush device DTE */
+   amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
+   device_flush_dte(dev_data);
+   dom_flush = true;
+   }
+
+   /* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
+   if (dom_flush) {
+   unsigned long flags;
+
+   spin_lock_irqsave(&pdomain->lock, flags);
+   amd_iommu_domain_flush_tlb_pde(pdomain);
+   amd_iommu_domain_flush_complete(pdomain);
+   spin_unlock_irqrestore(&pdomain->lock, flags);
+   }


And call spin_unlock_irqrestore(&pdomain->lock, flags); here.

+
+   return 0;
+}
+
+static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+   struct protection_domain *pdomain = to_pdomain(domain);
+   struct iommu_dev_data *dev_data;
+   u64 dte;
+


Also call spin_lock_irqsave(&pdomain->lock, flags) here


+   list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+   dte = amd_iommu_dev_table[dev_data->devid].data[0];
+   if (!(dte & DTE_FLAG_HAD))
+   return false;
+   }
+


And call spin_unlock_irqsave(&pdomain->lock, flags) here


+   return true;
+}
+
+static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+ unsigned long iova, size_t size,
+ struct iommu_dirty_bitmap *dirty)
+{
+   struct protection_domain *pdomain = to_pdomain(domain);
+   struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
+
+   if (!amd_iommu_get_dirty_tracking(domain))
+   return -EOPNOTSUPP;
+
+   if (!ops || !ops->read_and_clear_dirty)
+   return -ENODEV;


We move this check before the amd_iommu_get_dirty_tracking().

Best Regards,
Suravee


+
+   return ops->read_and_clear_dirty(ops, iova, size, dirty);
+}
+
+
  static void amd_iommu_get_resv_regions(struct device *dev,
   struct list_head *head)
  {
@@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.free   = amd_iommu_domain_free,
+   .set_dirty_tracking = amd_iommu_set_dirty_tracking,
+   .read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
}
  };
  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API

2022-05-31 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, May 30, 2022 8:23 PM
> 
> On Tue, May 24, 2022 at 08:17:27AM -0700, Jacob Pan wrote:
> > Hi Jason,
> >
> > On Tue, 24 May 2022 10:50:34 -0300, Jason Gunthorpe 
> wrote:
> >
> > > On Wed, May 18, 2022 at 11:21:15AM -0700, Jacob Pan wrote:
> > > > DMA requests tagged with PASID can target individual IOMMU domains.
> > > > Introduce a domain-wide PASID for DMA API, it will be used on the
> same
> > > > mapping as legacy DMA without PASID. Let it be IOVA or PA in case of
> > > > identity domain.
> > >
> > > Huh? I can't understand what this is trying to say or why this patch
> > > makes sense.
> > >
> > > We really should not have pasid's like this attached to the domains..
> > >
> > This is the same "DMA API global PASID" you reviewed in v3, I just
> > singled it out as a standalone patch and renamed it. Here is your previous
> > review comment.
> >
> > > +++ b/include/linux/iommu.h
> > > @@ -105,6 +105,8 @@ struct iommu_domain {
> > >   enum iommu_page_response_code (*iopf_handler)(struct
> iommu_fault *fault,
> > > void *data);
> > >   void *fault_data;
> > > + ioasid_t pasid; /* Used for DMA requests with PASID */
> > > + atomic_t pasid_users;
> >
> > These are poorly named, this is really the DMA API global PASID and
> > shouldn't be used for other things.
> >
> >
> >
> > Perhaps I misunderstood, do you mind explaining more?
> 
> You still haven't really explained what this is for in this patch,
> maybe it just needs a better commit message, or maybe something is
> wrong.
> 
> I keep saying the DMA API usage is not special, so why do we need to
> create a new global pasid and refcount? Realistically this is only
> going to be used by IDXD, why can't we just allocate a PASID and
> return it to the driver every time a driver asks for DMA API on PASI
> mode? Why does the core need to do anything special?
> 

Agree. I guess it was a mistake caused by treating ENQCMD as the
only user although the actual semantics of the invented interfaces
have already evolved to be quite general.

This is very similar to what we have been discussing for iommufd.
a PASID is just an additional routing info when attaching a device
to an I/O address space (DMA API in this context) and by default
it should be a per-device resource except when ENQCMD is
explicitly opt in.

Hence it's right time for us to develop common facility working
for both this DMA API usage and iommufd, i.e.:

for normal PASID attach to a domain, driver:

allocates a local pasid from device local space;
attaches the local pasid to a domain;

for PASID attach in particular for ENQCMD, driver:

allocates a global pasid in system-wide;
attaches the global pasid to a domain;
set the global pasid in PASID_MSR;

In both cases the pasid is stored in the attach data instead of the
domain.

DMA API pasid is no special from above except it needs to allow
one device attached to the same domain twice (one with RID
and the other with RID+PASID).

for iommufd those operations are initiated by userspace via
iommufd uAPI.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-31 Thread h...@lst.de
On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote:
> B) The contents of the memory buffer must transition between
> encrypted and not encrypted.  The hardware doesn't provide
> any mechanism to do such a transition "in place".  The only
> way to transition is for the CPU to copy the contents between
> an encrypted area and an unencrypted area of memory.
> 
> Because of (B), we're stuck needing bounce buffers.  There's no
> way to avoid them with the current hardware.  Tianyu also pointed
> out not wanting to expose uninitialized guest memory to the host,
> so, for example, sharing a read buffer with the host requires that
> it first be initialized to zero.

Ok, B is a deal breaker.  I just brought this in because I've received
review comments that state bouncing is just the easiest option for
now and we could map it into the hypervisor in the future.  But at
least for SEV that does not seem like an option without hardware
changes.

> We should reset and make sure we agree on the top-level approach.
> 1) We want general scalability improvements to swiotlb.  These
> improvements should scale to high CPUs counts (> 100) and for
> multiple NUMA nodes.
> 2) Drivers should not require any special knowledge of swiotlb to
> benefit from the improvements.  No new swiotlb APIs should be
> need to be used by drivers -- the swiotlb scalability improvements
> should be transparent.
> 3) The scalability improvements should not be based on device
> boundaries since a single device may have multiple channels
> doing bounce buffering on multiple CPUs in parallel.

Agreed to all counts.

> The patch from Andi Kleen [3] (not submitted upstream, but referenced
> by Tianyu as the basis for his patches) seems like a good starting point
> for meeting the top-level approach.

Yes, I think doing per-cpu and/or per-node scaling sounds like the
right general approach. Why was this never sent out?

> Andi and Robin had some
> back-and-forth about Andi's patch that I haven't delved into yet, but
> getting that worked out seems like a better overall approach.  I had
> an offline chat with Tianyu, and he would agree as well.

Where was this discussion?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu