Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread David Gibson
On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 28, 2022 11:11 PM
> > 
> > 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> > 
> > The main issue is that internally to the iommu subsystem the
> > iommu_domain aperture is assumed to be a single window. This kAPI will
> > have to be improved to model the PPC multi-window iommu_domain.
> > 
> 
> From the point of nesting probably each window can be a separate
> domain then the existing aperture should still work?

Maybe.  There might be several different ways to represent it, but the
vital piece is that any individual device (well, group, technically)
must atomically join/leave both windows at once.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-28 Thread David Gibson
On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is 
> > > > > the smaller
> > > > > + *   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside 
> > > > > these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of 
> > > > > iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > + __u32 size;
> > > > > + __u32 ioas_id;
> > > > > + __u32 out_num_iovas;
> > > > > + __u32 __reserved;
> > > > > + struct iommu_valid_iovas {
> > > > > + __aligned_u64 start;
> > > > > + __aligned_u64 last;
> > > > > + } out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to 

Re: [PATCH v4 11/12] iommu: Per-domain I/O page fault handling

2022-04-28 Thread Baolu Lu

On 2022/4/28 22:57, Jean-Philippe Brucker wrote:

On Thu, Apr 21, 2022 at 01:21:20PM +0800, Lu Baolu wrote:

  static void iopf_handle_group(struct work_struct *work)
  {
struct iopf_group *group;
@@ -134,12 +78,23 @@ static void iopf_handle_group(struct work_struct *work)
group = container_of(work, struct iopf_group, work);
  
  	list_for_each_entry_safe(iopf, next, &group->faults, list) {

+   struct iommu_domain *domain;
+
+   domain = iommu_get_domain_for_dev_pasid_async(group->dev,
+   iopf->fault.prm.pasid);

Reading the PCIe spec again (v6.0 10.4.1.1 PASID Usage), all faults within
the group have the same PASID so we could move the domain fetch out of the
loop. It does deviate from the old behavior, though, so we could change
it later.


Perhaps we can add a pasid member in the struct iopf_group and do a
sanity check when a new iopf is added to the group? Here, we just fetch
the domain with group->pasid.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 03/12] iommu: Add attach/detach_dev_pasid domain ops

2022-04-28 Thread Baolu Lu

On 2022/4/28 22:53, Jean-Philippe Brucker wrote:

On Thu, Apr 21, 2022 at 01:21:12PM +0800, Lu Baolu wrote:

Attaching an IOMMU domain to a PASID of a device is a generic operation
for modern IOMMU drivers which support PASID-granular DMA address
translation. Currently visible usage scenarios include (but not limited):

  - SVA (Shared Virtual Address)
  - kernel DMA with PASID
  - hardware-assist mediated device

This adds a pair of common domain ops for this purpose and adds helpers
to attach/detach a domain to/from a {device, PASID}. Some buses, like
PCI, route packets without considering the PASID value. Thus a DMA target
address with PASID might be treated as P2P if the address falls into the
MMIO BAR of other devices in the group. To make things simple, these
interfaces only apply to devices belonging to the singleton groups, and
the singleton is immutable in fabric i.e. not affected by hotplug.

Signed-off-by: Lu Baolu 

[...]

+/*
+ * Use standard PCI bus topology, isolation features, and DMA
+ * alias quirks to check the immutable singleton attribute. If
+ * the device came from DT, assume it is static and then
+ * singleton can know from the device count in the group.
+ */
+static bool device_group_immutable_singleton(struct device *dev)
+{
+   struct iommu_group *group = iommu_group_get(dev);
+   int count;
+
+   if (!group)
+   return false;
+
+   mutex_lock(&group->mutex);
+   count = iommu_group_device_count(group);
+   mutex_unlock(&group->mutex);
+   iommu_group_put(group);
+
+   if (count != 1)
+   return false;
+
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   /*
+* The device could be considered to be fully isolated if
+* all devices on the path from the device to the host-PCI
+* bridge are protected from peer-to-peer DMA by ACS.
+*/
+   if (!pci_acs_path_enabled(pdev, NULL, REQ_ACS_FLAGS))
+   return false;
+
+   /* Filter out devices which has any alias device. */
+   if (pci_for_each_dma_alias(pdev, has_pci_alias, pdev))
+   return false;


Aren't aliases already added to the group by pci_device_group()?  If so
it's part of the count check above


You are right. pci_device_group() has already covered pci aliases.




+
+   return true;
+   }
+
+   /*
+* If the device came from DT, assume it is static and then
+* singleton can know from the device count in the group.
+*/
+   return is_of_node(dev_fwnode(dev));


I don't think DT is relevant here because a platform device enumerated
through ACPI will also have its own group. It should be safe to stick to
what the IOMMU drivers declare with their device_group() callback. Except
for PCI those groups should be immutable so we can return true here.


Fair enough. My code is too restrict. The group singleton is immutable
as long as the fabric is static or ACS (or similar) technology is
implemented. Currently we only need to care about PCI as far as I can
see.


Thanks,
Jean



Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 10/12] iommu: Prepare IOMMU domain for IOPF

2022-04-28 Thread Baolu Lu

Hi Jean,

On 2022/4/28 22:47, Jean-Philippe Brucker wrote:

Hi Baolu,

On Thu, Apr 21, 2022 at 01:21:19PM +0800, Lu Baolu wrote:

+/*
+ * Get the attached domain for asynchronous usage, for example the I/O
+ * page fault handling framework. The caller get a reference counter
+ * of the domain automatically on a successful return and should put
+ * it with iommu_domain_put() after usage.
+ */
+struct iommu_domain *
+iommu_get_domain_for_dev_pasid_async(struct device *dev, ioasid_t pasid)
+{
+   struct iommu_domain *domain;
+   struct iommu_group *group;
+
+   if (!pasid_valid(pasid))
+   return NULL;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return NULL;
+
+   mutex_lock(&group->mutex);


There is a possible deadlock between unbind() and the fault handler:

  unbind()iopf_handle_group()
   mutex_lock(&group->mutex)
   iommu_detach_device_pasid()
iopf_queue_flush_dev() iommu_get_domain_for_dev_pasid_async()
 ... waits for IOPF workmutex_lock(&group->mutex)



Yes, really.


I was wrong in my previous review: we do have a guarantee that the SVA
domain does not go away during IOPF handling, because unbind() waits for
pending faults with iopf_queue_flush_dev() before freeing the domain (or
for Arm stall, knows that there are no pending faults). So we can just get
rid of domain->async_users and the group->mutex in IOPF, I think?


Agreed with you. The Intel code does the same thing in its unbind().

Thus, the sva domain's life cycle has already synchronized with IOPF
handling, there's no need for domain->async.

I will drop it in the next version. Thanks you!

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-04-28 Thread Tian, Kevin
> From: Joao Martins 
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Presented herewith is a series that extends IOMMUFD to have IOMMU
> hardware support for dirty bit in the IOPTEs.
> 
> Today, AMD Milan (which been out for a year now) supports it while ARM
> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
> The intended use-case is to support Live Migration with SR-IOV, with

this should not be restricted to SR-IOV.

> IOMMUs
> that support it. Yishai Hadas will be soon submiting an RFC that covers the
> PCI device dirty tracker via vfio.
> 
> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
> set of a IOVA ranges mapped to some physical memory composing an IO
> pagetable. This is then attached to a particular device, consequently
> creating the protection domain to share a common IO page table
> representing the endporint DMA-addressable guest address space.
> (Hopefully I am not twisting the terminology here) The resultant object

Just remove VMM/guest/... since iommufd is not specific to virtualization. 

> is an hw_pagetable object which represents the iommu_domain
> object that will be directly manipulated. For more background on
> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
> kAPI is then extended to provide:
> 
>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
> as the most common case of changing hardware protection domain control

didn't get what 'most common case' here tries to explain

> bits, and ARM specific case of having to enable the per-PTE DBM control
> bit. The 'real' tracking of whether dirty tracking is enabled or not is
> stored in the vendor IOMMU, hence no new fields are added to iommufd
> pagetable structures.
> 
>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
> thus describe the IOVAs that got written by the device. While performing
> the marshalling also vendors need to clear the dirty bits from IOPTE and

s/vendors/iommu drivers/ 

> allow the kAPI caller to batch the much needed IOTLB flush.
> There's no copy of bitmaps to userspace backed memory, all is zerocopy
> based. So far this is a test-and-clear kind of interface given that the
> IOPT walk is going to be expensive. It occured to me to separate
> the readout of dirty, and the clearing of dirty from IOPTEs.
> I haven't opted for that one, given that it would mean two lenghty IOPTE
> walks and felt counter-performant.

me too. that doesn't feel like a performant way.

> 
>  3) Unmapping an IOVA range while returning its dirty bit prior to
> unmap. This case is specific for non-nested vIOMMU case where an
> erronous guest (or device) DMAing to an address being unmapped at the
> same time.

an erroneous attempt like above cannot anticipate which DMAs can
succeed in that window thus the end behavior is undefined. For an
undefined behavior nothing will be broken by losing some bits dirtied
in the window between reading back dirty bits of the range and
actually calling unmap. From guest p.o.v. all those are black-box
hardware logic to serve a virtual iotlb invalidation request which just
cannot be completed in one cycle.

Hence in reality probably this is not required except to meet vfio
compat requirement. Just in concept returning dirty bits at unmap
is more accurate.

I'm slightly inclined to abandon it in iommufd uAPI.

> 
> [See at the end too, on general remarks, specifically the one regarding
>  probing dirty tracking via a dedicated iommufd cap ioctl]
> 
> The series is organized as follows:
> 
> * Patches 1-3: Takes care of the iommu domain operations to be added and
> extends iommufd io-pagetable to set/clear dirty tracking, as well as
> reading the dirty bits from the vendor pagetables. The idea is to abstract
> iommu vendors from any idea of how bitmaps are stored or propagated
> back to
> the caller, as well as allowing control/batching over IOTLB flush. So
> there's a data structure and an helper that only tells the upper layer that
> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking

why do we need another pinning here? any page mapped in iommu page
table is supposed to have been pinned already...

> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor
> just has
> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
> vendor IOMMU implementor.
> 
> * Patches 4-5: Adds the new unmap domain op that returns whether the
> IOVA
> got dirtied. I separated this from the rest of the set, as I am still
> questioning the need for this API and whether this race needs to be
> fundamentally be handled. I guess the thinking is that live-migration
> should be guest foolproof, but how much the race happens in pratice to
> deem this as a necessary unmap variant. Perhaps maybe it might be enough
> fetching dirty bits prior to the unmap? Feedback appreciated.

I think so as aforementioned.

Re: [PATCH v2] iommu/sva: Fix PASID use-after-free issue

2022-04-28 Thread Zhangfei Gao



On 2022/4/29 上午2:00, Fenghua Yu wrote:

The PASID is being freed too early.  It needs to stay around until after
device drivers that might be using it have had a chance to clear it out
of the hardware.

As a reminder:

mmget() /mmput()  refcount the mm's address space
mmgrab()/mmdrop() refcount the mm itself

The PASID is currently tied to the life of the mm's address space and
freed in __mmput().  This makes logical sense because the PASID can't be
used once the address space is gone.

But, this misses an important point: even after the address space is
gone, the PASID will still be programmed into a device.  Device drivers
might, for instance, still need to flush operations that are outstanding
and need to use that PASID.  They do this at file->release() time.

Device drivers call the IOMMU driver to hold a reference on the mm itself
and drop it at file->release() time.  But, the IOMMU driver holds a
reference on the mm itself, not the address space.  The address space
(and the PASID) is long gone by the time the driver tries to clean up.
This is effectively a use-after-free bug on the PASID.

To fix this, move the PASID free operation from __mmput() to __mmdrop().
This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
allocated until it drops its mm reference.

Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free 
it on mm exit")

Reported-by: Zhangfei Gao 
Tested-by: Zhangfei Gao 


Tested-by: Zhangfei Gao 

Use the formal email, thanks


Suggested-by: Jean-Philippe Brucker 
Suggested-by: Jacob Pan 
Reviewed-by: Jean-Philippe Brucker 
Signed-off-by: Fenghua Yu 
---

v2:
- Dave Hansen rewrites the change log.
- Add Tested-by: Zhangfei Gao 
- Add Reviewed-by: Jean-Philippe Brucker 

The original patch was posted and discussed in:
https://lore.kernel.org/lkml/ymdzffx7fn586...@fyu1.sc.intel.com/

  kernel/fork.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..35a3beff140b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm)
mmu_notifier_subscriptions_destroy(mm);
check_mm(mm);
put_user_ns(mm->user_ns);
+   mm_pasid_drop(mm);
free_mm(mm);
  }
  EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
-   mm_pasid_drop(mm);
mmdrop(mm);
  }
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, April 28, 2022 11:11 PM
> 
> 
> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.
> 
> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.
> 

>From the point of nesting probably each window can be a separate
domain then the existing aperture should still work?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-28 Thread Stefano Stabellini
On Thu, 28 Apr 2022, Boris Ostrovsky wrote:
> On 4/28/22 5:49 PM, Stefano Stabellini wrote:
> > On Thu, 28 Apr 2022, Christoph Hellwig wrote:
> > > On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:
> > > > > Reported-by: Rahul Singh 
> > > > > Signed-off-by: Christoph Hellwig 
> > > > Reviewed-by: Stefano Stabellini 
> > > Do you want to take this through the Xen tree or should I pick it up?
> > > Either way I'd love to see some testing on x86 as well.
> > I agree on the x86 testing. Juergen, Boris?
> > 
> > I'd say to take this patch via the Xen tree but Juergen has just sent a
> > Xen pull request to Linux last Saturday. Juergen do you plan to send
> > another one? Do you have something else lined up? If not, it might be
> > better to let Christoph pick it up.
> 
> 
> We don't have anything pending.
> 
> 
> I can test it but at best tomorrow so not sure we can get this into rc5. Do
> you consider this an urgent fix or can we wait until 5.19? Because it's a bit
> too much IMO for rc6.

On one hand, Linux doesn't boot on a platform without this fix. On the
other hand, I totally see that this patch could introduce regressions on
x86 so I think it is fair that we are careful with it.

>From my point of view, it might be better to wait for 5.19 and mark it
as backport.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-28 Thread Boris Ostrovsky



On 4/28/22 5:49 PM, Stefano Stabellini wrote:

On Thu, 28 Apr 2022, Christoph Hellwig wrote:

On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:

Reported-by: Rahul Singh 
Signed-off-by: Christoph Hellwig 

Reviewed-by: Stefano Stabellini 

Do you want to take this through the Xen tree or should I pick it up?
Either way I'd love to see some testing on x86 as well.

I agree on the x86 testing. Juergen, Boris?

I'd say to take this patch via the Xen tree but Juergen has just sent a
Xen pull request to Linux last Saturday. Juergen do you plan to send
another one? Do you have something else lined up? If not, it might be
better to let Christoph pick it up.



We don't have anything pending.


I can test it but at best tomorrow so not sure we can get this into rc5. Do you 
consider this an urgent fix or can we wait until 5.19? Because it's a bit too 
much IMO for rc6.


-boris

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-28 Thread Stefano Stabellini
On Thu, 28 Apr 2022, Christoph Hellwig wrote:
> On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:
> > > Reported-by: Rahul Singh 
> > > Signed-off-by: Christoph Hellwig 
> > 
> > Reviewed-by: Stefano Stabellini 
> 
> Do you want to take this through the Xen tree or should I pick it up?
> Either way I'd love to see some testing on x86 as well.

I agree on the x86 testing. Juergen, Boris?

I'd say to take this patch via the Xen tree but Juergen has just sent a
Xen pull request to Linux last Saturday. Juergen do you plan to send
another one? Do you have something else lined up? If not, it might be
better to let Christoph pick it up.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML

2022-04-28 Thread Joao Martins
From: Kunkun Jiang 

This detects BBML feature and if SMMU supports it, transfer BBMLx
quirk to io-pgtable.

BBML1 requires still marking PTE nT prior to performing a
translation table update, while BBML2 requires neither break-before-make
nor PTE nT bit being set. For dirty tracking it needs to clear
the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
"3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
[joaomart: massage commit message with the need to have BBML quirk
 and add the Quirk io-pgtable flags]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++
 include/linux/io-pgtable.h  |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 14609ece4e33..4dba53bde2e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
 
+   if (smmu->features & ARM_SMMU_FEAT_BBML1)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+   else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+   switch (FIELD_GET(IDR3_BBML, reg)) {
+   case IDR3_BBML0:
+   break;
+   case IDR3_BBML1:
+   smmu->features |= ARM_SMMU_FEAT_BBML1;
+   break;
+   case IDR3_BBML2:
+   smmu->features |= ARM_SMMU_FEAT_BBML2;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+   return -ENXIO;
+   }
+
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1487a80fdf1b..e15750be1d95 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3  0xc
+#define IDR3_BBML  GENMASK(12, 11)
+#define IDR3_BBML0 0
+#define IDR3_BBML1 1
+#define IDR3_BBML2 2
 #define IDR3_RIL   (1 << 10)
 
 #define ARM_SMMU_IDR5  0x14
@@ -644,6 +648,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
 #define ARM_SMMU_FEAT_HA   (1 << 19)
 #define ARM_SMMU_FEAT_HD   (1 << 20)
+#define ARM_SMMU_FEAT_BBML1(1 << 21)
+#define ARM_SMMU_FEAT_BBML2(1 << 22)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index c2ebfe037f5d..d7626ca67dbf 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -85,6 +85,9 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_MTK_EXTBIT(3)
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
+   #define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(7)
+   #define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(8)
+
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 19/19] iommu/intel: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
Similar to other IOMMUs base unmap_read_dirty out of how unmap() with
the exception to having a non-racy clear of the PTE to return whether it
was dirty or not.

Signed-off-by: Joao Martins 
---
 drivers/iommu/intel/iommu.c | 43 -
 include/linux/intel-iommu.h | 16 ++
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 92af43f27241..e80e98f5202b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1317,7 +1317,8 @@ static void dma_pte_list_pagetables(struct dmar_domain 
*domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
struct dma_pte *pte, unsigned long pfn,
unsigned long start_pfn, unsigned long last_pfn,
-   struct list_head *freelist)
+   struct list_head *freelist,
+   struct iommu_dirty_bitmap *dirty)
 {
struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1338,7 +1339,11 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
if (level > 1 && !dma_pte_superpage(pte))
dma_pte_list_pagetables(domain, level - 1, pte, 
freelist);
 
-   dma_clear_pte(pte);
+   if (dma_clear_pte_dirty(pte) && dirty)
+   iommu_dirty_bitmap_record(dirty,
+   pfn << VTD_PAGE_SHIFT,
+   level_size(level) << VTD_PAGE_SHIFT);
+
if (!first_pte)
first_pte = pte;
last_pte = pte;
@@ -1347,7 +1352,7 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
dma_pte_clear_level(domain, level - 1,
phys_to_virt(dma_pte_addr(pte)),
level_pfn, start_pfn, last_pfn,
-   freelist);
+   freelist, dirty);
}
 next:
pfn = level_pfn + level_size(level);
@@ -1362,7 +1367,8 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
the page tables, and may have cached the intermediate levels. The
pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-unsigned long last_pfn, struct list_head *freelist)
+unsigned long last_pfn, struct list_head *freelist,
+struct iommu_dirty_bitmap *dirty)
 {
BUG_ON(!domain_pfn_supported(domain, start_pfn));
BUG_ON(!domain_pfn_supported(domain, last_pfn));
@@ -1370,7 +1376,8 @@ static void domain_unmap(struct dmar_domain *domain, 
unsigned long start_pfn,
 
/* we don't need lock here; nobody else touches the iova range */
dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-   domain->pgd, 0, start_pfn, last_pfn, freelist);
+   domain->pgd, 0, start_pfn, last_pfn, freelist,
+   dirty);
 
/* free pgd */
if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2031,7 +2038,8 @@ static void domain_exit(struct dmar_domain *domain)
if (domain->pgd) {
LIST_HEAD(freelist);
 
-   domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+   domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+NULL);
put_pages_list(&freelist);
}
 
@@ -4125,7 +4133,8 @@ static int intel_iommu_memory_notifier(struct 
notifier_block *nb,
struct intel_iommu *iommu;
LIST_HEAD(freelist);
 
-   domain_unmap(si_domain, start_vpfn, last_vpfn, 
&freelist);
+   domain_unmap(si_domain, start_vpfn, last_vpfn,
+&freelist, NULL);
 
rcu_read_lock();
for_each_active_iommu(iommu, drhd)
@@ -4737,7 +4746,8 @@ static int intel_iommu_map_pages(struct iommu_domain 
*domain,
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
unsigned long iova, size_t size,
-   struct iommu_iotlb_gather *gather)
+   struct iommu_iotlb_gather *gather,
+   struct iommu_dirty_bitmap *dirty)
 {
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
unsigned long start_pfn, last_pfn;
@@ -4753,7 +4763,7 @@ static size_t intel_iommu_unmap(struct iommu_domain 
*domain,
start_

[PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains

2022-04-28 Thread Joao Martins
IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
stage table, though, has not bit for advertising, unless referenced via
a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
stage table "3.7.2 Accessed and Dirty Flags".

To enable it scalable-mode for the second-stage table is required,
solimit the use of dirty-bit to scalable-mode and discarding the
first stage configured DMAR domains. To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
doing so, flush all iommu caches. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Signed-off-by: Joao Martins 
---
Shouldn't probably be as aggresive as to flush all; needs
checking with hardware (and invalidations guidance) as to understand
what exactly needs flush.
---
 drivers/iommu/intel/iommu.c | 109 
 drivers/iommu/intel/pasid.c |  76 +
 drivers/iommu/intel/pasid.h |   7 +++
 include/linux/intel-iommu.h |  14 +
 4 files changed, 206 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ce33f85c72ab..92af43f27241 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct 
iommu_domain *domain,
}
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+ bool enable)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct device_domain_info *info;
+   unsigned long flags;
+   int ret = -EINVAL;
+
+   spin_lock_irqsave(&device_domain_lock, flags);
+   if (list_empty(&dmar_domain->devices)) {
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+   return ret;
+   }
+
+   list_for_each_entry(info, &dmar_domain->devices, link) {
+   if (!info->dev || (info->domain != dmar_domain))
+   continue;
+
+   /* Dirty tracking is second-stage level SM only */
+   if ((info->domain && domain_use_first_level(info->domain)) ||
+   !ecap_slads(info->iommu->ecap) ||
+   !sm_supported(info->iommu) || !intel_iommu_sm) {
+   ret = -EOPNOTSUPP;
+   continue;
+   }
+
+   ret = intel_pasid_setup_dirty_tracking(info->iommu, 
info->domain,
+info->dev, PASID_RID2PASID,
+enable);
+   if (ret)
+   break;
+   }
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+
+   /*
+* We need to flush context TLB and IOTLB with any cached translations
+* to force the incoming DMA requests for have its IOTLB entries tagged
+* with A/D bits
+*/
+   intel_flush_iotlb_all(domain);
+   return ret;
+}
+
+static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct device_domain_info *info;
+   unsigned long flags;
+   int ret = 0;
+
+   spin_lock_irqsave(&device_domain_lock, flags);
+   list_for_each_entry(info, &dmar_domain->devices, link) {
+   if (!info->dev || (info->domain != dmar_domain))
+   continue;
+
+   /* Dirty tracking is second-stage level SM only */
+   if ((info->domain && domain_use_first_level(info->domain)) ||
+   !ecap_slads(info->iommu->ecap) ||
+   !sm_supported(info->iommu) || !intel_iommu_sm) {
+   ret = -EOPNOTSUPP;
+   continue;
+   }
+
+   if (!intel_pasid_dirty_tracking_enabled(info->iommu, 
info->domain,
+info->dev, PASID_RID2PASID)) {
+   ret = -EINVAL;
+   break;
+   }
+   }
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+
+   return ret;
+}
+
+static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+ 

[PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support

2022-04-28 Thread Joao Martins
.read_and_clear_dirty() IOMMU domain op takes care of
reading the dirty bits (i.e. PTE has both DBM and AP[2] set)
and marshalling into a bitmap of a given page size.

While reading the dirty bits we also clear the PTE AP[2]
bit to mark it as writable-clean.

Structure it in a way that the IOPTE walker is generic,
and so we pass a function pointer over what to do on a per-PTE
basis. This is useful for a followup patch where we supply an
io-pgtable op to enable DBM when starting/stopping dirty tracking.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Co-developed-by: Kunkun Jiang 
Signed-off-by: Kunkun Jiang 
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 ++
 drivers/iommu/io-pgtable-arm.c  | 102 +++-
 2 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4dba53bde2e3..232057d20197 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2743,6 +2743,32 @@ static int arm_smmu_enable_nesting(struct iommu_domain 
*domain)
return ret;
 }
 
+static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
+unsigned long iova, size_t size,
+struct iommu_dirty_bitmap *dirty)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   int ret;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+   !(smmu->features & ARM_SMMU_FEAT_BBML2))
+   return -ENODEV;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->read_and_clear_dirty) {
+   pr_err_once("io-pgtable don't support dirty tracking\n");
+   return -ENODEV;
+   }
+
+   ret = ops->read_and_clear_dirty(ops, iova, size, dirty);
+
+   return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2871,6 +2897,7 @@ static struct iommu_ops arm_smmu_ops = {
.iova_to_phys   = arm_smmu_iova_to_phys,
.enable_nesting = arm_smmu_enable_nesting,
.free   = arm_smmu_domain_free,
+   .read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94ff319ae8ac..3c99028d315a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -75,6 +75,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -84,7 +85,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK |\
 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -93,6 +94,9 @@
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV (((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_AP_RDONLY (((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY_BIT 7
+#define ARM_LPAE_PTE_AP_WRITABLE   (ARM_LPAE_PTE_AP_RDONLY | \
+ARM_LPAE_PTE_DBM)
 #define ARM_LPAE_PTE_ATTRINDX_SHIFT2
 #define ARM_LPAE_PTE_nG(((arm_lpae_iopte)1) << 11)
 
@@ -737,6 +741,101 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
 }
 
+static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t size,
+  arm_lpae_iopte *ptep, void *opaque)
+{
+   struct iommu_dirty_bitmap *dirty = opaque;
+   arm_lpae_iopte pte;
+
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (pte & ARM_LPAE_PTE_AP_WRITABLE)
+   return 0;
+
+   if (!(pte & ARM_LPAE_PTE_DBM))
+   return 0;
+
+   iommu_dirty_bitmap_record(dirty, iova, size);
+   set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)ptep);
+   return 0;
+}
+
+static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
+

[PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
Mostly reuses unmap existing code with the extra addition of
marshalling into a bitmap of a page size. To tackle the race,
switch away from a plain store to a cmpxchg() and check whether
IOVA was dirtied or not once it succeeds.

Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +
 drivers/iommu/io-pgtable-arm.c  | 78 +
 2 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5f728f8f20a2..d1fb757056cc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain 
*domain, unsigned long io
return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
 }
 
+static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
+ unsigned long iova, size_t pgsize,
+ size_t pgcount,
+ struct iommu_iotlb_gather *gather,
+ struct iommu_dirty_bitmap *dirty)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+
+   if (!ops)
+   return 0;
+
+   return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
+  gather, dirty);
+}
+
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
 {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
.free   = arm_smmu_domain_free,
.read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
+   .unmap_pages_read_dirty = arm_smmu_unmap_pages_read_dirty,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 361410aa836c..143ee7d73f88 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, 
struct io_pgtable_cfg *cf
__arm_lpae_sync_pte(ptep, 1, cfg);
 }
 
+static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
+  struct io_pgtable_cfg *cfg)
+{
+   arm_lpae_iopte tmp;
+   bool dirty = false;
+
+   do {
+   tmp = cmpxchg64(ptep, *ptep, 0);
+   if ((tmp & ARM_LPAE_PTE_DBM) &&
+   !(tmp & ARM_LPAE_PTE_AP_RDONLY))
+   dirty = true;
+   } while (tmp);
+
+   if (!cfg->coherent_walk)
+   __arm_lpae_sync_pte(ptep, 1, cfg);
+
+   return dirty;
+}
+
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
   struct iommu_iotlb_gather *gather,
   unsigned long iova, size_t size, size_t pgcount,
-  int lvl, arm_lpae_iopte *ptep);
+  int lvl, arm_lpae_iopte *ptep,
+  struct iommu_dirty_bitmap *dirty);
 
 static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
phys_addr_t paddr, arm_lpae_iopte prot,
@@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+   /*
+* No need for dirty bitmap as arm_lpae_init_pte() is
+* only called from __arm_lpae_map()
+*/
if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
-lvl, tblp) != sz) {
+lvl, tblp, NULL) != sz) {
WARN_ON(1);
return -EINVAL;
}
@@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
   struct iommu_iotlb_gather *gather,
   unsigned long iova, size_t size,
   arm_lpae_iopte blk_pte, int lvl,
-  arm_lpae_iopte *ptep, size_t pgcount)
+  arm_lpae_iopte *ptep, size_t pgcount,
+  struct iommu_dirty_bitmap *dirty)
 {
struct io_pgtable_cfg *cfg = &data->iop.cfg;
arm_lpae_iopte pte, *tablep;
@@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
return

[PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

2022-04-28 Thread Joao Martins
From: Kunkun Jiang 

As nested mode is not upstreamed now, we just aim to support dirty
log tracking for stage1 with io-pgtable mapping (means not support
SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
CD and transfer ARM_HD quirk to io-pgtable.

We additionally filter out HD|HA if not supportted. The CD.HD bit
is not particularly useful unless we toggle the DBM bit in the PTE
entries.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
[joaomart:Convey HD|HA bits over to the context descriptor
 and update commit message]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 include/linux/io-pgtable.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1ca72fcca930..5f728f8f20a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 * this substream's traffic
 */
} else { /* (1) and (2) */
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   u64 tcr = cd->tcr;
+
cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
cdptr[2] = 0;
cdptr[3] = cpu_to_le64(cd->mair);
 
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   tcr &= ~CTXDESC_CD_0_TCR_HD;
+   if (!(smmu->features & ARM_SMMU_FEAT_HA))
+   tcr &= ~CTXDESC_CD_0_TCR_HA;
+
/*
 * STE is live, and the SMMU might read dwords of this CD in any
 * order. Ensure that it observes valid values before reading
@@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct 
arm_smmu_domain *smmu_domain,
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
 
+   if (smmu->features & ARM_SMMU_FEAT_HD)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
if (smmu->features & ARM_SMMU_FEAT_BBML1)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
else if (smmu->features & ARM_SMMU_FEAT_BBML2)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e15750be1d95..ff32242f2fdb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -292,6 +292,9 @@
 #define CTXDESC_CD_0_TCR_IPS   GENMASK_ULL(34, 32)
 #define CTXDESC_CD_0_TCR_TBI0  (1ULL << 38)
 
+#define CTXDESC_CD_0_TCR_HA(1UL << 43)
+#define CTXDESC_CD_0_TCR_HD(1UL << 42)
+
 #define CTXDESC_CD_0_AA64  (1UL << 41)
 #define CTXDESC_CD_0_S (1UL << 44)
 #define CTXDESC_CD_0_R (1UL << 45)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index d7626ca67dbf..a11902ae9cf1 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
#define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(7)
#define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(8)
+   #define IO_PGTABLE_QUIRK_ARM_HD BIT(9)
 
unsigned long   quirks;
unsigned long   pgsize_bitmap;
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-28 Thread Joao Martins
Similar to .read_and_clear_dirty() use the page table
walker helper functions and set DBM|RDONLY bit, thus
switching the IOPTE to writeable-clean.

Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 
 drivers/iommu/io-pgtable-arm.c  | 52 +
 2 files changed, 81 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 232057d20197..1ca72fcca930 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2769,6 +2769,34 @@ static int arm_smmu_read_and_clear_dirty(struct 
iommu_domain *domain,
return ret;
 }
 
+static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  struct iommu_iotlb_gather *iotlb_gather,
+  bool enabled)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   int ret;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+   !(smmu->features & ARM_SMMU_FEAT_BBML2))
+   return -ENODEV;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->set_dirty_tracking) {
+   pr_err_once("io-pgtable don't support dirty tracking\n");
+   return -ENODEV;
+   }
+
+   ret = ops->set_dirty_tracking(ops, iova, size, enabled);
+   iommu_iotlb_gather_add_range(iotlb_gather, iova, size);
+
+   return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2898,6 +2926,7 @@ static struct iommu_ops arm_smmu_ops = {
.enable_nesting = arm_smmu_enable_nesting,
.free   = arm_smmu_domain_free,
.read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
+   .set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 3c99028d315a..361410aa836c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -76,6 +76,7 @@
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
+#define ARM_LPAE_PTE_DBM_BIT   51
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -836,6 +837,56 @@ static int arm_lpae_read_and_clear_dirty(struct 
io_pgtable_ops *ops,
 __arm_lpae_read_and_clear_dirty, dirty);
 }
 
+static int __arm_lpae_set_dirty_modifier(unsigned long iova, size_t size,
+arm_lpae_iopte *ptep, void *opaque)
+{
+   bool enabled = *((bool *) opaque);
+   arm_lpae_iopte pte;
+
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if ((pte & ARM_LPAE_PTE_AP_WRITABLE) == ARM_LPAE_PTE_AP_RDONLY)
+   return -EINVAL;
+
+   if (!(enabled ^ !(pte & ARM_LPAE_PTE_DBM)))
+   return 0;
+
+   pte = enabled ? pte | (ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY) :
+   pte & ~(ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY);
+
+   WRITE_ONCE(*ptep, pte);
+   return 0;
+}
+
+
+static int arm_lpae_set_dirty_tracking(struct io_pgtable_ops *ops,
+  unsigned long iova, size_t size,
+  bool enabled)
+{
+   struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+   struct io_pgtable_cfg *cfg = &data->iop.cfg;
+   arm_lpae_iopte *ptep = data->pgd;
+   int lvl = data->start_level;
+   long iaext = (s64)iova >> cfg->ias;
+
+   if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+   return -EINVAL;
+
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+   iaext = ~iaext;
+   if (WARN_ON(iaext))
+   return -EINVAL;
+
+   if (data->iop.fmt != ARM_64_LPAE_S1 &&
+   data->iop.fmt != ARM_32_LPAE_S1)
+   return -EINVAL;
+
+   return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+__arm_lpae_set_dirty_modifier, &enabled);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
unsigned long granule, page_sizes;
@@ -917,6 +968,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
.unmap_pages= arm_lpa

[PATCH RFC 12/19] iommu/arm-smmu-v3: Add feature detection for HTTU

2022-04-28 Thread Joao Martins
From: Jean-Philippe Brucker 

If the SMMU supports it and the kernel was built with HTTU support,
Probe support for Hardware Translation Table Update (HTTU) which is
essentially to enable hardware update of access and dirty flags.

Probe and set the smmu::features for Hardware Dirty and Hardware Access
bits. This is in preparation, to enable it on the context descriptors of
stage 1 format.

Signed-off-by: Jean-Philippe Brucker 
[joaomart: Change commit message to reflect the underlying changes]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 
 2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fd49282c03a3..14609ece4e33 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,6 +3424,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+   u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | 
ARM_SMMU_FEAT_HD);
+   u32 features = 0;
+
+   switch (FIELD_GET(IDR0_HTTU, reg)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   features |= ARM_SMMU_FEAT_HA;
+   }
+
+   if (smmu->dev->of_node)
+   smmu->features |= features;
+   else if (features != fw_features)
+   /* ACPI IORT sets the HTTU bits */
+   dev_warn(smmu->dev,
+"IDR0.HTTU overridden by FW configuration (0x%x)\n",
+fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
u32 reg;
@@ -3484,6 +3506,8 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_E2H;
}
 
+   arm_smmu_get_httu(smmu, reg);
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
@@ -3669,6 +3693,14 @@ static int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+   switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   smmu->features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   smmu->features |= ARM_SMMU_FEAT_HA;
+   }
+
return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..1487a80fdf1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,9 @@
 #define IDR0_ASID16(1 << 12)
 #define IDR0_ATS   (1 << 10)
 #define IDR0_HYP   (1 << 9)
+#define IDR0_HTTU  GENMASK(7, 6)
+#define IDR0_HTTU_ACCESS   1
+#define IDR0_HTTU_ACCESS_DIRTY 2
 #define IDR0_COHACC(1 << 4)
 #define IDR0_TTF   GENMASK(3, 2)
 #define IDR0_TTF_AARCH64   2
@@ -639,6 +642,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_BTM  (1 << 16)
 #define ARM_SMMU_FEAT_SVA  (1 << 17)
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
+#define ARM_SMMU_FEAT_HA   (1 << 19)
+#define ARM_SMMU_FEAT_HD   (1 << 20)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 11/19] iommu/amd: Print access/dirty bits if supported

2022-04-28 Thread Joao Martins
Print the feature, much like other kernel-supported features.

One can still probe its actual hw support via sysfs, regardless
of what the kernel does.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/init.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 27f2cf61d0c6..c410d127eb58 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1936,6 +1936,10 @@ static void print_iommu_info(void)
 
if (iommu->features & FEATURE_GAM_VAPIC)
pr_cont(" GA_vAPIC");
+   if (iommu->features & FEATURE_HASUP)
+   pr_cont(" HASup");
+   if (iommu->features & FEATURE_HDSUP)
+   pr_cont(" HDSup");
 
pr_cont("\n");
}
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-04-28 Thread Joao Martins
IOMMU advertises Access/Dirty bits if the extended feature register
reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side
to steer whether to enable dirty tracking or not, rather than wrongly
doing in IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table
Entry (DTE) Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking
the pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is
left to the caller (much like unmap), and iommu_dirty_bitmap_record() is
the one adding page-ranges to invalidate. This allows caller to batch
the flush over a big span of IOVA space, without the iommu wondering
about when to flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page
tables. The IOMMU interlocked operations to update A and D bits must be
64-bit operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the default
non-page-size (aka replicated PTEs as mentined by manual), as AMD
supports all powers of two page sizes (except 512G) even though the
underlying IOTLB mappings are restricted to the same ones as supported
by the CPU (4K, 2M, 1G). It makes one wonder whether AMD_IOMMU_PGSIZES
ought to avoid advertising non-default page sizes at all, when creating
an UNMANAGED DOMAIN, or when dirty tracking is toggling in.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/amd_iommu.h   |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 11 +
 drivers/iommu/amd/init.c|  8 ++-
 drivers/iommu/amd/io_pgtable.c  | 56 +
 drivers/iommu/amd/iommu.c   | 77 +
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..2f16ad8f7514 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -34,6 +34,7 @@ extern int amd_iommu_reenable(int);
 extern int amd_iommu_enable_faulting(void);
 extern int amd_iommu_guest_ir;
 extern enum io_pgtable_fmt amd_iommu_pgtable;
+extern bool amd_iommu_had_support;
 
 /* IOMMUv2 specific functions */
 struct iommu_domain;
diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..c1eba8fce4bb 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,7 +93,9 @@
 #define FEATURE_HE (1ULL<<8)
 #define FEATURE_PC (1ULL<<9)
 #define FEATURE_GAM_VAPIC  (1ULL<<21)
+#define FEATURE_HASUP  (1ULL<<49)
 #define FEATURE_EPHSUP (1ULL<<50)
+#define FEATURE_HDSUP  (1ULL<<52)
 #define FEATURE_SNP(1ULL<<63)
 
 #define FEATURE_PASID_SHIFT32
@@ -197,6 +199,7 @@
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID 0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD   0x07
 #define DEV_ENTRY_PPR   0x34
 #define DEV_ENTRY_IR0x3d
 #define DEV_ENTRY_IW0x3e
@@ -350,10 +353,16 @@
 #define PTE_LEVEL_PAGE_SIZE(level) \
(1ULL << (12 + (9 * (level
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR (1ULL << 0)
+#define IOMMU_PTE_HD (1ULL << IOMMU_PTE_HD_BIT)
 #define IOMMU_PTE_U  (1ULL << 59)
 #define IOMMU_PTE_FC (1ULL << 60)
 #define IOMMU_PTE_IR (1ULL << 61)
@@ -364,6 +373,7 @@
  */
 #

[PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
AMD implementation of unmap_read_dirty() is pretty simple as
mostly reuses unmap code with the extra addition of marshalling
the dirty bit into the bitmap as it walks the to-be-unmapped
IOPTE.

Extra care is taken though, to switch over to cmpxchg as opposed
to a non-serialized store to the PTE and testing the dirty bit
only set until cmpxchg succeeds to set to 0.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/io_pgtable.c | 44 +-
 drivers/iommu/amd/iommu.c  | 22 +
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 8325ef193093..1868c3b58e6d 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct 
list_head *freelist)
free_sub_pt(pt, mode, freelist);
 }
 
+static bool free_pte_dirty(u64 *pte, u64 pteval)
+{
+   bool dirty = false;
+
+   while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
+   dirty = true;
+
+   return dirty;
+}
+
 /*
  * Generic mapping functions. It maps a physical address into a DMA
  * address space. It allocates the page table pages if necessary.
@@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, 
unsigned long iova,
return ret;
 }
 
-static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
- unsigned long iova,
- size_t size,
- struct iommu_iotlb_gather *gather)
+static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+  unsigned long iova,
+  size_t size,
+  struct iommu_iotlb_gather *gather,
+  struct iommu_dirty_bitmap *dirty)
 {
struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
unsigned long long unmapped;
@@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct 
io_pgtable_ops *ops,
while (unmapped < size) {
pte = fetch_pte(pgtable, iova, &unmap_size);
if (pte) {
-   int i, count;
+   unsigned long i, count;
+   bool pte_dirty = false;
 
count = PAGE_SIZE_PTE_COUNT(unmap_size);
for (i = 0; i < count; i++)
-   pte[i] = 0ULL;
+   pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
+
+   if (unlikely(pte_dirty && dirty))
+   iommu_dirty_bitmap_record(dirty, iova, 
unmap_size);
}
 
iova = (iova & ~(unmap_size - 1)) + unmap_size;
@@ -461,6 +476,22 @@ static unsigned long iommu_v1_unmap_page(struct 
io_pgtable_ops *ops,
return unmapped;
 }
 
+static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+unsigned long iova,
+size_t size,
+struct iommu_iotlb_gather *gather)
+{
+   return __iommu_v1_unmap_page(ops, iova, size, gather, NULL);
+}
+
+static unsigned long iommu_v1_unmap_page_read_dirty(struct io_pgtable_ops *ops,
+   unsigned long iova, size_t size,
+   struct iommu_iotlb_gather *gather,
+   struct iommu_dirty_bitmap *dirty)
+{
+   return __iommu_v1_unmap_page(ops, iova, size, gather, dirty);
+}
+
 static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned 
long iova)
 {
struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
@@ -575,6 +606,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct 
io_pgtable_cfg *cfg, void *coo
pgtable->iop.ops.unmap= iommu_v1_unmap_page;
pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
+   pgtable->iop.ops.unmap_read_dirty = iommu_v1_unmap_page_read_dirty;
 
return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0a86392b2367..a8fcb6e9a684 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
return r;
 }
 
+static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
+unsigned long iova, size_t page_size,
+struct iommu_iotlb_gather *gather,
+struct iommu_dirty_bitmap *dirty)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = 

[PATCH RFC 08/19] iommufd: Add a test for dirty tracking ioctls

2022-04-28 Thread Joao Martins
Add a new test ioctl for simulating the dirty IOVAs
in the mock domain, and implement the mock iommu domain ops
that get the dirty tracking supported.

The selftest exercises the usual main workflow of:

1) Setting/Clearing dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs
3) Unmap and read dirty back

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/iommufd_test.h|   9 ++
 drivers/iommu/iommufd/selftest.c| 137 +++-
 tools/testing/selftests/iommu/Makefile  |   1 +
 tools/testing/selftests/iommu/iommufd.c | 135 +++
 4 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
index d22ef484af1a..90dafa513078 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -14,6 +14,7 @@ enum {
IOMMU_TEST_OP_MD_CHECK_REFS,
IOMMU_TEST_OP_ACCESS_PAGES,
IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+   IOMMU_TEST_OP_DIRTY,
 };
 
 enum {
@@ -57,6 +58,14 @@ struct iommu_test_cmd {
struct {
__u32 limit;
} memory_limit;
+   struct {
+   __u32 flags;
+   __aligned_u64 iova;
+   __aligned_u64 length;
+   __aligned_u64 page_size;
+   __aligned_u64 uptr;
+   __aligned_u64 out_nr_dirty;
+   } dirty;
};
__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index a665719b493e..b02309722436 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -13,6 +13,7 @@
 size_t iommufd_test_memory_limit = 65536;
 
 enum {
+   MOCK_DIRTY_TRACK = 1,
MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
 
/*
@@ -25,9 +26,11 @@ enum {
_MOCK_PFN_START = MOCK_PFN_MASK + 1,
MOCK_PFN_START_IOVA = _MOCK_PFN_START,
MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+   MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
 };
 
 struct mock_iommu_domain {
+   unsigned long flags;
struct iommu_domain domain;
struct xarray pfns;
 };
@@ -133,7 +136,7 @@ static size_t mock_domain_unmap_pages(struct iommu_domain 
*domain,
 
for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
-   WARN_ON(!ent);
+
/*
 * iommufd generates unmaps that must be a strict
 * superset of the map's performend So every starting
@@ -143,12 +146,12 @@ static size_t mock_domain_unmap_pages(struct iommu_domain 
*domain,
 * passed to map_pages
 */
if (first) {
-   WARN_ON(!(xa_to_value(ent) &
+   WARN_ON(ent && !(xa_to_value(ent) &
  MOCK_PFN_START_IOVA));
first = false;
}
if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
-   WARN_ON(!(xa_to_value(ent) &
+   WARN_ON(ent && !(xa_to_value(ent) &
  MOCK_PFN_LAST_IOVA));
 
iova += MOCK_IO_PAGE_SIZE;
@@ -171,6 +174,75 @@ static phys_addr_t mock_domain_iova_to_phys(struct 
iommu_domain *domain,
return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
 }
 
+static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
+ bool enable)
+{
+   struct mock_iommu_domain *mock =
+   container_of(domain, struct mock_iommu_domain, domain);
+   unsigned long flags = mock->flags;
+
+   /* No change? */
+   if (!(enable ^ !!(flags & MOCK_DIRTY_TRACK)))
+   return -EINVAL;
+
+   flags = (enable ?
+flags | MOCK_DIRTY_TRACK : flags & ~MOCK_DIRTY_TRACK);
+
+   mock->flags = flags;
+   return 0;
+}
+
+static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   struct iommu_dirty_bitmap *dirty)
+{
+   struct mock_iommu_domain *mock =
+   container_of(domain, struct mock_iommu_domain, domain);
+   unsigned long i, max = size / MOCK_IO_PAGE_SIZE;
+   void *ent, *old;
+
+   if (!(mock->flags & MOCK_DIRTY_TRACK))
+   return -EINVAL;
+
+   for (i = 0; i < max; i++) {
+   unsigned long cur = iova + i * MOCK_IO_PAGE_SIZE;
+
+   ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);
+   if (ent &&
+   (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA)

[PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility

2022-04-28 Thread Joao Martins
Add the correspondent APIs for performing VFIO dirty tracking,
particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
* VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
 the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
tracking is active.

Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
whereas it gets set the domain configured page size the same as
iopt::iova_alignment and maximum dirty bitmap size same
as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
not implemented and userspace gets -EOPNOTSUPP which is handled by
today's userspace.

Move iommufd_get_pagesizes() definition prior to unmap for
iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
size against IOPT pagesize.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/vfio_compat.c | 221 ++--
 1 file changed, 209 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommufd/vfio_compat.c 
b/drivers/iommu/iommufd/vfio_compat.c
index dbe39404a105..2802f49cc10d 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -56,6 +56,16 @@ create_compat_ioas(struct iommufd_ctx *ictx)
return ioas;
 }
 
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+   /* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+* the high bits too, and we need to decide if we should report that
+* iommufd supports less than PAGE_SIZE alignment or stick to strict
+* compatibility. qemu only cares about the first set bit.
+*/
+   return ioas->iopt.iova_alignment;
+}
+
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
 {
struct iommu_vfio_ioas *cmd = ucmd->cmd;
@@ -130,9 +140,14 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  void __user *arg)
 {
size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
-   u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+   u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL |
+   VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+   struct iommufd_dirty_data dirty, *dirtyp = NULL;
struct vfio_iommu_type1_dma_unmap unmap;
+   struct vfio_bitmap bitmap;
struct iommufd_ioas *ioas;
+   unsigned long pgshift;
+   size_t pgsize;
int rc;
 
if (copy_from_user(&unmap, arg, minsz))
@@ -141,14 +156,53 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
return -EINVAL;
 
+   if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+   unsigned long npages;
+
+   if (copy_from_user(&bitmap,
+  (void __user *)(arg + minsz),
+  sizeof(bitmap)))
+   return -EFAULT;
+
+   if (!access_ok((void __user *)bitmap.data, bitmap.size))
+   return -EINVAL;
+
+   pgshift = __ffs(bitmap.pgsize);
+   npages = unmap.size >> pgshift;
+
+   if (!npages || !bitmap.size ||
+   (bitmap.size > DIRTY_BITMAP_SIZE_MAX) ||
+   (bitmap.size < dirty_bitmap_bytes(npages)))
+   return -EINVAL;
+
+   dirty.iova = unmap.iova;
+   dirty.length = unmap.size;
+   dirty.data = bitmap.data;
+   dirty.page_size = 1 << pgshift;
+   dirtyp = &dirty;
+   }
+
ioas = get_compat_ioas(ictx);
if (IS_ERR(ioas))
return PTR_ERR(ioas);
 
+   pgshift = __ffs(iommufd_get_pagesizes(ioas)),
+   pgsize = (size_t)1 << pgshift;
+
+   /* When dirty tracking is enabled, allow only min supported pgsize */
+   if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+   (bitmap.pgsize != pgsize)) {
+   rc = -EINVAL;
+   goto out_put;
+   }
+
if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
rc = iopt_unmap_all(&ioas->iopt);
else
-   rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
+   rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,
+dirtyp);
+
+out_put:
iommufd_put_object(&ioas->obj);
return rc;
 }
@@ -222,16 +276,6 @@ static int iommufd_vfio_set_iommu(struct iommufd_ctx 
*ictx, unsigned long type)
return 0;
 }
 
-static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
-{
-   /* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
-* the high bits too, and we need to decide if we should report that
-* iommufd s

[PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs

2022-04-28 Thread Joao Martins
Today, the dirty state is lost and the page wouldn't be migrated to
destination potentially leading the guest into error.

Add an unmap API that reads the dirty bit and sets it in the
user passed bitmap. This unmap iommu API tackles a potentially
racy update to the dirty bit *when* doing DMA on a iova that is
being unmapped at the same time.

The new unmap_read_dirty/unmap_pages_read_dirty does not replace
the unmap pages, but rather only when explicit called with an dirty
bitmap data passed in.

It could be said that the guest is buggy and rather than a special unmap
path tackling the theoretical race ... it would suffice fetching the
dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommu.c  | 43 +++---
 include/linux/io-pgtable.h | 10 +
 include/linux/iommu.h  | 12 +++
 3 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d18b9ddbcce4..cc04263709ee 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2289,12 +2289,25 @@ EXPORT_SYMBOL_GPL(iommu_map_atomic);
 
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
  unsigned long iova, size_t size,
- struct iommu_iotlb_gather *iotlb_gather)
+ struct iommu_iotlb_gather *iotlb_gather,
+ struct iommu_dirty_bitmap *dirty)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t pgsize, count;
 
pgsize = iommu_pgsize(domain, iova, iova, size, &count);
+
+   if (dirty) {
+   if (!ops->unmap_read_dirty && !ops->unmap_pages_read_dirty)
+   return 0;
+
+   return ops->unmap_pages_read_dirty ?
+  ops->unmap_pages_read_dirty(domain, iova, pgsize,
+  count, iotlb_gather, dirty) :
+  ops->unmap_read_dirty(domain, iova, pgsize,
+iotlb_gather, dirty);
+   }
+
return ops->unmap_pages ?
   ops->unmap_pages(domain, iova, pgsize, count, iotlb_gather) :
   ops->unmap(domain, iova, pgsize, iotlb_gather);
@@ -2302,7 +2315,8 @@ static size_t __iommu_unmap_pages(struct iommu_domain 
*domain,
 
 static size_t __iommu_unmap(struct iommu_domain *domain,
unsigned long iova, size_t size,
-   struct iommu_iotlb_gather *iotlb_gather)
+   struct iommu_iotlb_gather *iotlb_gather,
+   struct iommu_dirty_bitmap *dirty)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t unmapped_page, unmapped = 0;
@@ -2337,9 +2351,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 * or we hit an area that isn't mapped.
 */
while (unmapped < size) {
-   unmapped_page = __iommu_unmap_pages(domain, iova,
-   size - unmapped,
-   iotlb_gather);
+   unmapped_page = __iommu_unmap_pages(domain, iova, size - 
unmapped,
+   iotlb_gather, dirty);
if (!unmapped_page)
break;
 
@@ -2361,18 +2374,34 @@ size_t iommu_unmap(struct iommu_domain *domain,
size_t ret;
 
iommu_iotlb_gather_init(&iotlb_gather);
-   ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
+   ret = __iommu_unmap(domain, iova, size, &iotlb_gather, NULL);
iommu_iotlb_sync(domain, &iotlb_gather);
 
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
+size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+ unsigned long iova, size_t size,
+ struct iommu_dirty_bitmap *dirty)
+{
+   struct iommu_iotlb_gather iotlb_gather;
+   size_t ret;
+
+   iommu_iotlb_gather_init(&iotlb_gather);
+   ret = __iommu_unmap(domain, iova, size, &iotlb_gather, dirty);
+   iommu_iotlb_sync(domain, &iotlb_gather);
+
+   return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_unmap_read_dirty);
+
 size_t iommu_unmap_fast(struct iommu_domain *domain,
unsigned long iova, size_t size,
struct iommu_iotlb_gather *iotlb_gather)
 {
-   return __iommu_unmap(domain, iova, size, iotlb_gather);
+   return __iommu_unmap(domain, iova, size, iotlb_gather, NULL);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 82b39925c21f..c2ebfe037f5d 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -171,6 +171,16 @@ struct io_pgtable_ops {
int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
   

[PATCH RFC 06/19] iommufd: Dirty tracking IOCTLs for the hw_pagetable

2022-04-28 Thread Joao Martins
Every IOMMU driver should be able to implement the needed
iommu domain ops to perform dirty tracking.

Connect a hw_pagetable to the IOMMU core dirty tracking ops.
It exposes all of the functionality for the UAPI:

- Enable/Disable dirty tracking on an IOMMU domain (hw_pagetable id)
- Read the dirtied IOVAs (which clear IOMMU domain bitmap under the hood)
- Unmap and get the dirtied IOVAs

In doing so the previously internal iommufd_dirty_data structure is
moved over as the UAPI intermediate structure for representing iommufd
dirty bitmaps.

Contrary to past incantations the IOVA range to be scanned or unmap is
tied in to the bitmap size, and thus puts the heavy lifting in the
application to make sure it passes a precisedly sized bitmap address as
opposed to allowing base_iova != iova, which simplifies things further.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/hw_pagetable.c| 79 +
 drivers/iommu/iommufd/ioas.c| 33 +++
 drivers/iommu/iommufd/iommufd_private.h | 22 ---
 drivers/iommu/iommufd/main.c|  9 +++
 include/uapi/linux/iommufd.h| 78 
 5 files changed, 214 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c 
b/drivers/iommu/iommufd/hw_pagetable.c
index bafd7d07918b..943bcc3898a4 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include 
+#include 
 
 #include "iommufd_private.h"
 
@@ -140,3 +141,81 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
}
iommufd_object_destroy_user(ictx, &hwpt->obj);
 }
+
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_hwpt_set_dirty *cmd = ucmd->cmd;
+   struct iommufd_hw_pagetable *hwpt;
+   struct iommufd_ioas *ioas;
+   int rc = -EOPNOTSUPP;
+   bool enable;
+
+   hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+   if (IS_ERR(hwpt))
+   return PTR_ERR(hwpt);
+
+   ioas = hwpt->ioas;
+   enable = cmd->flags & IOMMU_DIRTY_TRACKING_ENABLED;
+
+   rc = iopt_set_dirty_tracking(&ioas->iopt, hwpt->domain, enable);
+
+   iommufd_put_object(&hwpt->obj);
+   return rc;
+}
+
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+struct iommufd_dirty_data *bitmap)
+{
+   unsigned long pgshift, npages;
+   size_t iommu_pgsize;
+   int rc = -EINVAL;
+   u64 bitmap_size;
+
+   pgshift = __ffs(bitmap->page_size);
+   npages = bitmap->length >> pgshift;
+   bitmap_size = dirty_bitmap_bytes(npages);
+
+   if (!npages || (bitmap_size > DIRTY_BITMAP_SIZE_MAX))
+   return rc;
+
+   if (!access_ok((void __user *) bitmap->data, bitmap_size))
+   return rc;
+
+   iommu_pgsize = 1 << __ffs(ioas->iopt.iova_alignment);
+
+   /* allow only smallest supported pgsize */
+   if (bitmap->page_size != iommu_pgsize)
+   return rc;
+
+   if (bitmap->iova & (iommu_pgsize - 1))
+   return rc;
+
+   if (!bitmap->length || bitmap->length & (iommu_pgsize - 1))
+   return rc;
+
+   return 0;
+}
+
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_hwpt_get_dirty_iova *cmd = ucmd->cmd;
+   struct iommufd_hw_pagetable *hwpt;
+   struct iommufd_ioas *ioas;
+   int rc = -EOPNOTSUPP;
+
+   hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+   if (IS_ERR(hwpt))
+   return PTR_ERR(hwpt);
+
+   ioas = hwpt->ioas;
+   rc = iommufd_check_iova_range(ioas, &cmd->bitmap);
+   if (rc)
+   goto out_put;
+
+   rc = iopt_read_and_clear_dirty_data(&ioas->iopt, hwpt->domain,
+   &cmd->bitmap);
+
+out_put:
+   iommufd_put_object(&hwpt->obj);
+   return rc;
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 19d6591aa005..50bef46bc0bb 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,6 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
rc = -EOVERFLOW;
goto out_put;
}
+
rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
}
 
@@ -250,3 +251,35 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
iommufd_put_object(&ioas->obj);
return rc;
 }
+
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_unmap_dirty *cmd = ucmd->cmd;
+   struct iommufd_dirty_data *bitmap;
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   /* The bitmaps would be gigantic */
+   bitmap = &cmd->bitmap;
+   if (bitmap->iova == 0 && bitmap->length == U

[PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking

2022-04-28 Thread Joao Martins
Add to iommu domain operations a set of callbacks to
perform dirty tracking, particulary to start and stop
tracking and finally to test and clear the dirty data.

Drivers are expected to dynamically change its hw protection
domain bits to toggle the tracking and flush some form of
control state structure that stands in the IOVA translation
path.

For reading and clearing dirty data, in all IOMMUs a transition
from any of the PTE access bits (Access, Dirty) implies flushing
the IOTLB to invalidate any stale data in the IOTLB as to whether
or not the IOMMU should update the said PTEs. The iommu core APIs
introduce a new structure for storing the dirties, albeit vendor
IOMMUs implementing .read_and_clear_dirty() just use
iommu_dirty_bitmap_record() to set the memory storing dirties.
The underlying tracking/iteration of user bitmap memory is instead
done by iommufd which takes care of initializing the dirty bitmap
*prior* to passing to the IOMMU domain op.

So far for currently/to-be-supported IOMMUs with dirty tracking
support this particularly because the tracking is part of
first stage tables and part of address translation. Below
it is mentioned how hardware deal with the hardware protection
domain control bits, to justify the added iommu core APIs.
vendor IOMMU implementation will also explain in more detail on
the dirty bit usage/clearing in the IOPTEs.

* x86 AMD:

The same thing for AMD particularly the Device Table
respectivally, followed by flushing the Device IOTLB. On AMD[1],
section "2.2.1 Updating Shared Tables", e.g.

> Each table can also have its contents cached by the IOMMU or
peripheral IOTLBs. Therefore, after
updating a table entry that can be cached, system software must
send the IOMMU an appropriate
invalidate command. Information in the peripheral IOTLBs must
also be invalidated.

There's no mention of particular bits that are cached or
not but fetching a dev entry is part of address translation
as also depicted, so invalidate the device table to make
sure the next translations fetch a DTE entry with the HD bits set.

* x86 Intel (rev3.0+):

Likewise[2] set the SSADE bit in the scalable-entry second stage table
to enable Access/Dirty bits in the second stage page table. See manual,
particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
Considerations"

> When modifying root-entries, scalable-mode root-entries,
context-entries, or scalable-mode context
entries:
> Software must serially invalidate the context-cache,
PASID-cache (if applicable), and the IOTLB.  The serialization is
required since hardware may utilize information from the
context-caches (e.g., Domain-ID) to tag new entries inserted to
the PASID-cache and IOTLB for processing in-flight requests.
Section 6.5 describe the invalidation operations.

And also the whole chapter "" Table "Table 23.  Guidance to
Software for Invalidations" in "6.5.3.3 Guidance to Software for
Invalidations" explicitly mentions

> SSADE transition from 0 to 1 in a scalable-mode PASID-table
entry with PGTT value of Second-stage or Nested

* ARM SMMUV3.2:

SMMUv3.2 needs to toggle the dirty bit descriptor
over the CD (or S2CD) for toggling and flush/invalidate
the IOMMU dev IOTLB.

Reference[0]: SMMU spec, "5.4.1 CD notes",

> The following CD fields are permitted to be cached as part of a
translation or TLB entry, and alteration requires
invalidation of any TLB entry that might have cached these
fields, in addition to CD structure cache invalidation:

...
HA, HD
...

Although, The ARM SMMUv3 case is a tad different that its x86
counterparts. Rather than changing *only* the IOMMU domain device entry to
enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
ARM instead uses a dirty-bit modifier which is separately enabled, and
changes the *existing* meaning of access bits (for ro/rw), to the point
that marking access bit read-only but with dirty-bit-modifier enabled
doesn't trigger an perm io page fault.

In pratice this means that changing iommu context isn't enough
and in fact mostly useless IIUC (and can be always enabled). Dirtying
is only really enabled when the DBM pte bit is enabled (with the
CD.HD bit as a prereq).

To capture this h/w construct an iommu core API is added which enables
dirty tracking on an IOVA range rather than a device/context entry.
iommufd picks one or the other, and IOMMUFD core will favour
device-context op followed by IOVA-range alternative.

[0] https://developer.arm.com/documentation/ihi0070/latest
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://cdrdv2.intel.com/v1/dl/getContent/671081

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommu.c  | 28 
 include/linux/io-pgtable.h |  6 +
 include/linux/iommu.h  | 52 ++
 3 files changed, 86 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0c42ece25854..d18b9ddbcce4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/i

[PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-04-28 Thread Joao Martins
Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (which been out for a year now) supports it while ARM
SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
The intended use-case is to support Live Migration with SR-IOV, with IOMMUs
that support it. Yishai Hadas will be soon submiting an RFC that covers the
PCI device dirty tracker via vfio.

At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then attached to a particular device, consequently
creating the protection domain to share a common IO page table
representing the endporint DMA-addressable guest address space.
(Hopefully I am not twisting the terminology here) The resultant object
is an hw_pagetable object which represents the iommu_domain
object that will be directly manipulated. For more background on
IOMMUFD have a look at these two series[0][1] on the kernel and qemu
consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
kAPI is then extended to provide:

 1) Enabling or disabling dirty tracking on the iommu_domain. Model
as the most common case of changing hardware protection domain control
bits, and ARM specific case of having to enable the per-PTE DBM control
bit. The 'real' tracking of whether dirty tracking is enabled or not is
stored in the vendor IOMMU, hence no new fields are added to iommufd
pagetable structures.

 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
thus describe the IOVAs that got written by the device. While performing
the marshalling also vendors need to clear the dirty bits from IOPTE and
allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based. So far this is a test-and-clear kind of interface given that the
IOPT walk is going to be expensive. It occured to me to separate
the readout of dirty, and the clearing of dirty from IOPTEs.
I haven't opted for that one, given that it would mean two lenghty IOPTE
walks and felt counter-performant.

 3) Unmapping an IOVA range while returning its dirty bit prior to
unmap. This case is specific for non-nested vIOMMU case where an
erronous guest (or device) DMAing to an address being unmapped at the
same time.

[See at the end too, on general remarks, specifically the one regarding
 probing dirty tracking via a dedicated iommufd cap ioctl]

The series is organized as follows:

* Patches 1-3: Takes care of the iommu domain operations to be added and
extends iommufd io-pagetable to set/clear dirty tracking, as well as
reading the dirty bits from the vendor pagetables. The idea is to abstract
iommu vendors from any idea of how bitmaps are stored or propagated back to
the caller, as well as allowing control/batching over IOTLB flush. So
there's a data structure and an helper that only tells the upper layer that
an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
the bitmap user memory, and kmap-ing them as needed. IOMMU vendor just has
an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
vendor IOMMU implementor.

* Patches 4-5: Adds the new unmap domain op that returns whether the IOVA
got dirtied. I separated this from the rest of the set, as I am still
questioning the need for this API and whether this race needs to be
fundamentally be handled. I guess the thinking is that live-migration
should be guest foolproof, but how much the race happens in pratice to
deem this as a necessary unmap variant. Perhaps maybe it might be enough
fetching dirty bits prior to the unmap? Feedback appreciated.

* Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
We should discuss whether to include the vfio-compat or not. Given how
vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
with the IOMMU hw support. I haven't implemented the perpectual dirtying
given his lack of usefullness over an IOMMU-backed implementation (or so
I think). The selftests, test mainly the principal workflow, still needs
to get added more corner cases.

Note: Given that there's no capability for new APIs, or page sizes or
etc, the userspace app using IOMMUFD native API would gather -EOPNOTSUPP
when dirty tracking is not supported by the IOMMU hardware.

For completeness and most importantly to make sure the new IOMMU core ops
capture the hardware blocks, all the IOMMUs that will eventually get IOMMU A/D
support were implemented. So the next half of the series presents *proof of
concept* implementations for IOMMUs:

* Patches 9-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
and also on a AMD Milan server IOMMU.

* Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
to do the dynamic set/clear dirty tracking, and immplicitly clearing
dirt

[PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()

2022-04-28 Thread Joao Martins
Add an argument to the kAPI that unmaps an IOVA from the attached
domains, to also receive a bitmap.

When passed an iommufd_dirty_data::bitmap we call out the special
dirty unmap (iommu_unmap_read_dirty()). The bitmap data is
iterated, similarly, like the read_and_clear_dirty() in IOVA
chunks using the previously added iommufd_dirty_iter* helper
functions.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 13 ++--
 drivers/iommu/iommufd/io_pagetable.h|  3 +-
 drivers/iommu/iommufd/ioas.c|  2 +-
 drivers/iommu/iommufd/iommufd_private.h |  4 +-
 drivers/iommu/iommufd/pages.c   | 79 +
 drivers/iommu/iommufd/vfio_compat.c |  2 +-
 6 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index 835b5040fce9..6f4117c629d4 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -542,13 +542,14 @@ struct iopt_pages *iopt_get_pages(struct io_pagetable 
*iopt, unsigned long iova,
 }
 
 static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
-struct iopt_pages *pages)
+struct iopt_pages *pages,
+struct iommufd_dirty_data *bitmap)
 {
/* Drivers have to unpin on notification. */
if (WARN_ON(atomic_read(&area->num_users)))
return -EBUSY;
 
-   iopt_area_unfill_domains(area, pages);
+   iopt_area_unfill_domains(area, pages, bitmap);
WARN_ON(atomic_read(&area->num_users));
iopt_abort_area(area);
iopt_put_pages(pages);
@@ -560,12 +561,13 @@ static int __iopt_unmap_iova(struct io_pagetable *iopt, 
struct iopt_area *area,
  * @iopt: io_pagetable to act on
  * @iova: Starting iova to unmap
  * @length: Number of bytes to unmap
+ * @bitmap: Bitmap of dirtied IOVAs
  *
  * The requested range must exactly match an existing range.
  * Splitting/truncating IOVA mappings is not allowed.
  */
 int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-   unsigned long length)
+   unsigned long length, struct iommufd_dirty_data *bitmap)
 {
struct iopt_pages *pages;
struct iopt_area *area;
@@ -590,7 +592,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned 
long iova,
area->pages = NULL;
up_write(&iopt->iova_rwsem);
 
-   rc = __iopt_unmap_iova(iopt, area, pages);
+   rc = __iopt_unmap_iova(iopt, area, pages, bitmap);
+
up_read(&iopt->domains_rwsem);
return rc;
 }
@@ -614,7 +617,7 @@ int iopt_unmap_all(struct io_pagetable *iopt)
area->pages = NULL;
up_write(&iopt->iova_rwsem);
 
-   rc = __iopt_unmap_iova(iopt, area, pages);
+   rc = __iopt_unmap_iova(iopt, area, pages, NULL);
if (rc)
goto out_unlock_domains;
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h 
b/drivers/iommu/iommufd/io_pagetable.h
index c8b6a60ff24c..c8baab25ab08 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -48,7 +48,8 @@ struct iopt_area {
 };
 
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages 
*pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+ struct iommufd_dirty_data *bitmap);
 
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..19d6591aa005 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,7 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
rc = -EOVERFLOW;
goto out_put;
}
-   rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+   rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
}
 
 out_put:
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index 4c12b4a8f1a6..3e3a97f623a1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -47,8 +47,6 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned 
long *iova,
 int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
   unsigned long *dst_iova, unsigned long start_byte,
   unsigned long length, int iommu_prot, unsigned int flags);
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-   unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
 struct iommufd_dirty_data {
@@ -63,

[PATCH RFC 03/19] iommufd: Dirty tracking data support

2022-04-28 Thread Joao Martins
Add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and
then copying back to userspace from each area-internal bitmap.

Underneath it uses the IOMMU equivalent API which will read the
dirty bits, as well as atomically clearing the IOPTE dirty bit
and flushing the IOTLB at the end. The dirty bitmaps pass an
iotlb_gather to allow batching the dirty-bit updates.

Most of the complexity, though, is in the handling of the user
bitmaps to avoid copies back and forth. The bitmap user addresses
need to be iterated through, pinned and then passing the pages
into iommu core. The amount of bitmap data passed at a time for a
read_and_clear_dirty() is 1 page worth of pinned base page
pointers. That equates to 16M bits, or rather 64G of data that
can be returned as 'dirtied'. The flush the IOTLB at the end of
the whole scanned IOVA range, to defer as much as possible the
potential DMA performance penalty.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 169 
 drivers/iommu/iommufd/iommufd_private.h |  44 ++
 2 files changed, 213 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f4609ef369e0..835b5040fce9 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "io_pagetable.h"
 
@@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
return ret;
 }
 
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+   struct iommufd_dirty_data *bitmap)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+   unsigned long bitmap_len;
+
+   bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
+
+   import_single_range(WRITE, bitmap->data, bitmap_len,
+   &iter->bitmap_iov, &iter->bitmap_iter);
+   iter->iova = bitmap->iova;
+
+   /* Can record up to 64G at a time */
+   dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
+
+   return !dirty->pages ? -ENOMEM : 0;
+}
+
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+   if (dirty->pages) {
+   free_page((unsigned long) dirty->pages);
+   dirty->pages = NULL;
+   }
+}
+
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
+{
+   return iov_iter_count(&iter->bitmap_iter) > 0;
+}
+
+static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter 
*iter)
+{
+   unsigned long left = iter->bitmap_iter.count - 
iter->bitmap_iter.iov_offset;
+
+   left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
+
+   return left;
+}
+
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
+{
+   unsigned long left = iommufd_dirty_iter_bytes(iter);
+
+   return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
+}
+
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
+{
+   unsigned long skip = iter->bitmap_iter.iov_offset;
+
+   return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
+}
+
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
+{
+   iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
+}
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+   if (dirty->npages)
+   unpin_user_pages(dirty->pages, dirty->npages);
+}
+
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+   unsigned long npages;
+   unsigned long ret;
+   void *addr;
+
+   addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
+   npages = iov_iter_npages(&iter->bitmap_iter,
+PAGE_SIZE / sizeof(struct page *));
+
+   ret = pin_user_pages_fast((unsigned long) addr, npages,
+ FOLL_WRITE, dirty->pages);
+   if (ret <= 0)
+   return -EINVAL;
+
+   dirty->npages = ret;
+   dirty->iova = iommufd_dirty_iova(iter);
+   dirty->start_offset = offset_in_page(addr);
+   return 0;
+}
+
+static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
+ struct iommufd_dirty_data *bitmap)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   struct iommu_iotlb_gather gather;
+   struct iommufd_dirty_iter iter;
+   int ret = 0;
+
+   if (!ops || !ops->read_and_clear_dirty)
+   return -EOPNOTSUPP;
+
+   iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
+   __ffs(bitmap->page_size), &gather);
+   ret = iommufd_dirty_iter_init(&iter, bitmap);
+   if (ret)

[PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable

2022-04-28 Thread Joao Martins
Add an io_pagetable kernel API to toggle dirty tracking:

* iopt_set_dirty_tracking(iopt, [domain], state)

It receives either NULL (which means all domains) or an
iommu_domain. The intended caller of this is via the hw_pagetable
object that is created on device attach, which passes an
iommu_domain. For now, the all-domains is left for vfio-compat.

The hw protection domain dirty control is favored over the IOVA-range
alternative. For the latter, it iterates over all IOVA areas and calls
iommu domain op to enable/disable for the range.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 71 +
 drivers/iommu/iommufd/iommufd_private.h |  3 ++
 2 files changed, 74 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..f4609ef369e0 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -276,6 +276,77 @@ int iopt_map_user_pages(struct io_pagetable *iopt, 
unsigned long *iova,
return 0;
 }
 
+static int __set_dirty_tracking_range_locked(struct iommu_domain *domain,
+struct io_pagetable *iopt,
+bool enable)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   struct iommu_iotlb_gather gather;
+   struct iopt_area *area;
+   int ret = -EOPNOTSUPP;
+   unsigned long iova;
+   size_t size;
+
+   iommu_iotlb_gather_init(&gather);
+
+   for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+   iova = iopt_area_iova(area);
+   size = iopt_area_last_iova(area) - iova;
+
+   if (ops->set_dirty_tracking_range) {
+   ret = ops->set_dirty_tracking_range(domain, iova,
+   size, &gather,
+   enable);
+   if (ret < 0)
+   break;
+   }
+   }
+
+   iommu_iotlb_sync(domain, &gather);
+
+   return ret;
+}
+
+static int iommu_set_dirty_tracking(struct iommu_domain *domain,
+   struct io_pagetable *iopt, bool enable)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   int ret = -EOPNOTSUPP;
+
+   if (ops->set_dirty_tracking)
+   ret = ops->set_dirty_tracking(domain, enable);
+   else if (ops->set_dirty_tracking_range)
+   ret = __set_dirty_tracking_range_locked(domain, iopt,
+   enable);
+
+   return ret;
+}
+
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+   struct iommu_domain *domain, bool enable)
+{
+   struct iommu_domain *dom;
+   unsigned long index;
+   int ret = -EOPNOTSUPP;
+
+   down_write(&iopt->iova_rwsem);
+   if (!domain) {
+   down_write(&iopt->domains_rwsem);
+   xa_for_each(&iopt->domains, index, dom) {
+   ret = iommu_set_dirty_tracking(dom, iopt, enable);
+   if (ret < 0)
+   break;
+   }
+   up_write(&iopt->domains_rwsem);
+   } else {
+   ret = iommu_set_dirty_tracking(domain, iopt, enable);
+   }
+
+   up_write(&iopt->iova_rwsem);
+   return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long 
iova,
  unsigned long *start_byte,
  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..d00ef3b785c5 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long 
iova,
unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+   struct iommu_domain *domain, bool enable);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
  unsigned long npages, struct page **out_pages, bool 
write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] iommu/sva: Fix PASID use-after-free issue

2022-04-28 Thread Fenghua Yu
The PASID is being freed too early.  It needs to stay around until after
device drivers that might be using it have had a chance to clear it out
of the hardware.

As a reminder:

mmget() /mmput()  refcount the mm's address space
mmgrab()/mmdrop() refcount the mm itself

The PASID is currently tied to the life of the mm's address space and
freed in __mmput().  This makes logical sense because the PASID can't be
used once the address space is gone.

But, this misses an important point: even after the address space is
gone, the PASID will still be programmed into a device.  Device drivers
might, for instance, still need to flush operations that are outstanding
and need to use that PASID.  They do this at file->release() time.

Device drivers call the IOMMU driver to hold a reference on the mm itself
and drop it at file->release() time.  But, the IOMMU driver holds a
reference on the mm itself, not the address space.  The address space
(and the PASID) is long gone by the time the driver tries to clean up.
This is effectively a use-after-free bug on the PASID.

To fix this, move the PASID free operation from __mmput() to __mmdrop().
This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
allocated until it drops its mm reference.

Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and 
free it on mm exit")

Reported-by: Zhangfei Gao 
Tested-by: Zhangfei Gao 
Suggested-by: Jean-Philippe Brucker 
Suggested-by: Jacob Pan 
Reviewed-by: Jean-Philippe Brucker 
Signed-off-by: Fenghua Yu 
---

v2:
- Dave Hansen rewrites the change log.
- Add Tested-by: Zhangfei Gao 
- Add Reviewed-by: Jean-Philippe Brucker 

The original patch was posted and discussed in:
https://lore.kernel.org/lkml/ymdzffx7fn586...@fyu1.sc.intel.com/

 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..35a3beff140b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm)
mmu_notifier_subscriptions_destroy(mm);
check_mm(mm);
put_user_ns(mm->user_ns);
+   mm_pasid_drop(mm);
free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
-   mm_pasid_drop(mm);
mmdrop(mm);
 }
 
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Set translation valid bit only when IO page tables are in used

2022-04-28 Thread Suravee Suthikulpanit via iommu




On 4/20/22 6:29 PM, Suravee Suthikulpanit wrote:

On AMD system with SNP enabled, IOMMU hardware checks the host translation
valid (TV) and guest translation valid (GV) bits in the device
table entry (DTE) before accessing the corresponded page tables.

However, current IOMMU driver sets the TV bit for all devices
regardless of whether the host page table is in used.
This results in ILLEGAL_DEV_TABLE_ENTRY event for devices, which
do not the host page table root pointer set up.

Thefore, only set TV bit when host or guest page tables are in used.

Signed-off-by: Suravee Suthikulpanit 



I found a bug in this patch. I will send out v2 with the fix.

Regards,
Suravee
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Robin Murphy

On 2022-04-28 17:02, Andi Kleen wrote:


On 4/28/2022 8:07 AM, Robin Murphy wrote:

On 2022-04-28 15:55, Andi Kleen wrote:


On 4/28/2022 7:45 AM, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
Rather than introduce this extra level of allocator complexity, how 
about
just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem

instances?
Yeah.  We're almost done removing all knowledge of swiotlb from 
drivers,

so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.


At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over 
those. I doubt it would be any simpler.


Eh, I think it would be, since the round-robin retry loop can then 
just sit around the existing io_tlb_mem-based allocator, vs. the churn 
of inserting it in the middle, plus it's then really easy to 
statically distribute different starting points across different 
devices via dev->dma_io_tlb_mem if we wanted to.


Admittedly the overall patch probably ends up about the same size, 
since it likely pushes a bit more complexity into swiotlb_init to 
compensate, but that's still a trade-off I like.


Unless you completely break the external API this will require a new 
mechanism to search a list of io_tlb_mems for the right area to free into.


If the memory area not contiguous (like in the original patch) this will 
be a O(n) operation on the number of io_tlb_mems, so it would get more 
and more expensive on larger systems. Or you merge them all together (so 
that the simple address arithmetic to look up the area works again), 
which will require even more changes in the setup. Or you add hashing or 
similar which will be even more complicated.


In the end doing it with a single io_tlb_mem is significantly simpler 
and also more natural.


Sorry if "dividing up the initial SWIOTLB allocation" somehow sounded 
like "making multiple separate SWIOTLB allocations all over the place"?


I don't see there being any *functional* difference in whether a slice 
of the overall SWIOTLB memory is represented by 
"io_tlb_default_mem->areas[i]->blah" or "io_tlb_default_mem[i]->blah", 
I'm simply advocating for not churning the already-complex allocator 
internals by pushing the new complexity out to the margins instead.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-04-28 Thread Dave Hansen
On 4/28/22 09:01, Jean-Philippe Brucker wrote:
>> But, this misses an important point: even after the address space is
>> gone, the PASID will still be programmed into a device.  Device drivers
>> might, for instance, still need to flush operations that are outstanding
>> and need to use that PASID.  They do this at ->release() time.
> It's not really clear which release() this is. For us it's file descriptor
> release() (not MMU notifier release(), which is how I initially understood
> this sentence)

OK, maybe that should be: "file->release() time" to make it more clear.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Andi Kleen


On 4/28/2022 8:07 AM, Robin Murphy wrote:

On 2022-04-28 15:55, Andi Kleen wrote:


On 4/28/2022 7:45 AM, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
Rather than introduce this extra level of allocator complexity, how 
about
just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem

instances?
Yeah.  We're almost done removing all knowledge of swiotlb from 
drivers,

so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.


At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over 
those. I doubt it would be any simpler.


Eh, I think it would be, since the round-robin retry loop can then 
just sit around the existing io_tlb_mem-based allocator, vs. the churn 
of inserting it in the middle, plus it's then really easy to 
statically distribute different starting points across different 
devices via dev->dma_io_tlb_mem if we wanted to.


Admittedly the overall patch probably ends up about the same size, 
since it likely pushes a bit more complexity into swiotlb_init to 
compensate, but that's still a trade-off I like.


Unless you completely break the external API this will require a new 
mechanism to search a list of io_tlb_mems for the right area to free into.


If the memory area not contiguous (like in the original patch) this will 
be a O(n) operation on the number of io_tlb_mems, so it would get more 
and more expensive on larger systems. Or you merge them all together (so 
that the simple address arithmetic to look up the area works again), 
which will require even more changes in the setup. Or you add hashing or 
similar which will be even more complicated.


In the end doing it with a single io_tlb_mem is significantly simpler 
and also more natural.


-Andi


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-04-28 Thread Jean-Philippe Brucker
On Thu, Apr 28, 2022 at 08:09:04AM -0700, Dave Hansen wrote:
> On 4/25/22 21:20, Fenghua Yu wrote:
> >>From 84aa68f6174439d863c40cdc2db0e1b89d620dd0 Mon Sep 17 00:00:00 2001
> > From: Fenghua Yu 
> > Date: Fri, 15 Apr 2022 00:51:33 -0700
> > Subject: [PATCH] iommu/sva: Fix PASID use-after-free issue
> > 
> > A PASID might be still used on ARM after it is freed in __mmput().
> 
> Is it really just ARM?
> 
> > process:
> > open()->sva_bind()->ioasid_alloc() = N; // Get PASID N for the mm
> > exit();
> > exit_mm()->__mmput()->mm_pasid_drop()->mm->pasid = -1; // PASID -1
> > exit_files()->release(dev)->sva_unbind()->use mm->pasid; // Failure
> > 
> > To avoid the use-after-free issue, free the PASID after no device uses it,
> > i.e. after all devices are unbound from the mm.
> > 
> > sva_bind()/sva_unbind() call mmgrab()/mmdrop() to track mm->mm_count.
> > __mmdrop() is called only after mm->mm_count is zero. So freeing the PASID
> > in __mmdrop() guarantees the PASID is safely freed only after no device
> > is bound to the mm.
> 
> Does this changelog work for everyone?
> 
> ==
> 
> tl;dr: The PASID is being freed too early.  It needs to stay around
> until after device drivers that might be using it have had a chance to
> clear it out of the hardware.
> 
> --
> 
> As a reminder:
> 
> mmget() /mmput()  refcount the mm's address space
> mmgrab()/mmdrop() refcount the mm itself
> 
> The PASID is currently tied to the life of the mm's address space and
> freed in __mmput().  This makes logical sense because the PASID can't be
> used once the address space is gone.
> 
> But, this misses an important point: even after the address space is
> gone, the PASID will still be programmed into a device.  Device drivers
> might, for instance, still need to flush operations that are outstanding
> and need to use that PASID.  They do this at ->release() time.

It's not really clear which release() this is. For us it's file descriptor
release() (not MMU notifier release(), which is how I initially understood
this sentence)

> 
> Device drivers hold a reference on the mm itself and drop it at
> ->release() time.  But, the device driver holds a reference mm itself,

"to the mm"

(To be pendantic it's the IOMMU driver that holds this reference, and
the device driver calls the IOMMU driver to release it, but the idea is
the same.)

> not the address space.  The address space (and the PASID) is long gone
> by the time the driver tries to clean up.  This is effectively a
> use-after-free bug on the PASID.
> 
> To fix this, move the PASID free operation from __mmput() to __mmdrop().
>  This ensures that the device drivers' existing mmgrab() keeps the PASID
> allocated until they drop their mm reference.

Good changelog otherwise

Thanks,
Jean

> 
> >  kernel/fork.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9796897560ab..35a3beff140b 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm)
> > mmu_notifier_subscriptions_destroy(mm);
> > check_mm(mm);
> > put_user_ns(mm->user_ns);
> > +   mm_pasid_drop(mm);
> > free_mm(mm);
> >  }
> >  EXPORT_SYMBOL_GPL(__mmdrop);
> > @@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm)
> > }
> > if (mm->binfmt)
> > module_put(mm->binfmt->module);
> > -   mm_pasid_drop(mm);
> > mmdrop(mm);
> >  }
> >  
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Tianyu Lan

On 4/28/2022 10:44 PM, Robin Murphy wrote:

On 2022-04-28 15:14, Tianyu Lan wrote:

From: Tianyu Lan 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen 
patch(https://github.com/intel/tdx/commit/4529b578

4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.


Rather than introduce this extra level of allocator complexity, how 
about just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem instances?


Robin.


Agree. Thanks for suggestion. That will be more generic and will update
in the next version.

Thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface

2022-04-28 Thread Tianyu Lan

On 4/28/2022 10:14 PM, Tianyu Lan wrote:

From: Tianyu Lan 

In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
bounce buffer to share data with host/hypervisor. The swiotlb spinlock
introduces overhead among devices if they share io tlb mem. Avoid such
issue, introduce swiotlb_device_allocate() to allocate device bounce
buffer from default io tlb pool and set up areas according input queue
number. Device may have multi io queues and setting up the same number
of io tlb area may help to resolve spinlock overhead among queues.

Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
from default pool for devices. IO TLB segment(256k) is too small.


Hi Christoph and Robin Murphy:

From Christoph:
"Yeah.  We're almost done removing all knowledge of swiotlb from 
drivers, so the very last thing I want is an interface that allows a 
driver to allocate a per-device buffer."

Please have a look at this patch. This patch is to provide a API
to device driver to allocate per-device buffer. Just providing 
per-device bounce buffer is not enough. Device still may have multi queue.
The single io tlb mem just has one spin lock in current code and this 
will introuduce overhead among queues DMA transaction. So the new API 
requests queues number as the IO TLB area number and this is why we 
still need to creat area in the IO Tlb mem.

   This new API is the one mentioned in the Christoph's comment.








Signed-off-by: Tianyu Lan 
---
  include/linux/swiotlb.h |  33 
  kernel/dma/swiotlb.c| 173 +++-
  2 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 489c249da434..380bd1ce3d0f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -31,6 +31,14 @@ struct scatterlist;
  #define IO_TLB_SHIFT 11
  #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
  
+/*

+ * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
+ * This allows device allocates bounce buffer from default io
+ * tlb pool.
+ */
+#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
+#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
+
  /* default to 64MB */
  #define IO_TLB_DEFAULT_SIZE (64UL<<20)
  
@@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;

   * @index:The slot index to start searching in this area for next round.
   * @lock: The lock to protect the above data structures in the map and
   *unmap calls.
+ * @block_index: The block index to start earching in this area for next round.
   */
  struct io_tlb_area {
unsigned long used;
unsigned int area_index;
unsigned int index;
+   unsigned int block_index;
spinlock_t lock;
  };
  
@@ -110,6 +120,7 @@ struct io_tlb_area {

   * @num_areas:  The area number in the pool.
   * @area_start: The area index to start searching in the next round.
   * @area_nslabs: The slot number in the area.
+ * @areas_block_number: The block number in the area.
   */
  struct io_tlb_mem {
phys_addr_t start;
@@ -126,7 +137,14 @@ struct io_tlb_mem {
unsigned int num_areas;
unsigned int area_start;
unsigned int area_nslabs;
+   unsigned int area_block_number;
+   struct io_tlb_mem *parent;
struct io_tlb_area *areas;
+   struct io_tlb_block {
+   size_t alloc_size;
+   unsigned long start_slot;
+   unsigned int list;
+   } *block;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
  size_t swiotlb_max_mapping_size(struct device *dev);
  bool is_swiotlb_active(struct device *dev);
  void __init swiotlb_adjust_size(unsigned long size);
+int swiotlb_device_allocate(struct device *dev,
+   unsigned int area_num,
+   unsigned long size);
+void swiotlb_device_free(struct device *dev);
  #else
  static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
  {
@@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
  static inline void swiotlb_adjust_size(unsigned long size)
  {
  }
+
+void swiotlb_device_free(struct device *dev)
+{
+}
+
+int swiotlb_device_allocate(struct device *dev,
+   unsigned int area_num,
+   unsigned long size)
+{
+   return -ENOMEM;
+}
  #endif /* CONFIG_SWIOTLB */
  
  extern void swiotlb_print_info(void);

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 00a16f540f20..7b95a140694a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
unsigned long nslabs, bool late_alloc)
  {
void *vaddr = phys_to_virt(start);
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i,

Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-04-28 Thread Dave Hansen
On 4/28/22 08:28, Fenghua Yu wrote:
> Do you want me to change the changlog to add both this paragraph and the
> following paragraph?

Yes, as long as everyone agrees that it captures the issue well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-04-28 Thread Fenghua Yu
Hi, Dave,

On Thu, Apr 28, 2022 at 08:09:04AM -0700, Dave Hansen wrote:
> On 4/25/22 21:20, Fenghua Yu wrote:
> >>From 84aa68f6174439d863c40cdc2db0e1b89d620dd0 Mon Sep 17 00:00:00 2001
> > From: Fenghua Yu 
> > Date: Fri, 15 Apr 2022 00:51:33 -0700
> > Subject: [PATCH] iommu/sva: Fix PASID use-after-free issue
> > 
> > A PASID might be still used on ARM after it is freed in __mmput().
> 
> Is it really just ARM?

Actually it should happen on X86 as well. I will remove "on ARM" in the
changelog.

> 
> > process:
> > open()->sva_bind()->ioasid_alloc() = N; // Get PASID N for the mm
> > exit();
> > exit_mm()->__mmput()->mm_pasid_drop()->mm->pasid = -1; // PASID -1
> > exit_files()->release(dev)->sva_unbind()->use mm->pasid; // Failure
> > 
> > To avoid the use-after-free issue, free the PASID after no device uses it,
> > i.e. after all devices are unbound from the mm.
> > 
> > sva_bind()/sva_unbind() call mmgrab()/mmdrop() to track mm->mm_count.
> > __mmdrop() is called only after mm->mm_count is zero. So freeing the PASID
> > in __mmdrop() guarantees the PASID is safely freed only after no device
> > is bound to the mm.
> 
> Does this changelog work for everyone?
> 
> ==
> 
> tl;dr: The PASID is being freed too early.  It needs to stay around
> until after device drivers that might be using it have had a chance to
> clear it out of the hardware.
> 

Do you want me to change the changlog to add both this paragraph and the
following paragraph?

> --
> 
> As a reminder:
> 
> mmget() /mmput()  refcount the mm's address space
> mmgrab()/mmdrop() refcount the mm itself
> 
> The PASID is currently tied to the life of the mm's address space and
> freed in __mmput().  This makes logical sense because the PASID can't be
> used once the address space is gone.
> 
> But, this misses an important point: even after the address space is
> gone, the PASID will still be programmed into a device.  Device drivers
> might, for instance, still need to flush operations that are outstanding
> and need to use that PASID.  They do this at ->release() time.
> 
> Device drivers hold a reference on the mm itself and drop it at
> ->release() time.  But, the device driver holds a reference mm itself,
> not the address space.  The address space (and the PASID) is long gone
> by the time the driver tries to clean up.  This is effectively a
> use-after-free bug on the PASID.
> 
> To fix this, move the PASID free operation from __mmput() to __mmdrop().
>  This ensures that the device drivers' existing mmgrab() keeps the PASID
> allocated until they drop their mm reference.
> 

Thank you very much!

-Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/2] MT8186 IOMMU SUPPORT

2022-04-28 Thread Joerg Roedel
On Thu, Apr 07, 2022 at 04:32:28PM +0800, Yong Wu wrote:
> Yong Wu (2):
>   dt-bindings: mediatek: mt8186: Add binding for MM iommu
>   iommu/mediatek: Add mt8186 iommu support
> 
>  .../bindings/iommu/mediatek,iommu.yaml|   4 +
>  drivers/iommu/mtk_iommu.c |  17 ++
>  .../dt-bindings/memory/mt8186-memory-port.h   | 217 ++
>  3 files changed, 238 insertions(+)
>  create mode 100644 include/dt-bindings/memory/mt8186-memory-port.h

This patch-set seems to apply cleanly only on 'MT8195 IOMMU SUPPORT',
please re-submit it together with the former when you made the changes
Matthias requested.

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Andi Kleen



On 4/28/2022 8:05 AM, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:

At least for TDX need parallelism with a single device for performance.

So find a way to make it happen without exposing details to random
drivers.



That's what the original patch (that this one is derived from) did.

It was completely transparent to everyone outside swiotlb.c

-Andi

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread Jason Gunthorpe via iommu
On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:

> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.

I think PPC and S390 are solving the same problem here. I think S390
is going to go to a SW nested model where it has an iommu_domain
controlled by iommufd that is populated with the pinned pages, eg
stored in an xarray.

Then the performance map/unmap path is simply copying pages from the
xarray to the real IOPTEs - and this would be modeled as a nested
iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.

Perhaps this is agreeable for PPC too?

> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.

As part of nesting iommufd will have a 'create iommu_domain using
iommu driver specific data' primitive.

The driver specific data for PPC can include a description of these
windows so the PPC specific qemu driver can issue this new ioctl
using the information provided by the guest.

The main issue is that internally to the iommu subsystem the
iommu_domain aperture is assumed to be a single window. This kAPI will
have to be improved to model the PPC multi-window iommu_domain.

If this API is not used then the PPC driver should choose some
sensible default windows that makes things like DPDK happy.

> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.

At the moment I think I would stick with not including the SPAPR
interfaces in vfio_compat, but there does seem to be a path if someone
with HW wants to build and test them?

> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.

Maybe every device gets a copy of the error notification?

ie maybe this should be part of vfio_pci and not part of iommufd to
mirror how AER works?

It feels strange to put in device error notification to iommufd, is
that connected the IOMMU?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-04-28 Thread Dave Hansen
On 4/25/22 21:20, Fenghua Yu wrote:
>>From 84aa68f6174439d863c40cdc2db0e1b89d620dd0 Mon Sep 17 00:00:00 2001
> From: Fenghua Yu 
> Date: Fri, 15 Apr 2022 00:51:33 -0700
> Subject: [PATCH] iommu/sva: Fix PASID use-after-free issue
> 
> A PASID might be still used on ARM after it is freed in __mmput().

Is it really just ARM?

> process:
>   open()->sva_bind()->ioasid_alloc() = N; // Get PASID N for the mm
>   exit();
>   exit_mm()->__mmput()->mm_pasid_drop()->mm->pasid = -1; // PASID -1
>   exit_files()->release(dev)->sva_unbind()->use mm->pasid; // Failure
> 
> To avoid the use-after-free issue, free the PASID after no device uses it,
> i.e. after all devices are unbound from the mm.
> 
> sva_bind()/sva_unbind() call mmgrab()/mmdrop() to track mm->mm_count.
> __mmdrop() is called only after mm->mm_count is zero. So freeing the PASID
> in __mmdrop() guarantees the PASID is safely freed only after no device
> is bound to the mm.

Does this changelog work for everyone?

==

tl;dr: The PASID is being freed too early.  It needs to stay around
until after device drivers that might be using it have had a chance to
clear it out of the hardware.

--

As a reminder:

mmget() /mmput()  refcount the mm's address space
mmgrab()/mmdrop() refcount the mm itself

The PASID is currently tied to the life of the mm's address space and
freed in __mmput().  This makes logical sense because the PASID can't be
used once the address space is gone.

But, this misses an important point: even after the address space is
gone, the PASID will still be programmed into a device.  Device drivers
might, for instance, still need to flush operations that are outstanding
and need to use that PASID.  They do this at ->release() time.

Device drivers hold a reference on the mm itself and drop it at
->release() time.  But, the device driver holds a reference mm itself,
not the address space.  The address space (and the PASID) is long gone
by the time the driver tries to clean up.  This is effectively a
use-after-free bug on the PASID.

To fix this, move the PASID free operation from __mmput() to __mmdrop().
 This ensures that the device drivers' existing mmgrab() keeps the PASID
allocated until they drop their mm reference.

>  kernel/fork.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9796897560ab..35a3beff140b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm)
>   mmu_notifier_subscriptions_destroy(mm);
>   check_mm(mm);
>   put_user_ns(mm->user_ns);
> + mm_pasid_drop(mm);
>   free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> @@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm)
>   }
>   if (mm->binfmt)
>   module_put(mm->binfmt->module);
> - mm_pasid_drop(mm);
>   mmdrop(mm);
>  }
>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Robin Murphy

On 2022-04-28 15:55, Andi Kleen wrote:


On 4/28/2022 7:45 AM, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
Rather than introduce this extra level of allocator complexity, how 
about

just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
instances?

Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.


At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over those. 
I doubt it would be any simpler.


Eh, I think it would be, since the round-robin retry loop can then just 
sit around the existing io_tlb_mem-based allocator, vs. the churn of 
inserting it in the middle, plus it's then really easy to statically 
distribute different starting points across different devices via 
dev->dma_io_tlb_mem if we wanted to.


Admittedly the overall patch probably ends up about the same size, since 
it likely pushes a bit more complexity into swiotlb_init to compensate, 
but that's still a trade-off I like.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Christoph Hellwig
On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:
> At least for TDX need parallelism with a single device for performance.

So find a way to make it happen without exposing details to random
drivers.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 11/12] iommu: Per-domain I/O page fault handling

2022-04-28 Thread Jean-Philippe Brucker
On Thu, Apr 21, 2022 at 01:21:20PM +0800, Lu Baolu wrote:
>  static void iopf_handle_group(struct work_struct *work)
>  {
>   struct iopf_group *group;
> @@ -134,12 +78,23 @@ static void iopf_handle_group(struct work_struct *work)
>   group = container_of(work, struct iopf_group, work);
>  
>   list_for_each_entry_safe(iopf, next, &group->faults, list) {
> + struct iommu_domain *domain;
> +
> + domain = iommu_get_domain_for_dev_pasid_async(group->dev,
> + iopf->fault.prm.pasid);

Reading the PCIe spec again (v6.0 10.4.1.1 PASID Usage), all faults within
the group have the same PASID so we could move the domain fetch out of the
loop. It does deviate from the old behavior, though, so we could change
it later.

Thanks,
Jean

> + if (!domain || !domain->iopf_handler)
> + status = IOMMU_PAGE_RESP_INVALID;
> +
>   /*
>* For the moment, errors are sticky: don't handle subsequent
>* faults in the group if there is an error.
>*/
>   if (status == IOMMU_PAGE_RESP_SUCCESS)
> - status = iopf_handle_single(iopf);
> + status = domain->iopf_handler(&iopf->fault,
> +   domain->fault_data);
> +
> + if (domain)
> + iommu_domain_put_async(domain);
>  
>   if (!(iopf->fault.prm.flags &
> IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Robin Murphy

On 2022-04-28 15:45, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:

Rather than introduce this extra level of allocator complexity, how about
just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
instances?


Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.


FWIW I'd already started thinking about having a distinct io_tlb_mem for 
non-coherent devices where vaddr is made non-cacheable to avoid the 
hassle of keeping the arch_dma_sync_* calls lined up, so I'm certainly 
in favour of bringing in a bit more flexibility at this level :)


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Andi Kleen



On 4/28/2022 7:45 AM, Christoph Hellwig wrote:

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:

Rather than introduce this extra level of allocator complexity, how about
just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
instances?

Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.


At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over those. 
I doubt it would be any simpler.



-Andi


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 03/12] iommu: Add attach/detach_dev_pasid domain ops

2022-04-28 Thread Jean-Philippe Brucker
On Thu, Apr 21, 2022 at 01:21:12PM +0800, Lu Baolu wrote:
> Attaching an IOMMU domain to a PASID of a device is a generic operation
> for modern IOMMU drivers which support PASID-granular DMA address
> translation. Currently visible usage scenarios include (but not limited):
> 
>  - SVA (Shared Virtual Address)
>  - kernel DMA with PASID
>  - hardware-assist mediated device
> 
> This adds a pair of common domain ops for this purpose and adds helpers
> to attach/detach a domain to/from a {device, PASID}. Some buses, like
> PCI, route packets without considering the PASID value. Thus a DMA target
> address with PASID might be treated as P2P if the address falls into the
> MMIO BAR of other devices in the group. To make things simple, these
> interfaces only apply to devices belonging to the singleton groups, and
> the singleton is immutable in fabric i.e. not affected by hotplug.
> 
> Signed-off-by: Lu Baolu 
[...]
> +/*
> + * Use standard PCI bus topology, isolation features, and DMA
> + * alias quirks to check the immutable singleton attribute. If
> + * the device came from DT, assume it is static and then
> + * singleton can know from the device count in the group.
> + */
> +static bool device_group_immutable_singleton(struct device *dev)
> +{
> + struct iommu_group *group = iommu_group_get(dev);
> + int count;
> +
> + if (!group)
> + return false;
> +
> + mutex_lock(&group->mutex);
> + count = iommu_group_device_count(group);
> + mutex_unlock(&group->mutex);
> + iommu_group_put(group);
> +
> + if (count != 1)
> + return false;
> +
> + if (dev_is_pci(dev)) {
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + /*
> +  * The device could be considered to be fully isolated if
> +  * all devices on the path from the device to the host-PCI
> +  * bridge are protected from peer-to-peer DMA by ACS.
> +  */
> + if (!pci_acs_path_enabled(pdev, NULL, REQ_ACS_FLAGS))
> + return false;
> +
> + /* Filter out devices which has any alias device. */
> + if (pci_for_each_dma_alias(pdev, has_pci_alias, pdev))
> + return false;

Aren't aliases already added to the group by pci_device_group()?  If so
it's part of the count check above

> +
> + return true;
> + }
> +
> + /*
> +  * If the device came from DT, assume it is static and then
> +  * singleton can know from the device count in the group.
> +  */
> + return is_of_node(dev_fwnode(dev));

I don't think DT is relevant here because a platform device enumerated
through ACPI will also have its own group. It should be safe to stick to
what the IOMMU drivers declare with their device_group() callback. Except
for PCI those groups should be immutable so we can return true here.

Thanks,
Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread David Gibson
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

There are a few different levels to consider for dealing with PPC.
For a suitable long term interface for ppc hosts and guests dropping
this is fine: the ppc specific iommu model was basically an
ill-conceived idea from the beginning, because none of us had
sufficiently understood what things were general and what things where
iommu model/hw specific.

..mostly.  There are several points of divergence for the ppc iommu
model.

1) Limited IOVA windows.  This one turned out to not really be ppc
specific, and is (rightly) handlded generically in the new interface.
No problem here.

2) Costly GUPs.  pseries (the most common ppc machine type) always
expects a (v)IOMMU.  That means that unlike the common x86 model of a
host with IOMMU, but guests with no-vIOMMU, guest initiated
maps/unmaps can be a hot path.  Accounting in that path can be
prohibitive (and on POWER8 in particular it prevented us from
optimizing that path the way we wanted).  We had two solutions for
that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
based on the IOVA window sizes.  That was improved in the v2 which
used the concept of preregistration.  IIUC iommufd can achieve the
same effect as preregistration using IOAS_COPY, so this one isn't
really a problem either.

3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
windows, which aren't contiguous with each other.  The base addresses
of each of these are fixed, but the size of each window, the pagesize
(i.e. granularity) of each window and the number of levels in the
IOMMU pagetable are runtime configurable.  Because it's true in the
hardware, it's also true of the vIOMMU interface defined by the IBM
hypervisor (and adpoted by KVM as well).  So, guests can request
changes in how these windows are handled.  Typical Linux guests will
use the "low" window (IOVA 0..2GiB) dynamically, and the high window
(IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
can't count on that; the guest can use them however it wants.


(3) still needs a plan for how to fit it into the /dev/iommufd model.
This is a secondary reason that in the past I advocated for the user
requesting specific DMA windows which the kernel would accept or
refuse, rather than having a query function - it connects easily to
the DDW model.  With the query-first model we'd need some sort of
extension here, not really sure what it should look like.



Then, there's handling existing qemu (or other software) that is using
the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
should be a goal or not: as others have noted, working actively to
port qemu to the new interface at the same time as making a
comprehensive in-kernel compat layer is arguably redundant work.

That said, if we did want to handle this in an in-kernel compat layer,
here's roughly what you'd need for SPAPR_TCE v2:

- VFIO_IOMMU_SPAPR_TCE_GET_INFO
I think this should be fairly straightforward; the information you
need should be in the now generic IOVA window stuff and would just
need massaging into the expected format.
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
  VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
IIUC, these could be traslated into map/unmap operations onto a
second implicit IOAS which represents the preregistered memory
areas (to which we'd never connect an actual device).  Along with
this VFIO_MAP and VFIO_UNMAP operations would need to check for
this case, verify their addresses against the preregistered space
and be translated into IOAS_COPY operations from the prereg
address space instead of raw IOAS_MAP operations.  Fiddly, but not
fundamentally hard, I think.

For SPAPR_TCE_v1 things are a bit trickier

- VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
I suspect you could get away with implementing these as no-ops.
It wouldn't be strictly correct, but I think software which is
using the interface correctly should work this way, though
possibly not optimally.  That might be good enough for this ugly
old interface.

And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
with this because the interface was completely broken for most of its
lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
somewhat similar in concept, though not interface, t

Re: [PATCH v6 11/34] iommu/mediatek: Add a flag NON_STD_AXI

2022-04-28 Thread Matthias Brugger




On 07/04/2022 09:57, Yong Wu wrote:

Add a new flag NON_STD_AXI, All the previous SoC support this flag.
Prepare for adding infra and apu iommu which don't support this.

Signed-off-by: Yong Wu 
Reviewed-by: AngeloGioacchino Del Regno 

---
  drivers/iommu/mtk_iommu.c | 16 ++--
  1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 92f172a772d1..e7008a20ec74 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -122,6 +122,7 @@
  #define IOVA_34_ENBIT(8)
  #define SHARE_PGTABLE BIT(9) /* 2 HW share pgtable */
  #define DCM_DISABLE   BIT(10)
+#define NOT_STD_AXI_MODE   BIT(11)
  
  #define MTK_IOMMU_HAS_FLAG(pdata, _x) \

pdata)->flags) & (_x)) == (_x))
@@ -785,7 +786,8 @@ static int mtk_iommu_hw_init(const struct mtk_iommu_data 
*data)
regval = 0;
} else {
regval = readl_relaxed(data->base + REG_MMU_MISC_CTRL);
-   regval &= ~F_MMU_STANDARD_AXI_MODE_MASK;
+   if (MTK_IOMMU_HAS_FLAG(data->plat_data, NOT_STD_AXI_MODE))
+   regval &= ~F_MMU_STANDARD_AXI_MODE_MASK;


That means that for mt8195 infra we write back the very same value we read from 
REG_MMU_MISC_CTRL. Is this necessary?
Maybe we can come up with a different flag called STD_AXI_MODE and use something 
like

} else if (!MTK_IOMMU_HAS_FLAG(data->plat_data, \
   STD_AXI_MODE)) {

Reason is that it makes more sense to add a flag for one specific iommu instead 
of adding a flag to the common case (iommu is not following standard AXI protocol).


What do you think?

Regards,
Matthias


if (MTK_IOMMU_HAS_FLAG(data->plat_data, OUT_ORDER_WR_EN))
regval &= ~F_MMU_IN_ORDER_WR_EN_MASK;
}
@@ -1058,7 +1060,8 @@ static const struct dev_pm_ops mtk_iommu_pm_ops = {
  
  static const struct mtk_iommu_plat_data mt2712_data = {

.m4u_plat = M4U_MT2712,
-   .flags= HAS_4GB_MODE | HAS_BCLK | HAS_VLD_PA_RNG | 
SHARE_PGTABLE,
+   .flags= HAS_4GB_MODE | HAS_BCLK | HAS_VLD_PA_RNG | 
SHARE_PGTABLE |
+   NOT_STD_AXI_MODE,
.hw_list  = &m4ulist,
.inv_sel_reg  = REG_MMU_INV_SEL_GEN1,
.iova_region  = single_domain,
@@ -1068,7 +1071,8 @@ static const struct mtk_iommu_plat_data mt2712_data = {
  
  static const struct mtk_iommu_plat_data mt6779_data = {

.m4u_plat  = M4U_MT6779,
-   .flags = HAS_SUB_COMM | OUT_ORDER_WR_EN | WR_THROT_EN,
+   .flags = HAS_SUB_COMM | OUT_ORDER_WR_EN | WR_THROT_EN |
+NOT_STD_AXI_MODE,
.inv_sel_reg   = REG_MMU_INV_SEL_GEN2,
.iova_region   = single_domain,
.iova_region_nr = ARRAY_SIZE(single_domain),
@@ -1077,7 +1081,7 @@ static const struct mtk_iommu_plat_data mt6779_data = {
  
  static const struct mtk_iommu_plat_data mt8167_data = {

.m4u_plat = M4U_MT8167,
-   .flags= RESET_AXI | HAS_LEGACY_IVRP_PADDR,
+   .flags= RESET_AXI | HAS_LEGACY_IVRP_PADDR | NOT_STD_AXI_MODE,
.inv_sel_reg  = REG_MMU_INV_SEL_GEN1,
.iova_region  = single_domain,
.iova_region_nr = ARRAY_SIZE(single_domain),
@@ -1087,7 +1091,7 @@ static const struct mtk_iommu_plat_data mt8167_data = {
  static const struct mtk_iommu_plat_data mt8173_data = {
.m4u_plat = M4U_MT8173,
.flags= HAS_4GB_MODE | HAS_BCLK | RESET_AXI |
-   HAS_LEGACY_IVRP_PADDR,
+   HAS_LEGACY_IVRP_PADDR | NOT_STD_AXI_MODE,
.inv_sel_reg  = REG_MMU_INV_SEL_GEN1,
.iova_region  = single_domain,
.iova_region_nr = ARRAY_SIZE(single_domain),
@@ -1106,7 +1110,7 @@ static const struct mtk_iommu_plat_data mt8183_data = {
  static const struct mtk_iommu_plat_data mt8192_data = {
.m4u_plat   = M4U_MT8192,
.flags  = HAS_BCLK | HAS_SUB_COMM | OUT_ORDER_WR_EN |
- WR_THROT_EN | IOVA_34_EN,
+ WR_THROT_EN | IOVA_34_EN | NOT_STD_AXI_MODE,
.inv_sel_reg= REG_MMU_INV_SEL_GEN2,
.iova_region= mt8192_multi_dom,
.iova_region_nr = ARRAY_SIZE(mt8192_multi_dom),

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/msm: add a check for the return of kzalloc()

2022-04-28 Thread Jeffrey Hugo
On Fri, Mar 25, 2022 at 2:13 PM  wrote:
>
> From: Xiaoke Wang 
>
> kzalloc() is a memory allocation function which can return NULL when
> some internal memory errors happen. So it is better to check it to
> prevent potential wrong memory access.
>
> Signed-off-by: Xiaoke Wang 
> ---
>  drivers/iommu/msm_iommu.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
> index 3a38352..697ad63 100644
> --- a/drivers/iommu/msm_iommu.c
> +++ b/drivers/iommu/msm_iommu.c
> @@ -597,6 +597,10 @@ static void insert_iommu_master(struct device *dev,
>
> if (list_empty(&(*iommu)->ctx_list)) {
> master = kzalloc(sizeof(*master), GFP_ATOMIC);
> +   if (!master) {
> +   dev_err(dev, "Failed to allocate iommu_master\n");

How do you reconcile this with chapter 14 of the coding style document?

"These generic allocation functions all emit a stack dump on failure when used
without __GFP_NOWARN so there is no use in emitting an additional failure
message when NULL is returned."

> +   return;
> +   }
> master->of_node = dev->of_node;
> list_add(&master->list, &(*iommu)->ctx_list);
> dev_iommu_priv_set(dev, master);
> --
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 10/12] iommu: Prepare IOMMU domain for IOPF

2022-04-28 Thread Jean-Philippe Brucker
Hi Baolu,

On Thu, Apr 21, 2022 at 01:21:19PM +0800, Lu Baolu wrote:
> +/*
> + * Get the attached domain for asynchronous usage, for example the I/O
> + * page fault handling framework. The caller get a reference counter
> + * of the domain automatically on a successful return and should put
> + * it with iommu_domain_put() after usage.
> + */
> +struct iommu_domain *
> +iommu_get_domain_for_dev_pasid_async(struct device *dev, ioasid_t pasid)
> +{
> + struct iommu_domain *domain;
> + struct iommu_group *group;
> +
> + if (!pasid_valid(pasid))
> + return NULL;
> +
> + group = iommu_group_get(dev);
> + if (!group)
> + return NULL;
> +
> + mutex_lock(&group->mutex);

There is a possible deadlock between unbind() and the fault handler:

 unbind()iopf_handle_group()
  mutex_lock(&group->mutex)
  iommu_detach_device_pasid()
   iopf_queue_flush_dev() iommu_get_domain_for_dev_pasid_async()
... waits for IOPF workmutex_lock(&group->mutex)

I was wrong in my previous review: we do have a guarantee that the SVA
domain does not go away during IOPF handling, because unbind() waits for
pending faults with iopf_queue_flush_dev() before freeing the domain (or
for Arm stall, knows that there are no pending faults). So we can just get
rid of domain->async_users and the group->mutex in IOPF, I think?

Thanks,
Jean

> + domain = xa_load(&group->pasid_array, pasid);
> + if (domain)
> + refcount_inc(&domain->async_users);
> + mutex_unlock(&group->mutex);
> + iommu_group_put(group);
> +
> + return domain;
> +}
> -- 
> 2.25.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Christoph Hellwig
On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
> Rather than introduce this extra level of allocator complexity, how about
> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
> instances?

Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Robin Murphy

On 2022-04-28 15:14, Tianyu Lan wrote:

From: Tianyu Lan 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.


Rather than introduce this extra level of allocator complexity, how 
about just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem instances?


Robin.


Based-on-idea-by: Andi Kleen 
Signed-off-by: Tianyu Lan 
---
  include/linux/swiotlb.h |  25 ++
  kernel/dma/swiotlb.c| 173 +++-
  2 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..489c249da434 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
  #ifdef CONFIG_SWIOTLB
  extern enum swiotlb_force swiotlb_force;
  
+/**

+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:  The number of used IO TLB block.
+ * @area_index: The index of to tlb area.
+ * @index: The slot index to start searching in this area for next round.
+ * @lock:  The lock to protect the above data structures in the map and
+ * unmap calls.
+ */
+struct io_tlb_area {
+   unsigned long used;
+   unsigned int area_index;
+   unsigned int index;
+   spinlock_t lock;
+};
+
  /**
   * struct io_tlb_mem - IO TLB Memory Pool Descriptor
   *
@@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
   * @late_alloc:   %true if allocated using the page allocator
   * @force_bounce: %true if swiotlb bouncing is forced
   * @for_alloc:  %true if the pool is used for memory allocation
+ * @num_areas:  The area number in the pool.
+ * @area_start: The area index to start searching in the next round.
+ * @area_nslabs: The slot number in the area.
   */
  struct io_tlb_mem {
phys_addr_t start;
@@ -102,6 +123,10 @@ struct io_tlb_mem {
bool late_alloc;
bool force_bounce;
bool for_alloc;
+   unsigned int num_areas;
+   unsigned int area_start;
+   unsigned int area_nslabs;
+   struct io_tlb_area *areas;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..00a16f540f20 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
  
  #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
  
+#define NUM_AREAS_DEFAULT 1

+
  static bool swiotlb_force_bounce;
  static bool swiotlb_force_disable;
  
@@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;

  phys_addr_t swiotlb_unencrypted_base;
  
  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;

+static unsigned long default_area_num = NUM_AREAS_DEFAULT;
+
+static int swiotlb_setup_areas(struct io_tlb_mem *mem,
+   unsigned int num_areas, unsigned long nslabs)
+{
+   if (nslabs < 1 || !is_power_of_2(num_areas)) {
+   pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
+   return -EINVAL;
+   }
+
+   /* Round up number of slabs to the next power of 2.
+* The last area is going be smaller than the rest if default_nslabs is
+* not power of two.
+*/
+   mem->area_start = 0;
+   mem->num_areas = num_areas;
+   mem->area_nslabs = nslabs / num_areas;
+   return 0;
+}
  
  static int __init

  setup_io_tlb_npages(char *str)
@@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
return;
size = ALIGN(size, IO_TLB_SIZE);
default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+   swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+   default_nslabs);
pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
  }
  
@@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,

unsigned long nslabs, bool late_alloc)
  {
void *vaddr = p

Re: [PATCH v11 0/9] ACPI/IORT: Support for IORT RMR node

2022-04-28 Thread Steven Price
On 22/04/2022 17:28, Shameer Kolothum wrote:
> Hi
> 
> v9 --> v10
>  -Addressed Christoph's comments. We now have a callback to 
>   struct iommu_resv_region to free all related memory and also dropped
>   the FW specific union and now has a container struct iommu_iort_rmr_data.
>   See patches #1 & #4
>  -Added R-by from Christoph.
>  -Dropped R-by from Lorenzo for patches #4 & #5 due to the above changes.
>  -Also dropped T-by from Steve and Laurentiu. Many thanks for your test
>   efforts. I have done basic sanity testing on my platform but please
>   give it a try at your end as well.

I'm back in the office and given it a spin, it's all good:

Tested-by: Steven Price 

Thanks,

Steve

> 
> As mentioned in v10, this now has a dependency on the ACPICA header patch
> here[1]. 
> 
> Please take a look and let me know.
> 
> Thanks,
> Shameer
> [1] https://lore.kernel.org/all/44610361.fMDQidcC6G@kreacher/
> 
> From old:
> We have faced issues with 3408iMR RAID controller cards which
> fail to boot when SMMU is enabled. This is because these
> controllers make use of host memory for various caching related
> purposes and when SMMU is enabled the iMR firmware fails to
> access these memory regions as there is no mapping for them.
> IORT RMR provides a way for UEFI to describe and report these
> memory regions so that the kernel can make a unity mapping for
> these in SMMU.
> 
> Change History:
> 
> v9 --> v10
>  - Dropped patch #1 ("Add temporary RMR node flag definitions") since
>the ACPICA header updates patch is now in the mailing list
>  - Based on the suggestion from Christoph, introduced a 
>resv_region_free_fw_data() callback in struct iommu_resv_region and
>used that to free RMR specific memory allocations.
> 
> v8 --> v9
>  - Adressed comments from Robin on interfaces.
>  - Addressed comments from Lorenzo.
> 
> v7 --> v8
>   - Patch #1 has temp definitions for RMR related changes till
>     the ACPICA header changes are part of kernel.
>   - No early parsing of RMR node info and is only parsed at the
>     time of use.
>   - Changes to the RMR get/put API format compared to the
>     previous version.
>   - Support for RMR descriptor shared by multiple stream IDs.
> 
> v6 --> v7
>  -fix pointed out by Steve to the SMMUv2 SMR bypass install in patch #8.
> 
> v5 --> v6
> - Addressed comments from Robin & Lorenzo.
>   : Moved iort_parse_rmr() to acpi_iort_init() from
>     iort_init_platform_devices().
>   : Removed use of struct iort_rmr_entry during the initial
>     parse. Using struct iommu_resv_region instead.
>   : Report RMR address alignment and overlap errors, but continue.
>   : Reworked arm_smmu_init_bypass_stes() (patch # 6).
> - Updated SMMUv2 bypass SMR code. Thanks to Jon N (patch #8).
> - Set IOMMU protection flags(IOMMU_CACHE, IOMMU_MMIO) based
>   on Type of RMR region. Suggested by Jon N.
> 
> v4 --> v5
>  -Added a fw_data union to struct iommu_resv_region and removed
>   struct iommu_rmr (Based on comments from Joerg/Robin).
>  -Added iommu_put_rmrs() to release mem.
>  -Thanks to Steve for verifying on SMMUv2, but not added the Tested-by
>   yet because of the above changes.
> 
> v3 -->v4
> -Included the SMMUv2 SMR bypass install changes suggested by
>  Steve(patch #7)
> -As per Robin's comments, RMR reserve implementation is now
>  more generic  (patch #8) and dropped v3 patches 8 and 10.
> -Rebase to 5.13-rc1
> 
> RFC v2 --> v3
>  -Dropped RFC tag as the ACPICA header changes are now ready to be
>   part of 5.13[0]. But this series still has a dependency on that patch.
>  -Added IORT E.b related changes(node flags, _DSM function 5 checks for
>   PCIe).
>  -Changed RMR to stream id mapping from M:N to M:1 as per the spec and
>   discussion here[1].
>  -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!)
> 
> Jon Nettleton (1):
>   iommu/arm-smmu: Get associated RMR info and install bypass SMR
> 
> Shameer Kolothum (8):
>   iommu: Introduce a callback to struct iommu_resv_region
>   ACPI/IORT: Make iort_iommu_msi_get_resv_regions() return void
>   ACPI/IORT: Provide a generic helper to retrieve reserve regions
>   ACPI/IORT: Add support to retrieve IORT RMR reserved regions
>   ACPI/IORT: Add a helper to retrieve RMR info directly
>   iommu/arm-smmu-v3: Introduce strtab init helper
>   iommu/arm-smmu-v3: Refactor arm_smmu_init_bypass_stes() to force
> bypass
>   iommu/arm-smmu-v3: Get associated RMR info and install bypass STE
> 
>  drivers/acpi/arm64/iort.c   | 359 ++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  78 -
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   |  52 +++
>  drivers/iommu/dma-iommu.c   |   2 +-
>  drivers/iommu/iommu.c   |  16 +-
>  include/linux/acpi_iort.h   |  14 +-
>  include/linux/iommu.h   |  10 +
>  7 files changed, 485 insertions(+), 46 deletions(-)
> 

___
iommu mailing

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-28 Thread Jason Gunthorpe via iommu
On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > 
> > > > +/**
> > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > + * @ioas_id: IOAS ID to read ranges from
> > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > + * @__reserved: Must be 0
> > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is 
> > > > the smaller
> > > > + *   of out_num_iovas or the length implied by size.
> > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > + *
> > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these 
> > > > ranges is
> > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > + * size should include the allocated flex array.
> > > > + */
> > > > +struct iommu_ioas_iova_ranges {
> > > > +   __u32 size;
> > > > +   __u32 ioas_id;
> > > > +   __u32 out_num_iovas;
> > > > +   __u32 __reserved;
> > > > +   struct iommu_valid_iovas {
> > > > +   __aligned_u64 start;
> > > > +   __aligned_u64 last;
> > > > +   } out_valid_iovas[];
> > > > +};
> > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > 
> > > Is the information returned by this valid for the lifeime of the IOAS,
> > > or can it change?  If it can change, what events can change it?
> > >
> > > If it *can't* change, then how do we have enough information to
> > > determine this at ALLOC time, since we don't necessarily know which
> > > (if any) hardware IOMMU will be attached to it.
> > 
> > It is a good point worth documenting. It can change. Particularly
> > after any device attachment.
> 
> Right.. this is vital and needs to be front and centre in the
> comments/docs here.  Really, I think an interface that *doesn't* have
> magically changing status would be better (which is why I was
> advocating that the user set the constraints, and the kernel supplied
> or failed outright).  Still I recognize that has its own problems.

That is a neat idea, it could be a nice option, it lets userspace
further customize the kernel allocator.

But I don't have a use case in mind? The simplified things I know
about want to attach their devices then allocate valid IOVA, they
don't really have a notion about what IOVA regions they are willing to
accept, or necessarily do hotplug.

What might be interesting is to have some option to load in a machine
specific default ranges - ie the union of every group and and every
iommu_domain. The idea being that after such a call hotplug of a
device should be very likely to succeed.

Though I don't have a user in mind..

> > I added this:
> > 
> >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these 
> > ranges
> >  * is not allowed. out_num_iovas will be set to the total number of iovas 
> > and
> >  * the out_valid_iovas[] will be filled in as space permits. size should 
> > include
> >  * the allocated flex array.
> >  *
> >  * The allowed ranges are dependent on the HW path the DMA operation takes, 
> > and
> >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have 
> > a
> >  * full range, and each attached device will narrow the ranges based on that
> >  * devices HW restrictions.
> 
> I think you need to be even more explicit about this: which exact
> operations on the fd can invalidate exactly which items in the
> information from this call?  Can it only ever be narrowed, or can it
> be broadened with any operations?

I think "attach" is the phrase we are using for that operation - it is
not a specific IOCTL here because it happens on, say, the VFIO device FD.

Let's add "detatching a device can widen the ranges. Userspace should
query ranges after every attach/detatch to know what IOVAs are valid
for mapping."

> > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > 
> > > Since it can only copy a single mapping, what's the benefit of this
> > > over just repeating an IOAS_MAP in the new IOAS?
> > 
> > It causes the underlying pin accounting to be shared and can avoid
> > calling GUP entirely.
> 
> If that's the only purpose, then that needs to be right here in the
> comments too.  So is expected best practice to IOAS_MAP everything you
> might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> mappings you actually end up wanting into the "real" IOASes for use?

That is one possibility, yes. qemu seems to be using this to establish
a clone ioas of an existing operational one which is another usage
model.

I added this additionally:

 * This may be used

Re: [PATCH v6 15/34] iommu/mediatek: Add IOMMU_TYPE flag

2022-04-28 Thread Matthias Brugger




On 07/04/2022 09:57, Yong Wu wrote:

Add IOMMU_TYPE definition. In the mt8195, we have another IOMMU_TYPE:
infra iommu, also there will be another APU_IOMMU, thus, use 2bits for the
IOMMU_TYPE.

Signed-off-by: Yong Wu 
Reviewed-by: AngeloGioacchino Del Regno 

---
  drivers/iommu/mtk_iommu.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 84d661e0b371..642949aad47e 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -126,9 +126,17 @@
  #define SHARE_PGTABLE BIT(10) /* 2 HW share pgtable */
  #define DCM_DISABLE   BIT(11)
  #define NOT_STD_AXI_MODE  BIT(12)
+/* 2 bits: iommu type */
+#define MTK_IOMMU_TYPE_MM  (0x0 << 13)
+#define MTK_IOMMU_TYPE_INFRA   (0x1 << 13)
+#define MTK_IOMMU_TYPE_MASK(0x3 << 13)
  
-#define MTK_IOMMU_HAS_FLAG(pdata, _x) \

-   pdata)->flags) & (_x)) == (_x))
+#define MTK_IOMMU_HAS_FLAG(pdata, _x)  (!!(((pdata)->flags) & (_x)))


That could be:
MTK_IOMMU_HAS_FLAG(pdata, _x) \
MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, _x)


+
+#define MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, mask)   \
+   pdata)->flags) & (mask)) == (_x))
+#define MTK_IOMMU_IS_TYPE(pdata, _x)   MTK_IOMMU_HAS_FLAG_MASK(pdata, _x,\
+   MTK_IOMMU_TYPE_MASK)
  
  struct mtk_iommu_domain {

struct io_pgtable_cfg   cfg;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/14] iommu: Retire bus_set_iommu()

2022-04-28 Thread Robin Murphy

On 2022-04-28 14:18, Robin Murphy wrote:

v1: 
https://lore.kernel.org/linux-iommu/cover.1649935679.git.robin.mur...@arm.com/

Hi all,

Just some minor updates for v2, adding a workaround to avoid changing
VT-d behaviour for now, cleaning up the extra include I missed in
virtio-iommu, and collecting all the acks so far.


Oh, and I added the proper error handling for bus_iommu_probe() since it 
turned out to be trivial, and applied Krishna's R-b to the wrong patch 
because of course what was patch #2 is now actually patch #3 :(


Apologies, juggling one too many things at the moment...

Robin.


As before, this is
based on the AMD IOMMU patch now applied, plus v2 of my
device_iommu_capable() series here:

https://lore.kernel.org/linux-iommu/cover.1650878781.git.robin.mur...@arm.com/

Cheers,
Robin.



Robin Murphy (14):
   iommu/vt-d: Temporarily reject probing non-PCI devices
   iommu: Always register bus notifiers
   iommu: Move bus setup to IOMMU device registration
   iommu/amd: Clean up bus_set_iommu()
   iommu/arm-smmu: Clean up bus_set_iommu()
   iommu/arm-smmu-v3: Clean up bus_set_iommu()
   iommu/dart: Clean up bus_set_iommu()
   iommu/exynos: Clean up bus_set_iommu()
   iommu/ipmmu-vmsa: Clean up bus_set_iommu()
   iommu/mtk: Clean up bus_set_iommu()
   iommu/omap: Clean up bus_set_iommu()
   iommu/tegra-smmu: Clean up bus_set_iommu()
   iommu/virtio: Clean up bus_set_iommu()
   iommu: Clean up bus_set_iommu()

  drivers/iommu/amd/amd_iommu.h   |   1 -
  drivers/iommu/amd/init.c|   9 +-
  drivers/iommu/amd/iommu.c   |  21 
  drivers/iommu/apple-dart.c  |  30 +-
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  53 +-
  drivers/iommu/arm/arm-smmu/arm-smmu.c   |  84 +--
  drivers/iommu/arm/arm-smmu/qcom_iommu.c |   4 -
  drivers/iommu/exynos-iommu.c|   9 --
  drivers/iommu/fsl_pamu_domain.c |   4 -
  drivers/iommu/intel/iommu.c |   5 +-
  drivers/iommu/iommu.c   | 110 +---
  drivers/iommu/ipmmu-vmsa.c  |  35 +--
  drivers/iommu/msm_iommu.c   |   2 -
  drivers/iommu/mtk_iommu.c   |  13 +--
  drivers/iommu/mtk_iommu_v1.c|  13 +--
  drivers/iommu/omap-iommu.c  |   6 --
  drivers/iommu/rockchip-iommu.c  |   2 -
  drivers/iommu/s390-iommu.c  |   6 --
  drivers/iommu/sprd-iommu.c  |   5 -
  drivers/iommu/sun50i-iommu.c|   2 -
  drivers/iommu/tegra-smmu.c  |  29 ++
  drivers/iommu/virtio-iommu.c|  25 -
  include/linux/iommu.h   |   1 -
  23 files changed, 67 insertions(+), 402 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface

2022-04-28 Thread Tianyu Lan
From: Tianyu Lan 

In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
bounce buffer to share data with host/hypervisor. The swiotlb spinlock
introduces overhead among devices if they share io tlb mem. Avoid such
issue, introduce swiotlb_device_allocate() to allocate device bounce
buffer from default io tlb pool and set up areas according input queue
number. Device may have multi io queues and setting up the same number
of io tlb area may help to resolve spinlock overhead among queues.

Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
from default pool for devices. IO TLB segment(256k) is too small.

Signed-off-by: Tianyu Lan 
---
 include/linux/swiotlb.h |  33 
 kernel/dma/swiotlb.c| 173 +++-
 2 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 489c249da434..380bd1ce3d0f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -31,6 +31,14 @@ struct scatterlist;
 #define IO_TLB_SHIFT 11
 #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
 
+/*
+ * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
+ * This allows device allocates bounce buffer from default io
+ * tlb pool.
+ */
+#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
+#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
+
 /* default to 64MB */
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
 
@@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
  * @index: The slot index to start searching in this area for next round.
  * @lock:  The lock to protect the above data structures in the map and
  * unmap calls.
+ * @block_index: The block index to start earching in this area for next round.
  */
 struct io_tlb_area {
unsigned long used;
unsigned int area_index;
unsigned int index;
+   unsigned int block_index;
spinlock_t lock;
 };
 
@@ -110,6 +120,7 @@ struct io_tlb_area {
  * @num_areas:  The area number in the pool.
  * @area_start: The area index to start searching in the next round.
  * @area_nslabs: The slot number in the area.
+ * @areas_block_number: The block number in the area.
  */
 struct io_tlb_mem {
phys_addr_t start;
@@ -126,7 +137,14 @@ struct io_tlb_mem {
unsigned int num_areas;
unsigned int area_start;
unsigned int area_nslabs;
+   unsigned int area_block_number;
+   struct io_tlb_mem *parent;
struct io_tlb_area *areas;
+   struct io_tlb_block {
+   size_t alloc_size;
+   unsigned long start_slot;
+   unsigned int list;
+   } *block;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+int swiotlb_device_allocate(struct device *dev,
+   unsigned int area_num,
+   unsigned long size);
+void swiotlb_device_free(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+void swiotlb_device_free(struct device *dev)
+{
+}
+
+int swiotlb_device_allocate(struct device *dev,
+   unsigned int area_num,
+   unsigned long size)
+{
+   return -ENOMEM;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 00a16f540f20..7b95a140694a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
unsigned long nslabs, bool late_alloc)
 {
void *vaddr = phys_to_virt(start);
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+   unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j, k;
unsigned int block_list;
 
mem->nslabs = nslabs;
@@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
mem->end = mem->start + bytes;
mem->index = 0;
mem->late_alloc = late_alloc;
+   mem->area_block_number = nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
 
if (swiotlb_force_bounce)
mem->force_bounce = true;
@@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
if (!(i % mem->area_nslabs)) {
mem->areas[j].index = 0;
+   mem->areas[j].block_index = 0;
spin_lock_init(&mem->area

Re: [PATCH v6 31/34] iommu/mediatek: Get the proper bankid for multi banks

2022-04-28 Thread Matthias Brugger




On 07/04/2022 09:57, Yong Wu wrote:

We preassign some ports in a special bank via the new defined
banks_portmsk. Put it in the plat_data means it is not expected to be
adjusted dynamically.

If the iommu id in the iommu consumer's dtsi node is inside this
banks_portmsk, then we switch it to this special iommu bank, and
initialise the IOMMU bank HW.

Each a bank has the independent pgtable(4GB iova range). Each a bank
is a independent iommu domain/group. Currently we don't separate different
iova ranges inside a bank.

Signed-off-by: Yong Wu 
Reviewed-by: AngeloGioacchino Del Regno 

---
  drivers/iommu/mtk_iommu.c | 39 ---
  1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 0828cff97625..d42b3d35a36e 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -191,6 +191,7 @@ struct mtk_iommu_plat_data {
  
  	u8  banks_num;

boolbanks_enable[MTK_IOMMU_BANK_MAX];
+   unsigned intbanks_portmsk[MTK_IOMMU_BANK_MAX];
unsigned char   larbid_remap[MTK_LARB_COM_MAX][MTK_LARB_SUBCOM_MAX];
  };
  
@@ -467,6 +468,30 @@ static irqreturn_t mtk_iommu_isr(int irq, void *dev_id)

return IRQ_HANDLED;
  }
  
+static unsigned int mtk_iommu_get_bank_id(struct device *dev,

+ const struct mtk_iommu_plat_data 
*plat_data)
+{
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   unsigned int i, portmsk = 0, bankid = 0;
+
+   if (plat_data->banks_num == 1)
+   return bankid;
+
+   for (i = 0; i < fwspec->num_ids; i++)
+   portmsk |= BIT(MTK_M4U_TO_PORT(fwspec->ids[i]));
+
+   for (i = 0; i < plat_data->banks_num && i < MTK_IOMMU_BANK_MAX; i++) {
+   if (!plat_data->banks_enable[i])
+   continue;
+
+   if (portmsk & plat_data->banks_portmsk[i]) {
+   bankid = i;
+   break;
+   }
+   }
+   return bankid; /* default is 0 */
+}
+
  static int mtk_iommu_get_iova_region_id(struct device *dev,
const struct mtk_iommu_plat_data 
*plat_data)
  {
@@ -619,13 +644,14 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
struct list_head *hw_list = data->hw_list;
struct device *m4udev = data->dev;
struct mtk_iommu_bank_data *bank;
-   unsigned int bankid = 0;
+   unsigned int bankid;
int ret, region_id;
  
  	region_id = mtk_iommu_get_iova_region_id(dev, data->plat_data);

if (region_id < 0)
return region_id;
  
+	bankid = mtk_iommu_get_bank_id(dev, data->plat_data);

mutex_lock(&dom->mutex);
if (!dom->bank) {
/* Data is in the frstdata in sharing pgtable case. */
@@ -802,6 +828,7 @@ static struct iommu_group *mtk_iommu_device_group(struct 
device *dev)
struct mtk_iommu_data *c_data = dev_iommu_priv_get(dev), *data;
struct list_head *hw_list = c_data->hw_list;
struct iommu_group *group;
+   unsigned int bankid, groupid;
int regionid;
  
  	data = mtk_iommu_get_frst_data(hw_list);

@@ -812,12 +839,18 @@ static struct iommu_group *mtk_iommu_device_group(struct 
device *dev)
if (regionid < 0)
return ERR_PTR(regionid);
  
+	bankid = mtk_iommu_get_bank_id(dev, data->plat_data);


I think code readability would be improved if we add a new function like 
mtk_iommu_get_id which call mtk_iommu_get_bankid and if necessary 
mtk_iommu_get_regionid.



mutex_lock(&data->mutex);
-   group = data->m4u_group[regionid];
+   /*
+* If the bank function is enabled, each a bank is a iommu group/domain.
+* otherwise, each a iova region is a iommu group/domain.


While at it:
"If the bank function is enabled, each bank is a iommu group/domain. Otherwise, 
each iova region is a iommu group/domain."


Regards,
Matthias


+*/
+   groupid = bankid ? bankid : regionid;
+   group = data->m4u_group[groupid];
if (!group) {
group = iommu_group_alloc();
if (!IS_ERR(group))
-   data->m4u_group[regionid] = group;
+   data->m4u_group[groupid] = group;
} else {
iommu_group_ref_get(group);
}

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/2] swiotlb: Split up single swiotlb lock

2022-04-28 Thread Tianyu Lan
From: Tianyu Lan 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.

Based-on-idea-by: Andi Kleen 
Signed-off-by: Tianyu Lan 
---
 include/linux/swiotlb.h |  25 ++
 kernel/dma/swiotlb.c| 173 +++-
 2 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..489c249da434 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
 
+/**
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:  The number of used IO TLB block.
+ * @area_index: The index of to tlb area.
+ * @index: The slot index to start searching in this area for next round.
+ * @lock:  The lock to protect the above data structures in the map and
+ * unmap calls.
+ */
+struct io_tlb_area {
+   unsigned long used;
+   unsigned int area_index;
+   unsigned int index;
+   spinlock_t lock;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @num_areas:  The area number in the pool.
+ * @area_start: The area index to start searching in the next round.
+ * @area_nslabs: The slot number in the area.
  */
 struct io_tlb_mem {
phys_addr_t start;
@@ -102,6 +123,10 @@ struct io_tlb_mem {
bool late_alloc;
bool force_bounce;
bool for_alloc;
+   unsigned int num_areas;
+   unsigned int area_start;
+   unsigned int area_nslabs;
+   struct io_tlb_area *areas;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..00a16f540f20 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
@@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
 phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long default_area_num = NUM_AREAS_DEFAULT;
+
+static int swiotlb_setup_areas(struct io_tlb_mem *mem,
+   unsigned int num_areas, unsigned long nslabs)
+{
+   if (nslabs < 1 || !is_power_of_2(num_areas)) {
+   pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
+   return -EINVAL;
+   }
+
+   /* Round up number of slabs to the next power of 2.
+* The last area is going be smaller than the rest if default_nslabs is
+* not power of two.
+*/
+   mem->area_start = 0;
+   mem->num_areas = num_areas;
+   mem->area_nslabs = nslabs / num_areas;
+   return 0;
+}
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
return;
size = ALIGN(size, IO_TLB_SIZE);
default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+   swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+   default_nslabs);
pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
unsigned long nslabs, bool late_alloc)
 {
void *vaddr = phys_to_virt(start);
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+   unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+   unsigned int block_list;
 
mem->nslabs = nslabs;
mem->start = start;
@@ -206,8 +230,1

[RFC PATCH 0/2] swiotlb: Introduce swiotlb device allocation function

2022-04-28 Thread Tianyu Lan
From: Tianyu Lan 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patchset splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate io tlb
buffer from areas averagely and free the allocation back to the associated
area.

Patch 2 introduces an helper function to allocate bounce buffer
from default IO tlb pool for devices with new IO TLB block unit
and set up IO TLB area for device queues to avoid spinlock overhead.
The area number is set by device driver according queue number.

The network test between traditional VM and Confidential VM.
The throughput improves from ~20Gb/s to ~34Gb/s  with this patchset.

Tianyu Lan (2):
  swiotlb: Split up single swiotlb lock
  Swiotlb: Add device bounce buffer allocation interface

 include/linux/swiotlb.h |  58 +++
 kernel/dma/swiotlb.c| 340 +++-
 2 files changed, 362 insertions(+), 36 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] Documentation: x86: rework IOMMU documentation

2022-04-28 Thread Jonathan Corbet
Joerg Roedel  writes:

> Acked-by: Joerg Roedel 
>
> Jonathan, will you merge that through the documentation tree?

Done.

Thanks,

jon
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v11 4/9] ACPI/IORT: Add support to retrieve IORT RMR reserved regions

2022-04-28 Thread Shameerali Kolothum Thodi via iommu
Hi Lorenzo,

> -Original Message-
> From: Lorenzo Pieralisi [mailto:lorenzo.pieral...@arm.com]
> Sent: 26 April 2022 16:30
> To: Shameerali Kolothum Thodi 
> Cc: linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-foundation.org; Linuxarm ;
> j...@8bytes.org; robin.mur...@arm.com; w...@kernel.org; wanghuiqiang
> ; Guohanjun (Hanjun Guo)
> ; steven.pr...@arm.com; sami.muja...@arm.com;
> j...@solid-run.com; eric.au...@redhat.com; laurentiu.tu...@nxp.com;
> h...@infradead.org
> Subject: Re: [PATCH v11 4/9] ACPI/IORT: Add support to retrieve IORT RMR
> reserved regions
> 
> On Fri, Apr 22, 2022 at 05:29:02PM +0100, Shameer Kolothum wrote:
> > Parse through the IORT RMR nodes and populate the reserve region list
> > corresponding to a given IOMMU and device(optional). Also, go through
> > the ID mappings of the RMR node and retrieve all the SIDs associated
> > with it.
> >
> > Signed-off-by: Shameer Kolothum
> 
> > ---
> >  drivers/acpi/arm64/iort.c | 290
> ++
> >  include/linux/iommu.h |   8 ++
> >  2 files changed, 298 insertions(+)
> 
> Reviewed-by: Lorenzo Pieralisi 

Thanks. I just realized that need a R-by from you for patch #2 as well.

Also for patch #5, I dropped your R-by tag as there is a minor change w.r.t
free up. If you are Ok, I can add that back.

Thanks,
Shameer
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH v8 00/11] Fix BUG_ON in vfio_iommu_group_notifier()

2022-04-28 Thread Joerg Roedel
On Thu, Apr 28, 2022 at 08:54:11AM -0300, Jason Gunthorpe wrote:
> Can we get this on a topic branch so Alex can pull it? There are
> conflicts with other VFIO patches

Right, we already discussed this. Moved the patches to a separate topic
branch. It will appear as 'vfio-notifier-fix' once I pushed the changes.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-28 Thread Christoph Hellwig
On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:
> > Reported-by: Rahul Singh 
> > Signed-off-by: Christoph Hellwig 
> 
> Reviewed-by: Stefano Stabellini 

Do you want to take this through the Xen tree or should I pick it up?
Either way I'd love to see some testing on x86 as well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 13/14] iommu/virtio: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the probe failure path accordingly.

Reviewed-by: Jean-Philippe Brucker 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/virtio-iommu.c | 25 -
 1 file changed, 25 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 25be4b822aa0..bcbd10ec4ccb 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -7,7 +7,6 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include 
 #include 
 #include 
 #include 
@@ -17,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1146,26 +1144,6 @@ static int viommu_probe(struct virtio_device *vdev)
 
iommu_device_register(&viommu->iommu, &viommu_ops, parent_dev);
 
-#ifdef CONFIG_PCI
-   if (pci_bus_type.iommu_ops != &viommu_ops) {
-   ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
-   if (ret)
-   goto err_unregister;
-   }
-#endif
-#ifdef CONFIG_ARM_AMBA
-   if (amba_bustype.iommu_ops != &viommu_ops) {
-   ret = bus_set_iommu(&amba_bustype, &viommu_ops);
-   if (ret)
-   goto err_unregister;
-   }
-#endif
-   if (platform_bus_type.iommu_ops != &viommu_ops) {
-   ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
-   if (ret)
-   goto err_unregister;
-   }
-
vdev->priv = viommu;
 
dev_info(dev, "input address: %u bits\n",
@@ -1174,9 +1152,6 @@ static int viommu_probe(struct virtio_device *vdev)
 
return 0;
 
-err_unregister:
-   iommu_device_sysfs_remove(&viommu->iommu);
-   iommu_device_unregister(&viommu->iommu);
 err_free_vqs:
vdev->config->del_vqs(vdev);
 
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 14/14] iommu: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Clean up the remaining trivial bus_set_iommu() callsites along
with the implementation. Now drivers only have to know and care
about iommu_device instances, phew!

Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |  4 
 drivers/iommu/fsl_pamu_domain.c |  4 
 drivers/iommu/intel/iommu.c |  1 -
 drivers/iommu/iommu.c   | 24 
 drivers/iommu/msm_iommu.c   |  2 --
 drivers/iommu/rockchip-iommu.c  |  2 --
 drivers/iommu/s390-iommu.c  |  6 --
 drivers/iommu/sprd-iommu.c  |  5 -
 drivers/iommu/sun50i-iommu.c|  2 --
 include/linux/iommu.h   |  1 -
 10 files changed, 51 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 4c077c38fbd6..80af00f468b4 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -845,8 +845,6 @@ static int qcom_iommu_device_probe(struct platform_device 
*pdev)
goto err_pm_disable;
}
 
-   bus_set_iommu(&platform_bus_type, &qcom_iommu_ops);
-
if (qcom_iommu->local_base) {
pm_runtime_get_sync(dev);
writel_relaxed(0x, qcom_iommu->local_base + 
SMMU_INTR_SEL_NS);
@@ -864,8 +862,6 @@ static int qcom_iommu_device_remove(struct platform_device 
*pdev)
 {
struct qcom_iommu_dev *qcom_iommu = platform_get_drvdata(pdev);
 
-   bus_set_iommu(&platform_bus_type, NULL);
-
pm_runtime_force_suspend(&pdev->dev);
platform_set_drvdata(pdev, NULL);
iommu_device_sysfs_remove(&qcom_iommu->iommu);
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 69a4a62dc3b9..7274f86b2bc4 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -480,11 +480,7 @@ int __init pamu_domain_init(void)
if (ret) {
iommu_device_sysfs_remove(&pamu_iommu);
pr_err("Can't register iommu device\n");
-   return ret;
}
 
-   bus_set_iommu(&platform_bus_type, &fsl_pamu_ops);
-   bus_set_iommu(&pci_bus_type, &fsl_pamu_ops);
-
return ret;
 }
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9507b64fdf6b..e0a31fa6a70c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4178,7 +4178,6 @@ int __init intel_iommu_init(void)
}
up_read(&dmar_global_lock);
 
-   bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
if (si_domain && !hw_pass_through)
register_memory_notifier(&intel_iommu_memory_nb);
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c89af4dc54c2..5f10e7ad04b0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1887,30 +1887,6 @@ static int iommu_bus_init(struct bus_type *bus)
return err;
 }
 
-/**
- * bus_set_iommu - set iommu-callbacks for the bus
- * @bus: bus.
- * @ops: the callbacks provided by the iommu-driver
- *
- * This function is called by an iommu driver to set the iommu methods
- * used for a particular bus. Drivers for devices on that bus can use
- * the iommu-api after these ops are registered.
- * This special function is needed because IOMMUs are usually devices on
- * the bus itself, so the iommu drivers are not initialized when the bus
- * is set up. With this function the iommu-driver can set the iommu-ops
- * afterwards.
- */
-int bus_set_iommu(struct bus_type *bus, const struct iommu_ops *ops)
-{
-   if (bus->iommu_ops && ops && bus->iommu_ops != ops)
-   return -EBUSY;
-
-   bus->iommu_ops = ops;
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(bus_set_iommu);
-
 bool iommu_present(struct bus_type *bus)
 {
return bus->iommu_ops != NULL;
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 50f57624610f..5b89fb16feb8 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -789,8 +789,6 @@ static int msm_iommu_probe(struct platform_device *pdev)
goto fail;
}
 
-   bus_set_iommu(&platform_bus_type, &msm_iommu_ops);
-
pr_info("device mapped at %p, irq %d with %d ctx banks\n",
iommu->base, iommu->irq, iommu->ncb);
 
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index ab57c4b8fade..a3fc59b814ab 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -1300,8 +1300,6 @@ static int rk_iommu_probe(struct platform_device *pdev)
if (!dma_dev)
dma_dev = &pdev->dev;
 
-   bus_set_iommu(&platform_bus_type, &rk_iommu_ops);
-
pm_runtime_enable(dev);
 
for (i = 0; i < iommu->num_irq; i++) {
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 3833e86c6e7b..5f5f4bd91e6f 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -376,9 +37

[PATCH v2 12/14] iommu/tegra-smmu: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the probe failure path accordingly.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/tegra-smmu.c | 29 ++---
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 1fea68e551f1..2e4d2e4c65bb 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -1086,8 +1086,8 @@ struct tegra_smmu *tegra_smmu_probe(struct device *dev,
 
/*
 * This is a bit of a hack. Ideally we'd want to simply return this
-* value. However the IOMMU registration process will attempt to add
-* all devices to the IOMMU when bus_set_iommu() is called. In order
+* value. However iommu_device_register() will attempt to add
+* all devices to the IOMMU before we get that far. In order
 * not to rely on global variables to track the IOMMU instance, we
 * set it here so that it can be looked up from the .probe_device()
 * callback via the IOMMU device's .drvdata field.
@@ -1141,32 +1141,15 @@ struct tegra_smmu *tegra_smmu_probe(struct device *dev,
return ERR_PTR(err);
 
err = iommu_device_register(&smmu->iommu, &tegra_smmu_ops, dev);
-   if (err)
-   goto remove_sysfs;
-
-   err = bus_set_iommu(&platform_bus_type, &tegra_smmu_ops);
-   if (err < 0)
-   goto unregister;
-
-#ifdef CONFIG_PCI
-   err = bus_set_iommu(&pci_bus_type, &tegra_smmu_ops);
-   if (err < 0)
-   goto unset_platform_bus;
-#endif
+   if (err) {
+   iommu_device_sysfs_remove(&smmu->iommu);
+   return ERR_PTR(err);
+   }
 
if (IS_ENABLED(CONFIG_DEBUG_FS))
tegra_smmu_debugfs_init(smmu);
 
return smmu;
-
-unset_platform_bus: __maybe_unused;
-   bus_set_iommu(&platform_bus_type, NULL);
-unregister:
-   iommu_device_unregister(&smmu->iommu);
-remove_sysfs:
-   iommu_device_sysfs_remove(&smmu->iommu);
-
-   return ERR_PTR(err);
 }
 
 void tegra_smmu_remove(struct tegra_smmu *smmu)
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 10/14] iommu/mtk: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the probe failure paths accordingly.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/mtk_iommu.c| 13 +
 drivers/iommu/mtk_iommu_v1.c | 13 +
 2 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 6fd75a60abd6..4278d9e032ad 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -920,19 +920,11 @@ static int mtk_iommu_probe(struct platform_device *pdev)
spin_lock_init(&data->tlb_lock);
list_add_tail(&data->list, &m4ulist);
 
-   if (!iommu_present(&platform_bus_type)) {
-   ret = bus_set_iommu(&platform_bus_type, &mtk_iommu_ops);
-   if (ret)
-   goto out_list_del;
-   }
-
ret = component_master_add_with_match(dev, &mtk_iommu_com_ops, match);
if (ret)
-   goto out_bus_set_null;
+   goto out_list_del;
return ret;
 
-out_bus_set_null:
-   bus_set_iommu(&platform_bus_type, NULL);
 out_list_del:
list_del(&data->list);
iommu_device_unregister(&data->iommu);
@@ -952,9 +944,6 @@ static int mtk_iommu_remove(struct platform_device *pdev)
iommu_device_sysfs_remove(&data->iommu);
iommu_device_unregister(&data->iommu);
 
-   if (iommu_present(&platform_bus_type))
-   bus_set_iommu(&platform_bus_type, NULL);
-
clk_disable_unprepare(data->bclk);
device_link_remove(data->smicomm_dev, &pdev->dev);
pm_runtime_disable(&pdev->dev);
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index ecff800656e6..7d17d6a21803 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -660,19 +660,11 @@ static int mtk_iommu_probe(struct platform_device *pdev)
if (ret)
goto out_sysfs_remove;
 
-   if (!iommu_present(&platform_bus_type)) {
-   ret = bus_set_iommu(&platform_bus_type,  &mtk_iommu_ops);
-   if (ret)
-   goto out_dev_unreg;
-   }
-
ret = component_master_add_with_match(dev, &mtk_iommu_com_ops, match);
if (ret)
-   goto out_bus_set_null;
+   goto out_dev_unreg;
return ret;
 
-out_bus_set_null:
-   bus_set_iommu(&platform_bus_type, NULL);
 out_dev_unreg:
iommu_device_unregister(&data->iommu);
 out_sysfs_remove:
@@ -687,9 +679,6 @@ static int mtk_iommu_remove(struct platform_device *pdev)
iommu_device_sysfs_remove(&data->iommu);
iommu_device_unregister(&data->iommu);
 
-   if (iommu_present(&platform_bus_type))
-   bus_set_iommu(&platform_bus_type, NULL);
-
clk_disable_unprepare(data->bclk);
devm_free_irq(&pdev->dev, data->irq, data);
component_master_del(&pdev->dev, &mtk_iommu_com_ops);
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 11/14] iommu/omap: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the init failure path accordingly.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/omap-iommu.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index d9cf2820c02e..07ee2600113c 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1776,14 +1776,8 @@ static int __init omap_iommu_init(void)
goto fail_driver;
}
 
-   ret = bus_set_iommu(&platform_bus_type, &omap_iommu_ops);
-   if (ret)
-   goto fail_bus;
-
return 0;
 
-fail_bus:
-   platform_driver_unregister(&omap_iommu_driver);
 fail_driver:
kmem_cache_destroy(iopte_cachep);
return ret;
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 09/14] iommu/ipmmu-vmsa: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary. This also
leaves the custom initcall effectively doing nothing but register
the driver, which no longer needs to happen early either, so convert
it to builtin_platform_driver().

Signed-off-by: Robin Murphy 
---
 drivers/iommu/ipmmu-vmsa.c | 35 +--
 1 file changed, 1 insertion(+), 34 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 8fdb84b3642b..2549d32f0ddd 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -1090,11 +1090,6 @@ static int ipmmu_probe(struct platform_device *pdev)
ret = iommu_device_register(&mmu->iommu, &ipmmu_ops, 
&pdev->dev);
if (ret)
return ret;
-
-#if defined(CONFIG_IOMMU_DMA)
-   if (!iommu_present(&platform_bus_type))
-   bus_set_iommu(&platform_bus_type, &ipmmu_ops);
-#endif
}
 
/*
@@ -1168,32 +1163,4 @@ static struct platform_driver ipmmu_driver = {
.probe = ipmmu_probe,
.remove = ipmmu_remove,
 };
-
-static int __init ipmmu_init(void)
-{
-   struct device_node *np;
-   static bool setup_done;
-   int ret;
-
-   if (setup_done)
-   return 0;
-
-   np = of_find_matching_node(NULL, ipmmu_of_ids);
-   if (!np)
-   return 0;
-
-   of_node_put(np);
-
-   ret = platform_driver_register(&ipmmu_driver);
-   if (ret < 0)
-   return ret;
-
-#if defined(CONFIG_ARM) && !defined(CONFIG_IOMMU_DMA)
-   if (!iommu_present(&platform_bus_type))
-   bus_set_iommu(&platform_bus_type, &ipmmu_ops);
-#endif
-
-   setup_done = true;
-   return 0;
-}
-subsys_initcall(ipmmu_init);
+builtin_platform_driver(ipmmu_driver);
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 08/14] iommu/exynos: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the init failure path accordingly.

Tested-by: Marek Szyprowski 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/exynos-iommu.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 71f2018e23fe..359b255b3924 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -1356,16 +1356,7 @@ static int __init exynos_iommu_init(void)
goto err_zero_lv2;
}
 
-   ret = bus_set_iommu(&platform_bus_type, &exynos_iommu_ops);
-   if (ret) {
-   pr_err("%s: Failed to register exynos-iommu driver.\n",
-   __func__);
-   goto err_set_iommu;
-   }
-
return 0;
-err_set_iommu:
-   kmem_cache_free(lv2table_kmem_cache, zero_lv2_table);
 err_zero_lv2:
platform_driver_unregister(&exynos_sysmmu_driver);
 err_reg_driver:
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 07/14] iommu/dart: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the probe failure path accordingly.

Tested-by: Sven Peter 
Reviewed-by: Sven Peter 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/apple-dart.c | 30 +-
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index decafb07ad08..a679e4c02291 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -823,27 +823,6 @@ static irqreturn_t apple_dart_irq(int irq, void *dev)
return IRQ_HANDLED;
 }
 
-static int apple_dart_set_bus_ops(const struct iommu_ops *ops)
-{
-   int ret;
-
-   if (!iommu_present(&platform_bus_type)) {
-   ret = bus_set_iommu(&platform_bus_type, ops);
-   if (ret)
-   return ret;
-   }
-#ifdef CONFIG_PCI
-   if (!iommu_present(&pci_bus_type)) {
-   ret = bus_set_iommu(&pci_bus_type, ops);
-   if (ret) {
-   bus_set_iommu(&platform_bus_type, NULL);
-   return ret;
-   }
-   }
-#endif
-   return 0;
-}
-
 static int apple_dart_probe(struct platform_device *pdev)
 {
int ret;
@@ -899,14 +878,10 @@ static int apple_dart_probe(struct platform_device *pdev)
 
platform_set_drvdata(pdev, dart);
 
-   ret = apple_dart_set_bus_ops(&apple_dart_iommu_ops);
-   if (ret)
-   goto err_free_irq;
-
ret = iommu_device_sysfs_add(&dart->iommu, dev, NULL, "apple-dart.%s",
 dev_name(&pdev->dev));
if (ret)
-   goto err_remove_bus_ops;
+   goto err_free_irq;
 
ret = iommu_device_register(&dart->iommu, &apple_dart_iommu_ops, dev);
if (ret)
@@ -920,8 +895,6 @@ static int apple_dart_probe(struct platform_device *pdev)
 
 err_sysfs_remove:
iommu_device_sysfs_remove(&dart->iommu);
-err_remove_bus_ops:
-   apple_dart_set_bus_ops(NULL);
 err_free_irq:
free_irq(dart->irq, dart);
 err_clk_disable:
@@ -936,7 +909,6 @@ static int apple_dart_remove(struct platform_device *pdev)
 
apple_dart_hw_reset(dart);
free_irq(dart->irq, dart);
-   apple_dart_set_bus_ops(NULL);
 
iommu_device_unregister(&dart->iommu);
iommu_device_sysfs_remove(&dart->iommu);
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 06/14] iommu/arm-smmu-v3: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and simplify
the probe failure path accordingly.

Acked-by: Will Deacon 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 53 +
 1 file changed, 2 insertions(+), 51 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 627a3ed5ee8f..73b7b1b17b77 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -28,8 +28,6 @@
 #include 
 #include 
 
-#include 
-
 #include "arm-smmu-v3.h"
 #include "../../iommu-sva-lib.h"
 
@@ -3698,43 +3696,6 @@ static unsigned long arm_smmu_resource_size(struct 
arm_smmu_device *smmu)
return SZ_128K;
 }
 
-static int arm_smmu_set_bus_ops(struct iommu_ops *ops)
-{
-   int err;
-
-#ifdef CONFIG_PCI
-   if (pci_bus_type.iommu_ops != ops) {
-   err = bus_set_iommu(&pci_bus_type, ops);
-   if (err)
-   return err;
-   }
-#endif
-#ifdef CONFIG_ARM_AMBA
-   if (amba_bustype.iommu_ops != ops) {
-   err = bus_set_iommu(&amba_bustype, ops);
-   if (err)
-   goto err_reset_pci_ops;
-   }
-#endif
-   if (platform_bus_type.iommu_ops != ops) {
-   err = bus_set_iommu(&platform_bus_type, ops);
-   if (err)
-   goto err_reset_amba_ops;
-   }
-
-   return 0;
-
-err_reset_amba_ops:
-#ifdef CONFIG_ARM_AMBA
-   bus_set_iommu(&amba_bustype, NULL);
-#endif
-err_reset_pci_ops: __maybe_unused;
-#ifdef CONFIG_PCI
-   bus_set_iommu(&pci_bus_type, NULL);
-#endif
-   return err;
-}
-
 static void __iomem *arm_smmu_ioremap(struct device *dev, resource_size_t 
start,
  resource_size_t size)
 {
@@ -3838,27 +3799,17 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
-   goto err_sysfs_remove;
+   iommu_device_sysfs_remove(&smmu->iommu);
+   return ret;
}
 
-   ret = arm_smmu_set_bus_ops(&arm_smmu_ops);
-   if (ret)
-   goto err_unregister_device;
-
return 0;
-
-err_unregister_device:
-   iommu_device_unregister(&smmu->iommu);
-err_sysfs_remove:
-   iommu_device_sysfs_remove(&smmu->iommu);
-   return ret;
 }
 
 static int arm_smmu_device_remove(struct platform_device *pdev)
 {
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
-   arm_smmu_set_bus_ops(NULL);
iommu_device_unregister(&smmu->iommu);
iommu_device_sysfs_remove(&smmu->iommu);
arm_smmu_device_disable(smmu);
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 05/14] iommu/arm-smmu: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary. With device
probes now replayed for every IOMMU instance registration, the whole
sorry ordering workaround for legacy DT bindings goes too, hooray!

Acked-by: Will Deacon 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 84 +--
 1 file changed, 2 insertions(+), 82 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 568cce590ccc..f0bec4a35df5 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -37,7 +37,6 @@
 #include 
 #include 
 
-#include 
 #include 
 
 #include "arm-smmu.h"
@@ -93,8 +92,6 @@ static struct platform_driver arm_smmu_driver;
 static struct iommu_ops arm_smmu_ops;
 
 #ifdef CONFIG_ARM_SMMU_LEGACY_DT_BINDINGS
-static int arm_smmu_bus_init(struct iommu_ops *ops);
-
 static struct device_node *dev_get_dev_node(struct device *dev)
 {
if (dev_is_pci(dev)) {
@@ -180,20 +177,6 @@ static int arm_smmu_register_legacy_master(struct device 
*dev,
kfree(sids);
return err;
 }
-
-/*
- * With the legacy DT binding in play, we have no guarantees about
- * probe order, but then we're also not doing default domains, so we can
- * delay setting bus ops until we're sure every possible SMMU is ready,
- * and that way ensure that no probe_device() calls get missed.
- */
-static int arm_smmu_legacy_bus_init(void)
-{
-   if (using_legacy_binding)
-   return arm_smmu_bus_init(&arm_smmu_ops);
-   return 0;
-}
-device_initcall_sync(arm_smmu_legacy_bus_init);
 #else
 static int arm_smmu_register_legacy_master(struct device *dev,
   struct arm_smmu_device **smmu)
@@ -2022,52 +2005,6 @@ static int arm_smmu_device_dt_probe(struct 
arm_smmu_device *smmu,
return 0;
 }
 
-static int arm_smmu_bus_init(struct iommu_ops *ops)
-{
-   int err;
-
-   /* Oh, for a proper bus abstraction */
-   if (!iommu_present(&platform_bus_type)) {
-   err = bus_set_iommu(&platform_bus_type, ops);
-   if (err)
-   return err;
-   }
-#ifdef CONFIG_ARM_AMBA
-   if (!iommu_present(&amba_bustype)) {
-   err = bus_set_iommu(&amba_bustype, ops);
-   if (err)
-   goto err_reset_platform_ops;
-   }
-#endif
-#ifdef CONFIG_PCI
-   if (!iommu_present(&pci_bus_type)) {
-   err = bus_set_iommu(&pci_bus_type, ops);
-   if (err)
-   goto err_reset_amba_ops;
-   }
-#endif
-#ifdef CONFIG_FSL_MC_BUS
-   if (!iommu_present(&fsl_mc_bus_type)) {
-   err = bus_set_iommu(&fsl_mc_bus_type, ops);
-   if (err)
-   goto err_reset_pci_ops;
-   }
-#endif
-   return 0;
-
-err_reset_pci_ops: __maybe_unused;
-#ifdef CONFIG_PCI
-   bus_set_iommu(&pci_bus_type, NULL);
-#endif
-err_reset_amba_ops: __maybe_unused;
-#ifdef CONFIG_ARM_AMBA
-   bus_set_iommu(&amba_bustype, NULL);
-#endif
-err_reset_platform_ops: __maybe_unused;
-   bus_set_iommu(&platform_bus_type, NULL);
-   return err;
-}
-
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
struct resource *res;
@@ -2185,7 +2122,8 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
err = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
if (err) {
dev_err(dev, "Failed to register iommu\n");
-   goto err_sysfs_remove;
+   iommu_device_sysfs_remove(&smmu->iommu);
+   return err;
}
 
platform_set_drvdata(pdev, smmu);
@@ -2203,24 +2141,7 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
pm_runtime_enable(dev);
}
 
-   /*
-* For ACPI and generic DT bindings, an SMMU will be probed before
-* any device which might need it, so we want the bus ops in place
-* ready to handle default domain setup as soon as any SMMU exists.
-*/
-   if (!using_legacy_binding) {
-   err = arm_smmu_bus_init(&arm_smmu_ops);
-   if (err)
-   goto err_unregister_device;
-   }
-
return 0;
-
-err_unregister_device:
-   iommu_device_unregister(&smmu->iommu);
-err_sysfs_remove:
-   iommu_device_sysfs_remove(&smmu->iommu);
-   return err;
 }
 
 static int arm_smmu_device_remove(struct platform_device *pdev)
@@ -2233,7 +2154,6 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
if (!bitmap_empty(smmu->context_map, ARM_SMMU_MAX_CBS))
dev_notice(&pdev->dev, "disabling translation\n");
 
-   arm_smmu_bus_init(NULL);
iommu_device_unregister(&smmu->iommu);
iommu_device_sysfs_remove(&smmu->iommu);
 
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.o

[PATCH v2 04/14] iommu/amd: Clean up bus_set_iommu()

2022-04-28 Thread Robin Murphy
Stop calling bus_set_iommu() since it's now unnecessary, and
garbage-collect the last remnants of amd_iommu_init_api().

Signed-off-by: Robin Murphy 
---
 drivers/iommu/amd/amd_iommu.h |  1 -
 drivers/iommu/amd/init.c  |  9 +
 drivers/iommu/amd/iommu.c | 21 -
 3 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..384393ce57fb 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -18,7 +18,6 @@ extern void amd_iommu_restart_event_logging(struct amd_iommu 
*iommu);
 extern int amd_iommu_init_devices(void);
 extern void amd_iommu_uninit_devices(void);
 extern void amd_iommu_init_notifier(void);
-extern int amd_iommu_init_api(void);
 
 #ifdef CONFIG_AMD_IOMMU_DEBUGFS
 void amd_iommu_debugfs_setup(struct amd_iommu *iommu);
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 0467918bf7fd..1cb10d8b0df4 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1970,20 +1970,13 @@ static int __init amd_iommu_init_pci(void)
/*
 * Order is important here to make sure any unity map requirements are
 * fulfilled. The unity mappings are created and written to the device
-* table during the amd_iommu_init_api() call.
+* table during the iommu_init_pci() call.
 *
 * After that we call init_device_table_dma() to make sure any
 * uninitialized DTE will block DMA, and in the end we flush the caches
 * of all IOMMUs to make sure the changes to the device table are
 * active.
 */
-   ret = amd_iommu_init_api();
-   if (ret) {
-   pr_err("IOMMU: Failed to initialize IOMMU-API interface 
(error=%d)!\n",
-  ret);
-   goto out;
-   }
-
init_device_table_dma();
 
for_each_iommu(iommu)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 038e104b922c..d907c96ff84e 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -11,8 +11,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -1838,25 +1836,6 @@ void amd_iommu_domain_update(struct protection_domain 
*domain)
amd_iommu_domain_flush_complete(domain);
 }
 
-int __init amd_iommu_init_api(void)
-{
-   int err;
-
-   err = bus_set_iommu(&pci_bus_type, &amd_iommu_ops);
-   if (err)
-   return err;
-#ifdef CONFIG_ARM_AMBA
-   err = bus_set_iommu(&amba_bustype, &amd_iommu_ops);
-   if (err)
-   return err;
-#endif
-   err = bus_set_iommu(&platform_bus_type, &amd_iommu_ops);
-   if (err)
-   return err;
-
-   return 0;
-}
-
 /*
  *
  * The following functions belong to the exported interface of AMD IOMMU
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 03/14] iommu: Move bus setup to IOMMU device registration

2022-04-28 Thread Robin Murphy
Move the bus setup to iommu_device_register(). This should allow
bus_iommu_probe() to be correctly replayed for multiple IOMMU instances,
and leaves bus_set_iommu() as a glorified no-op to be cleaned up next.

At this point we can also handle cleanup better than just rolling back
the most-recently-touched bus upon failure - which may release devices
owned by other already-registered instances, and still leave devices on
other buses with dangling pointers to the failed instance. Now it's easy
to clean up the exact footprint of a given instance, no more, no less.

Tested-by: Marek Szyprowski 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/iommu.c | 51 +++
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6c4621afc8cf..c89af4dc54c2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -175,6 +175,14 @@ static int __init iommu_subsys_init(void)
 }
 subsys_initcall(iommu_subsys_init);
 
+static int remove_iommu_group(struct device *dev, void *data)
+{
+   if (dev->iommu && dev->iommu->iommu_dev == data)
+   iommu_release_device(dev);
+
+   return 0;
+}
+
 /**
  * iommu_device_register() - Register an IOMMU hardware instance
  * @iommu: IOMMU handle for the instance
@@ -197,12 +205,29 @@ int iommu_device_register(struct iommu_device *iommu,
spin_lock(&iommu_device_lock);
list_add_tail(&iommu->list, &iommu_device_list);
spin_unlock(&iommu_device_lock);
+
+   for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++) {
+   struct bus_type *bus = iommu_buses[i];
+   int err;
+
+   WARN_ON(bus->iommu_ops && bus->iommu_ops != ops);
+   bus->iommu_ops = ops;
+   err = bus_iommu_probe(bus);
+   if (err) {
+   iommu_device_unregister(iommu);
+   return err;
+   }
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_device_register);
 
 void iommu_device_unregister(struct iommu_device *iommu)
 {
+   for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++)
+   bus_for_each_dev(iommu_buses[i], NULL, iommu, 
remove_iommu_group);
+
spin_lock(&iommu_device_lock);
list_del(&iommu->list);
spin_unlock(&iommu_device_lock);
@@ -1655,13 +1680,6 @@ static int probe_iommu_group(struct device *dev, void 
*data)
return ret;
 }
 
-static int remove_iommu_group(struct device *dev, void *data)
-{
-   iommu_release_device(dev);
-
-   return 0;
-}
-
 static int iommu_bus_notifier(struct notifier_block *nb,
  unsigned long action, void *data)
 {
@@ -1884,27 +1902,12 @@ static int iommu_bus_init(struct bus_type *bus)
  */
 int bus_set_iommu(struct bus_type *bus, const struct iommu_ops *ops)
 {
-   int err;
-
-   if (ops == NULL) {
-   bus->iommu_ops = NULL;
-   return 0;
-   }
-
-   if (bus->iommu_ops != NULL)
+   if (bus->iommu_ops && ops && bus->iommu_ops != ops)
return -EBUSY;
 
bus->iommu_ops = ops;
 
-   /* Do IOMMU specific setup for this bus-type */
-   err = bus_iommu_probe(bus);
-   if (err) {
-   /* Clean up */
-   bus_for_each_dev(bus, NULL, NULL, remove_iommu_group);
-   bus->iommu_ops = NULL;
-   }
-
-   return err;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(bus_set_iommu);
 
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 01/14] iommu/vt-d: Temporarily reject probing non-PCI devices

2022-04-28 Thread Robin Murphy
Although the driver has some support implemented for non-PCI devices via
ANDD, it only registers itself for pci_bus_type, so has never actually
seen probe_device for a non-PCI device. Once the bus details move into
the IOMMU core, it appears there may be some issues with correctly
rejecting non-ANDD platform devices, so let's temporarily enforce the
current behaviour of only considering PCI devices until that can be
investigated properly.

Reported-by: Lu Baolu 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/intel/iommu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0edf6084dc14..9507b64fdf6b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4565,6 +4565,10 @@ static struct iommu_device 
*intel_iommu_probe_device(struct device *dev)
unsigned long flags;
u8 bus, devfn;
 
+   /* ANDD platform device support needs fixing */
+   if (!pdev)
+   return ERR_PTR(-ENODEV);
+
iommu = device_to_iommu(dev, &bus, &devfn);
if (!iommu)
return ERR_PTR(-ENODEV);
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 02/14] iommu: Always register bus notifiers

2022-04-28 Thread Robin Murphy
The number of bus types that the IOMMU subsystem deals with is small and
manageable, so pull that list into core code as a first step towards
cleaning up all the boilerplate bus-awareness from drivers. Calling
iommu_probe_device() before bus->iommu_ops is set will simply return
-ENODEV and not break the notifier call chain, so there should be no
harm in proactively registering all our bus notifiers at init time.

Tested-by: Marek Szyprowski 
Reviewed-By: Krishna Reddy 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/iommu.c | 49 ---
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 00218b56a441..6c4621afc8cf 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -6,6 +6,7 @@
 
 #define pr_fmt(fmt)"iommu: " fmt
 
+#include 
 #include 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -74,6 +76,7 @@ static const char * const iommu_group_resv_type_string[] = {
 #define IOMMU_CMD_LINE_DMA_API BIT(0)
 #define IOMMU_CMD_LINE_STRICT  BIT(1)
 
+static int iommu_bus_init(struct bus_type *bus);
 static int iommu_alloc_default_domain(struct iommu_group *group,
  struct device *dev);
 static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
@@ -102,6 +105,19 @@ struct iommu_group_attribute iommu_group_attr_##_name =
\
 static LIST_HEAD(iommu_device_list);
 static DEFINE_SPINLOCK(iommu_device_lock);
 
+static struct bus_type * const iommu_buses[] = {
+   &platform_bus_type,
+#ifdef CONFIG_PCI
+   &pci_bus_type,
+#endif
+#ifdef CONFIG_ARM_AMBA
+   &amba_bustype,
+#endif
+#ifdef CONFIG_FSL_MC_BUS
+   &fsl_mc_bus_type,
+#endif
+};
+
 /*
  * Use a function instead of an array here because the domain-type is a
  * bit-field, so an array would waste memory.
@@ -151,6 +167,10 @@ static int __init iommu_subsys_init(void)
(iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
"(set via kernel command line)" : "");
 
+   /* If the system is so broken that this fails, it will WARN anyway */
+   for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++)
+   iommu_bus_init(iommu_buses[i]);
+
return 0;
 }
 subsys_initcall(iommu_subsys_init);
@@ -1832,35 +1852,19 @@ int bus_iommu_probe(struct bus_type *bus)
return ret;
 }
 
-static int iommu_bus_init(struct bus_type *bus, const struct iommu_ops *ops)
+static int iommu_bus_init(struct bus_type *bus)
 {
struct notifier_block *nb;
int err;
 
-   nb = kzalloc(sizeof(struct notifier_block), GFP_KERNEL);
+   nb = kzalloc(sizeof(*nb), GFP_KERNEL);
if (!nb)
return -ENOMEM;
 
nb->notifier_call = iommu_bus_notifier;
-
err = bus_register_notifier(bus, nb);
if (err)
-   goto out_free;
-
-   err = bus_iommu_probe(bus);
-   if (err)
-   goto out_err;
-
-
-   return 0;
-
-out_err:
-   /* Clean up */
-   bus_for_each_dev(bus, NULL, NULL, remove_iommu_group);
-   bus_unregister_notifier(bus, nb);
-
-out_free:
-   kfree(nb);
+   kfree(nb);
 
return err;
 }
@@ -1893,9 +1897,12 @@ int bus_set_iommu(struct bus_type *bus, const struct 
iommu_ops *ops)
bus->iommu_ops = ops;
 
/* Do IOMMU specific setup for this bus-type */
-   err = iommu_bus_init(bus, ops);
-   if (err)
+   err = bus_iommu_probe(bus);
+   if (err) {
+   /* Clean up */
+   bus_for_each_dev(bus, NULL, NULL, remove_iommu_group);
bus->iommu_ops = NULL;
+   }
 
return err;
 }
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 00/14] iommu: Retire bus_set_iommu()

2022-04-28 Thread Robin Murphy
v1: 
https://lore.kernel.org/linux-iommu/cover.1649935679.git.robin.mur...@arm.com/

Hi all,

Just some minor updates for v2, adding a workaround to avoid changing
VT-d behaviour for now, cleaning up the extra include I missed in
virtio-iommu, and collecting all the acks so far. As before, this is
based on the AMD IOMMU patch now applied, plus v2 of my
device_iommu_capable() series here:

https://lore.kernel.org/linux-iommu/cover.1650878781.git.robin.mur...@arm.com/

Cheers,
Robin.



Robin Murphy (14):
  iommu/vt-d: Temporarily reject probing non-PCI devices
  iommu: Always register bus notifiers
  iommu: Move bus setup to IOMMU device registration
  iommu/amd: Clean up bus_set_iommu()
  iommu/arm-smmu: Clean up bus_set_iommu()
  iommu/arm-smmu-v3: Clean up bus_set_iommu()
  iommu/dart: Clean up bus_set_iommu()
  iommu/exynos: Clean up bus_set_iommu()
  iommu/ipmmu-vmsa: Clean up bus_set_iommu()
  iommu/mtk: Clean up bus_set_iommu()
  iommu/omap: Clean up bus_set_iommu()
  iommu/tegra-smmu: Clean up bus_set_iommu()
  iommu/virtio: Clean up bus_set_iommu()
  iommu: Clean up bus_set_iommu()

 drivers/iommu/amd/amd_iommu.h   |   1 -
 drivers/iommu/amd/init.c|   9 +-
 drivers/iommu/amd/iommu.c   |  21 
 drivers/iommu/apple-dart.c  |  30 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  53 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  84 +--
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |   4 -
 drivers/iommu/exynos-iommu.c|   9 --
 drivers/iommu/fsl_pamu_domain.c |   4 -
 drivers/iommu/intel/iommu.c |   5 +-
 drivers/iommu/iommu.c   | 110 +---
 drivers/iommu/ipmmu-vmsa.c  |  35 +--
 drivers/iommu/msm_iommu.c   |   2 -
 drivers/iommu/mtk_iommu.c   |  13 +--
 drivers/iommu/mtk_iommu_v1.c|  13 +--
 drivers/iommu/omap-iommu.c  |   6 --
 drivers/iommu/rockchip-iommu.c  |   2 -
 drivers/iommu/s390-iommu.c  |   6 --
 drivers/iommu/sprd-iommu.c  |   5 -
 drivers/iommu/sun50i-iommu.c|   2 -
 drivers/iommu/tegra-smmu.c  |  29 ++
 drivers/iommu/virtio-iommu.c|  25 -
 include/linux/iommu.h   |   1 -
 23 files changed, 67 insertions(+), 402 deletions(-)

-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH v8 00/11] Fix BUG_ON in vfio_iommu_group_notifier()

2022-04-28 Thread Jason Gunthorpe via iommu
On Thu, Apr 28, 2022 at 11:32:04AM +0200, Joerg Roedel wrote:
> On Mon, Apr 18, 2022 at 08:49:49AM +0800, Lu Baolu wrote:
> > Lu Baolu (10):
> >   iommu: Add DMA ownership management interfaces
> >   driver core: Add dma_cleanup callback in bus_type
> >   amba: Stop sharing platform_dma_configure()
> >   bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
> >   PCI: pci_stub: Set driver_managed_dma
> >   PCI: portdrv: Set driver_managed_dma
> >   vfio: Set DMA ownership for VFIO devices
> >   vfio: Remove use of vfio_group_viable()
> >   vfio: Remove iommu group notifier
> >   iommu: Remove iommu group changes notifier
> 
> Applied to core branch, thanks Baolu.

Can we get this on a topic branch so Alex can pull it? There are
conflicts with other VFIO patches

Thanks!
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/5] iommu/sun50i: Ensure the IOMMU can be used for DMA

2022-04-28 Thread Robin Murphy

On 2022-04-28 02:04, Samuel Holland wrote:

So far, the driver has relied on arch/arm64/Kconfig to select IOMMU_DMA.
Unsurprisingly, this does not work on RISC-V, so the driver must select
IOMMU_DMA itself.


No, IOMMU_DMA should only be selected by the architecture code that's 
also responsible for calling iommu_setup_dma_ops(). Without that, this 
select will do nothing other than add some unused code to the kernel image.


I appreciate that the current state of the x86 IOMMU drivers being 
tightly-coupled to the x86 arch code might be confusing (which reminds 
me I'd totally forgotten about [1]). I'm about to start reworking the 
whole area anyway, but for now please just follow the existing intent.


Thanks,
Robin.

[1] 
https://lore.kernel.org/linux-iommu/9ba6f2e8568a3ff6a94fade8d99705433c44.1631536879.git.robin.mur...@arm.com/



Signed-off-by: Samuel Holland 
---

  drivers/iommu/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c79a0df090c0..70a0bfa6d907 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -223,6 +223,7 @@ config SUN50I_IOMMU
depends on ARCH_SUNXI || COMPILE_TEST
select ARM_DMA_USE_IOMMU
select IOMMU_API
+   select IOMMU_DMA
help
  Support for the IOMMU introduced in the Allwinner H6 SoCs.
  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/msm: add a check for the return of kzalloc()

2022-04-28 Thread Joerg Roedel
On Thu, Apr 28, 2022 at 04:52:39PM +0800, xkernel.w...@foxmail.com wrote:
> From: Xiaoke Wang 
> 
> kzalloc() is a memory allocation function which can return NULL when
> some internal memory errors happen. So it is better to check it to
> prevent potential wrong memory access.
> 
> Besides, to propagate the error to the caller, the type of
> insert_iommu_master() is changed to `int`. Several instructions related
> to it are also updated.
> 
> Signed-off-by: Xiaoke Wang 
> ---
> ChangeLog:
> v1->v2 propagate the error to the caller.
>  drivers/iommu/msm_iommu.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/37] iommu/amd: Add multiple PCI segments support

2022-04-28 Thread Joerg Roedel
Hi Vasant, Hi Suravee,

On Mon, Apr 25, 2022 at 05:03:38PM +0530, Vasant Hegde wrote:
> Newer AMD systems can support multiple PCI segments, where each segment
> contains one or more IOMMU instances. However, an IOMMU instance can only
> support a single PCI segment.

Thanks for doing this, making the AMD IOMMU driver multi-segment aware
has been on my todo list for a while too. Overall the series looks good
to me, just some minor comments to some patches.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 37/37] iommu/amd: Update amd_iommu_fault structure to include PCI seg ID

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:04:15PM +0530, Vasant Hegde wrote:
> + seg_id = (iommu_fault->sbdf >> 16) & 0x;
> + devid = iommu_fault->sbdf & 0x;

This deserves some macros for readability.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 27/37] iommu/amd: Remove global amd_iommu_dev_table

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:04:05PM +0530, Vasant Hegde wrote:
> From: Suravee Suthikulpanit 
> 
> Replace global amd_iommu_dev_table with per PCI segment device table.
> Also remove "dev_table_size".
> 
> Co-developed-by: Vasant Hegde 
> Signed-off-by: Vasant Hegde 
> Signed-off-by: Suravee Suthikulpanit 

Patches 27-29 can be merged into one.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 11/37] iommu/amd: Introduce per PCI segment device table size

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:03:49PM +0530, Vasant Hegde wrote:
> + /* Size of the device table */
> + u32 dev_table_size;

Same here and with all other size indicators. If they are always going
to have their maximum value anyways, we can drop them.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 10/37] iommu/amd: Introduce per PCI segment last_bdf

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:03:48PM +0530, Vasant Hegde wrote:
> + /* Largest PCI device id we expect translation requests for */
> + u16 last_bdf;

How does the IVRS table look like on these systems? Do they still
enumerate the whole PCI Bus/Dev/Fn space? If so I am fine with getting
rid of last_bdf alltogether and just allocate the data structures with
their maximum size.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 01/37] iommu/amd: Update struct iommu_dev_data defination

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:03:39PM +0530, Vasant Hegde wrote:

Subject: iommu/amd: Update struct iommu_dev_data defination
 ^^ Typo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 02/37] iommu/amd: Introduce pci segment structure

2022-04-28 Thread Joerg Roedel
Hi Vasant,

On Mon, Apr 25, 2022 at 05:03:40PM +0530, Vasant Hegde wrote:
> +/*
> + * This structure contains information about one PCI segment in the system.
> + */
> +struct amd_iommu_pci_seg {
> + struct list_head list;

The purpose of this list_head needs a comment.

> +
> + /* PCI segment number */
> + u16 id;
> +};
> +/*
> + * List with all PCI segments in the system. This list is not locked because
> + * it is only written at driver initialization time
> + */
> +extern struct list_head amd_iommu_pci_seg_list;

So there will never be hotplug of a PCI segment? Say together with
hotplugging a CPU?

> +static void __init free_pci_segment(void)

This needs plural: free_pci_segments(), as it frees all segments.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dart: check return value after calling platform_get_resource()

2022-04-28 Thread Joerg Roedel
On Mon, Apr 25, 2022 at 05:08:26PM +0800, Yang Yingliang wrote:
> It will cause null-ptr-deref in resource_size(), if platform_get_resource()
> returns NULL, move calling resource_size() after devm_ioremap_resource() that
> will check 'res' to avoid null-ptr-deref.
> And use devm_platform_get_and_ioremap_resource() to simplify code.
> 
> Fixes: 46d1fb072e76 ("iommu/dart: Add DART iommu driver")
> Signed-off-by: Yang Yingliang 
> ---
>  drivers/iommu/apple-dart.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)

Applied to iommu/fixes, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dt-bindings: iommu: Drop client node in examples

2022-04-28 Thread Joerg Roedel
On Fri, Apr 22, 2022 at 02:21:03PM -0500, Rob Herring wrote:
> There's no need to show consumer side in provider examples. The ones
> used here are undocumented or undocumented in schemas which results in
> warnings.
> 
> Signed-off-by: Rob Herring 

Applied, thanks Rob.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] iommu/arm-smmu: Fixes for 5.18

2022-04-28 Thread Joerg Roedel
On Fri, Apr 22, 2022 at 12:22:34PM +0100, Will Deacon wrote:
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git 
> tags/arm-smmu-fixes

Pulled, thanks Will.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH v8 00/11] Fix BUG_ON in vfio_iommu_group_notifier()

2022-04-28 Thread Joerg Roedel
On Mon, Apr 18, 2022 at 08:49:49AM +0800, Lu Baolu wrote:
> Lu Baolu (10):
>   iommu: Add DMA ownership management interfaces
>   driver core: Add dma_cleanup callback in bus_type
>   amba: Stop sharing platform_dma_configure()
>   bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
>   PCI: pci_stub: Set driver_managed_dma
>   PCI: portdrv: Set driver_managed_dma
>   vfio: Set DMA ownership for VFIO devices
>   vfio: Remove use of vfio_group_viable()
>   vfio: Remove iommu group notifier
>   iommu: Remove iommu group changes notifier

Applied to core branch, thanks Baolu.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/11] dt-bindings: iommu: arm,smmu-v3: make PRI IRQ optional

2022-04-28 Thread Krzysztof Kozlowski
On 28/04/2022 11:23, Robin Murphy wrote:
> On 2022-04-28 07:56, Krzysztof Kozlowski wrote:
>> On 27/04/2022 13:25, Andre Przywara wrote:
>>> The Page Request Interface (PRI) is an optional PCIe feature. As such, a
>>> SMMU would not need to handle it if the PCIe host bridge or the SMMU
>>> itself do not implement it. Also an SMMU could be connected to a platform
>>> device, without any PRI functionality whatsoever.
>>> In all cases there would be no SMMU PRI queue interrupt to be wired up
>>> to an interrupt controller.
>>>
>>> Relax the binding to allow specifying three interrupts, omitting the PRI
>>> IRQ. At the moment, with the "eventq,gerror,priq,cmdq-sync" order, we
>>> would need to sacrifice the command queue sync interrupt as well, which
>>> might not be desired.
>>> The Linux driver does not care about any order at all, just picks IRQs
>>> based on their names, and treats all (wired) IRQs as optional.
>>
>> The last sentence is not a good explanation for the bindings. They are
>> not about Linux and are used in other projects as well.
>>
>>>
>>> Signed-off-by: Andre Przywara 
>>> ---
>>>   .../bindings/iommu/arm,smmu-v3.yaml   | 21 ++-
>>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml 
>>> b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
>>> index e87bfbcc69135..6b3111f1f06ce 100644
>>> --- a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
>>> +++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
>>> @@ -37,12 +37,23 @@ properties:
>>> hardware supports just a single, combined interrupt line.
>>> If provided, then the combined interrupt will be used in 
>>> preference to
>>> any others.
>>> -  - minItems: 2
>>> +  - minItems: 1
>>>   items:
>>> -  - const: eventq # Event Queue not empty
>>> -  - const: gerror # Global Error activated
>>> -  - const: priq   # PRI Queue not empty
>>> -  - const: cmdq-sync  # CMD_SYNC complete
>>> +  - enum:
>>> +  - eventq # Event Queue not empty
>>> +  - gerror # Global Error activated
>>> +  - cmdq-sync  # CMD_SYNC complete
>>> +  - priq   # PRI Queue not empty
>>> +  - enum:
>>> +  - gerror
>>> +  - cmdq-sync
>>> +  - priq
>>> +  - enum:
>>> +  - cmdq-sync
>>> +  - priq
>>> +  - enum:
>>> +  - cmdq-sync
>>> +  - priq
>>
>> The order should be strict, so if you want the first interrupt optional,
>> then:
>> oneOf:
>>   - items:
>>  ... 4 items list
>>   - items
>>  ... 3 items list
> 
> All 4 interrupts are optional, though, since any of them could 
> potentially use an MSI instead. Do we really want to list out all 15 
> combinations?

Bah, I missed that part from commit msg. It's fine then.


Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/11] dt-bindings: iommu: arm,smmu-v3: make PRI IRQ optional

2022-04-28 Thread Robin Murphy

On 2022-04-28 07:56, Krzysztof Kozlowski wrote:

On 27/04/2022 13:25, Andre Przywara wrote:

The Page Request Interface (PRI) is an optional PCIe feature. As such, a
SMMU would not need to handle it if the PCIe host bridge or the SMMU
itself do not implement it. Also an SMMU could be connected to a platform
device, without any PRI functionality whatsoever.
In all cases there would be no SMMU PRI queue interrupt to be wired up
to an interrupt controller.

Relax the binding to allow specifying three interrupts, omitting the PRI
IRQ. At the moment, with the "eventq,gerror,priq,cmdq-sync" order, we
would need to sacrifice the command queue sync interrupt as well, which
might not be desired.
The Linux driver does not care about any order at all, just picks IRQs
based on their names, and treats all (wired) IRQs as optional.


The last sentence is not a good explanation for the bindings. They are
not about Linux and are used in other projects as well.



Signed-off-by: Andre Przywara 
---
  .../bindings/iommu/arm,smmu-v3.yaml   | 21 ++-
  1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml 
b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
index e87bfbcc69135..6b3111f1f06ce 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.yaml
@@ -37,12 +37,23 @@ properties:
hardware supports just a single, combined interrupt line.
If provided, then the combined interrupt will be used in preference 
to
any others.
-  - minItems: 2
+  - minItems: 1
  items:
-  - const: eventq # Event Queue not empty
-  - const: gerror # Global Error activated
-  - const: priq   # PRI Queue not empty
-  - const: cmdq-sync  # CMD_SYNC complete
+  - enum:
+  - eventq # Event Queue not empty
+  - gerror # Global Error activated
+  - cmdq-sync  # CMD_SYNC complete
+  - priq   # PRI Queue not empty
+  - enum:
+  - gerror
+  - cmdq-sync
+  - priq
+  - enum:
+  - cmdq-sync
+  - priq
+  - enum:
+  - cmdq-sync
+  - priq


The order should be strict, so if you want the first interrupt optional,
then:
oneOf:
  - items:
 ... 4 items list
  - items
 ... 3 items list


All 4 interrupts are optional, though, since any of them could 
potentially use an MSI instead. Do we really want to list out all 15 
combinations?


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


  1   2   >