Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-30 Thread David Gibson
On Fri, Apr 29, 2022 at 09:54:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > > But I don't have a use case in mind? The simplified things I know
> > > about want to attach their devices then allocate valid IOVA, they
> > > don't really have a notion about what IOVA regions they are willing to
> > > accept, or necessarily do hotplug.
> > 
> > The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> > emulation code knows the IOVA windows that are expected of the vIOMMU
> > (because that's a property of the emulated platform), and requests
> > them of the host IOMMU.  If the host can supply that, you're good
> > (this doesn't necessarily mean the host windows match exactly, just
> > that the requested windows fit within the host windows).  If not,
> > you report an error.  This can be done at any point when the host
> > windows might change - so try to attach a device that can't support
> > the requested windows, and it will fail.  Attaching a device which
> > shrinks the windows, but still fits the requested windows within, and
> > you're still good to go.
> 
> We were just talking about this in another area - Alex said that qemu
> doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Uh.. what?  We certainly know (or, rather, choose) the IOVA ranges for
ppc.  That is to say we set up the default IOVA ranges at machine
construction (those defaults have changed with machine version a
couple of times).  If the guest uses dynamic DMA windows we then
update those ranges based on the hypercalls, but at any point we know
what the IOVA windows are supposed to be.  I don't really see how x86
or anything else could not know the IOVA ranges.  Who else *could* set
the ranges when implementing a vIOMMU in TCG mode?

For the non-vIOMMU case then IOVA==GPA, so everything qemu knows about
the GPA space it also knows about the IOVA space.  Which, come to
think of it, means memory hotplug also complicates things.

> Even if yes, qemu is able to manage this on its own - it doesn't use
> the kernel IOVA allocator, so there is not a strong reason to tell the
> kernel what the narrowed ranges are.

I don't follow.  The problem for the qemu case here is if you hotplug
a device which narrows down the range to something smaller than the
guest expects.  If qemu has told the kernel the ranges it needs, that
can just fail (which is the best you can do).  If the kernel adds the
device but narrows the ranges, then you may have just put the guest
into a situation where the vIOMMU cannot do what the guest expects it
to.  If qemu can only query the windows, not specify them then it
won't know that adding a particular device will conflict with its
guest side requirements until after it's already added.  That could
mess up concurrent guest initiated map operations for existing devices
in the same guest side domain, so I don't think reversing the hotplug
after the problem is detected is enough.

> > > That is one possibility, yes. qemu seems to be using this to establish
> > > a clone ioas of an existing operational one which is another usage
> > > model.
> > 
> > Right, for qemu (or other hypervisors) the obvious choice would be to
> > create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> > emulated bus IOASes.  For a userspace driver situation, I'm guessing
> > you'd map your relevant memory pool into an IOAS, then COPY to the
> > IOAS you need for whatever specific devices you're using.
> 
> qemu seems simpler, it juggled multiple containers so it literally
> just copies when it instantiates a new container and does a map in
> multi-container.

I don't follow you.  Are you talking about the vIOMMU or non vIOMMU
case?  In the vIOMMU case the different containers can be for
different guest side iommu domains with different guest-IOVA spaces,
so you can't just copy from one to another.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-29 Thread Jason Gunthorpe via iommu
On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > But I don't have a use case in mind? The simplified things I know
> > about want to attach their devices then allocate valid IOVA, they
> > don't really have a notion about what IOVA regions they are willing to
> > accept, or necessarily do hotplug.
> 
> The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> emulation code knows the IOVA windows that are expected of the vIOMMU
> (because that's a property of the emulated platform), and requests
> them of the host IOMMU.  If the host can supply that, you're good
> (this doesn't necessarily mean the host windows match exactly, just
> that the requested windows fit within the host windows).  If not,
> you report an error.  This can be done at any point when the host
> windows might change - so try to attach a device that can't support
> the requested windows, and it will fail.  Attaching a device which
> shrinks the windows, but still fits the requested windows within, and
> you're still good to go.

We were just talking about this in another area - Alex said that qemu
doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Even if yes, qemu is able to manage this on its own - it doesn't use
the kernel IOVA allocator, so there is not a strong reason to tell the
kernel what the narrowed ranges are.

> > That is one possibility, yes. qemu seems to be using this to establish
> > a clone ioas of an existing operational one which is another usage
> > model.
> 
> Right, for qemu (or other hypervisors) the obvious choice would be to
> create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> emulated bus IOASes.  For a userspace driver situation, I'm guessing
> you'd map your relevant memory pool into an IOAS, then COPY to the
> IOAS you need for whatever specific devices you're using.

qemu seems simpler, it juggled multiple containers so it literally
just copies when it instantiates a new container and does a map in
multi-container.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-29 Thread David Gibson
On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is 
> > > > > the smaller
> > > > > + *   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside 
> > > > > these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of 
> > > > > iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > + __u32 size;
> > > > > + __u32 ioas_id;
> > > > > + __u32 out_num_iovas;
> > > > > + __u32 __reserved;
> > > > > + struct iommu_valid_iovas {
> > > > > + __aligned_u64 start;
> > > > > + __aligned_u64 last;
> > > > > + } out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to 

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-28 Thread Jason Gunthorpe via iommu
On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > 
> > > > +/**
> > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > + * @ioas_id: IOAS ID to read ranges from
> > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > + * @__reserved: Must be 0
> > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is 
> > > > the smaller
> > > > + *   of out_num_iovas or the length implied by size.
> > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > + *
> > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these 
> > > > ranges is
> > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > + * size should include the allocated flex array.
> > > > + */
> > > > +struct iommu_ioas_iova_ranges {
> > > > +   __u32 size;
> > > > +   __u32 ioas_id;
> > > > +   __u32 out_num_iovas;
> > > > +   __u32 __reserved;
> > > > +   struct iommu_valid_iovas {
> > > > +   __aligned_u64 start;
> > > > +   __aligned_u64 last;
> > > > +   } out_valid_iovas[];
> > > > +};
> > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > 
> > > Is the information returned by this valid for the lifeime of the IOAS,
> > > or can it change?  If it can change, what events can change it?
> > >
> > > If it *can't* change, then how do we have enough information to
> > > determine this at ALLOC time, since we don't necessarily know which
> > > (if any) hardware IOMMU will be attached to it.
> > 
> > It is a good point worth documenting. It can change. Particularly
> > after any device attachment.
> 
> Right.. this is vital and needs to be front and centre in the
> comments/docs here.  Really, I think an interface that *doesn't* have
> magically changing status would be better (which is why I was
> advocating that the user set the constraints, and the kernel supplied
> or failed outright).  Still I recognize that has its own problems.

That is a neat idea, it could be a nice option, it lets userspace
further customize the kernel allocator.

But I don't have a use case in mind? The simplified things I know
about want to attach their devices then allocate valid IOVA, they
don't really have a notion about what IOVA regions they are willing to
accept, or necessarily do hotplug.

What might be interesting is to have some option to load in a machine
specific default ranges - ie the union of every group and and every
iommu_domain. The idea being that after such a call hotplug of a
device should be very likely to succeed.

Though I don't have a user in mind..

> > I added this:
> > 
> >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these 
> > ranges
> >  * is not allowed. out_num_iovas will be set to the total number of iovas 
> > and
> >  * the out_valid_iovas[] will be filled in as space permits. size should 
> > include
> >  * the allocated flex array.
> >  *
> >  * The allowed ranges are dependent on the HW path the DMA operation takes, 
> > and
> >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have 
> > a
> >  * full range, and each attached device will narrow the ranges based on that
> >  * devices HW restrictions.
> 
> I think you need to be even more explicit about this: which exact
> operations on the fd can invalidate exactly which items in the
> information from this call?  Can it only ever be narrowed, or can it
> be broadened with any operations?

I think "attach" is the phrase we are using for that operation - it is
not a specific IOCTL here because it happens on, say, the VFIO device FD.

Let's add "detatching a device can widen the ranges. Userspace should
query ranges after every attach/detatch to know what IOVAs are valid
for mapping."

> > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > 
> > > Since it can only copy a single mapping, what's the benefit of this
> > > over just repeating an IOAS_MAP in the new IOAS?
> > 
> > It causes the underlying pin accounting to be shared and can avoid
> > calling GUP entirely.
> 
> If that's the only purpose, then that needs to be right here in the
> comments too.  So is expected best practice to IOAS_MAP everything you
> might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> mappings you actually end up wanting into the "real" IOASes for use?

That is one possibility, yes. qemu seems to be using this to establish
a clone ioas of an existing operational one which is another usage
model.

I added this additionally:

 * This may be 

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-28 Thread David Gibson
On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> 
> > > +/**
> > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > + * @ioas_id: IOAS ID to read ranges from
> > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > + * @__reserved: Must be 0
> > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the 
> > > smaller
> > > + *   of out_num_iovas or the length implied by size.
> > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > + *
> > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these 
> > > ranges is
> > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > + * size should include the allocated flex array.
> > > + */
> > > +struct iommu_ioas_iova_ranges {
> > > + __u32 size;
> > > + __u32 ioas_id;
> > > + __u32 out_num_iovas;
> > > + __u32 __reserved;
> > > + struct iommu_valid_iovas {
> > > + __aligned_u64 start;
> > > + __aligned_u64 last;
> > > + } out_valid_iovas[];
> > > +};
> > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > 
> > Is the information returned by this valid for the lifeime of the IOAS,
> > or can it change?  If it can change, what events can change it?
> >
> > If it *can't* change, then how do we have enough information to
> > determine this at ALLOC time, since we don't necessarily know which
> > (if any) hardware IOMMU will be attached to it.
> 
> It is a good point worth documenting. It can change. Particularly
> after any device attachment.

Right.. this is vital and needs to be front and centre in the
comments/docs here.  Really, I think an interface that *doesn't* have
magically changing status would be better (which is why I was
advocating that the user set the constraints, and the kernel supplied
or failed outright).  Still I recognize that has its own problems.

> I added this:
> 
>  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these 
> ranges
>  * is not allowed. out_num_iovas will be set to the total number of iovas and
>  * the out_valid_iovas[] will be filled in as space permits. size should 
> include
>  * the allocated flex array.
>  *
>  * The allowed ranges are dependent on the HW path the DMA operation takes, 
> and
>  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
>  * full range, and each attached device will narrow the ranges based on that
>  * devices HW restrictions.

I think you need to be even more explicit about this: which exact
operations on the fd can invalidate exactly which items in the
information from this call?  Can it only ever be narrowed, or can it
be broadened with any operations?

> > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > 
> > Since it can only copy a single mapping, what's the benefit of this
> > over just repeating an IOAS_MAP in the new IOAS?
> 
> It causes the underlying pin accounting to be shared and can avoid
> calling GUP entirely.

If that's the only purpose, then that needs to be right here in the
comments too.  So is expected best practice to IOAS_MAP everything you
might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
mappings you actually end up wanting into the "real" IOASes for use?

Seems like it would be nicer for the interface to just figure it out
for you: I can see there being sufficient complications with that to
have this slightly awkward interface, but I think it needs a rationale
to accompany it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-01 Thread Yi Liu




On 2022/3/31 20:59, Jason Gunthorpe wrote:

On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:


+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from


so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)


Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.


then should this be documented?

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-31 Thread Jason Gunthorpe via iommu
On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:

> > +/**
> > + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> > + * @size: sizeof(struct iommu_ioas_copy)
> > + * @flags: Combination of enum iommufd_ioas_map_flags
> > + * @dst_ioas_id: IOAS ID to change the mapping of
> > + * @src_ioas_id: IOAS ID to copy from
> 
> so the dst and src ioas_id are allocated via the same iommufd.
> right? just out of curious, do you think it is possible that
> the srs/dst ioas_ids are from different iommufds? In that case
> may need to add src/dst iommufd. It's not needed today, just to
> see if any blocker in kernel to support such copy. :-)

Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-31 Thread Jason Gunthorpe via iommu
On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:

> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the 
> > smaller
> > + *   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these 
> > ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +   __u32 size;
> > +   __u32 ioas_id;
> > +   __u32 out_num_iovas;
> > +   __u32 __reserved;
> > +   struct iommu_valid_iovas {
> > +   __aligned_u64 start;
> > +   __aligned_u64 last;
> > +   } out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
>
> If it *can't* change, then how do we have enough information to
> determine this at ALLOC time, since we don't necessarily know which
> (if any) hardware IOMMU will be attached to it.

It is a good point worth documenting. It can change. Particularly
after any device attachment.

I added this:

 * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
 * is not allowed. out_num_iovas will be set to the total number of iovas and
 * the out_valid_iovas[] will be filled in as space permits. size should include
 * the allocated flex array.
 *
 * The allowed ranges are dependent on the HW path the DMA operation takes, and
 * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
 * full range, and each attached device will narrow the ranges based on that
 * devices HW restrictions.


> > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> 
> Since it can only copy a single mapping, what's the benefit of this
> over just repeating an IOAS_MAP in the new IOAS?

It causes the underlying pin accounting to be shared and can avoid
calling GUP entirely.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-30 Thread Tian, Kevin
> From: David Gibson 
> Sent: Thursday, March 31, 2022 12:36 PM
> > +
> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the
> smaller
> > + *   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these
> ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +   __u32 size;
> > +   __u32 ioas_id;
> > +   __u32 out_num_iovas;
> > +   __u32 __reserved;
> > +   struct iommu_valid_iovas {
> > +   __aligned_u64 start;
> > +   __aligned_u64 last;
> > +   } out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE,
> IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
> 

It can change when a new device is attached to an ioas.

You can look at iopt_table_enforce_group_resv_regions() in patch7
which is called by iommufd_device_attach() in patch10. That function
will first check whether new reserved ranges from the attached device
have been used and if no conflict then add them to the list of reserved
ranges of this ioas.

Userspace can call this ioctl to retrieve updated IOVA range info after
attaching a device.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-30 Thread David Gibson
On Fri, Mar 18, 2022 at 02:27:33PM -0300, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>  - Universal simple support for all IOMMUs (no PPC special path)
>  - An IOVA allocator that considerds the aperture and the reserved ranges
>  - io_pagetable allows any number of iommu_domains to be connected to the
>IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.


[snip]
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address 
> (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> + __u32 size;
> + __u32 flags;
> + __u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the 
> smaller
> + *   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges 
> is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> + __u32 size;
> + __u32 ioas_id;
> + __u32 out_num_iovas;
> + __u32 __reserved;
> + struct iommu_valid_iovas {
> + __aligned_u64 start;
> + __aligned_u64 last;
> + } out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> IOMMUFD_CMD_IOAS_IOVA_RANGES)

Is the information returned by this valid for the lifeime of the IOAS,
or can it change?  If it can change, what events can change it?

If it *can't* change, then how do we have enough information to
determine this at ALLOC time, since we don't necessarily know which
(if any) hardware IOMMU will be attached to it.

> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an 
> appropriate
> + * IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> + IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> + IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> + IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then 
> the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> + __u32 size;
> + __u32 flags;
> + __u32 ioas_id;
> + __u32 __reserved;
> + __aligned_u64 user_va;
> + __aligned_u64 length;
> + __aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used 

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-30 Thread Yi Liu

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
  - Universal simple support for all IOMMUs (no PPC special path)
  - An IOVA allocator that considerds the aperture and the reserved ranges
  - io_pagetable allows any number of iommu_domains to be connected to the
IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/iommufd/Makefile  |   1 +
  drivers/iommu/iommufd/ioas.c| 248 
  drivers/iommu/iommufd/iommufd_private.h |  27 +++
  drivers/iommu/iommufd/main.c|  17 ++
  include/uapi/linux/iommufd.h| 132 +
  5 files changed, 425 insertions(+)
  create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0-only
  iommufd-y := \
io_pagetable.o \
+   ioas.o \
main.o \
pages.o
  
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c

new file mode 100644
index 00..c530b2ba74b06b
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include 
+#include 
+#include 
+#include 
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+   struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+   int rc;
+
+   rc = iopt_unmap_all(>iopt);
+   WARN_ON(rc);
+   iopt_destroy_table(>iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+   if (IS_ERR(ioas))
+   return ioas;
+
+   rc = iopt_init_table(>iopt);
+   if (rc)
+   goto out_abort;
+   return ioas;
+
+out_abort:
+   iommufd_object_abort(ictx, >obj);
+   return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_alloc *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   if (cmd->flags)
+   return -EOPNOTSUPP;
+
+   ioas = iommufd_ioas_alloc(ucmd->ictx);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   cmd->out_ioas_id = ioas->obj.id;
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   goto out_table;
+   iommufd_object_finalize(ucmd->ictx, >obj);
+   return 0;
+
+out_table:
+   iommufd_ioas_destroy(>obj);
+   return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+   struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   struct interval_tree_span_iter span;
+   u32 max_iovas;
+   int rc;
+
+   if (cmd->__reserved)
+   return -EOPNOTSUPP;
+
+   max_iovas = cmd->size - sizeof(*cmd);
+   if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+   return -EINVAL;
+   max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+   ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   down_read(>iopt.iova_rwsem);
+   cmd->out_num_iovas = 0;
+   for (interval_tree_span_iter_first(
+, >iopt.reserved_iova_itree, 0, ULONG_MAX);
+!interval_tree_span_iter_done();
+interval_tree_span_iter_next()) {
+   if (!span.is_hole)
+   continue;
+   if (cmd->out_num_iovas < max_iovas) {
+   rc = put_user((u64)span.start_hole,
+ >out_valid_iovas[cmd->out_num_iovas]
+  .start);
+   if (rc)
+   goto out_put;
+   rc = put_user(
+   (u64)span.last_hole,
+   
>out_valid_iovas[cmd->out_num_iovas].last);
+   if (rc)
+   goto out_put;
+   }
+   cmd->out_num_iovas++;
+   }
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-28 Thread Alex Williamson
On Mon, 28 Mar 2022 16:47:49 -0300
Jason Gunthorpe  wrote:

> On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:
> 
> > So, currently AMD and Intel have exactly the same HW feature with a
> > different kAPI..  
> 
> I fixed it like below and made the ordering changes Kevin pointed
> to. Will send next week after the merge window:
> 
> 527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
> ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
> 79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
> 02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with 
> dev_is_dma_coherent()
> 
> '79c can be avoided, we'd just drive IOMMU_CACHE off of
> dev_is_dma_coherent() - but if we do that I'd like to properly
> document the arch/iommu/platform/kvm combination that is using this..

We can try to enforce dev_is_dma_coherent(), as you note it's not going
to affect any x86 users.  arm64 is the only obviously relevant arch that
defines ARCH_HAS_SYNC_DMA_FOR_{DEVICE,CPU} but the device.dma_coherent
setting comes from ACPI/OF firmware, so someone from ARM land will need
to shout if this is an issue.  I think we'd need to back off and go
with documentation if a broken use case shows up.  Thanks,

Alex

 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device 
> *dev)
>   return 0;
>  }
>  
> +static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> + /* IOMMU_PTE_FC is always set */
> + return true;
> +}
> +
>  const struct iommu_ops amd_iommu_ops = {
>   .capable = amd_iommu_capable,
>   .domain_alloc = amd_iommu_domain_alloc,
> @@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
>   .flush_iotlb_all = amd_iommu_flush_iotlb_all,
>   .iotlb_sync = amd_iommu_iotlb_sync,
>   .free   = amd_iommu_domain_free,
> + .enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
>   }
>  };
> 
> Thanks,
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-28 Thread Jason Gunthorpe via iommu
On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:

> So, currently AMD and Intel have exactly the same HW feature with a
> different kAPI..

I fixed it like below and made the ordering changes Kevin pointed
to. Will send next week after the merge window:

527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with 
dev_is_dma_coherent()

'79c can be avoided, we'd just drive IOMMU_CACHE off of
dev_is_dma_coherent() - but if we do that I'd like to properly
document the arch/iommu/platform/kvm combination that is using this..

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
return 0;
 }
 
+static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+   /* IOMMU_PTE_FC is always set */
+   return true;
+}
+
 const struct iommu_ops amd_iommu_ops = {
.capable = amd_iommu_capable,
.domain_alloc = amd_iommu_domain_alloc,
@@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.free   = amd_iommu_domain_free,
+   .enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
}
 };

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-28 Thread Jason Gunthorpe via iommu
On Mon, Mar 28, 2022 at 11:17:23AM -0600, Alex Williamson wrote:
> On Thu, 24 Mar 2022 10:46:22 -0300
> Jason Gunthorpe  wrote:
> 
> > On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:
> > 
> > > Based on that here is a quick tweak of the force-snoop part (not 
> > > compiled).  
> > 
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> > 
> > How about this:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgunthorpe%2Flinux%2Fcommits%2Fintel_no_snoopdata=04%7C01%7Cjgg%40nvidia.com%7C9d34426f1c1646af43a608da10ded6b5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637840846514240225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2ByHWyE8Yxcwxe8r8LoMQD9tPh5%2FZPaGfNsUkMlpRfWM%3Dreserved=0
> > 
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> 
> I have some issues with the argument here:
> 
>   This will block device/platform/iommu combinations that do not
>   support cache coherent DMA - but these never worked anyhow as VFIO
>   did not expose any interface to perform the required cache
>   maintenance operations.
> 
> VFIO never intended to provide such operations, it only tried to make
> the coherence of the device visible to userspace such that it can
> perform operations via other means, for example via KVM.  The "never
> worked" statement here seems false.

VFIO is generic. I expect if DPDK connects to VFIO then it will work
properly. That is definitely not the case today when
dev_is_dma_coherent() is false. This is what the paragraph is talking
about.

Remember, x86 wires dev_is_dma_coherent() to true, so this above
remark is not related to anything about x86.

We don't have a way in VFIO to negotiate that 'vfio can only be used
with kvm' so I hope no cases like that really do exist :( Do you know
of any?

> Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
> vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
> where snoop-control is supported, this translates to KVM emulating
> coherency instructions everywhere except VT-d w/ snoop-control.

It seems so.

> My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
> this would trigger unnecessary wbinvd emulation on those platforms.  

I look in the AMD manual and it looks like it works the same as intel
with a dedicated IOPTE bit:

  #define IOMMU_PTE_FC (1ULL << 60)

https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf Pg 79:

 FC: Force Coherent. Software uses the FC bit in the PTE to indicate
 the source of the upstream coherent attribute state for an
 untranslated DMA transaction.1 = the IOMMU sets the coherent attribute
 state in the upstream request. 0 = the IOMMU passes on the coherent
 attribute state from the originating request. Device internal
 address/page table translations are considered "untranslated accesses"
 by IOMMU.The FC state is returned in the ATS response to the device
 endpoint via the state of the (N)oSnoop bit.

So, currently AMD and Intel have exactly the same HW feature with a
different kAPI..

I would say it is wrong that AMD creates kernel owned domains for the
DMA-API to use that do not support snoop.

> don't know if other archs need similar, but it seems we're changing
> polarity wrt no-snoop TLPs from "everyone is coherent except this case
> on Intel" to "everyone is non-coherent except this opposite case on
> Intel".

Yes. We should not assume no-snoop blocking is a HW feature without
explicit knowledge that it is.

>From a kAPI compat perspective IOMMU_CAP_CACHE_COHERENCY
only has two impacts:
 - Only on x86 arch it controls kvm_arch_register_noncoherent_dma()
 - It triggers IOMMU_CACHE

If we look at the list of places where IOMMU_CAP_CACHE_COHERENCY is set:

 drivers/iommu/intel/iommu.c
   Must have IOMMU_CACHE set/clear to control no-snoop blocking

 drivers/iommu/amd/iommu.c
   Always sets its no-snoop block, inconsistent with Intel

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
 drivers/iommu/arm/arm-smmu/arm-smmu.c
 drivers/iommu/arm/arm-smmu/qcom_iommu.c
   Must have IOMMU_CACHE set, ARM arch has no
   kvm_arch_register_noncoherent_dma()

   From what I could tell in the manuals and the prior discussion
   SMMU doesn't block no-snoop.

   ie ARM lies about IOMMU_CAP_CACHE_COHERENCY because it needs
   IOMM_CACHE set to work.

 drivers/iommu/fsl_pamu_domain.c
 drivers/iommu/s390-iommu.c
   Ignore IOMM_CACHE, arch has no kvm_arch_register_noncoherent_dma()

   No idea if the HW blocks no-snoop or not, but it doesn't matter.

So other than AMD, it is OK to change the sense and makes it clearer
for future driver authors what they are expected to do with this.

Thanks,
Jason
___
iommu mailing list

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-28 Thread Alex Williamson
On Thu, 24 Mar 2022 10:46:22 -0300
Jason Gunthorpe  wrote:

> On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled). 
> >  
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence

I have some issues with the argument here:

  This will block device/platform/iommu combinations that do not
  support cache coherent DMA - but these never worked anyhow as VFIO
  did not expose any interface to perform the required cache
  maintenance operations.

VFIO never intended to provide such operations, it only tried to make
the coherence of the device visible to userspace such that it can
perform operations via other means, for example via KVM.  The "never
worked" statement here seems false.

Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
where snoop-control is supported, this translates to KVM emulating
coherency instructions everywhere except VT-d w/ snoop-control.

My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
this would trigger unnecessary wbinvd emulation on those platforms.  I
don't know if other archs need similar, but it seems we're changing
polarity wrt no-snoop TLPs from "everyone is coherent except this case
on Intel" to "everyone is non-coherent except this opposite case on
Intel".  Thanks,

Alex

> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original 
> meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with 
> dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 
> Thanks,
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-27 Thread Jason Gunthorpe via iommu
On Sun, Mar 27, 2022 at 02:32:23AM +, Tian, Kevin wrote:

> > this looks good to me except that the 2nd patch (eab4b381) should be
> > the last one otherwise it affects bisect. and in that case the subject
> > would be simply about removing the capability instead of
> > restoring...

Oh because VFIO won't sent IOMMU_CACHE in this case? Hmm. OK

> > let me find a box to verify it.
> 
> My colleague (Terrence) has the environment and helped verify it.
> 
> He will give his tested-by after you send out the formal series.

Okay, I can send it after the merge window

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-26 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, March 25, 2022 10:16 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Thursday, March 24, 2022 9:46 PM
> >
> > On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:
> >
> > > Based on that here is a quick tweak of the force-snoop part (not
> compiled).
> >
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> >
> > How about this:
> >
> > https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> >
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> > IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> > eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> > original meaning
> > 2752e12bed48f6 iommu: Replace uses of
> IOMMU_CAP_CACHE_COHERENCY
> > with dev_is_dma_coherent()
> >
> > If you like it could you take it from here?
> >
> 
> this looks good to me except that the 2nd patch (eab4b381) should be
> the last one otherwise it affects bisect. and in that case the subject
> would be simply about removing the capability instead of restoring...
> 
> let me find a box to verify it.
> 

My colleague (Terrence) has the environment and helped verify it.

He will give his tested-by after you send out the formal series.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 9:46 PM
> 
> On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY
> with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 

this looks good to me except that the 2nd patch (eab4b381) should be
the last one otherwise it affects bisect. and in that case the subject 
would be simply about removing the capability instead of restoring...

let me find a box to verify it.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:

> Based on that here is a quick tweak of the force-snoop part (not compiled).

I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
started out OK but got weird. So lets fix it back to the way it was.

How about this:

https://github.com/jgunthorpe/linux/commits/intel_no_snoop

b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
5263947f9d5f36 vfio: Require that device support DMA cache coherence
eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with 
dev_is_dma_coherent()

If you like it could you take it from here?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 6:55 AM
> 
> On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:
> 
> > Stated another way, any platform that wires dev_is_dma_coherent() to
> > true, like all x86 does, must support IOMMU_CACHE and report
> > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> > supports. The platform obviously declares it support this in order to
> > support the in-kernel DMA API.
> 
> That gives me a nice simple idea:
> 
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index 3c6b95ad026829..8366884df4a030 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "iommufd_private.h"
> 
> @@ -61,6 +62,10 @@ struct iommufd_device
> *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
>   struct iommu_group *group;
>   int rc;
> 
> + /* iommufd always uses IOMMU_CACHE */
> + if (!dev_is_dma_coherent(>dev))
> + return ERR_PTR(-EINVAL);
> +
>   ictx = iommufd_fget(fd);
>   if (!ictx)
>   return ERR_PTR(-EINVAL);
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 48149988c84bbc..3d6df1ffbf93e6 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
>* We provide no manual cache coherency ioctls to userspace and
> most
>* architectures make the CPU ops for cache flushing privileged.
>* Therefore we require the underlying IOMMU to support CPU
> coherent
> -  * operation.
> +  * operation. Support for IOMMU_CACHE is enforced by the
> +  * dev_is_dma_coherent() test during bind.
>*/
>   iommu_prot = IOMMU_CACHE;
>   if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> 
> Looking at it I would say all the places that test
> IOMMU_CAP_CACHE_COHERENCY can be replaced with
> dev_is_dma_coherent()
> except for the one call in VFIO that is supporting the Intel no-snoop
> behavior.
> 
> Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
> IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
> abusing IOMMU_CACHE.
> 

Based on that here is a quick tweak of the force-snoop part (not compiled).

Several notes:

- IOMMU_CAP_CACHE_COHERENCY is kept as it's checked in vfio's
  group attach interface. Removing it may require a group_is_dma_coherent();

- vdpa is not changed as force-snoop is only for integrated GPU today which
  is not managed by vdpa. But adding the snoop support is easy if necessary;

- vfio type1 reports force-snoop fact to KVM via VFIO_DMA_CC_IOMMU. For
  iommufd the compat layer may leverage that interface but more thoughts
  are required for non-compat usage how that can be reused or whether a
  new one is required between iommufd and kvm. Per earlier  discussions
  Paolo prefers to reuse.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cf..06cca04 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5110,7 +5110,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
prot |= DMA_PTE_READ;
if (iommu_prot & IOMMU_WRITE)
prot |= DMA_PTE_WRITE;
-   if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
+   /* nothing to do for IOMMU_CACHE */
+   if ((iommu_prot & IOMMU_SNOOP) && dmar_domain->iommu_snooping)
prot |= DMA_PTE_SNP;
 
max_addr = iova + size;
@@ -5236,6 +5237,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
if (cap == IOMMU_CAP_CACHE_COHERENCY)
+   return true;
+   if (cap == IOMMU_CAP_FORCE_SNOOP)
return domain_update_iommu_snooping(NULL);
if (cap == IOMMU_CAP_INTR_REMAP)
return irq_remapping_enabled == 1;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9..abc4cfe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2270,6 +2270,9 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;
 
+   if (iommu_capable(bus, IOMMU_CAP_FORCE_SNOOP)
+   domain->prot |= IOMMU_SNOOP;
+
/*
 * Try to match an existing compatible domain.  We don't want to
 * preclude an IOMMU driver supporting multiple bus_types and being
@@ -2611,14 +2614,14 @@ static void vfio_iommu_type1_release(void *iommu_data)
kfree(iommu);
 }
 
-static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+static int vfio_domains_have_iommu_snoop(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
  

RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe  wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe  wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > + int iommu_prot;
> > > > > +
> > > > > + /*
> > > > > +  * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +  * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +  * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +  * operation.
> > > > > +  */
> > > > > + iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly. 

> 
> It only means normal DMAs issued in a normal way are coherent with the
> CPU and do not require special cache flushing instructions. ie DMA
> issued by a kernel driver using the DMA API.
> 
> It does not mean that non-coherence DMA does not exist, or that
> platform or device specific ways to trigger non-coherence are blocked.
> 
> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

This is a good explanation of IOMMU_CACHE. From that intel-iommu
driver should always report this capability and do nothing with
IOMMU_CACHE since it's already guaranteed by the arch. Actually
this is exactly what AMD iommu driver does today.

Then we'll introduce a new cap/prot in particular for enforcing snoop
as you suggested below.

> 
> Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
> blocking' should be created to cover Intel's special need. From what I
> know it is only implemented in the Intel driver and apparently only
> for some IOMMUs connected to IGD.
> 
> > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > > not working currently, I fixed it.
> >
> > Right, I see it in the comments relative to extensions, but missed in
> > the commit log.  Thanks,
> 
> Oh good, I remembered it was someplace..
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:

> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

That gives me a nice simple idea:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3c6b95ad026829..8366884df4a030 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "iommufd_private.h"
 
@@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, 
struct pci_dev *pdev,
struct iommu_group *group;
int rc;
 
+   /* iommufd always uses IOMMU_CACHE */
+   if (!dev_is_dma_coherent(>dev))
+   return ERR_PTR(-EINVAL);
+
ictx = iommufd_fget(fd);
if (!ictx)
return ERR_PTR(-EINVAL);
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84bbc..3d6df1ffbf93e6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
 * We provide no manual cache coherency ioctls to userspace and most
 * architectures make the CPU ops for cache flushing privileged.
 * Therefore we require the underlying IOMMU to support CPU coherent
-* operation.
+* operation. Support for IOMMU_CACHE is enforced by the
+* dev_is_dma_coherent() test during bind.
 */
iommu_prot = IOMMU_CACHE;
if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)

Looking at it I would say all the places that test
IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent()
except for the one call in VFIO that is supporting the Intel no-snoop
behavior.

Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
abusing IOMMU_CACHE.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 16:34:39 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > Jason Gunthorpe  wrote:
> > >   
> > > > +static int conv_iommu_prot(u32 map_flags)
> > > > +{
> > > > +   int iommu_prot;
> > > > +
> > > > +   /*
> > > > +* We provide no manual cache coherency ioctls to userspace and 
> > > > most
> > > > +* architectures make the CPU ops for cache flushing privileged.
> > > > +* Therefore we require the underlying IOMMU to support CPU 
> > > > coherent
> > > > +* operation.
> > > > +*/
> > > > +   iommu_prot = IOMMU_CACHE;  
> > > 
> > > Where is this requirement enforced?  AIUI we'd need to test
> > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > intel_iommu_map() simply drop the flag when not supported by HW.  
> > 
> > You are right, the correct thing to do is to fail device
> > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > however we can't do that because Intel abuses the meaning of
> > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> > supported.
> > 
> > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> > all iommu drivers. Once this is done vfio and iommufd should both
> > always set IOMMU_CACHE and refuse to work without
> > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> > arch that does in fact work today with vfio, somehow, but I don't..)
> 
> IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> lack of snoop-control support, causing us to have mixed coherent and
> non-coherent domains.  I don't recall if you go back far enough in VT-d
> history if the primary IOMMU might have lacked this support.  So I
> think there are systems we care about with IOMMUs that can't enforce
> DMA coherency.
> 
> As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
> mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
> are coherent regardless of mapping protection flags?  What's the point
> of IOMMU_CACHE at that point?

IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
change.

It only means normal DMAs issued in a normal way are coherent with the
CPU and do not require special cache flushing instructions. ie DMA
issued by a kernel driver using the DMA API.

It does not mean that non-coherence DMA does not exist, or that
platform or device specific ways to trigger non-coherence are blocked.

Stated another way, any platform that wires dev_is_dma_coherent() to
true, like all x86 does, must support IOMMU_CACHE and report
IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
supports. The platform obviously declares it support this in order to
support the in-kernel DMA API.

Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
blocking' should be created to cover Intel's special need. From what I
know it is only implemented in the Intel driver and apparently only
for some IOMMUs connected to IGD.

> > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > not working currently, I fixed it.
> 
> Right, I see it in the comments relative to extensions, but missed in
> the commit log.  Thanks,

Oh good, I remembered it was someplace..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Alex Williamson
On Wed, 23 Mar 2022 16:34:39 -0300
Jason Gunthorpe  wrote:

> On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > On Fri, 18 Mar 2022 14:27:33 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > +static int conv_iommu_prot(u32 map_flags)
> > > +{
> > > + int iommu_prot;
> > > +
> > > + /*
> > > +  * We provide no manual cache coherency ioctls to userspace and most
> > > +  * architectures make the CPU ops for cache flushing privileged.
> > > +  * Therefore we require the underlying IOMMU to support CPU coherent
> > > +  * operation.
> > > +  */
> > > + iommu_prot = IOMMU_CACHE;  
> > 
> > Where is this requirement enforced?  AIUI we'd need to test
> > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > intel_iommu_map() simply drop the flag when not supported by HW.  
> 
> You are right, the correct thing to do is to fail device
> binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> however we can't do that because Intel abuses the meaning of
> IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> supported.
> 
> I want Intel to split out their special no-snoop from IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> all iommu drivers. Once this is done vfio and iommufd should both
> always set IOMMU_CACHE and refuse to work without
> IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> arch that does in fact work today with vfio, somehow, but I don't..)

IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
lack of snoop-control support, causing us to have mixed coherent and
non-coherent domains.  I don't recall if you go back far enough in VT-d
history if the primary IOMMU might have lacked this support.  So I
think there are systems we care about with IOMMUs that can't enforce
DMA coherency.

As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
are coherent regardless of mapping protection flags?  What's the point
of IOMMU_CACHE at that point?

> I added a fixme about this.
> 
> > This also seems like an issue relative to vfio compatibility that I
> > don't see mentioned in that patch.  Thanks,  
> 
> Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> not working currently, I fixed it.

Right, I see it in the comments relative to extensions, but missed in
the commit log.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:33 -0300
> Jason Gunthorpe  wrote:
> 
> > +static int conv_iommu_prot(u32 map_flags)
> > +{
> > +   int iommu_prot;
> > +
> > +   /*
> > +* We provide no manual cache coherency ioctls to userspace and most
> > +* architectures make the CPU ops for cache flushing privileged.
> > +* Therefore we require the underlying IOMMU to support CPU coherent
> > +* operation.
> > +*/
> > +   iommu_prot = IOMMU_CACHE;
> 
> Where is this requirement enforced?  AIUI we'd need to test
> IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> intel_iommu_map() simply drop the flag when not supported by HW.

You are right, the correct thing to do is to fail device
binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
however we can't do that because Intel abuses the meaning of
IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
supported.

I want Intel to split out their special no-snoop from IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
all iommu drivers. Once this is done vfio and iommufd should both
always set IOMMU_CACHE and refuse to work without
IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
arch that does in fact work today with vfio, somehow, but I don't..)

I added a fixme about this.

> This also seems like an issue relative to vfio compatibility that I
> don't see mentioned in that patch.  Thanks,

Yes, it was missed in the notes for vfio compat that Intel no-snoop is
not working currently, I fixed it.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Alex Williamson
On Fri, 18 Mar 2022 14:27:33 -0300
Jason Gunthorpe  wrote:

> +static int conv_iommu_prot(u32 map_flags)
> +{
> + int iommu_prot;
> +
> + /*
> +  * We provide no manual cache coherency ioctls to userspace and most
> +  * architectures make the CPU ops for cache flushing privileged.
> +  * Therefore we require the underlying IOMMU to support CPU coherent
> +  * operation.
> +  */
> + iommu_prot = IOMMU_CACHE;

Where is this requirement enforced?  AIUI we'd need to test
IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
intel_iommu_map() simply drop the flag when not supported by HW.

This also seems like an issue relative to vfio compatibility that I
don't see mentioned in that patch.  Thanks,

Alex

> + if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> + iommu_prot |= IOMMU_WRITE;
> + if (map_flags & IOMMU_IOAS_MAP_READABLE)
> + iommu_prot |= IOMMU_READ;
> + return iommu_prot;
> +}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-18 Thread Jason Gunthorpe via iommu
Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considerds the aperture and the reserved ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/ioas.c| 248 
 drivers/iommu/iommufd/iommufd_private.h |  27 +++
 drivers/iommu/iommufd/main.c|  17 ++
 include/uapi/linux/iommufd.h| 132 +
 5 files changed, 425 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
io_pagetable.o \
+   ioas.o \
main.o \
pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00..c530b2ba74b06b
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include 
+#include 
+#include 
+#include 
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+   struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+   int rc;
+
+   rc = iopt_unmap_all(>iopt);
+   WARN_ON(rc);
+   iopt_destroy_table(>iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+   if (IS_ERR(ioas))
+   return ioas;
+
+   rc = iopt_init_table(>iopt);
+   if (rc)
+   goto out_abort;
+   return ioas;
+
+out_abort:
+   iommufd_object_abort(ictx, >obj);
+   return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_alloc *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   if (cmd->flags)
+   return -EOPNOTSUPP;
+
+   ioas = iommufd_ioas_alloc(ucmd->ictx);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   cmd->out_ioas_id = ioas->obj.id;
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   goto out_table;
+   iommufd_object_finalize(ucmd->ictx, >obj);
+   return 0;
+
+out_table:
+   iommufd_ioas_destroy(>obj);
+   return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+   struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+   struct iommufd_ioas *ioas;
+   struct interval_tree_span_iter span;
+   u32 max_iovas;
+   int rc;
+
+   if (cmd->__reserved)
+   return -EOPNOTSUPP;
+
+   max_iovas = cmd->size - sizeof(*cmd);
+   if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+   return -EINVAL;
+   max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+   ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+   if (IS_ERR(ioas))
+   return PTR_ERR(ioas);
+
+   down_read(>iopt.iova_rwsem);
+   cmd->out_num_iovas = 0;
+   for (interval_tree_span_iter_first(
+, >iopt.reserved_iova_itree, 0, ULONG_MAX);
+!interval_tree_span_iter_done();
+interval_tree_span_iter_next()) {
+   if (!span.is_hole)
+   continue;
+   if (cmd->out_num_iovas < max_iovas) {
+   rc = put_user((u64)span.start_hole,
+ >out_valid_iovas[cmd->out_num_iovas]
+  .start);
+   if (rc)
+   goto out_put;
+   rc = put_user(
+   (u64)span.last_hole,
+   
>out_valid_iovas[cmd->out_num_iovas].last);
+   if (rc)
+   goto out_put;
+   }
+   cmd->out_num_iovas++;
+   }
+   rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+   if (rc)
+   goto out_put;
+   if (cmd->out_num_iovas > max_iovas)
+