subject:"\[RFC 11\/20\] iommu\/iommufd\: Add IOMMU_IOASID

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-26 Thread Tian, Kevin

> From: David Gibson 
> Sent: Monday, October 25, 2021 1:05 PM
> 
> > > > For above cases a [base, max] hint can be provided by the user per
> > > > Jason's recommendation.
> > >
> > > Provided at which stage?
> >
> > IOMMU_IOASID_ALLOC
> 
> Ok.  I have mixed thoughts on this.  Doing this at ALLOC time was my
> first instict as well.  However with Jason's suggestion that any of a
> number of things could disambiguate multiple IOAS attached to a
> device, I wonder if it makes more sense for consistency to put base
> address at attach time, as with PASID.

In that case the base address provided at attach time is used as an 
address space ID similar to PASID, which imho is orthogonal to the 
generic [base, size] info for IOAS itself. The 2nd base sort of becomes 
an offset on top of the first base in ppc case.

> >
> > regarding live migration with vfio devices, it's still in early stage. there
> > are tons of compatibility check opens to be addressed before it can
> > be widely deployed. this might just add another annoying open to that
> > long list...
> 
> So, yes, live migration with VFIO is limited, unfortunately this
> still affects us even if we don't (currently) have VFIO devices.  The
> problem arises from the combination of two limitations:
> 
> 1) Live migration means that we can't dynamically select guest visible
> IOVA parameters at qemu start up time.  We need to get consistent
> guest visible behaviour for a given set of qemu options, so that we
> can migrate between them.
> 
> 2) Device hotplug means that we don't know if a PCI domain will have
> VFIO devices on it when we start qemu.  So, we don't know if host
> limitations on IOVA ranges will affect the guest or not.
> 
> Together these mean that the best we can do is to define a *fixed*
> (per machine type) configuration based on qemu options only.  That is,
> defined by the guest platform we're trying to present, only, never
> host capabilities.  We can then see if that configuration is possible
> on the host and pass or fail.  It's never safe to go the other
> direction and take host capabilities and present those to the guest.
> 

That is just one userspace policy. We don't want to design a uAPI
just for a specific userspace implementation. In concept the 
userspace could:

1)  use DMA-API like map/unmap i.e. letting IOVA address space
managed by the kernel;

* suitable for simple applications e.g. dpdk.

2)  manage IOVA address space with *fixed* layout:

* fail device passthrough at MAP_DMA if conflict is detected
  between mapped range and device specific IOVA holes

* suitable for VM when live migration is highly concerned

* potential problem with vIOMMU since the guest is unaware
  of host constraints thus undefined behavior may occur if
  guest IOVA addresses happens to overlap with host IOVA holes.

* ppc is special as you need to claim guest IOVA ranges in
  the host. But it's not the case for other emulated IOMMUs.

3)  manage IOVA address space with host constraints:

* create IOVA layout by combining qemu options and IOVA holes 
  of all boot-time passthrough devices

* reject hotplugged device if it has conflicting IOVA holes with
  the initial IOVA layout

* suitable for vIOMMU since host constraints can be further 
  reported to the guest

* suitable for VM w/o live migration requirement, e.g. in many
  client virtualization scenarios

* suboptimal with VM live migration with compatibility limitation

Overall the proposed uAPI will provide:

1)  a simple DMA-API-like mapping protocol for kernel managed IOVA
address space:

2)  a vfio-like mapping protocol for user managed IOVA address space:

a) check IOVA conflict in MAP_DMA ioctl;
b) allows the user to query available IOVA ranges;

Then it's totally user policy on how it wants to utilize those ioctls.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-24 Thread David Gibson

On Thu, Oct 14, 2021 at 06:53:01AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Thursday, October 14, 2021 1:00 PM
> > 
> > On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote:
> > > > From: David Gibson
> > > > Sent: Friday, October 1, 2021 2:11 PM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > vfio.
> > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > there
> > > > >   is a way to claim a specific iova range from a system-wide address
> > space.
> > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > >   and then decide how to incorporate it in v2.
> > > >
> > > > Ok, there are several things we need for ppc.  None of which are
> > > > inherently ppc specific and some of which will I think be useful for
> > > > most platforms.  So, starting from most general to most specific
> > > > here's basically what's needed:
> > > >
> > > > 1. We need to represent the fact that the IOMMU can only translate
> > > >*some* IOVAs, not a full 64-bit range.  You have the addr_width
> > > >already, but I'm entirely sure if the translatable range on ppc
> > > >(or other platforms) is always a power-of-2 size.  It usually will
> > > >be, of course, but I'm not sure that's a hard requirement.  So
> > > >using a size/max rather than just a number of bits might be safer.
> > > >
> > > >I think basically every platform will need this.  Most platforms
> > > >don't actually implement full 64-bit translation in any case, but
> > > >rather some smaller number of bits that fits their page table
> > > >format.
> > > >
> > > > 2. The translatable range of IOVAs may not begin at 0.  So we need to
> > > >advertise to userspace what the base address is, as well as the
> > > >size.  POWER's main IOVA range begins at 2^59 (at least on the
> > > >models I know about).
> > > >
> > > >I think a number of platforms are likely to want this, though I
> > > >couldn't name them apart from POWER.  Putting the translated IOVA
> > > >window at some huge address is a pretty obvious approach to making
> > > >an IOMMU which can translate a wide address range without colliding
> > > >with any legacy PCI addresses down low (the IOMMU can check if this
> > > >transaction is for it by just looking at some high bits in the
> > > >address).
> > > >
> > > > 3. There might be multiple translatable ranges.  So, on POWER the
> > > >IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > > >2^59..2^59+.  The two ranges have completely separate IO
> > > >page tables, with (usually) different layouts.  (The low range will
> > > >nearly always be a single-level page table with 4kiB or 64kiB
> > > >entries, the high one will be multiple levels depending on the size
> > > >of the range and pagesize).
> > > >
> > > >This may be less common, but I suspect POWER won't be the only
> > > >platform to do something like this.  As above, using a high range
> > > >is a pretty obvious approach, but clearly won't handle older
> > > >devices which can't do 64-bit DMA.  So adding a smaller range for
> > > >those devices is again a pretty obvious solution.  Any platform
> > > >with an "IO hole" can be treated as having two ranges, one below
> > > >the hole and one above it (although in that case they may well not
> > > >have separate page tables
> > >
> > > 1-3 are common on all platforms with fixed reserved ranges. Current
> > > vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> > > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> > > maps only in those ranges. iommufd can follo

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-18 Thread Jason Gunthorpe via iommu

On Mon, Oct 18, 2021 at 02:50:54PM +1100, David Gibson wrote:

> Hrm... which makes me think... if we allow this for the common
> kernel-managed case, do we even need to have capcity in the high-level
> interface for reporting IO holes?  If the kernel can choose a non-zero
> base, it could just choose on x86 to place it's advertised window
> above the IO hole.

If the high level interface is like dma_map() then, no it doesn't need
the ability to report holes. Kernel would find and return the IOVA
from dma_map not accept it in.

Since dma_map is a well proven model I'm inclined to model the
simplied interface after it..

That said, if we have some ioctl 'query iova ranges' I would expect it
to work on an IOAS created by the simplified interface too.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-17 Thread da...@gibson.dropbear.id.au

On Thu, Oct 14, 2021 at 12:06:10PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote:
> 
> > > If the HW can attach multiple non-overlapping IOAS's to the same
> > > device then the HW is routing to the correct IOAS by using the address
> > > bits. This is not much different from the prior discussion we had
> > > where we were thinking of the PASID as an 80 bit address
> > 
> > Ah... that might be a workable approach.  And it even helps me get my
> > head around multiple attachment which I was struggling with before.
> > 
> > So, the rule would be that you can attach multiple IOASes to a device,
> > as long as none of them overlap.  The non-overlapping could be because
> > each IOAS covers a disjoint address range, or it could be because
> > there's some attached information - such as a PASID - to disambiguate.
> 
> Right exactly - it is very parallel to PASID
> 
> And obviously HW support is required to have multiple page table
> pointers per RID - which sounds like PPC does (high/low pointer?)

Hardware support is require *in the IOMMU*.  Nothing (beyond regular
64-bit DMA support) is required in the endpoint devices.  That's not
true of PASID.

> > What remains a question is where the disambiguating information comes
> > from in each case: does it come from properties of the IOAS,
> > propertues of the device, or from extra parameters supplied at attach
> > time.  IIUC, the current draft suggests it always comes at attach time
> > for the PASID information.  Obviously the more consistency we can have
> > here the better.
> 
> From a generic view point I'd say all are fair game. It is up to the
> IOMMU driver to take the requested set of IOAS's, the "at attachment"
> information (like PASID) and decide what to do, or fail.

Ok, that's a model that makes sense to me.

> > I can also see an additional problem in implementation, once we start
> > looking at hot-adding devices to existing address spaces.  
> 
> I won't pretend to guess how to implement this :) Just from a modeling
> perspective is something that works logically. If the kernel
> implementation is too hard then PPC should do one of the other ideas.
> 
> Personally I'd probably try for a nice multi-domain attachment model
> like PASID and not try to create/destroy domains.

I don't really follow what you mean by that.

> As I said in my last email I think it is up to each IOMMU HW driver to
> make these decisions, the iommufd framework just provides a
> standardized API toward the attaching driver that the IOMMU HW must
> fit into.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-17 Thread David Gibson

On Thu, Oct 14, 2021 at 11:52:08AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote:
> 
> > > My feeling is that qemu should be dealing with the host != target
> > > case, not the kernel.
> > > 
> > > The kernel's job should be to expose the IOMMU HW it has, with all
> > > features accessible, to userspace.
> > 
> > See... to me this is contrary to the point we agreed on above.
> 
> I'm not thinking of these as exclusive ideas.
> 
> The IOCTL interface in iommu can quite happily expose:
>  Create IOAS generically
>  Manipulate IOAS generically
>  Create IOAS with IOMMU driver specific attributes
>  HW specific Manipulate IOAS
> 
> IOCTL commands all together.
> 
> So long as everything is focused on a generic in-kernel IOAS object it
> is fine to have multiple ways in the uAPI to create and manipulate the
> objects.
> 
> When I speak about a generic interface I mean "Create IOAS
> generically" - ie a set of IOCTLs that work on most IOMMU HW and can
> be relied upon by things like DPDK/etc to always work and be portable.
> This is why I like "hints" to provide some limited widely applicable
> micro-optimization.
> 
> When I said "expose the IOMMU HW it has with all features accessible"
> I mean also providing "Create IOAS with IOMMU driver specific
> attributes".
> 
> These other IOCTLs would allow the IOMMU driver to expose every
> configuration knob its HW has, in a natural HW centric language.
> There is no pretense of genericness here, no crazy foo=A, foo=B hidden
> device specific interface.
> 
> Think of it as a high level/low level interface to the same thing.

Ok, I see what you mean.

> > Those are certainly wrong, but they came about explicitly by *not*
> > being generic rather than by being too generic.  So I'm really
> > confused aso to what you're arguing for / against.
> 
> IMHO it is not having a PPC specific interface that was the problem,
> it was making the PPC specific interface exclusive to the type 1
> interface. If type 1 continued to work on PPC then DPDK/etc would
> never learned PPC specific code.

Ok, but the reason this happened is that the initial version of type 1
*could not* be used on PPC.  The original Type 1 implicitly promised a
"large" IOVA range beginning at IOVA 0 without any real way of
specifying or discovering how large that range was.  Since ppc could
typically only give a 2GiB range at IOVA 0, that wasn't usable.

That's why I say the problem was not making type1 generic enough.  I
believe the current version of Type1 has addressed this - at least
enough to be usable in common cases.  But by this time the ppc backend
is already out there, so no-one's had the capacity to go back and make
ppc work with Type1.

> For iommufd with the high/low interface each IOMMU HW should ask basic
> questions:
> 
>  - What should the generic high level interface do on this HW?
>For instance what should 'Create IOAS generically' do for PPC?
>It should not fail, it should create *something*
>What is the best thing for DPDK?
>I guess the 64 bit window is most broadly useful.

Right, which means the kernel must (at least in the common case) have
the capcity to choose and report a non-zero base-IOVA.

Hrm... which makes me think... if we allow this for the common
kernel-managed case, do we even need to have capcity in the high-level
interface for reporting IO holes?  If the kernel can choose a non-zero
base, it could just choose on x86 to place it's advertised window
above the IO hole.

>  - How to accurately describe the HW in terms of standard IOAS objects
>and where to put HW specific structs to support this.
> 
>This is where PPC would decide how best to expose a control over
>its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver
>wants, so long as it fits into the kernel IOAS model facing the
>connected device driver.
> 
> QEMU would have IOMMU userspace drivers. One would be the "generic
> driver" using only the high level generic interface. It should work as
> best it can on all HW devices. This is the fallback path you talked
> of.
> 
> QEMU would also have HW specific IOMMU userspace drivers that know how
> to operate the exact HW. eg these drivers would know how to use
> userspace page tables, how to form IOPTEs and how to access the
> special features.
> 
> This is how QEMU could use an optimzed path with nested page tables,
> for instance.

The concept makes sense in general.  The devil's in the details, as usual.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread Jason Gunthorpe via iommu

On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote:

> > If the HW can attach multiple non-overlapping IOAS's to the same
> > device then the HW is routing to the correct IOAS by using the address
> > bits. This is not much different from the prior discussion we had
> > where we were thinking of the PASID as an 80 bit address
> 
> Ah... that might be a workable approach.  And it even helps me get my
> head around multiple attachment which I was struggling with before.
> 
> So, the rule would be that you can attach multiple IOASes to a device,
> as long as none of them overlap.  The non-overlapping could be because
> each IOAS covers a disjoint address range, or it could be because
> there's some attached information - such as a PASID - to disambiguate.

Right exactly - it is very parallel to PASID

And obviously HW support is required to have multiple page table
pointers per RID - which sounds like PPC does (high/low pointer?)
 
> What remains a question is where the disambiguating information comes
> from in each case: does it come from properties of the IOAS,
> propertues of the device, or from extra parameters supplied at attach
> time.  IIUC, the current draft suggests it always comes at attach time
> for the PASID information.  Obviously the more consistency we can have
> here the better.

>From a generic view point I'd say all are fair game. It is up to the
IOMMU driver to take the requested set of IOAS's, the "at attachment"
information (like PASID) and decide what to do, or fail.

> I can also see an additional problem in implementation, once we start
> looking at hot-adding devices to existing address spaces.  

I won't pretend to guess how to implement this :) Just from a modeling
perspective is something that works logically. If the kernel
implementation is too hard then PPC should do one of the other ideas.

Personally I'd probably try for a nice multi-domain attachment model
like PASID and not try to create/destroy domains.

As I said in my last email I think it is up to each IOMMU HW driver to
make these decisions, the iommufd framework just provides a
standardized API toward the attaching driver that the IOMMU HW must
fit into.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread Jason Gunthorpe via iommu

On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote:

> > My feeling is that qemu should be dealing with the host != target
> > case, not the kernel.
> > 
> > The kernel's job should be to expose the IOMMU HW it has, with all
> > features accessible, to userspace.
> 
> See... to me this is contrary to the point we agreed on above.

I'm not thinking of these as exclusive ideas.

The IOCTL interface in iommu can quite happily expose:
 Create IOAS generically
 Manipulate IOAS generically
 Create IOAS with IOMMU driver specific attributes
 HW specific Manipulate IOAS

IOCTL commands all together.

So long as everything is focused on a generic in-kernel IOAS object it
is fine to have multiple ways in the uAPI to create and manipulate the
objects.

When I speak about a generic interface I mean "Create IOAS
generically" - ie a set of IOCTLs that work on most IOMMU HW and can
be relied upon by things like DPDK/etc to always work and be portable.
This is why I like "hints" to provide some limited widely applicable
micro-optimization.

When I said "expose the IOMMU HW it has with all features accessible"
I mean also providing "Create IOAS with IOMMU driver specific
attributes".

These other IOCTLs would allow the IOMMU driver to expose every
configuration knob its HW has, in a natural HW centric language.
There is no pretense of genericness here, no crazy foo=A, foo=B hidden
device specific interface.

Think of it as a high level/low level interface to the same thing.

> Those are certainly wrong, but they came about explicitly by *not*
> being generic rather than by being too generic.  So I'm really
> confused aso to what you're arguing for / against.

IMHO it is not having a PPC specific interface that was the problem,
it was making the PPC specific interface exclusive to the type 1
interface. If type 1 continued to work on PPC then DPDK/etc would
never learned PPC specific code.

For iommufd with the high/low interface each IOMMU HW should ask basic
questions:

 - What should the generic high level interface do on this HW?
   For instance what should 'Create IOAS generically' do for PPC?
   It should not fail, it should create *something*
   What is the best thing for DPDK?
   I guess the 64 bit window is most broadly useful.

 - How to accurately describe the HW in terms of standard IOAS objects
   and where to put HW specific structs to support this.

   This is where PPC would decide how best to expose a control over
   its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver
   wants, so long as it fits into the kernel IOAS model facing the
   connected device driver.

QEMU would have IOMMU userspace drivers. One would be the "generic
driver" using only the high level generic interface. It should work as
best it can on all HW devices. This is the fallback path you talked
of.

QEMU would also have HW specific IOMMU userspace drivers that know how
to operate the exact HW. eg these drivers would know how to use
userspace page tables, how to form IOPTEs and how to access the
special features.

This is how QEMU could use an optimzed path with nested page tables,
for instance.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread Tian, Kevin

> From: David Gibson 
> Sent: Thursday, October 14, 2021 1:00 PM
> 
> On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Friday, October 1, 2021 2:11 PM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address
> space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > >
> > > Ok, there are several things we need for ppc.  None of which are
> > > inherently ppc specific and some of which will I think be useful for
> > > most platforms.  So, starting from most general to most specific
> > > here's basically what's needed:
> > >
> > > 1. We need to represent the fact that the IOMMU can only translate
> > >*some* IOVAs, not a full 64-bit range.  You have the addr_width
> > >already, but I'm entirely sure if the translatable range on ppc
> > >(or other platforms) is always a power-of-2 size.  It usually will
> > >be, of course, but I'm not sure that's a hard requirement.  So
> > >using a size/max rather than just a number of bits might be safer.
> > >
> > >I think basically every platform will need this.  Most platforms
> > >don't actually implement full 64-bit translation in any case, but
> > >rather some smaller number of bits that fits their page table
> > >format.
> > >
> > > 2. The translatable range of IOVAs may not begin at 0.  So we need to
> > >advertise to userspace what the base address is, as well as the
> > >size.  POWER's main IOVA range begins at 2^59 (at least on the
> > >models I know about).
> > >
> > >I think a number of platforms are likely to want this, though I
> > >couldn't name them apart from POWER.  Putting the translated IOVA
> > >window at some huge address is a pretty obvious approach to making
> > >an IOMMU which can translate a wide address range without colliding
> > >with any legacy PCI addresses down low (the IOMMU can check if this
> > >transaction is for it by just looking at some high bits in the
> > >address).
> > >
> > > 3. There might be multiple translatable ranges.  So, on POWER the
> > >IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > >2^59..2^59+.  The two ranges have completely separate IO
> > >page tables, with (usually) different layouts.  (The low range will
> > >nearly always be a single-level page table with 4kiB or 64kiB
> > >entries, the high one will be multiple levels depending on the size
> > >of the range and pagesize).
> > >
> > >This may be less common, but I suspect POWER won't be the only
> > >platform to do something like this.  As above, using a high range
> > >is a pretty obvious approach, but clearly won't handle older
> > >devices which can't do 64-bit DMA.  So adding a smaller range for
> > >those devices is again a pretty obvious solution.  Any platform
> > >with an "IO hole" can be treated as having two ranges, one below
> > >the hole and one above it (although in that case they may well not
> > >have separate page tables
> >
> > 1-3 are common on all platforms with fixed reserved ranges. Current
> > vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> > maps only in those ranges. iommufd can follow the same logic for the
> > baseline uAPI.
> >
> > For above cases a [base, max] hint can be provided by the user per
> > Jason's recommendation.
> 
> Provided at which stage?

IOMMU_IOASID_ALLOC

> 
> > It is a hint as no additional restrictio

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread David Gibson

On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Friday, October 1, 2021 2:11 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > Ok, there are several things we need for ppc.  None of which are
> > inherently ppc specific and some of which will I think be useful for
> > most platforms.  So, starting from most general to most specific
> > here's basically what's needed:
> > 
> > 1. We need to represent the fact that the IOMMU can only translate
> >*some* IOVAs, not a full 64-bit range.  You have the addr_width
> >already, but I'm entirely sure if the translatable range on ppc
> >(or other platforms) is always a power-of-2 size.  It usually will
> >be, of course, but I'm not sure that's a hard requirement.  So
> >using a size/max rather than just a number of bits might be safer.
> > 
> >I think basically every platform will need this.  Most platforms
> >don't actually implement full 64-bit translation in any case, but
> >rather some smaller number of bits that fits their page table
> >format.
> > 
> > 2. The translatable range of IOVAs may not begin at 0.  So we need to
> >advertise to userspace what the base address is, as well as the
> >size.  POWER's main IOVA range begins at 2^59 (at least on the
> >models I know about).
> > 
> >I think a number of platforms are likely to want this, though I
> >couldn't name them apart from POWER.  Putting the translated IOVA
> >window at some huge address is a pretty obvious approach to making
> >an IOMMU which can translate a wide address range without colliding
> >with any legacy PCI addresses down low (the IOMMU can check if this
> >transaction is for it by just looking at some high bits in the
> >address).
> > 
> > 3. There might be multiple translatable ranges.  So, on POWER the
> >IOMMU can typically translate IOVAs from 0..2GiB, and also from
> >2^59..2^59+.  The two ranges have completely separate IO
> >page tables, with (usually) different layouts.  (The low range will
> >nearly always be a single-level page table with 4kiB or 64kiB
> >entries, the high one will be multiple levels depending on the size
> >of the range and pagesize).
> > 
> >This may be less common, but I suspect POWER won't be the only
> >platform to do something like this.  As above, using a high range
> >is a pretty obvious approach, but clearly won't handle older
> >devices which can't do 64-bit DMA.  So adding a smaller range for
> >those devices is again a pretty obvious solution.  Any platform
> >with an "IO hole" can be treated as having two ranges, one below
> >the hole and one above it (although in that case they may well not
> >have separate page tables
> 
> 1-3 are common on all platforms with fixed reserved ranges. Current
> vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> maps only in those ranges. iommufd can follow the same logic for the
> baseline uAPI.
> 
> For above cases a [base, max] hint can be provided by the user per
> Jason's recommendation.

Provided at which stage?

> It is a hint as no additional restriction is
> imposed,

For the qemu type use case, that's not true.  In that case we
*require* the available mapping ranges to match what the guest
platform expects.

> since the kernel only cares about no violation on permitted
> ranges that it reports to the user. Underlying iommu dr

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread da...@gibson.dropbear.id.au

On Mon, Oct 11, 2021 at 02:17:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote:
> > > PASID support will already require that a device can be multi-bound to
> > > many IOAS's, couldn't PPC do the same with the windows?
> > 
> > I don't see how that would make sense.  The device has no awareness of
> > multiple windows the way it does of PASIDs.  It just sends
> > transactions over the bus with the IOVAs it's told.  If those IOVAs
> > lie within one of the windows, the IOMMU picks them up and translates
> > them.  If they don't, it doesn't.
> 
> To my mind that address centric routing is awareness.

I don't really understand that position.  A PASID capable device has
to be built to be PASID capable, and will generally have registers
into which you store PASIDs to use.

Any 64-bit DMA capable device can use the POWER IOMMU just fine - it's
up to the driver to program it with addresses that will be translated
(and in Linux the driver will get those from the DMA subsystem).

> If the HW can attach multiple non-overlapping IOAS's to the same
> device then the HW is routing to the correct IOAS by using the address
> bits. This is not much different from the prior discussion we had
> where we were thinking of the PASID as an 80 bit address

Ah... that might be a workable approach.  And it even helps me get my
head around multiple attachment which I was struggling with before.

So, the rule would be that you can attach multiple IOASes to a device,
as long as none of them overlap.  The non-overlapping could be because
each IOAS covers a disjoint address range, or it could be because
there's some attached information - such as a PASID - to disambiguate.

What remains a question is where the disambiguating information comes
from in each case: does it come from properties of the IOAS,
propertues of the device, or from extra parameters supplied at attach
time.  IIUC, the current draft suggests it always comes at attach time
for the PASID information.  Obviously the more consistency we can have
here the better.

I can also see an additional problem in implementation, once we start
looking at hot-adding devices to existing address spaces.  Suppose our
software (maybe qemu) wants to set up a single DMA view for a bunch of
devices, that has such a split window.  It can set up IOASes easily
enough for the two windows, then it needs to attach them.  Presumbly,
it attaches them one at a time, which means that each device (or
group) goes through an interim state where it's attached to one, but
not the other.  That can probably be achieved by using an extra IOMMU
domain (or the local equivalent) in the hardware for that interim
state.  However it means we have to repeatedly create and destroy that
extra domain for each device after the first we add, rather than
simply adding each device to the domain which has both windows.

[I think this doesn't arise on POWER when running under PowerVM.  That
 has no concept like IOMMU domains, and instead the mapping is always
 done per "partitionable endpoint" (PE), essentially a group.  That
 means it's just a question of whether we mirror mappings on both
 windows into a given PE or just those from one IOAS.  It's not an
 unreasonable extension/combination of existing hardware quirks to
 consider, though]

> The fact the PPC HW actually has multiple page table roots and those
> roots even have different page tables layouts while still connected to
> the same device suggests this is not even an unnatural modelling
> approach...
> 
> Jason  
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread David Gibson

On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> 
> > > This means we cannot define an input that has a magic HW specific
> > > value.
> > 
> > I'm not entirely sure what you mean by that.
> 
> I mean if you make a general property 'foo' that userspace must
> specify correctly then your API isn't general anymore. Userspace must
> know if it is A or B HW to set foo=A or foo=B.

I absolutely agree.  Which is exactly why I'm advocating that
userspace should request from the kernel what it needs (providing a
*minimum* of information) and the kernel satisfies that (filling in
the missing information as suitable for the platform) or outright
fails.

I think that is more robust across multiple platforms and usecases
than advertising a bunch of capabilities and forcing userspace to
interpret those to work out what it can do.

> Supported IOVA ranges are easially like that as every IOMMU is
> different. So DPDK shouldn't provide such specific or binding
> information.

Absolutely, DPDK should not provide that.  qemu *should* provide that,
because the specific IOVAs matter to the guest.  That will inevitably
mean that the request is more likely to fail, but that's a fundamental
tradeoff.

> > No, I don't think that needs to be a condition.  I think it's
> > perfectly reasonable for a constraint to be given, and for the host
> > IOMMU to just say "no, I can't do that".  But that does mean that each
> > of these values has to have an explicit way of userspace specifying "I
> > don't care", so that the kernel will select a suitable value for those
> > instead - that's what DPDK or other userspace would use nearly all the
> > time.
> 
> My feeling is that qemu should be dealing with the host != target
> case, not the kernel.
> 
> The kernel's job should be to expose the IOMMU HW it has, with all
> features accessible, to userspace.

See... to me this is contrary to the point we agreed on above.

> Qemu's job should be to have a userspace driver for each kernel IOMMU
> and the internal infrastructure to make accelerated emulations for all
> supported target IOMMUs.

This seems the wrong way around to me.  I see qemu as providing logic
to emulate each target IOMMU.  Where that matches the host, there's
the potential for an accelerated implementation, but it makes life a
lot easier if we can at least have a fallback that will work on any
sufficiently capable host IOMMU.

> In other words, it is not the kernel's job to provide target IOMMU
> emulation.

Absolutely not.  But it *is* the kernel's job to let qemu do as mach
as it can with the *host* IOMMU.

> The kernel should provide truely generic "works everywhere" interface
> that qemu/etc can rely on to implement the least accelerated emulation
> path.

Right... seems like we're agreeing again.

> So when I see proposals to have "generic" interfaces that actually
> require very HW specific setup, and cannot be used by a generic qemu
> userpace driver, I think it breaks this model. If qemu needs to know
> it is on PPC (as it does today with VFIO's PPC specific API) then it
> may as well speak PPC specific language and forget about pretending to
> be generic.

Absolutely, the current situation is a mess.

> This approach is grounded in 15 years of trying to build these
> user/kernel split HW subsystems (particularly RDMA) where it has
> become painfully obvious that the kernel is the worst place to try and
> wrangle really divergent HW into a "common" uAPI.
> 
> This is because the kernel/user boundary is fixed. Introducing
> anything generic here requires a lot of time, thought, arguing and
> risk. Usually it ends up being done wrong (like the PPC specific
> ioctls, for instance)

Those are certainly wrong, but they came about explicitly by *not*
being generic rather than by being too generic.  So I'm really
confused aso to what you're arguing for / against.

> and when this happens we can't learn and adapt,
> we are stuck with stable uABI forever.
> 
> Exposing a device's native programming interface is much simpler. Each
> device is fixed, defined and someone can sit down and figure out how
> to expose it. Then that is it, it doesn't need revisiting, it doesn't
> need harmonizing with a future slightly different device, it just
> stays as is.

I can certainly see the case for that approach.  That seems utterly at
odds with what /dev/iommu is trying to do, though.

> The cost, is that there must be a userspace driver component for each
> HW piece - which we are already paying here!
> 
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> For instance, we are going to see HW with nested pag

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread David Gibson

On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> > be determined) that it needs an IOAS where things can be mapped in the
> > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> > 64-bit window).
> > 
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> Seems like we don't need the negotiation part?  The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as expected
> (MSI doorbells for example). And there already is a negotiation happening,
> when the host kernel rejects MAP ioctl outside the advertised area.

The problem with the approach where the kernel advertises and
userspace selects based on that, is that it locks us into a specific
representation of what's possible.  If we get new hardware with new
weird constraints that can't be expressed with the representation we
chose, we're kind of out of stuffed.  Userspace will have to change to
accomodate the new extension and have any chance of working on the new
hardware.

With the model where userspace requests, and the kernel acks or nacks,
we can still support existing userspace if the only things it requests
can still be accomodated in the new constraints.  That's pretty likely
if the majority of userspaces request very simple things (say a single
IOVA block where it doesn't care about the base address).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread Tian, Kevin

> From: Jean-Philippe Brucker 
> Sent: Tuesday, October 12, 2021 4:34 PM
> 
> On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote:
> > On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> >
> > > Seems like we don't need the negotiation part?  The host kernel
> > > communicates available IOVA ranges to userspace including holes (patch
> > > 17), and userspace can check that the ranges it needs are within the IOVA
> > > space boundaries. That part is necessary for DPDK as well since it needs
> > > to know about holes in the IOVA space where DMA wouldn't work as
> expected
> > > (MSI doorbells for example).
> >
> > I haven't looked super closely at DPDK, but the other simple VFIO app
> > I am aware of struggled to properly implement this semantic (Indeed it
> > wasn't even clear to the author this was even needed).
> >
> > It requires interval tree logic inside the application which is not a
> > trivial algorithm to implement in C.
> >
> > I do wonder if the "simple" interface should have an option more like
> > the DMA API where userspace just asks to DMA map some user memory
> and
> > gets back the dma_addr_t to use. Kernel manages the allocation
> > space/etc.
> 
> Agreed, it's tempting to use IOVA = VA but the two spaces aren't
> necessarily compatible. An extension that plugs into the IOVA allocator
> could be useful to userspace drivers.
> 

Make sense. We can have a flag in IOMMUFD_MAP_DMA to tell whether
the user provides vaddr or expects the kernel to allocate and return.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread Tian, Kevin

> From: Jean-Philippe Brucker 
> Sent: Monday, October 11, 2021 4:50 PM
> 
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> > be determined) that it needs an IOAS where things can be mapped in the
> > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> > 64-bit window).
> >
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> Seems like we don't need the negotiation part?  The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as
> expected
> (MSI doorbells for example). And there already is a negotiation happening,
> when the host kernel rejects MAP ioctl outside the advertised area.
> 

Agree. This can cover the ppc platforms with fixed reserved ranges.
It's meaningless to have user further tell kernel that it is only willing
to use a subset of advertised area. for ppc platforms with dynamic
reserved ranges which are claimed by user, we can leave it out of
the common set and handled in a different way, either leveraging
ioas nesting if applied or having ppc specific cmd.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-13 Thread Tian, Kevin

> From: David Gibson
> Sent: Friday, October 1, 2021 2:11 PM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> 
> Ok, there are several things we need for ppc.  None of which are
> inherently ppc specific and some of which will I think be useful for
> most platforms.  So, starting from most general to most specific
> here's basically what's needed:
> 
> 1. We need to represent the fact that the IOMMU can only translate
>*some* IOVAs, not a full 64-bit range.  You have the addr_width
>already, but I'm entirely sure if the translatable range on ppc
>(or other platforms) is always a power-of-2 size.  It usually will
>be, of course, but I'm not sure that's a hard requirement.  So
>using a size/max rather than just a number of bits might be safer.
> 
>I think basically every platform will need this.  Most platforms
>don't actually implement full 64-bit translation in any case, but
>rather some smaller number of bits that fits their page table
>format.
> 
> 2. The translatable range of IOVAs may not begin at 0.  So we need to
>advertise to userspace what the base address is, as well as the
>size.  POWER's main IOVA range begins at 2^59 (at least on the
>models I know about).
> 
>I think a number of platforms are likely to want this, though I
>couldn't name them apart from POWER.  Putting the translated IOVA
>window at some huge address is a pretty obvious approach to making
>an IOMMU which can translate a wide address range without colliding
>with any legacy PCI addresses down low (the IOMMU can check if this
>transaction is for it by just looking at some high bits in the
>address).
> 
> 3. There might be multiple translatable ranges.  So, on POWER the
>IOMMU can typically translate IOVAs from 0..2GiB, and also from
>2^59..2^59+.  The two ranges have completely separate IO
>page tables, with (usually) different layouts.  (The low range will
>nearly always be a single-level page table with 4kiB or 64kiB
>entries, the high one will be multiple levels depending on the size
>of the range and pagesize).
> 
>This may be less common, but I suspect POWER won't be the only
>platform to do something like this.  As above, using a high range
>is a pretty obvious approach, but clearly won't handle older
>devices which can't do 64-bit DMA.  So adding a smaller range for
>those devices is again a pretty obvious solution.  Any platform
>with an "IO hole" can be treated as having two ranges, one below
>the hole and one above it (although in that case they may well not
>have separate page tables

1-3 are common on all platforms with fixed reserved ranges. Current
vfio already reports permitted iova ranges to user via VFIO_IOMMU_
TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
maps only in those ranges. iommufd can follow the same logic for the
baseline uAPI.

For above cases a [base, max] hint can be provided by the user per
Jason's recommendation. It is a hint as no additional restriction is
imposed, since the kernel only cares about no violation on permitted
ranges that it reports to the user. Underlying iommu driver may use 
this hint to optimize e.g. deciding how many levels are used for
the kernel-managed page table according to max addr.

> 
> 4. The translatable ranges might not be fixed.  On ppc that 0..2GiB
>and 2^59..whatever ranges are kernel conventions, not specified by
>the hardware or firmware.  When running as a guest (which is the
>normal case on POWER), there are explicit hypercalls for
>con

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-12 Thread Jean-Philippe Brucker

On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> 
> > Seems like we don't need the negotiation part?  The host kernel
> > communicates available IOVA ranges to userspace including holes (patch
> > 17), and userspace can check that the ranges it needs are within the IOVA
> > space boundaries. That part is necessary for DPDK as well since it needs
> > to know about holes in the IOVA space where DMA wouldn't work as expected
> > (MSI doorbells for example). 
> 
> I haven't looked super closely at DPDK, but the other simple VFIO app
> I am aware of struggled to properly implement this semantic (Indeed it
> wasn't even clear to the author this was even needed).
> 
> It requires interval tree logic inside the application which is not a
> trivial algorithm to implement in C.
> 
> I do wonder if the "simple" interface should have an option more like
> the DMA API where userspace just asks to DMA map some user memory and
> gets back the dma_addr_t to use. Kernel manages the allocation
> space/etc.

Agreed, it's tempting to use IOVA = VA but the two spaces aren't
necessarily compatible. An extension that plugs into the IOVA allocator
could be useful to userspace drivers.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread Jason Gunthorpe via iommu

On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:

> Seems like we don't need the negotiation part?  The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as expected
> (MSI doorbells for example). 

I haven't looked super closely at DPDK, but the other simple VFIO app
I am aware of struggled to properly implement this semantic (Indeed it
wasn't even clear to the author this was even needed).

It requires interval tree logic inside the application which is not a
trivial algorithm to implement in C.

I do wonder if the "simple" interface should have an option more like
the DMA API where userspace just asks to DMA map some user memory and
gets back the dma_addr_t to use. Kernel manages the allocation
space/etc.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread Jason Gunthorpe via iommu

On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:

> > This means we cannot define an input that has a magic HW specific
> > value.
> 
> I'm not entirely sure what you mean by that.

I mean if you make a general property 'foo' that userspace must
specify correctly then your API isn't general anymore. Userspace must
know if it is A or B HW to set foo=A or foo=B.

Supported IOVA ranges are easially like that as every IOMMU is
different. So DPDK shouldn't provide such specific or binding
information.

> No, I don't think that needs to be a condition.  I think it's
> perfectly reasonable for a constraint to be given, and for the host
> IOMMU to just say "no, I can't do that".  But that does mean that each
> of these values has to have an explicit way of userspace specifying "I
> don't care", so that the kernel will select a suitable value for those
> instead - that's what DPDK or other userspace would use nearly all the
> time.

My feeling is that qemu should be dealing with the host != target
case, not the kernel.

The kernel's job should be to expose the IOMMU HW it has, with all
features accessible, to userspace.

Qemu's job should be to have a userspace driver for each kernel IOMMU
and the internal infrastructure to make accelerated emulations for all
supported target IOMMUs.

In other words, it is not the kernel's job to provide target IOMMU
emulation.

The kernel should provide truely generic "works everywhere" interface
that qemu/etc can rely on to implement the least accelerated emulation
path.

So when I see proposals to have "generic" interfaces that actually
require very HW specific setup, and cannot be used by a generic qemu
userpace driver, I think it breaks this model. If qemu needs to know
it is on PPC (as it does today with VFIO's PPC specific API) then it
may as well speak PPC specific language and forget about pretending to
be generic.

This approach is grounded in 15 years of trying to build these
user/kernel split HW subsystems (particularly RDMA) where it has
become painfully obvious that the kernel is the worst place to try and
wrangle really divergent HW into a "common" uAPI.

This is because the kernel/user boundary is fixed. Introducing
anything generic here requires a lot of time, thought, arguing and
risk. Usually it ends up being done wrong (like the PPC specific
ioctls, for instance) and when this happens we can't learn and adapt,
we are stuck with stable uABI forever.

Exposing a device's native programming interface is much simpler. Each
device is fixed, defined and someone can sit down and figure out how
to expose it. Then that is it, it doesn't need revisiting, it doesn't
need harmonizing with a future slightly different device, it just
stays as is.

The cost, is that there must be a userspace driver component for each
HW piece - which we are already paying here!

> Ideally the host /dev/iommu will say "ok!", since both those ranges
> are within the 0..2^60 translated range of the host IOMMU, and don't
> touch the IO hole.  When the guest calls the IO mapping hypercalls,
> qemu translates those into DMA_MAP operations, and since they're all
> within the previously verified windows, they should work fine.

For instance, we are going to see HW with nested page tables, user
space owned page tables and even kernel-bypass fast IOTLB
invalidation.

In that world does it even make sense for qmeu to use slow DMA_MAP
ioctls for emulation?

A userspace framework in qemu can make these optimizations and is
also necessarily HW specific as the host page table is HW specific..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread Jason Gunthorpe via iommu

On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote:
> > PASID support will already require that a device can be multi-bound to
> > many IOAS's, couldn't PPC do the same with the windows?
> 
> I don't see how that would make sense.  The device has no awareness of
> multiple windows the way it does of PASIDs.  It just sends
> transactions over the bus with the IOVAs it's told.  If those IOVAs
> lie within one of the windows, the IOMMU picks them up and translates
> them.  If they don't, it doesn't.

To my mind that address centric routing is awareness.

If the HW can attach multiple non-overlapping IOAS's to the same
device then the HW is routing to the correct IOAS by using the address
bits. This is not much different from the prior discussion we had
where we were thinking of the PASID as an 80 bit address

The fact the PPC HW actually has multiple page table roots and those
roots even have different page tables layouts while still connected to
the same device suggests this is not even an unnatural modelling
approach...

Jason  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread Jean-Philippe Brucker

On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> be determined) that it needs an IOAS where things can be mapped in the
> range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> 64-bit window).
> 
> Ideally the host /dev/iommu will say "ok!", since both those ranges
> are within the 0..2^60 translated range of the host IOMMU, and don't
> touch the IO hole.  When the guest calls the IO mapping hypercalls,
> qemu translates those into DMA_MAP operations, and since they're all
> within the previously verified windows, they should work fine.

Seems like we don't need the negotiation part?  The host kernel
communicates available IOVA ranges to userspace including holes (patch
17), and userspace can check that the ranges it needs are within the IOVA
space boundaries. That part is necessary for DPDK as well since it needs
to know about holes in the IOVA space where DMA wouldn't work as expected
(MSI doorbells for example). And there already is a negotiation happening,
when the host kernel rejects MAP ioctl outside the advertised area.

Thanks,
Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread David Gibson

On Fri, Oct 01, 2021 at 09:22:25AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote:
> > On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > > 
> > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > > 
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > > 
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address 
> > > > space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci 
> > > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1 
> > > > discussion
> > > >   and then decide how to incorporate it in v2.
> > > 
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> > > actual geometry including any holes via a query.
> > 
> > So part of the point of specifying start/end addresses is that
> > explicitly querying holes shouldn't be necessary: if the requested
> > range crosses a hole, it should fail.  If you didn't really need all
> > that range, you shouldn't have asked for it.
> > 
> > Which means these aren't really "hints" but optionally supplied
> > constraints.
> 
> We have to be very careful here, there are two very different use
> cases. When we are talking about the generic API I am mostly
> interested to see that applications like DPDK can use this API and be
> portable to any IOMMU HW the kernel supports. I view the fact that
> there is VFIO PPC specific code in DPDK as a failing of the kernel to
> provide a HW abstraction.

I would agree.  At the time we were making this, we thought there were
irreconcilable differences between what could be done with the x86 vs
ppc IOMMUs.  Turns out we just didn't think it through hard enough to
find a common model.

> This means we cannot define an input that has a magic HW specific
> value.

I'm not entirely sure what you mean by that.

> DPDK can never provide that portably. Thus all these kinds of
> inputs in the generic API need to be hints, if they exist at all.

I don't follow your reasoning.  First, note that in qemu these valus
are *target* hardware specific, not *host* hardware specific.  If
those requests aren't honoured, qemu cannot faithfully emulate the
target hardware and has to fail.  That's what I mean when I say this
is not a constraint, not a hint.

But when I say the constraint is optional, I mean that things which
don't have that requirement - like DPDK - shouldn't apply the
constraint.

> As 'address space size hint'/'address space start hint' is both
> generic, useful, and providable by DPDK I think it is OK.

Size is certainly providable, and probably useful.  For DPDK, I don't
think start is useful.

> PPC can use
> it to pick which of the two page table formats to use for this IOAS if
> it wants.

Clarification: it's not that each window has a specific page table
format.  The two windows are independent of each other, which means
you can separately select the page table format for each one (although
the 32-bit one generally won't be big enough that there's any point
selecting something other than a 1-level TCE table).  When I say
format here, I basically mean number of levels and size of each level
- the IOPTE (a.k.a. TCE) format is the same in each case.

> The second use case is when we have a userspace driver for a specific
> HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration.
> We can look here for things to make general, but I would expect a
> fairly high bar. Instead, I would rather see the userspace driver
> communicate with the kernel driver in its own private language, so
> that the entire functionality of the unique HW can be used.

I don't think we actually need to do this.  Or rather, we might want
to do this for maximum performance in some cases, but I think we can
h

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-11 Thread da...@gibson.dropbear.id.au

On Sat, Oct 02, 2021 at 09:25:42AM -0300, Jason Gunthorpe wrote:
> On Sat, Oct 02, 2021 at 02:21:38PM +1000, da...@gibson.dropbear.id.au wrote:
> 
> > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > > > guest, and therefore needs to request both from the host.
> > > 
> > > As I understood your remarks each IOAS can only be one of the formats
> > > as they have a different PTE layout. So here I ment that qmeu needs to
> > > be able to pick *for each IOAS* which of the two formats it is.
> > 
> > No.  Both windows are in the same IOAS.  A device could do DMA
> > simultaneously to both windows.  
> 
> Sure, but that doesn't force us to model it as one IOAS in the
> iommufd. A while back you were talking about using nesting and 3
> IOAS's, right?
> 
> 1, 2 or 3 IOAS's seems like a decision we can make.

Well, up to a point.  We can decide how such a thing should be
constructed.  However at some point there needs to exist an IOAS in
which both windows are mapped, whether it's directly or indirectly.
That's what the device will be attached to.

> PASID support will already require that a device can be multi-bound to
> many IOAS's, couldn't PPC do the same with the windows?

I don't see how that would make sense.  The device has no awareness of
multiple windows the way it does of PASIDs.  It just sends
transactions over the bus with the IOVAs it's told.  If those IOVAs
lie within one of the windows, the IOMMU picks them up and translates
them.  If they don't, it doesn't.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-02 Thread Jason Gunthorpe via iommu

On Sat, Oct 02, 2021 at 02:21:38PM +1000, da...@gibson.dropbear.id.au wrote:

> > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > > guest, and therefore needs to request both from the host.
> > 
> > As I understood your remarks each IOAS can only be one of the formats
> > as they have a different PTE layout. So here I ment that qmeu needs to
> > be able to pick *for each IOAS* which of the two formats it is.
> 
> No.  Both windows are in the same IOAS.  A device could do DMA
> simultaneously to both windows.  

Sure, but that doesn't force us to model it as one IOAS in the
iommufd. A while back you were talking about using nesting and 3
IOAS's, right?

1, 2 or 3 IOAS's seems like a decision we can make.

PASID support will already require that a device can be multi-bound to
many IOAS's, couldn't PPC do the same with the windows?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread da...@gibson.dropbear.id.au

On Fri, Oct 01, 2021 at 09:25:05AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote:
> > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > > > 
> > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > > format information for the target I/O page table.
> > > > > >
> > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > > semantics. For this type the user should specify the addr_width of
> > > > > > the I/O address space and whether the I/O page table is created in
> > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this 
> > > > > > point,
> > > > > > as the false setting requires additional contract with KVM on 
> > > > > > handling
> > > > > > WBINVD emulation, which can be added later.
> > > > > >
> > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > > for what formats can be specified when allocating an IOASID.
> > > > > >
> > > > > > Open:
> > > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > > vfio.
> > > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > > there
> > > > > >   is a way to claim a specific iova range from a system-wide 
> > > > > > address space.
> > > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > > devices
> > > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > > > hasn't
> > > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > > discussion
> > > > > >   and then decide how to incorporate it in v2.
> > > > > 
> > > > > I think the request was to include a start/end IO address hint when
> > > > > creating the ios. When the kernel creates it then it can return the
> > > > 
> > > > is the hint single-range or could be multiple-ranges?
> > > 
> > > David explained it here:
> > > 
> > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> > 
> > Apparently not well enough.  I've attempted again in this thread.
> > 
> > > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > > bit range.
> > 
> > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > guest, and therefore needs to request both from the host.
> 
> As I understood your remarks each IOAS can only be one of the formats
> as they have a different PTE layout. So here I ment that qmeu needs to
> be able to pick *for each IOAS* which of the two formats it is.

No.  Both windows are in the same IOAS.  A device could do DMA
simultaneously to both windows.  More realstically a 64-bit DMA
capable and a non-64-bit DMA capable device could be in the same group
and be doing DMAs to different windows simultaneously.

> > Or rather, it *might* need to supply both.  It will supply just the
> > 32-bit range by default, but the guest can request the 64-bit range
> > and/or remove and resize the 32-bit range via hypercall interfaces.
> > Vaguely recent Linux guests certainly will request the 64-bit range in
> > addition to the default 32-bit range.
> 
> And this would result in two different IOAS objects

There might be two different IOAS objects for setup, but at some point
they need to be combined into one IOAS to which the device is actually
attached.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread Jason Gunthorpe via iommu

On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote:
> On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > > 
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this 
> > > > > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > vfio.
> > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > there
> > > > >   is a way to claim a specific iova range from a system-wide address 
> > > > > space.
> > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > > hasn't
> > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > >   and then decide how to incorporate it in v2.
> > > > 
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > > 
> > > is the hint single-range or could be multiple-ranges?
> > 
> > David explained it here:
> > 
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> 
> Apparently not well enough.  I've attempted again in this thread.
> 
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> 
> No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> guest, and therefore needs to request both from the host.

As I understood your remarks each IOAS can only be one of the formats
as they have a different PTE layout. So here I ment that qmeu needs to
be able to pick *for each IOAS* which of the two formats it is.

> Or rather, it *might* need to supply both.  It will supply just the
> 32-bit range by default, but the guest can request the 64-bit range
> and/or remove and resize the 32-bit range via hypercall interfaces.
> Vaguely recent Linux guests certainly will request the 64-bit range in
> addition to the default 32-bit range.

And this would result in two different IOAS objects

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-01 Thread Jason Gunthorpe via iommu

On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote:
> On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > > 
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > > 
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > > 
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci 
> > > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1 
> > > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> > actual geometry including any holes via a query.
> 
> So part of the point of specifying start/end addresses is that
> explicitly querying holes shouldn't be necessary: if the requested
> range crosses a hole, it should fail.  If you didn't really need all
> that range, you shouldn't have asked for it.
> 
> Which means these aren't really "hints" but optionally supplied
> constraints.

We have to be very careful here, there are two very different use
cases. When we are talking about the generic API I am mostly
interested to see that applications like DPDK can use this API and be
portable to any IOMMU HW the kernel supports. I view the fact that
there is VFIO PPC specific code in DPDK as a failing of the kernel to
provide a HW abstraction.

This means we cannot define an input that has a magic HW specific
value. DPDK can never provide that portably. Thus all these kinds of
inputs in the generic API need to be hints, if they exist at all.

As 'address space size hint'/'address space start hint' is both
generic, useful, and providable by DPDK I think it is OK. PPC can use
it to pick which of the two page table formats to use for this IOAS if
it wants.

The second use case is when we have a userspace driver for a specific
HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration.
We can look here for things to make general, but I would expect a
fairly high bar. Instead, I would rather see the userspace driver
communicate with the kernel driver in its own private language, so
that the entire functionality of the unique HW can be used.

So, when it comes to providing exact ranges as an input parameter we
have to decide if that is done as some additional general data, or if
it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest
the guiding factor should be if every single IOMMU implementation can
be updated to support the value.

Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread David Gibson

On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> > 
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> > 
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> > 
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1 discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the
> actual geometry including any holes via a query.

So part of the point of specifying start/end addresses is that
explicitly querying holes shouldn't be necessary: if the requested
range crosses a hole, it should fail.  If you didn't really need all
that range, you shouldn't have asked for it.

Which means these aren't really "hints" but optionally supplied
constraints.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au

On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 10:09 PM
> > 
> > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > vfio.
> > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > there
> > > > >   is a way to claim a specific iova range from a system-wide address
> > space.
> > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > >   and then decide how to incorporate it in v2.
> > > >
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > >
> > > is the hint single-range or could be multiple-ranges?
> > 
> > David explained it here:
> > 
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> > 
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> > 
> > So a 'range hint' will do the job
> > 
> > David also suggested this:
> > 
> > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> > 
> > So I like this better:
> > 
> > struct iommu_ioasid_alloc {
> > __u32   argsz;
> > 
> > __u32   flags;
> > #define IOMMU_IOASID_ENFORCE_SNOOP  (1 << 0)
> > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)
> > 
> > __aligned_u64 max_iova_hint;
> > __aligned_u64 base_iova_hint; // Used only if
> > IOMMU_IOASID_HINT_BASE_IOVA
> > 
> > // For creating nested page tables
> > __u32 parent_ios_id;
> > __u32 format;
> > #define IOMMU_FORMAT_KERNEL 0
> > #define IOMMU_FORMAT_PPC_XXX 2
> > #define IOMMU_FORMAT_[..]
> > u32 format_flags; // Layout depends on format above
> > 
> > __aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
> > };
> > 
> > Again 'type' as an overall API indicator should not exist, feature
> > flags need to have clear narrow meanings.
> 
> currently the type is aimed to differentiate three usages:
> 
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)
> 
> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

To me "format" indicates how the IO translation information is
encoded.  We potentially have two different encodings: from userspace
to the kernel and from the kernel to the hardware.  But since this is
the userspace API, it's only the userspace to kernel one that matters
here.

In that sense, KERNEL, is a "format": we encode the translation
information as a series of IOMAP operations to the kernel, rather than
as an in-memory structure.

> > This does both of David's suggestions at once. If quemu wants the 1G
> > limited region it could specify max_iova_hint = 1G, if it wants the
> > extend 64bit region with the hole it can give either the high base or
> > a large max_iova_hint. format/format_flags allows a further
> 
> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

I need more specifics to answer that.  Are you talking from a
userspace PoV, a guest kernel's or the host kernel's?  In general I
think requiring userspace to locate and work aronud holes is a bad
idea.  If userspace requests a range, it should get *all* of that
range.

The ppc case is further complicated because there are multiple ranges
and each range could have separate IO page tables.  In practice
non-kernel managed IO pagetables are likely to be hard on ppc (or a

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au

On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > 
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address 
> > > > space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > > 
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> > 
> > is the hint single-range or could be multiple-ranges?
> 
> David explained it here:
> 
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

Apparently not well enough.  I've attempted again in this thread.

> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.

No. qemu needs to supply *both* the 32-bit and 64-bit range to its
guest, and therefore needs to request both from the host.

Or rather, it *might* need to supply both.  It will supply just the
32-bit range by default, but the guest can request the 64-bit range
and/or remove and resize the 32-bit range via hypercall interfaces.
Vaguely recent Linux guests certainly will request the 64-bit range in
addition to the default 32-bit range.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au

On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, September 23, 2021 8:07 PM
> > 
> > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > 
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> > 
> > Creating a shared ios is something that should probably be a different
> > command.
> 
> why? I didn't understand the criteria here...
> 
> > 
> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> > 
> > Format should be
> > 
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> 
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> 
> > 
> > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > > range information or be ok with a large range including holes (then let
> > > the kernel to figure out where the holes locate)?
> > 
> > My impression was it only needed a way to select between the two
> > different cases as they are exclusive. I'd see this API as being a
> > hint and userspace should query the exact ranges to learn what was
> > actually created.
> 
> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that 
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices.

I think certain POWER models do this, yes, there's *protection*
between DMAs from different devices, but you can't translate the same
address to different places for different devices.  I *think* that's a
firmware/hypervisor convention rather than a hardware limitation, but
I'm not entirely sure.  We don't do things this way when emulating the
POWER vIOMMU in POWER, but PowerVM might and we still have to deal
with that when running as a POWERVM guest.

> It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.
> 
> > 
> > > > device-specific escape if more specific customization is needed and is
> > > > needed to specify user space page tables anyhow.
> > >
> > > and I didn't understand the 2nd link. How does user-managed page
> > > table jump into this range claim problem? I'm getting confused...
> > 
> > PPC could also model it using a FORMAT_KERNEL_PPC_X,
> > FORMAT_KERNEL_PPC_Y
> > though it is less nice..
> 
> yes PPC can use different format, but I didn't understand why it is 
> related user-managed page table which further requires nesting. sound
> disconnected topics here...
> 
> > 
> > > > Yes, ioas_id should always be the xarray index.
> > > >
> > > > PASID needs to be called out as PASID or as a generic "hw description"
> > > > blob.
> > >
> > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
> > 
> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
> 
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!
> 
> > 
> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> > 
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
> 
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
> 
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
>   * vfio converts vPASID to pPASID and then call 
> iommufd_device_attach_ioasid()
>   * the latter then installs ioas to the IOMMU with RID/PASID
> 
> > 
> > > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > > call ioasid_alloc() from ioasid.c?
> > 
> > As long as it is ioas_id and ioasid it is probably fine..
> 
> let's align with others in a few hours.
> 
> > 
> > > > kvm's API to program the vPASID translation table should probably take
> > > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > > information using an in-kernel API. Userspace shouldn't have to
> > > > shuttle it around.
> > >
> > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > > returned to kvm. So we stil

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au

On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?
> 
> > actual geometry including any holes via a query.
> 
> I'd like to see a detail flow from David on how the uAPI works today with
> existing spapr driver and what exact changes he'd like to make on this
> proposed interface. Above info is still insufficient for us to think about the
> right solution.
> 
> > 
> > > - Currently ioasid term has already been used in the kernel
> > (drivers/iommu/
> > >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> > >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> > Stream
> > >   ID). We need find a way to resolve the naming conflict between the
> > hardware
> > >   ID and software handle. One option is to rename the existing ioasid to 
> > > be
> > >   pasid or ssid, given their full names still sound generic. Appreciate 
> > > more
> > >   thoughts on this open!
> > 
> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index and ioasid to represent pasid/ssid?

This is probably obsoleted by Jason's other comments, but definitely
don't use "ioas_id" and "ioasid" to mean different things.  Having
meaningfully different things distinguished only by an underscore is
not a good idea.

> At this point
> the software handle and hardware id are mixed together thus need
> a clear terminology to differentiate them.
> 
> 
> Thanks
> Kevin
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread David Gibson

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

Ok, there are several things we need for ppc.  None of which are
inherently ppc specific and some of which will I think be useful for
most platforms.  So, starting from most general to most specific
here's basically what's needed:

1. We need to represent the fact that the IOMMU can only translate
   *some* IOVAs, not a full 64-bit range.  You have the addr_width
   already, but I'm entirely sure if the translatable range on ppc
   (or other platforms) is always a power-of-2 size.  It usually will
   be, of course, but I'm not sure that's a hard requirement.  So
   using a size/max rather than just a number of bits might be safer.

   I think basically every platform will need this.  Most platforms
   don't actually implement full 64-bit translation in any case, but
   rather some smaller number of bits that fits their page table
   format.

2. The translatable range of IOVAs may not begin at 0.  So we need to
   advertise to userspace what the base address is, as well as the
   size.  POWER's main IOVA range begins at 2^59 (at least on the
   models I know about).

   I think a number of platforms are likely to want this, though I
   couldn't name them apart from POWER.  Putting the translated IOVA
   window at some huge address is a pretty obvious approach to making
   an IOMMU which can translate a wide address range without colliding
   with any legacy PCI addresses down low (the IOMMU can check if this
   transaction is for it by just looking at some high bits in the
   address).

3. There might be multiple translatable ranges.  So, on POWER the
   IOMMU can typically translate IOVAs from 0..2GiB, and also from
   2^59..2^59+.  The two ranges have completely separate IO
   page tables, with (usually) different layouts.  (The low range will
   nearly always be a single-level page table with 4kiB or 64kiB
   entries, the high one will be multiple levels depending on the size
   of the range and pagesize).

   This may be less common, but I suspect POWER won't be the only
   platform to do something like this.  As above, using a high range
   is a pretty obvious approach, but clearly won't handle older
   devices which can't do 64-bit DMA.  So adding a smaller range for
   those devices is again a pretty obvious solution.  Any platform
   with an "IO hole" can be treated as having two ranges, one below
   the hole and one above it (although in that case they may well not
   have separate page tables 

4. The translatable ranges might not be fixed.  On ppc that 0..2GiB
   and 2^59..whatever ranges are kernel conventions, not specified by
   the hardware or firmware.  When running as a guest (which is the
   normal case on POWER), there are explicit hypercalls for
   configuring the allowed IOVA windows (along with pagesize, number
   of levels etc.).  At the moment it is fixed in hardware that there
   are only 2 windows, one starting at 0 and one at 2^59 but there's
   no inherent reason those couldn't also be configurable.

   This will probably be rarer, but I wouldn't be surprised if it
   appears on another platform.  If you were designing an IOMMU ASIC
   for use in a variety of platforms, making the base address and size
   of the translatable range(s) configurable in registers would make
   sense.

Now, for (3) and (4), representing lists of windows explicitly in
ioctl()s is likely to be pretty ugly.  We might be able to avoid that,
for at least some of the interfaces, by using the nested IOAS stuff.
One way or another, though, the IOASes which are actually attached to
devices need to represent both windows.

e.g.
Create a "top-level" IOAS

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-29 Thread Liu, Yi L

> From: Jean-Philippe Brucker 
> Sent: Wednesday, September 22, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> 
> Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
> what it's used for or why it's mandatory. But for PPC it sounds like it
> should be an address range instead of an upper limit?

yes, as this open described, it may need to be a range. But not sure
if PPC requires multiple ranges or just one range. Perhaps, David may
guide there.

Regards,
Yi Liu
 
> Thanks,
> Jean
> 
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> >
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 9:31 PM
> 
> On Thu, Sep 23, 2021 at 01:20:55PM +, Tian, Kevin wrote:
> 
> > > > this is not a flow for mdev. It's also required for pdev on Intel 
> > > > platform,
> > > > because the pasid table is in HPA space thus must be managed by host
> > > > kernel. Even no translation we still need the user to provide the pasid
> info.
> > >
> > > There should be no mandatory vPASID stuff in most of these flows, that
> > > is just a special thing ENQCMD virtualization needs. If userspace
> > > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > > stuff.
> >
> > No. for one, we also support SVA w/o using ENQCMD. For two, the key
> > is that the PASID table cannot be delegated to the userspace like ARM
> > or AMD. This implies that for any pasid that the userspace wants to
> > enable, it must be configured via the kernel.
> 
> Yes, configured through the kernel, but the simplified flow should
> have the kernel handle everything and just emit a PASID for userspace
> to use.
> 
> 
> > just for a short summary of PASID model from previous design RFC:
> >
> > for arm/amd:
> > - pasid space delegated to userspace
> > - pasid table delegated to userspace
> > - just one call to bind pasid_table() then pasids are fully managed by
> user
> >
> > for intel:
> > - pasid table is always managed by kernel
> > - for pdev,
> > - pasid space is delegated to userspace
> > - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the
> pasid entry
> > - for mdev,
> > - pasid space is managed by userspace
> > - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to
> ppasid. iommufd setups the ppasid entry
> > - additional a contract to kvm for setup CPU pasid translation
> if enqcmd is used
> > - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let
> underlying driver to figure out whether vpasid should be translated.
> 
> All cases should support a kernel owned ioas associated with a
> PASID. This is the universal basic API that all PASID supporting
> IOMMUs need to implement.
> 
> I should not need to write generic users space that has to know how to
> setup architecture specific nested userspace page tables just to use
> PASID!

ah, got you! I have to admit that my previous thoughts are all from
VM p.o.v, with true userspace application ignored...

> 
> All of the above is qemu accelerated vIOMMU stuff. It is a good idea
> to keep the two areas seperate as it greatly informs what is general
> code and what is HW specific code.
> 

Agree. will think more along this direction. possibly this discussion 
deviated a lot from what this skeleton series provide. We still have 
plenty of time to figure it out when starting the pasid support. For now
at least the minimal output is that PASID might be a good candidate to 
be used in iommufd. 😊

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Jason Gunthorpe via iommu

On Thu, Sep 23, 2021 at 01:20:55PM +, Tian, Kevin wrote:

> > > this is not a flow for mdev. It's also required for pdev on Intel 
> > > platform,
> > > because the pasid table is in HPA space thus must be managed by host
> > > kernel. Even no translation we still need the user to provide the pasid 
> > > info.
> > 
> > There should be no mandatory vPASID stuff in most of these flows, that
> > is just a special thing ENQCMD virtualization needs. If userspace
> > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > stuff.
> 
> No. for one, we also support SVA w/o using ENQCMD. For two, the key
> is that the PASID table cannot be delegated to the userspace like ARM
> or AMD. This implies that for any pasid that the userspace wants to
> enable, it must be configured via the kernel.

Yes, configured through the kernel, but the simplified flow should
have the kernel handle everything and just emit a PASID for userspace
to use.


> just for a short summary of PASID model from previous design RFC:
> 
> for arm/amd:
>   - pasid space delegated to userspace
>   - pasid table delegated to userspace
>   - just one call to bind pasid_table() then pasids are fully managed by 
> user
> 
> for intel:
>   - pasid table is always managed by kernel
>   - for pdev,
>   - pasid space is delegated to userspace
>   - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the 
> pasid entry
>   - for mdev,
>   - pasid space is managed by userspace
>   - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to 
> ppasid. iommufd setups the ppasid entry
>   - additional a contract to kvm for setup CPU pasid translation 
> if enqcmd is used
>   - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). 
> let underlying driver to figure out whether vpasid should be translated.

All cases should support a kernel owned ioas associated with a
PASID. This is the universal basic API that all PASID supporting
IOMMUs need to implement.

I should not need to write generic users space that has to know how to
setup architecture specific nested userspace page tables just to use
PASID!

All of the above is qemu accelerated vIOMMU stuff. It is a good idea
to keep the two areas seperate as it greatly informs what is general
code and what is HW specific code.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 9:02 PM
> 
> On Thu, Sep 23, 2021 at 12:45:17PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, September 23, 2021 8:31 PM
> > >
> > > On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > > >
> > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > > > >
> > > > > > currently the type is aimed to differentiate three usages:
> > > > > >
> > > > > > - kernel-managed I/O page table
> > > > > > - user-managed I/O page table
> > > > > > - shared I/O page table (e.g. with mm, or ept)
> > > > >
> > > > > Creating a shared ios is something that should probably be a different
> > > > > command.
> > > >
> > > > why? I didn't understand the criteria here...
> > >
> > > I suspect the input args will be very different, no?
> >
> > yes, but can't the structure be extended to incorporate it?
> 
> You need to be thoughtful, giant structures with endless combinations
> of optional fields turn out very hard. I haven't even seen what args
> this shared thing will need, but I'm guessing it is almost none, so
> maybe a new call is OK?

To judge this looks we may have to do some practice on this front
e.g. coming up an example structure for future intended usages and
then see whether one structure can fit? 

> 
> If it is literally just 'give me an ioas for current mm' then it has
> no args or complexity at all.

for mm, yes, should be simple. for ept it might be more complex e.g.
requiring a handle in kvm and some other format info to match ept
page table.

> 
> > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > > indicator? their difference is not about format.
> > > > >
> > > > > Format should be
> > > > >
> > > > >
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > > >
> > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> > >
> > > So long as we are using structs we need to have values then the field
> > > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > > are not creating a userspace page table.
> > >
> > > Alternatively a userspace page table could have a different API
> >
> > I don't know. Your comments really confused me on what's the right
> > way to design the uAPI. If you still remember, the original v1 proposal
> > introduced different uAPIs for kernel/user-managed cases. Then you
> > recommended to consolidate everything related to ioas in one allocation
> > command.
> 
> This is because you had almost completely duplicated the input args
> between the two calls.
> 
> If it turns out they have very different args, then they should have
> different calls.
> 
> > > > - open iommufd
> > > > - create an ioas
> > > > - attach vfio device to ioasid, with vPASID info
> > > > * vfio converts vPASID to pPASID and then call
> > > iommufd_device_attach_ioasid()
> > > > * the latter then installs ioas to the IOMMU with RID/PASID
> > >
> > > This was your flow for mdev's, I've always been talking about wanting
> > > to see this supported for all use cases, including physical PCI
> > > devices w/ PASID support.
> >
> > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > because the pasid table is in HPA space thus must be managed by host
> > kernel. Even no translation we still need the user to provide the pasid 
> > info.
> 
> There should be no mandatory vPASID stuff in most of these flows, that
> is just a special thing ENQCMD virtualization needs. If userspace
> isn't doing ENQCMD virtualization it shouldn't need to touch this
> stuff.

No. for one, we also support SVA w/o using ENQCMD. For two, the key
is that the PASID table cannot be delegated to the userspace like ARM
or AMD. This implies that for any pasid that the userspace wants to
enable, it must be configured via the kernel.

> 
> > as explained earlier, on Intel platform the user always needs to provide
> > a PASID in the attaching call. whether it's directly used (for pdev)
> > or translated (for mdev) is the underlying driver thing. From kernel
> > p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> > in the uAPI.
> 
> I've always disagreed with this. There should be an option for the
> kernel to pick an appropriate PASID for portability to other IOMMUs
> and simplicity of the interface.
> 
> You need to keep it clear what is in the minimum basic path and what
> is needed for special cases, like ENQCMD virtualization.
> 
> Not every user of iommufd is doing virtualization.
> 

just for a short summary of PASID model from previous design RFC:

for arm/amd:
- pasid space delegated to userspace
- pasid table delegated to userspace
- just one call to bind pasid_table() then pasids are fully managed by 
user

for intel:
- pasid table is always managed by kernel
-

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Jason Gunthorpe via iommu

On Thu, Sep 23, 2021 at 12:45:17PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, September 23, 2021 8:31 PM
> > 
> > On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > >
> > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > > >
> > > > > currently the type is aimed to differentiate three usages:
> > > > >
> > > > > - kernel-managed I/O page table
> > > > > - user-managed I/O page table
> > > > > - shared I/O page table (e.g. with mm, or ept)
> > > >
> > > > Creating a shared ios is something that should probably be a different
> > > > command.
> > >
> > > why? I didn't understand the criteria here...
> > 
> > I suspect the input args will be very different, no?
> 
> yes, but can't the structure be extended to incorporate it? 

You need to be thoughtful, giant structures with endless combinations
of optional fields turn out very hard. I haven't even seen what args
this shared thing will need, but I'm guessing it is almost none, so
maybe a new call is OK?

If it is literally just 'give me an ioas for current mm' then it has
no args or complexity at all.

> > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > indicator? their difference is not about format.
> > > >
> > > > Format should be
> > > >
> > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > >
> > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> > 
> > So long as we are using structs we need to have values then the field
> > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > are not creating a userspace page table.
> > 
> > Alternatively a userspace page table could have a different API
> 
> I don't know. Your comments really confused me on what's the right
> way to design the uAPI. If you still remember, the original v1 proposal
> introduced different uAPIs for kernel/user-managed cases. Then you
> recommended to consolidate everything related to ioas in one allocation
> command.

This is because you had almost completely duplicated the input args
between the two calls.

If it turns out they have very different args, then they should have
different calls.

> > > - open iommufd
> > > - create an ioas
> > > - attach vfio device to ioasid, with vPASID info
> > >   * vfio converts vPASID to pPASID and then call
> > iommufd_device_attach_ioasid()
> > >   * the latter then installs ioas to the IOMMU with RID/PASID
> > 
> > This was your flow for mdev's, I've always been talking about wanting
> > to see this supported for all use cases, including physical PCI
> > devices w/ PASID support.
> 
> this is not a flow for mdev. It's also required for pdev on Intel platform,
> because the pasid table is in HPA space thus must be managed by host 
> kernel. Even no translation we still need the user to provide the pasid info.

There should be no mandatory vPASID stuff in most of these flows, that
is just a special thing ENQCMD virtualization needs. If userspace
isn't doing ENQCMD virtualization it shouldn't need to touch this
stuff.

> as explained earlier, on Intel platform the user always needs to provide 
> a PASID in the attaching call. whether it's directly used (for pdev)
> or translated (for mdev) is the underlying driver thing. From kernel
> p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> in the uAPI.

I've always disagreed with this. There should be an option for the
kernel to pick an appropriate PASID for portability to other IOMMUs
and simplicity of the interface.

You need to keep it clear what is in the minimum basic path and what
is needed for special cases, like ENQCMD virtualization.

Not every user of iommufd is doing virtualization.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 8:31 PM
> 
> On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, September 23, 2021 8:07 PM
> > >
> > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > >
> > > > currently the type is aimed to differentiate three usages:
> > > >
> > > > - kernel-managed I/O page table
> > > > - user-managed I/O page table
> > > > - shared I/O page table (e.g. with mm, or ept)
> > >
> > > Creating a shared ios is something that should probably be a different
> > > command.
> >
> > why? I didn't understand the criteria here...
> 
> I suspect the input args will be very different, no?

yes, but can't the structure be extended to incorporate it? 

> 
> > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > indicator? their difference is not about format.
> > >
> > > Format should be
> > >
> > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> >
> > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> 
> So long as we are using structs we need to have values then the field
> isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> are not creating a userspace page table.
> 
> Alternatively a userspace page table could have a different API

I don't know. Your comments really confused me on what's the right
way to design the uAPI. If you still remember, the original v1 proposal
introduced different uAPIs for kernel/user-managed cases. Then you
recommended to consolidate everything related to ioas in one allocation
command.

Can you help articulate the criteria first?

> 
> > yes, the user can query the permitted range using DEVICE_GET_INFO.
> > But in the end if the user wants two separate regions, I'm afraid that
> > the underlying iommu driver wants to know the exact info. iirc PPC
> > has one global system address space shared by all devices. It is possible
> > that the user may want to claim range-A and range-C, with range-B
> > in-between but claimed by another user. Then simply using one hint
> > range [A-lowend, C-highend] might not work.
> 
> I don't know, that sounds strange.. In any event hint is a hint, it
> can be ignored, the only information the kernel needs to extract is
> low/high bank?

iirc Dave said that the user needs to claim a range explicitly. 'claim'
sounds not a hint to me. Possibly it's time for Dave to chime in. 

> 
> > yes PPC can use different format, but I didn't understand why it is
> > related user-managed page table which further requires nesting. sound
> > disconnected topics here...
> 
> It is just a way to feed through more information if we get stuck
> someday.

You mean that we should define uAPI for all future possible extensions
now to minimize the frequency of changing it?

> 
> > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > > and it MUST be exposed in that format to be programmed into the PCI
> > > device itself.
> >
> > In the entire discussion in previous design RFC, I kept an impression that
> > ARM-equivalent PASID is called SSID. If we can use PASID as a general
> > term in iommufd context, definitely it's much better!
> 
> SSID is inside the chip and part of the IOMMU. PASID is part of the
> PCI spec.
> 
> iommufd should keep these things distinct.
> 
> If we are talking about a PCI TLP then the name to use is PASID.

If Jean doesn't object...

> 
> > > All of this should be able to support a userspace, like DPDK, creating
> > > a PASID on its own without any special VFIO drivers.
> > >
> > > - Open iommufd
> > > - Attach the vfio device FD
> > > - Request a PASID device id
> > > - Create an ios against the pasid device id
> > > - Query the ios for the PCI PASID #
> > > - Program the HW to issue TLPs with the PASID
> >
> > this all makes me very confused, and completely different from what
> > we agreed in previous v2 design proposal:
> >
> > - open iommufd
> > - create an ioas
> > - attach vfio device to ioasid, with vPASID info
> > * vfio converts vPASID to pPASID and then call
> iommufd_device_attach_ioasid()
> > * the latter then installs ioas to the IOMMU with RID/PASID
> 
> This was your flow for mdev's, I've always been talking about wanting
> to see this supported for all use cases, including physical PCI
> devices w/ PASID support.

this is not a flow for mdev. It's also required for pdev on Intel platform,
because the pasid table is in HPA space thus must be managed by host 
kernel. Even no translation we still need the user to provide the pasid info.

> 
> A normal vfio_pci userspace should be able to create PASIDs unrelated
> to the mdev stuff.
> 
> > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > > really about presenting a consistent IOMMU handle to the guest across
> > > migration, it is not the value that shows up on the PCI bus.
> >
> > It's the former. But vfio driver needs to maintain vPASID->pPAS

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Jason Gunthorpe via iommu

On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, September 23, 2021 8:07 PM
> > 
> > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > 
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> > 
> > Creating a shared ios is something that should probably be a different
> > command.
> 
> why? I didn't understand the criteria here...

I suspect the input args will be very different, no?

> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> > 
> > Format should be
> > 
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> 
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

So long as we are using structs we need to have values then the field
isn't being used. FORMAT_KERNEL is a reasonable value to have when we
are not creating a userspace page table.

Alternatively a userspace page table could have a different API

> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that 
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices. It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.

I don't know, that sounds strange.. In any event hint is a hint, it
can be ignored, the only information the kernel needs to extract is
low/high bank?

> yes PPC can use different format, but I didn't understand why it is 
> related user-managed page table which further requires nesting. sound
> disconnected topics here...

It is just a way to feed through more information if we get stuck
someday.

> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
> 
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!

SSID is inside the chip and part of the IOMMU. PASID is part of the
PCI spec.

iommufd should keep these things distinct. 

If we are talking about a PCI TLP then the name to use is PASID.

> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> > 
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
> 
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
>
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
>   * vfio converts vPASID to pPASID and then call 
> iommufd_device_attach_ioasid()
>   * the latter then installs ioas to the IOMMU with RID/PASID

This was your flow for mdev's, I've always been talking about wanting
to see this supported for all use cases, including physical PCI
devices w/ PASID support.

A normal vfio_pci userspace should be able to create PASIDs unrelated
to the mdev stuff.

> > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > really about presenting a consistent IOMMU handle to the guest across
> > migration, it is not the value that shows up on the PCI bus.
> 
> It's the former. But vfio driver needs to maintain vPASID->pPASID
> translation in the mediation path, since what guest programs is vPASID.

The pPASID definately is a PASID as it goes out on the PCIe wire

Suggest you come up with a more general name for vPASID?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 8:07 PM
> 
> On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> 
> > currently the type is aimed to differentiate three usages:
> >
> > - kernel-managed I/O page table
> > - user-managed I/O page table
> > - shared I/O page table (e.g. with mm, or ept)
> 
> Creating a shared ios is something that should probably be a different
> command.

why? I didn't understand the criteria here...

> 
> > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > indicator? their difference is not about format.
> 
> Format should be
> 
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

> 
> > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > range information or be ok with a large range including holes (then let
> > the kernel to figure out where the holes locate)?
> 
> My impression was it only needed a way to select between the two
> different cases as they are exclusive. I'd see this API as being a
> hint and userspace should query the exact ranges to learn what was
> actually created.

yes, the user can query the permitted range using DEVICE_GET_INFO.
But in the end if the user wants two separate regions, I'm afraid that 
the underlying iommu driver wants to know the exact info. iirc PPC
has one global system address space shared by all devices. It is possible
that the user may want to claim range-A and range-C, with range-B
in-between but claimed by another user. Then simply using one hint
range [A-lowend, C-highend] might not work.

> 
> > > device-specific escape if more specific customization is needed and is
> > > needed to specify user space page tables anyhow.
> >
> > and I didn't understand the 2nd link. How does user-managed page
> > table jump into this range claim problem? I'm getting confused...
> 
> PPC could also model it using a FORMAT_KERNEL_PPC_X,
> FORMAT_KERNEL_PPC_Y
> though it is less nice..

yes PPC can use different format, but I didn't understand why it is 
related user-managed page table which further requires nesting. sound
disconnected topics here...

> 
> > > Yes, ioas_id should always be the xarray index.
> > >
> > > PASID needs to be called out as PASID or as a generic "hw description"
> > > blob.
> >
> > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
> 
> ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> and it MUST be exposed in that format to be programmed into the PCI
> device itself.

In the entire discussion in previous design RFC, I kept an impression that
ARM-equivalent PASID is called SSID. If we can use PASID as a general
term in iommufd context, definitely it's much better!

> 
> All of this should be able to support a userspace, like DPDK, creating
> a PASID on its own without any special VFIO drivers.
> 
> - Open iommufd
> - Attach the vfio device FD
> - Request a PASID device id
> - Create an ios against the pasid device id
> - Query the ios for the PCI PASID #
> - Program the HW to issue TLPs with the PASID

this all makes me very confused, and completely different from what
we agreed in previous v2 design proposal:

- open iommufd
- create an ioas
- attach vfio device to ioasid, with vPASID info
* vfio converts vPASID to pPASID and then call 
iommufd_device_attach_ioasid()
* the latter then installs ioas to the IOMMU with RID/PASID

> 
> > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > call ioasid_alloc() from ioasid.c?
> 
> As long as it is ioas_id and ioasid it is probably fine..

let's align with others in a few hours.

> 
> > > kvm's API to program the vPASID translation table should probably take
> > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > information using an in-kernel API. Userspace shouldn't have to
> > > shuttle it around.
> >
> > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > returned to kvm. So we still need a generic blob to represent
> > vPASID in the uAPI.
> 
> I think you have to be clear about what the value is being used
> for. Is it an IOMMU page table handle or is it a PCI PASID value?
> 
> AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> really about presenting a consistent IOMMU handle to the guest across
> migration, it is not the value that shows up on the PCI bus.
> 

It's the former. But vfio driver needs to maintain vPASID->pPASID
translation in the mediation path, since what guest programs is vPASID.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Jason Gunthorpe via iommu

On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:

> currently the type is aimed to differentiate three usages:
> 
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)

Creating a shared ios is something that should probably be a different
command.

> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

Format should be

FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

My impression was it only needed a way to select between the two
different cases as they are exclusive. I'd see this API as being a
hint and userspace should query the exact ranges to learn what was
actually created.

> > device-specific escape if more specific customization is needed and is
> > needed to specify user space page tables anyhow.
> 
> and I didn't understand the 2nd link. How does user-managed page
> table jump into this range claim problem? I'm getting confused...

PPC could also model it using a FORMAT_KERNEL_PPC_X, FORMAT_KERNEL_PPC_Y
though it is less nice..

> > Yes, ioas_id should always be the xarray index.
> > 
> > PASID needs to be called out as PASID or as a generic "hw description"
> > blob.
> 
> ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
and it MUST be exposed in that format to be programmed into the PCI
device itself.

All of this should be able to support a userspace, like DPDK, creating
a PASID on its own without any special VFIO drivers.

- Open iommufd
- Attach the vfio device FD
- Request a PASID device id
- Create an ios against the pasid device id
- Query the ios for the PCI PASID #
- Program the HW to issue TLPs with the PASID

> and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> kernel. Do we want to clear this confusion? Or possibly it's fine because
> ioas_id is never used outside of iommufd and iommufd doesn't directly
> call ioasid_alloc() from ioasid.c?

As long as it is ioas_id and ioasid it is probably fine..

> > kvm's API to program the vPASID translation table should probably take
> > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > information using an in-kernel API. Userspace shouldn't have to
> > shuttle it around.
> 
> the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> when kvm calls iommufd with above tuple, vPASID->pPASID is
> returned to kvm. So we still need a generic blob to represent
> vPASID in the uAPI.

I think you have to be clear about what the value is being used
for. Is it an IOMMU page table handle or is it a PCI PASID value?

AFAICT I think it is the former in the Intel scheme as the "vPASID" is
really about presenting a consistent IOMMU handle to the guest across
migration, it is not the value that shows up on the PCI bus.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-23 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 10:09 PM
> 
> On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address
> space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > >
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> >
> > is the hint single-range or could be multiple-ranges?
> 
> David explained it here:
> 
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> 
> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.
> 
> So a 'range hint' will do the job
> 
> David also suggested this:
> 
> https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> 
> So I like this better:
> 
> struct iommu_ioasid_alloc {
>   __u32   argsz;
> 
>   __u32   flags;
> #define IOMMU_IOASID_ENFORCE_SNOOP(1 << 0)
> #define IOMMU_IOASID_HINT_BASE_IOVA   (1 << 1)
> 
>   __aligned_u64 max_iova_hint;
>   __aligned_u64 base_iova_hint; // Used only if
> IOMMU_IOASID_HINT_BASE_IOVA
> 
>   // For creating nested page tables
>   __u32 parent_ios_id;
>   __u32 format;
> #define IOMMU_FORMAT_KERNEL 0
> #define IOMMU_FORMAT_PPC_XXX 2
> #define IOMMU_FORMAT_[..]
>   u32 format_flags; // Layout depends on format above
> 
>   __aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
> };
> 
> Again 'type' as an overall API indicator should not exist, feature
> flags need to have clear narrow meanings.

currently the type is aimed to differentiate three usages:

- kernel-managed I/O page table
- user-managed I/O page table
- shared I/O page table (e.g. with mm, or ept)

we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
indicator? their difference is not about format.

> 
> This does both of David's suggestions at once. If quemu wants the 1G
> limited region it could specify max_iova_hint = 1G, if it wants the
> extend 64bit region with the hole it can give either the high base or
> a large max_iova_hint. format/format_flags allows a further

Dave's links didn't answer one puzzle from me. Does PPC needs accurate
range information or be ok with a large range including holes (then let
the kernel to figure out where the holes locate)?

> device-specific escape if more specific customization is needed and is
> needed to specify user space page tables anyhow.

and I didn't understand the 2nd link. How does user-managed page
table jump into this range claim problem? I'm getting confused...

> 
> > > ioas works well here I think. Use ioas_id to refer to the xarray
> > > index.
> >
> > What about when introducing pasid to this uAPI? Then use ioas_id
> > for the xarray index
> 
> Yes, ioas_id should always be the xarray index.
> 
> PASID needs to be called out as PASID or as a generic "hw description"
> blob.

ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
kernel. Do we want to clear this confusion? Or possibly it's fine because
ioas_id is never used outside of iommufd and iommufd doesn't directly
call ioasid_alloc() from ioasid.c?

> 
> kvm's API to program the vPASID translation table should probably take
> in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> information using an in-kernel API. Userspace shouldn't have to
> shuttle it around.

the vPASID info is carried

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Liu, Yi L

> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 9:32 PM
> 
> On Wed, Sep 22, 2021 at 12:51:38PM +, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > [...]
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 641f199f2d41..4839f128b24a 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -24,6 +24,7 @@
> > > >  struct iommufd_ctx {
> > > > refcount_t refs;
> > > > struct mutex lock;
> > > > +   struct xarray ioasid_xa; /* xarray of ioasids */
> > > > struct xarray device_xa; /* xarray of bound devices */
> > > >  };
> > > >
> > > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > > > u64 dev_cookie;
> > > >  };
> > > >
> > > > +/* Represent an I/O address space */
> > > > +struct iommufd_ioas {
> > > > +   int ioasid;
> > >
> > > xarray id's should consistently be u32s everywhere.
> >
> > sure. just one more check, this id is supposed to be returned to
> > userspace as the return value of ioctl(IOASID_ALLOC). That's why
> > I chose to use "int" as its prototype to make it aligned with the
> > return type of ioctl(). Based on this, do you think it's still better
> > to use "u32" here?
> 
> I suggest not using the return code from ioctl to exchange data.. The
> rest of the uAPI uses an in/out struct, everything should do
> that consistently.

got it.

Thanks,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Jason Gunthorpe via iommu

On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?

David explained it here:

https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

qeumu needs to be able to chooose if it gets the 32 bit range or 64
bit range.

So a 'range hint' will do the job

David also suggested this:

https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/

So I like this better:

struct iommu_ioasid_alloc {
__u32   argsz;

__u32   flags;
#define IOMMU_IOASID_ENFORCE_SNOOP  (1 << 0)
#define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)

__aligned_u64 max_iova_hint;
__aligned_u64 base_iova_hint; // Used only if 
IOMMU_IOASID_HINT_BASE_IOVA

// For creating nested page tables
__u32 parent_ios_id;
__u32 format;
#define IOMMU_FORMAT_KERNEL 0
#define IOMMU_FORMAT_PPC_XXX 2
#define IOMMU_FORMAT_[..]
u32 format_flags; // Layout depends on format above

__aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
};

Again 'type' as an overall API indicator should not exist, feature
flags need to have clear narrow meanings.

This does both of David's suggestions at once. If quemu wants the 1G
limited region it could specify max_iova_hint = 1G, if it wants the
extend 64bit region with the hole it can give either the high base or
a large max_iova_hint. format/format_flags allows a further
device-specific escape if more specific customization is needed and is
needed to specify user space page tables anyhow.

> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index

Yes, ioas_id should always be the xarray index.

PASID needs to be called out as PASID or as a generic "hw description"
blob.

kvm's API to program the vPASID translation table should probably take
in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
information using an in-kernel API. Userspace shouldn't have to
shuttle it around.

I'm starting to feel like the struct approach for describing this uAPI
might not scale well, but lets see..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Jean-Philippe Brucker

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.

Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
what it's used for or why it's mandatory. But for PPC it sounds like it
should be an address range instead of an upper limit?

Thanks,
Jean

>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.
> 
> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Jason Gunthorpe via iommu

On Wed, Sep 22, 2021 at 12:51:38PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> [...]
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 641f199f2d41..4839f128b24a 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -24,6 +24,7 @@
> > >  struct iommufd_ctx {
> > >   refcount_t refs;
> > >   struct mutex lock;
> > > + struct xarray ioasid_xa; /* xarray of ioasids */
> > >   struct xarray device_xa; /* xarray of bound devices */
> > >  };
> > >
> > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > >   u64 dev_cookie;
> > >  };
> > >
> > > +/* Represent an I/O address space */
> > > +struct iommufd_ioas {
> > > + int ioasid;
> > 
> > xarray id's should consistently be u32s everywhere.
> 
> sure. just one more check, this id is supposed to be returned to
> userspace as the return value of ioctl(IOASID_ALLOC). That's why
> I chose to use "int" as its prototype to make it aligned with the
> return type of ioctl(). Based on this, do you think it's still better
> to use "u32" here?

I suggest not using the return code from ioctl to exchange data.. The
rest of the uAPI uses an in/out struct, everything should do
that consistently.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-22 Thread Liu, Yi L

> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> >  struct iommufd_ctx {
> > refcount_t refs;
> > struct mutex lock;
> > +   struct xarray ioasid_xa; /* xarray of ioasids */
> > struct xarray device_xa; /* xarray of bound devices */
> >  };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> > u64 dev_cookie;
> >  };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > +   int ioasid;
> 
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-21 Thread Tian, Kevin

> From: Jason Gunthorpe 
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the

is the hint single-range or could be multiple-ranges?

> actual geometry including any holes via a query.

I'd like to see a detail flow from David on how the uAPI works today with
existing spapr driver and what exact changes he'd like to make on this
proposed interface. Above info is still insufficient for us to think about the
right solution.

> 
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
> 
> ioas works well here I think. Use ioas_id to refer to the xarray
> index.

What about when introducing pasid to this uAPI? Then use ioas_id
for the xarray index and ioasid to represent pasid/ssid? At this point
the software handle and hardware id are mixed together thus need
a clear terminology to differentiate them.


Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-21 Thread Jason Gunthorpe via iommu

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

I think the request was to include a start/end IO address hint when
creating the ios. When the kernel creates it then it can return the
actual geometry including any holes via a query.

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

ioas works well here I think. Use ioas_id to refer to the xarray
index.

> Signed-off-by: Liu Yi L 
>  drivers/iommu/iommufd/iommufd.c | 120 
>  include/linux/iommufd.h |   3 +
>  include/uapi/linux/iommu.h  |  54 ++
>  3 files changed, 177 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
>  struct iommufd_ctx {
>   refcount_t refs;
>   struct mutex lock;
> + struct xarray ioasid_xa; /* xarray of ioasids */
>   struct xarray device_xa; /* xarray of bound devices */
>  };
>  
> @@ -42,6 +43,16 @@ struct iommufd_device {
>   u64 dev_cookie;
>  };
>  
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> + int ioasid;

xarray id's should consistently be u32s everywhere.

Many of the same prior comments repeated here

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-18 Thread Liu Yi L

This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
  Per previous discussion they can also use vfio type1v2 as long as there
  is a way to claim a specific iova range from a system-wide address space.
  This requirement doesn't sound PPC specific, as addr_width for pci devices
  can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
  adopted this design yet. We hope to have formal alignment in v1 discussion
  and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
  ioasid.c) to represent the hardware I/O address space ID in the wire. It
  covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
  ID). We need find a way to resolve the naming conflict between the hardware
  ID and software handle. One option is to rename the existing ioasid to be
  pasid or ssid, given their full names still sound generic. Appreciate more
  thoughts on this open!

Signed-off-by: Liu Yi L 
---
 drivers/iommu/iommufd/iommufd.c | 120 
 include/linux/iommufd.h |   3 +
 include/uapi/linux/iommu.h  |  54 ++
 3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
 struct iommufd_ctx {
refcount_t refs;
struct mutex lock;
+   struct xarray ioasid_xa; /* xarray of ioasids */
struct xarray device_xa; /* xarray of bound devices */
 };
 
@@ -42,6 +43,16 @@ struct iommufd_device {
u64 dev_cookie;
 };
 
+/* Represent an I/O address space */
+struct iommufd_ioas {
+   int ioasid;
+   u32 type;
+   u32 addr_width;
+   bool enforce_snoop;
+   struct iommufd_ctx *ictx;
+   refcount_t refs;
+};
+
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
 {
struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file 
*filep)
 
refcount_set(&ictx->refs, 1);
mutex_init(&ictx->lock);
+   xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;
 
@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
if (!refcount_dec_and_test(&ictx->refs))
return;
 
+   WARN_ON(!xa_empty(&ictx->ioasid_xa));
WARN_ON(!xa_empty(&ictx->device_xa));
kfree(ictx);
 }
 
+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+   struct iommufd_ctx *ictx = ioas->ictx;
+   int ioasid = ioas->ioasid;
+
+   if (!refcount_dec_and_test(&ioas->refs))
+   return;
+
+   xa_erase(&ictx->ioasid_xa, ioasid);
+   iommufd_ctx_put(ictx);
+   kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+   struct iommu_ioasid_alloc req;
+   struct iommufd_ioas *ioas;
+   unsigned long minsz;
+   int ioasid, ret;
+
+   minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+   if (copy_from_user(&req, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (req.argsz < minsz || !req.addr_width ||
+   req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+   req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+   return -EINVAL;
+
+   ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+   if (!ioas)
+   return -ENOMEM;
+
+   mutex_lock(&ictx->lock);
+   ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
+  XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+  GFP_KERNEL);
+   mutex_unlock(&ictx->lock);
+   if (ret) {
+   pr_err_ratelimited("Failed to alloc ioasid\n");
+   kfree(ioas);
+   return ret;
+   }
+
+   ioas->ioasid = ioasid;
+
+   /* only supports kernel managed I/O page table so far */
+   ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+   ioas->addr_width = req.ad

50 matches

Mail list logo