RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: David Gibson > Sent: Monday, October 25, 2021 1:05 PM > > > > > For above cases a [base, max] hint can be provided by the user per > > > > Jason's recommendation. > > > > > > Provided at which stage? > > > > IOMMU_IOASID_ALLOC > > Ok. I have mixed thoughts on this. Doing this at ALLOC time was my > first instict as well. However with Jason's suggestion that any of a > number of things could disambiguate multiple IOAS attached to a > device, I wonder if it makes more sense for consistency to put base > address at attach time, as with PASID. In that case the base address provided at attach time is used as an address space ID similar to PASID, which imho is orthogonal to the generic [base, size] info for IOAS itself. The 2nd base sort of becomes an offset on top of the first base in ppc case. > > > > regarding live migration with vfio devices, it's still in early stage. there > > are tons of compatibility check opens to be addressed before it can > > be widely deployed. this might just add another annoying open to that > > long list... > > So, yes, live migration with VFIO is limited, unfortunately this > still affects us even if we don't (currently) have VFIO devices. The > problem arises from the combination of two limitations: > > 1) Live migration means that we can't dynamically select guest visible > IOVA parameters at qemu start up time. We need to get consistent > guest visible behaviour for a given set of qemu options, so that we > can migrate between them. > > 2) Device hotplug means that we don't know if a PCI domain will have > VFIO devices on it when we start qemu. So, we don't know if host > limitations on IOVA ranges will affect the guest or not. > > Together these mean that the best we can do is to define a *fixed* > (per machine type) configuration based on qemu options only. That is, > defined by the guest platform we're trying to present, only, never > host capabilities. We can then see if that configuration is possible > on the host and pass or fail. It's never safe to go the other > direction and take host capabilities and present those to the guest. > That is just one userspace policy. We don't want to design a uAPI just for a specific userspace implementation. In concept the userspace could: 1) use DMA-API like map/unmap i.e. letting IOVA address space managed by the kernel; * suitable for simple applications e.g. dpdk. 2) manage IOVA address space with *fixed* layout: * fail device passthrough at MAP_DMA if conflict is detected between mapped range and device specific IOVA holes * suitable for VM when live migration is highly concerned * potential problem with vIOMMU since the guest is unaware of host constraints thus undefined behavior may occur if guest IOVA addresses happens to overlap with host IOVA holes. * ppc is special as you need to claim guest IOVA ranges in the host. But it's not the case for other emulated IOMMUs. 3) manage IOVA address space with host constraints: * create IOVA layout by combining qemu options and IOVA holes of all boot-time passthrough devices * reject hotplugged device if it has conflicting IOVA holes with the initial IOVA layout * suitable for vIOMMU since host constraints can be further reported to the guest * suitable for VM w/o live migration requirement, e.g. in many client virtualization scenarios * suboptimal with VM live migration with compatibility limitation Overall the proposed uAPI will provide: 1) a simple DMA-API-like mapping protocol for kernel managed IOVA address space: 2) a vfio-like mapping protocol for user managed IOVA address space: a) check IOVA conflict in MAP_DMA ioctl; b) allows the user to query available IOVA ranges; Then it's totally user policy on how it wants to utilize those ioctls. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 06:53:01AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Thursday, October 14, 2021 1:00 PM > > > > On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote: > > > > From: David Gibson > > > > Sent: Friday, October 1, 2021 2:11 PM > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > format information for the target I/O page table. > > > > > > > > > > This RFC supports only one type > > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > semantics. For this type the user should specify the addr_width of > > > > > the I/O address space and whether the I/O page table is created in > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > point, > > > > > as the false setting requires additional contract with KVM on handling > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > > patch) > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > Open: > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > vfio. > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > there > > > > > is a way to claim a specific iova range from a system-wide address > > space. > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > devices > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > hasn't > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > discussion > > > > > and then decide how to incorporate it in v2. > > > > > > > > Ok, there are several things we need for ppc. None of which are > > > > inherently ppc specific and some of which will I think be useful for > > > > most platforms. So, starting from most general to most specific > > > > here's basically what's needed: > > > > > > > > 1. We need to represent the fact that the IOMMU can only translate > > > >*some* IOVAs, not a full 64-bit range. You have the addr_width > > > >already, but I'm entirely sure if the translatable range on ppc > > > >(or other platforms) is always a power-of-2 size. It usually will > > > >be, of course, but I'm not sure that's a hard requirement. So > > > >using a size/max rather than just a number of bits might be safer. > > > > > > > >I think basically every platform will need this. Most platforms > > > >don't actually implement full 64-bit translation in any case, but > > > >rather some smaller number of bits that fits their page table > > > >format. > > > > > > > > 2. The translatable range of IOVAs may not begin at 0. So we need to > > > >advertise to userspace what the base address is, as well as the > > > >size. POWER's main IOVA range begins at 2^59 (at least on the > > > >models I know about). > > > > > > > >I think a number of platforms are likely to want this, though I > > > >couldn't name them apart from POWER. Putting the translated IOVA > > > >window at some huge address is a pretty obvious approach to making > > > >an IOMMU which can translate a wide address range without colliding > > > >with any legacy PCI addresses down low (the IOMMU can check if this > > > >transaction is for it by just looking at some high bits in the > > > >address). > > > > > > > > 3. There might be multiple translatable ranges. So, on POWER the > > > >IOMMU can typically translate IOVAs from 0..2GiB, and also from > > > >2^59..2^59+. The two ranges have completely separate IO > > > >page tables, with (usually) different layouts. (The low range will > > > >nearly always be a single-level page table with 4kiB or 64kiB > > > >entries, the high one will be multiple levels depending on the size > > > >of the range and pagesize). > > > > > > > >This may be less common, but I suspect POWER won't be the only > > > >platform to do something like this. As above, using a high range > > > >is a pretty obvious approach, but clearly won't handle older > > > >devices which can't do 64-bit DMA. So adding a smaller range for > > > >those devices is again a pretty obvious solution. Any platform > > > >with an "IO hole" can be treated as having two ranges, one below > > > >the hole and one above it (although in that case they may well not > > > >have separate page tables > > > > > > 1-3 are common on all platforms with fixed reserved ranges. Current > > > vfio already reports permitted iova ranges to user via VFIO_IOMMU_ > > > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct > > > maps only in those ranges. iommufd can follo
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 18, 2021 at 02:50:54PM +1100, David Gibson wrote: > Hrm... which makes me think... if we allow this for the common > kernel-managed case, do we even need to have capcity in the high-level > interface for reporting IO holes? If the kernel can choose a non-zero > base, it could just choose on x86 to place it's advertised window > above the IO hole. If the high level interface is like dma_map() then, no it doesn't need the ability to report holes. Kernel would find and return the IOVA from dma_map not accept it in. Since dma_map is a well proven model I'm inclined to model the simplied interface after it.. That said, if we have some ioctl 'query iova ranges' I would expect it to work on an IOAS created by the simplified interface too. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 12:06:10PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote: > > > > If the HW can attach multiple non-overlapping IOAS's to the same > > > device then the HW is routing to the correct IOAS by using the address > > > bits. This is not much different from the prior discussion we had > > > where we were thinking of the PASID as an 80 bit address > > > > Ah... that might be a workable approach. And it even helps me get my > > head around multiple attachment which I was struggling with before. > > > > So, the rule would be that you can attach multiple IOASes to a device, > > as long as none of them overlap. The non-overlapping could be because > > each IOAS covers a disjoint address range, or it could be because > > there's some attached information - such as a PASID - to disambiguate. > > Right exactly - it is very parallel to PASID > > And obviously HW support is required to have multiple page table > pointers per RID - which sounds like PPC does (high/low pointer?) Hardware support is require *in the IOMMU*. Nothing (beyond regular 64-bit DMA support) is required in the endpoint devices. That's not true of PASID. > > What remains a question is where the disambiguating information comes > > from in each case: does it come from properties of the IOAS, > > propertues of the device, or from extra parameters supplied at attach > > time. IIUC, the current draft suggests it always comes at attach time > > for the PASID information. Obviously the more consistency we can have > > here the better. > > From a generic view point I'd say all are fair game. It is up to the > IOMMU driver to take the requested set of IOAS's, the "at attachment" > information (like PASID) and decide what to do, or fail. Ok, that's a model that makes sense to me. > > I can also see an additional problem in implementation, once we start > > looking at hot-adding devices to existing address spaces. > > I won't pretend to guess how to implement this :) Just from a modeling > perspective is something that works logically. If the kernel > implementation is too hard then PPC should do one of the other ideas. > > Personally I'd probably try for a nice multi-domain attachment model > like PASID and not try to create/destroy domains. I don't really follow what you mean by that. > As I said in my last email I think it is up to each IOMMU HW driver to > make these decisions, the iommufd framework just provides a > standardized API toward the attaching driver that the IOMMU HW must > fit into. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 11:52:08AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote: > > > > My feeling is that qemu should be dealing with the host != target > > > case, not the kernel. > > > > > > The kernel's job should be to expose the IOMMU HW it has, with all > > > features accessible, to userspace. > > > > See... to me this is contrary to the point we agreed on above. > > I'm not thinking of these as exclusive ideas. > > The IOCTL interface in iommu can quite happily expose: > Create IOAS generically > Manipulate IOAS generically > Create IOAS with IOMMU driver specific attributes > HW specific Manipulate IOAS > > IOCTL commands all together. > > So long as everything is focused on a generic in-kernel IOAS object it > is fine to have multiple ways in the uAPI to create and manipulate the > objects. > > When I speak about a generic interface I mean "Create IOAS > generically" - ie a set of IOCTLs that work on most IOMMU HW and can > be relied upon by things like DPDK/etc to always work and be portable. > This is why I like "hints" to provide some limited widely applicable > micro-optimization. > > When I said "expose the IOMMU HW it has with all features accessible" > I mean also providing "Create IOAS with IOMMU driver specific > attributes". > > These other IOCTLs would allow the IOMMU driver to expose every > configuration knob its HW has, in a natural HW centric language. > There is no pretense of genericness here, no crazy foo=A, foo=B hidden > device specific interface. > > Think of it as a high level/low level interface to the same thing. Ok, I see what you mean. > > Those are certainly wrong, but they came about explicitly by *not* > > being generic rather than by being too generic. So I'm really > > confused aso to what you're arguing for / against. > > IMHO it is not having a PPC specific interface that was the problem, > it was making the PPC specific interface exclusive to the type 1 > interface. If type 1 continued to work on PPC then DPDK/etc would > never learned PPC specific code. Ok, but the reason this happened is that the initial version of type 1 *could not* be used on PPC. The original Type 1 implicitly promised a "large" IOVA range beginning at IOVA 0 without any real way of specifying or discovering how large that range was. Since ppc could typically only give a 2GiB range at IOVA 0, that wasn't usable. That's why I say the problem was not making type1 generic enough. I believe the current version of Type1 has addressed this - at least enough to be usable in common cases. But by this time the ppc backend is already out there, so no-one's had the capacity to go back and make ppc work with Type1. > For iommufd with the high/low interface each IOMMU HW should ask basic > questions: > > - What should the generic high level interface do on this HW? >For instance what should 'Create IOAS generically' do for PPC? >It should not fail, it should create *something* >What is the best thing for DPDK? >I guess the 64 bit window is most broadly useful. Right, which means the kernel must (at least in the common case) have the capcity to choose and report a non-zero base-IOVA. Hrm... which makes me think... if we allow this for the common kernel-managed case, do we even need to have capcity in the high-level interface for reporting IO holes? If the kernel can choose a non-zero base, it could just choose on x86 to place it's advertised window above the IO hole. > - How to accurately describe the HW in terms of standard IOAS objects >and where to put HW specific structs to support this. > >This is where PPC would decide how best to expose a control over >its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver >wants, so long as it fits into the kernel IOAS model facing the >connected device driver. > > QEMU would have IOMMU userspace drivers. One would be the "generic > driver" using only the high level generic interface. It should work as > best it can on all HW devices. This is the fallback path you talked > of. > > QEMU would also have HW specific IOMMU userspace drivers that know how > to operate the exact HW. eg these drivers would know how to use > userspace page tables, how to form IOPTEs and how to access the > special features. > > This is how QEMU could use an optimzed path with nested page tables, > for instance. The concept makes sense in general. The devil's in the details, as usual. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote: > > If the HW can attach multiple non-overlapping IOAS's to the same > > device then the HW is routing to the correct IOAS by using the address > > bits. This is not much different from the prior discussion we had > > where we were thinking of the PASID as an 80 bit address > > Ah... that might be a workable approach. And it even helps me get my > head around multiple attachment which I was struggling with before. > > So, the rule would be that you can attach multiple IOASes to a device, > as long as none of them overlap. The non-overlapping could be because > each IOAS covers a disjoint address range, or it could be because > there's some attached information - such as a PASID - to disambiguate. Right exactly - it is very parallel to PASID And obviously HW support is required to have multiple page table pointers per RID - which sounds like PPC does (high/low pointer?) > What remains a question is where the disambiguating information comes > from in each case: does it come from properties of the IOAS, > propertues of the device, or from extra parameters supplied at attach > time. IIUC, the current draft suggests it always comes at attach time > for the PASID information. Obviously the more consistency we can have > here the better. >From a generic view point I'd say all are fair game. It is up to the IOMMU driver to take the requested set of IOAS's, the "at attachment" information (like PASID) and decide what to do, or fail. > I can also see an additional problem in implementation, once we start > looking at hot-adding devices to existing address spaces. I won't pretend to guess how to implement this :) Just from a modeling perspective is something that works logically. If the kernel implementation is too hard then PPC should do one of the other ideas. Personally I'd probably try for a nice multi-domain attachment model like PASID and not try to create/destroy domains. As I said in my last email I think it is up to each IOMMU HW driver to make these decisions, the iommufd framework just provides a standardized API toward the attaching driver that the IOMMU HW must fit into. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote: > > My feeling is that qemu should be dealing with the host != target > > case, not the kernel. > > > > The kernel's job should be to expose the IOMMU HW it has, with all > > features accessible, to userspace. > > See... to me this is contrary to the point we agreed on above. I'm not thinking of these as exclusive ideas. The IOCTL interface in iommu can quite happily expose: Create IOAS generically Manipulate IOAS generically Create IOAS with IOMMU driver specific attributes HW specific Manipulate IOAS IOCTL commands all together. So long as everything is focused on a generic in-kernel IOAS object it is fine to have multiple ways in the uAPI to create and manipulate the objects. When I speak about a generic interface I mean "Create IOAS generically" - ie a set of IOCTLs that work on most IOMMU HW and can be relied upon by things like DPDK/etc to always work and be portable. This is why I like "hints" to provide some limited widely applicable micro-optimization. When I said "expose the IOMMU HW it has with all features accessible" I mean also providing "Create IOAS with IOMMU driver specific attributes". These other IOCTLs would allow the IOMMU driver to expose every configuration knob its HW has, in a natural HW centric language. There is no pretense of genericness here, no crazy foo=A, foo=B hidden device specific interface. Think of it as a high level/low level interface to the same thing. > Those are certainly wrong, but they came about explicitly by *not* > being generic rather than by being too generic. So I'm really > confused aso to what you're arguing for / against. IMHO it is not having a PPC specific interface that was the problem, it was making the PPC specific interface exclusive to the type 1 interface. If type 1 continued to work on PPC then DPDK/etc would never learned PPC specific code. For iommufd with the high/low interface each IOMMU HW should ask basic questions: - What should the generic high level interface do on this HW? For instance what should 'Create IOAS generically' do for PPC? It should not fail, it should create *something* What is the best thing for DPDK? I guess the 64 bit window is most broadly useful. - How to accurately describe the HW in terms of standard IOAS objects and where to put HW specific structs to support this. This is where PPC would decide how best to expose a control over its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver wants, so long as it fits into the kernel IOAS model facing the connected device driver. QEMU would have IOMMU userspace drivers. One would be the "generic driver" using only the high level generic interface. It should work as best it can on all HW devices. This is the fallback path you talked of. QEMU would also have HW specific IOMMU userspace drivers that know how to operate the exact HW. eg these drivers would know how to use userspace page tables, how to form IOPTEs and how to access the special features. This is how QEMU could use an optimzed path with nested page tables, for instance. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: David Gibson > Sent: Thursday, October 14, 2021 1:00 PM > > On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote: > > > From: David Gibson > > > Sent: Friday, October 1, 2021 2:11 PM > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in > > > > vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > there > > > > is a way to claim a specific iova range from a system-wide address > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > Ok, there are several things we need for ppc. None of which are > > > inherently ppc specific and some of which will I think be useful for > > > most platforms. So, starting from most general to most specific > > > here's basically what's needed: > > > > > > 1. We need to represent the fact that the IOMMU can only translate > > >*some* IOVAs, not a full 64-bit range. You have the addr_width > > >already, but I'm entirely sure if the translatable range on ppc > > >(or other platforms) is always a power-of-2 size. It usually will > > >be, of course, but I'm not sure that's a hard requirement. So > > >using a size/max rather than just a number of bits might be safer. > > > > > >I think basically every platform will need this. Most platforms > > >don't actually implement full 64-bit translation in any case, but > > >rather some smaller number of bits that fits their page table > > >format. > > > > > > 2. The translatable range of IOVAs may not begin at 0. So we need to > > >advertise to userspace what the base address is, as well as the > > >size. POWER's main IOVA range begins at 2^59 (at least on the > > >models I know about). > > > > > >I think a number of platforms are likely to want this, though I > > >couldn't name them apart from POWER. Putting the translated IOVA > > >window at some huge address is a pretty obvious approach to making > > >an IOMMU which can translate a wide address range without colliding > > >with any legacy PCI addresses down low (the IOMMU can check if this > > >transaction is for it by just looking at some high bits in the > > >address). > > > > > > 3. There might be multiple translatable ranges. So, on POWER the > > >IOMMU can typically translate IOVAs from 0..2GiB, and also from > > >2^59..2^59+. The two ranges have completely separate IO > > >page tables, with (usually) different layouts. (The low range will > > >nearly always be a single-level page table with 4kiB or 64kiB > > >entries, the high one will be multiple levels depending on the size > > >of the range and pagesize). > > > > > >This may be less common, but I suspect POWER won't be the only > > >platform to do something like this. As above, using a high range > > >is a pretty obvious approach, but clearly won't handle older > > >devices which can't do 64-bit DMA. So adding a smaller range for > > >those devices is again a pretty obvious solution. Any platform > > >with an "IO hole" can be treated as having two ranges, one below > > >the hole and one above it (although in that case they may well not > > >have separate page tables > > > > 1-3 are common on all platforms with fixed reserved ranges. Current > > vfio already reports permitted iova ranges to user via VFIO_IOMMU_ > > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct > > maps only in those ranges. iommufd can follow the same logic for the > > baseline uAPI. > > > > For above cases a [base, max] hint can be provided by the user per > > Jason's recommendation. > > Provided at which stage? IOMMU_IOASID_ALLOC > > > It is a hint as no additional restrictio
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Friday, October 1, 2021 2:11 PM > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > discussion > > > and then decide how to incorporate it in v2. > > > > Ok, there are several things we need for ppc. None of which are > > inherently ppc specific and some of which will I think be useful for > > most platforms. So, starting from most general to most specific > > here's basically what's needed: > > > > 1. We need to represent the fact that the IOMMU can only translate > >*some* IOVAs, not a full 64-bit range. You have the addr_width > >already, but I'm entirely sure if the translatable range on ppc > >(or other platforms) is always a power-of-2 size. It usually will > >be, of course, but I'm not sure that's a hard requirement. So > >using a size/max rather than just a number of bits might be safer. > > > >I think basically every platform will need this. Most platforms > >don't actually implement full 64-bit translation in any case, but > >rather some smaller number of bits that fits their page table > >format. > > > > 2. The translatable range of IOVAs may not begin at 0. So we need to > >advertise to userspace what the base address is, as well as the > >size. POWER's main IOVA range begins at 2^59 (at least on the > >models I know about). > > > >I think a number of platforms are likely to want this, though I > >couldn't name them apart from POWER. Putting the translated IOVA > >window at some huge address is a pretty obvious approach to making > >an IOMMU which can translate a wide address range without colliding > >with any legacy PCI addresses down low (the IOMMU can check if this > >transaction is for it by just looking at some high bits in the > >address). > > > > 3. There might be multiple translatable ranges. So, on POWER the > >IOMMU can typically translate IOVAs from 0..2GiB, and also from > >2^59..2^59+. The two ranges have completely separate IO > >page tables, with (usually) different layouts. (The low range will > >nearly always be a single-level page table with 4kiB or 64kiB > >entries, the high one will be multiple levels depending on the size > >of the range and pagesize). > > > >This may be less common, but I suspect POWER won't be the only > >platform to do something like this. As above, using a high range > >is a pretty obvious approach, but clearly won't handle older > >devices which can't do 64-bit DMA. So adding a smaller range for > >those devices is again a pretty obvious solution. Any platform > >with an "IO hole" can be treated as having two ranges, one below > >the hole and one above it (although in that case they may well not > >have separate page tables > > 1-3 are common on all platforms with fixed reserved ranges. Current > vfio already reports permitted iova ranges to user via VFIO_IOMMU_ > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct > maps only in those ranges. iommufd can follow the same logic for the > baseline uAPI. > > For above cases a [base, max] hint can be provided by the user per > Jason's recommendation. Provided at which stage? > It is a hint as no additional restriction is > imposed, For the qemu type use case, that's not true. In that case we *require* the available mapping ranges to match what the guest platform expects. > since the kernel only cares about no violation on permitted > ranges that it reports to the user. Underlying iommu dr
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 02:17:48PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote: > > > PASID support will already require that a device can be multi-bound to > > > many IOAS's, couldn't PPC do the same with the windows? > > > > I don't see how that would make sense. The device has no awareness of > > multiple windows the way it does of PASIDs. It just sends > > transactions over the bus with the IOVAs it's told. If those IOVAs > > lie within one of the windows, the IOMMU picks them up and translates > > them. If they don't, it doesn't. > > To my mind that address centric routing is awareness. I don't really understand that position. A PASID capable device has to be built to be PASID capable, and will generally have registers into which you store PASIDs to use. Any 64-bit DMA capable device can use the POWER IOMMU just fine - it's up to the driver to program it with addresses that will be translated (and in Linux the driver will get those from the DMA subsystem). > If the HW can attach multiple non-overlapping IOAS's to the same > device then the HW is routing to the correct IOAS by using the address > bits. This is not much different from the prior discussion we had > where we were thinking of the PASID as an 80 bit address Ah... that might be a workable approach. And it even helps me get my head around multiple attachment which I was struggling with before. So, the rule would be that you can attach multiple IOASes to a device, as long as none of them overlap. The non-overlapping could be because each IOAS covers a disjoint address range, or it could be because there's some attached information - such as a PASID - to disambiguate. What remains a question is where the disambiguating information comes from in each case: does it come from properties of the IOAS, propertues of the device, or from extra parameters supplied at attach time. IIUC, the current draft suggests it always comes at attach time for the PASID information. Obviously the more consistency we can have here the better. I can also see an additional problem in implementation, once we start looking at hot-adding devices to existing address spaces. Suppose our software (maybe qemu) wants to set up a single DMA view for a bunch of devices, that has such a split window. It can set up IOASes easily enough for the two windows, then it needs to attach them. Presumbly, it attaches them one at a time, which means that each device (or group) goes through an interim state where it's attached to one, but not the other. That can probably be achieved by using an extra IOMMU domain (or the local equivalent) in the hardware for that interim state. However it means we have to repeatedly create and destroy that extra domain for each device after the first we add, rather than simply adding each device to the domain which has both windows. [I think this doesn't arise on POWER when running under PowerVM. That has no concept like IOMMU domains, and instead the mapping is always done per "partitionable endpoint" (PE), essentially a group. That means it's just a question of whether we mirror mappings on both windows into a given PE or just those from one IOAS. It's not an unreasonable extension/combination of existing hardware quirks to consider, though] > The fact the PPC HW actually has multiple page table roots and those > roots even have different page tables layouts while still connected to > the same device suggests this is not even an unnatural modelling > approach... > > Jason > > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > > > This means we cannot define an input that has a magic HW specific > > > value. > > > > I'm not entirely sure what you mean by that. > > I mean if you make a general property 'foo' that userspace must > specify correctly then your API isn't general anymore. Userspace must > know if it is A or B HW to set foo=A or foo=B. I absolutely agree. Which is exactly why I'm advocating that userspace should request from the kernel what it needs (providing a *minimum* of information) and the kernel satisfies that (filling in the missing information as suitable for the platform) or outright fails. I think that is more robust across multiple platforms and usecases than advertising a bunch of capabilities and forcing userspace to interpret those to work out what it can do. > Supported IOVA ranges are easially like that as every IOMMU is > different. So DPDK shouldn't provide such specific or binding > information. Absolutely, DPDK should not provide that. qemu *should* provide that, because the specific IOVAs matter to the guest. That will inevitably mean that the request is more likely to fail, but that's a fundamental tradeoff. > > No, I don't think that needs to be a condition. I think it's > > perfectly reasonable for a constraint to be given, and for the host > > IOMMU to just say "no, I can't do that". But that does mean that each > > of these values has to have an explicit way of userspace specifying "I > > don't care", so that the kernel will select a suitable value for those > > instead - that's what DPDK or other userspace would use nearly all the > > time. > > My feeling is that qemu should be dealing with the host != target > case, not the kernel. > > The kernel's job should be to expose the IOMMU HW it has, with all > features accessible, to userspace. See... to me this is contrary to the point we agreed on above. > Qemu's job should be to have a userspace driver for each kernel IOMMU > and the internal infrastructure to make accelerated emulations for all > supported target IOMMUs. This seems the wrong way around to me. I see qemu as providing logic to emulate each target IOMMU. Where that matches the host, there's the potential for an accelerated implementation, but it makes life a lot easier if we can at least have a fallback that will work on any sufficiently capable host IOMMU. > In other words, it is not the kernel's job to provide target IOMMU > emulation. Absolutely not. But it *is* the kernel's job to let qemu do as mach as it can with the *host* IOMMU. > The kernel should provide truely generic "works everywhere" interface > that qemu/etc can rely on to implement the least accelerated emulation > path. Right... seems like we're agreeing again. > So when I see proposals to have "generic" interfaces that actually > require very HW specific setup, and cannot be used by a generic qemu > userpace driver, I think it breaks this model. If qemu needs to know > it is on PPC (as it does today with VFIO's PPC specific API) then it > may as well speak PPC specific language and forget about pretending to > be generic. Absolutely, the current situation is a mess. > This approach is grounded in 15 years of trying to build these > user/kernel split HW subsystems (particularly RDMA) where it has > become painfully obvious that the kernel is the worst place to try and > wrangle really divergent HW into a "common" uAPI. > > This is because the kernel/user boundary is fixed. Introducing > anything generic here requires a lot of time, thought, arguing and > risk. Usually it ends up being done wrong (like the PPC specific > ioctls, for instance) Those are certainly wrong, but they came about explicitly by *not* being generic rather than by being too generic. So I'm really confused aso to what you're arguing for / against. > and when this happens we can't learn and adapt, > we are stuck with stable uABI forever. > > Exposing a device's native programming interface is much simpler. Each > device is fixed, defined and someone can sit down and figure out how > to expose it. Then that is it, it doesn't need revisiting, it doesn't > need harmonizing with a future slightly different device, it just > stays as is. I can certainly see the case for that approach. That seems utterly at odds with what /dev/iommu is trying to do, though. > The cost, is that there must be a userspace driver component for each > HW piece - which we are already paying here! > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > For instance, we are going to see HW with nested pag
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote: > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to > > be determined) that it needs an IOAS where things can be mapped in the > > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the > > 64-bit window). > > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > Seems like we don't need the negotiation part? The host kernel > communicates available IOVA ranges to userspace including holes (patch > 17), and userspace can check that the ranges it needs are within the IOVA > space boundaries. That part is necessary for DPDK as well since it needs > to know about holes in the IOVA space where DMA wouldn't work as expected > (MSI doorbells for example). And there already is a negotiation happening, > when the host kernel rejects MAP ioctl outside the advertised area. The problem with the approach where the kernel advertises and userspace selects based on that, is that it locks us into a specific representation of what's possible. If we get new hardware with new weird constraints that can't be expressed with the representation we chose, we're kind of out of stuffed. Userspace will have to change to accomodate the new extension and have any chance of working on the new hardware. With the model where userspace requests, and the kernel acks or nacks, we can still support existing userspace if the only things it requests can still be accomodated in the new constraints. That's pretty likely if the majority of userspaces request very simple things (say a single IOVA block where it doesn't care about the base address). -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jean-Philippe Brucker > Sent: Tuesday, October 12, 2021 4:34 PM > > On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote: > > On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote: > > > > > Seems like we don't need the negotiation part? The host kernel > > > communicates available IOVA ranges to userspace including holes (patch > > > 17), and userspace can check that the ranges it needs are within the IOVA > > > space boundaries. That part is necessary for DPDK as well since it needs > > > to know about holes in the IOVA space where DMA wouldn't work as > expected > > > (MSI doorbells for example). > > > > I haven't looked super closely at DPDK, but the other simple VFIO app > > I am aware of struggled to properly implement this semantic (Indeed it > > wasn't even clear to the author this was even needed). > > > > It requires interval tree logic inside the application which is not a > > trivial algorithm to implement in C. > > > > I do wonder if the "simple" interface should have an option more like > > the DMA API where userspace just asks to DMA map some user memory > and > > gets back the dma_addr_t to use. Kernel manages the allocation > > space/etc. > > Agreed, it's tempting to use IOVA = VA but the two spaces aren't > necessarily compatible. An extension that plugs into the IOVA allocator > could be useful to userspace drivers. > Make sense. We can have a flag in IOMMUFD_MAP_DMA to tell whether the user provides vaddr or expects the kernel to allocate and return. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jean-Philippe Brucker > Sent: Monday, October 11, 2021 4:50 PM > > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to > > be determined) that it needs an IOAS where things can be mapped in the > > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the > > 64-bit window). > > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > Seems like we don't need the negotiation part? The host kernel > communicates available IOVA ranges to userspace including holes (patch > 17), and userspace can check that the ranges it needs are within the IOVA > space boundaries. That part is necessary for DPDK as well since it needs > to know about holes in the IOVA space where DMA wouldn't work as > expected > (MSI doorbells for example). And there already is a negotiation happening, > when the host kernel rejects MAP ioctl outside the advertised area. > Agree. This can cover the ppc platforms with fixed reserved ranges. It's meaningless to have user further tell kernel that it is only willing to use a subset of advertised area. for ppc platforms with dynamic reserved ranges which are claimed by user, we can leave it out of the common set and handled in a different way, either leveraging ioas nesting if applied or having ppc specific cmd. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: David Gibson > Sent: Friday, October 1, 2021 2:11 PM > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > This patch adds IOASID allocation/free interface per iommufd. When > > allocating an IOASID, userspace is expected to specify the type and > > format information for the target I/O page table. > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > semantics. For this type the user should specify the addr_width of > > the I/O address space and whether the I/O page table is created in > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > as the false setting requires additional contract with KVM on handling > > WBINVD emulation, which can be added later. > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > for what formats can be specified when allocating an IOASID. > > > > Open: > > - Devices on PPC platform currently use a different iommu driver in vfio. > > Per previous discussion they can also use vfio type1v2 as long as there > > is a way to claim a specific iova range from a system-wide address space. > > This requirement doesn't sound PPC specific, as addr_width for pci > devices > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > adopted this design yet. We hope to have formal alignment in v1 > discussion > > and then decide how to incorporate it in v2. > > Ok, there are several things we need for ppc. None of which are > inherently ppc specific and some of which will I think be useful for > most platforms. So, starting from most general to most specific > here's basically what's needed: > > 1. We need to represent the fact that the IOMMU can only translate >*some* IOVAs, not a full 64-bit range. You have the addr_width >already, but I'm entirely sure if the translatable range on ppc >(or other platforms) is always a power-of-2 size. It usually will >be, of course, but I'm not sure that's a hard requirement. So >using a size/max rather than just a number of bits might be safer. > >I think basically every platform will need this. Most platforms >don't actually implement full 64-bit translation in any case, but >rather some smaller number of bits that fits their page table >format. > > 2. The translatable range of IOVAs may not begin at 0. So we need to >advertise to userspace what the base address is, as well as the >size. POWER's main IOVA range begins at 2^59 (at least on the >models I know about). > >I think a number of platforms are likely to want this, though I >couldn't name them apart from POWER. Putting the translated IOVA >window at some huge address is a pretty obvious approach to making >an IOMMU which can translate a wide address range without colliding >with any legacy PCI addresses down low (the IOMMU can check if this >transaction is for it by just looking at some high bits in the >address). > > 3. There might be multiple translatable ranges. So, on POWER the >IOMMU can typically translate IOVAs from 0..2GiB, and also from >2^59..2^59+. The two ranges have completely separate IO >page tables, with (usually) different layouts. (The low range will >nearly always be a single-level page table with 4kiB or 64kiB >entries, the high one will be multiple levels depending on the size >of the range and pagesize). > >This may be less common, but I suspect POWER won't be the only >platform to do something like this. As above, using a high range >is a pretty obvious approach, but clearly won't handle older >devices which can't do 64-bit DMA. So adding a smaller range for >those devices is again a pretty obvious solution. Any platform >with an "IO hole" can be treated as having two ranges, one below >the hole and one above it (although in that case they may well not >have separate page tables 1-3 are common on all platforms with fixed reserved ranges. Current vfio already reports permitted iova ranges to user via VFIO_IOMMU_ TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct maps only in those ranges. iommufd can follow the same logic for the baseline uAPI. For above cases a [base, max] hint can be provided by the user per Jason's recommendation. It is a hint as no additional restriction is imposed, since the kernel only cares about no violation on permitted ranges that it reports to the user. Underlying iommu driver may use this hint to optimize e.g. deciding how many levels are used for the kernel-managed page table according to max addr. > > 4. The translatable ranges might not be fixed. On ppc that 0..2GiB >and 2^59..whatever ranges are kernel conventions, not specified by >the hardware or firmware. When running as a guest (which is the >normal case on POWER), there are explicit hypercalls for >con
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote: > > > Seems like we don't need the negotiation part? The host kernel > > communicates available IOVA ranges to userspace including holes (patch > > 17), and userspace can check that the ranges it needs are within the IOVA > > space boundaries. That part is necessary for DPDK as well since it needs > > to know about holes in the IOVA space where DMA wouldn't work as expected > > (MSI doorbells for example). > > I haven't looked super closely at DPDK, but the other simple VFIO app > I am aware of struggled to properly implement this semantic (Indeed it > wasn't even clear to the author this was even needed). > > It requires interval tree logic inside the application which is not a > trivial algorithm to implement in C. > > I do wonder if the "simple" interface should have an option more like > the DMA API where userspace just asks to DMA map some user memory and > gets back the dma_addr_t to use. Kernel manages the allocation > space/etc. Agreed, it's tempting to use IOVA = VA but the two spaces aren't necessarily compatible. An extension that plugs into the IOVA allocator could be useful to userspace drivers. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote: > Seems like we don't need the negotiation part? The host kernel > communicates available IOVA ranges to userspace including holes (patch > 17), and userspace can check that the ranges it needs are within the IOVA > space boundaries. That part is necessary for DPDK as well since it needs > to know about holes in the IOVA space where DMA wouldn't work as expected > (MSI doorbells for example). I haven't looked super closely at DPDK, but the other simple VFIO app I am aware of struggled to properly implement this semantic (Indeed it wasn't even clear to the author this was even needed). It requires interval tree logic inside the application which is not a trivial algorithm to implement in C. I do wonder if the "simple" interface should have an option more like the DMA API where userspace just asks to DMA map some user memory and gets back the dma_addr_t to use. Kernel manages the allocation space/etc. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > This means we cannot define an input that has a magic HW specific > > value. > > I'm not entirely sure what you mean by that. I mean if you make a general property 'foo' that userspace must specify correctly then your API isn't general anymore. Userspace must know if it is A or B HW to set foo=A or foo=B. Supported IOVA ranges are easially like that as every IOMMU is different. So DPDK shouldn't provide such specific or binding information. > No, I don't think that needs to be a condition. I think it's > perfectly reasonable for a constraint to be given, and for the host > IOMMU to just say "no, I can't do that". But that does mean that each > of these values has to have an explicit way of userspace specifying "I > don't care", so that the kernel will select a suitable value for those > instead - that's what DPDK or other userspace would use nearly all the > time. My feeling is that qemu should be dealing with the host != target case, not the kernel. The kernel's job should be to expose the IOMMU HW it has, with all features accessible, to userspace. Qemu's job should be to have a userspace driver for each kernel IOMMU and the internal infrastructure to make accelerated emulations for all supported target IOMMUs. In other words, it is not the kernel's job to provide target IOMMU emulation. The kernel should provide truely generic "works everywhere" interface that qemu/etc can rely on to implement the least accelerated emulation path. So when I see proposals to have "generic" interfaces that actually require very HW specific setup, and cannot be used by a generic qemu userpace driver, I think it breaks this model. If qemu needs to know it is on PPC (as it does today with VFIO's PPC specific API) then it may as well speak PPC specific language and forget about pretending to be generic. This approach is grounded in 15 years of trying to build these user/kernel split HW subsystems (particularly RDMA) where it has become painfully obvious that the kernel is the worst place to try and wrangle really divergent HW into a "common" uAPI. This is because the kernel/user boundary is fixed. Introducing anything generic here requires a lot of time, thought, arguing and risk. Usually it ends up being done wrong (like the PPC specific ioctls, for instance) and when this happens we can't learn and adapt, we are stuck with stable uABI forever. Exposing a device's native programming interface is much simpler. Each device is fixed, defined and someone can sit down and figure out how to expose it. Then that is it, it doesn't need revisiting, it doesn't need harmonizing with a future slightly different device, it just stays as is. The cost, is that there must be a userspace driver component for each HW piece - which we are already paying here! > Ideally the host /dev/iommu will say "ok!", since both those ranges > are within the 0..2^60 translated range of the host IOMMU, and don't > touch the IO hole. When the guest calls the IO mapping hypercalls, > qemu translates those into DMA_MAP operations, and since they're all > within the previously verified windows, they should work fine. For instance, we are going to see HW with nested page tables, user space owned page tables and even kernel-bypass fast IOTLB invalidation. In that world does it even make sense for qmeu to use slow DMA_MAP ioctls for emulation? A userspace framework in qemu can make these optimizations and is also necessarily HW specific as the host page table is HW specific.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote: > > PASID support will already require that a device can be multi-bound to > > many IOAS's, couldn't PPC do the same with the windows? > > I don't see how that would make sense. The device has no awareness of > multiple windows the way it does of PASIDs. It just sends > transactions over the bus with the IOVAs it's told. If those IOVAs > lie within one of the windows, the IOMMU picks them up and translates > them. If they don't, it doesn't. To my mind that address centric routing is awareness. If the HW can attach multiple non-overlapping IOAS's to the same device then the HW is routing to the correct IOAS by using the address bits. This is not much different from the prior discussion we had where we were thinking of the PASID as an 80 bit address The fact the PPC HW actually has multiple page table roots and those roots even have different page tables layouts while still connected to the same device suggests this is not even an unnatural modelling approach... Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to > be determined) that it needs an IOAS where things can be mapped in the > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the > 64-bit window). > > Ideally the host /dev/iommu will say "ok!", since both those ranges > are within the 0..2^60 translated range of the host IOMMU, and don't > touch the IO hole. When the guest calls the IO mapping hypercalls, > qemu translates those into DMA_MAP operations, and since they're all > within the previously verified windows, they should work fine. Seems like we don't need the negotiation part? The host kernel communicates available IOVA ranges to userspace including holes (patch 17), and userspace can check that the ranges it needs are within the IOVA space boundaries. That part is necessary for DPDK as well since it needs to know about holes in the IOVA space where DMA wouldn't work as expected (MSI doorbells for example). And there already is a negotiation happening, when the host kernel rejects MAP ioctl outside the advertised area. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 09:22:25AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote: > > On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote: > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in > > > > vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > there > > > > is a way to claim a specific iova range from a system-wide address > > > > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > I think the request was to include a start/end IO address hint when > > > creating the ios. When the kernel creates it then it can return the > > > actual geometry including any holes via a query. > > > > So part of the point of specifying start/end addresses is that > > explicitly querying holes shouldn't be necessary: if the requested > > range crosses a hole, it should fail. If you didn't really need all > > that range, you shouldn't have asked for it. > > > > Which means these aren't really "hints" but optionally supplied > > constraints. > > We have to be very careful here, there are two very different use > cases. When we are talking about the generic API I am mostly > interested to see that applications like DPDK can use this API and be > portable to any IOMMU HW the kernel supports. I view the fact that > there is VFIO PPC specific code in DPDK as a failing of the kernel to > provide a HW abstraction. I would agree. At the time we were making this, we thought there were irreconcilable differences between what could be done with the x86 vs ppc IOMMUs. Turns out we just didn't think it through hard enough to find a common model. > This means we cannot define an input that has a magic HW specific > value. I'm not entirely sure what you mean by that. > DPDK can never provide that portably. Thus all these kinds of > inputs in the generic API need to be hints, if they exist at all. I don't follow your reasoning. First, note that in qemu these valus are *target* hardware specific, not *host* hardware specific. If those requests aren't honoured, qemu cannot faithfully emulate the target hardware and has to fail. That's what I mean when I say this is not a constraint, not a hint. But when I say the constraint is optional, I mean that things which don't have that requirement - like DPDK - shouldn't apply the constraint. > As 'address space size hint'/'address space start hint' is both > generic, useful, and providable by DPDK I think it is OK. Size is certainly providable, and probably useful. For DPDK, I don't think start is useful. > PPC can use > it to pick which of the two page table formats to use for this IOAS if > it wants. Clarification: it's not that each window has a specific page table format. The two windows are independent of each other, which means you can separately select the page table format for each one (although the 32-bit one generally won't be big enough that there's any point selecting something other than a 1-level TCE table). When I say format here, I basically mean number of levels and size of each level - the IOPTE (a.k.a. TCE) format is the same in each case. > The second use case is when we have a userspace driver for a specific > HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration. > We can look here for things to make general, but I would expect a > fairly high bar. Instead, I would rather see the userspace driver > communicate with the kernel driver in its own private language, so > that the entire functionality of the unique HW can be used. I don't think we actually need to do this. Or rather, we might want to do this for maximum performance in some cases, but I think we can h
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Sat, Oct 02, 2021 at 09:25:42AM -0300, Jason Gunthorpe wrote: > On Sat, Oct 02, 2021 at 02:21:38PM +1000, da...@gibson.dropbear.id.au wrote: > > > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > > > > guest, and therefore needs to request both from the host. > > > > > > As I understood your remarks each IOAS can only be one of the formats > > > as they have a different PTE layout. So here I ment that qmeu needs to > > > be able to pick *for each IOAS* which of the two formats it is. > > > > No. Both windows are in the same IOAS. A device could do DMA > > simultaneously to both windows. > > Sure, but that doesn't force us to model it as one IOAS in the > iommufd. A while back you were talking about using nesting and 3 > IOAS's, right? > > 1, 2 or 3 IOAS's seems like a decision we can make. Well, up to a point. We can decide how such a thing should be constructed. However at some point there needs to exist an IOAS in which both windows are mapped, whether it's directly or indirectly. That's what the device will be attached to. > PASID support will already require that a device can be multi-bound to > many IOAS's, couldn't PPC do the same with the windows? I don't see how that would make sense. The device has no awareness of multiple windows the way it does of PASIDs. It just sends transactions over the bus with the IOVAs it's told. If those IOVAs lie within one of the windows, the IOMMU picks them up and translates them. If they don't, it doesn't. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Sat, Oct 02, 2021 at 02:21:38PM +1000, da...@gibson.dropbear.id.au wrote: > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > > > guest, and therefore needs to request both from the host. > > > > As I understood your remarks each IOAS can only be one of the formats > > as they have a different PTE layout. So here I ment that qmeu needs to > > be able to pick *for each IOAS* which of the two formats it is. > > No. Both windows are in the same IOAS. A device could do DMA > simultaneously to both windows. Sure, but that doesn't force us to model it as one IOAS in the iommufd. A while back you were talking about using nesting and 3 IOAS's, right? 1, 2 or 3 IOAS's seems like a decision we can make. PASID support will already require that a device can be multi-bound to many IOAS's, couldn't PPC do the same with the windows? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 09:25:05AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote: > > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote: > > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > > format information for the target I/O page table. > > > > > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > > semantics. For this type the user should specify the addr_width of > > > > > > the I/O address space and whether the I/O page table is created in > > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > > > > > point, > > > > > > as the false setting requires additional contract with KVM on > > > > > > handling > > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > > > Open: > > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > > vfio. > > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > > there > > > > > > is a way to claim a specific iova range from a system-wide > > > > > > address space. > > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > > devices > > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > > > hasn't > > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > > discussion > > > > > > and then decide how to incorporate it in v2. > > > > > > > > > > I think the request was to include a start/end IO address hint when > > > > > creating the ios. When the kernel creates it then it can return the > > > > > > > > is the hint single-range or could be multiple-ranges? > > > > > > David explained it here: > > > > > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > > > Apparently not well enough. I've attempted again in this thread. > > > > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > > > bit range. > > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > > guest, and therefore needs to request both from the host. > > As I understood your remarks each IOAS can only be one of the formats > as they have a different PTE layout. So here I ment that qmeu needs to > be able to pick *for each IOAS* which of the two formats it is. No. Both windows are in the same IOAS. A device could do DMA simultaneously to both windows. More realstically a 64-bit DMA capable and a non-64-bit DMA capable device could be in the same group and be doing DMAs to different windows simultaneously. > > Or rather, it *might* need to supply both. It will supply just the > > 32-bit range by default, but the guest can request the 64-bit range > > and/or remove and resize the 32-bit range via hypercall interfaces. > > Vaguely recent Linux guests certainly will request the 64-bit range in > > addition to the default 32-bit range. > > And this would result in two different IOAS objects There might be two different IOAS objects for setup, but at some point they need to be combined into one IOAS to which the device is actually attached. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote: > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > format information for the target I/O page table. > > > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > semantics. For this type the user should specify the addr_width of > > > > > the I/O address space and whether the I/O page table is created in > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > > > > point, > > > > > as the false setting requires additional contract with KVM on handling > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > Open: > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > vfio. > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > there > > > > > is a way to claim a specific iova range from a system-wide address > > > > > space. > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > devices > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > > hasn't > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > discussion > > > > > and then decide how to incorporate it in v2. > > > > > > > > I think the request was to include a start/end IO address hint when > > > > creating the ios. When the kernel creates it then it can return the > > > > > > is the hint single-range or could be multiple-ranges? > > > > David explained it here: > > > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > Apparently not well enough. I've attempted again in this thread. > > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > > bit range. > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > guest, and therefore needs to request both from the host. As I understood your remarks each IOAS can only be one of the formats as they have a different PTE layout. So here I ment that qmeu needs to be able to pick *for each IOAS* which of the two formats it is. > Or rather, it *might* need to supply both. It will supply just the > 32-bit range by default, but the guest can request the 64-bit range > and/or remove and resize the 32-bit range via hypercall interfaces. > Vaguely recent Linux guests certainly will request the 64-bit range in > addition to the default 32-bit range. And this would result in two different IOAS objects Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote: > On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote: > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > and then decide how to incorporate it in v2. > > > > I think the request was to include a start/end IO address hint when > > creating the ios. When the kernel creates it then it can return the > > actual geometry including any holes via a query. > > So part of the point of specifying start/end addresses is that > explicitly querying holes shouldn't be necessary: if the requested > range crosses a hole, it should fail. If you didn't really need all > that range, you shouldn't have asked for it. > > Which means these aren't really "hints" but optionally supplied > constraints. We have to be very careful here, there are two very different use cases. When we are talking about the generic API I am mostly interested to see that applications like DPDK can use this API and be portable to any IOMMU HW the kernel supports. I view the fact that there is VFIO PPC specific code in DPDK as a failing of the kernel to provide a HW abstraction. This means we cannot define an input that has a magic HW specific value. DPDK can never provide that portably. Thus all these kinds of inputs in the generic API need to be hints, if they exist at all. As 'address space size hint'/'address space start hint' is both generic, useful, and providable by DPDK I think it is OK. PPC can use it to pick which of the two page table formats to use for this IOAS if it wants. The second use case is when we have a userspace driver for a specific HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration. We can look here for things to make general, but I would expect a fairly high bar. Instead, I would rather see the userspace driver communicate with the kernel driver in its own private language, so that the entire functionality of the unique HW can be used. So, when it comes to providing exact ranges as an input parameter we have to decide if that is done as some additional general data, or if it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest the guiding factor should be if every single IOMMU implementation can be updated to support the value. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote: > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > This patch adds IOASID allocation/free interface per iommufd. When > > allocating an IOASID, userspace is expected to specify the type and > > format information for the target I/O page table. > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > semantics. For this type the user should specify the addr_width of > > the I/O address space and whether the I/O page table is created in > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > as the false setting requires additional contract with KVM on handling > > WBINVD emulation, which can be added later. > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > for what formats can be specified when allocating an IOASID. > > > > Open: > > - Devices on PPC platform currently use a different iommu driver in vfio. > > Per previous discussion they can also use vfio type1v2 as long as there > > is a way to claim a specific iova range from a system-wide address space. > > This requirement doesn't sound PPC specific, as addr_width for pci devices > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > adopted this design yet. We hope to have formal alignment in v1 discussion > > and then decide how to incorporate it in v2. > > I think the request was to include a start/end IO address hint when > creating the ios. When the kernel creates it then it can return the > actual geometry including any holes via a query. So part of the point of specifying start/end addresses is that explicitly querying holes shouldn't be necessary: if the requested range crosses a hole, it should fail. If you didn't really need all that range, you shouldn't have asked for it. Which means these aren't really "hints" but optionally supplied constraints. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, September 22, 2021 10:09 PM > > > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > format information for the target I/O page table. > > > > > > > > > > This RFC supports only one type > > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > semantics. For this type the user should specify the addr_width of > > > > > the I/O address space and whether the I/O page table is created in > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > point, > > > > > as the false setting requires additional contract with KVM on handling > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > > patch) > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > Open: > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > vfio. > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > there > > > > > is a way to claim a specific iova range from a system-wide address > > space. > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > devices > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > hasn't > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > discussion > > > > > and then decide how to incorporate it in v2. > > > > > > > > I think the request was to include a start/end IO address hint when > > > > creating the ios. When the kernel creates it then it can return the > > > > > > is the hint single-range or could be multiple-ranges? > > > > David explained it here: > > > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > > bit range. > > > > So a 'range hint' will do the job > > > > David also suggested this: > > > > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/ > > > > So I like this better: > > > > struct iommu_ioasid_alloc { > > __u32 argsz; > > > > __u32 flags; > > #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0) > > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1) > > > > __aligned_u64 max_iova_hint; > > __aligned_u64 base_iova_hint; // Used only if > > IOMMU_IOASID_HINT_BASE_IOVA > > > > // For creating nested page tables > > __u32 parent_ios_id; > > __u32 format; > > #define IOMMU_FORMAT_KERNEL 0 > > #define IOMMU_FORMAT_PPC_XXX 2 > > #define IOMMU_FORMAT_[..] > > u32 format_flags; // Layout depends on format above > > > > __aligned_u64 user_page_directory; // Used if parent_ios_id != 0 > > }; > > > > Again 'type' as an overall API indicator should not exist, feature > > flags need to have clear narrow meanings. > > currently the type is aimed to differentiate three usages: > > - kernel-managed I/O page table > - user-managed I/O page table > - shared I/O page table (e.g. with mm, or ept) > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > indicator? their difference is not about format. To me "format" indicates how the IO translation information is encoded. We potentially have two different encodings: from userspace to the kernel and from the kernel to the hardware. But since this is the userspace API, it's only the userspace to kernel one that matters here. In that sense, KERNEL, is a "format": we encode the translation information as a series of IOMAP operations to the kernel, rather than as an in-memory structure. > > This does both of David's suggestions at once. If quemu wants the 1G > > limited region it could specify max_iova_hint = 1G, if it wants the > > extend 64bit region with the hole it can give either the high base or > > a large max_iova_hint. format/format_flags allows a further > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate > range information or be ok with a large range including holes (then let > the kernel to figure out where the holes locate)? I need more specifics to answer that. Are you talking from a userspace PoV, a guest kernel's or the host kernel's? In general I think requiring userspace to locate and work aronud holes is a bad idea. If userspace requests a range, it should get *all* of that range. The ppc case is further complicated because there are multiple ranges and each range could have separate IO page tables. In practice non-kernel managed IO pagetables are likely to be hard on ppc (or a
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote: > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in > > > > vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > there > > > > is a way to claim a specific iova range from a system-wide address > > > > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > I think the request was to include a start/end IO address hint when > > > creating the ios. When the kernel creates it then it can return the > > > > is the hint single-range or could be multiple-ranges? > > David explained it here: > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ Apparently not well enough. I've attempted again in this thread. > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > bit range. No. qemu needs to supply *both* the 32-bit and 64-bit range to its guest, and therefore needs to request both from the host. Or rather, it *might* need to supply both. It will supply just the 32-bit range by default, but the guest can request the 64-bit range and/or remove and resize the 32-bit range via hypercall interfaces. Vaguely recent Linux guests certainly will request the 64-bit range in addition to the default 32-bit range. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, September 23, 2021 8:07 PM > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > > > currently the type is aimed to differentiate three usages: > > > > > > - kernel-managed I/O page table > > > - user-managed I/O page table > > > - shared I/O page table (e.g. with mm, or ept) > > > > Creating a shared ios is something that should probably be a different > > command. > > why? I didn't understand the criteria here... > > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > > indicator? their difference is not about format. > > > > Format should be > > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? > > > > > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate > > > range information or be ok with a large range including holes (then let > > > the kernel to figure out where the holes locate)? > > > > My impression was it only needed a way to select between the two > > different cases as they are exclusive. I'd see this API as being a > > hint and userspace should query the exact ranges to learn what was > > actually created. > > yes, the user can query the permitted range using DEVICE_GET_INFO. > But in the end if the user wants two separate regions, I'm afraid that > the underlying iommu driver wants to know the exact info. iirc PPC > has one global system address space shared by all devices. I think certain POWER models do this, yes, there's *protection* between DMAs from different devices, but you can't translate the same address to different places for different devices. I *think* that's a firmware/hypervisor convention rather than a hardware limitation, but I'm not entirely sure. We don't do things this way when emulating the POWER vIOMMU in POWER, but PowerVM might and we still have to deal with that when running as a POWERVM guest. > It is possible > that the user may want to claim range-A and range-C, with range-B > in-between but claimed by another user. Then simply using one hint > range [A-lowend, C-highend] might not work. > > > > > > > device-specific escape if more specific customization is needed and is > > > > needed to specify user space page tables anyhow. > > > > > > and I didn't understand the 2nd link. How does user-managed page > > > table jump into this range claim problem? I'm getting confused... > > > > PPC could also model it using a FORMAT_KERNEL_PPC_X, > > FORMAT_KERNEL_PPC_Y > > though it is less nice.. > > yes PPC can use different format, but I didn't understand why it is > related user-managed page table which further requires nesting. sound > disconnected topics here... > > > > > > > Yes, ioas_id should always be the xarray index. > > > > > > > > PASID needs to be called out as PASID or as a generic "hw description" > > > > blob. > > > > > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid? > > > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus, > > and it MUST be exposed in that format to be programmed into the PCI > > device itself. > > In the entire discussion in previous design RFC, I kept an impression that > ARM-equivalent PASID is called SSID. If we can use PASID as a general > term in iommufd context, definitely it's much better! > > > > > All of this should be able to support a userspace, like DPDK, creating > > a PASID on its own without any special VFIO drivers. > > > > - Open iommufd > > - Attach the vfio device FD > > - Request a PASID device id > > - Create an ios against the pasid device id > > - Query the ios for the PCI PASID # > > - Program the HW to issue TLPs with the PASID > > this all makes me very confused, and completely different from what > we agreed in previous v2 design proposal: > > - open iommufd > - create an ioas > - attach vfio device to ioasid, with vPASID info > * vfio converts vPASID to pPASID and then call > iommufd_device_attach_ioasid() > * the latter then installs ioas to the IOMMU with RID/PASID > > > > > > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the > > > kernel. Do we want to clear this confusion? Or possibly it's fine because > > > ioas_id is never used outside of iommufd and iommufd doesn't directly > > > call ioasid_alloc() from ioasid.c? > > > > As long as it is ioas_id and ioasid it is probably fine.. > > let's align with others in a few hours. > > > > > > > kvm's API to program the vPASID translation table should probably take > > > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side > > > > information using an in-kernel API. Userspace shouldn't have to > > > > shuttle it around. > > > > > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI. > > > when kvm calls iommufd with above tuple, vPASID->pPASID is > > > returned to kvm. So we stil
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > discussion > > > and then decide how to incorporate it in v2. > > > > I think the request was to include a start/end IO address hint when > > creating the ios. When the kernel creates it then it can return the > > is the hint single-range or could be multiple-ranges? > > > actual geometry including any holes via a query. > > I'd like to see a detail flow from David on how the uAPI works today with > existing spapr driver and what exact changes he'd like to make on this > proposed interface. Above info is still insufficient for us to think about the > right solution. > > > > > > - Currently ioasid term has already been used in the kernel > > (drivers/iommu/ > > > ioasid.c) to represent the hardware I/O address space ID in the wire. It > > > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub- > > Stream > > > ID). We need find a way to resolve the naming conflict between the > > hardware > > > ID and software handle. One option is to rename the existing ioasid to > > > be > > > pasid or ssid, given their full names still sound generic. Appreciate > > > more > > > thoughts on this open! > > > > ioas works well here I think. Use ioas_id to refer to the xarray > > index. > > What about when introducing pasid to this uAPI? Then use ioas_id > for the xarray index and ioasid to represent pasid/ssid? This is probably obsoleted by Jason's other comments, but definitely don't use "ioas_id" and "ioasid" to mean different things. Having meaningfully different things distinguished only by an underscore is not a good idea. > At this point > the software handle and hardware id are mixed together thus need > a clear terminology to differentiate them. > > > Thanks > Kevin > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > This patch adds IOASID allocation/free interface per iommufd. When > allocating an IOASID, userspace is expected to specify the type and > format information for the target I/O page table. > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > implying a kernel-managed I/O page table with vfio type1v2 mapping > semantics. For this type the user should specify the addr_width of > the I/O address space and whether the I/O page table is created in > an iommu enfore_snoop format. enforce_snoop must be true at this point, > as the false setting requires additional contract with KVM on handling > WBINVD emulation, which can be added later. > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > for what formats can be specified when allocating an IOASID. > > Open: > - Devices on PPC platform currently use a different iommu driver in vfio. > Per previous discussion they can also use vfio type1v2 as long as there > is a way to claim a specific iova range from a system-wide address space. > This requirement doesn't sound PPC specific, as addr_width for pci devices > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > adopted this design yet. We hope to have formal alignment in v1 discussion > and then decide how to incorporate it in v2. Ok, there are several things we need for ppc. None of which are inherently ppc specific and some of which will I think be useful for most platforms. So, starting from most general to most specific here's basically what's needed: 1. We need to represent the fact that the IOMMU can only translate *some* IOVAs, not a full 64-bit range. You have the addr_width already, but I'm entirely sure if the translatable range on ppc (or other platforms) is always a power-of-2 size. It usually will be, of course, but I'm not sure that's a hard requirement. So using a size/max rather than just a number of bits might be safer. I think basically every platform will need this. Most platforms don't actually implement full 64-bit translation in any case, but rather some smaller number of bits that fits their page table format. 2. The translatable range of IOVAs may not begin at 0. So we need to advertise to userspace what the base address is, as well as the size. POWER's main IOVA range begins at 2^59 (at least on the models I know about). I think a number of platforms are likely to want this, though I couldn't name them apart from POWER. Putting the translated IOVA window at some huge address is a pretty obvious approach to making an IOMMU which can translate a wide address range without colliding with any legacy PCI addresses down low (the IOMMU can check if this transaction is for it by just looking at some high bits in the address). 3. There might be multiple translatable ranges. So, on POWER the IOMMU can typically translate IOVAs from 0..2GiB, and also from 2^59..2^59+. The two ranges have completely separate IO page tables, with (usually) different layouts. (The low range will nearly always be a single-level page table with 4kiB or 64kiB entries, the high one will be multiple levels depending on the size of the range and pagesize). This may be less common, but I suspect POWER won't be the only platform to do something like this. As above, using a high range is a pretty obvious approach, but clearly won't handle older devices which can't do 64-bit DMA. So adding a smaller range for those devices is again a pretty obvious solution. Any platform with an "IO hole" can be treated as having two ranges, one below the hole and one above it (although in that case they may well not have separate page tables 4. The translatable ranges might not be fixed. On ppc that 0..2GiB and 2^59..whatever ranges are kernel conventions, not specified by the hardware or firmware. When running as a guest (which is the normal case on POWER), there are explicit hypercalls for configuring the allowed IOVA windows (along with pagesize, number of levels etc.). At the moment it is fixed in hardware that there are only 2 windows, one starting at 0 and one at 2^59 but there's no inherent reason those couldn't also be configurable. This will probably be rarer, but I wouldn't be surprised if it appears on another platform. If you were designing an IOMMU ASIC for use in a variety of platforms, making the base address and size of the translatable range(s) configurable in registers would make sense. Now, for (3) and (4), representing lists of windows explicitly in ioctl()s is likely to be pretty ugly. We might be able to avoid that, for at least some of the interfaces, by using the nested IOAS stuff. One way or another, though, the IOASes which are actually attached to devices need to represent both windows. e.g. Create a "top-level" IOAS
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jean-Philippe Brucker > Sent: Wednesday, September 22, 2021 9:45 PM > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > This patch adds IOASID allocation/free interface per iommufd. When > > allocating an IOASID, userspace is expected to specify the type and > > format information for the target I/O page table. > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > semantics. For this type the user should specify the addr_width of > > the I/O address space and whether the I/O page table is created in > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > as the false setting requires additional contract with KVM on handling > > WBINVD emulation, which can be added later. > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > for what formats can be specified when allocating an IOASID. > > > > Open: > > - Devices on PPC platform currently use a different iommu driver in vfio. > > Per previous discussion they can also use vfio type1v2 as long as there > > is a way to claim a specific iova range from a system-wide address space. > > Is this the reason for passing addr_width to IOASID_ALLOC? I didn't get > what it's used for or why it's mandatory. But for PPC it sounds like it > should be an address range instead of an upper limit? yes, as this open described, it may need to be a range. But not sure if PPC requires multiple ranges or just one range. Perhaps, David may guide there. Regards, Yi Liu > Thanks, > Jean > > > This requirement doesn't sound PPC specific, as addr_width for pci > devices > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > adopted this design yet. We hope to have formal alignment in v1 > discussion > > and then decide how to incorporate it in v2. > > > > - Currently ioasid term has already been used in the kernel > (drivers/iommu/ > > ioasid.c) to represent the hardware I/O address space ID in the wire. It > > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub- > Stream > > ID). We need find a way to resolve the naming conflict between the > hardware > > ID and software handle. One option is to rename the existing ioasid to be > > pasid or ssid, given their full names still sound generic. Appreciate more > > thoughts on this open! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Thursday, September 23, 2021 9:31 PM > > On Thu, Sep 23, 2021 at 01:20:55PM +, Tian, Kevin wrote: > > > > > this is not a flow for mdev. It's also required for pdev on Intel > > > > platform, > > > > because the pasid table is in HPA space thus must be managed by host > > > > kernel. Even no translation we still need the user to provide the pasid > info. > > > > > > There should be no mandatory vPASID stuff in most of these flows, that > > > is just a special thing ENQCMD virtualization needs. If userspace > > > isn't doing ENQCMD virtualization it shouldn't need to touch this > > > stuff. > > > > No. for one, we also support SVA w/o using ENQCMD. For two, the key > > is that the PASID table cannot be delegated to the userspace like ARM > > or AMD. This implies that for any pasid that the userspace wants to > > enable, it must be configured via the kernel. > > Yes, configured through the kernel, but the simplified flow should > have the kernel handle everything and just emit a PASID for userspace > to use. > > > > just for a short summary of PASID model from previous design RFC: > > > > for arm/amd: > > - pasid space delegated to userspace > > - pasid table delegated to userspace > > - just one call to bind pasid_table() then pasids are fully managed by > user > > > > for intel: > > - pasid table is always managed by kernel > > - for pdev, > > - pasid space is delegated to userspace > > - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the > pasid entry > > - for mdev, > > - pasid space is managed by userspace > > - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to > ppasid. iommufd setups the ppasid entry > > - additional a contract to kvm for setup CPU pasid translation > if enqcmd is used > > - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let > underlying driver to figure out whether vpasid should be translated. > > All cases should support a kernel owned ioas associated with a > PASID. This is the universal basic API that all PASID supporting > IOMMUs need to implement. > > I should not need to write generic users space that has to know how to > setup architecture specific nested userspace page tables just to use > PASID! ah, got you! I have to admit that my previous thoughts are all from VM p.o.v, with true userspace application ignored... > > All of the above is qemu accelerated vIOMMU stuff. It is a good idea > to keep the two areas seperate as it greatly informs what is general > code and what is HW specific code. > Agree. will think more along this direction. possibly this discussion deviated a lot from what this skeleton series provide. We still have plenty of time to figure it out when starting the pasid support. For now at least the minimal output is that PASID might be a good candidate to be used in iommufd. 😊 Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 01:20:55PM +, Tian, Kevin wrote: > > > this is not a flow for mdev. It's also required for pdev on Intel > > > platform, > > > because the pasid table is in HPA space thus must be managed by host > > > kernel. Even no translation we still need the user to provide the pasid > > > info. > > > > There should be no mandatory vPASID stuff in most of these flows, that > > is just a special thing ENQCMD virtualization needs. If userspace > > isn't doing ENQCMD virtualization it shouldn't need to touch this > > stuff. > > No. for one, we also support SVA w/o using ENQCMD. For two, the key > is that the PASID table cannot be delegated to the userspace like ARM > or AMD. This implies that for any pasid that the userspace wants to > enable, it must be configured via the kernel. Yes, configured through the kernel, but the simplified flow should have the kernel handle everything and just emit a PASID for userspace to use. > just for a short summary of PASID model from previous design RFC: > > for arm/amd: > - pasid space delegated to userspace > - pasid table delegated to userspace > - just one call to bind pasid_table() then pasids are fully managed by > user > > for intel: > - pasid table is always managed by kernel > - for pdev, > - pasid space is delegated to userspace > - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the > pasid entry > - for mdev, > - pasid space is managed by userspace > - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to > ppasid. iommufd setups the ppasid entry > - additional a contract to kvm for setup CPU pasid translation > if enqcmd is used > - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). > let underlying driver to figure out whether vpasid should be translated. All cases should support a kernel owned ioas associated with a PASID. This is the universal basic API that all PASID supporting IOMMUs need to implement. I should not need to write generic users space that has to know how to setup architecture specific nested userspace page tables just to use PASID! All of the above is qemu accelerated vIOMMU stuff. It is a good idea to keep the two areas seperate as it greatly informs what is general code and what is HW specific code. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Thursday, September 23, 2021 9:02 PM > > On Thu, Sep 23, 2021 at 12:45:17PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, September 23, 2021 8:31 PM > > > > > > On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Thursday, September 23, 2021 8:07 PM > > > > > > > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > > > > > > > > > currently the type is aimed to differentiate three usages: > > > > > > > > > > > > - kernel-managed I/O page table > > > > > > - user-managed I/O page table > > > > > > - shared I/O page table (e.g. with mm, or ept) > > > > > > > > > > Creating a shared ios is something that should probably be a different > > > > > command. > > > > > > > > why? I didn't understand the criteria here... > > > > > > I suspect the input args will be very different, no? > > > > yes, but can't the structure be extended to incorporate it? > > You need to be thoughtful, giant structures with endless combinations > of optional fields turn out very hard. I haven't even seen what args > this shared thing will need, but I'm guessing it is almost none, so > maybe a new call is OK? To judge this looks we may have to do some practice on this front e.g. coming up an example structure for future intended usages and then see whether one structure can fit? > > If it is literally just 'give me an ioas for current mm' then it has > no args or complexity at all. for mm, yes, should be simple. for ept it might be more complex e.g. requiring a handle in kvm and some other format info to match ept page table. > > > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > > > > > indicator? their difference is not about format. > > > > > > > > > > Format should be > > > > > > > > > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > > > > > > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? > > > > > > So long as we are using structs we need to have values then the field > > > isn't being used. FORMAT_KERNEL is a reasonable value to have when we > > > are not creating a userspace page table. > > > > > > Alternatively a userspace page table could have a different API > > > > I don't know. Your comments really confused me on what's the right > > way to design the uAPI. If you still remember, the original v1 proposal > > introduced different uAPIs for kernel/user-managed cases. Then you > > recommended to consolidate everything related to ioas in one allocation > > command. > > This is because you had almost completely duplicated the input args > between the two calls. > > If it turns out they have very different args, then they should have > different calls. > > > > > - open iommufd > > > > - create an ioas > > > > - attach vfio device to ioasid, with vPASID info > > > > * vfio converts vPASID to pPASID and then call > > > iommufd_device_attach_ioasid() > > > > * the latter then installs ioas to the IOMMU with RID/PASID > > > > > > This was your flow for mdev's, I've always been talking about wanting > > > to see this supported for all use cases, including physical PCI > > > devices w/ PASID support. > > > > this is not a flow for mdev. It's also required for pdev on Intel platform, > > because the pasid table is in HPA space thus must be managed by host > > kernel. Even no translation we still need the user to provide the pasid > > info. > > There should be no mandatory vPASID stuff in most of these flows, that > is just a special thing ENQCMD virtualization needs. If userspace > isn't doing ENQCMD virtualization it shouldn't need to touch this > stuff. No. for one, we also support SVA w/o using ENQCMD. For two, the key is that the PASID table cannot be delegated to the userspace like ARM or AMD. This implies that for any pasid that the userspace wants to enable, it must be configured via the kernel. > > > as explained earlier, on Intel platform the user always needs to provide > > a PASID in the attaching call. whether it's directly used (for pdev) > > or translated (for mdev) is the underlying driver thing. From kernel > > p.o.v, since this PASID is provided by the user, it's fine to call it vPASID > > in the uAPI. > > I've always disagreed with this. There should be an option for the > kernel to pick an appropriate PASID for portability to other IOMMUs > and simplicity of the interface. > > You need to keep it clear what is in the minimum basic path and what > is needed for special cases, like ENQCMD virtualization. > > Not every user of iommufd is doing virtualization. > just for a short summary of PASID model from previous design RFC: for arm/amd: - pasid space delegated to userspace - pasid table delegated to userspace - just one call to bind pasid_table() then pasids are fully managed by user for intel: - pasid table is always managed by kernel -
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 12:45:17PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, September 23, 2021 8:31 PM > > > > On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Thursday, September 23, 2021 8:07 PM > > > > > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > > > > > > > currently the type is aimed to differentiate three usages: > > > > > > > > > > - kernel-managed I/O page table > > > > > - user-managed I/O page table > > > > > - shared I/O page table (e.g. with mm, or ept) > > > > > > > > Creating a shared ios is something that should probably be a different > > > > command. > > > > > > why? I didn't understand the criteria here... > > > > I suspect the input args will be very different, no? > > yes, but can't the structure be extended to incorporate it? You need to be thoughtful, giant structures with endless combinations of optional fields turn out very hard. I haven't even seen what args this shared thing will need, but I'm guessing it is almost none, so maybe a new call is OK? If it is literally just 'give me an ioas for current mm' then it has no args or complexity at all. > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > > > > indicator? their difference is not about format. > > > > > > > > Format should be > > > > > > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > > > > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? > > > > So long as we are using structs we need to have values then the field > > isn't being used. FORMAT_KERNEL is a reasonable value to have when we > > are not creating a userspace page table. > > > > Alternatively a userspace page table could have a different API > > I don't know. Your comments really confused me on what's the right > way to design the uAPI. If you still remember, the original v1 proposal > introduced different uAPIs for kernel/user-managed cases. Then you > recommended to consolidate everything related to ioas in one allocation > command. This is because you had almost completely duplicated the input args between the two calls. If it turns out they have very different args, then they should have different calls. > > > - open iommufd > > > - create an ioas > > > - attach vfio device to ioasid, with vPASID info > > > * vfio converts vPASID to pPASID and then call > > iommufd_device_attach_ioasid() > > > * the latter then installs ioas to the IOMMU with RID/PASID > > > > This was your flow for mdev's, I've always been talking about wanting > > to see this supported for all use cases, including physical PCI > > devices w/ PASID support. > > this is not a flow for mdev. It's also required for pdev on Intel platform, > because the pasid table is in HPA space thus must be managed by host > kernel. Even no translation we still need the user to provide the pasid info. There should be no mandatory vPASID stuff in most of these flows, that is just a special thing ENQCMD virtualization needs. If userspace isn't doing ENQCMD virtualization it shouldn't need to touch this stuff. > as explained earlier, on Intel platform the user always needs to provide > a PASID in the attaching call. whether it's directly used (for pdev) > or translated (for mdev) is the underlying driver thing. From kernel > p.o.v, since this PASID is provided by the user, it's fine to call it vPASID > in the uAPI. I've always disagreed with this. There should be an option for the kernel to pick an appropriate PASID for portability to other IOMMUs and simplicity of the interface. You need to keep it clear what is in the minimum basic path and what is needed for special cases, like ENQCMD virtualization. Not every user of iommufd is doing virtualization. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Thursday, September 23, 2021 8:31 PM > > On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, September 23, 2021 8:07 PM > > > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > > > > > currently the type is aimed to differentiate three usages: > > > > > > > > - kernel-managed I/O page table > > > > - user-managed I/O page table > > > > - shared I/O page table (e.g. with mm, or ept) > > > > > > Creating a shared ios is something that should probably be a different > > > command. > > > > why? I didn't understand the criteria here... > > I suspect the input args will be very different, no? yes, but can't the structure be extended to incorporate it? > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > > > indicator? their difference is not about format. > > > > > > Format should be > > > > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? > > So long as we are using structs we need to have values then the field > isn't being used. FORMAT_KERNEL is a reasonable value to have when we > are not creating a userspace page table. > > Alternatively a userspace page table could have a different API I don't know. Your comments really confused me on what's the right way to design the uAPI. If you still remember, the original v1 proposal introduced different uAPIs for kernel/user-managed cases. Then you recommended to consolidate everything related to ioas in one allocation command. Can you help articulate the criteria first? > > > yes, the user can query the permitted range using DEVICE_GET_INFO. > > But in the end if the user wants two separate regions, I'm afraid that > > the underlying iommu driver wants to know the exact info. iirc PPC > > has one global system address space shared by all devices. It is possible > > that the user may want to claim range-A and range-C, with range-B > > in-between but claimed by another user. Then simply using one hint > > range [A-lowend, C-highend] might not work. > > I don't know, that sounds strange.. In any event hint is a hint, it > can be ignored, the only information the kernel needs to extract is > low/high bank? iirc Dave said that the user needs to claim a range explicitly. 'claim' sounds not a hint to me. Possibly it's time for Dave to chime in. > > > yes PPC can use different format, but I didn't understand why it is > > related user-managed page table which further requires nesting. sound > > disconnected topics here... > > It is just a way to feed through more information if we get stuck > someday. You mean that we should define uAPI for all future possible extensions now to minimize the frequency of changing it? > > > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus, > > > and it MUST be exposed in that format to be programmed into the PCI > > > device itself. > > > > In the entire discussion in previous design RFC, I kept an impression that > > ARM-equivalent PASID is called SSID. If we can use PASID as a general > > term in iommufd context, definitely it's much better! > > SSID is inside the chip and part of the IOMMU. PASID is part of the > PCI spec. > > iommufd should keep these things distinct. > > If we are talking about a PCI TLP then the name to use is PASID. If Jean doesn't object... > > > > All of this should be able to support a userspace, like DPDK, creating > > > a PASID on its own without any special VFIO drivers. > > > > > > - Open iommufd > > > - Attach the vfio device FD > > > - Request a PASID device id > > > - Create an ios against the pasid device id > > > - Query the ios for the PCI PASID # > > > - Program the HW to issue TLPs with the PASID > > > > this all makes me very confused, and completely different from what > > we agreed in previous v2 design proposal: > > > > - open iommufd > > - create an ioas > > - attach vfio device to ioasid, with vPASID info > > * vfio converts vPASID to pPASID and then call > iommufd_device_attach_ioasid() > > * the latter then installs ioas to the IOMMU with RID/PASID > > This was your flow for mdev's, I've always been talking about wanting > to see this supported for all use cases, including physical PCI > devices w/ PASID support. this is not a flow for mdev. It's also required for pdev on Intel platform, because the pasid table is in HPA space thus must be managed by host kernel. Even no translation we still need the user to provide the pasid info. > > A normal vfio_pci userspace should be able to create PASIDs unrelated > to the mdev stuff. > > > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is > > > really about presenting a consistent IOMMU handle to the guest across > > > migration, it is not the value that shows up on the PCI bus. > > > > It's the former. But vfio driver needs to maintain vPASID->pPAS
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, September 23, 2021 8:07 PM > > > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > > > currently the type is aimed to differentiate three usages: > > > > > > - kernel-managed I/O page table > > > - user-managed I/O page table > > > - shared I/O page table (e.g. with mm, or ept) > > > > Creating a shared ios is something that should probably be a different > > command. > > why? I didn't understand the criteria here... I suspect the input args will be very different, no? > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > > indicator? their difference is not about format. > > > > Format should be > > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? So long as we are using structs we need to have values then the field isn't being used. FORMAT_KERNEL is a reasonable value to have when we are not creating a userspace page table. Alternatively a userspace page table could have a different API > yes, the user can query the permitted range using DEVICE_GET_INFO. > But in the end if the user wants two separate regions, I'm afraid that > the underlying iommu driver wants to know the exact info. iirc PPC > has one global system address space shared by all devices. It is possible > that the user may want to claim range-A and range-C, with range-B > in-between but claimed by another user. Then simply using one hint > range [A-lowend, C-highend] might not work. I don't know, that sounds strange.. In any event hint is a hint, it can be ignored, the only information the kernel needs to extract is low/high bank? > yes PPC can use different format, but I didn't understand why it is > related user-managed page table which further requires nesting. sound > disconnected topics here... It is just a way to feed through more information if we get stuck someday. > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus, > > and it MUST be exposed in that format to be programmed into the PCI > > device itself. > > In the entire discussion in previous design RFC, I kept an impression that > ARM-equivalent PASID is called SSID. If we can use PASID as a general > term in iommufd context, definitely it's much better! SSID is inside the chip and part of the IOMMU. PASID is part of the PCI spec. iommufd should keep these things distinct. If we are talking about a PCI TLP then the name to use is PASID. > > All of this should be able to support a userspace, like DPDK, creating > > a PASID on its own without any special VFIO drivers. > > > > - Open iommufd > > - Attach the vfio device FD > > - Request a PASID device id > > - Create an ios against the pasid device id > > - Query the ios for the PCI PASID # > > - Program the HW to issue TLPs with the PASID > > this all makes me very confused, and completely different from what > we agreed in previous v2 design proposal: > > - open iommufd > - create an ioas > - attach vfio device to ioasid, with vPASID info > * vfio converts vPASID to pPASID and then call > iommufd_device_attach_ioasid() > * the latter then installs ioas to the IOMMU with RID/PASID This was your flow for mdev's, I've always been talking about wanting to see this supported for all use cases, including physical PCI devices w/ PASID support. A normal vfio_pci userspace should be able to create PASIDs unrelated to the mdev stuff. > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is > > really about presenting a consistent IOMMU handle to the guest across > > migration, it is not the value that shows up on the PCI bus. > > It's the former. But vfio driver needs to maintain vPASID->pPASID > translation in the mediation path, since what guest programs is vPASID. The pPASID definately is a PASID as it goes out on the PCIe wire Suggest you come up with a more general name for vPASID? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Thursday, September 23, 2021 8:07 PM > > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > > > currently the type is aimed to differentiate three usages: > > > > - kernel-managed I/O page table > > - user-managed I/O page table > > - shared I/O page table (e.g. with mm, or ept) > > Creating a shared ios is something that should probably be a different > command. why? I didn't understand the criteria here... > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > > indicator? their difference is not about format. > > Format should be > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format? > > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate > > range information or be ok with a large range including holes (then let > > the kernel to figure out where the holes locate)? > > My impression was it only needed a way to select between the two > different cases as they are exclusive. I'd see this API as being a > hint and userspace should query the exact ranges to learn what was > actually created. yes, the user can query the permitted range using DEVICE_GET_INFO. But in the end if the user wants two separate regions, I'm afraid that the underlying iommu driver wants to know the exact info. iirc PPC has one global system address space shared by all devices. It is possible that the user may want to claim range-A and range-C, with range-B in-between but claimed by another user. Then simply using one hint range [A-lowend, C-highend] might not work. > > > > device-specific escape if more specific customization is needed and is > > > needed to specify user space page tables anyhow. > > > > and I didn't understand the 2nd link. How does user-managed page > > table jump into this range claim problem? I'm getting confused... > > PPC could also model it using a FORMAT_KERNEL_PPC_X, > FORMAT_KERNEL_PPC_Y > though it is less nice.. yes PPC can use different format, but I didn't understand why it is related user-managed page table which further requires nesting. sound disconnected topics here... > > > > Yes, ioas_id should always be the xarray index. > > > > > > PASID needs to be called out as PASID or as a generic "hw description" > > > blob. > > > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid? > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus, > and it MUST be exposed in that format to be programmed into the PCI > device itself. In the entire discussion in previous design RFC, I kept an impression that ARM-equivalent PASID is called SSID. If we can use PASID as a general term in iommufd context, definitely it's much better! > > All of this should be able to support a userspace, like DPDK, creating > a PASID on its own without any special VFIO drivers. > > - Open iommufd > - Attach the vfio device FD > - Request a PASID device id > - Create an ios against the pasid device id > - Query the ios for the PCI PASID # > - Program the HW to issue TLPs with the PASID this all makes me very confused, and completely different from what we agreed in previous v2 design proposal: - open iommufd - create an ioas - attach vfio device to ioasid, with vPASID info * vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid() * the latter then installs ioas to the IOMMU with RID/PASID > > > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the > > kernel. Do we want to clear this confusion? Or possibly it's fine because > > ioas_id is never used outside of iommufd and iommufd doesn't directly > > call ioasid_alloc() from ioasid.c? > > As long as it is ioas_id and ioasid it is probably fine.. let's align with others in a few hours. > > > > kvm's API to program the vPASID translation table should probably take > > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side > > > information using an in-kernel API. Userspace shouldn't have to > > > shuttle it around. > > > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI. > > when kvm calls iommufd with above tuple, vPASID->pPASID is > > returned to kvm. So we still need a generic blob to represent > > vPASID in the uAPI. > > I think you have to be clear about what the value is being used > for. Is it an IOMMU page table handle or is it a PCI PASID value? > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is > really about presenting a consistent IOMMU handle to the guest across > migration, it is not the value that shows up on the PCI bus. > It's the former. But vfio driver needs to maintain vPASID->pPASID translation in the mediation path, since what guest programs is vPASID. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote: > currently the type is aimed to differentiate three usages: > > - kernel-managed I/O page table > - user-managed I/O page table > - shared I/O page table (e.g. with mm, or ept) Creating a shared ios is something that should probably be a different command. > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good > indicator? their difference is not about format. Format should be FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc > Dave's links didn't answer one puzzle from me. Does PPC needs accurate > range information or be ok with a large range including holes (then let > the kernel to figure out where the holes locate)? My impression was it only needed a way to select between the two different cases as they are exclusive. I'd see this API as being a hint and userspace should query the exact ranges to learn what was actually created. > > device-specific escape if more specific customization is needed and is > > needed to specify user space page tables anyhow. > > and I didn't understand the 2nd link. How does user-managed page > table jump into this range claim problem? I'm getting confused... PPC could also model it using a FORMAT_KERNEL_PPC_X, FORMAT_KERNEL_PPC_Y though it is less nice.. > > Yes, ioas_id should always be the xarray index. > > > > PASID needs to be called out as PASID or as a generic "hw description" > > blob. > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid? ARM *does* need PASID! PASID is the label of the DMA on the PCI bus, and it MUST be exposed in that format to be programmed into the PCI device itself. All of this should be able to support a userspace, like DPDK, creating a PASID on its own without any special VFIO drivers. - Open iommufd - Attach the vfio device FD - Request a PASID device id - Create an ios against the pasid device id - Query the ios for the PCI PASID # - Program the HW to issue TLPs with the PASID > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the > kernel. Do we want to clear this confusion? Or possibly it's fine because > ioas_id is never used outside of iommufd and iommufd doesn't directly > call ioasid_alloc() from ioasid.c? As long as it is ioas_id and ioasid it is probably fine.. > > kvm's API to program the vPASID translation table should probably take > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side > > information using an in-kernel API. Userspace shouldn't have to > > shuttle it around. > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI. > when kvm calls iommufd with above tuple, vPASID->pPASID is > returned to kvm. So we still need a generic blob to represent > vPASID in the uAPI. I think you have to be clear about what the value is being used for. Is it an IOMMU page table handle or is it a PCI PASID value? AFAICT I think it is the former in the Intel scheme as the "vPASID" is really about presenting a consistent IOMMU handle to the guest across migration, it is not the value that shows up on the PCI bus. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Wednesday, September 22, 2021 10:09 PM > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in > > > > vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > there > > > > is a way to claim a specific iova range from a system-wide address > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > I think the request was to include a start/end IO address hint when > > > creating the ios. When the kernel creates it then it can return the > > > > is the hint single-range or could be multiple-ranges? > > David explained it here: > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > bit range. > > So a 'range hint' will do the job > > David also suggested this: > > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/ > > So I like this better: > > struct iommu_ioasid_alloc { > __u32 argsz; > > __u32 flags; > #define IOMMU_IOASID_ENFORCE_SNOOP(1 << 0) > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1) > > __aligned_u64 max_iova_hint; > __aligned_u64 base_iova_hint; // Used only if > IOMMU_IOASID_HINT_BASE_IOVA > > // For creating nested page tables > __u32 parent_ios_id; > __u32 format; > #define IOMMU_FORMAT_KERNEL 0 > #define IOMMU_FORMAT_PPC_XXX 2 > #define IOMMU_FORMAT_[..] > u32 format_flags; // Layout depends on format above > > __aligned_u64 user_page_directory; // Used if parent_ios_id != 0 > }; > > Again 'type' as an overall API indicator should not exist, feature > flags need to have clear narrow meanings. currently the type is aimed to differentiate three usages: - kernel-managed I/O page table - user-managed I/O page table - shared I/O page table (e.g. with mm, or ept) we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good indicator? their difference is not about format. > > This does both of David's suggestions at once. If quemu wants the 1G > limited region it could specify max_iova_hint = 1G, if it wants the > extend 64bit region with the hole it can give either the high base or > a large max_iova_hint. format/format_flags allows a further Dave's links didn't answer one puzzle from me. Does PPC needs accurate range information or be ok with a large range including holes (then let the kernel to figure out where the holes locate)? > device-specific escape if more specific customization is needed and is > needed to specify user space page tables anyhow. and I didn't understand the 2nd link. How does user-managed page table jump into this range claim problem? I'm getting confused... > > > > ioas works well here I think. Use ioas_id to refer to the xarray > > > index. > > > > What about when introducing pasid to this uAPI? Then use ioas_id > > for the xarray index > > Yes, ioas_id should always be the xarray index. > > PASID needs to be called out as PASID or as a generic "hw description" > blob. ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid? and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the kernel. Do we want to clear this confusion? Or possibly it's fine because ioas_id is never used outside of iommufd and iommufd doesn't directly call ioasid_alloc() from ioasid.c? > > kvm's API to program the vPASID translation table should probably take > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side > information using an in-kernel API. Userspace shouldn't have to > shuttle it around. the vPASID info is carried
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Wednesday, September 22, 2021 9:32 PM > > On Wed, Sep 22, 2021 at 12:51:38PM +, Liu, Yi L wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > [...] > > > > diff --git a/drivers/iommu/iommufd/iommufd.c > > > b/drivers/iommu/iommufd/iommufd.c > > > > index 641f199f2d41..4839f128b24a 100644 > > > > +++ b/drivers/iommu/iommufd/iommufd.c > > > > @@ -24,6 +24,7 @@ > > > > struct iommufd_ctx { > > > > refcount_t refs; > > > > struct mutex lock; > > > > + struct xarray ioasid_xa; /* xarray of ioasids */ > > > > struct xarray device_xa; /* xarray of bound devices */ > > > > }; > > > > > > > > @@ -42,6 +43,16 @@ struct iommufd_device { > > > > u64 dev_cookie; > > > > }; > > > > > > > > +/* Represent an I/O address space */ > > > > +struct iommufd_ioas { > > > > + int ioasid; > > > > > > xarray id's should consistently be u32s everywhere. > > > > sure. just one more check, this id is supposed to be returned to > > userspace as the return value of ioctl(IOASID_ALLOC). That's why > > I chose to use "int" as its prototype to make it aligned with the > > return type of ioctl(). Based on this, do you think it's still better > > to use "u32" here? > > I suggest not using the return code from ioctl to exchange data.. The > rest of the uAPI uses an in/out struct, everything should do > that consistently. got it. Thanks, Yi Liu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > discussion > > > and then decide how to incorporate it in v2. > > > > I think the request was to include a start/end IO address hint when > > creating the ios. When the kernel creates it then it can return the > > is the hint single-range or could be multiple-ranges? David explained it here: https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ qeumu needs to be able to chooose if it gets the 32 bit range or 64 bit range. So a 'range hint' will do the job David also suggested this: https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/ So I like this better: struct iommu_ioasid_alloc { __u32 argsz; __u32 flags; #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0) #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1) __aligned_u64 max_iova_hint; __aligned_u64 base_iova_hint; // Used only if IOMMU_IOASID_HINT_BASE_IOVA // For creating nested page tables __u32 parent_ios_id; __u32 format; #define IOMMU_FORMAT_KERNEL 0 #define IOMMU_FORMAT_PPC_XXX 2 #define IOMMU_FORMAT_[..] u32 format_flags; // Layout depends on format above __aligned_u64 user_page_directory; // Used if parent_ios_id != 0 }; Again 'type' as an overall API indicator should not exist, feature flags need to have clear narrow meanings. This does both of David's suggestions at once. If quemu wants the 1G limited region it could specify max_iova_hint = 1G, if it wants the extend 64bit region with the hole it can give either the high base or a large max_iova_hint. format/format_flags allows a further device-specific escape if more specific customization is needed and is needed to specify user space page tables anyhow. > > ioas works well here I think. Use ioas_id to refer to the xarray > > index. > > What about when introducing pasid to this uAPI? Then use ioas_id > for the xarray index Yes, ioas_id should always be the xarray index. PASID needs to be called out as PASID or as a generic "hw description" blob. kvm's API to program the vPASID translation table should probably take in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side information using an in-kernel API. Userspace shouldn't have to shuttle it around. I'm starting to feel like the struct approach for describing this uAPI might not scale well, but lets see.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > This patch adds IOASID allocation/free interface per iommufd. When > allocating an IOASID, userspace is expected to specify the type and > format information for the target I/O page table. > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > implying a kernel-managed I/O page table with vfio type1v2 mapping > semantics. For this type the user should specify the addr_width of > the I/O address space and whether the I/O page table is created in > an iommu enfore_snoop format. enforce_snoop must be true at this point, > as the false setting requires additional contract with KVM on handling > WBINVD emulation, which can be added later. > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > for what formats can be specified when allocating an IOASID. > > Open: > - Devices on PPC platform currently use a different iommu driver in vfio. > Per previous discussion they can also use vfio type1v2 as long as there > is a way to claim a specific iova range from a system-wide address space. Is this the reason for passing addr_width to IOASID_ALLOC? I didn't get what it's used for or why it's mandatory. But for PPC it sounds like it should be an address range instead of an upper limit? Thanks, Jean > This requirement doesn't sound PPC specific, as addr_width for pci devices > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > adopted this design yet. We hope to have formal alignment in v1 discussion > and then decide how to incorporate it in v2. > > - Currently ioasid term has already been used in the kernel (drivers/iommu/ > ioasid.c) to represent the hardware I/O address space ID in the wire. It > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream > ID). We need find a way to resolve the naming conflict between the hardware > ID and software handle. One option is to rename the existing ioasid to be > pasid or ssid, given their full names still sound generic. Appreciate more > thoughts on this open! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Sep 22, 2021 at 12:51:38PM +, Liu, Yi L wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, September 22, 2021 1:45 AM > > > [...] > > > diff --git a/drivers/iommu/iommufd/iommufd.c > > b/drivers/iommu/iommufd/iommufd.c > > > index 641f199f2d41..4839f128b24a 100644 > > > +++ b/drivers/iommu/iommufd/iommufd.c > > > @@ -24,6 +24,7 @@ > > > struct iommufd_ctx { > > > refcount_t refs; > > > struct mutex lock; > > > + struct xarray ioasid_xa; /* xarray of ioasids */ > > > struct xarray device_xa; /* xarray of bound devices */ > > > }; > > > > > > @@ -42,6 +43,16 @@ struct iommufd_device { > > > u64 dev_cookie; > > > }; > > > > > > +/* Represent an I/O address space */ > > > +struct iommufd_ioas { > > > + int ioasid; > > > > xarray id's should consistently be u32s everywhere. > > sure. just one more check, this id is supposed to be returned to > userspace as the return value of ioctl(IOASID_ALLOC). That's why > I chose to use "int" as its prototype to make it aligned with the > return type of ioctl(). Based on this, do you think it's still better > to use "u32" here? I suggest not using the return code from ioctl to exchange data.. The rest of the uAPI uses an in/out struct, everything should do that consistently. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Wednesday, September 22, 2021 1:45 AM > [...] > > diff --git a/drivers/iommu/iommufd/iommufd.c > b/drivers/iommu/iommufd/iommufd.c > > index 641f199f2d41..4839f128b24a 100644 > > +++ b/drivers/iommu/iommufd/iommufd.c > > @@ -24,6 +24,7 @@ > > struct iommufd_ctx { > > refcount_t refs; > > struct mutex lock; > > + struct xarray ioasid_xa; /* xarray of ioasids */ > > struct xarray device_xa; /* xarray of bound devices */ > > }; > > > > @@ -42,6 +43,16 @@ struct iommufd_device { > > u64 dev_cookie; > > }; > > > > +/* Represent an I/O address space */ > > +struct iommufd_ioas { > > + int ioasid; > > xarray id's should consistently be u32s everywhere. sure. just one more check, this id is supposed to be returned to userspace as the return value of ioctl(IOASID_ALLOC). That's why I chose to use "int" as its prototype to make it aligned with the return type of ioctl(). Based on this, do you think it's still better to use "u32" here? Regards, Yi Liu > Many of the same prior comments repeated here > > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jason Gunthorpe > Sent: Wednesday, September 22, 2021 1:45 AM > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > This patch adds IOASID allocation/free interface per iommufd. When > > allocating an IOASID, userspace is expected to specify the type and > > format information for the target I/O page table. > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > semantics. For this type the user should specify the addr_width of > > the I/O address space and whether the I/O page table is created in > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > as the false setting requires additional contract with KVM on handling > > WBINVD emulation, which can be added later. > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > for what formats can be specified when allocating an IOASID. > > > > Open: > > - Devices on PPC platform currently use a different iommu driver in vfio. > > Per previous discussion they can also use vfio type1v2 as long as there > > is a way to claim a specific iova range from a system-wide address space. > > This requirement doesn't sound PPC specific, as addr_width for pci > devices > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > adopted this design yet. We hope to have formal alignment in v1 > discussion > > and then decide how to incorporate it in v2. > > I think the request was to include a start/end IO address hint when > creating the ios. When the kernel creates it then it can return the is the hint single-range or could be multiple-ranges? > actual geometry including any holes via a query. I'd like to see a detail flow from David on how the uAPI works today with existing spapr driver and what exact changes he'd like to make on this proposed interface. Above info is still insufficient for us to think about the right solution. > > > - Currently ioasid term has already been used in the kernel > (drivers/iommu/ > > ioasid.c) to represent the hardware I/O address space ID in the wire. It > > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub- > Stream > > ID). We need find a way to resolve the naming conflict between the > hardware > > ID and software handle. One option is to rename the existing ioasid to be > > pasid or ssid, given their full names still sound generic. Appreciate more > > thoughts on this open! > > ioas works well here I think. Use ioas_id to refer to the xarray > index. What about when introducing pasid to this uAPI? Then use ioas_id for the xarray index and ioasid to represent pasid/ssid? At this point the software handle and hardware id are mixed together thus need a clear terminology to differentiate them. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > This patch adds IOASID allocation/free interface per iommufd. When > allocating an IOASID, userspace is expected to specify the type and > format information for the target I/O page table. > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > implying a kernel-managed I/O page table with vfio type1v2 mapping > semantics. For this type the user should specify the addr_width of > the I/O address space and whether the I/O page table is created in > an iommu enfore_snoop format. enforce_snoop must be true at this point, > as the false setting requires additional contract with KVM on handling > WBINVD emulation, which can be added later. > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > for what formats can be specified when allocating an IOASID. > > Open: > - Devices on PPC platform currently use a different iommu driver in vfio. > Per previous discussion they can also use vfio type1v2 as long as there > is a way to claim a specific iova range from a system-wide address space. > This requirement doesn't sound PPC specific, as addr_width for pci devices > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > adopted this design yet. We hope to have formal alignment in v1 discussion > and then decide how to incorporate it in v2. I think the request was to include a start/end IO address hint when creating the ios. When the kernel creates it then it can return the actual geometry including any holes via a query. > - Currently ioasid term has already been used in the kernel (drivers/iommu/ > ioasid.c) to represent the hardware I/O address space ID in the wire. It > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream > ID). We need find a way to resolve the naming conflict between the hardware > ID and software handle. One option is to rename the existing ioasid to be > pasid or ssid, given their full names still sound generic. Appreciate more > thoughts on this open! ioas works well here I think. Use ioas_id to refer to the xarray index. > Signed-off-by: Liu Yi L > drivers/iommu/iommufd/iommufd.c | 120 > include/linux/iommufd.h | 3 + > include/uapi/linux/iommu.h | 54 ++ > 3 files changed, 177 insertions(+) > > diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c > index 641f199f2d41..4839f128b24a 100644 > +++ b/drivers/iommu/iommufd/iommufd.c > @@ -24,6 +24,7 @@ > struct iommufd_ctx { > refcount_t refs; > struct mutex lock; > + struct xarray ioasid_xa; /* xarray of ioasids */ > struct xarray device_xa; /* xarray of bound devices */ > }; > > @@ -42,6 +43,16 @@ struct iommufd_device { > u64 dev_cookie; > }; > > +/* Represent an I/O address space */ > +struct iommufd_ioas { > + int ioasid; xarray id's should consistently be u32s everywhere. Many of the same prior comments repeated here Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
This patch adds IOASID allocation/free interface per iommufd. When allocating an IOASID, userspace is expected to specify the type and format information for the target I/O page table. This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), implying a kernel-managed I/O page table with vfio type1v2 mapping semantics. For this type the user should specify the addr_width of the I/O address space and whether the I/O page table is created in an iommu enfore_snoop format. enforce_snoop must be true at this point, as the false setting requires additional contract with KVM on handling WBINVD emulation, which can be added later. Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) for what formats can be specified when allocating an IOASID. Open: - Devices on PPC platform currently use a different iommu driver in vfio. Per previous discussion they can also use vfio type1v2 as long as there is a way to claim a specific iova range from a system-wide address space. This requirement doesn't sound PPC specific, as addr_width for pci devices can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't adopted this design yet. We hope to have formal alignment in v1 discussion and then decide how to incorporate it in v2. - Currently ioasid term has already been used in the kernel (drivers/iommu/ ioasid.c) to represent the hardware I/O address space ID in the wire. It covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream ID). We need find a way to resolve the naming conflict between the hardware ID and software handle. One option is to rename the existing ioasid to be pasid or ssid, given their full names still sound generic. Appreciate more thoughts on this open! Signed-off-by: Liu Yi L --- drivers/iommu/iommufd/iommufd.c | 120 include/linux/iommufd.h | 3 + include/uapi/linux/iommu.h | 54 ++ 3 files changed, 177 insertions(+) diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c index 641f199f2d41..4839f128b24a 100644 --- a/drivers/iommu/iommufd/iommufd.c +++ b/drivers/iommu/iommufd/iommufd.c @@ -24,6 +24,7 @@ struct iommufd_ctx { refcount_t refs; struct mutex lock; + struct xarray ioasid_xa; /* xarray of ioasids */ struct xarray device_xa; /* xarray of bound devices */ }; @@ -42,6 +43,16 @@ struct iommufd_device { u64 dev_cookie; }; +/* Represent an I/O address space */ +struct iommufd_ioas { + int ioasid; + u32 type; + u32 addr_width; + bool enforce_snoop; + struct iommufd_ctx *ictx; + refcount_t refs; +}; + static int iommufd_fops_open(struct inode *inode, struct file *filep) { struct iommufd_ctx *ictx; @@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep) refcount_set(&ictx->refs, 1); mutex_init(&ictx->lock); + xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC); xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC); filep->private_data = ictx; @@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx) if (!refcount_dec_and_test(&ictx->refs)) return; + WARN_ON(!xa_empty(&ictx->ioasid_xa)); WARN_ON(!xa_empty(&ictx->device_xa)); kfree(ictx); } +/* Caller should hold ictx->lock */ +static void ioas_put_locked(struct iommufd_ioas *ioas) +{ + struct iommufd_ctx *ictx = ioas->ictx; + int ioasid = ioas->ioasid; + + if (!refcount_dec_and_test(&ioas->refs)) + return; + + xa_erase(&ictx->ioasid_xa, ioasid); + iommufd_ctx_put(ictx); + kfree(ioas); +} + +static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg) +{ + struct iommu_ioasid_alloc req; + struct iommufd_ioas *ioas; + unsigned long minsz; + int ioasid, ret; + + minsz = offsetofend(struct iommu_ioasid_alloc, addr_width); + + if (copy_from_user(&req, (void __user *)arg, minsz)) + return -EFAULT; + + if (req.argsz < minsz || !req.addr_width || + req.flags != IOMMU_IOASID_ENFORCE_SNOOP || + req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2) + return -EINVAL; + + ioas = kzalloc(sizeof(*ioas), GFP_KERNEL); + if (!ioas) + return -ENOMEM; + + mutex_lock(&ictx->lock); + ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas, + XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX), + GFP_KERNEL); + mutex_unlock(&ictx->lock); + if (ret) { + pr_err_ratelimited("Failed to alloc ioasid\n"); + kfree(ioas); + return ret; + } + + ioas->ioasid = ioasid; + + /* only supports kernel managed I/O page table so far */ + ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2; + + ioas->addr_width = req.ad