Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On 5/24/22 23:25, Jason Gunthorpe wrote: On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: Which means the guest RAM does not need to be all mapped in that base IOAS suggested down this thread as that would mean all memory is pinned and powervm won't be able to swap it out (yeah, it can do such thing now!). Not sure if we really want to support this or stick to a simpler design. Huh? How can it swap? Calling GUP is not optional. Either you call GUP at the start and there is no swap, or you call GUP for each vIOMMU hypercall. Correct, not optional. Since everyone says PPC doesn't call GUP during the hypercall - how is it working? It does not call GUP during hypercalls because all VM pages are GUPed in advance at a special memory preregistration step as we could not call GUP from a hypercall handler with MMU off (often the case with POWER8 when this was developed in the first place). Things are better with POWER9 (bare metal can do all sorts of things pretty much) but the PowerVM interface with 2 windows is still there and this iommufd proposal is going to be ported on top of PowerVM at first. I am just saying there is a model when not everything is mapped and this has its use. The PowerVM's swapping capability is something new and I do not really know how that works though. -- Alexey ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote: > On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: > > > Which means the guest RAM does not need to be all mapped in that base IOAS > > suggested down this thread as that would mean all memory is pinned and > > powervm won't be able to swap it out (yeah, it can do such thing now!). Not > > sure if we really want to support this or stick to a simpler design. > > Huh? How can it swap? Calling GUP is not optional. Either you call GUP > at the start and there is no swap, or you call GUP for each vIOMMU > hypercall. > > Since everyone says PPC doesn't call GUP during the hypercall - how is > it working? The current implementation does GUP during the pre-reserve. I think Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't know how that works. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: > Which means the guest RAM does not need to be all mapped in that base IOAS > suggested down this thread as that would mean all memory is pinned and > powervm won't be able to swap it out (yeah, it can do such thing now!). Not > sure if we really want to support this or stick to a simpler design. Huh? How can it swap? Calling GUP is not optional. Either you call GUP at the start and there is no swap, or you call GUP for each vIOMMU hypercall. Since everyone says PPC doesn't call GUP during the hypercall - how is it working? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On 4/29/22 00:53, David Gibson wrote: On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: On Wed, 23 Mar 2022 21:33:42 -0300 Jason Gunthorpe wrote: On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: My overall question here would be whether we can actually achieve a compatibility interface that has sufficient feature transparency that we can dump vfio code in favor of this interface, or will there be enough niche use cases that we need to keep type1 and vfio containers around through a deprecation process? Other than SPAPR, I think we can. Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure for POWER support? There are a few different levels to consider for dealing with PPC. For a suitable long term interface for ppc hosts and guests dropping this is fine: the ppc specific iommu model was basically an ill-conceived idea from the beginning, because none of us had sufficiently understood what things were general and what things where iommu model/hw specific. ..mostly. There are several points of divergence for the ppc iommu model. 1) Limited IOVA windows. This one turned out to not really be ppc specific, and is (rightly) handlded generically in the new interface. No problem here. 2) Costly GUPs. pseries (the most common ppc machine type) always expects a (v)IOMMU. That means that unlike the common x86 model of a host with IOMMU, but guests with no-vIOMMU, guest initiated maps/unmaps can be a hot path. Accounting in that path can be prohibitive (and on POWER8 in particular it prevented us from optimizing that path the way we wanted). We had two solutions for that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted based on the IOVA window sizes. That was improved in the v2 which used the concept of preregistration. IIUC iommufd can achieve the same effect as preregistration using IOAS_COPY, so this one isn't really a problem either. I am getting rid of those POWER8-related realmode handlers as POWER9 has MMU enabled when hcalls are handled. Costly GUP problem is still there though (which base IOAS should solve?). 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA windows, which aren't contiguous with each other. The base addresses of each of these are fixed, but the size of each window, the pagesize (i.e. granularity) of each window and the number of levels in the IOMMU pagetable are runtime configurable. Because it's true in the hardware, it's also true of the vIOMMU interface defined by the IBM hypervisor (and adpoted by KVM as well). So, guests can request changes in how these windows are handled. Typical Linux guests will use the "low" window (IOVA 0..2GiB) dynamically, and the high window (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we can't count on that; the guest can use them however it wants. The guest actually does this already. AIX has always been like that, Linux is forced to do that for SRIOV VFs as there can be many VFs and TCEs (==IOPTEs) are limited resource. The today's pseries IOMMU code first tried mapping 1:1 (as it has been for ages) but if there is not enough TCEs - it removes the first window (which increases the TCE budget), creates a new 64bit window (as big as possible but not necessarily enough for 1:1, 64K/2M IOMMU page sizes allowed) and does map/unmap as drivers go. Which means the guest RAM does not need to be all mapped in that base IOAS suggested down this thread as that would mean all memory is pinned and powervm won't be able to swap it out (yeah, it can do such thing now!). Not sure if we really want to support this or stick to a simpler design. (3) still needs a plan for how to fit it into the /dev/iommufd model. This is a secondary reason that in the past I advocated for the user requesting specific DMA windows which the kernel would accept or refuse, rather than having a query function - it connects easily to the DDW model. With the query-first model we'd need some sort of extension here, not really sure what it should look like. Then, there's handling existing qemu (or other software) that is using the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this should be a goal or not: as others have noted, working actively to port qemu to the new interface at the same time as making a comprehensive in-kernel compat layer is arguably redundant work. That said, if we did want to handle this in an in-kernel compat layer, here's roughly what you'd need for SPAPR_TCE v2: - VFIO_IOMMU_SPAPR_TCE_GET_INFO I think this should be fairly straightforward; the information you need should be in the now generic IOVA window stuff and would just need massaging into the expected format. - VFIO_IOMMU_SPAPR_REGISTER_MEMORY / VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY IIUC, these could be traslated into map/unmap operations onto a second implicit IOAS which represents the preregistered memory a
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Wed, May 11, 2022 at 03:15:22AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I don't think so. In fact the way I was envisaging this, non-FIXED mappings will *never* go into the reserved ranges. This is for the benefit of any use cases that need both mappings where they don't care about the IOVA and those which do. Essentially, reserving a region here is saying to the kernel "I want to manage this IOVA space; make sure nothing else touches it". That means both that the kernel must disallow any hw associated changes (like ATTACH) which would impinge on the reserved region, and also any IOVA allocations that would take parts away from that space. Whether we want to restrict FIXED mappings to the reserved regions is an interesting question. I wasn't thinking that would be necessary (just as you can use mmap() MAP_FIXED anywhere). However.. much as MAP_FIXED is very dangerous to use if you don't previously reserve address space, I think IOMAP_FIXED is dangerous if you haven't previously reserved space. So maybe it would make sense to only allow FIXED mappings within reserved regions. Strictly dividing the IOVA space into kernel managed and user managed regions does make a certain amount of sense. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. > > otherwise we may need another dedicated hint for that optimization. Right. With the revised model where reserving windows is optional, not required, I don't think we can quite re-use this for optimization hints. Which is a bit unfortunate. I can't immediately see a way to tweak this which handles both more neatly, but I like the idea if we can figure out a way. > > > So, for DPDK the sequence would be: > > > > > > 1. Create IOAS > > > 2. ATTACH devices > > > 3. IOAS_MAP some stuff > > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > > > (Note, not even any need for QUERY in simple cases) > > > > Yes, this is done already > > > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > >guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > >reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply rep
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Thursday, May 12, 2022 12:32 AM > > On Wed, May 11, 2022 at 03:15:22AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > > Ok... here's a revised version of my proposal which I think addresses > > > > your concerns and simplfies things. > > > > > > > > - No new operations, but IOAS_MAP gets some new flags (and > IOAS_COPY > > > > will probably need matching changes) > > > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > > is chosen by the kernel within the aperture(s). This is closer to > > > > how mmap() operates, and DPDK and similar shouldn't care about > > > > having specific IOVAs, even at the individual mapping level. > > > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s > MAP_FIXED, > > > > for when you really do want to control the IOVA (qemu, maybe some > > > > special userspace driver cases) > > > > > > We already did both of these, the flag is called > > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > > select the IOVA internally. > > > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > > exclude any already established mappings (I assume this is already > > > > the case) > > > > > > Yes > > > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non- > FIXED) > > > > MAPs won't use it, but doesn't actually put anything into the IO > > > > pagetables. > > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > > IOMAP_RESERVEed region will fail > > > > - An IOMAP_RESERVEed area can be overmapped with an > IOMAP_FIXED > > > > mapping > > > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > > you don't really want mmap replacement semantics but a permanent > > > record of what IOVA must always be valid. > > > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > > > struct iommu_ioas_require_iova { > > > __u32 size; > > > __u32 ioas_id; > > > __u32 num_iovas; > > > __u32 __reserved; > > > struct iommu_required_iovas { > > > __aligned_u64 start; > > > __aligned_u64 last; > > > } required_iovas[]; > > > }; > > > > As a permanent record do we want to enforce that once the required > > range list is set all FIXED and non-FIXED allocations must be within the > > list of ranges? > > No, I would just use this as a guarntee that going forward any > get_ranges will always return ranges that cover the listed required > ranges. Ie any narrowing of the ranges will be refused. > > map/unmap should only be restricted to the get_ranges output. > > Wouldn't burn CPU cycles to nanny userspace here. fair enough. > > > If yes we can take the end of the last range as the max size of the iova > > address space to optimize the page table layout. > > I think this API should not interact with the driver. Its only job is > to prevent devices from attaching that would narrow the ranges. > > If we also use it to adjust the aperture of the created iommu_domain > then it looses its usefullness as guard since something like qemu > would have to leave room for hotplug as well. > > I suppose optimizing the created iommu_domains should be some other > API, with a different set of ranges and the '# of bytes of IOVA' hint > as well. make sense. > > > > > For (unoptimized) qemu it would be: > > > > > > > > 1. Create IOAS > > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > > the > > > >guest platform > > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > >reserved IOVA regions) > > > > 4. Boot the guest > > > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > > vIOMMUs regular mappings are required before booting the guest and > > reservation might be done but not mandatory (at least not what current > > Qemu vfio can afford as it simply replays valid ranges in the CPU address > > space). > > I think qemu can always do it, it feels like it would simplify error > cases around aperture mismatches. > It could, but require more changes in Qemu to define required ranges in platform logic and then convey it from Qemu address space to VFIO. I view it as an optimization hence not necessarily to be done immediately. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Wed, May 11, 2022 at 03:15:22AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I would just use this as a guarntee that going forward any get_ranges will always return ranges that cover the listed required ranges. Ie any narrowing of the ranges will be refused. map/unmap should only be restricted to the get_ranges output. Wouldn't burn CPU cycles to nanny userspace here. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. I think this API should not interact with the driver. Its only job is to prevent devices from attaching that would narrow the ranges. If we also use it to adjust the aperture of the created iommu_domain then it looses its usefullness as guard since something like qemu would have to leave room for hotplug as well. I suppose optimizing the created iommu_domains should be some other API, with a different set of ranges and the '# of bytes of IOVA' hint as well. > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > >guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > >reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply replays valid ranges in the CPU address > space). I think qemu can always do it, it feels like it would simplify error cases around aperture mismatches. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > > here, it is up to the iommu driver to choose something reasonable for > > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > > x86-like aperture. > > > > > > > > So, using the big aperture means a very high base IOVA > > > > (1<<59)... which means that it won't work at all if you want to attach > > > > any devices that aren't capable of 64-bit DMA. > > > > > > I'd expect to include the 32 bit window too.. > > > > I'm not entirely sure what you mean. Are you working on the > > assumption that we've extended to allowing multiple apertures, so we'd > > default to advertising both a small/low aperture and a large/high > > aperture? > > Yes Ok, that works assuming we can advertise multiple windows. > > > No, this just makes it fragile in the other direction because now > > > userspace has to know what platform specific things to ask for *or it > > > doesn't work at all*. This is not a improvement for the DPDK cases. > > > > Um.. no. The idea is that userspace requests *what it needs*, not > > anything platform specific. In the case of DPDK that would be nothing > > more than the (minimum) aperture size. Nothing platform specific > > about that. > > Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit > platform can do more, but it varies how much more, etc. > > There is no constant value DPDK could stuff in this request, unless it > needs a really small amount of IOVA, like 1G or something. Well, my assumption was that DPDK always wanted an IOVA window to cover its hugepage buffer space. So not "constant" exactly, but a value it will know at start up time. But I think we cover that more closely below. > > > It isn't like there is some hard coded value we can put into DPDK that > > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > > don't see any feasible alternative. > > > > Yes, hence *optionally specified* base address only. > > Okay, so imagine we've already done this and DPDK is not optionally > specifying anything :) > > The structs can be extended so we can add this as an input to creation > when a driver can implement it. > > > > The ppc specific driver would be on the generic side of qemu in its > > > viommu support framework. There is lots of host driver optimization > > > possible here with knowledge of the underlying host iommu HW. It > > > should not be connected to the qemu target. > > > > Thinking through this... > > > > So, I guess we could have basically the same logic I'm suggesting be > > in the qemu backend iommu driver instead. So the target side (machine > > type, strictly speaking) would request of the host side the apertures > > it needs, and the host side driver would see if it can do that, based > > on both specific knowledge of that driver and the query reponses. > > Yes, this is what I'm thinking > > > ppc on x86 should work with that.. at least if the x86 aperture is > > large enough to reach up to ppc's high window. I guess we'd have the > > option here of using either the generic host driver or the > > x86-specific driver. The latter would mean qemu maintaining an > > x86-format shadow of the io pagetables; mildly tedious, but doable. > > The appeal of having userspace page tables is performance, so it is > tedious to shadow, but it should run faster. I doubt the difference is meaningful in the context of an emulated guest, though. > > So... is there any way of backing out of this gracefully. We could > > detach the device, but in the meantime ongoing DMA maps from > > previous devices might have failed. > > This sounds like a good use case for qemu to communicate ranges - but > as I mentioned before Alex said qemu didn't know the ranges.. Yeah, I'm a bit baffled by that, and I don't know the context. Note that there are at least two different very different users of the host IOMMU backends in: one is for emulation of guest DMA (with or without a vIOMMU). In that case the details of the guest platform should let qemu know the ranges. There's also a VFIO based NVME backend; that one's much more like a "normal" userspace driver, where it doesn't care about the address ranges (because they're not guest visible). > > We could pre-attach the new device to a new IOAS and check the > > apertures there - but when we move it to the final IOAS is it > > guaranteed that the apertures will be (at least) the intersection of > > the old and new apertures, or is that just the probable outcome. > > Should be guarenteed Ok; that would need to be documented. > > Ok.. you convinced me. As long as we have some way to handle the > >
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Wednesday, May 11, 2022 3:00 AM > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > Ok... here's a revised version of my proposal which I think addresses > > your concerns and simplfies things. > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > will probably need matching changes) > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > is chosen by the kernel within the aperture(s). This is closer to > > how mmap() operates, and DPDK and similar shouldn't care about > > having specific IOVAs, even at the individual mapping level. > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > for when you really do want to control the IOVA (qemu, maybe some > > special userspace driver cases) > > We already did both of these, the flag is called > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > select the IOVA internally. > > > - ATTACH will fail if the new device would shrink the aperture to > > exclude any already established mappings (I assume this is already > > the case) > > Yes > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > MAPs won't use it, but doesn't actually put anything into the IO > > pagetables. > > - Like a regular mapping, ATTACHes that are incompatible with an > > IOMAP_RESERVEed region will fail > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > mapping > > Yeah, this seems OK, I'm thinking a new API might make sense because > you don't really want mmap replacement semantics but a permanent > record of what IOVA must always be valid. > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > struct iommu_ioas_require_iova { > __u32 size; > __u32 ioas_id; > __u32 num_iovas; > __u32 __reserved; > struct iommu_required_iovas { > __aligned_u64 start; > __aligned_u64 last; > } required_iovas[]; > }; As a permanent record do we want to enforce that once the required range list is set all FIXED and non-FIXED allocations must be within the list of ranges? If yes we can take the end of the last range as the max size of the iova address space to optimize the page table layout. otherwise we may need another dedicated hint for that optimization. > > > So, for DPDK the sequence would be: > > > > 1. Create IOAS > > 2. ATTACH devices > > 3. IOAS_MAP some stuff > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > (Note, not even any need for QUERY in simple cases) > > Yes, this is done already > > > For (unoptimized) qemu it would be: > > > > 1. Create IOAS > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > the > >guest platform > > 3. ATTACH devices (this will fail if they're not compatible with the > >reserved IOVA regions) > > 4. Boot the guest I suppose above is only the sample flow for PPC vIOMMU. For non-PPC vIOMMUs regular mappings are required before booting the guest and reservation might be done but not mandatory (at least not what current Qemu vfio can afford as it simply replays valid ranges in the CPU address space). > > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part > of > >the reserved regions > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > reserved regions) > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > necessary (which might fail) > > OK, I will take care of it > > Thanks, > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Monday, May 9, 2022 10:01 PM > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > Which is why I'm suggesting that the base address be an optional > > request. DPDK *will* care about the size of the range, so it just > > requests that and gets told a base address. > > We've talked about a size of IOVA address space before, strictly as a > hint, to possible optimize page table layout, or something, and I'm > fine with that idea. But - we have no driver implementation today, so > I'm not sure what we can really do with this right now.. > > Kevin could Intel consume a hint on IOVA space and optimize the number > of IO page table levels? > It could, but not implemented now. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > here, it is up to the iommu driver to choose something reasonable for > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > x86-like aperture. > > > > > > So, using the big aperture means a very high base IOVA > > > (1<<59)... which means that it won't work at all if you want to attach > > > any devices that aren't capable of 64-bit DMA. > > > > I'd expect to include the 32 bit window too.. > > I'm not entirely sure what you mean. Are you working on the > assumption that we've extended to allowing multiple apertures, so we'd > default to advertising both a small/low aperture and a large/high > aperture? Yes > > No, this just makes it fragile in the other direction because now > > userspace has to know what platform specific things to ask for *or it > > doesn't work at all*. This is not a improvement for the DPDK cases. > > Um.. no. The idea is that userspace requests *what it needs*, not > anything platform specific. In the case of DPDK that would be nothing > more than the (minimum) aperture size. Nothing platform specific > about that. Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit platform can do more, but it varies how much more, etc. There is no constant value DPDK could stuff in this request, unless it needs a really small amount of IOVA, like 1G or something. > > It isn't like there is some hard coded value we can put into DPDK that > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > don't see any feasible alternative. > > Yes, hence *optionally specified* base address only. Okay, so imagine we've already done this and DPDK is not optionally specifying anything :) The structs can be extended so we can add this as an input to creation when a driver can implement it. > > The ppc specific driver would be on the generic side of qemu in its > > viommu support framework. There is lots of host driver optimization > > possible here with knowledge of the underlying host iommu HW. It > > should not be connected to the qemu target. > > Thinking through this... > > So, I guess we could have basically the same logic I'm suggesting be > in the qemu backend iommu driver instead. So the target side (machine > type, strictly speaking) would request of the host side the apertures > it needs, and the host side driver would see if it can do that, based > on both specific knowledge of that driver and the query reponses. Yes, this is what I'm thinking > ppc on x86 should work with that.. at least if the x86 aperture is > large enough to reach up to ppc's high window. I guess we'd have the > option here of using either the generic host driver or the > x86-specific driver. The latter would mean qemu maintaining an > x86-format shadow of the io pagetables; mildly tedious, but doable. The appeal of having userspace page tables is performance, so it is tedious to shadow, but it should run faster. > So... is there any way of backing out of this gracefully. We could > detach the device, but in the meantime ongoing DMA maps from > previous devices might have failed. This sounds like a good use case for qemu to communicate ranges - but as I mentioned before Alex said qemu didn't know the ranges.. > We could pre-attach the new device to a new IOAS and check the > apertures there - but when we move it to the final IOAS is it > guaranteed that the apertures will be (at least) the intersection of > the old and new apertures, or is that just the probable outcome. Should be guarenteed > Ok.. you convinced me. As long as we have some way to handle the > device hotplug case, we can work with this. I like the communicate ranges for hotplug, so long as we can actually implement it in qemu - I'm a bit unclear on that honestly. > Ok, I see. That can certainly be done. I was really hoping we could > have a working, though non-optimized, implementation using just the > generic interface. Oh, sure that should largely work as well too, this is just an additional direction people may find interesting and helps explain why qemu should have an iommu layer inside. > "holes" versus "windows". We can choose either one; I think "windows" > rather than "holes" makes more sense, but it doesn't really matter. Yes, I picked windows aka ranges for the uAPI - we translate the holes from the groups into windows and intersect them with the apertures. > > > Another approach would be to give the required apertures / pagesizes > > > in the initial creation of the domain/IOAS. In that case they would > > > be static for the IOAS, as well as the underlying iommu_domains: any > > > ATTACH which would be incompatible would fail. > > > > This is the device-specific io
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > The default iommu_domain that the iommu driver creates will be used > > > here, it is up to the iommu driver to choose something reasonable for > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > x86-like aperture. > > > > So, using the big aperture means a very high base IOVA > > (1<<59)... which means that it won't work at all if you want to attach > > any devices that aren't capable of 64-bit DMA. > > I'd expect to include the 32 bit window too.. I'm not entirely sure what you mean. Are you working on the assumption that we've extended to allowing multiple apertures, so we'd default to advertising both a small/low aperture and a large/high aperture? > > Using the maximum possible window size would mean we either > > potentially waste a lot of kernel memory on pagetables, or we use > > unnecessarily large number of levels to the pagetable. > > All drivers have this issue to one degree or another. We seem to be > ignoring it - in any case this is a micro optimization, not a > functional need? Ok, fair point. > > More generally, the problem with the interface advertising limitations > > and it being up to userspace to work out if those are ok or not is > > that it's fragile. It's pretty plausible that some future IOMMU model > > will have some new kind of limitation that can't be expressed in the > > query structure we invented now. > > The basic API is very simple - the driver needs to provide ranges of > IOVA and map/unmap - I don't think we have a future problem here we > need to try and guess and solve today. Well.. maybe. My experience of encountering hardware doing weird-arse stuff makes me less sanguine. > Even PPC fits this just fine, the open question for DPDK is more > around optimization, not functional. > > > But if userspace requests the capabilities it wants, and the kernel > > acks or nacks that, we can support the new host IOMMU with existing > > software just fine. > > No, this just makes it fragile in the other direction because now > userspace has to know what platform specific things to ask for *or it > doesn't work at all*. This is not a improvement for the DPDK cases. Um.. no. The idea is that userspace requests *what it needs*, not anything platform specific. In the case of DPDK that would be nothing more than the (minimum) aperture size. Nothing platform specific about that. > Kernel decides, using all the kernel knowledge it has and tells the > application what it can do - this is the basic simplified interface. > > > > The iommu-driver-specific struct is the "advanced" interface and > > > allows a user-space IOMMU driver to tightly control the HW with full > > > HW specific knowledge. This is where all the weird stuff that is not > > > general should go. > > > > Right, but forcing anything more complicated than "give me some IOVA > > region" to go through the advanced interface means that qemu (or any > > hypervisor where the guest platform need not identically match the > > host) has to have n^2 complexity to match each guest IOMMU model to > > each host IOMMU model. > > I wouldn't say n^2, but yes, qemu needs to have a userspace driver for > the platform IOMMU, and yes it needs this to reach optimal > behavior. We already know this is a hard requirement for using nesting > as acceleration, I don't see why apertures are so different. For one thing, because we only care about optimal behaviour on the host ~= guest KVM case. That means it's not n^2, just (roughly) one host driver for each matching guest driver. I'm considering the general X on Y case - we don't need to optimize it, but it would be nice for it to work without considering every combination separately. > > Errr.. how do you figure? On ppc the ranges and pagesizes are > > definitely negotiable. I'm not really familiar with other models, but > > anything which allows *any* variations in the pagetable structure will > > effectively have at least some negotiable properties. > > As above, if you ask for the wrong thing then you don't get > anything. If DPDK asks for something that works on ARM like 0 -> 4G > then PPC and x86 will always fail. How is this improving anything to > require applications to carefully ask for exactly the right platform > specific ranges? Hm, looks like I didn't sufficiently emphasize that the base address would be optional for userspace to supply. So userspace would request a range *size* only, unless it needs a specific IOVA base address. It only requests the latter if it actually needs it - so failing in that case is correct. (Qemu, with or without an vIOMMU is the obvious case for that, though I could also imagine it for a specialized driver for some broken device which has weird limitations on what IOVA addresses it can generate on the bus). > It isn't like there is some hard coded value we c
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > The default iommu_domain that the iommu driver creates will be used > > here, it is up to the iommu driver to choose something reasonable for > > use by applications like DPDK. ie PPC should probably pick its biggest > > x86-like aperture. > > So, using the big aperture means a very high base IOVA > (1<<59)... which means that it won't work at all if you want to attach > any devices that aren't capable of 64-bit DMA. I'd expect to include the 32 bit window too.. > Using the maximum possible window size would mean we either > potentially waste a lot of kernel memory on pagetables, or we use > unnecessarily large number of levels to the pagetable. All drivers have this issue to one degree or another. We seem to be ignoring it - in any case this is a micro optimization, not a functional need? > More generally, the problem with the interface advertising limitations > and it being up to userspace to work out if those are ok or not is > that it's fragile. It's pretty plausible that some future IOMMU model > will have some new kind of limitation that can't be expressed in the > query structure we invented now. The basic API is very simple - the driver needs to provide ranges of IOVA and map/unmap - I don't think we have a future problem here we need to try and guess and solve today. Even PPC fits this just fine, the open question for DPDK is more around optimization, not functional. > But if userspace requests the capabilities it wants, and the kernel > acks or nacks that, we can support the new host IOMMU with existing > software just fine. No, this just makes it fragile in the other direction because now userspace has to know what platform specific things to ask for *or it doesn't work at all*. This is not a improvement for the DPDK cases. Kernel decides, using all the kernel knowledge it has and tells the application what it can do - this is the basic simplified interface. > > The iommu-driver-specific struct is the "advanced" interface and > > allows a user-space IOMMU driver to tightly control the HW with full > > HW specific knowledge. This is where all the weird stuff that is not > > general should go. > > Right, but forcing anything more complicated than "give me some IOVA > region" to go through the advanced interface means that qemu (or any > hypervisor where the guest platform need not identically match the > host) has to have n^2 complexity to match each guest IOMMU model to > each host IOMMU model. I wouldn't say n^2, but yes, qemu needs to have a userspace driver for the platform IOMMU, and yes it needs this to reach optimal behavior. We already know this is a hard requirement for using nesting as acceleration, I don't see why apertures are so different. > Errr.. how do you figure? On ppc the ranges and pagesizes are > definitely negotiable. I'm not really familiar with other models, but > anything which allows *any* variations in the pagetable structure will > effectively have at least some negotiable properties. As above, if you ask for the wrong thing then you don't get anything. If DPDK asks for something that works on ARM like 0 -> 4G then PPC and x86 will always fail. How is this improving anything to require applications to carefully ask for exactly the right platform specific ranges? It isn't like there is some hard coded value we can put into DPDK that will work on every platform. So kernel must pick for DPDK, IMHO. I don't see any feasible alternative. > Which is why I'm suggesting that the base address be an optional > request. DPDK *will* care about the size of the range, so it just > requests that and gets told a base address. We've talked about a size of IOVA address space before, strictly as a hint, to possible optimize page table layout, or something, and I'm fine with that idea. But - we have no driver implementation today, so I'm not sure what we can really do with this right now.. Kevin could Intel consume a hint on IOVA space and optimize the number of IO page table levels? > > and IMHO, qemu > > is fine to have a PPC specific userspace driver to tweak this PPC > > unique thing if the default windows are not acceptable. > > Not really, because it's the ppc *target* (guest) side which requires > the specific properties, but selecting the "advanced" interface > requires special knowledge on the *host* side. The ppc specific driver would be on the generic side of qemu in its viommu support framework. There is lots of host driver optimization possible here with knowledge of the underlying host iommu HW. It should not be connected to the qemu target. It is not so different from today where qemu has to know about ppc's special vfio interface generically even to emulate x86. > > IMHO it is no different from imagining an Intel specific userspace > > driver that is using userspace IO pagetables to optimize > > cross-platform qemu vIOMMU emulation. > > I'm not quite sure what you have in mind he
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote: > On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote: > > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > > > > When the iommu_domain is created I want to have a > > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > > however it likes. > > > > This requires that the client be aware of the host side IOMMU model. > > That's true in VFIO now, and it's nasty; I was really hoping we could > > *stop* doing that. > > iommufd has two modes, the 'generic interface' which what this patch > series shows that does not require any device specific knowledge. Right, and I'm speaking specifically to that generic interface. But I'm thinking particularly about the qemu case where we do have specific knowledge of the *guest* vIOMMU, but we want to avoid having specific knowledge of the host IOMMU, because they might not be the same. It would be good to have a way of seeing if the guest vIOMMU can be emulated on this host IOMMU without qemu having to have separate logic for every host IOMMU. > The default iommu_domain that the iommu driver creates will be used > here, it is up to the iommu driver to choose something reasonable for > use by applications like DPDK. ie PPC should probably pick its biggest > x86-like aperture. So, using the big aperture means a very high base IOVA (1<<59)... which means that it won't work at all if you want to attach any devices that aren't capable of 64-bit DMA. Using the maximum possible window size would mean we either potentially waste a lot of kernel memory on pagetables, or we use unnecessarily large number of levels to the pagetable. Basically we don't have enough information to make a good decision here. More generally, the problem with the interface advertising limitations and it being up to userspace to work out if those are ok or not is that it's fragile. It's pretty plausible that some future IOMMU model will have some new kind of limitation that can't be expressed in the query structure we invented now. That means that to add support for that we need some kind of gate to prevent old userspace using the new IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses newly added queries to get the new limitations). That's true even if what userspace was actually doing with the IOMMU would fit just fine into those new limitations. But if userspace requests the capabilities it wants, and the kernel acks or nacks that, we can support the new host IOMMU with existing software just fine. They won't be able to use any *new* features or capabilities of the new hardware, of course, but they'll be able to use what it does that overlaps with what they needed before. ppc - or more correctly, the POWER and PAPR IOMMU models - is just acting here as an example of an IOMMU with limitations and capabilities that don't fit into the current query model. > The iommu-driver-specific struct is the "advanced" interface and > allows a user-space IOMMU driver to tightly control the HW with full > HW specific knowledge. This is where all the weird stuff that is not > general should go. Right, but forcing anything more complicated than "give me some IOVA region" to go through the advanced interface means that qemu (or any hypervisor where the guest platform need not identically match the host) has to have n^2 complexity to match each guest IOMMU model to each host IOMMU model. > > Note that I'm talking here *purely* about the non-optimized case where > > all updates to the host side IO pagetables are handled by IOAS_MAP / > > IOAS_COPY, with no direct hardware access to user or guest managed IO > > pagetables. The optimized case obviously requires end-to-end > > agreement on the pagetable format amongst other domain properties. > > Sure, this is how things are already.. > > > What I'm hoping is that qemu (or whatever) can use this non-optimized > > as a fallback case where it does't need to know the properties of > > whatever host side IOMMU models there are. It just requests what it > > needs based on the vIOMMU properties it needs to replicate and the > > host kernel either can supply it or can't. > > There aren't really any negotiable vIOMMU properties beyond the > ranges, and the ranges are not *really* negotiable. Errr.. how do you figure? On ppc the ranges and pagesizes are definitely negotiable. I'm not really familiar with other models, but anything which allows *any* variations in the pagetable structure will effectively have at least some negotiable properties. Even if any individual host IOMMU doesn't have negotiable properties (which ppc demonstrates is false), there's still a negotiation here in the context that userspace doesn't know (and doesn't care) what specific host IOMMU model it has. > There are lots of dragons with the idea we can actually negotiate > ranges - because asking for the wrong range for what the HW can do > means you don't
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, May 06, 2022 at 10:42:21AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Friday, May 6, 2022 1:25 PM > > > > > > > > When the iommu_domain is created I want to have a > > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > > however it likes. > > > > This requires that the client be aware of the host side IOMMU model. > > That's true in VFIO now, and it's nasty; I was really hoping we could > > *stop* doing that. > > that model is anyway inevitable when talking about user page table, Right, but I'm explicitly not talking about the user managed page table case. I'm talking about the case where the IO pagetable is still managed by the kernel and we update it via IOAS_MAP and similar operations. > i.e. when nesting is enabled. I don't really follow the connection you're drawing between a user managed table and nesting. > > Note that I'm talking here *purely* about the non-optimized case where > > all updates to the host side IO pagetables are handled by IOAS_MAP / > > IOAS_COPY, with no direct hardware access to user or guest managed IO > > pagetables. The optimized case obviously requires end-to-end > > agreement on the pagetable format amongst other domain properties. > > > > What I'm hoping is that qemu (or whatever) can use this non-optimized > > as a fallback case where it does't need to know the properties of > > whatever host side IOMMU models there are. It just requests what it > > needs based on the vIOMMU properties it needs to replicate and the > > host kernel either can supply it or can't. > > > > In many cases it should be perfectly possible to emulate a PPC style > > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal > > aperture that it will encompass wherever the ppc apertures end > > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc > > host (currently somewhere between awkward and impossible) by placing > > the host apertures to cover guest memory. > > > > Admittedly those are pretty niche cases, but allowing for them gives > > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > > the future, and AFAICT, ARM are much less conservative that x86 about > > maintaining similar hw interfaces over time. That's why I think > > considering these ppc cases will give a more robust interface for > > other future possibilities as well. > > It's not niche cases. We already have virtio-iommu which can work > on both ARM and x86 platforms, i.e. what current iommufd provides > is already generic enough except on PPC. > > Then IMHO the key open here is: > > Can PPC adapt to the current iommufd proposal if it can be > refactored to fit the standard iommu domain/group concepts? Right... and I'm still trying to figure out whether it can adapt to either part of that. We absolutely need to allow for multiple IOVA apertures within a domain. If we have that I *think* we can manage (if suboptimally), but I'm trying to figure out the corner cases to make sure I haven't missed something. > If not, what is the remaining gap after PPC becomes a normal > citizen in the iommu layer and is it worth solving it in the general > interface or via iommu-driver-specific domain (given this will > exist anyway)? > > to close that open I'm with Jason: > >"Fundamentally PPC has to fit into the iommu standard framework of >group and domains, we can talk about modifications, but drifting too >far away is a big problem." > > Directly jumping to the iommufd layer for what changes might be > applied to all platforms sounds counter-intuitive if we haven't tried > to solve the gap in the iommu layer in the first place, as even > there is argument that certain changes in iommufd layer can find > matching concept on other platforms it still sort of looks redundant > since those platforms already work with the current model. I don't really follow what you're saying here. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote: > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > > When the iommu_domain is created I want to have a > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > however it likes. > > This requires that the client be aware of the host side IOMMU model. > That's true in VFIO now, and it's nasty; I was really hoping we could > *stop* doing that. iommufd has two modes, the 'generic interface' which what this patch series shows that does not require any device specific knowledge. The default iommu_domain that the iommu driver creates will be used here, it is up to the iommu driver to choose something reasonable for use by applications like DPDK. ie PPC should probably pick its biggest x86-like aperture. The iommu-driver-specific struct is the "advanced" interface and allows a user-space IOMMU driver to tightly control the HW with full HW specific knowledge. This is where all the weird stuff that is not general should go. > Note that I'm talking here *purely* about the non-optimized case where > all updates to the host side IO pagetables are handled by IOAS_MAP / > IOAS_COPY, with no direct hardware access to user or guest managed IO > pagetables. The optimized case obviously requires end-to-end > agreement on the pagetable format amongst other domain properties. Sure, this is how things are already.. > What I'm hoping is that qemu (or whatever) can use this non-optimized > as a fallback case where it does't need to know the properties of > whatever host side IOMMU models there are. It just requests what it > needs based on the vIOMMU properties it needs to replicate and the > host kernel either can supply it or can't. There aren't really any negotiable vIOMMU properties beyond the ranges, and the ranges are not *really* negotiable. There are lots of dragons with the idea we can actually negotiate ranges - because asking for the wrong range for what the HW can do means you don't get anything. Which is completely contrary to the idea of easy generic support for things like DPDK. So DPDK can't ask for ranges, it is not generic. This means we are really talking about a qemu-only API, and IMHO, qemu is fine to have a PPC specific userspace driver to tweak this PPC unique thing if the default windows are not acceptable. IMHO it is no different from imagining an Intel specific userspace driver that is using userspace IO pagetables to optimize cross-platform qemu vIOMMU emulation. We should be comfortable with the idea that accessing the full device-specific feature set requires a HW specific user space driver. > Admittedly those are pretty niche cases, but allowing for them gives > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > the future, and AFAICT, ARM are much less conservative that x86 about > maintaining similar hw interfaces over time. That's why I think > considering these ppc cases will give a more robust interface for > other future possibilities as well. I don't think so - PPC has just done two things that are completely divergent from eveything else - having two IO page tables for the same end point, and using map/unmap hypercalls instead of a nested page table. Everyone else seems to be focused on IOPTEs that are similar to CPU PTEs, particularly to enable SVA and other tricks, and CPU's don't have either of this weirdness. > You can consider that a map/unmap hypercall, but the size of the > mapping is fixed (the IO pagesize which was previously set for the > aperture). Yes, I would consider that a map/unmap hypercall vs a nested translation. > > Assuming yes, I'd expect that: > > > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior > > calls to make. Whenever a new PE is attached to that domain it gets > > the logged map's replayed to set it up, and when a PE is detached the > > log is used to unmap everything. > > And likewise duplicate every H_PUT_TCE to all the PEs in the domain. > Sure. It means the changes won't be atomic across the domain, but I > guess that doesn't matter. I guess you could have the same thing on a > sufficiently complex x86 or ARM system, if you put two devices into > the IOAS that were sufficiently far from each other in the bus > topology that they use a different top-level host IOMMU. Yes, strict atomicity is not needed. > > It is not perfectly memory efficient - and we could perhaps talk about > > a API modification to allow re-use of the iommufd datastructure > > somehow, but I think this is a good logical starting point. > > Because the map size is fixed, a "replay log" is effectively > equivalent to a mirror of the entire IO pagetable. So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs and device attach/detach just sweeps the xarray and does the hypercalls. Very similar to what we discussed for S390. It seems OK, th
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: David Gibson > Sent: Friday, May 6, 2022 1:25 PM > > > > > When the iommu_domain is created I want to have a > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > however it likes. > > This requires that the client be aware of the host side IOMMU model. > That's true in VFIO now, and it's nasty; I was really hoping we could > *stop* doing that. that model is anyway inevitable when talking about user page table, i.e. when nesting is enabled. > > Note that I'm talking here *purely* about the non-optimized case where > all updates to the host side IO pagetables are handled by IOAS_MAP / > IOAS_COPY, with no direct hardware access to user or guest managed IO > pagetables. The optimized case obviously requires end-to-end > agreement on the pagetable format amongst other domain properties. > > What I'm hoping is that qemu (or whatever) can use this non-optimized > as a fallback case where it does't need to know the properties of > whatever host side IOMMU models there are. It just requests what it > needs based on the vIOMMU properties it needs to replicate and the > host kernel either can supply it or can't. > > In many cases it should be perfectly possible to emulate a PPC style > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal > aperture that it will encompass wherever the ppc apertures end > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc > host (currently somewhere between awkward and impossible) by placing > the host apertures to cover guest memory. > > Admittedly those are pretty niche cases, but allowing for them gives > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > the future, and AFAICT, ARM are much less conservative that x86 about > maintaining similar hw interfaces over time. That's why I think > considering these ppc cases will give a more robust interface for > other future possibilities as well. It's not niche cases. We already have virtio-iommu which can work on both ARM and x86 platforms, i.e. what current iommufd provides is already generic enough except on PPC. Then IMHO the key open here is: Can PPC adapt to the current iommufd proposal if it can be refactored to fit the standard iommu domain/group concepts? If not, what is the remaining gap after PPC becomes a normal citizen in the iommu layer and is it worth solving it in the general interface or via iommu-driver-specific domain (given this will exist anyway)? to close that open I'm with Jason: "Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem." Directly jumping to the iommufd layer for what changes might be applied to all platforms sounds counter-intuitive if we haven't tried to solve the gap in the iommu layer in the first place, as even there is argument that certain changes in iommufd layer can find matching concept on other platforms it still sort of looks redundant since those platforms already work with the current model. My two cents. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > > > It is a bit more CPU work since maps in the lower range would have to > > > be copied over, but conceptually the model matches the HW nesting. > > > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > > fixed windows, but we fake dynamic windows in the userspace > > implementation by flipping the devices over to a new IOAS with the new > > windows. Is that right? > > Yes > > > Where exactly would the windows be specified? My understanding was > > that when creating a back-end specific IOAS, that would typically be > > for the case where you're using a user / guest managed IO pagetable, > > with the backend specifying the format for that. In the ppc case we'd > > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > > operations to manage the mappings. The PAPR vIOMMU is > > paravirtualized, so all updates come via hypercalls, so there's no > > user/guest managed data structure. > > When the iommu_domain is created I want to have a > iommu-driver-specific struct, so PPC can customize its iommu_domain > however it likes. This requires that the client be aware of the host side IOMMU model. That's true in VFIO now, and it's nasty; I was really hoping we could *stop* doing that. Note that I'm talking here *purely* about the non-optimized case where all updates to the host side IO pagetables are handled by IOAS_MAP / IOAS_COPY, with no direct hardware access to user or guest managed IO pagetables. The optimized case obviously requires end-to-end agreement on the pagetable format amongst other domain properties. What I'm hoping is that qemu (or whatever) can use this non-optimized as a fallback case where it does't need to know the properties of whatever host side IOMMU models there are. It just requests what it needs based on the vIOMMU properties it needs to replicate and the host kernel either can supply it or can't. In many cases it should be perfectly possible to emulate a PPC style vIOMMU on an x86 host, because the x86 IOMMU has such a colossal aperture that it will encompass wherever the ppc apertures end up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc host (currently somewhere between awkward and impossible) by placing the host apertures to cover guest memory. Admittedly those are pretty niche cases, but allowing for them gives us flexibility for the future. Emulating an ARM SMMUv3 guest on an ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in the future, and AFAICT, ARM are much less conservative that x86 about maintaining similar hw interfaces over time. That's why I think considering these ppc cases will give a more robust interface for other future possibilities as well. > > That should work from the point of view of the userspace and guest > > side interfaces. It might be fiddly from the point of view of the > > back end. The ppc iommu doesn't really have the notion of > > configurable domains - instead the address spaces are the hardware or > > firmware fixed PEs, so they have a fixed set of devices. At the bare > > metal level it's possible to sort of do domains by making the actual > > pagetable pointers for several PEs point to a common place. > > I'm not sure I understand this - a domain is just a storage container > for an IO page table, if the HW has IOPTEs then it should be able to > have a domain? > > Making page table pointers point to a common IOPTE tree is exactly > what iommu_domains are for - why is that "sort of" for ppc? Ok, fair enough, it's only "sort of" in the sense that the hw specs / docs don't present any equivalent concept. > > However, in the future, nested KVM under PowerVM is likely to be the > > norm. In that situation the L1 as well as the L2 only has the > > paravirtualized interfaces, which don't have any notion of domains, > > only PEs. All updates take place via hypercalls which explicitly > > specify a PE (strictly speaking they take a "Logical IO Bus Number" > > (LIOBN), but those generally map one to one with PEs), so it can't use > > shared pointer tricks either. > > How does the paravirtualized interfaces deal with the page table? Does > it call a map/unmap hypercall instead of providing guest IOPTEs? Sort of. The main interface is H_PUT_TCE ("TCE" - Translation Control Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects which PE and aperture), an IOVA and a TCE value - which is a guest physical address plus some permission bits. There are some variants for performance that can set a batch of IOPTEs from a buffer, or clear a range of IOPTEs, but they're just faster ways of doing the same thing as a bunch of H_PUT_TCE calls. H_PUT_TCE calls. You can consider that a map/unmap hypercall, but the size of the mapping is fixed (the IO pagesize which was previously set for the aperture). > Assuming yes, I'
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > It is a bit more CPU work since maps in the lower range would have to > > be copied over, but conceptually the model matches the HW nesting. > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > fixed windows, but we fake dynamic windows in the userspace > implementation by flipping the devices over to a new IOAS with the new > windows. Is that right? Yes > Where exactly would the windows be specified? My understanding was > that when creating a back-end specific IOAS, that would typically be > for the case where you're using a user / guest managed IO pagetable, > with the backend specifying the format for that. In the ppc case we'd > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > operations to manage the mappings. The PAPR vIOMMU is > paravirtualized, so all updates come via hypercalls, so there's no > user/guest managed data structure. When the iommu_domain is created I want to have a iommu-driver-specific struct, so PPC can customize its iommu_domain however it likes. > That should work from the point of view of the userspace and guest > side interfaces. It might be fiddly from the point of view of the > back end. The ppc iommu doesn't really have the notion of > configurable domains - instead the address spaces are the hardware or > firmware fixed PEs, so they have a fixed set of devices. At the bare > metal level it's possible to sort of do domains by making the actual > pagetable pointers for several PEs point to a common place. I'm not sure I understand this - a domain is just a storage container for an IO page table, if the HW has IOPTEs then it should be able to have a domain? Making page table pointers point to a common IOPTE tree is exactly what iommu_domains are for - why is that "sort of" for ppc? > However, in the future, nested KVM under PowerVM is likely to be the > norm. In that situation the L1 as well as the L2 only has the > paravirtualized interfaces, which don't have any notion of domains, > only PEs. All updates take place via hypercalls which explicitly > specify a PE (strictly speaking they take a "Logical IO Bus Number" > (LIOBN), but those generally map one to one with PEs), so it can't use > shared pointer tricks either. How does the paravirtualized interfaces deal with the page table? Does it call a map/unmap hypercall instead of providing guest IOPTEs? Assuming yes, I'd expect that: The iommu_domain for nested PPC is just a log of map/unmap hypervsior calls to make. Whenever a new PE is attached to that domain it gets the logged map's replayed to set it up, and when a PE is detached the log is used to unmap everything. It is not perfectly memory efficient - and we could perhaps talk about a API modification to allow re-use of the iommufd datastructure somehow, but I think this is a good logical starting point. The PE would have to be modeled as an iommu_group. > So, here's an alternative set of interfaces that should work for ppc, > maybe you can tell me whether they also work for x86 and others: Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem. > * Each domain/IOAS has a concept of one or more IOVA windows, which > each have a base address, size, pagesize (granularity) and optionally > other flags/attributes. > * This has some bearing on hardware capabilities, but is > primarily a software notion iommu_domain has the aperture, PPC will require extending this to a list of apertures since it is currently only one window. Once a domain is created and attached to a group the aperture should be immutable. > * MAP/UNMAP operations are only permitted within an existing IOVA > window (and with addresses aligned to the window's pagesize) > * This is enforced by software whether or not it is required by > the underlying hardware > * Likewise IOAS_COPY operations are only permitted if the source and > destination windows have compatible attributes Already done, domain's aperture restricts all the iommufd operations > * A newly created kernel-managed IOAS has *no* IOVA windows Already done, the iommufd IOAS has no iommu_domains inside it at creation time. > * A CREATE_WINDOW operation is added > * This takes a size, pagesize/granularity, optional base address > and optional additional attributes > * If any of the specified attributes are incompatible with the > underlying hardware, the operation fails iommu layer has nothing called a window. The closest thing is a domain. I really don't want to try to make a new iommu layer object that is so unique and special to PPC - we have to figure out how to fit PPC into the iommu_domain model with reasonable extensions. > > > > Maybe every device gets a copy of the error notification? > > > > > > Alas, it's har
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote: > > > > I think PPC and S390 are solving the same problem here. I think S390 > > > is going to go to a SW nested model where it has an iommu_domain > > > controlled by iommufd that is populated with the pinned pages, eg > > > stored in an xarray. > > > > > > Then the performance map/unmap path is simply copying pages from the > > > xarray to the real IOPTEs - and this would be modeled as a nested > > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > > > > > Perhaps this is agreeable for PPC too? > > > > Uh.. maybe? Note that I'm making these comments based on working on > > this some years ago (the initial VFIO for ppc implementation in > > particular). I'm no longer actively involved in ppc kernel work. > > OK > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 > > > > IOVA > > > > windows, which aren't contiguous with each other. The base addresses > > > > of each of these are fixed, but the size of each window, the pagesize > > > > (i.e. granularity) of each window and the number of levels in the > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > changes in how these windows are handled. Typical Linux guests will > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > can't count on that; the guest can use them however it wants. > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > iommu driver specific data' primitive. > > > > > > The driver specific data for PPC can include a description of these > > > windows so the PPC specific qemu driver can issue this new ioctl > > > using the information provided by the guest. > > > > Hmm.. not sure if that works. At the moment, qemu (for example) needs > > to set up the domains/containers/IOASes as it constructs the machine, > > because that's based on the virtual hardware topology. Initially they > > use the default windows (0..2GiB first window, second window > > disabled). Only once the guest kernel is up and running does it issue > > the hypercalls to set the final windows as it prefers. In theory the > > guest could change them during runtime though it's unlikely in > > practice. They could change during machine lifetime in practice, > > though, if you rebooted from one guest kernel to another that uses a > > different configuration. > > > > *Maybe* IOAS construction can be deferred somehow, though I'm not sure > > because the assigned devices need to live somewhere. > > This is a general requirement for all the nesting implementations, we > start out with some default nested page table and then later the VM > does the vIOMMU call to change it. So nesting will have to come along > with some kind of 'switch domains IOCTL' > > In this case I would guess PPC could do the same and start out with a > small (nested) iommu_domain and then create the VM's desired > iommu_domain from the hypercall, and switch to it. > > It is a bit more CPU work since maps in the lower range would have to > be copied over, but conceptually the model matches the HW nesting. Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have fixed windows, but we fake dynamic windows in the userspace implementation by flipping the devices over to a new IOAS with the new windows. Is that right? Where exactly would the windows be specified? My understanding was that when creating a back-end specific IOAS, that would typically be for the case where you're using a user / guest managed IO pagetable, with the backend specifying the format for that. In the ppc case we'd need to specify the windows, but we'd still need the IOAS_MAP/UNMAP operations to manage the mappings. The PAPR vIOMMU is paravirtualized, so all updates come via hypercalls, so there's no user/guest managed data structure. That should work from the point of view of the userspace and guest side interfaces. It might be fiddly from the point of view of the back end. The ppc iommu doesn't really have the notion of configurable domains - instead the address spaces are the hardware or firmware fixed PEs, so they have a fixed set of devices. At the bare metal level it's possible to sort of do domains by making the actual pagetable pointers for several PEs point to a common place. However, in the future, nested KVM under PowerVM is likely to be the norm. In that situation the L1 as well as the L2 only has the paravirtualized interfaces, which don't have any notion of domains, only PEs. All updates take place via hypercalls which explicitly specify a PE (strictly speaking they take a "Logical IO Bus Number" (
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote: > > On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > > > 2 IOVA > > > > > windows, which aren't contiguous with each other. The base addresses > > > > > of each of these are fixed, but the size of each window, the pagesize > > > > > (i.e. granularity) of each window and the number of levels in the > > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > > changes in how these windows are handled. Typical Linux guests will > > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > > can't count on that; the guest can use them however it wants. > > > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > > iommu driver specific data' primitive. > > > > > > > > The driver specific data for PPC can include a description of these > > > > windows so the PPC specific qemu driver can issue this new ioctl > > > > using the information provided by the guest. > > > > > > > > The main issue is that internally to the iommu subsystem the > > > > iommu_domain aperture is assumed to be a single window. This kAPI will > > > > have to be improved to model the PPC multi-window iommu_domain. > > > > > > > > > > From the point of nesting probably each window can be a separate > > > domain then the existing aperture should still work? > > > > Maybe. There might be several different ways to represent it, but the > > vital piece is that any individual device (well, group, technically) > > must atomically join/leave both windows at once. > > I'm not keen on the multi-iommu_domains because it means we have to > create the idea that a device can be attached to multiple > iommu_domains, which we don't have at all today. > > Since iommu_domain allows PPC to implement its special rules, like the > atomicness above. I tend to agree; I think extending the iommu domain concept to incorporate multiple windows makes more sense than extending to allow multiple domains per device. I'm just saying there might be other ways of representing this, and that's not a sticking point for me as long as the right properties can be preserved. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote: > On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > > 2 IOVA > > > > windows, which aren't contiguous with each other. The base addresses > > > > of each of these are fixed, but the size of each window, the pagesize > > > > (i.e. granularity) of each window and the number of levels in the > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > changes in how these windows are handled. Typical Linux guests will > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > can't count on that; the guest can use them however it wants. > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > iommu driver specific data' primitive. > > > > > > The driver specific data for PPC can include a description of these > > > windows so the PPC specific qemu driver can issue this new ioctl > > > using the information provided by the guest. > > > > > > The main issue is that internally to the iommu subsystem the > > > iommu_domain aperture is assumed to be a single window. This kAPI will > > > have to be improved to model the PPC multi-window iommu_domain. > > > > > > > From the point of nesting probably each window can be a separate > > domain then the existing aperture should still work? > > Maybe. There might be several different ways to represent it, but the > vital piece is that any individual device (well, group, technically) > must atomically join/leave both windows at once. I'm not keen on the multi-iommu_domains because it means we have to create the idea that a device can be attached to multiple iommu_domains, which we don't have at all today. Since iommu_domain allows PPC to implement its special rules, like the atomicness above. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote: > > I think PPC and S390 are solving the same problem here. I think S390 > > is going to go to a SW nested model where it has an iommu_domain > > controlled by iommufd that is populated with the pinned pages, eg > > stored in an xarray. > > > > Then the performance map/unmap path is simply copying pages from the > > xarray to the real IOPTEs - and this would be modeled as a nested > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > > > Perhaps this is agreeable for PPC too? > > Uh.. maybe? Note that I'm making these comments based on working on > this some years ago (the initial VFIO for ppc implementation in > particular). I'm no longer actively involved in ppc kernel work. OK > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > > > windows, which aren't contiguous with each other. The base addresses > > > of each of these are fixed, but the size of each window, the pagesize > > > (i.e. granularity) of each window and the number of levels in the > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > changes in how these windows are handled. Typical Linux guests will > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > can't count on that; the guest can use them however it wants. > > > > As part of nesting iommufd will have a 'create iommu_domain using > > iommu driver specific data' primitive. > > > > The driver specific data for PPC can include a description of these > > windows so the PPC specific qemu driver can issue this new ioctl > > using the information provided by the guest. > > Hmm.. not sure if that works. At the moment, qemu (for example) needs > to set up the domains/containers/IOASes as it constructs the machine, > because that's based on the virtual hardware topology. Initially they > use the default windows (0..2GiB first window, second window > disabled). Only once the guest kernel is up and running does it issue > the hypercalls to set the final windows as it prefers. In theory the > guest could change them during runtime though it's unlikely in > practice. They could change during machine lifetime in practice, > though, if you rebooted from one guest kernel to another that uses a > different configuration. > > *Maybe* IOAS construction can be deferred somehow, though I'm not sure > because the assigned devices need to live somewhere. This is a general requirement for all the nesting implementations, we start out with some default nested page table and then later the VM does the vIOMMU call to change it. So nesting will have to come along with some kind of 'switch domains IOCTL' In this case I would guess PPC could do the same and start out with a small (nested) iommu_domain and then create the VM's desired iommu_domain from the hypercall, and switch to it. It is a bit more CPU work since maps in the lower range would have to be copied over, but conceptually the model matches the HW nesting. > > > You might be able to do this by simply failing this outright if > > > there's anything other than exactly one IOMMU group bound to the > > > container / IOAS (which I think might be what VFIO itself does now). > > > Handling that with a device centric API gets somewhat fiddlier, of > > > course. > > > > Maybe every device gets a copy of the error notification? > > Alas, it's harder than that. One of the things that can happen on an > EEH fault is that the entire PE gets suspended (blocking both DMA and > MMIO, IIRC) until the proper recovery steps are taken. I think qemu would have to de-duplicate the duplicated device notifications and then it can go from a device notifiation to the device's iommu_group to the IOAS to the vPE? A simple serial number in the event would make this pretty simple. The way back to clear the event would just forward the commands through a random device in the iommu_group to the PE? Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote: > > > 2) Costly GUPs. pseries (the most common ppc machine type) always > > expects a (v)IOMMU. That means that unlike the common x86 model of a > > host with IOMMU, but guests with no-vIOMMU, guest initiated > > maps/unmaps can be a hot path. Accounting in that path can be > > prohibitive (and on POWER8 in particular it prevented us from > > optimizing that path the way we wanted). We had two solutions for > > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted > > based on the IOVA window sizes. That was improved in the v2 which > > used the concept of preregistration. IIUC iommufd can achieve the > > same effect as preregistration using IOAS_COPY, so this one isn't > > really a problem either. > > I think PPC and S390 are solving the same problem here. I think S390 > is going to go to a SW nested model where it has an iommu_domain > controlled by iommufd that is populated with the pinned pages, eg > stored in an xarray. > > Then the performance map/unmap path is simply copying pages from the > xarray to the real IOPTEs - and this would be modeled as a nested > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > Perhaps this is agreeable for PPC too? Uh.. maybe? Note that I'm making these comments based on working on this some years ago (the initial VFIO for ppc implementation in particular). I'm no longer actively involved in ppc kernel work. > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > > windows, which aren't contiguous with each other. The base addresses > > of each of these are fixed, but the size of each window, the pagesize > > (i.e. granularity) of each window and the number of levels in the > > IOMMU pagetable are runtime configurable. Because it's true in the > > hardware, it's also true of the vIOMMU interface defined by the IBM > > hypervisor (and adpoted by KVM as well). So, guests can request > > changes in how these windows are handled. Typical Linux guests will > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > can't count on that; the guest can use them however it wants. > > As part of nesting iommufd will have a 'create iommu_domain using > iommu driver specific data' primitive. > > The driver specific data for PPC can include a description of these > windows so the PPC specific qemu driver can issue this new ioctl > using the information provided by the guest. Hmm.. not sure if that works. At the moment, qemu (for example) needs to set up the domains/containers/IOASes as it constructs the machine, because that's based on the virtual hardware topology. Initially they use the default windows (0..2GiB first window, second window disabled). Only once the guest kernel is up and running does it issue the hypercalls to set the final windows as it prefers. In theory the guest could change them during runtime though it's unlikely in practice. They could change during machine lifetime in practice, though, if you rebooted from one guest kernel to another that uses a different configuration. *Maybe* IOAS construction can be deferred somehow, though I'm not sure because the assigned devices need to live somewhere. > The main issue is that internally to the iommu subsystem the > iommu_domain aperture is assumed to be a single window. This kAPI will > have to be improved to model the PPC multi-window iommu_domain. Right. > If this API is not used then the PPC driver should choose some > sensible default windows that makes things like DPDK happy. > > > Then, there's handling existing qemu (or other software) that is using > > the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this > > should be a goal or not: as others have noted, working actively to > > port qemu to the new interface at the same time as making a > > comprehensive in-kernel compat layer is arguably redundant work. > > At the moment I think I would stick with not including the SPAPR > interfaces in vfio_compat, but there does seem to be a path if someone > with HW wants to build and test them? > > > You might be able to do this by simply failing this outright if > > there's anything other than exactly one IOMMU group bound to the > > container / IOAS (which I think might be what VFIO itself does now). > > Handling that with a device centric API gets somewhat fiddlier, of > > course. > > Maybe every device gets a copy of the error notification? Alas, it's harder than that. One of the things that can happen on an EEH fault is that the entire PE gets suspended (blocking both DMA and MMIO, IIRC) until the proper recovery steps are taken. Since that's handled at the hardware/firmware level, it will obviously only affect the host side PE (== host iommu group). However the interfaces we have only allow
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > 2 IOVA > > > windows, which aren't contiguous with each other. The base addresses > > > of each of these are fixed, but the size of each window, the pagesize > > > (i.e. granularity) of each window and the number of levels in the > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > changes in how these windows are handled. Typical Linux guests will > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > can't count on that; the guest can use them however it wants. > > > > As part of nesting iommufd will have a 'create iommu_domain using > > iommu driver specific data' primitive. > > > > The driver specific data for PPC can include a description of these > > windows so the PPC specific qemu driver can issue this new ioctl > > using the information provided by the guest. > > > > The main issue is that internally to the iommu subsystem the > > iommu_domain aperture is assumed to be a single window. This kAPI will > > have to be improved to model the PPC multi-window iommu_domain. > > > > From the point of nesting probably each window can be a separate > domain then the existing aperture should still work? Maybe. There might be several different ways to represent it, but the vital piece is that any individual device (well, group, technically) must atomically join/leave both windows at once. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Thursday, April 28, 2022 11:11 PM > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > 2 IOVA > > windows, which aren't contiguous with each other. The base addresses > > of each of these are fixed, but the size of each window, the pagesize > > (i.e. granularity) of each window and the number of levels in the > > IOMMU pagetable are runtime configurable. Because it's true in the > > hardware, it's also true of the vIOMMU interface defined by the IBM > > hypervisor (and adpoted by KVM as well). So, guests can request > > changes in how these windows are handled. Typical Linux guests will > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > can't count on that; the guest can use them however it wants. > > As part of nesting iommufd will have a 'create iommu_domain using > iommu driver specific data' primitive. > > The driver specific data for PPC can include a description of these > windows so the PPC specific qemu driver can issue this new ioctl > using the information provided by the guest. > > The main issue is that internally to the iommu subsystem the > iommu_domain aperture is assumed to be a single window. This kAPI will > have to be improved to model the PPC multi-window iommu_domain. > >From the point of nesting probably each window can be a separate domain then the existing aperture should still work? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote: > 2) Costly GUPs. pseries (the most common ppc machine type) always > expects a (v)IOMMU. That means that unlike the common x86 model of a > host with IOMMU, but guests with no-vIOMMU, guest initiated > maps/unmaps can be a hot path. Accounting in that path can be > prohibitive (and on POWER8 in particular it prevented us from > optimizing that path the way we wanted). We had two solutions for > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted > based on the IOVA window sizes. That was improved in the v2 which > used the concept of preregistration. IIUC iommufd can achieve the > same effect as preregistration using IOAS_COPY, so this one isn't > really a problem either. I think PPC and S390 are solving the same problem here. I think S390 is going to go to a SW nested model where it has an iommu_domain controlled by iommufd that is populated with the pinned pages, eg stored in an xarray. Then the performance map/unmap path is simply copying pages from the xarray to the real IOPTEs - and this would be modeled as a nested iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. Perhaps this is agreeable for PPC too? > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > windows, which aren't contiguous with each other. The base addresses > of each of these are fixed, but the size of each window, the pagesize > (i.e. granularity) of each window and the number of levels in the > IOMMU pagetable are runtime configurable. Because it's true in the > hardware, it's also true of the vIOMMU interface defined by the IBM > hypervisor (and adpoted by KVM as well). So, guests can request > changes in how these windows are handled. Typical Linux guests will > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > can't count on that; the guest can use them however it wants. As part of nesting iommufd will have a 'create iommu_domain using iommu driver specific data' primitive. The driver specific data for PPC can include a description of these windows so the PPC specific qemu driver can issue this new ioctl using the information provided by the guest. The main issue is that internally to the iommu subsystem the iommu_domain aperture is assumed to be a single window. This kAPI will have to be improved to model the PPC multi-window iommu_domain. If this API is not used then the PPC driver should choose some sensible default windows that makes things like DPDK happy. > Then, there's handling existing qemu (or other software) that is using > the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this > should be a goal or not: as others have noted, working actively to > port qemu to the new interface at the same time as making a > comprehensive in-kernel compat layer is arguably redundant work. At the moment I think I would stick with not including the SPAPR interfaces in vfio_compat, but there does seem to be a path if someone with HW wants to build and test them? > You might be able to do this by simply failing this outright if > there's anything other than exactly one IOMMU group bound to the > container / IOAS (which I think might be what VFIO itself does now). > Handling that with a device centric API gets somewhat fiddlier, of > course. Maybe every device gets a copy of the error notification? ie maybe this should be part of vfio_pci and not part of iommufd to mirror how AER works? It feels strange to put in device error notification to iommufd, is that connected the IOMMU? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > On Wed, 23 Mar 2022 21:33:42 -0300 > Jason Gunthorpe wrote: > > > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > > > My overall question here would be whether we can actually achieve a > > > compatibility interface that has sufficient feature transparency that we > > > can dump vfio code in favor of this interface, or will there be enough > > > niche use cases that we need to keep type1 and vfio containers around > > > through a deprecation process? > > > > Other than SPAPR, I think we can. > > Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure > for POWER support? There are a few different levels to consider for dealing with PPC. For a suitable long term interface for ppc hosts and guests dropping this is fine: the ppc specific iommu model was basically an ill-conceived idea from the beginning, because none of us had sufficiently understood what things were general and what things where iommu model/hw specific. ..mostly. There are several points of divergence for the ppc iommu model. 1) Limited IOVA windows. This one turned out to not really be ppc specific, and is (rightly) handlded generically in the new interface. No problem here. 2) Costly GUPs. pseries (the most common ppc machine type) always expects a (v)IOMMU. That means that unlike the common x86 model of a host with IOMMU, but guests with no-vIOMMU, guest initiated maps/unmaps can be a hot path. Accounting in that path can be prohibitive (and on POWER8 in particular it prevented us from optimizing that path the way we wanted). We had two solutions for that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted based on the IOVA window sizes. That was improved in the v2 which used the concept of preregistration. IIUC iommufd can achieve the same effect as preregistration using IOAS_COPY, so this one isn't really a problem either. 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA windows, which aren't contiguous with each other. The base addresses of each of these are fixed, but the size of each window, the pagesize (i.e. granularity) of each window and the number of levels in the IOMMU pagetable are runtime configurable. Because it's true in the hardware, it's also true of the vIOMMU interface defined by the IBM hypervisor (and adpoted by KVM as well). So, guests can request changes in how these windows are handled. Typical Linux guests will use the "low" window (IOVA 0..2GiB) dynamically, and the high window (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we can't count on that; the guest can use them however it wants. (3) still needs a plan for how to fit it into the /dev/iommufd model. This is a secondary reason that in the past I advocated for the user requesting specific DMA windows which the kernel would accept or refuse, rather than having a query function - it connects easily to the DDW model. With the query-first model we'd need some sort of extension here, not really sure what it should look like. Then, there's handling existing qemu (or other software) that is using the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this should be a goal or not: as others have noted, working actively to port qemu to the new interface at the same time as making a comprehensive in-kernel compat layer is arguably redundant work. That said, if we did want to handle this in an in-kernel compat layer, here's roughly what you'd need for SPAPR_TCE v2: - VFIO_IOMMU_SPAPR_TCE_GET_INFO I think this should be fairly straightforward; the information you need should be in the now generic IOVA window stuff and would just need massaging into the expected format. - VFIO_IOMMU_SPAPR_REGISTER_MEMORY / VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY IIUC, these could be traslated into map/unmap operations onto a second implicit IOAS which represents the preregistered memory areas (to which we'd never connect an actual device). Along with this VFIO_MAP and VFIO_UNMAP operations would need to check for this case, verify their addresses against the preregistered space and be translated into IOAS_COPY operations from the prereg address space instead of raw IOAS_MAP operations. Fiddly, but not fundamentally hard, I think. For SPAPR_TCE_v1 things are a bit trickier - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE I suspect you could get away with implementing these as no-ops. It wouldn't be strictly correct, but I think software which is using the interface correctly should work this way, though possibly not optimally. That might be good enough for this ugly old interface. And... then there's VFIO_EEH_PE_OP. It's very hard to know what to do with this because the interface was completely broken for most of its lifetime. EEH is a fancy error handling feature of IBM PCI hardware somewhat similar in concept, though not interface, t
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On 2022/3/24 06:51, Alex Williamson wrote: On Fri, 18 Mar 2022 14:27:36 -0300 Jason Gunthorpe wrote: iommufd can directly implement the /dev/vfio/vfio container IOCTLs by mapping them into io_pagetable operations. Doing so allows the use of iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to SET_CONTAINER using a iommufd instead of a container fd is a followup series. Internally the compatibility API uses a normal IOAS object that, like vfio, is automatically allocated when the first device is attached. Userspace can also query or set this IOAS object directly using the IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only features while still using the VFIO style map/unmap ioctls. While this is enough to operate qemu, it is still a bit of a WIP with a few gaps to be resolved: - Only the TYPE1v2 mode is supported where unmap cannot punch holes or split areas. The old mode can be implemented with a new operation to split an iopt_area into two without disturbing the iopt_pages or the domains, then unmapping a whole area as normal. - Resource limits rely on memory cgroups to bound what userspace can do instead of the module parameter dma_entry_limit. - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will require some additional work to properly expose PFN lifecycle between VFIO and iommfd - Various components of the mdev API are not completed yet - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not implemented. - The 'dirty tracking' is not implemented - A full audit for pedantic compatibility details (eg errnos, etc) has not yet been done - powerpc SPAPR is left out, as it is not connected to the iommu_domain framework. My hope is that SPAPR will be moved into the iommu_domain framework as a special HW specific type and would expect power to support the generic interface through a normal iommu_domain. My overall question here would be whether we can actually achieve a compatibility interface that has sufficient feature transparency that we can dump vfio code in favor of this interface, or will there be enough niche use cases that we need to keep type1 and vfio containers around through a deprecation process? The locked memory differences for one seem like something that libvirt wouldn't want hidden and we have questions regarding support for vaddr hijacking and different ideas how to implement dirty page tracking, not to mention the missing features that are currently well used, like p2p mappings, coherency tracking, mdev, etc. It seems like quite an endeavor to fill all these gaps, while at the same time QEMU will be working to move to use iommufd directly in order to gain all the new features. Hi Alex, Jason hasn't included the vfio changes for adapting to iommufd. But it's in this branch (https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and me are working on adding the iommufd support in QEMU as well. If wanting to run the new QEMU on old kernel, QEMU is supposed to support both the legacy group/container interface and the latest device/iommufd interface. We've got some draft code toward this direction (https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works for both legacy group/container and device/iommufd path. It's just for reference so far, Eric and me will have a further sync on it. Where do we focus attention? Is symlinking device files our proposal to userspace and is that something achievable, or do we want to use this compatibility interface as a means to test the interface and allow userspace to make use of it for transition, if their use cases allow it, perhaps eventually performing the symlink after deprecation and eventual removal of the vfio container and type1 code? Thanks, I'm sure it is possible that one day the group/container interface will be removed in kernel. Perhaps this will happen when SPAPR is supported by iommufd. But how about QEMU, should QEMU keep backward compatibility forever? or one day QEMU may also remove the group/container path and hence unable to work on the old kernels? -- Regards, Yi Liu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On 3/24/22 23:11, Jason Gunthorpe wrote: > On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: >> On Wed, 23 Mar 2022 21:33:42 -0300 >> Jason Gunthorpe wrote: >>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: >>> I don't think this is compatibility. No kernel today triggers qemu to >>> use this feature as no kernel supports live migration. No existing >>> qemu will trigger this feature with new kernels that support live >>> migration v2. Therefore we can adjust qemu's dirty tracking at the >>> same time we enable migration v2 in qemu. >> >> I guess I was assuming that enabling v2 migration in QEMU was dependent >> on the existing type1 dirty tracking because it's the only means we >> have to tell QEMU that all memory is perpetually dirty when we have a >> DMA device. Is that not correct? > > I haven't looked closely at this part in qemu, but IMHO, if qemu sees > that it has VFIO migration support but does not have any DMA dirty > tracking capability it should not do precopy flows. > > If there is a bug here we should certainly fix it before progressing > the v2 patches. I'll ask Yishai & Co to take a look. I think that's already the case. wrt to VFIO IOMMU type1, kernel always exports a migration capability and the page sizes it supports. In the VMM if it matches the page size qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio container ioctls. Which well, I guess it's always if the syscall is there considering we dirty every page. In qemu, the start and stop of dirty tracking is actually unbounded (it attempts to do it without checking if the capability is there), although syncing the dirties from vfio against Qemu private tracking, it does check if the dirty page tracking is supported prior to even trying the syncing via the ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty tracking, Qemu adds a live migration blocker if either the device doesn't support migration or VFIO container doesn't support it (so migration won't even start). So I think VMM knows how to deal with the lack of the dirty container ioctls as far as my understanding goes. TBH, I am not overly concerned with dirty page tracking in vfio-compat layer -- I have been doing both in tandem (old and new). We mainly need to decide what do we wanna maintain in the compat layer. I can drop that IOMMU support code I have from vfio-compat or we do the 'perpectual dirtying' that current does, or not support the dirty ioctls in vfio-compat at all. Maybe the latter makes more sense, as that might mimmic more accurately what hardware supports, and deprive VMMs from even starting migration. The second looks useful for testing, but doing dirty of all DMA-mapped memory seems to be too much in a real world migration scenario :( specially as the guest size increases. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Friday, March 25, 2022 7:12 AM > > On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > > That's essentially what I'm trying to reconcile, we're racing both > > to round out the compatibility interface to fully support QEMU, while > > also updating QEMU to use iommufd directly so it won't need that full > > support. It's a confusing message. Thanks, > > The long term purpose of compatibility is to provide a config option > to allow type 1 to be turned off and continue to support old user > space (eg in containers) that is running old qemu/dpdk/spdk/etc. > > This shows that we have a plan/path to allow a distro to support only > one iommu interface in their kernel should they choose without having > to sacrifice uABI compatibility. > > As for racing, my intention is to leave the compat interface alone for > awhile - the more urgent things in on my personal list are the RFC > for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation > for iommufd support. > > Eric and Yi are focusing on userspace page tables and qemu updates. > > Joao is working on implementing iommu driver dirty tracking > > Lu and Jacob are working on getting PASID support infrastructure > together. > > There is alot going on! > > A question to consider is what would you consider the minimum bar for > merging? > My two cents. 😊 IMHO making the compat work as a task in parallel with other works listed above is the most efficient approach to move forward. In concept they are not mutual-dependent by using different set of uAPIs (vfio compat vs. iommufd native). Otherwise considering the list of TODOs the compat work will become a single big task gating all other works. If agreed this suggests we may want to prioritize Yi's vfio device uAPI [1] to integrate vfio with iommufd to get this series merged. iirc there are less opens remaining from v1 discussion compared to the list in the compat interface. Of course it needs the Qemu change ready to use iommufd directly, but this is necessary to unblock other tasks anyway. [1] https://github.com/luxis1999/iommufd/commit/2d9278d4ecad7953b3787c98cdb650764af8a1a1 Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > On Wed, 23 Mar 2022 21:33:42 -0300 > Jason Gunthorpe wrote: > > > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > > > My overall question here would be whether we can actually achieve a > > > compatibility interface that has sufficient feature transparency that we > > > can dump vfio code in favor of this interface, or will there be enough > > > niche use cases that we need to keep type1 and vfio containers around > > > through a deprecation process? > > > > Other than SPAPR, I think we can. > > Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure > for POWER support? Certainly initialy - I have no ability to do better than that. I'm hoping someone from IBM will be willing to work on this in the long run and we can do better. > > I don't think this is compatibility. No kernel today triggers qemu to > > use this feature as no kernel supports live migration. No existing > > qemu will trigger this feature with new kernels that support live > > migration v2. Therefore we can adjust qemu's dirty tracking at the > > same time we enable migration v2 in qemu. > > I guess I was assuming that enabling v2 migration in QEMU was dependent > on the existing type1 dirty tracking because it's the only means we > have to tell QEMU that all memory is perpetually dirty when we have a > DMA device. Is that not correct? I haven't looked closely at this part in qemu, but IMHO, if qemu sees that it has VFIO migration support but does not have any DMA dirty tracking capability it should not do precopy flows. If there is a bug here we should certainly fix it before progressing the v2 patches. I'll ask Yishai & Co to take a look. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. Patches for this exist now > > mdev will come along with the final VFIO integration, all the really > > hard parts are done already. The VFIO integration is a medium sized > > task overall. > > > > So, I'm not ready to give up yet :) > > Ok, that's a more promising outlook than I was inferring from the long > list of missing features. Yeah, it is just long, but they are not scary things, just priorites and patch planning. > > I think we can get there pretty quickly, or at least I haven't got > > anything that is scaring me alot (beyond SPAPR of course) > > > > For the dpdk/etcs of the world I think we are already there. > > That's essentially what I'm trying to reconcile, we're racing both > to round out the compatibility interface to fully support QEMU, while > also updating QEMU to use iommufd directly so it won't need that full > support. It's a confusing message. Thanks, The long term purpose of compatibility is to provide a config option to allow type 1 to be turned off and continue to support old user space (eg in containers) that is running old qemu/dpdk/spdk/etc. This shows that we have a plan/path to allow a distro to support only one iommu interface in their kernel should they choose without having to sacrifice uABI compatibility. As for racing, my intention is to leave the compat interface alone for awhile - the more urgent things in on my personal list are the RFC for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation for iommufd support. Eric and Yi are focusing on userspace page tables and qemu updates. Joao is working on implementing iommu driver dirty tracking Lu and Jacob are working on getting PASID support infrastructure together. There is alot going on! A question to consider is what would you consider the minimum bar for merging? Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Wed, 23 Mar 2022 21:33:42 -0300 Jason Gunthorpe wrote: > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > My overall question here would be whether we can actually achieve a > > compatibility interface that has sufficient feature transparency that we > > can dump vfio code in favor of this interface, or will there be enough > > niche use cases that we need to keep type1 and vfio containers around > > through a deprecation process? > > Other than SPAPR, I think we can. Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure for POWER support? > > The locked memory differences for one seem like something that > > libvirt wouldn't want hidden > > I'm first interested to have an understanding how this change becomes > a real problem in practice that requires libvirt to do something > different for vfio or iommufd. We can discuss in the other thread > > If this is the make or break point then I think we can deal with it > either by going back to what vfio does now or perhaps some other > friendly compat approach.. > > > and we have questions regarding support for vaddr hijacking > > I'm not sure what vaddr hijacking is? Do you mean > VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to > implement it in a functionally compatible way without the deadlock > problem. I estimate this as a small project. > > > and different ideas how to implement dirty page tracking, > > I don't think this is compatibility. No kernel today triggers qemu to > use this feature as no kernel supports live migration. No existing > qemu will trigger this feature with new kernels that support live > migration v2. Therefore we can adjust qemu's dirty tracking at the > same time we enable migration v2 in qemu. I guess I was assuming that enabling v2 migration in QEMU was dependent on the existing type1 dirty tracking because it's the only means we have to tell QEMU that all memory is perpetually dirty when we have a DMA device. Is that not correct? If we don't intend to carry type1 dirty tracking into iommufd compatibility and we need it for this purpose, then our window for being able to rip it out entirely closes when QEMU gains v2 migration support. > With Joao's work we are close to having a solid RFC to come with > something that can be fully implemented. > > Hopefully we can agree to this soon enough that qemu can come with a > full package of migration v2 support including the dirty tracking > solution. > > > not to mention the missing features that are currently well used, > > like p2p mappings, coherency tracking, mdev, etc. > > I consider these all mandatory things, they won't be left out. > > The reason they are not in the RFC is mostly because supporting them > requires work outside just this iommufd area, and I'd like this series > to remain self-contained. > > I've already got a draft to add DMABUF support to VFIO PCI which > nicely solves the follow_pfn security problem, we want to do this for > another reason already. I'm waiting for some testing feedback before > posting it. Need some help from Daniel make the DMABUF revoke semantic > him and I have been talking about. In the worst case can copy the > follow_pfn approach. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. > > mdev will come along with the final VFIO integration, all the really > hard parts are done already. The VFIO integration is a medium sized > task overall. > > So, I'm not ready to give up yet :) Ok, that's a more promising outlook than I was inferring from the long list of missing features. > > Where do we focus attention? Is symlinking device files our proposal > > to userspace and is that something achievable, or do we want to use > > this compatibility interface as a means to test the interface and > > allow userspace to make use of it for transition, if their use cases > > allow it, perhaps eventually performing the symlink after deprecation > > and eventual removal of the vfio container and type1 code? Thanks, > > symlinking device files is definitely just a suggested way to expedite > testing. > > Things like qemu that are learning to use iommufd-only features should > learn to directly open iommufd instead of vfio container to activate > those features. Which is kind of the basis for my question, QEMU is racing for native support, Eric and Yi are already working on this, so some of these compatibility interfaces might only have short term usefulness. > Looking long down the road I don't think we want to have type 1 and > iommufd code forever. Agreed. > So, I would like to make an option to compile > out vfio container support entirely and have that option arrange for > iommufd to provide the container device node itself. > > I think we can get there pretty quickly, or at least I haven't got > anything that is scaring me alot (beyond SPAPR of course) > > For the dpdk/etcs of the world I think we are already there. That'
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
Hi, On 3/24/22 1:33 AM, Jason Gunthorpe wrote: > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > >> My overall question here would be whether we can actually achieve a >> compatibility interface that has sufficient feature transparency that we >> can dump vfio code in favor of this interface, or will there be enough >> niche use cases that we need to keep type1 and vfio containers around >> through a deprecation process? > Other than SPAPR, I think we can. > >> The locked memory differences for one seem like something that >> libvirt wouldn't want hidden > I'm first interested to have an understanding how this change becomes > a real problem in practice that requires libvirt to do something > different for vfio or iommufd. We can discuss in the other thread > > If this is the make or break point then I think we can deal with it > either by going back to what vfio does now or perhaps some other > friendly compat approach.. > >> and we have questions regarding support for vaddr hijacking > I'm not sure what vaddr hijacking is? Do you mean > VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to > implement it in a functionally compatible way without the deadlock > problem. I estimate this as a small project. > >> and different ideas how to implement dirty page tracking, > I don't think this is compatibility. No kernel today triggers qemu to > use this feature as no kernel supports live migration. No existing > qemu will trigger this feature with new kernels that support live > migration v2. Therefore we can adjust qemu's dirty tracking at the > same time we enable migration v2 in qemu. > > With Joao's work we are close to having a solid RFC to come with > something that can be fully implemented. > > Hopefully we can agree to this soon enough that qemu can come with a > full package of migration v2 support including the dirty tracking > solution. > >> not to mention the missing features that are currently well used, >> like p2p mappings, coherency tracking, mdev, etc. > I consider these all mandatory things, they won't be left out. > > The reason they are not in the RFC is mostly because supporting them > requires work outside just this iommufd area, and I'd like this series > to remain self-contained. > > I've already got a draft to add DMABUF support to VFIO PCI which > nicely solves the follow_pfn security problem, we want to do this for > another reason already. I'm waiting for some testing feedback before > posting it. Need some help from Daniel make the DMABUF revoke semantic > him and I have been talking about. In the worst case can copy the > follow_pfn approach. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. > > mdev will come along with the final VFIO integration, all the really > hard parts are done already. The VFIO integration is a medium sized > task overall. > > So, I'm not ready to give up yet :) > >> Where do we focus attention? Is symlinking device files our proposal >> to userspace and is that something achievable, or do we want to use >> this compatibility interface as a means to test the interface and >> allow userspace to make use of it for transition, if their use cases >> allow it, perhaps eventually performing the symlink after deprecation >> and eventual removal of the vfio container and type1 code? Thanks, > symlinking device files is definitely just a suggested way to expedite > testing. > > Things like qemu that are learning to use iommufd-only features should > learn to directly open iommufd instead of vfio container to activate > those features. > > Looking long down the road I don't think we want to have type 1 and > iommufd code forever. So, I would like to make an option to compile > out vfio container support entirely and have that option arrange for > iommufd to provide the container device node itself. I am currently working on migrating the QEMU VFIO device onto the new API because since after our discussions the compat mode cannot be used anyway to implemented nesting. I hope I will be able to present something next week. Thanks Eric > > I think we can get there pretty quickly, or at least I haven't got > anything that is scaring me alot (beyond SPAPR of course) > > For the dpdk/etcs of the world I think we are already there. > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > My overall question here would be whether we can actually achieve a > compatibility interface that has sufficient feature transparency that we > can dump vfio code in favor of this interface, or will there be enough > niche use cases that we need to keep type1 and vfio containers around > through a deprecation process? Other than SPAPR, I think we can. > The locked memory differences for one seem like something that > libvirt wouldn't want hidden I'm first interested to have an understanding how this change becomes a real problem in practice that requires libvirt to do something different for vfio or iommufd. We can discuss in the other thread If this is the make or break point then I think we can deal with it either by going back to what vfio does now or perhaps some other friendly compat approach.. > and we have questions regarding support for vaddr hijacking I'm not sure what vaddr hijacking is? Do you mean VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to implement it in a functionally compatible way without the deadlock problem. I estimate this as a small project. > and different ideas how to implement dirty page tracking, I don't think this is compatibility. No kernel today triggers qemu to use this feature as no kernel supports live migration. No existing qemu will trigger this feature with new kernels that support live migration v2. Therefore we can adjust qemu's dirty tracking at the same time we enable migration v2 in qemu. With Joao's work we are close to having a solid RFC to come with something that can be fully implemented. Hopefully we can agree to this soon enough that qemu can come with a full package of migration v2 support including the dirty tracking solution. > not to mention the missing features that are currently well used, > like p2p mappings, coherency tracking, mdev, etc. I consider these all mandatory things, they won't be left out. The reason they are not in the RFC is mostly because supporting them requires work outside just this iommufd area, and I'd like this series to remain self-contained. I've already got a draft to add DMABUF support to VFIO PCI which nicely solves the follow_pfn security problem, we want to do this for another reason already. I'm waiting for some testing feedback before posting it. Need some help from Daniel make the DMABUF revoke semantic him and I have been talking about. In the worst case can copy the follow_pfn approach. Intel no-snoop is simple enough, just needs some Intel cleanup parts. mdev will come along with the final VFIO integration, all the really hard parts are done already. The VFIO integration is a medium sized task overall. So, I'm not ready to give up yet :) > Where do we focus attention? Is symlinking device files our proposal > to userspace and is that something achievable, or do we want to use > this compatibility interface as a means to test the interface and > allow userspace to make use of it for transition, if their use cases > allow it, perhaps eventually performing the symlink after deprecation > and eventual removal of the vfio container and type1 code? Thanks, symlinking device files is definitely just a suggested way to expedite testing. Things like qemu that are learning to use iommufd-only features should learn to directly open iommufd instead of vfio container to activate those features. Looking long down the road I don't think we want to have type 1 and iommufd code forever. So, I would like to make an option to compile out vfio container support entirely and have that option arrange for iommufd to provide the container device node itself. I think we can get there pretty quickly, or at least I haven't got anything that is scaring me alot (beyond SPAPR of course) For the dpdk/etcs of the world I think we are already there. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Fri, 18 Mar 2022 14:27:36 -0300 Jason Gunthorpe wrote: > iommufd can directly implement the /dev/vfio/vfio container IOCTLs by > mapping them into io_pagetable operations. Doing so allows the use of > iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to > SET_CONTAINER using a iommufd instead of a container fd is a followup > series. > > Internally the compatibility API uses a normal IOAS object that, like > vfio, is automatically allocated when the first device is > attached. > > Userspace can also query or set this IOAS object directly using the > IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only > features while still using the VFIO style map/unmap ioctls. > > While this is enough to operate qemu, it is still a bit of a WIP with a > few gaps to be resolved: > > - Only the TYPE1v2 mode is supported where unmap cannot punch holes or >split areas. The old mode can be implemented with a new operation to >split an iopt_area into two without disturbing the iopt_pages or the >domains, then unmapping a whole area as normal. > > - Resource limits rely on memory cgroups to bound what userspace can do >instead of the module parameter dma_entry_limit. > > - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will >require some additional work to properly expose PFN lifecycle between >VFIO and iommfd > > - Various components of the mdev API are not completed yet > > - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not >implemented. > > - The 'dirty tracking' is not implemented > > - A full audit for pedantic compatibility details (eg errnos, etc) has >not yet been done > > - powerpc SPAPR is left out, as it is not connected to the iommu_domain >framework. My hope is that SPAPR will be moved into the iommu_domain >framework as a special HW specific type and would expect power to >support the generic interface through a normal iommu_domain. My overall question here would be whether we can actually achieve a compatibility interface that has sufficient feature transparency that we can dump vfio code in favor of this interface, or will there be enough niche use cases that we need to keep type1 and vfio containers around through a deprecation process? The locked memory differences for one seem like something that libvirt wouldn't want hidden and we have questions regarding support for vaddr hijacking and different ideas how to implement dirty page tracking, not to mention the missing features that are currently well used, like p2p mappings, coherency tracking, mdev, etc. It seems like quite an endeavor to fill all these gaps, while at the same time QEMU will be working to move to use iommufd directly in order to gain all the new features. Where do we focus attention? Is symlinking device files our proposal to userspace and is that something achievable, or do we want to use this compatibility interface as a means to test the interface and allow userspace to make use of it for transition, if their use cases allow it, perhaps eventually performing the symlink after deprecation and eventual removal of the vfio container and type1 code? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu