Re: [RFC] /dev/ioasid uAPI proposal
On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote: > On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Saturday, May 29, 2021 3:59 AM > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > > 5. Use Cases and Flows > > > > > > > > Here assume VFIO will support a new model where every bound device > > > > is explicitly listed under /dev/vfio thus a device fd can be acquired > > > > w/o > > > > going through legacy container/group interface. For illustration purpose > > > > those devices are just called dev[1...N]: > > > > > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > > > > > As explained earlier, one IOASID fd is sufficient for all intended use > > > > cases: > > > > > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > > > > > For simplicity below examples are all made for the virtualization story. > > > > They are representative and could be easily adapted to a > > > > non-virtualization > > > > scenario. > > > > > > For others, I don't think this is *strictly* necessary, we can > > > probably still get to the device_fd using the group_fd and fit in > > > /dev/ioasid. It does make the rest of this more readable though. > > > > Jason, want to confirm here. Per earlier discussion we remain an > > impression that you want VFIO to be a pure device driver thus > > container/group are used only for legacy application. > > Let me call this a "nice wish". > > If you get to a point where you hard need this, then identify the hard > requirement and let's do it, but I wouldn't bloat this already large > project unnecessarily. > > Similarly I wouldn't depend on the group fd existing in this design > so it could be changed later. I don't think presence or absence of a group fd makes a lot of difference to this design. Having a group fd just means we attach groups to the ioasid instead of individual devices, and we no longer need the bookkeeping of "partial" devices. > > From this comment are you suggesting that VFIO can still keep > > container/ group concepts and user just deprecates the use of vfio > > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a > > simple policy that an IOASID will reject cmd if partially-attached > > group exists)? > > I would say no on the container. /dev/ioasid == the container, having > two competing objects at once in a single process is just a mess. Right. I'd assume that for compatibility, creating a container would create a single IOASID under the hood with a compatiblity layer translating the container operations to iosaid operations. > If the group fd can be kept requires charting a path through the > ioctls where the container is not used and /dev/ioasid is sub'd in > using the same device FD specific IOCTLs you show here. Again, I don't think it makes much difference. The model doesn't really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on the IOASID. Basically ATTACH_GROUP would just be equivalent to attaching all the constituent devices. > I didn't try to chart this out carefully. > > Also, ultimately, something need to be done about compatability with > the vfio container fd. It looks clear enough to me that the the VFIO > container FD is just a single IOASID using a special ioctl interface > so it would be quite rasonable to harmonize these somehow. > > But that is too complicated and far out for me at least to guess on at > this point.. > > > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is > > > there any scenario where we want different vpasid's for the same > > > IOASID? I guess it is OK like this. Hum. > > > > Yes, it's completely sane that the guest links a I/O page table to > > different vpasids on dev1 and dev2. The IOMMU doesn't mandate > > that when multiple devices share an I/O page table they must use > > the same PASID#. > > Ok.. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > 5. Use Cases and Flows > > > > Here assume VFIO will support a new model where every bound device > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > going through legacy container/group interface. For illustration purpose > > those devices are just called dev[1...N]: > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > As explained earlier, one IOASID fd is sufficient for all intended use > > cases: > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > For simplicity below examples are all made for the virtualization story. > > They are representative and could be easily adapted to a non-virtualization > > scenario. > > For others, I don't think this is *strictly* necessary, we can > probably still get to the device_fd using the group_fd and fit in > /dev/ioasid. It does make the rest of this more readable though. Leaving aside whether group fds should exist, while they *do* exist binding to an IOASID should be done on the group not an individual device. [snip] > > /* if dev1 is ENQCMD-capable mdev, update CPU PASID > > * translation structure through KVM > > */ > > pa_data = { > > .ioasid_fd = ioasid_fd; > > .ioasid = gva_ioasid; > > .guest_pasid= gpasid1; > > }; > > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data); > > Make sense > > > /* Bind guest I/O page table */ > > bind_data = { > > .ioasid = gva_ioasid; > > .addr = gva_pgtable1; > > // and format information > > }; > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > Again I do wonder if this should just be part of alloc_ioasid. Is > there any reason to split these things? The only advantage to the > split is the device is known, but the device shouldn't impact > anything.. I'm pretty sure the device(s) could matter, although they probably won't usually. But it would certainly be possible for a system to have two different host bridges with two different IOMMUs with different pagetable formats. Until you know which devices (and therefore which host bridge) you're talking about, you don't know what formats of pagetable to accept. And if you have devices from *both* bridges you can't bind a page table at all - you could theoretically support a kernel managed pagetable by mirroring each MAP and UNMAP to tables in both formats, but it would be pretty reasonable not to support that. > > 5.6. I/O page fault > > +++ > > > > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > > to guest IOMMU driver and backwards). > > > > - Host IOMMU driver receives a page request with raw fault_data {rid, > > pasid, addr}; > > > > - Host IOMMU driver identifies the faulting I/O page table according to > > information registered by IOASID fault handler; > > > > - IOASID fault handler is called with raw fault_data (rid, pasid, addr), > > which > > is saved in ioasid_data->fault_data (used for response); > > > > - IOASID fault handler generates an user fault_data (ioasid, addr), links > > it > > to the shared ring buffer and triggers eventfd to userspace; > > Here rid should be translated to a labeled device and return the > device label from VFIO_BIND_IOASID_FD. Depending on how the device > bound the label might match to a rid or to a rid,pasid I like the idea of labelling devices when they're attached, it makes extension to non-PCI devices much more obvious that having to deal with concrete RIDs. But, remember we can only (reliably) determine rid up to the group boundary. So if you're labelling devices, all devices in a group would have to have the same label. Or you attach the label to a group not a device, which would be a reason to represent the group as an object again. > > - Upon received event, Qemu needs to find the virtual routing information > > (v_rid + v_pasid) of the device attached to the faulting ioasid. If > > there are > > multiple, pick a random one. This should be fine since the purpose is to > > fix the I/O page table on the guest; > > The device label should fix this > > > - Qemu finds the pending fault event, converts virtual completion data > > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to > > complete the pending fault; > > > > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved > > in > > ioasid_data->fault_data, and then calls iommu api to complete it with > > {rid, pasid, response_code}; > > So resuming a fault on an ioasid will resume all devices pending on > the fault? > > > 5.7. BIND_PASID_TABLE > > > > > > PASID table is put in the GPA space on some platform, thus must be updated > > by the guest. It is tre
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > 2.1. /dev/ioasid uAPI > > + > > > > /* > > * Check whether an uAPI extension is supported. > > * > > * This is for FD-level capabilities, such as locked page > > pre-registration. > > * IOASID-level capabilities are reported through IOASID_GET_INFO. > > * > > * Return: 0 if not supported, 1 if supported. > > */ > > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0) > > > > /* > > * Register user space memory where DMA is allowed. > > * > > * It pins user pages and does the locked memory accounting so sub- > > * sequent IOASID_MAP/UNMAP_DMA calls get faster. > > * > > * When this ioctl is not used, one user page might be accounted > > * multiple times when it is mapped by multiple IOASIDs which are > > * not nested together. > > * > > * Input parameters: > > * - vaddr; > > * - size; > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1) > > #define IOASID_UNREGISTER_MEMORY_IO(IOASID_TYPE, IOASID_BASE + 2) > > So VA ranges are pinned and stored in a tree and later references to > those VA ranges by any other IOASID use the pin cached in the tree? > > It seems reasonable and is similar to the ioasid parent/child I > suggested for PPC. > > IMHO this should be merged with the all SW IOASID that is required for > today's mdev drivers. If this can be done while keeping this uAPI then > great, otherwise I don't think it is so bad to weakly nest a physical > IOASID under a SW one just to optimize page pinning. Right, I think we can simplify the interface by modelling the preregistration as a nesting layer. Well, mostly.. the wrinkle is that generally you can't do anything with an ioasid until you've attached devices to it, but that doesn't really make sense for the prereg layer. I expect we can find some way to deal with that, though. Actually... to simplify that "weak nesting" concept I wonder if we want to expand to 3 ways of specifying the pagetables for the ioasid: 1) kernel managed (MAP/UNMAP) 2) user managed (BIND/INVALIDATE) 3) pass-though (IOVA==parent address) Obviously pass-through wouldn't be allowed in all circumstances. > Either way this seems like a smart direction > > > /* > > * Allocate an IOASID. > > * > > * IOASID is the FD-local software handle representing an I/O address > > * space. Each IOASID is associated with a single I/O page table. User > > * must call this ioctl to get an IOASID for every I/O address space that > > is > > * intended to be enabled in the IOMMU. > > * > > * A newly-created IOASID doesn't accept any command before it is > > * attached to a device. Once attached, an empty I/O page table is > > * bound with the IOMMU then the user could use either DMA mapping > > * or pgtable binding commands to manage this I/O page table. > > Can the IOASID can be populated before being attached? I don't think it reasonably can. Until attached, you don't actually know what hardware IOMMU will be backing it, and therefore you don't know it's capabilities. You can't really allow mappings if you don't even know allowed IOVA ranges and page size. > > * Device attachment is initiated through device driver uAPI (e.g. VFIO) > > * > > * Return: allocated ioasid on success, -errno on failure. > > */ > > #define IOASID_ALLOC_IO(IOASID_TYPE, IOASID_BASE + 3) > > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4) > > I assume alloc will include quite a big structure to satisfy the > various vendor needs? > > > /* > > * Get information about an I/O address space > > * > > * Supported capabilities: > > * - VFIO type1 map/unmap; > > * - pgtable/pasid_table binding > > * - hardware nesting vs. software nesting; > > * - ... > > * > > * Related attributes: > > * - supported page sizes, reserved IOVA ranges (DMA mapping); > > * - vendor pgtable formats (pgtable binding); > > * - number of child IOASIDs (nesting); > > * - ... > > * > > * Above information is available only after one or more devices are > > * attached to the specified IOASID. Otherwise the IOASID is just a > > * number w/o any capability or attribute. > > This feels wrong to learn most of these attributes of the IOASID after > attaching to a device. Yes... but as above, we have no idea what the IOMMU's capabilities are until devices are attached. > The user should have some idea how it intends to use the IOASID when > it creates it and the rest of the system should match the intention. > > For instance if the user is creating a IOASID to cover the guest GPA > with the intention of making children it should indicate this during > alloc. > > If the user is intending to point a child IOASID to a guest page table > in a certain descrip
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: [snip] > > With above design /dev/ioasid uAPI is all about I/O address spaces. > > It doesn't include any device routing information, which is only > > indirectly registered to the ioasid driver through VFIO uAPI. For > > example, I/O page fault is always reported to userspace per IOASID, > > although it's physically reported per device (RID+PASID). > > I agree with Jean-Philippe - at the very least erasing this > information needs a major rational - but I don't really see why it > must be erased? The HW reports the originating device, is it just a > matter of labeling the devices attached to the /dev/ioasid FD so it > can be reported to userspace? HW reports the originating device as far as it knows. In many cases where you have multiple devices in an IOMMU group, it's because although they're treated as separate devices at the kernel level, they have the same RID at the HW level. Which means a RID for something in the right group is the closest you can count on supplying. [snip] > > However this way significantly > > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID > > one address space any more. Device routing information (indirectly > > marking hidden I/O spaces) has to be carried in iotlb invalidation and > > page faulting uAPI to help connect vIOMMU with the underlying > > pIOMMU. This is one design choice to be confirmed with ARM guys. > > I'm confused by this rational. > > For a vIOMMU that has IO page tables in the guest the basic > choices are: > - Do we have a hypervisor trap to bind the page table or not? (RID >and PASID may differ here) > - Do we have a hypervisor trap to invaliate the page tables or not? > > If the first is a hypervisor trap then I agree it makes sense to create a > child IOASID that points to each guest page table and manage it > directly. This should not require walking guest page tables as it is > really just informing the HW where the page table lives. HW will walk > them. > > If there are no hypervisor traps (does this exist?) then there is no > way to involve the hypervisor here and the child IOASID should simply > be a pointer to the guest's data structure that describes binding. In > this case that IOASID should claim all PASIDs when bound to a > RID. And in that case I think we should call that object something other than an IOASID, since it represents multiple address spaces. > Invalidation should be passed up the to the IOMMU driver in terms of > the guest tables information and either the HW or software has to walk > to guest tables to make sense of it. > > Events from the IOMMU to userspace should be tagged with the attached > device label and the PASID/substream ID. This means there is no issue > to have a a 'all PASID' IOASID. > > > Notes: > > - It might be confusing as IOASID is also used in the kernel (drivers/ > > iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need > > find a better name later to differentiate. > > +1 on Jean-Philippe's remarks > > > - PPC has not be considered yet as we haven't got time to fully understand > > its semantics. According to previous discussion there is some > > generality > > between PPC window-based scheme and VFIO type1 semantics. Let's > > first make consensus on this proposal and then further discuss how to > > extend it to cover PPC's requirement. > > From what I understood PPC is not so bad, Nesting IOASID's did its > preload feature and it needed a way to specify/query the IOVA range a > IOASID will cover. > > > - There is a protocol between vfio group and kvm. Needs to think about > > how it will be affected following this proposal. > > Ugh, I always stop looking when I reach that boundary. Can anyone > summarize what is going on there? > > Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the > right answer. Eg if ARM needs to get the VMID from KVM and set it to > ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is > reasonable. Certainly better than the symbol get sutff we have right > now. > > I will read through the detail below in another email > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > /dev/ioasid provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own logic > to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/ioasid and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ > > It ends up to be a long writing due to many things to be summarized and > non-trivial effort required to connect them into a complete proposal. > Hope it provides a clean base to converge. Thanks for the writeup. I'm giving this a first pass review, note that I haven't read all the existing replies in detail yet. > > TOC > > 1. Terminologies and Concepts > 2. uAPI Proposal > 2.1. /dev/ioasid uAPI > 2.2. /dev/vfio uAPI > 2.3. /dev/kvm uAPI > 3. Sample structures and helper functions > 4. PASID virtualization > 5. Use Cases and Flows > 5.1. A simple example > 5.2. Multiple IOASIDs (no nesting) > 5.3. IOASID nesting (software) > 5.4. IOASID nesting (hardware) > 5.5. Guest SVA (vSVA) > 5.6. I/O page fault > 5.7. BIND_PASID_TABLE > > > 1. Terminologies and Concepts > - > > IOASID FD is the container holding multiple I/O address spaces. User > manages those address spaces through FD operations. Multiple FD's are > allowed per process, but with this proposal one FD should be sufficient for > all intended usages. > > IOASID is the FD-local software handle representing an I/O address space. > Each IOASID is associated with a single I/O page table. IOASIDs can be > nested together, implying the output address from one I/O page table > (represented by child IOASID) must be further translated by another I/O > page table (represented by parent IOASID). Is there a compelling reason to have all the IOASIDs handled by one FD? Simply on the grounds that handles to kernel internal objects are usualy fds, having an fd per ioasid seems like an obvious alternative. In that case plain open() would replace IOASID_ALLOC. Nested could be handled either by 1) having a CREATED_NESTED on the parent fd which spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing a SET_PARENT before doing anything else. I may be bikeshedding here.. > I/O address space can be managed through two protocols, according to > whether the corresponding I/O page table is constructed by the kernel or > the user. When kernel-managed, a dma mapping protocol (similar to > existing VFIO iommu type1) is provided for the user to explicitly specify > how the I/O address space is mapped. Otherwise, a different protocol is > provided for the user to bind an user-managed I/O page table to the > IOMMU, plus necessary commands for iotlb invalidation and I/O fault > handling. > > Pgtable binding protocol can be used only on the child IOASID's, implying > IOASID nesting must be enabled. This is because the kernel doesn't trust > userspace. Nesting allows the kernel to enforce its DMA isolation policy > through the parent IOASID. To clarify, I'm guessing that's a restriction of likely practice, rather than a fundamental API restriction. I can see a couple of theoretical future cases where a user-managed pagetable for a "base" IOASID would be feasible: 1) On some fancy future MMU allowing free nesting, where the kernel would insert an implicit extra layer translating user addresses to physical addresses, and the userspace manages a pagetable with its own VAs being the target AS 2) For a purely software virtual device, where its virtual DMA engine can interpet user addresses fine > IOASID nesting can be implemented in two ways: hardware nesting and > software nesting. With hardware support the child and parent I/O page > tables are walked consecutively by the IOMMU to form a nested translation. > When it's implemented in software, the ioasid driver is responsible for > merging the two-level mappings into a single-level shadow I/O page table. > Software nesting requires both child/parent page tables operated through > the dma mapping protocol, so any change in either level can be captured > by the kernel to update the corresponding shadow mapping. As Jason also said, I don't think you need to restrict software nesting to only kernel managed L2 tables - you already need hooks for cache invalidation, and you can use those to trigger reshadows. > An I/O address space takes effect in the IOMMU only after it is attached > to a device. The device in the /dev/i
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On 01/06/2021 20:08, Thierry Reding wrote: > On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: >> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: >>> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: From: Thierry Reding Hi, this is a set of patches that is the result of earlier discussions regarding early identity mappings that are needed to avoid SMMU faults during early boot. The goal here is to avoid early identity mappings altogether and instead postpone the need for the identity mappings to when devices are attached to the SMMU. This works by making the SMMU driver coordinate with the memory controller driver on when to start enforcing SMMU translations. This makes Tegra behave in a more standard way and pushes the code to deal with the Tegra-specific programming into the NVIDIA SMMU implementation. Compared to the original version of these patches, I've split the preparatory work into a separate patch series because it became very large and will be mostly uninteresting for this audience. Patch 1 provides a mechanism to program SID overrides at runtime. Patch 2 updates the ARM SMMU device tree bindings to include the Tegra186 compatible string as suggested by Robin during review. Patches 3 and 4 create the fundamentals in the SMMU driver to support this and also make this functionality available on Tegra186. Patch 5 hooks the ARM SMMU up to the memory controller so that the memory client stream ID overrides can be programmed at the right time. Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of this through device tree updates. Patch 10 is included here to show how SMMU will be enabled for display controllers. However, it cannot be applied yet because the code to create identity mappings for potentially live framebuffers hasn't been merged yet. The end result is that various peripherals will have SMMU enabled, while the display controllers will keep using passthrough, as initially set up by firmware. Once the device tree bindings have been accepted and the SMMU driver has been updated to create identity mappings for the display controllers, they can be hooked up to the SMMU and the code in this series will automatically program the SID overrides to enable SMMU translations at the right time. Note that the series creates a compile time dependency between the memory controller and IOMMU trees. If it helps I can provide a branch for each tree, modelling the dependency, once the series has been reviewed. Changes in v2: - split off the preparatory work into a separate series (that needs to be applied first) - address review comments by Robin Thierry Thierry Reding (10): memory: tegra: Implement SID override programming dt-bindings: arm-smmu: Add Tegra186 compatible string iommu/arm-smmu: Implement ->probe_finalize() iommu/arm-smmu: tegra: Detect number of instances at runtime iommu/arm-smmu: tegra: Implement SID override programming iommu/arm-smmu: Use Tegra implementation on Tegra186 arm64: tegra: Use correct compatible string for Tegra186 SMMU arm64: tegra: Hook up memory controller to SMMU on Tegra186 arm64: tegra: Enable SMMU support on Tegra194 arm64: tegra: Enable SMMU support for display on Tegra194 .../devicetree/bindings/iommu/arm,smmu.yaml | 11 +- arch/arm64/boot/dts/nvidia/tegra186.dtsi | 4 +- arch/arm64/boot/dts/nvidia/tegra194.dtsi | 166 ++ drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 +- drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c | 90 -- drivers/iommu/arm/arm-smmu/arm-smmu.c | 13 ++ drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 + drivers/memory/tegra/mc.c | 9 + drivers/memory/tegra/tegra186.c | 72 include/soc/tegra/mc.h| 3 + 10 files changed, 349 insertions(+), 23 deletions(-) >>> >>> Will, Robin, >>> >>> do you have any more comments on the ARM SMMU bits of this series? If >>> not, can you guys provide an Acked-by so that Krzysztof can pick this >>> (modulo the DT patches) up into the memory-controller tree for v5.14? >>> >>> I'll send out a v3 with the bisectibilitiy fix that Krishna pointed >>> out. >> >> Probably best if I queue 3-6 on a separate branch once you send a v3, >> then Krzysztof can pull that in if he needs it. > > Patch 5 has a build-time dependency on patch 1, so they need to go in > together. The reason why I suggested Krzysztof pick these up is because > there is a restructuring series that this depends on, which will go into > Krzysztof's tree. So in order to
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > On 01/06/2021 20:08, Thierry Reding wrote: >> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: >>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > From: Thierry Reding > >>> Probably best if I queue 3-6 on a separate branch once you send a v3, >>> then Krzysztof can pull that in if he needs it. >> >> Patch 5 has a build-time dependency on patch 1, so they need to go in >> together. The reason why I suggested Krzysztof pick these up is because >> there is a restructuring series that this depends on, which will go into >> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other >> and mostly unrelated stuff as well. > > I missed that part... what other series are needed for this one? Except > Dmitry's power management set I do not have anything in my sight for > Tegras memory controllers. > > Anyway, I can take the memory bits and provide a stable tag with these. > Recently there was quite a lot work around Tegra memory controllers, so > this makes especially sense if new patches appear. OK, I think I have now the patchset you talked about - "memory: tegra: Driver unification" v2, right? Best regards, Krzysztof ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured
On 02/06/2021 05:37, Lu Baolu wrote: On 6/1/21 10:29 PM, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. Could you explain why it requires singleton group and driver unbinding if the user only wants to increase the upper limit? I haven't dived into the details yet, sorry if this is a silly question. Hi Baolu, I did actually try increasing the range for a 'live' domain in the v1 series, but it turned out too messy. First problem is reallocating the memory to hold the rcaches. Second problem is that we need to deal with the issue that all IOVAs in the rcache need to be a pow-of-2, which is difficult to enforce for IOVAs which weren't being cached before, but now would be. So now I changed to work similar to how we change the default domain type, i.e. don't operate on a 'live' domain. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
On 6/1/21 7:42 PM, Jussi Maki wrote: Hi Robin, On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy wrote: The regression shows as a significant drop in throughput as measured with "super_netperf" [0], with measured bandwidth of ~95Gbps before and ~35Gbps after: I guess that must be the difference between using the flush queue vs. strict invalidation. On closer inspection, it seems to me that there's a subtle pre-existing bug in the AMD IOMMU driver, in that amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api() has called bus_set_iommu(). Does the patch below work? Thanks for the quick response & patch. I tried it out and indeed it does solve the issue: # uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 95341.2 root@zh-lab-node-3:~# uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1 17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 33989.5 Robin, probably goes without saying, but please make sure to include ... Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE") ... to your fix in [0], maybe along with another Fixes tag pointing to the original commit adding this issue. But certainly a250c23f15c2 would be good given the regression was uncovered on that one first, so that Greg et al have a chance to pick this fix up for stable kernels. Thanks everyone! [0] https://lore.kernel.org/bpf/7f048c57-423b-68ba-eede-7e194c1fe...@arm.com/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: > On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > > On 01/06/2021 20:08, Thierry Reding wrote: > >> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > >>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > > From: Thierry Reding > > > >>> Probably best if I queue 3-6 on a separate branch once you send a v3, > >>> then Krzysztof can pull that in if he needs it. > >> > >> Patch 5 has a build-time dependency on patch 1, so they need to go in > >> together. The reason why I suggested Krzysztof pick these up is because > >> there is a restructuring series that this depends on, which will go into > >> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other > >> and mostly unrelated stuff as well. > > > > I missed that part... what other series are needed for this one? Except > > Dmitry's power management set I do not have anything in my sight for > > Tegras memory controllers. > > > > Anyway, I can take the memory bits and provide a stable tag with these. > > Recently there was quite a lot work around Tegra memory controllers, so > > this makes especially sense if new patches appear. > > OK, I think I have now the patchset you talked about - "memory: tegra: > Driver unification" v2, right? Yes, that's the one. That series is fairly self-contained, but Dmitry's power management set has dependencies that pull in the regulator, clock and ARM SoC trees. I did a test merge of the driver unification series with a branch that has Dmitry's patches and all the dependencies and there are no conflicts so that, fortunately, doesn't further complicates things. Do you want me to send you a pull request with Dmitry's memory controller changes? You could then apply the unification series on top, which should allow this SMMU series to apply cleanly on top of that. I can also carry all these changes in the Tegra tree and send a PR in a few days once this has seen a bit more testing in linux-next, which also makes sure it's got a bit more testing in our internal test farm. Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午4:28, Jason Gunthorpe 写道: I summarized five opens here, about: 1) Finalizing the name to replace /dev/ioasid; 2) Whether one device is allowed to bind to multiple IOASID fd's; 3) Carry device information in invalidation/fault reporting uAPI; 4) What should/could be specified when allocating an IOASID; 5) The protocol between vfio group and kvm; For 1), two alternative names are mentioned: /dev/iommu and /dev/ioas. I don't have a strong preference and would like to hear votes from all stakeholders. /dev/iommu is slightly better imho for two reasons. First, per AMD's presentation in last KVM forum they implement vIOMMU in hardware thus need to support user-managed domains. An iommu uAPI notation might make more sense moving forward. Second, it makes later uAPI naming easier as 'IOASID' can be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of IOASID_ALLOC_IOASID.:) I think two years ago I suggested /dev/iommu and it didn't go very far at the time. It looks to me using "/dev/iommu" excludes the possibility of implementing IOASID in a device specific way (e.g through the co-operation with device MMU + platform IOMMU)? What's more, ATS spec doesn't forbid the device #PF to be reported via a device specific way. Thanks We've also talked about this as /dev/sva for a while and now /dev/ioasid I think /dev/iommu is fine, and call the things inside them IOAS objects. Then we don't have naming aliasing with kernel constructs. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午1:31, Jason Gunthorpe 写道: On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote: We can open up to ~0U file descriptors, I don't see why we need to restrict it in uAPI. There are significant problems with such large file descriptor tables. High FD numbers man things like select don't work at all anymore and IIRC there are more complications. I don't see how much difference for IOASID and other type of fds. People can choose to use poll or epoll. And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I wonder how select can work for the specific ioasid. Thanks A huge number of FDs for typical usages should be avoided. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 31.05.21 19:37, Parav Pandit wrote: It appears that this is only to make map ioctl faster apart from accounting. It doesn't have any ioasid handle input either. In that case, can it be a new system call? Why does it have to be under /dev/ioasid? For example few years back such system call mpin() thought was proposed in [1]. I'm very reluctant to more syscall inflation. We already have lots of syscalls that could have been easily done via devices or filesystems (yes, some of them are just old Unix relics). Syscalls don't play well w/ modules, containers, distributed systems, etc, and need extra low-level code for most non-C languages (eg. scripting languages). --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering i...@metux.net -- +49-151-27565287 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 27.05.21 09:58, Tian, Kevin wrote: Hi, /dev/ioasid provides an unified interface for managing I/O page tables for devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, etc.) are expected to use this interface instead of creating their own logic to isolate untrusted device DMAs initiated by userspace. While I'm in favour of having generic APIs for generic tasks, as well as using FDs, I wonder whether it has to be a new and separate device. Now applications have to use multiple APIs in lockstep. One consequence of that is operators, as well as provisioning systems, container infrastructures, etc, always have to consider multiple devices together. You can't just say "give workload XY access to device /dev/foo" anymore. Now you have to take care about scenarios like "if someone wants /dev/foo, he also needs /dev/bar"). And if that happens multiple times together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to scenarios like the dev nodes are bind-mounted somewhere, you need to take care that additional devices aren't bind-mounted twice, etc ... If I understand this correctly, /dev/ioasid is a kind of "common supplier" to other APIs / devices. Why can't the fd be acquired by the consumer APIs (eg. kvm, vfio, etc) ? --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering i...@metux.net -- +49-151-27565287 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午1:29, Jason Gunthorpe 写道: On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote: For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? For some scenarios every guest PASID will require a IOASID ID # so there is a large enough demand that FDs alone are not a good fit. Further there are global container wide properties that are hard to carry over to a multi-FD model, like the attachment of devices to the container at the startup. So if we implement per fd model. The global "container" properties could be done via the parent fd. E.g attaching the parent to the device at the startup. So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. The examples at the bottom of the document show multiple IOASIDs in the container for a parent/child type relationship This can also be done per fd? A fd parent can have multiple fd childs. Thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
On Sat, May 29, 2021 at 11:17:30AM +0200, Thomas Gleixner wrote: > --- a/arch/x86/include/asm/disabled-features.h > +++ b/arch/x86/include/asm/disabled-features.h > @@ -56,11 +56,8 @@ > # define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31)) > #endif > > -#ifdef CONFIG_IOMMU_SUPPORT > -# define DISABLE_ENQCMD 0 > -#else > -# define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31)) > -#endif > +/* Force disable because it's broken beyond repair */ > +#define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31)) Yeah, for that to work we need: --- From: Borislav Petkov Date: Wed, 2 Jun 2021 12:07:52 +0200 Subject: [PATCH] dmaengine: idxd: Use cpu_feature_enabled() When testing x86 feature bits, use cpu_feature_enabled() so that build-disabled features can remain off, regardless of what CPUID says. Fixes: 8e50d392652f ("dmaengine: idxd: Add shared workqueue support") Signed-off-by: Borislav Petkov Cc: --- drivers/dma/idxd/init.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c index 2a926bef87f2..776fd44aff5f 100644 --- a/drivers/dma/idxd/init.c +++ b/drivers/dma/idxd/init.c @@ -745,12 +745,12 @@ static int __init idxd_init_module(void) * If the CPU does not support MOVDIR64B or ENQCMDS, there's no point in * enumerating the device. We can not utilize it. */ - if (!boot_cpu_has(X86_FEATURE_MOVDIR64B)) { + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) { pr_warn("idxd driver failed to load without MOVDIR64B.\n"); return -ENODEV; } - if (!boot_cpu_has(X86_FEATURE_ENQCMD)) + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD)) pr_warn("Platform does not have ENQCMD(S) support.\n"); else support_enqcmd = true; -- 2.29.2 For the newly CCed parties, pls check https://lkml.kernel.org/r/87mtsd6gr9@nanos.tec.linutronix.de for more info. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
On Wed, Jun 02 2021 at 12:14, Borislav Petkov wrote: > On Sat, May 29, 2021 at 11:17:30AM +0200, Thomas Gleixner wrote: >> --- a/arch/x86/include/asm/disabled-features.h >> +++ b/arch/x86/include/asm/disabled-features.h >> @@ -56,11 +56,8 @@ >> # define DISABLE_PTI(1 << (X86_FEATURE_PTI & 31)) >> #endif >> >> -#ifdef CONFIG_IOMMU_SUPPORT >> -# define DISABLE_ENQCMD 0 >> -#else >> -# define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31)) >> -#endif >> +/* Force disable because it's broken beyond repair */ >> +#define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31)) > > Yeah, for that to work we need: > > --- > From: Borislav Petkov > Date: Wed, 2 Jun 2021 12:07:52 +0200 > Subject: [PATCH] dmaengine: idxd: Use cpu_feature_enabled() > > When testing x86 feature bits, use cpu_feature_enabled() so that > build-disabled features can remain off, regardless of what CPUID says. > > Fixes: 8e50d392652f ("dmaengine: idxd: Add shared workqueue support") > Signed-off-by: Borislav Petkov > Cc: Reviewed-by: Thomas Gleixner Thanks for spotting this! tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On 02/06/2021 10:52, Thierry Reding wrote: > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote: >>> On 01/06/2021 20:08, Thierry Reding wrote: On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: >>> From: Thierry Reding >>> > Probably best if I queue 3-6 on a separate branch once you send a v3, > then Krzysztof can pull that in if he needs it. Patch 5 has a build-time dependency on patch 1, so they need to go in together. The reason why I suggested Krzysztof pick these up is because there is a restructuring series that this depends on, which will go into Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other and mostly unrelated stuff as well. >>> >>> I missed that part... what other series are needed for this one? Except >>> Dmitry's power management set I do not have anything in my sight for >>> Tegras memory controllers. >>> >>> Anyway, I can take the memory bits and provide a stable tag with these. >>> Recently there was quite a lot work around Tegra memory controllers, so >>> this makes especially sense if new patches appear. >> >> OK, I think I have now the patchset you talked about - "memory: tegra: >> Driver unification" v2, right? > > Yes, that's the one. That series is fairly self-contained, but Dmitry's > power management set has dependencies that pull in the regulator, clock > and ARM SoC trees. > > I did a test merge of the driver unification series with a branch that > has Dmitry's patches and all the dependencies and there are no conflicts > so that, fortunately, doesn't further complicates things. > > Do you want me to send you a pull request with Dmitry's memory > controller changes? You could then apply the unification series on top, > which should allow this SMMU series to apply cleanly on top of that. Makes sense and it looks quite bulletproof for future changes. Let's do like this. I will apply your patch 1/10 from this v2 on top of it and driver unification later. > I can also carry all these changes in the Tegra tree and send a PR in a > few days once this has seen a bit more testing in linux-next, which also > makes sure it's got a bit more testing in our internal test farm. > Best regards, Krzysztof ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote: > On 02/06/2021 10:52, Thierry Reding wrote: > > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: > >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > >>> On 01/06/2021 20:08, Thierry Reding wrote: > On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > >>> From: Thierry Reding > >>> > > Probably best if I queue 3-6 on a separate branch once you send a v3, > > then Krzysztof can pull that in if he needs it. > > Patch 5 has a build-time dependency on patch 1, so they need to go in > together. The reason why I suggested Krzysztof pick these up is because > there is a restructuring series that this depends on, which will go into > Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other > and mostly unrelated stuff as well. > >>> > >>> I missed that part... what other series are needed for this one? Except > >>> Dmitry's power management set I do not have anything in my sight for > >>> Tegras memory controllers. > >>> > >>> Anyway, I can take the memory bits and provide a stable tag with these. > >>> Recently there was quite a lot work around Tegra memory controllers, so > >>> this makes especially sense if new patches appear. > >> > >> OK, I think I have now the patchset you talked about - "memory: tegra: > >> Driver unification" v2, right? > > > > Yes, that's the one. That series is fairly self-contained, but Dmitry's > > power management set has dependencies that pull in the regulator, clock > > and ARM SoC trees. > > > > I did a test merge of the driver unification series with a branch that > > has Dmitry's patches and all the dependencies and there are no conflicts > > so that, fortunately, doesn't further complicates things. > > > > Do you want me to send you a pull request with Dmitry's memory > > controller changes? You could then apply the unification series on top, > > which should allow this SMMU series to apply cleanly on top of that. > > Makes sense and it looks quite bulletproof for future changes. Let's do > like this. I will apply your patch 1/10 from this v2 on top of it and > driver unification later. Okey doke. Thierry -- please let me know which patches I can queue. Having the SMMU driver changes in the IOMMU tree really helps in case of conflicts with other SMMU changes. As I say, I can put stuff on a separate branch for you if it helps. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Enrico Weigelt, metux IT consult > Sent: Wednesday, June 2, 2021 2:09 PM > > On 31.05.21 19:37, Parav Pandit wrote: > > > It appears that this is only to make map ioctl faster apart from accounting. > > It doesn't have any ioasid handle input either. > > > > In that case, can it be a new system call? Why does it have to be under > /dev/ioasid? > > For example few years back such system call mpin() thought was proposed > in [1]. > > I'm very reluctant to more syscall inflation. We already have lots of syscalls > that could have been easily done via devices or filesystems (yes, some of > them are just old Unix relics). > > Syscalls don't play well w/ modules, containers, distributed systems, etc, and > need extra low-level code for most non-C languages (eg. > scripting languages). Likely, but as per my understanding, this ioctl() is a wrapper to device agnostic code as, { atomic_inc(mm->pinned_vm); pin_user_pages(); } And mm must got to hold the reference to it, so that these pages cannot be munmap() or freed. And second reason I think (I could be wrong) is that, second level page table for a PASID, should be same as what process CR3 has used. Essentially iommu page table and mmu page table should be pointing to same page table entry. If they are different, than even if the guest CPU has accessed the pages, device access via IOMMU will result in an expensive page faults. So assuming both cr3 and pasid table entry points to same page table, I fail to understand for the need of extra refcount and hence driver specific ioctl(). Though I do not have strong objection to the ioctl(). But want to know what it will and will_not do. Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the memory. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
On 2021-06-02 09:09, Daniel Borkmann wrote: On 6/1/21 7:42 PM, Jussi Maki wrote: Hi Robin, On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy wrote: The regression shows as a significant drop in throughput as measured with "super_netperf" [0], with measured bandwidth of ~95Gbps before and ~35Gbps after: I guess that must be the difference between using the flush queue vs. strict invalidation. On closer inspection, it seems to me that there's a subtle pre-existing bug in the AMD IOMMU driver, in that amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api() has called bus_set_iommu(). Does the patch below work? Thanks for the quick response & patch. I tried it out and indeed it does solve the issue: Cool, thanks Jussi. May I infer a Tested-by tag from that? # uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 95341.2 root@zh-lab-node-3:~# uname -a Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1 17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2 33989.5 Robin, probably goes without saying, but please make sure to include ... Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE") ... to your fix in [0], maybe along with another Fixes tag pointing to the original commit adding this issue. But certainly a250c23f15c2 would be good given the regression was uncovered on that one first, so that Greg et al have a chance to pick this fix up for stable kernels. Given that the race looks to have been pretty theoretical until now, I'm not convinced it's worth the bother of digging through the long history of default domain and DMA ops movement to figure where it started, much less attempt invasive backports. The flush queue change which made it apparent only landed in 5.13-rc1, so as long as we can get this in as a fix in the current cycle we should be golden - in the meantime, note that booting with "iommu.strict=0" should also restore the expected behaviour. FWIW I do still plan to resend the patch "properly" soon (in all honesty it wasn't even compile-tested!) Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote: > On 02/06/2021 10:52, Thierry Reding wrote: > > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: > >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > >>> On 01/06/2021 20:08, Thierry Reding wrote: > On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > >>> From: Thierry Reding > >>> > > Probably best if I queue 3-6 on a separate branch once you send a v3, > > then Krzysztof can pull that in if he needs it. > > Patch 5 has a build-time dependency on patch 1, so they need to go in > together. The reason why I suggested Krzysztof pick these up is because > there is a restructuring series that this depends on, which will go into > Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other > and mostly unrelated stuff as well. > >>> > >>> I missed that part... what other series are needed for this one? Except > >>> Dmitry's power management set I do not have anything in my sight for > >>> Tegras memory controllers. > >>> > >>> Anyway, I can take the memory bits and provide a stable tag with these. > >>> Recently there was quite a lot work around Tegra memory controllers, so > >>> this makes especially sense if new patches appear. > >> > >> OK, I think I have now the patchset you talked about - "memory: tegra: > >> Driver unification" v2, right? > > > > Yes, that's the one. That series is fairly self-contained, but Dmitry's > > power management set has dependencies that pull in the regulator, clock > > and ARM SoC trees. > > > > I did a test merge of the driver unification series with a branch that > > has Dmitry's patches and all the dependencies and there are no conflicts > > so that, fortunately, doesn't further complicates things. > > > > Do you want me to send you a pull request with Dmitry's memory > > controller changes? You could then apply the unification series on top, > > which should allow this SMMU series to apply cleanly on top of that. > > Makes sense and it looks quite bulletproof for future changes. Let's do > like this. I will apply your patch 1/10 from this v2 on top of it and > driver unification later. The SMMU series here depends on the unification series, so the unification series needs to go first. It'd be a fair bit of work to reverse that because the ->probe_device() callback implemented by the first patch of this SMMU series is part of the tegra_mc_ops structure that's introduced in the unification series. Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On Wed, Jun 02, 2021 at 12:40:49PM +0100, Will Deacon wrote: > On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote: > > On 02/06/2021 10:52, Thierry Reding wrote: > > > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: > > >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > > >>> On 01/06/2021 20:08, Thierry Reding wrote: > > On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: > > > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > > >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > > >>> From: Thierry Reding > > >>> > > > Probably best if I queue 3-6 on a separate branch once you send a v3, > > > then Krzysztof can pull that in if he needs it. > > > > Patch 5 has a build-time dependency on patch 1, so they need to go in > > together. The reason why I suggested Krzysztof pick these up is because > > there is a restructuring series that this depends on, which will go > > into > > Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of > > other > > and mostly unrelated stuff as well. > > >>> > > >>> I missed that part... what other series are needed for this one? Except > > >>> Dmitry's power management set I do not have anything in my sight for > > >>> Tegras memory controllers. > > >>> > > >>> Anyway, I can take the memory bits and provide a stable tag with these. > > >>> Recently there was quite a lot work around Tegra memory controllers, so > > >>> this makes especially sense if new patches appear. > > >> > > >> OK, I think I have now the patchset you talked about - "memory: tegra: > > >> Driver unification" v2, right? > > > > > > Yes, that's the one. That series is fairly self-contained, but Dmitry's > > > power management set has dependencies that pull in the regulator, clock > > > and ARM SoC trees. > > > > > > I did a test merge of the driver unification series with a branch that > > > has Dmitry's patches and all the dependencies and there are no conflicts > > > so that, fortunately, doesn't further complicates things. > > > > > > Do you want me to send you a pull request with Dmitry's memory > > > controller changes? You could then apply the unification series on top, > > > which should allow this SMMU series to apply cleanly on top of that. > > > > Makes sense and it looks quite bulletproof for future changes. Let's do > > like this. I will apply your patch 1/10 from this v2 on top of it and > > driver unification later. > > Okey doke. Thierry -- please let me know which patches I can queue. Having > the SMMU driver changes in the IOMMU tree really helps in case of conflicts > with other SMMU changes. As I say, I can put stuff on a separate branch for > you if it helps. Given that the SMMU patches have a build-time dependency on the first patch in the series, and the series depends on the unification series, I think Krzysztof would have to pull the memory controller branch that I have with Dmitry's work, apply the unification series on top and then take patch 1 of this series on top of that. That's the state that you'd have to pull into the SMMU tree to get the right dependencies. The SMMU pieces are the leaf of the dependency tree, so technically no separate branch is needed, because I don't think anybody would have to pull from it. However, a separate branch might make it easier to back any of this work out if we have to. Krzysztof, I do plan on sending out a v3 of the unification series to address the two cleanups that Dmitry and you have pointed out. After that I can send out v3 of this series to fix the ordering issue that Krishna had mentioned. After all of those are applied, would you be able to provide a stable branch for Will's SMMU tree? Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On 02/06/2021 16:53, Thierry Reding wrote: > On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote: >> On 02/06/2021 10:52, Thierry Reding wrote: >>> On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: On 02/06/2021 09:33, Krzysztof Kozlowski wrote: > On 01/06/2021 20:08, Thierry Reding wrote: >> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: >>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: > From: Thierry Reding > >>> Probably best if I queue 3-6 on a separate branch once you send a v3, >>> then Krzysztof can pull that in if he needs it. >> >> Patch 5 has a build-time dependency on patch 1, so they need to go in >> together. The reason why I suggested Krzysztof pick these up is because >> there is a restructuring series that this depends on, which will go into >> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other >> and mostly unrelated stuff as well. > > I missed that part... what other series are needed for this one? Except > Dmitry's power management set I do not have anything in my sight for > Tegras memory controllers. > > Anyway, I can take the memory bits and provide a stable tag with these. > Recently there was quite a lot work around Tegra memory controllers, so > this makes especially sense if new patches appear. OK, I think I have now the patchset you talked about - "memory: tegra: Driver unification" v2, right? >>> >>> Yes, that's the one. That series is fairly self-contained, but Dmitry's >>> power management set has dependencies that pull in the regulator, clock >>> and ARM SoC trees. >>> >>> I did a test merge of the driver unification series with a branch that >>> has Dmitry's patches and all the dependencies and there are no conflicts >>> so that, fortunately, doesn't further complicates things. >>> >>> Do you want me to send you a pull request with Dmitry's memory >>> controller changes? You could then apply the unification series on top, >>> which should allow this SMMU series to apply cleanly on top of that. >> >> Makes sense and it looks quite bulletproof for future changes. Let's do >> like this. I will apply your patch 1/10 from this v2 on top of it and >> driver unification later. > > The SMMU series here depends on the unification series, so the > unification series needs to go first. It'd be a fair bit of work to > reverse that because the ->probe_device() callback implemented by the > first patch of this SMMU series is part of the tegra_mc_ops structure > that's introduced in the unification series. Right, you already wrote it in the first email in this thread, I just reversed words in my head... Then as you wrote - take Dmitry's changes and share them via pull to me. I'll put on top the unification series, then #1 from this SMMU series and finally I'll provide a pull request for Will so his SMMU can go on. Best regards, Krzysztof ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults
On 02/06/2021 16:58, Thierry Reding wrote: > On Wed, Jun 02, 2021 at 12:40:49PM +0100, Will Deacon wrote: >> On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote: >>> On 02/06/2021 10:52, Thierry Reding wrote: On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote: > On 02/06/2021 09:33, Krzysztof Kozlowski wrote: >> On 01/06/2021 20:08, Thierry Reding wrote: >>> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote: On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote: > On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote: >> From: Thierry Reding >> Probably best if I queue 3-6 on a separate branch once you send a v3, then Krzysztof can pull that in if he needs it. >>> >>> Patch 5 has a build-time dependency on patch 1, so they need to go in >>> together. The reason why I suggested Krzysztof pick these up is because >>> there is a restructuring series that this depends on, which will go into >>> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other >>> and mostly unrelated stuff as well. >> >> I missed that part... what other series are needed for this one? Except >> Dmitry's power management set I do not have anything in my sight for >> Tegras memory controllers. >> >> Anyway, I can take the memory bits and provide a stable tag with these. >> Recently there was quite a lot work around Tegra memory controllers, so >> this makes especially sense if new patches appear. > > OK, I think I have now the patchset you talked about - "memory: tegra: > Driver unification" v2, right? Yes, that's the one. That series is fairly self-contained, but Dmitry's power management set has dependencies that pull in the regulator, clock and ARM SoC trees. I did a test merge of the driver unification series with a branch that has Dmitry's patches and all the dependencies and there are no conflicts so that, fortunately, doesn't further complicates things. Do you want me to send you a pull request with Dmitry's memory controller changes? You could then apply the unification series on top, which should allow this SMMU series to apply cleanly on top of that. >>> >>> Makes sense and it looks quite bulletproof for future changes. Let's do >>> like this. I will apply your patch 1/10 from this v2 on top of it and >>> driver unification later. >> >> Okey doke. Thierry -- please let me know which patches I can queue. Having >> the SMMU driver changes in the IOMMU tree really helps in case of conflicts >> with other SMMU changes. As I say, I can put stuff on a separate branch for >> you if it helps. > > Given that the SMMU patches have a build-time dependency on the first > patch in the series, and the series depends on the unification series, I > think Krzysztof would have to pull the memory controller branch that I > have with Dmitry's work, apply the unification series on top and then > take patch 1 of this series on top of that. That's the state that you'd > have to pull into the SMMU tree to get the right dependencies. > > The SMMU pieces are the leaf of the dependency tree, so technically no > separate branch is needed, because I don't think anybody would have to > pull from it. However, a separate branch might make it easier to back > any of this work out if we have to. > > Krzysztof, I do plan on sending out a v3 of the unification series to > address the two cleanups that Dmitry and you have pointed out. After > that I can send out v3 of this series to fix the ordering issue that > Krishna had mentioned. After all of those are applied, would you be able > to provide a stable branch for Will's SMMU tree? Yes, I will provide a stable branch/tag. Best regards, Krzysztof ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
Hi Boris: Thanks for your review. On 6/2/2021 9:16 AM, Boris Ostrovsky wrote: On 5/30/21 11:06 AM, Tianyu Lan wrote: @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); IOMMU_INIT_FINISH(2, - NULL, + hyperv_swiotlb_detect, pci_xen_swiotlb_init, NULL); Could you explain this change? Hyper-V allocates its own swiotlb bounce buffer and the default swiotlb buffer should not be allocated. swiotlb_init() in pci_swiotlb_init() is to allocate default swiotlb buffer. To achieve this, put hyperv_swiotlb_detect() as the first entry in the iommu_table_entry list. The detect loop in the pci_iommu_alloc() will exit once hyperv_swiotlb_detect() is called in Hyper-V VM and other iommu_table_entry callback will not be called. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/6] Add support for ACPI VIOT
Add a driver for the ACPI VIOT table, which provides topology information for para-virtual IOMMUs. Enable virtio-iommu on non-devicetree platforms, including x86. Since v2 [1] I tried to improve commit messages and comments. More feedback and review are always welcome. Joerg offered to take this series through the IOMMU tree, which requires acks for patches 1-3. You can find a QEMU implementation at [2], with extra support for testing all VIOT nodes including MMIO-based endpoints and IOMMU. This series is at [3]. [1] https://lore.kernel.org/linux-iommu/20210423113836.3974972-1-jean-phili...@linaro.org/ [2] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/acpi [3] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/acpi Jean-Philippe Brucker (6): ACPI: arm64: Move DMA setup operations out of IORT ACPI: Move IOMMU setup code out of IORT ACPI: Add driver for the VIOT table iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops() iommu/dma: Simplify calls to iommu_setup_dma_ops() iommu/virtio: Enable x86 support drivers/acpi/Kconfig | 3 + drivers/iommu/Kconfig| 4 +- drivers/acpi/Makefile| 2 + drivers/acpi/arm64/Makefile | 1 + include/acpi/acpi_bus.h | 3 + include/linux/acpi.h | 3 + include/linux/acpi_iort.h| 14 +- include/linux/acpi_viot.h| 19 ++ include/linux/dma-iommu.h| 4 +- arch/arm64/mm/dma-mapping.c | 2 +- drivers/acpi/arm64/dma.c | 50 + drivers/acpi/arm64/iort.c| 129 ++--- drivers/acpi/bus.c | 2 + drivers/acpi/scan.c | 60 +- drivers/acpi/viot.c | 364 +++ drivers/iommu/amd/iommu.c| 9 +- drivers/iommu/dma-iommu.c| 17 +- drivers/iommu/intel/iommu.c | 10 +- drivers/iommu/virtio-iommu.c | 8 + MAINTAINERS | 8 + 20 files changed, 562 insertions(+), 150 deletions(-) create mode 100644 include/linux/acpi_viot.h create mode 100644 drivers/acpi/arm64/dma.c create mode 100644 drivers/acpi/viot.c -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/6] ACPI: arm64: Move DMA setup operations out of IORT
Extract generic DMA setup code out of IORT, so it can be reused by VIOT. Keep it in drivers/acpi/arm64 for now, since it could break x86 platforms that haven't run this code so far, if they have invalid tables. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/arm64/Makefile | 1 + include/linux/acpi.h| 3 +++ include/linux/acpi_iort.h | 6 ++--- drivers/acpi/arm64/dma.c| 50 ++ drivers/acpi/arm64/iort.c | 54 ++--- drivers/acpi/scan.c | 2 +- 6 files changed, 66 insertions(+), 50 deletions(-) create mode 100644 drivers/acpi/arm64/dma.c diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 6ff50f4ed947..66acbe77f46e 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-$(CONFIG_ACPI_IORT)+= iort.o obj-$(CONFIG_ACPI_GTDT)+= gtdt.o +obj-y += dma.o diff --git a/include/linux/acpi.h b/include/linux/acpi.h index c60745f657e9..7aaa9559cc19 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -259,9 +259,12 @@ void acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa); #ifdef CONFIG_ARM64 void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa); +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size); #else static inline void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) { } +static inline void +acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { } #endif int acpi_numa_memory_affinity_init (struct acpi_srat_mem_affinity *ma); diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 1a12baa58e40..f7f054833afd 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -34,7 +34,7 @@ struct irq_domain *iort_get_device_domain(struct device *dev, u32 id, void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ -void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size); +int iort_dma_get_ranges(struct device *dev, u64 *size); const struct iommu_ops *iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); @@ -48,8 +48,8 @@ static inline struct irq_domain *iort_get_device_domain( { return NULL; } static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ -static inline void iort_dma_setup(struct device *dev, u64 *dma_addr, - u64 *size) { } +static inline int iort_dma_get_ranges(struct device *dev, u64 *size) +{ return -ENODEV; } static inline const struct iommu_ops *iort_iommu_configure_id( struct device *dev, const u32 *id_in) { return NULL; } diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c new file mode 100644 index ..f16739ad3cc0 --- /dev/null +++ b/drivers/acpi/arm64/dma.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include + +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) +{ + int ret; + u64 end, mask; + u64 dmaaddr = 0, size = 0, offset = 0; + + /* +* If @dev is expected to be DMA-capable then the bus code that created +* it should have initialised its dma_mask pointer by this point. For +* now, we'll continue the legacy behaviour of coercing it to the +* coherent mask if not, but we'll no longer do so quietly. +*/ + if (!dev->dma_mask) { + dev_warn(dev, "DMA mask not set\n"); + dev->dma_mask = &dev->coherent_dma_mask; + } + + if (dev->coherent_dma_mask) + size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1); + else + size = 1ULL << 32; + + ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size); + if (ret == -ENODEV) + ret = iort_dma_get_ranges(dev, &size); + if (!ret) { + /* +* Limit coherent and dma mask based on size retrieved from +* firmware. +*/ + end = dmaaddr + size - 1; + mask = DMA_BIT_MASK(ilog2(end) + 1); + dev->bus_dma_limit = end; + dev->coherent_dma_mask = min(dev->coherent_dma_mask, mask); + *dev->dma_mask = min(*dev->dma_mask, mask); + } + + *dma_addr = dmaaddr; + *dma_size = size; + + ret = dma_direct_set_offset(dev, dmaaddr + offset, dmaaddr, size); + + dev_dbg(dev, "dma_offset(%#08llx)%s\n", offset, ret ? " failed!" : ""); +} diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 3912a1f6058e..a940be1cf2af 100644
[PATCH v3 6/6] iommu/virtio: Enable x86 support
With the VIOT support in place, x86 platforms can now use the virtio-iommu. Because the other x86 IOMMU drivers aren't yet ready to use the acpi_dma_setup() path, x86 doesn't implement arch_setup_dma_ops() at the moment. Similarly to Vt-d and AMD IOMMU, call iommu_setup_dma_ops() from probe_finalize(). Acked-by: Joerg Roedel Acked-by: Michael S. Tsirkin Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig| 3 ++- drivers/iommu/dma-iommu.c| 1 + drivers/iommu/virtio-iommu.c | 8 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index aff8a4830dd1..07b7c25cbed8 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -400,8 +400,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA select INTERVAL_TREE select ACPI_VIOT if ACPI help diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 175f8eaeb5b3..46ed43c400cf 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1332,6 +1332,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n", dev_name(dev)); } +EXPORT_SYMBOL_GPL(iommu_setup_dma_ops); static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev, phys_addr_t msi_addr, struct iommu_domain *domain) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 218fe8560e8d..77aee1207ced 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1026,6 +1026,13 @@ static struct iommu_device *viommu_probe_device(struct device *dev) return ERR_PTR(ret); } +static void viommu_probe_finalize(struct device *dev) +{ +#ifndef CONFIG_ARCH_HAS_SETUP_DMA_OPS + iommu_setup_dma_ops(dev, 0, U64_MAX); +#endif +} + static void viommu_release_device(struct device *dev) { struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); @@ -1062,6 +1069,7 @@ static struct iommu_ops viommu_ops = { .iova_to_phys = viommu_iova_to_phys, .iotlb_sync = viommu_iotlb_sync, .probe_device = viommu_probe_device, + .probe_finalize = viommu_probe_finalize, .release_device = viommu_release_device, .device_group = viommu_device_group, .get_resv_regions = viommu_get_resv_regions, -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT
Extract the code that sets up the IOMMU infrastructure from IORT, since it can be reused by VIOT. Move it one level up into a new acpi_iommu_configure_id() function, which calls the IORT parsing function which in turn calls the acpi_iommu_fwspec_init() helper. Signed-off-by: Jean-Philippe Brucker --- include/acpi/acpi_bus.h | 3 ++ include/linux/acpi_iort.h | 8 ++--- drivers/acpi/arm64/iort.c | 75 +-- drivers/acpi/scan.c | 55 +++- 4 files changed, 69 insertions(+), 72 deletions(-) diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h index 3a82faac5767..41f092a269f6 100644 --- a/include/acpi/acpi_bus.h +++ b/include/acpi/acpi_bus.h @@ -588,6 +588,9 @@ struct acpi_pci_root { bool acpi_dma_supported(struct acpi_device *adev); enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev); +int acpi_iommu_fwspec_init(struct device *dev, u32 id, + struct fwnode_handle *fwnode, + const struct iommu_ops *ops); int acpi_dma_get_range(struct device *dev, u64 *dma_addr, u64 *offset, u64 *size); int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr, diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index f7f054833afd..f1f0842a2cb2 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -35,8 +35,7 @@ void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ int iort_dma_get_ranges(struct device *dev, u64 *size); -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in); +int iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); phys_addr_t acpi_iort_dma_get_max_cpu_address(void); #else @@ -50,9 +49,8 @@ static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ static inline int iort_dma_get_ranges(struct device *dev, u64 *size) { return -ENODEV; } -static inline const struct iommu_ops *iort_iommu_configure_id( - struct device *dev, const u32 *id_in) -{ return NULL; } +static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in) +{ return -ENODEV; } static inline int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) { return 0; } diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index a940be1cf2af..b5b021e064b6 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -806,23 +806,6 @@ static struct acpi_iort_node *iort_get_msi_resv_iommu(struct device *dev) return NULL; } -static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev) -{ - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - - return (fwspec && fwspec->ops) ? fwspec->ops : NULL; -} - -static inline int iort_add_device_replay(struct device *dev) -{ - int err = 0; - - if (dev->bus && !device_iommu_mapped(dev)) - err = iommu_probe_device(dev); - - return err; -} - /** * iort_iommu_msi_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() @@ -900,18 +883,6 @@ static inline bool iort_iommu_driver_enabled(u8 type) } } -static int arm_smmu_iort_xlate(struct device *dev, u32 streamid, - struct fwnode_handle *fwnode, - const struct iommu_ops *ops) -{ - int ret = iommu_fwspec_init(dev, fwnode, ops); - - if (!ret) - ret = iommu_fwspec_add_ids(dev, &streamid, 1); - - return ret; -} - static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node) { struct acpi_iort_root_complex *pci_rc; @@ -946,7 +917,7 @@ static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node, return iort_iommu_driver_enabled(node->type) ? -EPROBE_DEFER : -ENODEV; - return arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops); + return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops); } struct iort_pci_alias_info { @@ -1020,24 +991,14 @@ static int iort_nc_iommu_map_id(struct device *dev, * @dev: device to configure * @id_in: optional input id const value pointer * - * Returns: iommu_ops pointer on configuration success - * NULL on configuration failure + * Returns: 0 on success, <0 on failure */ -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in) +int iort_iommu_configure_id(struct device *dev, const u32 *id_in) { struct acpi_iort_node *node; - const struct iommu_ops *ops; + const struct iommu_ops *ops = NULL; int err = -ENODEV; - /* -
[PATCH v3 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()
dma-iommu uses the address bounds described in domain->geometry during IOVA allocation. The address size parameters of iommu_setup_dma_ops() are useful for describing additional limits set by the platform firmware, but aren't needed for drivers that call this function from probe_finalize(). The base parameter can be zero because dma-iommu already removes the first IOVA page, and the limit parameter can be U64_MAX because it's only checked against the domain geometry. Simplify calls to iommu_setup_dma_ops(). Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/amd/iommu.c | 9 + drivers/iommu/dma-iommu.c | 4 +++- drivers/iommu/intel/iommu.c | 10 +- 3 files changed, 5 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 94b96d81fcfd..d3123bc05c08 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1708,14 +1708,7 @@ static struct iommu_device *amd_iommu_probe_device(struct device *dev) static void amd_iommu_probe_finalize(struct device *dev) { - struct iommu_domain *domain; - - /* Domains are initialized for this device - have a look what we ended up with */ - domain = iommu_get_domain_for_dev(dev); - if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void amd_iommu_release_device(struct device *dev) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index c62e19bed302..175f8eaeb5b3 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) if (domain->type == IOMMU_DOMAIN_DMA) { if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev)) goto out_err; - dev->dma_ops = &iommu_dma_ops; + set_dma_ops(dev, &iommu_dma_ops); + } else { + set_dma_ops(dev, NULL); } return; diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 85f18342603c..8d866940692a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5165,15 +5165,7 @@ static void intel_iommu_release_device(struct device *dev) static void intel_iommu_probe_finalize(struct device *dev) { - dma_addr_t base = IOVA_START_PFN << VTD_PAGE_SHIFT; - struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct dmar_domain *dmar_domain = to_dmar_domain(domain); - - if (domain && domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, base, - __DOMAIN_MAX_ADDR(dmar_domain->gaw)); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void intel_iommu_get_resv_regions(struct device *device, -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/6] ACPI: Add driver for the VIOT table
The ACPI Virtual I/O Translation Table describes topology of para-virtual platforms, similarly to vendor tables DMAR, IVRS and IORT. For now it describes the relation between virtio-iommu and the endpoints it manages. Three steps are needed to configure DMA of endpoints: (1) acpi_viot_init(): parse the VIOT table, find or create the fwnode associated to each vIOMMU device. (2) When probing the vIOMMU device, the driver registers its IOMMU ops within the IOMMU subsystem. This step doesn't require any intervention from the VIOT driver. (3) viot_iommu_configure(): before binding the endpoint to a driver, find the associated IOMMU ops. Register them, along with the endpoint ID, into the device's iommu_fwspec. If step (3) happens before step (2), it is deferred until the IOMMU is initialized, then retried. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/Kconfig | 3 + drivers/iommu/Kconfig | 1 + drivers/acpi/Makefile | 2 + include/linux/acpi_viot.h | 19 ++ drivers/acpi/bus.c| 2 + drivers/acpi/scan.c | 3 + drivers/acpi/viot.c | 364 ++ MAINTAINERS | 8 + 8 files changed, 402 insertions(+) create mode 100644 include/linux/acpi_viot.h create mode 100644 drivers/acpi/viot.c diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index eedec61e3476..3758c6940ed7 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -526,6 +526,9 @@ endif source "drivers/acpi/pmic/Kconfig" +config ACPI_VIOT + bool + endif # ACPI config X86_PM_TIMER diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 1f111b399bca..aff8a4830dd1 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -403,6 +403,7 @@ config VIRTIO_IOMMU depends on ARM64 select IOMMU_API select INTERVAL_TREE + select ACPI_VIOT if ACPI help Para-virtualised IOMMU driver with virtio. diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 700b41adf2db..a6e644c48987 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -118,3 +118,5 @@ video-objs += acpi_video.o video_detect.o obj-y += dptf/ obj-$(CONFIG_ARM64)+= arm64/ + +obj-$(CONFIG_ACPI_VIOT)+= viot.o diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h new file mode 100644 index ..1eb8ee5b0e5f --- /dev/null +++ b/include/linux/acpi_viot.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __ACPI_VIOT_H__ +#define __ACPI_VIOT_H__ + +#include + +#ifdef CONFIG_ACPI_VIOT +void __init acpi_viot_init(void); +int viot_iommu_configure(struct device *dev); +#else +static inline void acpi_viot_init(void) {} +static inline int viot_iommu_configure(struct device *dev) +{ + return -ENODEV; +} +#endif + +#endif /* __ACPI_VIOT_H__ */ diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index be7da23fad76..b835ca702ff0 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -27,6 +27,7 @@ #include #endif #include +#include #include #include #include @@ -1339,6 +1340,7 @@ static int __init acpi_init(void) pci_mmcfg_late_init(); acpi_iort_init(); acpi_scan_init(); + acpi_viot_init(); acpi_ec_init(); acpi_debugfs_init(); acpi_sleep_proc_init(); diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 696b81432ae7..574d657cf4bb 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -1555,6 +1556,8 @@ static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev, return ops; err = iort_iommu_configure_id(dev, id_in); + if (err && err != -EPROBE_DEFER) + err = viot_iommu_configure(dev); /* * If we have reason to believe the IOMMU driver missed the initial diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c new file mode 100644 index ..892cd9fa7b6d --- /dev/null +++ b/drivers/acpi/viot.c @@ -0,0 +1,364 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Virtual I/O topology + * + * The Virtual I/O Translation Table (VIOT) describes the topology of + * para-virtual IOMMUs and the endpoints they manage. The OS uses it to + * initialize devices in the right order, preventing endpoints from issuing DMA + * before their IOMMU is ready. + * + * When binding a driver to a device, before calling the device driver's probe() + * method, the driver infrastructure calls dma_configure(). At that point the + * VIOT driver looks for an IOMMU associated to the device in the VIOT table. + * If an IOMMU exists and has been initialized, the VIOT driver initializes the + * device's IOMMU fwspec, allowing the DMA infrastructure to invoke the IOMMU + * ops when the device driver configures DMA mappings. If an IOMMU exists a
[PATCH v3 4/6] iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()
Passing a 64-bit address width to iommu_setup_dma_ops() is valid on virtual platforms, but isn't currently possible. The overflow check in iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass a limit address instead of a size, so callers don't have to fake a size to work around the check. Signed-off-by: Jean-Philippe Brucker --- include/linux/dma-iommu.h | 4 ++-- arch/arm64/mm/dma-mapping.c | 2 +- drivers/iommu/amd/iommu.c | 2 +- drivers/iommu/dma-iommu.c | 12 ++-- drivers/iommu/intel/iommu.c | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 6e75a2d689b4..758ca4694257 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -19,7 +19,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base); void iommu_put_dma_cookie(struct iommu_domain *domain); /* Setup call for arch DMA mapping code */ -void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size); +void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit); /* The DMA API isn't _quite_ the whole story, though... */ /* @@ -50,7 +50,7 @@ struct msi_msg; struct device; static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base, - u64 size) + u64 dma_limit) { } diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 4bf1dd3eb041..7bd1d2199141 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -50,7 +50,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, dev->dma_coherent = coherent; if (iommu) - iommu_setup_dma_ops(dev, dma_base, size); + iommu_setup_dma_ops(dev, dma_base, size - dma_base - 1); #ifdef CONFIG_XEN if (xen_swiotlb_detect()) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 3ac42bbdefc6..94b96d81fcfd 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1713,7 +1713,7 @@ static void amd_iommu_probe_finalize(struct device *dev) /* Domains are initialized for this device - have a look what we ended up with */ domain = iommu_get_domain_for_dev(dev); if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, 0); + iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX); else set_dma_ops(dev, NULL); } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..c62e19bed302 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -319,16 +319,16 @@ static bool dev_is_untrusted(struct device *dev) * iommu_dma_init_domain - Initialise a DMA mapping domain * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() * @base: IOVA at which the mappable address space starts - * @size: Size of IOVA space + * @limit: Last address of the IOVA space * @dev: Device the domain is being initialised for * - * @base and @size should be exact multiples of IOMMU page granularity to + * @base and @limit + 1 should be exact multiples of IOMMU page granularity to * avoid rounding surprises. If necessary, we reserve the page at address 0 * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but * any change which could make prior IOVAs invalid will fail. */ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, - u64 size, struct device *dev) +dma_addr_t limit, struct device *dev) { struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; @@ -346,7 +346,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, /* Check the domain allows at least some access to the device... */ if (domain->geometry.force_aperture) { if (base > domain->geometry.aperture_end || - base + size <= domain->geometry.aperture_start) { + limit < domain->geometry.aperture_start) { pr_warn("specified DMA range outside IOMMU capability\n"); return -EFAULT; } @@ -1308,7 +1308,7 @@ static const struct dma_map_ops iommu_dma_ops = { * The IOMMU core code allocates the default DMA domain, which the underlying * IOMMU driver needs to support via the dma-iommu layer. */ -void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) +void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); @@ -1320,7 +1320,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) * underlying IOMMU driver needs to support via the dma-iommu layer. */ if (domain->type == IOM
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 02:20:15AM +, Tian, Kevin wrote: > > From: Alex Williamson > > Sent: Wednesday, June 2, 2021 6:22 AM > > > > On Tue, 1 Jun 2021 07:01:57 + > > "Tian, Kevin" wrote: > > > > > > I summarized five opens here, about: > > > > > > 1) Finalizing the name to replace /dev/ioasid; > > > 2) Whether one device is allowed to bind to multiple IOASID fd's; > > > 3) Carry device information in invalidation/fault reporting uAPI; > > > 4) What should/could be specified when allocating an IOASID; > > > 5) The protocol between vfio group and kvm; > > > > > ... > > > > > > For 5), I'd expect Alex to chime in. Per my understanding looks the > > > original purpose of this protocol is not about I/O address space. It's > > > for KVM to know whether any device is assigned to this VM and then > > > do something special (e.g. posted interrupt, EPT cache attribute, etc.). > > > > Right, the original use case was for KVM to determine whether it needs > > to emulate invlpg, so it needs to be aware when an assigned device is > > invlpg -> wbinvd :) > > > present and be able to test if DMA for that device is cache > > coherent. Why is this such a strong linkage to VFIO and not just a 'hey kvm emulate wbinvd' flag from qemu? I briefly didn't see any obvios linkage in the arch code, just some dead code: $ git grep iommu_noncoherent arch/x86/include/asm/kvm_host.h:bool iommu_noncoherent; $ git grep iommu_domain arch/x86 arch/x86/include/asm/kvm_host.h:struct iommu_domain *iommu_domain; Huh? It kind of looks like the other main point is to generate the VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to connect back to the kvm data But that seems like it would have been better handled with some IOCTL on the vfio_device fd to import the KVM to the driver not this roundabout way? > > The user, QEMU, creates a KVM "pseudo" device representing the vfio > > group, providing the file descriptor of that group to show ownership. > > The ugly symbol_get code is to avoid hard module dependencies, ie. the > > kvm module should not pull in or require the vfio module, but vfio will > > be present if attempting to register this device. > > so the symbol_get thing is not about the protocol itself. Whatever protocol > is defined, as long as kvm needs to call vfio or ioasid helper function, we > need define a proper way to do it. Jason, what's your opinion of alternative > option since you dislike symbol_get? The symbol_get was to avoid module dependencies because bringing in vfio along with kvm is not nice. The symbol get is not nice here, but unless things can be truely moved to lower levels of code where module dependencies are not a problem (eg kvm to iommu is a non-issue) I don't see much of a solution. Other cases like kvmgt or AP would be similarly fine to have had a kvmgt to kvm module dependency. > > All of these use cases are related to the IOMMU, whether DMA is > > coherent, translating device IOVA to GPA, and an acceleration path to > > emulate IOMMU programming in kernel... they seem pretty relevant. > > One open is whether kvm should hold a device reference or IOASID > reference. For DMA coherence, it only matters whether assigned > devices are coherent or not (not for a specific address space). For kvmgt, > it is for recoding kvm pointer in mdev driver to do write protection. For > ppc, it does relate to a specific I/O page table. > > Then I feel only a part of the protocol will be moved to /dev/ioasid and > something will still remain between kvm and vfio? Honestly I would try not to touch this too much :\ Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
On 6/2/21 11:01 AM, Tianyu Lan wrote: > Hi Boris: > Thanks for your review. > > On 6/2/2021 9:16 AM, Boris Ostrovsky wrote: >> >> On 5/30/21 11:06 AM, Tianyu Lan wrote: >>> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) >>> EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); >>> IOMMU_INIT_FINISH(2, >>> - NULL, >>> + hyperv_swiotlb_detect, >>> pci_xen_swiotlb_init, >>> NULL); >> >> >> Could you explain this change? > > Hyper-V allocates its own swiotlb bounce buffer and the default > swiotlb buffer should not be allocated. swiotlb_init() in pci_swiotlb_init() > is to allocate default swiotlb buffer. > To achieve this, put hyperv_swiotlb_detect() as the first entry in the > iommu_table_entry list. The detect loop in the pci_iommu_alloc() will exit > once hyperv_swiotlb_detect() is called in Hyper-V VM and other > iommu_table_entry callback will not be called. Right. But pci_xen_swiotlb_detect() will only do something for Xen PV guests, and those guests don't run on hyperV. It's either xen_pv_domain() (i.e. hypervisor_is_type(X86_HYPER_XEN_PV)) or hypervisor_is_type(X86_HYPER_MS_HYPERV) but never both. So I don't think there needs to be a dependency between the two callbacks. -boris ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:52:02PM +0800, Jason Wang wrote: > > 在 2021/6/2 上午4:28, Jason Gunthorpe 写道: > > > I summarized five opens here, about: > > > > > > 1) Finalizing the name to replace /dev/ioasid; > > > 2) Whether one device is allowed to bind to multiple IOASID fd's; > > > 3) Carry device information in invalidation/fault reporting uAPI; > > > 4) What should/could be specified when allocating an IOASID; > > > 5) The protocol between vfio group and kvm; > > > > > > For 1), two alternative names are mentioned: /dev/iommu and > > > /dev/ioas. I don't have a strong preference and would like to hear > > > votes from all stakeholders. /dev/iommu is slightly better imho for > > > two reasons. First, per AMD's presentation in last KVM forum they > > > implement vIOMMU in hardware thus need to support user-managed > > > domains. An iommu uAPI notation might make more sense moving > > > forward. Second, it makes later uAPI naming easier as 'IOASID' can > > > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of > > > IOASID_ALLOC_IOASID.:) > > I think two years ago I suggested /dev/iommu and it didn't go very far > > at the time. > > > It looks to me using "/dev/iommu" excludes the possibility of implementing > IOASID in a device specific way (e.g through the co-operation with device > MMU + platform IOMMU)? This is intended to be the 'drivers/iommu' subsystem though. I don't want to see pluggabilit here beyoned what drivers/iommu is providing. If someone wants to do a something complicated through this interface then they need to make a drivers/iommu implementation. Or they need to use the mdev-esque "SW TABLE" mode when their driver attaches to the interface. If this is good enough or not for a specific device is an entirely other question though > What's more, ATS spec doesn't forbid the device #PF to be reported via a > device specific way. And if this is done then a kernel function indicating page fault should be triggered on the ioasid handle that the device has. It is still drivers/iommu functionality Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, June 2, 2021 1:42 AM > > > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Saturday, May 29, 2021 1:36 AM > > > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > > > IOASID nesting can be implemented in two ways: hardware nesting and > > > > > software nesting. With hardware support the child and parent I/O page > > > > > tables are walked consecutively by the IOMMU to form a nested > > translation. > > > > > When it's implemented in software, the ioasid driver is responsible > > > > > for > > > > > merging the two-level mappings into a single-level shadow I/O page > > table. > > > > > Software nesting requires both child/parent page tables operated > > through > > > > > the dma mapping protocol, so any change in either level can be > > captured > > > > > by the kernel to update the corresponding shadow mapping. > > > > > > > > Why? A SW emulation could do this synchronization during invalidation > > > > processing if invalidation contained an IOVA range. > > > > > > In this proposal we differentiate between host-managed and user- > > > managed I/O page tables. If host-managed, the user is expected to use > > > map/unmap cmd explicitly upon any change required on the page table. > > > If user-managed, the user first binds its page table to the IOMMU and > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically > > > not required when changing a PTE from non-present to present). > > > > > > We expect user to use map+unmap and bind+invalidate respectively > > > instead of mixing them together. Following this policy, map+unmap > > > must be used in both levels for software nesting, so changes in either > > > level are captured timely to synchronize the shadow mapping. > > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when > > it is created. If you put two different types in a tree then each IOASID > > must continue to use its own operation mode. > > > > I don't see a reason to force all IOASIDs in a tree to be consistent?? > > only for software nesting. With hardware support the parent uses map > while the child uses bind. > > Yes, the policy is specified per IOASID. But if the policy violates the > requirement in a specific nesting mode, then nesting should fail. I don't get it. If the IOASID is a page table then it is bind/invalidate. SW or not SW doesn't matter at all. > > > > A software emulated two level page table where the leaf level is a > > bound page table in guest memory should continue to use > > bind/invalidate to maintain the guest page table IOASID even though it > > is a SW construct. > > with software nesting the leaf should be a host-managed page table > (or metadata). A bind/invalidate protocol doesn't require the user > to notify the kernel of every page table change. The purpose of invalidate is to inform the implementation that the page table has changed so it can flush the caches. If the page table is changed and invalidation is not issued then then the implementation is free to ignore the changes. In this way the SW mode is the same as a HW mode with an infinite cache. The collaposed shadow page table is really just a cache. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote: > > I agree with Jean-Philippe - at the very least erasing this > > information needs a major rational - but I don't really see why it > > must be erased? The HW reports the originating device, is it just a > > matter of labeling the devices attached to the /dev/ioasid FD so it > > can be reported to userspace? > > HW reports the originating device as far as it knows. In many cases > where you have multiple devices in an IOMMU group, it's because > although they're treated as separate devices at the kernel level, they > have the same RID at the HW level. Which means a RID for something in > the right group is the closest you can count on supplying. Granted there may be cases where exact fidelity is not possible, but that doesn't excuse eliminating fedelity where it does exist.. > > If there are no hypervisor traps (does this exist?) then there is no > > way to involve the hypervisor here and the child IOASID should simply > > be a pointer to the guest's data structure that describes binding. In > > this case that IOASID should claim all PASIDs when bound to a > > RID. > > And in that case I think we should call that object something other > than an IOASID, since it represents multiple address spaces. Maybe.. It is certainly a special case. We can still consider it a single "address space" from the IOMMU perspective. What has happened is that the address table is not just a 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA". If we are already going in the direction of having the IOASID specify the page table format and other details, specifying that the page tabnle format is the 80 bit "PASID, IOVA" format is a fairly small step. I wouldn't twist things into knots to create a difference, but if it is easy to do it wouldn't hurt either. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote: > I don't think presence or absence of a group fd makes a lot of > difference to this design. Having a group fd just means we attach > groups to the ioasid instead of individual devices, and we no longer > need the bookkeeping of "partial" devices. Oh, I think we really don't want to attach the group to an ioasid, or at least not as a first-class idea. The fundamental problem that got us here is we now live in a world where there are many ways to attach a device to an IOASID: - A RID binding - A RID,PASID binding - A RID,PASID binding for ENQCMD - A SW TABLE binding - etc The selection of which mode to use is based on the specific driver/device operation. Ie the thing that implements the 'struct vfio_device' is the thing that has to select the binding mode. group attachment was fine when there was only one mode. As you say it is fine to just attach every group member with RID binding if RID binding is the only option. When SW TABLE binding was added the group code was hacked up - now the group logic is choosing between RID/SW TABLE in a very hacky and mdev specific way, and this is just a mess. The flow must carry the IOASID from the /dev/iommu to the vfio_device driver and the vfio_device implementation must choose which binding mode and parameters it wants based on driver and HW configuration. eg if two PCI devices are in a group then it is perfectly fine that one device uses RID binding and the other device uses RID,PASID binding. The only place I see for a "group bind" in the uAPI is some compat layer for the vfio container, and the implementation would be quite different, we'd have to call each vfio_device driver in the group and execute the IOASID attach IOCTL. > > I would say no on the container. /dev/ioasid == the container, having > > two competing objects at once in a single process is just a mess. > > Right. I'd assume that for compatibility, creating a container would > create a single IOASID under the hood with a compatiblity layer > translating the container operations to iosaid operations. It is a nice dream for sure /dev/vfio could be a special case of /dev/ioasid just with a different uapi and ending up with only one IOASID. They could be interchangable from then on, which would simplify the internals of VFIO if it consistently delt with these new ioasid objects everywhere. But last I looked it was complicated enough to best be done later on Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RESEND PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
From: Rob Clark (Resend, first attempt seems to not have entirely shown up in patchwork and had a random already merged patch tagging along because 00*patch picks up things I forgot to delete) This picks up an earlier series[1] from Jordan, and adds additional support needed to generate GPU devcore dumps on iova faults. Original description: This is a stack to add an Adreno GPU specific handler for pagefaults. The first patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds a adreno-smmu-priv function hook to capture a handful of important debugging registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the third patch to print more detailed information on page fault such as the TTBR0 for the pagetable that caused the fault and the source of the fault as determined by a combination of the FSYNR1 register and an internal GPU register. This code provides a solid base that we can expand on later for even more extensive GPU side page fault debugging capabilities. v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver resume translation after it has had a chance to snapshot the GPUs state v3: Always clear FSR even if the target driver is going to handle resume v2: Fix comment wording and function pointer check per Rob Clark [1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcro...@codeaurora.org/ Jordan Crouse (3): iommu/arm-smmu: Add support for driver IOMMU fault handlers iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info drm/msm: Improve the a6xx page fault handler Rob Clark (3): iommu/arm-smmu-qcom: Add stall support drm/msm: Add crashdump support for stalled SMMU drm/msm: devcoredump iommu fault support drivers/gpu/drm/msm/adreno/a2xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 2 +- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 9 +- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 101 +++- drivers/gpu/drm/msm/adreno/a6xx_gpu.h | 2 +- drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 +++-- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++ drivers/gpu/drm/msm/msm_debugfs.c | 2 +- drivers/gpu/drm/msm/msm_gem.h | 1 + drivers/gpu/drm/msm/msm_gem_submit.c| 1 + drivers/gpu/drm/msm/msm_gpu.c | 55 ++- drivers/gpu/drm/msm/msm_gpu.h | 19 +++- drivers/gpu/drm/msm/msm_gpummu.c| 5 + drivers/gpu/drm/msm/msm_iommu.c | 22 - drivers/gpu/drm/msm/msm_mmu.h | 5 +- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 ++ drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +- drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 + include/linux/adreno-smmu-priv.h| 38 +++- 20 files changed, 354 insertions(+), 31 deletions(-) -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RESEND PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
From: Jordan Crouse Call report_iommu_fault() to allow upper-level drivers to register their own fault handlers. Signed-off-by: Jordan Crouse Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 6f72c4d208ca..b4b32d31fc06 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct arm_smmu_device *smmu = smmu_domain->smmu; int idx = smmu_domain->cfg.cbndx; + int ret; fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR); if (!(fsr & ARM_SMMU_FSR_FAULT)) @@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR); cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx)); - dev_err_ratelimited(smmu->dev, - "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n", + ret = report_iommu_fault(domain, NULL, iova, + fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); + + if (ret == -ENOSYS) + dev_err_ratelimited(smmu->dev, + "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n", fsr, iova, fsynr, cbfrsynra, idx); arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr); -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RESEND PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
From: Jordan Crouse Add a callback in adreno-smmu-priv to read interesting SMMU registers to provide an opportunity for a richer debug experience in the GPU driver. Signed-off-by: Jordan Crouse Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 ++ include/linux/adreno-smmu-priv.h | 31 +- 3 files changed, 49 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index 98b3a1c2a181..b2e31ea84128 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx, arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); } +static void qcom_adreno_smmu_get_fault_info(const void *cookie, + struct adreno_smmu_fault_info *info) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = &smmu_domain->cfg; + struct arm_smmu_device *smmu = smmu_domain->smmu; + + info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR); + info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0); + info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1); + info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR); + info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx)); + info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0); + info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR); +} + #define QCOM_ADRENO_SMMU_GPU_SID 0 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev) @@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->cookie = smmu_domain; priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; + priv->get_fault_info = qcom_adreno_smmu_get_fault_info; return 0; } diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h index c31a59d35c64..84c21c4b0691 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h @@ -224,6 +224,8 @@ enum arm_smmu_cbar_type { #define ARM_SMMU_CB_FSYNR0 0x68 #define ARM_SMMU_FSYNR0_WNRBIT(4) +#define ARM_SMMU_CB_FSYNR1 0x6c + #define ARM_SMMU_CB_S1_TLBIVA 0x600 #define ARM_SMMU_CB_S1_TLBIASID0x610 #define ARM_SMMU_CB_S1_TLBIVAL 0x620 diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h index a889f28afb42..53fe32fb9214 100644 --- a/include/linux/adreno-smmu-priv.h +++ b/include/linux/adreno-smmu-priv.h @@ -8,6 +8,32 @@ #include +/** + * struct adreno_smmu_fault_info - container for key fault information + * + * @far: The faulting IOVA from ARM_SMMU_CB_FAR + * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0 + * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR + * @fsr: The fault status from ARM_SMMU_CB_FSR + * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0 + * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0 + * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx) + * + * This struct passes back key page fault information to the GPU driver + * through the get_fault_info function pointer. + * The GPU driver can use this information to print informative + * log messages and provide deeper GPU specific insight into the fault. + */ +struct adreno_smmu_fault_info { + u64 far; + u64 ttbr0; + u32 contextidr; + u32 fsr; + u32 fsynr0; + u32 fsynr1; + u32 cbfrsynra; +}; + /** * struct adreno_smmu_priv - private interface between adreno-smmu and GPU * @@ -17,6 +43,8 @@ * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank. A * NULL config disables TTBR0 translation, otherwise * TTBR0 translation is enabled with the specified cfg + * @get_fault_info: Called by the GPU fault handler to get information about + * the fault * * The GPU driver (drm/msm) and adreno-smmu work together for controlling * the GPU's SMMU instance. This is by necessity, as the GPU is directly @@ -31,6 +59,7 @@ struct adreno_smmu_priv { const void *cookie; const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie); int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg); +void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info); }; -#endif /* __ADRENO_SMMU_PRIV_H */ \ No newline at end of file +#endif /* __ADRENO_SMMU_PRIV_H */ -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org htt
[RESEND PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
From: Rob Clark Add, via the adreno-smmu-priv interface, a way for the GPU to request the SMMU to stall translation on faults, and then later resume the translation, either retrying or terminating the current translation. This will be used on the GPU side to "freeze" the GPU while we snapshot useful state for devcoredump. Signed-off-by: Rob Clark --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++ include/linux/adreno-smmu-priv.h | 7 + 2 files changed, 40 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index b2e31ea84128..61fc645c1325 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -13,6 +13,7 @@ struct qcom_smmu { struct arm_smmu_device smmu; bool bypass_quirk; u8 bypass_cbndx; + u32 stall_enabled; }; static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu) @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu) static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx, u32 reg) { + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu); + /* * On the GPU device we want to process subsequent transactions after a * fault to keep the GPU from hanging */ reg |= ARM_SMMU_SCTLR_HUPCF; + if (qsmmu->stall_enabled & BIT(idx)) + reg |= ARM_SMMU_SCTLR_CFCFG; + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); } @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie, info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR); } +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = &smmu_domain->cfg; + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu); + + if (enabled) + qsmmu->stall_enabled |= BIT(cfg->cbndx); + else + qsmmu->stall_enabled &= ~BIT(cfg->cbndx); +} + +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate) +{ + struct arm_smmu_domain *smmu_domain = (void *)cookie; + struct arm_smmu_cfg *cfg = &smmu_domain->cfg; + struct arm_smmu_device *smmu = smmu_domain->smmu; + u32 reg = 0; + + if (terminate) + reg |= ARM_SMMU_RESUME_TERMINATE; + + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg); +} + #define QCOM_ADRENO_SMMU_GPU_SID 0 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev) @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; priv->get_fault_info = qcom_adreno_smmu_get_fault_info; + priv->set_stall = qcom_adreno_smmu_set_stall; + priv->resume_translation = qcom_adreno_smmu_resume_translation; return 0; } diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h index 53fe32fb9214..c637e0997f6d 100644 --- a/include/linux/adreno-smmu-priv.h +++ b/include/linux/adreno-smmu-priv.h @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info { * TTBR0 translation is enabled with the specified cfg * @get_fault_info: Called by the GPU fault handler to get information about * the fault + * @set_stall: Configure whether stall on fault (CFCFG) is enabled. Call + * before set_ttbr0_cfg(). If stalling on fault is enabled, + * the GPU driver must call resume_translation() + * @resume_translation: Resume translation after a fault + * * * The GPU driver (drm/msm) and adreno-smmu work together for controlling * the GPU's SMMU instance. This is by necessity, as the GPU is directly @@ -60,6 +65,8 @@ struct adreno_smmu_priv { const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie); int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg); void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info); +void (*set_stall)(const void *cookie, bool enabled); +void (*resume_translation)(const void *cookie, bool terminate); }; #endif /* __ADRENO_SMMU_PRIV_H */ -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote: > > > /* Bind guest I/O page table */ > > > bind_data = { > > > .ioasid = gva_ioasid; > > > .addr = gva_pgtable1; > > > // and format information > > > }; > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > > > Again I do wonder if this should just be part of alloc_ioasid. Is > > there any reason to split these things? The only advantage to the > > split is the device is known, but the device shouldn't impact > > anything.. > > I'm pretty sure the device(s) could matter, although they probably > won't usually. It is a bit subtle, but the /dev/iommu fd itself is connected to the devices first. This prevents wildly incompatible devices from being joined together, and allows some "get info" to report the capability union of all devices if we want to do that. The original concept was that devices joined would all have to support the same IOASID format, at least for the kernel owned map/unmap IOASID type. Supporting different page table formats maybe is reason to revisit that concept. There is a small advantage to re-using the IOASID container because of the get_user_pages caching and pinned accounting management at the FD level. I don't know if that small advantage is worth the extra complexity though. > But it would certainly be possible for a system to have two > different host bridges with two different IOMMUs with different > pagetable formats. Until you know which devices (and therefore > which host bridge) you're talking about, you don't know what formats > of pagetable to accept. And if you have devices from *both* bridges > you can't bind a page table at all - you could theoretically support > a kernel managed pagetable by mirroring each MAP and UNMAP to tables > in both formats, but it would be pretty reasonable not to support > that. The basic process for a user space owned pgtable mode would be: 1) qemu has to figure out what format of pgtable to use Presumably it uses query functions using the device label. The kernel code should look at the entire device path through all the IOMMU HW to determine what is possible. Or it already knows because the VM's vIOMMU is running in some fixed page table format, or the VM's vIOMMU already told it, or something. 2) qemu creates an IOASID and based on #1 and says 'I want this format' 3) qemu binds the IOASID to the device. If qmeu gets it wrong then it just fails. 4) For the next device qemu would have to figure out if it can re-use an existing IOASID based on the required proeprties. You pointed to the case of mixing vIOMMU's of different platforms. So it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU page table mode v2" while running on an x86 because that is what the vIOMMU is wired to work with. Presumably qemu will fall back to software emulation if this is not possible. One interesting option for software emulation is to just transform the ARM page table format to a x86 page table format in userspace and use nested bind/invalidate to synchronize with the kernel. With SW nesting I suspect this would be much faster Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, 2 Jun 2021 13:01:40 -0300 Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 02:20:15AM +, Tian, Kevin wrote: > > > From: Alex Williamson > > > Sent: Wednesday, June 2, 2021 6:22 AM > > > > > > On Tue, 1 Jun 2021 07:01:57 + > > > "Tian, Kevin" wrote: > > > > > > > > I summarized five opens here, about: > > > > > > > > 1) Finalizing the name to replace /dev/ioasid; > > > > 2) Whether one device is allowed to bind to multiple IOASID fd's; > > > > 3) Carry device information in invalidation/fault reporting uAPI; > > > > 4) What should/could be specified when allocating an IOASID; > > > > 5) The protocol between vfio group and kvm; > > > > > > > ... > > > > > > > > For 5), I'd expect Alex to chime in. Per my understanding looks the > > > > original purpose of this protocol is not about I/O address space. It's > > > > for KVM to know whether any device is assigned to this VM and then > > > > do something special (e.g. posted interrupt, EPT cache attribute, > > > > etc.). > > > > > > Right, the original use case was for KVM to determine whether it needs > > > to emulate invlpg, so it needs to be aware when an assigned device is > > > > invlpg -> wbinvd :) Oops, of course. > > > present and be able to test if DMA for that device is cache > > > coherent. > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > emulate wbinvd' flag from qemu? IIRC, wbinvd has host implications, a malicious user could tell KVM to emulate wbinvd then run the op in a loop and induce a disproportionate load on the system. We therefore wanted a way that it would only be enabled when required. > I briefly didn't see any obvios linkage in the arch code, just some > dead code: > > $ git grep iommu_noncoherent > arch/x86/include/asm/kvm_host.h: bool iommu_noncoherent; > $ git grep iommu_domain arch/x86 > arch/x86/include/asm/kvm_host.h:struct iommu_domain *iommu_domain; > > Huh? Cruft from legacy KVM device assignment, I assume. What you're looking for is: kvm_vfio_update_coherency kvm_arch_register_noncoherent_dma atomic_inc(&kvm->arch.noncoherent_dma_count); need_emulate_wbinvd kvm_arch_has_noncoherent_dma atomic_read(&kvm->arch.noncoherent_dma_count); There are a couple other callers that I'm not as familiar with. > It kind of looks like the other main point is to generate the > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to > connect back to the kvm data > > But that seems like it would have been better handled with some IOCTL > on the vfio_device fd to import the KVM to the driver not this > roundabout way? Then QEMU would need to know which drivers require KVM knowledge? This allowed transparent backwards compatibility with userspace. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote: > Is there a compelling reason to have all the IOASIDs handled by one > FD? There was an answer on this, if every PASID needs an IOASID then there are too many FDs. It is difficult to share the get_user_pages cache across FDs. There are global properties in the /dev/iommu FD, like what devices are part of it, that are important for group security operations. This becomes confused if it is split to many FDs. > > I/O address space can be managed through two protocols, according to > > whether the corresponding I/O page table is constructed by the kernel or > > the user. When kernel-managed, a dma mapping protocol (similar to > > existing VFIO iommu type1) is provided for the user to explicitly specify > > how the I/O address space is mapped. Otherwise, a different protocol is > > provided for the user to bind an user-managed I/O page table to the > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault > > handling. > > > > Pgtable binding protocol can be used only on the child IOASID's, implying > > IOASID nesting must be enabled. This is because the kernel doesn't trust > > userspace. Nesting allows the kernel to enforce its DMA isolation policy > > through the parent IOASID. > > To clarify, I'm guessing that's a restriction of likely practice, > rather than a fundamental API restriction. I can see a couple of > theoretical future cases where a user-managed pagetable for a "base" > IOASID would be feasible: > > 1) On some fancy future MMU allowing free nesting, where the kernel > would insert an implicit extra layer translating user addresses > to physical addresses, and the userspace manages a pagetable with > its own VAs being the target AS I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one where the IOVA == process VA and the kernel maintains this mapping. Since the uAPI is so general I do have a general expecation that the drivers/iommu implementations might need to be a bit more complicated, like if the HW can optimize certain specific graphs of IOASIDs we would still model them as graphs and the HW driver would have to "compile" the graph into the optimal hardware. This approach has worked reasonable in other kernel areas. > 2) For a purely software virtual device, where its virtual DMA > engine can interpet user addresses fine This also sounds like an SVA IOASID. Depending on HW if a device can really only bind to a very narrow kind of IOASID then it should ask for that (probably platform specific!) type during its attachment request to drivers/iommu. eg "I am special hardware and only know how to do PLATFORM_BLAH transactions, give me an IOASID comatible with that". If the only way to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is hardwired to the CPU ASID then that is just how it is. > I wonder if there's a way to model this using a nested AS rather than > requiring special operations. e.g. > > 'prereg' IOAS > | > \- 'rid' IOAS > | > \- 'pasid' IOAS (maybe) > > 'prereg' would have a kernel managed pagetable into which (for > example) qemu platform code would map all guest memory (using > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's > IO mappings into the 'rid' IOAS in terms of GPA. > > This wouldn't quite work as is, because the 'prereg' IOAS would have > no devices. But we could potentially have another call to mark an > IOAS as a purely "preregistration" or pure virtual IOAS. Using that > would be an alternative to attaching devices. It is one option for sure, this is where I was thinking when we were talking in the other thread. I think the decision is best implementation driven as the datastructure to store the preregsitration data should be rather purpose built. > > /* > > * Map/unmap process virtual addresses to I/O virtual addresses. > > * > > * Provide VFIO type1 equivalent semantics. Start with the same > > * restriction e.g. the unmap size should match those used in the > > * original mapping call. > > * > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr > > * must be already in the preregistered list. > > * > > * Input parameters: > > * - u32 ioasid; > > * - refer to vfio_iommu_type1_dma_{un}map > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6) > > #define IOASID_UNMAP_DMA_IO(IOASID_TYPE, IOASID_BASE + 7) > > I'm assuming these would be expected to fail if a user managed > pagetable has been bound? Me too, or a SVA page table. This document would do well to have a list of imagined page table types and the set of operations that act on them. I think they are all pretty disjoint.. Your presentation of 'kernel owns the table' vs 'userspace owns the table' is a useful clarification to call out too > > 5. Use Cases and Flows > > > >
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote: > > 在 2021/6/2 上午1:31, Jason Gunthorpe 写道: > > On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote: > > > We can open up to ~0U file descriptors, I don't see why we need to > > > restrict > > > it in uAPI. > > There are significant problems with such large file descriptor > > tables. High FD numbers man things like select don't work at all > > anymore and IIRC there are more complications. > > > I don't see how much difference for IOASID and other type of fds. People can > choose to use poll or epoll. Not really, once one thing in an applicate uses a large number FDs the entire application is effected. If any open() can return 'very big number' then nothing in the process is allowed to ever use select. It is not a trivial thing to ask for > And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I > wonder how select can work for the specific ioasid. pagefault events are one thing that comes to mind. Bundling them all together into a single ring buffer is going to be necessary. Multifds just complicate this too Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 10:56:48AM +0200, Enrico Weigelt, metux IT consult wrote: > If I understand this correctly, /dev/ioasid is a kind of "common supplier" > to other APIs / devices. Why can't the fd be acquired by the > consumer APIs (eg. kvm, vfio, etc) ? /dev/ioasid would be similar to /dev/vfio, and everything already deals with exposing /dev/vfio and /dev/vfio/N together I don't see it as a problem, just more work. Having FDs spawn other FDs is pretty ugly, it defeats the "everything is a file" model of UNIX. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote: > > > > present and be able to test if DMA for that device is cache > > > > coherent. > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > > emulate wbinvd' flag from qemu? > > IIRC, wbinvd has host implications, a malicious user could tell KVM to > emulate wbinvd then run the op in a loop and induce a disproportionate > load on the system. We therefore wanted a way that it would only be > enabled when required. I think the non-coherentness is vfio_device specific? eg a specific device will decide if it is coherent or not? If yes I'd recast this to call kvm_arch_register_noncoherent_dma() from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device implementation and not link it through the IOMMU. If userspace is telling the vfio_device to be non-coherent or not then it can call kvm_arch_register_noncoherent_dma() or not based on that signal. > > It kind of looks like the other main point is to generate the > > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to > > connect back to the kvm data > > > > But that seems like it would have been better handled with some IOCTL > > on the vfio_device fd to import the KVM to the driver not this > > roundabout way? > > Then QEMU would need to know which drivers require KVM knowledge? This > allowed transparent backwards compatibility with userspace. Thanks, I'd just blindly fire a generic 'hey here is your KVM FD' into every VFIO device. The backwards compat angle is reasonable enough though. So those two don't sound so bad, don't know about PPC, but David seem optimistic A basic idea is to remove the iommu stuff from the kvm connection so that the scope of the iommu related rework is contained to vfio Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, 2 Jun 2021 14:35:10 -0300 Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote: > > > > > > present and be able to test if DMA for that device is cache > > > > > coherent. > > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > > > emulate wbinvd' flag from qemu? > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to > > emulate wbinvd then run the op in a loop and induce a disproportionate > > load on the system. We therefore wanted a way that it would only be > > enabled when required. > > I think the non-coherentness is vfio_device specific? eg a specific > device will decide if it is coherent or not? No, this is specifically whether DMA is cache coherent to the processor, ie. in the case of wbinvd whether the processor needs to invalidate its cache in order to see data from DMA. > If yes I'd recast this to call kvm_arch_register_noncoherent_dma() > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device > implementation and not link it through the IOMMU. The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to IOMMU_CAP_CACHE_COHERENCY for all domains within a container. > If userspace is telling the vfio_device to be non-coherent or not then > it can call kvm_arch_register_noncoherent_dma() or not based on that > signal. Not non-coherent device memory, that would be a driver issue, cache coherence of DMA is what we're after. > > > It kind of looks like the other main point is to generate the > > > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to > > > connect back to the kvm data > > > > > > But that seems like it would have been better handled with some IOCTL > > > on the vfio_device fd to import the KVM to the driver not this > > > roundabout way? > > > > Then QEMU would need to know which drivers require KVM knowledge? This > > allowed transparent backwards compatibility with userspace. Thanks, > > I'd just blindly fire a generic 'hey here is your KVM FD' into every > VFIO device. Yes, QEMU could do this, but the vfio-kvm device was already there with this association and required no uAPI work. This one is the least IOMMU related of the use cases. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote: > On Wed, 2 Jun 2021 14:35:10 -0300 > Jason Gunthorpe wrote: > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote: > > > > > > > > present and be able to test if DMA for that device is cache > > > > > > coherent. > > > > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > > > > emulate wbinvd' flag from qemu? > > > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to > > > emulate wbinvd then run the op in a loop and induce a disproportionate > > > load on the system. We therefore wanted a way that it would only be > > > enabled when required. > > > > I think the non-coherentness is vfio_device specific? eg a specific > > device will decide if it is coherent or not? > > No, this is specifically whether DMA is cache coherent to the > processor, ie. in the case of wbinvd whether the processor needs to > invalidate its cache in order to see data from DMA. I'm confused. This is x86, all DMA is cache coherent unless the device is doing something special. > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma() > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device > > implementation and not link it through the IOMMU. > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to > IOMMU_CAP_CACHE_COHERENCY for all domains within a container. And this special IOMMU mode is basically requested by the device driver, right? Because if you use this mode you have to also use special programming techniques. This smells like all the "snoop bypass" stuff from PCIE (for GPUs even) in a different guise - it is device triggered, not platform triggered behavior. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, 2 Jun 2021 15:09:25 -0300 Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote: > > On Wed, 2 Jun 2021 14:35:10 -0300 > > Jason Gunthorpe wrote: > > > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote: > > > > > > > > > > present and be able to test if DMA for that device is cache > > > > > > > coherent. > > > > > > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > > > > > emulate wbinvd' flag from qemu? > > > > > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to > > > > emulate wbinvd then run the op in a loop and induce a disproportionate > > > > load on the system. We therefore wanted a way that it would only be > > > > enabled when required. > > > > > > I think the non-coherentness is vfio_device specific? eg a specific > > > device will decide if it is coherent or not? > > > > No, this is specifically whether DMA is cache coherent to the > > processor, ie. in the case of wbinvd whether the processor needs to > > invalidate its cache in order to see data from DMA. > > I'm confused. This is x86, all DMA is cache coherent unless the device > is doing something special. > > > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma() > > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device > > > implementation and not link it through the IOMMU. > > > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to > > IOMMU_CAP_CACHE_COHERENCY for all domains within a container. > > And this special IOMMU mode is basically requested by the device > driver, right? Because if you use this mode you have to also use > special programming techniques. > > This smells like all the "snoop bypass" stuff from PCIE (for GPUs > even) in a different guise - it is device triggered, not platform > triggered behavior. Right, the device can generate the no-snoop transactions, but it's the IOMMU that essentially determines whether those transactions are actually still cache coherent, AIUI. I did experiment with virtually hardwiring the Enable No-Snoop bit in the Device Control Register to zero, which would be generically allowed by the PCIe spec, but then we get into subtle dependencies in the device drivers and clearing the bit again after any sort of reset and the backdoor accesses to config space which exist mostly in the class of devices that might use no-snoop transactions (yes, GPUs suck). It was much easier and more robust to ignore the device setting and rely on the IOMMU behavior. Yes, maybe we sometimes emulate wbinvd for VMs where the device doesn't support no-snoop, but it seemed like platforms were headed in this direction where no-snoop was ignored anyway. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote: > On Wed, 2 Jun 2021 15:09:25 -0300 > Jason Gunthorpe wrote: > > > On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote: > > > On Wed, 2 Jun 2021 14:35:10 -0300 > > > Jason Gunthorpe wrote: > > > > > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote: > > > > > > > > > > > > present and be able to test if DMA for that device is cache > > > > > > > > coherent. > > > > > > > > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm > > > > > > emulate wbinvd' flag from qemu? > > > > > > > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to > > > > > emulate wbinvd then run the op in a loop and induce a disproportionate > > > > > load on the system. We therefore wanted a way that it would only be > > > > > enabled when required. > > > > > > > > I think the non-coherentness is vfio_device specific? eg a specific > > > > device will decide if it is coherent or not? > > > > > > No, this is specifically whether DMA is cache coherent to the > > > processor, ie. in the case of wbinvd whether the processor needs to > > > invalidate its cache in order to see data from DMA. > > > > I'm confused. This is x86, all DMA is cache coherent unless the device > > is doing something special. > > > > > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma() > > > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device > > > > implementation and not link it through the IOMMU. > > > > > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to > > > IOMMU_CAP_CACHE_COHERENCY for all domains within a container. > > > > And this special IOMMU mode is basically requested by the device > > driver, right? Because if you use this mode you have to also use > > special programming techniques. > > > > This smells like all the "snoop bypass" stuff from PCIE (for GPUs > > even) in a different guise - it is device triggered, not platform > > triggered behavior. > > Right, the device can generate the no-snoop transactions, but it's the > IOMMU that essentially determines whether those transactions are > actually still cache coherent, AIUI. Wow, this is really confusing stuff in the code. At the PCI level there is a TLP bit called no-snoop that is platform specific. The general intention is to allow devices to selectively bypass the CPU caching for DMAs. GPUs like to use this feature for performance. I assume there is some exciting security issues here. Looks like allowing cache bypass does something bad inside VMs? Looks like allowing the VM to use the cache clear instruction that is mandatory with cache bypass DMA causes some QOS issues? OK. So how does it work? What I see in the intel/iommu.c is that some domains support "snoop control" or not, based on some HW flag. This indicates if the DMA_PTE_SNP bit is supported on a page by page basis or not. Since x86 always leans toward "DMA cache coherent" I'm reading some tea leaves here: IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA transactions */ And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set? Further, I guess IOMMUs that don't support PTE_SNP, or have DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat and all) So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA incoherent with the CPU caches, it just means that that snoop bit in the TLP cannot be enforced. ie the device *could* do no-shoop DMA if it wants. Devices that never do no-snoop remain DMA coherent on x86, as they always have been. IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means the PCI device is blocked from using no-snoop in its TLPs. I wonder if ARM implemented this consistently? I see VDPA is confused.. I was confused. What a terrible set of names. In VFIO generic code I see it always sets IOMMU_CACHE: if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) domain->prot |= IOMMU_CACHE; And thus also always provides IOMMU_CACHE to iommu_map: ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT, npage << PAGE_SHIFT, prot | d->prot); So when the IOMMU supports the no-snoop blocking security feature VFIO turns it on and blocks no-snoop to all pages? Ok.. But I must be missing something big because *something* in the IOVA map should work with no-snoopable DMA, right? Otherwise what is the point of exposing the invalidate instruction to the guest? I would think userspace should be relaying the DMA_PTE_SNP bit from the guest's page tables up to here?? The KVM hookup is driven by IOMMU_CACHE which is driven by IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only if the IOMMU can block the SNP bit? And then we map all the
Re: [PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT
Hi Jean-Philippe, I love your patch! Yet something to improve: [auto build test ERROR on pm/linux-next] [also build test ERROR on iommu/next arm64/for-next/core linus/master v5.13-rc4 next-20210602] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Jean-Philippe-Brucker/Add-support-for-ACPI-VIOT/20210602-235849 base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next config: x86_64-randconfig-a012-20210602 (attached as .config) compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project d41cb6bb2607fa5c7a9df2b3dab361353657d225) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install x86_64 cross compiling tool for clang build # apt-get install binutils-x86-64-linux-gnu # https://github.com/0day-ci/linux/commit/937da71a81108243877fb1f0f568e56a08a62c50 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jean-Philippe-Brucker/Add-support-for-ACPI-VIOT/20210602-235849 git checkout 937da71a81108243877fb1f0f568e56a08a62c50 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): >> drivers/acpi/scan.c:1540:26: error: no member named 'ops' in 'struct >> iommu_fwspec' return fwspec ? fwspec->ops : NULL; ~~ ^ >> drivers/acpi/scan.c:1564:9: error: implicit declaration of function >> 'iommu_probe_device' [-Werror,-Wimplicit-function-declaration] err = iommu_probe_device(dev); ^ 2 errors generated. vim +1540 drivers/acpi/scan.c 1535 1536 static inline const struct iommu_ops *acpi_iommu_fwspec_ops(struct device *dev) 1537 { 1538 struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); 1539 > 1540 return fwspec ? fwspec->ops : NULL; 1541 } 1542 1543 static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev, 1544 const u32 *id_in) 1545 { 1546 int err; 1547 const struct iommu_ops *ops; 1548 1549 /* 1550 * If we already translated the fwspec there is nothing left to do, 1551 * return the iommu_ops. 1552 */ 1553 ops = acpi_iommu_fwspec_ops(dev); 1554 if (ops) 1555 return ops; 1556 1557 err = iort_iommu_configure_id(dev, id_in); 1558 1559 /* 1560 * If we have reason to believe the IOMMU driver missed the initial 1561 * add_device callback for dev, replay it to get things in order. 1562 */ 1563 if (!err && dev->bus && !device_iommu_mapped(dev)) > 1564 err = iommu_probe_device(dev); 1565 1566 /* Ignore all other errors apart from EPROBE_DEFER */ 1567 if (err == -EPROBE_DEFER) { 1568 return ERR_PTR(err); 1569 } else if (err) { 1570 dev_dbg(dev, "Adding to IOMMU failed: %d\n", err); 1571 return NULL; 1572 } 1573 return acpi_iommu_fwspec_ops(dev); 1574 } 1575 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Fix IOVA reserve dma ranges
Hi, I just ran into the exact same issue while working on the M1 DART IOMMU driver and it was fixed by this commit. Thanks! Would be great if this could be picked up. Tested-by: Sven Peter Best, Sven On Mon, Sep 14, 2020, at 09:23, Srinath Mannam via iommu wrote: > Fix IOVA reserve failure in the case when address of first memory region > listed in dma-ranges is equal to 0x0. > > Fixes: aadad097cd46f ("iommu/dma: Reserve IOVA for PCIe inaccessible > DMA address") > Signed-off-by: Srinath Mannam > --- > Changes from v2: >Modify error message with useful information based on Bjorn's > comments. > > Changes from v1: >Removed unnecessary changes based on Robin's review comments. > > drivers/iommu/dma-iommu.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 5141d49a046b..5b9791f35c5e 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -217,9 +217,11 @@ static int iova_reserve_pci_windows(struct pci_dev *dev, > lo = iova_pfn(iovad, start); > hi = iova_pfn(iovad, end); > reserve_iova(iovad, lo, hi); > - } else { > + } else if (end < start) { > /* dma_ranges list should be sorted */ > - dev_err(&dev->dev, "Failed to reserve IOVA\n"); > + dev_err(&dev->dev, > + "Failed to reserve IOVA [%#010llx-%#010llx]\n", > + start, end); > return -EINVAL; > } > > -- > 2.17.1 > > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, 2 Jun 2021 16:54:04 -0300 Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote: > > > > Right, the device can generate the no-snoop transactions, but it's the > > IOMMU that essentially determines whether those transactions are > > actually still cache coherent, AIUI. > > Wow, this is really confusing stuff in the code. > > At the PCI level there is a TLP bit called no-snoop that is platform > specific. The general intention is to allow devices to selectively > bypass the CPU caching for DMAs. GPUs like to use this feature for > performance. Yes > I assume there is some exciting security issues here. Looks like > allowing cache bypass does something bad inside VMs? Looks like > allowing the VM to use the cache clear instruction that is mandatory > with cache bypass DMA causes some QOS issues? OK. IIRC, largely a DoS issue if userspace gets to choose when to emulate wbinvd rather than it being demanded for correct operation. > So how does it work? > > What I see in the intel/iommu.c is that some domains support "snoop > control" or not, based on some HW flag. This indicates if the > DMA_PTE_SNP bit is supported on a page by page basis or not. > > Since x86 always leans toward "DMA cache coherent" I'm reading some > tea leaves here: > > IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA > transactions */ > > And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the > snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set? That's my understanding as well. > Further, I guess IOMMUs that don't support PTE_SNP, or have > DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat > and all) Yes. > So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA > incoherent with the CPU caches, it just means that that snoop bit in > the TLP cannot be enforced. ie the device *could* do no-shoop DMA > if it wants. Devices that never do no-snoop remain DMA coherent on > x86, as they always have been. Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device DMA to be coherent via the IOMMU. > IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means > the PCI device is blocked from using no-snoop in its TLPs. > > I wonder if ARM implemented this consistently? I see VDPA is > confused.. I was confused. What a terrible set of names. > > In VFIO generic code I see it always sets IOMMU_CACHE: > > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) > domain->prot |= IOMMU_CACHE; > > And thus also always provides IOMMU_CACHE to iommu_map: > > ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << > PAGE_SHIFT, > npage << PAGE_SHIFT, prot | d->prot); > > So when the IOMMU supports the no-snoop blocking security feature VFIO > turns it on and blocks no-snoop to all pages? Ok.. Yep, I'd forgotten this nuance that we need to enable it via the mapping flags. > But I must be missing something big because *something* in the IOVA > map should work with no-snoopable DMA, right? Otherwise what is the > point of exposing the invalidate instruction to the guest? > > I would think userspace should be relaying the DMA_PTE_SNP bit from > the guest's page tables up to here?? > > The KVM hookup is driven by IOMMU_CACHE which is driven by > IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only > if the IOMMU can block the SNP bit? And then we map all the pages to > block the snoop bit? Huh? Right. I don't follow where you're jumping to relaying DMA_PTE_SNP from the guest page table... what page table? We don't necessarily have a vIOMMU to expose such things, I don't think it even existed when this we added. Essentially if we can ignore no-snoop at the IOMMU, then KVM doesn't need to worry about emulating wbinvd because of an assigned device, whether that device uses it or not. Win-win. > Your explanation makes perfect sense: Block guests from using the > dangerous cache invalidate instruction unless a device that uses > no-snoop is plugged in. Block devices from using no-snoop because > something about it is insecure. Ok. No-snoop itself is not insecure, but to support no-snoop in a VM KVM can't ignore wbinvd, which has overhead and abuse implications. > But the conditions I'm looking for "device that uses no-snoop" is: > - The device will issue no-snoop TLPs at all We can't really know this generically. We can try to set the enable bit to see if the device is capable of no-snoop, but that doesn't mean it will use no-snoop. > - The IOMMU will let no-snoop through > - The platform will honor no-snoop > > Only if all three are met we should allow the dangerous instruction in > KVM, right? We test at the IOMMU and assume that the IOMMU knowledge encompasses whether the platform honors no-snoop (note for example how amd and arm report true for IOMMU_CAP_CACHE_COHER
RE: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
>> ... so on a PASID system, your trivial reproducer would theoretically >> fire the same way and corrupt FPU state just as well. > > This is worse and you can't selftest it because the IPI can just hit in > the middle of _any_ FPU state operation and corrupt state. That sounds like we should abandon the "IPI all the other threads to force enable the PASID for them" approach. It would just be a nightmare of papering over cracks when the IPI was delivered at some inconvenient moment when the recipient was in the middle of touching xsave state. I've told Fenghua to dig out the previous iteration of this patch where the plan was to lazily fix the PASID_MSR in other threads in the #GP handler. That algorithm is very simple and easy to check. Pseudo-code: #GP if (usermode && current->mm->pasid && rdmsr(PASID_MSR) != valid) { wrmsr(PASID_MSR, current->mm->pasid | PASID_VALID); return; } Worst case is that some thread of a multi-threaded process that is using PASID takes some unrelated #GP ... this code will try to fix it by enabling the PASID_MSR. That will just #GP a second time and this test will see the MSR is already set, so fall into the usual #GP handling code. Seems like a better direction than trying to fix the IPI method. The virtualization folks will like this way more because IPI in guest causes a couple of VMEXIT so is somewhat expensive. -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > from the guest page table... what page table? I see my confusion now, the phrasing in your earlier remark led me think this was about allowing the no-snoop performance enhancement in some restricted way. It is really about blocking no-snoop 100% of the time and then disabling the dangerous wbinvd when the block is successful. Didn't closely read the kvm code :\ If it was about allowing the optimization then I'd expect the guest to enable no-snoopable regions via it's vIOMMU and realize them to the hypervisor and plumb the whole thing through. Hence my remark about the guest page tables.. So really the test is just 'were we able to block it' ? > This support existed before mdev, IIRC we needed it for direct > assignment of NVIDIA GPUs. Probably because they ignored the disable no-snoop bits in the control block, or reset them in some insane way to "fix" broken bioses and kept using it even though by all rights qemu would have tried hard to turn it off via the config space. Processing no-snoop without a working wbinvd would be fatal. Yeesh But Ok, back the /dev/ioasid. This answers a few lingering questions I had.. 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY domains. This doesn't actually matter. If you mix them together then kvm will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP anywhere in this VM. This if two IOMMU's are joined together into a single /dev/ioasid then we can just make them both pretend to be !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. 2) How to fit this part of kvm in some new /dev/ioasid world What we want to do here is iterate over every ioasid associated with the group fd that is passed into kvm. Today the group fd has a single container which specifies the single ioasid so this is being done trivially. To reorg we want to get the ioasid from the device not the group (see my note to David about the groups vs device rational) This is just iterating over each vfio_device in the group and querying the ioasid it is using. Or perhaps more directly: an op attaching the vfio_device to the kvm and having some simple helper '(un)register ioasid with kvm (kvm, ioasid)' that the vfio_device driver can call that just sorts this out. It is not terrible.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote: > On 6/2/21 1:26 AM, Jason Gunthorpe wrote: > > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote: > > > > > This version only covers 1) and 4). Do you think we need to support 2), > > > 3) and beyond? > > > > Yes aboslutely. The API should be flexable enough to specify the > > creation of all future page table formats we'd want to have and all HW > > specific details on those formats. > > OK, stay in the same line. > > > > If so, it seems that we need some in-kernel helpers and uAPIs to > > > support pre-installing a page table to IOASID. > > > > Not sure what this means.. > > Sorry that I didn't make this clear. > > Let me bring back the page table types in my eyes. > > 1) IOMMU format page table (a.k.a. iommu_domain) > 2) user application CPU page table (SVA for example) > 3) KVM EPT (future option) > 4) VM guest managed page table (nesting mode) > > Each type of page table should be able to be associated with its IOASID. > We have BIND protocol for 4); We explicitly allocate an iommu_domain for > 1). But we don't have a clear definition for 2) 3) and others. I think > it's necessary to clearly define a time point and kAPI name between > IOASID_ALLOC and IOASID_ATTACH, so that other modules have the > opportunity to associate their page table with the allocated IOASID > before attaching the page table to the real IOMMU hardware. In my mind these are all actions of creation.. #1 is ALLOC_IOASID 'to be compatible with thes devices attached to this FD' #2 is ALLOC_IOASID_SVA #3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl) #4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address Each allocation should have a set of operations that are allows map/unmap is only legal on #1. invalidate is only legal on #4, etc. How you want to split this up in the ioctl interface is a more interesting question. I generally like more calls than giant unwieldly multiplexer structs, but some things are naturally flags and optional modifications of a single ioctl. In any event they should have a similar naming 'ALLOC_IOASID_XXX' and then a single 'DESTROY_IOASID' that works on all of them. > I/O page fault handling is similar. The provider of the page table > should take the responsibility to handle the possible page faults. For the faultable types, yes #3 and #4 should hook in the fault handler and deal with it. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote: > OK, this implies that if one user inadvertently creates intended parent/ > child via different fd's then the operation will simply fail. Remember the number space to refer to the ioasid's inside the FD is local to that instance of the FD. Each FD should have its own xarray You can't actually accidently refer to an IOASID in FD A from FD B because the xarray lookup in FD B will not return 'IOASID A'. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured
On 6/2/21 3:48 PM, John Garry wrote: On 02/06/2021 05:37, Lu Baolu wrote: On 6/1/21 10:29 PM, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. Could you explain why it requires singleton group and driver unbinding if the user only wants to increase the upper limit? I haven't dived into the details yet, sorry if this is a silly question. Hi Baolu, I did actually try increasing the range for a 'live' domain in the v1 series, but it turned out too messy. First problem is reallocating the memory to hold the rcaches. Second problem is that we need to deal with the issue that all IOVAs in the rcache need to be a pow-of-2, which is difficult to enforce for IOVAs which weren't being cached before, but now would be. So now I changed to work similar to how we change the default domain type, i.e. don't operate on a 'live' domain. If these hard restrictions on users are just to walk around the messy code in kernel, I would rather solve those messy code to achieve a better user experience. :-) Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, June 3, 2021 12:09 AM > > On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, June 2, 2021 1:42 AM > > > > > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Saturday, May 29, 2021 1:36 AM > > > > > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > > > > > IOASID nesting can be implemented in two ways: hardware nesting > and > > > > > > software nesting. With hardware support the child and parent I/O > page > > > > > > tables are walked consecutively by the IOMMU to form a nested > > > translation. > > > > > > When it's implemented in software, the ioasid driver is responsible > for > > > > > > merging the two-level mappings into a single-level shadow I/O page > > > table. > > > > > > Software nesting requires both child/parent page tables operated > > > through > > > > > > the dma mapping protocol, so any change in either level can be > > > captured > > > > > > by the kernel to update the corresponding shadow mapping. > > > > > > > > > > Why? A SW emulation could do this synchronization during > invalidation > > > > > processing if invalidation contained an IOVA range. > > > > > > > > In this proposal we differentiate between host-managed and user- > > > > managed I/O page tables. If host-managed, the user is expected to use > > > > map/unmap cmd explicitly upon any change required on the page table. > > > > If user-managed, the user first binds its page table to the IOMMU and > > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically > > > > not required when changing a PTE from non-present to present). > > > > > > > > We expect user to use map+unmap and bind+invalidate respectively > > > > instead of mixing them together. Following this policy, map+unmap > > > > must be used in both levels for software nesting, so changes in either > > > > level are captured timely to synchronize the shadow mapping. > > > > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when > > > it is created. If you put two different types in a tree then each IOASID > > > must continue to use its own operation mode. > > > > > > I don't see a reason to force all IOASIDs in a tree to be consistent?? > > > > only for software nesting. With hardware support the parent uses map > > while the child uses bind. > > > > Yes, the policy is specified per IOASID. But if the policy violates the > > requirement in a specific nesting mode, then nesting should fail. > > I don't get it. > > If the IOASID is a page table then it is bind/invalidate. SW or not SW > doesn't matter at all. > > > > > > > A software emulated two level page table where the leaf level is a > > > bound page table in guest memory should continue to use > > > bind/invalidate to maintain the guest page table IOASID even though it > > > is a SW construct. > > > > with software nesting the leaf should be a host-managed page table > > (or metadata). A bind/invalidate protocol doesn't require the user > > to notify the kernel of every page table change. > > The purpose of invalidate is to inform the implementation that the > page table has changed so it can flush the caches. If the page table > is changed and invalidation is not issued then then the implementation > is free to ignore the changes. > > In this way the SW mode is the same as a HW mode with an infinite > cache. > > The collaposed shadow page table is really just a cache. > OK. One additional thing is that we may need a 'caching_mode" thing reported by /dev/ioasid, indicating whether invalidation is required when changing non-present to present. For hardware nesting it's not reported as the hardware IOMMU will walk the guest page table in cases of iotlb miss. For software nesting caching_mode is reported so the user must issue invalidation upon any change in guest page table so the kernel can update the shadow page table timely. Following this and your other comment with David, we will mark host-managed vs. guest-managed explicitly for I/O page table of each IOASID. map+unmap or bind+invalid is decided by which owner is specified by the user. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Virtio hardening for TDX
在 2021/6/3 上午8:41, Andi Kleen 写道: [v1: Initial post] With confidential computing like TDX the guest doesn't trust the host anymore. The host is allowed to DOS of course, but it is not allowed to read or write any guest memory not explicitely shared with it. This has implication for virtio. Traditionally virtio didn't assume the other side of the communication channel is malicious, and therefore didn't do any boundary checks in virtio ring data structures. This patchkit does hardening for virtio. In a TDX like model the only host memory accesses allowed are in the virtio ring, as well as the (forced) swiotlb buffer. This patch kit does various changes to ensure there can be no access outside these two areas. It is possible for the host to break the communication, but this should result in a IO error on the guest, but no memory safety violations. virtio is quite complicated with many modes. To simplify the task we enforce that virtio is only in split mode without indirect descriptors, when running as a TDX guest. We also enforce use of the DMA API. Then these code paths are hardened against any corruptions on the ring. This patchkit has components in three subsystems: - Hardening changes to virtio, all in the generic virtio-ring - Hardening changes to kernel/dma swiotlb to harden swiotlb against malicious pointers. It requires an API change which needed a tree sweep. - A single x86 patch to enable the arch_has_restricted_memory_access for TDX It depends on Sathya's earlier patchkit that adds the basic infrastructure for TDX. This is only needed for the "am I running in TDX" part. Note that it's probably needed by other cases as well: 1) Other encrypted VM technology 2) VDUSE[1] 3) Smart NICs We have already had discussions and some patches have been posted[2][3][4]. I think the basic idea is similar, basically, we don't trust any metadata provided by the device. [2] is the series that use the metadata stored in the private memory which can't be accessed by swiotlb, this series aims to eliminate all the possible attacks via virtqueue metadata [3] is one example for the the used length validation [4] is the fix for the malicious config space Thanks [1] https://www.spinics.net/lists/netdev/msg743264.html [2] https://www.spinics.net/lists/kvm/msg241825.html [3] https://patches.linaro.org/patch/450733/ [4] https://lkml.org/lkml/2021/5/17/376 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午8:41, Andi Kleen 写道: When running under TDX the virtio host is untrusted. The bulk of the kernel memory is encrypted and protected, but the virtio ring is in special shared memory that is shared with the untrusted host. This means virtio needs to be hardened against any attacks from the host through the ring. Of course it's impossible to prevent DOS (the host can chose at any time to stop doing IO), but there should be no buffer overruns or similar that might give access to any private memory in the guest. virtio has a lot of modes, most are difficult to harden. The best for hardening seems to be split mode without indirect descriptors. This also simplifies the hardening job because it's only a single code path. Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..f35629fa47b1 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #ifdef DEBUG @@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev) unsigned int i; for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { + + /* +* In protected guest mode disallow packed or indirect +* because they ain't hardened. +*/ + switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; So we will see huge performance regression without indirect descriptor. We need to consider to address this. Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks
On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote: > swiotlb currently only uses the start address of a DMA to check if something > is in the swiotlb or not. But with virtio and untrusted hosts the host > could give some DMA mapping that crosses the swiotlb boundaries, > potentially leaking or corrupting data. Add size checks to all the swiotlb > checks and reject any DMAs that cross the swiotlb buffer boundaries. I seem to be only CC-ed on this and #7, so please bear with me. But could you explain to me why please: commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12) Author: Martin Radev Date: Tue Jan 12 16:07:29 2021 +0100 swiotlb: Validate bounce size in the sync/unmap path does not solve the problem as well? > > Signed-off-by: Andi Kleen > --- > drivers/iommu/dma-iommu.c | 13 ++--- > drivers/xen/swiotlb-xen.c | 11 ++- > include/linux/dma-mapping.h | 4 ++-- > include/linux/swiotlb.h | 8 +--- > kernel/dma/direct.c | 8 > kernel/dma/direct.h | 8 > kernel/dma/mapping.c| 4 ++-- > net/xdp/xsk_buff_pool.c | 2 +- > 8 files changed, 30 insertions(+), 28 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 7bcdd1205535..7ef13198721b 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, > dma_addr_t dma_addr, > > __iommu_dma_unmap(dev, dma_addr, size); > > - if (unlikely(is_swiotlb_buffer(phys))) > + if (unlikely(is_swiotlb_buffer(phys, size))) > swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); > } > > @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device > *dev, phys_addr_t phys, > } > > iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); > - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) > + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size)) > swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); > return iova; > } > @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device > *dev, > if (!dev_is_dma_coherent(dev)) > arch_sync_dma_for_cpu(phys, size, dir); > > - if (is_swiotlb_buffer(phys)) > + if (is_swiotlb_buffer(phys, size)) > swiotlb_sync_single_for_cpu(dev, phys, size, dir); > } > > @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct > device *dev, > return; > > phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); > - if (is_swiotlb_buffer(phys)) > + if (is_swiotlb_buffer(phys, size)) > swiotlb_sync_single_for_device(dev, phys, size, dir); > > if (!dev_is_dma_coherent(dev)) > @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, > if (!dev_is_dma_coherent(dev)) > arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); > > - if (is_swiotlb_buffer(sg_phys(sg))) > + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) > swiotlb_sync_single_for_cpu(dev, sg_phys(sg), > sg->length, dir); > } > @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device > *dev, > return; > > for_each_sg(sgl, sg, nelems, i) { > - if (is_swiotlb_buffer(sg_phys(sg))) > + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) > swiotlb_sync_single_for_device(dev, sg_phys(sg), > sg->length, dir); > - > if (!dev_is_dma_coherent(dev)) > arch_sync_dma_for_device(sg_phys(sg), sg->length, dir); > } > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c > index 24d11861ac7d..333846af8d35 100644 > --- a/drivers/xen/swiotlb-xen.c > +++ b/drivers/xen/swiotlb-xen.c > @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t > p, size_t size) > return 0; > } > > -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) > +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr, > + size_t size) > { > unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); > unsigned long xen_pfn = bfn_to_local_pfn(bfn); > @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, > dma_addr_t dma_addr) >* in our domain. Therefore _only_ check address within our domain. >*/ > if (pfn_valid(PFN_DOWN(paddr))) > - return is_swiotlb_buffer(paddr); > + return is_swiotlb_buffer(paddr, size); > return 0; > } > > @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device
Virtio hardening for TDX
[v1: Initial post] With confidential computing like TDX the guest doesn't trust the host anymore. The host is allowed to DOS of course, but it is not allowed to read or write any guest memory not explicitely shared with it. This has implication for virtio. Traditionally virtio didn't assume the other side of the communication channel is malicious, and therefore didn't do any boundary checks in virtio ring data structures. This patchkit does hardening for virtio. In a TDX like model the only host memory accesses allowed are in the virtio ring, as well as the (forced) swiotlb buffer. This patch kit does various changes to ensure there can be no access outside these two areas. It is possible for the host to break the communication, but this should result in a IO error on the guest, but no memory safety violations. virtio is quite complicated with many modes. To simplify the task we enforce that virtio is only in split mode without indirect descriptors, when running as a TDX guest. We also enforce use of the DMA API. Then these code paths are hardened against any corruptions on the ring. This patchkit has components in three subsystems: - Hardening changes to virtio, all in the generic virtio-ring - Hardening changes to kernel/dma swiotlb to harden swiotlb against malicious pointers. It requires an API change which needed a tree sweep. - A single x86 patch to enable the arch_has_restricted_memory_access for TDX It depends on Sathya's earlier patchkit that adds the basic infrastructure for TDX. This is only needed for the "am I running in TDX" part. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 6/8] dma: Add return value to dma_unmap_page
In some situations when we know swiotlb is forced and we have to deal with untrusted hosts, it's useful to know if a mapping was in the swiotlb or not. This allows us to abort any IO operation that would access memory outside the swiotlb. Otherwise it might be possible for a malicious host to inject any guest page in a read operation. While it couldn't directly access the results of the read() inside the guest, there might scenarios where data is echoed back with a write(), and that would then leak guest memory. Add a return value to dma_unmap_single/page. Most users of course will ignore it. The return value is set to EIO if we're in forced swiotlb mode and the buffer is not inside the swiotlb buffer. Otherwise it's always 0. A new callback is used to avoid changing all the IOMMU drivers. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 17 +++-- include/linux/dma-map-ops.h | 3 +++ include/linux/dma-mapping.h | 7 --- kernel/dma/mapping.c| 6 +- 4 files changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7ef13198721b..babe46f2ae3a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr, iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist); } -static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, +static int __iommu_dma_unmap_swiotlb_check(struct device *dev, + dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, phys = iommu_iova_to_phys(domain, dma_addr); if (WARN_ON(!phys)) - return; + return -EIO; __iommu_dma_unmap(dev, dma_addr, size); if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); + else if (swiotlb_force == SWIOTLB_FORCE) + return -EIO; + return 0; } static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, @@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, return dma_handle; } -static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir); - __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs); + return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir, + attrs); } /* @@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s int i; for_each_sg(sg, s, nents, i) - __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s), + __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); } @@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = { .mmap = iommu_dma_mmap, .get_sgtable= iommu_dma_get_sgtable, .map_page = iommu_dma_map_page, - .unmap_page = iommu_dma_unmap_page, + .unmap_page_check = iommu_dma_unmap_page_check, .map_sg = iommu_dma_map_sg, .unmap_sg = iommu_dma_unmap_sg, .sync_single_for_cpu= iommu_dma_sync_single_for_cpu, diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d53a96a3d64..0ed0190f7949 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,9 @@ struct dma_map_ops { u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); unsigned long (*get_merge_boundary)(struct device *dev); + int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); }; #ifdef CONFIG_DMA_OPS diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 37fbd12bd4ab..25b8382f8601 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir, unsigned long attrs); -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t siz
[PATCH v1 1/8] virtio: Force only split mode with protected guest
When running under TDX the virtio host is untrusted. The bulk of the kernel memory is encrypted and protected, but the virtio ring is in special shared memory that is shared with the untrusted host. This means virtio needs to be hardened against any attacks from the host through the ring. Of course it's impossible to prevent DOS (the host can chose at any time to stop doing IO), but there should be no buffer overruns or similar that might give access to any private memory in the guest. virtio has a lot of modes, most are difficult to harden. The best for hardening seems to be split mode without indirect descriptors. This also simplifies the hardening job because it's only a single code path. Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..f35629fa47b1 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #ifdef DEBUG @@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev) unsigned int i; for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { + + /* +* In protected guest mode disallow packed or indirect +* because they ain't hardened. +*/ + switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); -- 2.25.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 7/8] virtio: Abort IO when descriptor points outside forced swiotlb
Now that we have a return value for unmapping DMA mappings that are outside the forced swiotlb, use that to abort the IO operation. This prevents the host from subverting a read to access some data in the guest address space, which it might then get access somehow in another IO operation. It can subvert reads to point to other reads or other writes, but since it controls IO it can do that anyways. This is only done for the split code path, which is the only one supported with confidential guests. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1e9aa1e95e1b..244a5b62d85c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -365,29 +365,31 @@ static int vring_mapping_error(const struct vring_virtqueue *vq, * Split ring specific functions - *_split(). */ -static void vring_unmap_one_split(const struct vring_virtqueue *vq, +static int vring_unmap_one_split(const struct vring_virtqueue *vq, struct vring_desc *desc) { u16 flags; + int ret; if (!vq->use_dma_api) - return; + return 0; flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); if (flags & VRING_DESC_F_INDIRECT) { - dma_unmap_single(vring_dma_dev(vq), + ret = dma_unmap_single(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } else { - dma_unmap_page(vring_dma_dev(vq), + ret = dma_unmap_page(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } + return ret; } static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, @@ -609,6 +611,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, break; if (!inside_split_ring(vq, i)) break; + /* +* Ignore unmapping errors since +* we're aborting anyways. +*/ vring_unmap_one_split(vq, &desc[i]); i = virtio16_to_cpu(_vq->vdev, desc[i].next); } @@ -671,7 +677,10 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, i = head; while (vq->split.vring.desc[i].flags & nextflag) { - vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + int ret; + ret = vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + if (ret) + return ret; i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); if (!inside_split_ring(vq, i)) return -EIO; @@ -878,6 +887,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) continue; /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; + /* Ignore unmap errors because there is nothing to abort */ detach_buf_split(vq, i, NULL); /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; -- 2.25.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 8/8] virtio: Error out on endless free lists
Error out with a warning when the free list loops longer than the maximum size while freeing descriptors. While technically we don't care about DOS it is still better to abort it early. We ran into this problem while fuzzing the virtio interactions where the fuzzed code would get stuck for a long time. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 244a5b62d85c..96adaa4c5404 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -685,6 +685,11 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, if (!inside_split_ring(vq, i)) return -EIO; vq->vq.num_free++; + if (WARN_ONCE(vq->vq.num_free > + vq->split.queue_size_in_bytes / + sizeof(struct vring_desc), + "Virtio freelist corrupted")) + return -EIO; } vring_unmap_one_split(vq, &vq->split.vring.desc[i]); -- 2.25.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 3/8] virtio: Harden split buffer detachment
Harden the split buffer detachment path by adding boundary checking. Note that when this fails we may fail to unmap some swiotlb mapping, which could result in a leak and a DOS. But that's acceptable because an malicious host can DOS us anyways. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d37ff5a0ff58..1e9aa1e95e1b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) return needs_kick; } -static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, -void **ctx) +static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, + void **ctx) { unsigned int i, j; __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); + /* We'll leak DMA mappings when this happens, but nothing +* can be done about that. In the worst case the host +* could DOS us, but it can of course do that anyways. +*/ + if (!inside_split_ring(vq, head)) + return -EIO; + /* Clear data ptr. */ vq->split.desc_state[head].data = NULL; @@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, while (vq->split.vring.desc[i].flags & nextflag) { vring_unmap_one_split(vq, &vq->split.vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); + if (!inside_split_ring(vq, i)) + return -EIO; vq->vq.num_free++; } @@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, /* Free the indirect table, if any, now that it's unmapped. */ if (!indir_desc) - return; + return 0; len = virtio32_to_cpu(vq->vq.vdev, vq->split.vring.desc[head].len); @@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, } else if (ctx) { *ctx = vq->split.desc_state[head].indir_desc; } + return 0; } static inline bool more_used_split(const struct vring_virtqueue *vq) @@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, void *ret; unsigned int i; u16 last_used; + int err; START_USE(vq); @@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, /* detach_buf_split clears data, so grab it now. */ ret = vq->split.desc_state[i].data; - detach_buf_split(vq, i, ctx); + err = detach_buf_split(vq, i, ctx); + if (err) { + END_USE(vq); + return NULL; + } + vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before @@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; detach_buf_split(vq, i, NULL); + /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); -- 2.25.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 5/8] dma: Use size for swiotlb boundary checks
swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 13 ++--- drivers/xen/swiotlb-xen.c | 11 ++- include/linux/dma-mapping.h | 4 ++-- include/linux/swiotlb.h | 8 +--- kernel/dma/direct.c | 8 kernel/dma/direct.h | 8 kernel/dma/mapping.c| 4 ++-- net/xdp/xsk_buff_pool.c | 2 +- 8 files changed, 30 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..7ef13198721b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, __iommu_dma_unmap(dev, dma_addr, size); - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys, } iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size)) swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); return iova; } @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(phys, size, dir); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_cpu(dev, phys, size, dir); } @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev, return; phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_device(dev, phys, size, dir); if (!dev_is_dma_coherent(dev)) @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_cpu(dev, sg_phys(sg), sg->length, dir); } @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device *dev, return; for_each_sg(sgl, sg, nelems, i) { - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_device(dev, sg_phys(sg), sg->length, dir); - if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_device(sg_phys(sg), sg->length, dir); } diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 24d11861ac7d..333846af8d35 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t p, size_t size) return 0; } -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr, +size_t size) { unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); unsigned long xen_pfn = bfn_to_local_pfn(bfn); @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) - return is_swiotlb_buffer(paddr); + return is_swiotlb_buffer(paddr, size); return 0; } @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, } /* NOTE: We use dev_addr here, not paddr! */ - if (is_xen_swiotlb_buffer(hwdev, dev_addr)) + if (is_xen_swiotlb_buffer(hwdev, dev_addr, size)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); } @@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr, xen_dma_sync_for_cpu(dev, dma_addr, size, dir); } - if (is_xen_swiotlb_buffer(dev, dma_addr)) + if (is_xen
[PATCH v1 2/8] virtio: Add boundary checks to virtio ring
In protected guest mode we don't trust the host. This means we need to make sure the host cannot subvert us through virtio communication. In general it can corrupt our virtio data and cause a DOS, but it should not be able to access any data that is not explicitely under IO. Also boundary checking so that the free list (which is accessible to the host) cannot point outside the virtio ring. Note it could still contain loops or similar, but these should only cause an DOS, not a memory corruption or leak. When we detect any out of bounds descriptor trigger an IO error. We also use a WARN() (in case it was a software bug instead of an attack). This implies that a malicious host can flood the guest kernel log, but that's only a DOS and acceptable in the threat model. This patch only hardens the initial consumption of the free list, the freeing comes later. Any of these errors can cause DMA memory leaks, but there is nothing we can do about that and that would be just a DOS. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 46 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f35629fa47b1..d37ff5a0ff58 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, return desc; } +/* assumes no indirect mode */ +static inline bool inside_split_ring(struct vring_virtqueue *vq, +unsigned index) +{ + return !WARN(index >= vq->split.vring.num, + "desc index %u out of bounds (%u)\n", + index, vq->split.vring.num); +} + static inline int virtqueue_add_split(struct virtqueue *_vq, struct scatterlist *sgs[], unsigned int total_sg, @@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, unsigned int i, n, avail, descs_used, prev, err_idx; int head; bool indirect; + int io_err; START_USE(vq); @@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, dma_addr_t addr = vring_map_single( vq, desc, total_sg * sizeof(struct vring_desc), DMA_TO_DEVICE); + io_err = -ENOMEM; if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* We're using some buffers from the free list. */ vq->vq.num_free -= descs_used; + io_err = -EIO; + if (!inside_split_ring(vq, head)) + goto unmap_release; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, @@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Put entry in available array (but don't update avail->idx until they * do sync). */ avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); + + if (avail >= vq->split.vring.num) + goto unmap_release; + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head); /* Descriptors and available array need to be set before we expose the @@ -576,6 +607,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < total_sg; n++) { if (i == err_idx)
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
So we will see huge performance regression without indirect descriptor. We need to consider to address this. A regression would be when some existing case would be slower. That's not the case because the behavior for the existing cases does not change. Anyways when there are performance problems they can be addressed, but first is to make it secure. -Andi Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks
On 6/2/2021 6:48 PM, Konrad Rzeszutek Wilk wrote: On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote: swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. I seem to be only CC-ed on this and #7, so please bear with me. You weren't cc'ed originally so if you get partial emails it must be through some list. But could you explain to me why please: commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12) Author: Martin Radev Date: Tue Jan 12 16:07:29 2021 +0100 swiotlb: Validate bounce size in the sync/unmap path does not solve the problem as well? Thanks. I missed that patch, race condition. One major difference of my patch is that it supports an error return, which allows virtio to error out. This is important in virtio because otherwise you'll end up with uninitialized memory on the target without any indication. This uninitialized memory could be an potential attack vector on the guest memory, e.g. if the attacker finds some way to echo it out again. But the error return could be added to your infrastructure too and what would make this patch much shorter. I'll take a look at that. -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Virtio hardening for TDX
Note that it's probably needed by other cases as well: 1) Other encrypted VM technology 2) VDUSE[1] 3) Smart NICs Right. I don't see any reason why these shouldn't work. You may just need to add the enable for the lockdown, but you can reuse the basic infrastructure. We have already had discussions and some patches have been posted[2][3][4]. Thanks. Yes [2] is indeed an alternative. We considered this at some point, but since we don't care about DOS in our case it seemed simpler to just harden the existing code. But yes if it's there it's useful for TDX too. FWIW I would argue that the descriptor boundary checking should be added in any case, security case or separated metadata or not, because it can catch bugs and is very cheap. Checking boundaries is good practice. [4] would be an independent issue, that's something we didn't catch. Also the swiotlb hardening implemented in this patchkit doesn't seem to be in any of the other patches. So I would say my patches are mostly orthogonal to these patches below and not conflicting, even though they address a similar problem space. -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX
In virtio the host decides whether the guest uses the DMA API or not using the strangely named VIRTIO_F_ACCESS_PLATFORM bit (which really indicates whether the DMA API is used or not) For hardened virtio on TDX we want to enforce that that swiotlb is always used, which requires using the DMA API. While IO wouldn't really work without the swiotlb, it might be possible that an attacker forces swiotlbless IO to manipulate memory in the guest. So we want to force the DMA API (which then forces swiotlb), but without relying on the host. There is already an arch_has_restricted_memory_acces hook for this, which is currently used only by s390. Enable the config option for the hook for x86 and enable it for TDX. Signed-off-by: Andi Kleen --- arch/x86/Kconfig | 1 + arch/x86/mm/mem_encrypt_common.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1531a0f905ed..3d804fce31b9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST select X86_X2APIC select SECURITY_LOCKDOWN_LSM select X86_MEM_ENCRYPT_COMMON + select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS help Provide support for running in a trusted domain on Intel processors equipped with Trusted Domain eXtenstions. TDX is a new Intel diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 24c9117547b4..2244d1f033ab 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -9,6 +9,7 @@ #include #include +#include #include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ @@ -37,3 +38,9 @@ void __init mem_encrypt_init(void) amd_mem_encrypt_init(); } +int arch_has_restricted_virtio_memory_access(void) +{ + return is_tdx_guest(); +} +EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); + -- 2.25.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, June 3, 2021 12:17 AM > [...] > > > If there are no hypervisor traps (does this exist?) then there is no > > > way to involve the hypervisor here and the child IOASID should simply > > > be a pointer to the guest's data structure that describes binding. In > > > this case that IOASID should claim all PASIDs when bound to a > > > RID. > > > > And in that case I think we should call that object something other > > than an IOASID, since it represents multiple address spaces. > > Maybe.. It is certainly a special case. > > We can still consider it a single "address space" from the IOMMU > perspective. What has happened is that the address table is not just a > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA". More accurately 64+20=84 bit IOVA 😊 > > If we are already going in the direction of having the IOASID specify > the page table format and other details, specifying that the page I'm leaning toward this direction now, after a discussion with Baolu. He reminded me that a default domain is already created for each device when it's probed by the iommu driver. So it looks workable to expose a per-device capability query uAPI to user once a device is bound to the ioasid fd. Once it's available, the user should be able to judge what format/mode should be set when creating an IOASID. > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small > step. In concept this view is true. But when designing the uAPI possibly we will not call it a 84bit format as the PASID table itself just serves 20bit PASID space. Will think more how to mark it in the next version. > > I wouldn't twist things into knots to create a difference, but if it > is easy to do it wouldn't hurt either. > Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
在 2021/6/3 上午8:41, Andi Kleen 写道: In protected guest mode we don't trust the host. This means we need to make sure the host cannot subvert us through virtio communication. In general it can corrupt our virtio data and cause a DOS, but it should not be able to access any data that is not explicitely under IO. Also boundary checking so that the free list (which is accessible to the host) cannot point outside the virtio ring. Note it could still contain loops or similar, but these should only cause an DOS, not a memory corruption or leak. When we detect any out of bounds descriptor trigger an IO error. We also use a WARN() (in case it was a software bug instead of an attack). This implies that a malicious host can flood the guest kernel log, but that's only a DOS and acceptable in the threat model. This patch only hardens the initial consumption of the free list, the freeing comes later. Any of these errors can cause DMA memory leaks, but there is nothing we can do about that and that would be just a DOS. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 46 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f35629fa47b1..d37ff5a0ff58 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, return desc; } +/* assumes no indirect mode */ +static inline bool inside_split_ring(struct vring_virtqueue *vq, +unsigned index) +{ + return !WARN(index >= vq->split.vring.num, + "desc index %u out of bounds (%u)\n", + index, vq->split.vring.num); It's better to use BAD_RING to stop virtqueue in this case. +} + static inline int virtqueue_add_split(struct virtqueue *_vq, struct scatterlist *sgs[], unsigned int total_sg, @@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, unsigned int i, n, avail, descs_used, prev, err_idx; int head; bool indirect; + int io_err; START_USE(vq); @@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html Thanks @@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, dma_addr_t addr = vring_map_single( vq, desc, total_sg * sizeof(struct vring_desc), DMA_TO_DEVICE); + io_err = -ENOMEM; if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* We're using some buffers from the free list. */ vq->vq.num_free -= descs_used; + io_err = -EIO; + if (!inside_split_ring(vq, head)) + goto unmap_release; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, @@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Put entry in available array (but don't update avail->idx until they * do sync). */ avail = vq->split.avail_
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html I would argue that you should boundary check in any case. It was always a bug to not have boundary checks in such a data structure with multiple users, trust or not. But yes your patch series is interesting and definitely makes sense for TDX too. Best would be to have both I guess, and always check the boundaries everywhere. So what's the merge status of your series? -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 3/8] virtio: Harden split buffer detachment
在 2021/6/3 上午8:41, Andi Kleen 写道: Harden the split buffer detachment path by adding boundary checking. Note that when this fails we may fail to unmap some swiotlb mapping, which could result in a leak and a DOS. But that's acceptable because an malicious host can DOS us anyways. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d37ff5a0ff58..1e9aa1e95e1b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) return needs_kick; } -static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, -void **ctx) +static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, + void **ctx) { unsigned int i, j; __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); + /* We'll leak DMA mappings when this happens, but nothing +* can be done about that. In the worst case the host +* could DOS us, but it can of course do that anyways. +*/ + if (!inside_split_ring(vq, head)) + return -EIO; I think the caller have already did this for us with even more check on the token (virtqueue_get_buf_ctx_split()): if (unlikely(i >= vq->split.vring.num)) { BAD_RING(vq, "id %u out of range\n", i); return NULL; } if (unlikely(!vq->split.desc_state[i].data)) { BAD_RING(vq, "id %u is not a head!\n", i); return NULL; } + /* Clear data ptr. */ vq->split.desc_state[head].data = NULL; @@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, while (vq->split.vring.desc[i].flags & nextflag) { vring_unmap_one_split(vq, &vq->split.vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); + if (!inside_split_ring(vq, i)) + return -EIO; Similarly, if we don't depend on the metadata stored in the descriptor, we don't need this check. vq->vq.num_free++; } @@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, /* Free the indirect table, if any, now that it's unmapped. */ if (!indir_desc) - return; + return 0; len = virtio32_to_cpu(vq->vq.vdev, vq->split.vring.desc[head].len); @@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, } else if (ctx) { *ctx = vq->split.desc_state[head].indir_desc; } + return 0; } static inline bool more_used_split(const struct vring_virtqueue *vq) @@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, void *ret; unsigned int i; u16 last_used; + int err; START_USE(vq); @@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, /* detach_buf_split clears data, so grab it now. */ ret = vq->split.desc_state[i].data; - detach_buf_split(vq, i, ctx); + err = detach_buf_split(vq, i, ctx); + if (err) { + END_USE(vq); This reminds me that we don't use END_USE() after BAD_RING() which should be fixed. Thanks + return NULL; + } + vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before @@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; detach_buf_split(vq, i, NULL); + /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午9:48, Andi Kleen 写道: So we will see huge performance regression without indirect descriptor. We need to consider to address this. A regression would be when some existing case would be slower. That's not the case because the behavior for the existing cases does not change. Anyways when there are performance problems they can be addressed, but first is to make it secure. I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. Thanks -Andi Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
在 2021/6/3 上午10:18, Andi Kleen 写道: It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html I would argue that you should boundary check in any case. It was always a bug to not have boundary checks in such a data structure with multiple users, trust or not. But yes your patch series is interesting and definitely makes sense for TDX too. Best would be to have both I guess, and always check the boundaries everywhere. I agree but some of the checks are unnecessary in we do this series on top of my series. So what's the merge status of your series? If I understand correctly from Michael, I will send a formal series and he will try to merge it for the 5.14. Thanks -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, June 3, 2021 12:59 AM > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote: > > > > /* Bind guest I/O page table */ > > > > bind_data = { > > > > .ioasid = gva_ioasid; > > > > .addr = gva_pgtable1; > > > > // and format information > > > > }; > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > > > > > Again I do wonder if this should just be part of alloc_ioasid. Is > > > there any reason to split these things? The only advantage to the > > > split is the device is known, but the device shouldn't impact > > > anything.. > > > > I'm pretty sure the device(s) could matter, although they probably > > won't usually. > > It is a bit subtle, but the /dev/iommu fd itself is connected to the > devices first. This prevents wildly incompatible devices from being > joined together, and allows some "get info" to report the capability > union of all devices if we want to do that. I would expect the capability reported per-device via /dev/iommu. Incompatible devices can bind to the same fd but cannot attach to the same IOASID. This allows incompatible devices to share locked page accounting. > > The original concept was that devices joined would all have to support > the same IOASID format, at least for the kernel owned map/unmap IOASID > type. Supporting different page table formats maybe is reason to > revisit that concept. I hope my memory was not broken, that the original concept was the devices attached to the same IOASID must support the same format. Otherwise they need attach to different IOASIDs (but still within the same fd). > > There is a small advantage to re-using the IOASID container because of > the get_user_pages caching and pinned accounting management at the FD > level. With above concept we don't need IOASID container then. > > I don't know if that small advantage is worth the extra complexity > though. > > > But it would certainly be possible for a system to have two > > different host bridges with two different IOMMUs with different > > pagetable formats. Until you know which devices (and therefore > > which host bridge) you're talking about, you don't know what formats > > of pagetable to accept. And if you have devices from *both* bridges > > you can't bind a page table at all - you could theoretically support > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables > > in both formats, but it would be pretty reasonable not to support > > that. > > The basic process for a user space owned pgtable mode would be: > > 1) qemu has to figure out what format of pgtable to use > > Presumably it uses query functions using the device label. The > kernel code should look at the entire device path through all the > IOMMU HW to determine what is possible. > > Or it already knows because the VM's vIOMMU is running in some > fixed page table format, or the VM's vIOMMU already told it, or > something. I'd expect the both. First get the hardware format. Then detect whether it's compatible to the vIOMMU format. > > 2) qemu creates an IOASID and based on #1 and says 'I want this format' Based on earlier discussion this will possibly be: struct iommu_ioasid_create_info { // if set this is a guest-managed page table, use bind+invalidate, with // info provided in struct pgtable_info; // if clear it's host-managed and use map+unmap; #define IOMMU_IOASID_FLAG_USER_PGTABLE 1 // if set it is for pasid table binding. same implication as USER_PGTABLE // except it's for a different pgtable type #define IOMMU_IOASID_FLAG_USER_PASID_TABLE 2 int flags; // Create nesting if not INVALID_IOASID u32 parent_ioasid; // additional info about the page table union { // for user-managed page table struct { u64 user_pgd; u32 format; u32 addr_width; // and other vendor format info } pgtable_info; // for kernel-managed page table struct { // not required on x86 // for ppc, iirc the user wants to claim a window // explicitly? } map_info; }; }; then there will be no UNBIND_PGTABLE ioctl. The unbind is done automatically when the IOASID is freed. > > 3) qemu binds the IOASID to the device. let's use 'attach' for consistency. 😊 'bind' is for ioasid fd which must be completed in step 0) so format can be reported in step 1) > > If qmeu gets it wrong then it just fails. > > 4) For the next device qemu would have to figure out if it can re-use > an existing IOASID based on the required proeprties. > > You pointed to the case of mixing vIOMMU's of different p
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, 2 Jun 2021 19:45:36 -0300 Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > > from the guest page table... what page table? > > I see my confusion now, the phrasing in your earlier remark led me > think this was about allowing the no-snoop performance enhancement in > some restricted way. > > It is really about blocking no-snoop 100% of the time and then > disabling the dangerous wbinvd when the block is successful. > > Didn't closely read the kvm code :\ > > If it was about allowing the optimization then I'd expect the guest to > enable no-snoopable regions via it's vIOMMU and realize them to the > hypervisor and plumb the whole thing through. Hence my remark about > the guest page tables.. > > So really the test is just 'were we able to block it' ? Yup. Do we really still consider that there's some performance benefit to be had by enabling a device to use no-snoop? This seems largely a legacy thing. > > This support existed before mdev, IIRC we needed it for direct > > assignment of NVIDIA GPUs. > > Probably because they ignored the disable no-snoop bits in the control > block, or reset them in some insane way to "fix" broken bioses and > kept using it even though by all rights qemu would have tried hard to > turn it off via the config space. Processing no-snoop without a > working wbinvd would be fatal. Yeesh > > But Ok, back the /dev/ioasid. This answers a few lingering questions I > had.. > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY >domains. > >This doesn't actually matter. If you mix them together then kvm >will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP >anywhere in this VM. > >This if two IOMMU's are joined together into a single /dev/ioasid >then we can just make them both pretend to be >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's available based on the per domain support available. That gives us the most consistent behavior, ie. we don't have VMs emulating wbinvd because they used to have a device attached where the domain required it and we can't atomically remap with new flags to perform the same as a VM that never had that device attached in the first place. > 2) How to fit this part of kvm in some new /dev/ioasid world > >What we want to do here is iterate over every ioasid associated >with the group fd that is passed into kvm. Yeah, we need some better names, binding a device to an ioasid (fd) but then attaching a device to an allocated ioasid (non-fd)... I assume you're talking about the latter ioasid. >Today the group fd has a single container which specifies the >single ioasid so this is being done trivially. > >To reorg we want to get the ioasid from the device not the >group (see my note to David about the groups vs device rational) > >This is just iterating over each vfio_device in the group and >querying the ioasid it is using. The IOMMU API group interfaces is largely iommu_group_for_each_dev() anyway, we still need to account for all the RIDs and aliases of a group. >Or perhaps more directly: an op attaching the vfio_device to the >kvm and having some simple helper > '(un)register ioasid with kvm (kvm, ioasid)' >that the vfio_device driver can call that just sorts this out. We could almost eliminate the device notion altogether here, use an ioasidfd_for_each_ioasid() but we really want a way to trigger on each change to the composition of the device set for the ioasid, which is why we currently do it on addition or removal of a group, where the group has a consistent set of IOMMU properties. Register a notifier callback via the ioasidfd? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/3 上午4:37, Alex Williamson 写道: On Wed, 2 Jun 2021 16:54:04 -0300 Jason Gunthorpe wrote: On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote: Right, the device can generate the no-snoop transactions, but it's the IOMMU that essentially determines whether those transactions are actually still cache coherent, AIUI. Wow, this is really confusing stuff in the code. At the PCI level there is a TLP bit called no-snoop that is platform specific. The general intention is to allow devices to selectively bypass the CPU caching for DMAs. GPUs like to use this feature for performance. Yes I assume there is some exciting security issues here. Looks like allowing cache bypass does something bad inside VMs? Looks like allowing the VM to use the cache clear instruction that is mandatory with cache bypass DMA causes some QOS issues? OK. IIRC, largely a DoS issue if userspace gets to choose when to emulate wbinvd rather than it being demanded for correct operation. So how does it work? What I see in the intel/iommu.c is that some domains support "snoop control" or not, based on some HW flag. This indicates if the DMA_PTE_SNP bit is supported on a page by page basis or not. Since x86 always leans toward "DMA cache coherent" I'm reading some tea leaves here: IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA transactions */ And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set? That's my understanding as well. Further, I guess IOMMUs that don't support PTE_SNP, or have DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat and all) Yes. So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA incoherent with the CPU caches, it just means that that snoop bit in the TLP cannot be enforced. ie the device *could* do no-shoop DMA if it wants. Devices that never do no-snoop remain DMA coherent on x86, as they always have been. Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device DMA to be coherent via the IOMMU. IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means the PCI device is blocked from using no-snoop in its TLPs. I wonder if ARM implemented this consistently? I see VDPA is confused.. Basically, we don't want to bother with pseudo KVM device like what VFIO did. So for simplicity, we rules out the IOMMU that can't enforce coherency in vhost-vDPA if the parent purely depends on the platform IOMMU: if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) return -ENOTSUPP; For the parents that use its own translations logic, an implicit assumption is that the hardware will always perform cache coherent DMA. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. I had all kinds of problems with uninitialized entries in the indirect table. So I gave up on it and concluded it would be too difficult to secure. -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, June 3, 2021 1:20 AM > [...] > > I wonder if there's a way to model this using a nested AS rather than > > requiring special operations. e.g. > > > > 'prereg' IOAS > > | > > \- 'rid' IOAS > >| > >\- 'pasid' IOAS (maybe) > > > > 'prereg' would have a kernel managed pagetable into which (for > > example) qemu platform code would map all guest memory (using > > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's > > IO mappings into the 'rid' IOAS in terms of GPA. > > > > This wouldn't quite work as is, because the 'prereg' IOAS would have > > no devices. But we could potentially have another call to mark an > > IOAS as a purely "preregistration" or pure virtual IOAS. Using that > > would be an alternative to attaching devices. > > It is one option for sure, this is where I was thinking when we were > talking in the other thread. I think the decision is best > implementation driven as the datastructure to store the > preregsitration data should be rather purpose built. Yes. For now I prefer to managing prereg through a separate cmd instead of special-casing it in the IOASID graph. Anyway this is sort of a per-fd thing. > > > > /* > > > * Map/unmap process virtual addresses to I/O virtual addresses. > > > * > > > * Provide VFIO type1 equivalent semantics. Start with the same > > > * restriction e.g. the unmap size should match those used in the > > > * original mapping call. > > > * > > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr > > > * must be already in the preregistered list. > > > * > > > * Input parameters: > > > * - u32 ioasid; > > > * - refer to vfio_iommu_type1_dma_{un}map > > > * > > > * Return: 0 on success, -errno on failure. > > > */ > > > #define IOASID_MAP_DMA_IO(IOASID_TYPE, IOASID_BASE + 6) > > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7) > > > > I'm assuming these would be expected to fail if a user managed > > pagetable has been bound? > > Me too, or a SVA page table. > > This document would do well to have a list of imagined page table > types and the set of operations that act on them. I think they are all > pretty disjoint.. > > Your presentation of 'kernel owns the table' vs 'userspace owns the > table' is a useful clarification to call out too sure, I incorporated this comment in last reply. > > > > 5. Use Cases and Flows > > > > > > Here assume VFIO will support a new model where every bound device > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > > going through legacy container/group interface. For illustration purpose > > > those devices are just called dev[1...N]: > > > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > Minor detail, but I'd suggest /dev/vfio/pci/:BB:SS.F for the > > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something > > for mdevs. That leaves other subdirs of /dev/vfio free for future > > non-PCI device types, and /dev/vfio itself for the legacy group > > devices. > > There are a bunch of nice options here if we go this path Yes, this part is only roughly visited to focus on /dev/iommu first. In later versions it will be considered more seriously. > > > > 5.2. Multiple IOASIDs (no nesting) > > > > > > > > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially > > > both devices are attached to gpa_ioasid. > > > > Doesn't really affect your example, but note that the PAPR IOMMU does > > not have a passthrough mode, so devices will not initially be attached > > to gpa_ioasid - they will be unusable for DMA until attached to a > > gIOVA ioasid. 'initially' here is still user-requested action. For PAPR you should do attach only when it's necessary. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午10:56, Andi Kleen 写道: I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. I had all kinds of problems with uninitialized entries in the indirect table. So I gave up on it and concluded it would be too difficult to secure. -Andi Ok, but what I meant is this, if we don't read from the descriptor ring, and validate all the other metadata supplied by the device (used id and len). Then there should be no way for the device to suppress the dma flags to write to the indirect descriptor table. Or do you have an example how it can do that? Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Alex Williamson > Sent: Thursday, June 3, 2021 10:51 AM > > On Wed, 2 Jun 2021 19:45:36 -0300 > Jason Gunthorpe wrote: > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > > > from the guest page table... what page table? > > > > I see my confusion now, the phrasing in your earlier remark led me > > think this was about allowing the no-snoop performance enhancement in > > some restricted way. > > > > It is really about blocking no-snoop 100% of the time and then > > disabling the dangerous wbinvd when the block is successful. > > > > Didn't closely read the kvm code :\ > > > > If it was about allowing the optimization then I'd expect the guest to > > enable no-snoopable regions via it's vIOMMU and realize them to the > > hypervisor and plumb the whole thing through. Hence my remark about > > the guest page tables.. > > > > So really the test is just 'were we able to block it' ? > > Yup. Do we really still consider that there's some performance benefit > to be had by enabling a device to use no-snoop? This seems largely a > legacy thing. Yes, there is indeed performance benefit for device to use no-snoop, e.g. 8K display and some imaging processing path, etc. The problem is that the IOMMU for such devices is typically a different one from the default IOMMU for most devices. This special IOMMU may not have the ability of enforcing snoop on no-snoop PCI traffic then this fact must be understood by KVM to do proper mtrr/pat/wbinvd virtualization for such devices to work correctly. > > > > This support existed before mdev, IIRC we needed it for direct > > > assignment of NVIDIA GPUs. > > > > Probably because they ignored the disable no-snoop bits in the control > > block, or reset them in some insane way to "fix" broken bioses and > > kept using it even though by all rights qemu would have tried hard to > > turn it off via the config space. Processing no-snoop without a > > working wbinvd would be fatal. Yeesh > > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I > > had.. > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY > and !IOMMU_CAP_CACHE_COHERENCY > >domains. > > > >This doesn't actually matter. If you mix them together then kvm > >will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP > >anywhere in this VM. > > > >This if two IOMMU's are joined together into a single /dev/ioasid > >then we can just make them both pretend to be > >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then > we > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's > available based on the per domain support available. That gives us the > most consistent behavior, ie. we don't have VMs emulating wbinvd > because they used to have a device attached where the domain required > it and we can't atomically remap with new flags to perform the same as > a VM that never had that device attached in the first place. > > > 2) How to fit this part of kvm in some new /dev/ioasid world > > > >What we want to do here is iterate over every ioasid associated > >with the group fd that is passed into kvm. > > Yeah, we need some better names, binding a device to an ioasid (fd) but > then attaching a device to an allocated ioasid (non-fd)... I assume > you're talking about the latter ioasid. > > >Today the group fd has a single container which specifies the > >single ioasid so this is being done trivially. > > > >To reorg we want to get the ioasid from the device not the > >group (see my note to David about the groups vs device rational) > > > >This is just iterating over each vfio_device in the group and > >querying the ioasid it is using. > > The IOMMU API group interfaces is largely iommu_group_for_each_dev() > anyway, we still need to account for all the RIDs and aliases of a > group. > > >Or perhaps more directly: an op attaching the vfio_device to the > >kvm and having some simple helper > > '(un)register ioasid with kvm (kvm, ioasid)' > >that the vfio_device driver can call that just sorts this out. > > We could almost eliminate the device notion altogether here, use an > ioasidfd_for_each_ioasid() but we really want a way to trigger on each > change to the composition of the device set for the ioasid, which is > why we currently do it on addition or removal of a group, where the > group has a consistent set of IOMMU properties. Register a notifier > callback via the ioasidfd? Thanks, > When discussing I/O page fault support in another thread, the consensus is that an device handle will be registered (by user) or allocated (return to user) in /dev/ioasid when binding the device to ioasid fd. From this angle we can register {ioasid_fd, device_handle} to KVM and then call something like ioasidfd_device_is_coherent()
Re: [PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX
On 6/2/21 5:41 PM, Andi Kleen wrote: +int arch_has_restricted_virtio_memory_access(void) +{ + return is_tdx_guest(); +} +EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); + This function definition had to be removed from arch/x86/mm/mem_encrypt.c. Otherwise, if you enable both CONFIG_AMD_MEM_ENCRYPT, CONFIG_X86_MEM_ENCRYPT_COMMON it will generate multiple definition error. --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -493,9 +493,3 @@ void __init amd_mem_encrypt_init(void) print_mem_encrypt_feature_info(); } - -int arch_has_restricted_virtio_memory_access(void) -{ - return sev_active(); -} -EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -40,7 +40,7 @@ void __init mem_encrypt_init(void) int arch_has_restricted_virtio_memory_access(void) { - return is_tdx_guest(); + return (is_tdx_guest() || sev_active()); } EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, 3 Jun 2021 03:22:27 + "Tian, Kevin" wrote: > > From: Alex Williamson > > Sent: Thursday, June 3, 2021 10:51 AM > > > > On Wed, 2 Jun 2021 19:45:36 -0300 > > Jason Gunthorpe wrote: > > > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > > > > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > > > > from the guest page table... what page table? > > > > > > I see my confusion now, the phrasing in your earlier remark led me > > > think this was about allowing the no-snoop performance enhancement in > > > some restricted way. > > > > > > It is really about blocking no-snoop 100% of the time and then > > > disabling the dangerous wbinvd when the block is successful. > > > > > > Didn't closely read the kvm code :\ > > > > > > If it was about allowing the optimization then I'd expect the guest to > > > enable no-snoopable regions via it's vIOMMU and realize them to the > > > hypervisor and plumb the whole thing through. Hence my remark about > > > the guest page tables.. > > > > > > So really the test is just 'were we able to block it' ? > > > > Yup. Do we really still consider that there's some performance benefit > > to be had by enabling a device to use no-snoop? This seems largely a > > legacy thing. > > Yes, there is indeed performance benefit for device to use no-snoop, > e.g. 8K display and some imaging processing path, etc. The problem is > that the IOMMU for such devices is typically a different one from the > default IOMMU for most devices. This special IOMMU may not have > the ability of enforcing snoop on no-snoop PCI traffic then this fact > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization > for such devices to work correctly. The case where the IOMMU does not support snoop-control for such a device already works fine, we can't prevent no-snoop so KVM will emulate wbinvd. The harder one is if we should opt to allow no-snoop even if the IOMMU does support snoop-control. > > > > This support existed before mdev, IIRC we needed it for direct > > > > assignment of NVIDIA GPUs. > > > > > > Probably because they ignored the disable no-snoop bits in the control > > > block, or reset them in some insane way to "fix" broken bioses and > > > kept using it even though by all rights qemu would have tried hard to > > > turn it off via the config space. Processing no-snoop without a > > > working wbinvd would be fatal. Yeesh > > > > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I > > > had.. > > > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY > > and !IOMMU_CAP_CACHE_COHERENCY > > >domains. > > > > > >This doesn't actually matter. If you mix them together then kvm > > >will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP > > >anywhere in this VM. > > > > > >This if two IOMMU's are joined together into a single /dev/ioasid > > >then we can just make them both pretend to be > > >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. > > > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then > > we > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's > > available based on the per domain support available. That gives us the > > most consistent behavior, ie. we don't have VMs emulating wbinvd > > because they used to have a device attached where the domain required > > it and we can't atomically remap with new flags to perform the same as > > a VM that never had that device attached in the first place. > > > > > 2) How to fit this part of kvm in some new /dev/ioasid world > > > > > >What we want to do here is iterate over every ioasid associated > > >with the group fd that is passed into kvm. > > > > Yeah, we need some better names, binding a device to an ioasid (fd) but > > then attaching a device to an allocated ioasid (non-fd)... I assume > > you're talking about the latter ioasid. > > > > >Today the group fd has a single container which specifies the > > >single ioasid so this is being done trivially. > > > > > >To reorg we want to get the ioasid from the device not the > > >group (see my note to David about the groups vs device rational) > > > > > >This is just iterating over each vfio_device in the group and > > >querying the ioasid it is using. > > > > The IOMMU API group interfaces is largely iommu_group_for_each_dev() > > anyway, we still need to account for all the RIDs and aliases of a > > group. > > > > >Or perhaps more directly: an op attaching the vfio_device to the > > >kvm and having some simple helper > > > '(un)register ioasid with kvm (kvm, ioasid)' > > >that the vfio_device driver can call that just sorts this out. > > > > We could almost eliminate the device notion altogether here, use an > > ioasidfd_for_each_ioasid() but we really want a way to trigger on each > > change to the composition of the device set for the
RE: [RFC] /dev/ioasid uAPI proposal
> From: Alex Williamson > Sent: Thursday, June 3, 2021 12:15 PM > > On Thu, 3 Jun 2021 03:22:27 + > "Tian, Kevin" wrote: > > > > From: Alex Williamson > > > Sent: Thursday, June 3, 2021 10:51 AM > > > > > > On Wed, 2 Jun 2021 19:45:36 -0300 > > > Jason Gunthorpe wrote: > > > > > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > > > > > > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > > > > > from the guest page table... what page table? > > > > > > > > I see my confusion now, the phrasing in your earlier remark led me > > > > think this was about allowing the no-snoop performance enhancement > in > > > > some restricted way. > > > > > > > > It is really about blocking no-snoop 100% of the time and then > > > > disabling the dangerous wbinvd when the block is successful. > > > > > > > > Didn't closely read the kvm code :\ > > > > > > > > If it was about allowing the optimization then I'd expect the guest to > > > > enable no-snoopable regions via it's vIOMMU and realize them to the > > > > hypervisor and plumb the whole thing through. Hence my remark about > > > > the guest page tables.. > > > > > > > > So really the test is just 'were we able to block it' ? > > > > > > Yup. Do we really still consider that there's some performance benefit > > > to be had by enabling a device to use no-snoop? This seems largely a > > > legacy thing. > > > > Yes, there is indeed performance benefit for device to use no-snoop, > > e.g. 8K display and some imaging processing path, etc. The problem is > > that the IOMMU for such devices is typically a different one from the > > default IOMMU for most devices. This special IOMMU may not have > > the ability of enforcing snoop on no-snoop PCI traffic then this fact > > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization > > for such devices to work correctly. > > The case where the IOMMU does not support snoop-control for such a > device already works fine, we can't prevent no-snoop so KVM will > emulate wbinvd. The harder one is if we should opt to allow no-snoop > even if the IOMMU does support snoop-control. In other discussion we are leaning toward a per-device capability reporting scheme through /dev/ioasid (or /dev/iommu as the new name). It seems natural to also allow setting a capability e.g. no- snoop for a device if underlying IOMMU driver allows it. > > > > > > This support existed before mdev, IIRC we needed it for direct > > > > > assignment of NVIDIA GPUs. > > > > > > > > Probably because they ignored the disable no-snoop bits in the control > > > > block, or reset them in some insane way to "fix" broken bioses and > > > > kept using it even though by all rights qemu would have tried hard to > > > > turn it off via the config space. Processing no-snoop without a > > > > working wbinvd would be fatal. Yeesh > > > > > > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I > > > > had.. > > > > > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY > > > and !IOMMU_CAP_CACHE_COHERENCY > > > >domains. > > > > > > > >This doesn't actually matter. If you mix them together then kvm > > > >will turn on wbinvd anyhow, so we don't need to use the > DMA_PTE_SNP > > > >anywhere in this VM. > > > > > > > >This if two IOMMU's are joined together into a single /dev/ioasid > > > >then we can just make them both pretend to be > > > >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. > > > > > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY > then > > > we > > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's > > > available based on the per domain support available. That gives us the > > > most consistent behavior, ie. we don't have VMs emulating wbinvd > > > because they used to have a device attached where the domain required > > > it and we can't atomically remap with new flags to perform the same as > > > a VM that never had that device attached in the first place. > > > > > > > 2) How to fit this part of kvm in some new /dev/ioasid world > > > > > > > >What we want to do here is iterate over every ioasid associated > > > >with the group fd that is passed into kvm. > > > > > > Yeah, we need some better names, binding a device to an ioasid (fd) but > > > then attaching a device to an allocated ioasid (non-fd)... I assume > > > you're talking about the latter ioasid. > > > > > > >Today the group fd has a single container which specifies the > > > >single ioasid so this is being done trivially. > > > > > > > >To reorg we want to get the ioasid from the device not the > > > >group (see my note to David about the groups vs device rational) > > > > > > > >This is just iterating over each vfio_device in the group and > > > >querying the ioasid it is using. > > > > > > The IOMMU API group interfaces is largely iommu_group_for_each_dev() > > > anyway, we still need to account for all the RIDs
Re: [RFC] /dev/ioasid uAPI proposal
On 6/3/21 7:23 AM, Jason Gunthorpe wrote: On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote: On 6/2/21 1:26 AM, Jason Gunthorpe wrote: On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote: This version only covers 1) and 4). Do you think we need to support 2), 3) and beyond? Yes aboslutely. The API should be flexable enough to specify the creation of all future page table formats we'd want to have and all HW specific details on those formats. OK, stay in the same line. If so, it seems that we need some in-kernel helpers and uAPIs to support pre-installing a page table to IOASID. Not sure what this means.. Sorry that I didn't make this clear. Let me bring back the page table types in my eyes. 1) IOMMU format page table (a.k.a. iommu_domain) 2) user application CPU page table (SVA for example) 3) KVM EPT (future option) 4) VM guest managed page table (nesting mode) Each type of page table should be able to be associated with its IOASID. We have BIND protocol for 4); We explicitly allocate an iommu_domain for 1). But we don't have a clear definition for 2) 3) and others. I think it's necessary to clearly define a time point and kAPI name between IOASID_ALLOC and IOASID_ATTACH, so that other modules have the opportunity to associate their page table with the allocated IOASID before attaching the page table to the real IOMMU hardware. In my mind these are all actions of creation.. #1 is ALLOC_IOASID 'to be compatible with thes devices attached to this FD' #2 is ALLOC_IOASID_SVA #3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl) #4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address Each allocation should have a set of operations that are allows map/unmap is only legal on #1. invalidate is only legal on #4, etc. This sounds reasonable. The corresponding page table types and required callbacks are also part of it. How you want to split this up in the ioctl interface is a more interesting question. I generally like more calls than giant unwieldly multiplexer structs, but some things are naturally flags and optional modifications of a single ioctl. In any event they should have a similar naming 'ALLOC_IOASID_XXX' and then a single 'DESTROY_IOASID' that works on all of them. I/O page fault handling is similar. The provider of the page table should take the responsibility to handle the possible page faults. For the faultable types, yes #3 and #4 should hook in the fault handler and deal with it. Agreed. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote: > > > I don't think presence or absence of a group fd makes a lot of > > difference to this design. Having a group fd just means we attach > > groups to the ioasid instead of individual devices, and we no longer > > need the bookkeeping of "partial" devices. > > Oh, I think we really don't want to attach the group to an ioasid, or > at least not as a first-class idea. > > The fundamental problem that got us here is we now live in a world > where there are many ways to attach a device to an IOASID: I'm not seeing that that's necessarily a problem. > - A RID binding > - A RID,PASID binding > - A RID,PASID binding for ENQCMD I have to admit I haven't fully grasped the differences between these modes. I'm hoping we can consolidate at least some of them into the same sort of binding onto different IOASIDs (which may be linked in parent/child relationships). > - A SW TABLE binding > - etc > > The selection of which mode to use is based on the specific > driver/device operation. Ie the thing that implements the 'struct > vfio_device' is the thing that has to select the binding mode. I thought userspace selected the binding mode - although not all modes will be possible for all devices. > group attachment was fine when there was only one mode. As you say it > is fine to just attach every group member with RID binding if RID > binding is the only option. > > When SW TABLE binding was added the group code was hacked up - now the > group logic is choosing between RID/SW TABLE in a very hacky and mdev > specific way, and this is just a mess. Sounds like it. What do you propose instead to handle backwards compatibility for group-based VFIO code? > The flow must carry the IOASID from the /dev/iommu to the vfio_device > driver and the vfio_device implementation must choose which binding > mode and parameters it wants based on driver and HW configuration. > > eg if two PCI devices are in a group then it is perfectly fine that > one device uses RID binding and the other device uses RID,PASID > binding. U... I don't see how that can be. They could well be in the same group because their RIDs cannot be distinguished from each other. > The only place I see for a "group bind" in the uAPI is some compat > layer for the vfio container, and the implementation would be quite > different, we'd have to call each vfio_device driver in the group and > execute the IOASID attach IOCTL. > > > > I would say no on the container. /dev/ioasid == the container, having > > > two competing objects at once in a single process is just a mess. > > > > Right. I'd assume that for compatibility, creating a container would > > create a single IOASID under the hood with a compatiblity layer > > translating the container operations to iosaid operations. > > It is a nice dream for sure > > /dev/vfio could be a special case of /dev/ioasid just with a different > uapi and ending up with only one IOASID. They could be interchangable > from then on, which would simplify the internals of VFIO if it > consistently delt with these new ioasid objects everywhere. But last I > looked it was complicated enough to best be done later on > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, Jun 03, 2021 at 01:29:58AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, June 3, 2021 12:09 AM > > > > On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, June 2, 2021 1:42 AM > > > > > > > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote: > > > > > > From: Jason Gunthorpe > > > > > > Sent: Saturday, May 29, 2021 1:36 AM > > > > > > > > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote: > > > > > > > > > > > > > IOASID nesting can be implemented in two ways: hardware nesting > > and > > > > > > > software nesting. With hardware support the child and parent I/O > > page > > > > > > > tables are walked consecutively by the IOMMU to form a nested > > > > translation. > > > > > > > When it's implemented in software, the ioasid driver is > > > > > > > responsible > > for > > > > > > > merging the two-level mappings into a single-level shadow I/O page > > > > table. > > > > > > > Software nesting requires both child/parent page tables operated > > > > through > > > > > > > the dma mapping protocol, so any change in either level can be > > > > captured > > > > > > > by the kernel to update the corresponding shadow mapping. > > > > > > > > > > > > Why? A SW emulation could do this synchronization during > > invalidation > > > > > > processing if invalidation contained an IOVA range. > > > > > > > > > > In this proposal we differentiate between host-managed and user- > > > > > managed I/O page tables. If host-managed, the user is expected to use > > > > > map/unmap cmd explicitly upon any change required on the page table. > > > > > If user-managed, the user first binds its page table to the IOMMU and > > > > > then use invalidation cmd to flush iotlb when necessary (e.g. > > > > > typically > > > > > not required when changing a PTE from non-present to present). > > > > > > > > > > We expect user to use map+unmap and bind+invalidate respectively > > > > > instead of mixing them together. Following this policy, map+unmap > > > > > must be used in both levels for software nesting, so changes in either > > > > > level are captured timely to synchronize the shadow mapping. > > > > > > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when > > > > it is created. If you put two different types in a tree then each IOASID > > > > must continue to use its own operation mode. > > > > > > > > I don't see a reason to force all IOASIDs in a tree to be consistent?? > > > > > > only for software nesting. With hardware support the parent uses map > > > while the child uses bind. > > > > > > Yes, the policy is specified per IOASID. But if the policy violates the > > > requirement in a specific nesting mode, then nesting should fail. > > > > I don't get it. > > > > If the IOASID is a page table then it is bind/invalidate. SW or not SW > > doesn't matter at all. > > > > > > > > > > A software emulated two level page table where the leaf level is a > > > > bound page table in guest memory should continue to use > > > > bind/invalidate to maintain the guest page table IOASID even though it > > > > is a SW construct. > > > > > > with software nesting the leaf should be a host-managed page table > > > (or metadata). A bind/invalidate protocol doesn't require the user > > > to notify the kernel of every page table change. > > > > The purpose of invalidate is to inform the implementation that the > > page table has changed so it can flush the caches. If the page table > > is changed and invalidation is not issued then then the implementation > > is free to ignore the changes. > > > > In this way the SW mode is the same as a HW mode with an infinite > > cache. > > > > The collaposed shadow page table is really just a cache. > > > > OK. One additional thing is that we may need a 'caching_mode" > thing reported by /dev/ioasid, indicating whether invalidation is > required when changing non-present to present. For hardware > nesting it's not reported as the hardware IOMMU will walk the > guest page table in cases of iotlb miss. For software nesting > caching_mode is reported so the user must issue invalidation > upon any change in guest page table so the kernel can update > the shadow page table timely. For the fist cut, I'd have the API assume that invalidates are *always* required. Some bypass to avoid them in cases where they're not needed can be an additional extension. > Following this and your other comment with David, we will mark > host-managed vs. guest-managed explicitly for I/O page table > of each IOASID. map+unmap or bind+invalid is decided by > which owner is specified by the user. > > Thanks > Kevin > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description:
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, Jun 03, 2021 at 02:49:56AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, June 3, 2021 12:59 AM > > > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote: > > > > > /* Bind guest I/O page table */ > > > > > bind_data = { > > > > > .ioasid = gva_ioasid; > > > > > .addr = gva_pgtable1; > > > > > // and format information > > > > > }; > > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > > > > > > > Again I do wonder if this should just be part of alloc_ioasid. Is > > > > there any reason to split these things? The only advantage to the > > > > split is the device is known, but the device shouldn't impact > > > > anything.. > > > > > > I'm pretty sure the device(s) could matter, although they probably > > > won't usually. > > > > It is a bit subtle, but the /dev/iommu fd itself is connected to the > > devices first. This prevents wildly incompatible devices from being > > joined together, and allows some "get info" to report the capability > > union of all devices if we want to do that. > > I would expect the capability reported per-device via /dev/iommu. > Incompatible devices can bind to the same fd but cannot attach to > the same IOASID. This allows incompatible devices to share locked > page accounting. Yeah... I'm not convinced that everything relevant here can be reported per-device. I think we may have edge cases where combinations of devices have restrictions that individual devices in the set do not. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote: > > > > /* Bind guest I/O page table */ > > > > bind_data = { > > > > .ioasid = gva_ioasid; > > > > .addr = gva_pgtable1; > > > > // and format information > > > > }; > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > > > > > Again I do wonder if this should just be part of alloc_ioasid. Is > > > there any reason to split these things? The only advantage to the > > > split is the device is known, but the device shouldn't impact > > > anything.. > > > > I'm pretty sure the device(s) could matter, although they probably > > won't usually. > > It is a bit subtle, but the /dev/iommu fd itself is connected to the > devices first. This prevents wildly incompatible devices from being > joined together, and allows some "get info" to report the capability > union of all devices if we want to do that. Right.. but I've not been convinced that having a /dev/iommu fd instance be the boundary for these types of things actually makes sense. For example if we were doing the preregistration thing (whether by child ASes or otherwise) then that still makes sense across wildly different devices, but we couldn't share that layer if we have to open different instances for each of them. It really seems to me that it's at the granularity of the address space (including extended RID+PASID ASes) that we need to know what devices we have, and therefore what capbilities we have for that AS. > The original concept was that devices joined would all have to support > the same IOASID format, at least for the kernel owned map/unmap IOASID > type. Supporting different page table formats maybe is reason to > revisit that concept. > > There is a small advantage to re-using the IOASID container because of > the get_user_pages caching and pinned accounting management at the FD > level. Right, but at this stage I'm just not seeing a really clear (across platforms and device typpes) boundary for what things have to be per IOASID container and what have to be per IOASID, so I'm just not sure the /dev/iommu instance grouping makes any sense. > I don't know if that small advantage is worth the extra complexity > though. > > > But it would certainly be possible for a system to have two > > different host bridges with two different IOMMUs with different > > pagetable formats. Until you know which devices (and therefore > > which host bridge) you're talking about, you don't know what formats > > of pagetable to accept. And if you have devices from *both* bridges > > you can't bind a page table at all - you could theoretically support > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables > > in both formats, but it would be pretty reasonable not to support > > that. > > The basic process for a user space owned pgtable mode would be: > > 1) qemu has to figure out what format of pgtable to use > > Presumably it uses query functions using the device label. No... in the qemu case it would always select the page table format that it needs to present to the guest. That's part of the guest-visible platform that's selected by qemu's configuration. There's no negotiation here: either the kernel can supply what qemu needs to pass to the guest, or it can't. If it can't qemu, will have to either emulate in SW (if possible, probably using a kernel-managed IOASID to back it) or fail outright. > The > kernel code should look at the entire device path through all the > IOMMU HW to determine what is possible. > > Or it already knows because the VM's vIOMMU is running in some > fixed page table format, or the VM's vIOMMU already told it, or > something. Again, I think you have the order a bit backwards. The user selects the capabilities that the vIOMMU will present to the guest as part of the qemu configuration. Qemu then requests that of the host kernel, and either the host kernel supplies it, qemu emulates it in SW, or qemu fails to start. Guest visible properties of the platform never (or *should* never) depend implicitly on host capabilities - it's impossible to sanely support migration in such an environment. > 2) qemu creates an IOASID and based on #1 and says 'I want this format' Right. > 3) qemu binds the IOASID to the device. > > If qmeu gets it wrong then it just fails. Right, though it may be fall back to (partial) software emulation. In practice that would mean using a kernel-managed IOASID and walking the guest IO pagetables itself to mirror them into the host kernel. > 4) For the next device qemu would have to figure out if it can re-use > an existing IOASID based on the required proeprties. Nope. Again, what devices share an IO address space is a guest visible part of the platform. If the host kernel can't supply that, then qemu must not st
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:16:48PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote: > > > I agree with Jean-Philippe - at the very least erasing this > > > information needs a major rational - but I don't really see why it > > > must be erased? The HW reports the originating device, is it just a > > > matter of labeling the devices attached to the /dev/ioasid FD so it > > > can be reported to userspace? > > > > HW reports the originating device as far as it knows. In many cases > > where you have multiple devices in an IOMMU group, it's because > > although they're treated as separate devices at the kernel level, they > > have the same RID at the HW level. Which means a RID for something in > > the right group is the closest you can count on supplying. > > Granted there may be cases where exact fidelity is not possible, but > that doesn't excuse eliminating fedelity where it does exist.. > > > > If there are no hypervisor traps (does this exist?) then there is no > > > way to involve the hypervisor here and the child IOASID should simply > > > be a pointer to the guest's data structure that describes binding. In > > > this case that IOASID should claim all PASIDs when bound to a > > > RID. > > > > And in that case I think we should call that object something other > > than an IOASID, since it represents multiple address spaces. > > Maybe.. It is certainly a special case. > > We can still consider it a single "address space" from the IOMMU > perspective. What has happened is that the address table is not just a > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA". True. This does complexify how we represent what IOVA ranges are valid, though. I'll bet you most implementations don't actually implement a full 64-bit IOVA, which means we effectively have a large number of windows from (0..max IOVA) for each valid pasid. This adds another reason I don't think my concept of IOVA windows is just a power specific thing. > If we are already going in the direction of having the IOASID specify > the page table format and other details, specifying that the page > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small > step. Well, rather I think userspace needs to request what page table format it wants and the kernel tells it whether it can oblige or not. > I wouldn't twist things into knots to create a difference, but if it > is easy to do it wouldn't hurt either. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote: > > > Is there a compelling reason to have all the IOASIDs handled by one > > FD? > > There was an answer on this, if every PASID needs an IOASID then there > are too many FDs. Too many in what regard? fd limits? Something else? It seems to be there are two different cases for PASID handling here. One is where userspace explicitly creates each valid PASID and attaches a separate pagetable for each (or handles each with MAP/UNMAP). In that case I wouldn't have expected there to be too many fds. Then there's the case where we register a whole PASID table, in which case I think you only need the one FD. We can treat that as creating an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of pagetables for each PASID). > It is difficult to share the get_user_pages cache across FDs. Ah... hrm, yes I can see that. > There are global properties in the /dev/iommu FD, like what devices > are part of it, that are important for group security operations. This > becomes confused if it is split to many FDs. I'm still not seeing those. I'm really not seeing any well-defined meaning to devices being attached to the fd, but not to a particular IOAS. > > > I/O address space can be managed through two protocols, according to > > > whether the corresponding I/O page table is constructed by the kernel or > > > the user. When kernel-managed, a dma mapping protocol (similar to > > > existing VFIO iommu type1) is provided for the user to explicitly specify > > > how the I/O address space is mapped. Otherwise, a different protocol is > > > provided for the user to bind an user-managed I/O page table to the > > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault > > > handling. > > > > > > Pgtable binding protocol can be used only on the child IOASID's, implying > > > IOASID nesting must be enabled. This is because the kernel doesn't trust > > > userspace. Nesting allows the kernel to enforce its DMA isolation policy > > > through the parent IOASID. > > > > To clarify, I'm guessing that's a restriction of likely practice, > > rather than a fundamental API restriction. I can see a couple of > > theoretical future cases where a user-managed pagetable for a "base" > > IOASID would be feasible: > > > > 1) On some fancy future MMU allowing free nesting, where the kernel > > would insert an implicit extra layer translating user addresses > > to physical addresses, and the userspace manages a pagetable with > > its own VAs being the target AS > > I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one > where the IOVA == process VA and the kernel maintains this mapping. That makes sense. Needs a different name to avoid Intel and PCI specificness, but having a trivial "pagetable format" which just says IOVA == user address is nice idea. > Since the uAPI is so general I do have a general expecation that the > drivers/iommu implementations might need to be a bit more complicated, > like if the HW can optimize certain specific graphs of IOASIDs we > would still model them as graphs and the HW driver would have to > "compile" the graph into the optimal hardware. > > This approach has worked reasonable in other kernel areas. That seems sensible. > > 2) For a purely software virtual device, where its virtual DMA > > engine can interpet user addresses fine > > This also sounds like an SVA IOASID. Ok. > Depending on HW if a device can really only bind to a very narrow kind > of IOASID then it should ask for that (probably platform specific!) > type during its attachment request to drivers/iommu. > > eg "I am special hardware and only know how to do PLATFORM_BLAH > transactions, give me an IOASID comatible with that". If the only way > to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is > hardwired to the CPU ASID then that is just how it is. Fair enough. > > I wonder if there's a way to model this using a nested AS rather than > > requiring special operations. e.g. > > > > 'prereg' IOAS > > | > > \- 'rid' IOAS > >| > >\- 'pasid' IOAS (maybe) > > > > 'prereg' would have a kernel managed pagetable into which (for > > example) qemu platform code would map all guest memory (using > > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's > > IO mappings into the 'rid' IOAS in terms of GPA. > > > > This wouldn't quite work as is, because the 'prereg' IOAS would have > > no devices. But we could potentially have another call to mark an > > IOAS as a purely "preregistration" or pure virtual IOAS. Using that > > would be an alternative to attaching devices. > > It is one option for sure, this is where I was thinking when we were > talking in the other thread. I think the decision is best > implementation driven as the datastructure to store the > preregsitratio