Re: [RFC] /dev/ioasid uAPI proposal
On 04/06/21 19:22, Jason Gunthorpe wrote: 4) The KVM interface is the very simple enable/disable WBINVD. Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required to enable WBINVD at KVM. The KVM interface is the same kvm-vfio device that exists already. The userspace API does not need to change at all: adding one VFIO file descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD functionality (see kvm_vfio_update_coherency). Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it seems useless complication compared to just using what we have now, at least while VMs only use IOASIDs via VFIO. Either way, there should be no policy attached to the add/delete operations. KVM users want to add the VFIO (or IOASID) file descriptors to the device independent of WBINVD. If userspace wants/needs to apply its own policy on whether to enable WBINVD or not, they can do it on the VFIO/IOASID side: 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID it communicates its no-snoop configuration: - 0 enable, allow WBINVD - 1 automatic disable, block WBINVD if the platform IOMMU can police it (what we do today) - 2 force disable, do not allow BINVD ever Though, like Alex, it's also not clear to me whether force-disable is useful. Instead userspace can query the IOMMU or the device to ensure it's not enabled. Paolo ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Friday, June 4, 2021 8:34 PM > > On Fri, Jun 04, 2021 at 06:08:28AM +, Tian, Kevin wrote: > > > In Qemu case the problem is that it doesn't know the list of devices > > that will be attached to an IOASID when it's created. This is a guest- > > side knowledge which is conveyed one device at a time to Qemu > > though vIOMMU. > > At least for the guest side it is alot simpler because the vIOMMU > being emulated will define nearly everything. > > qemu will just have to ask the kernel for whatever it is the guest is > doing. If the kernel can't do it then qemu has to SW emulate. > > The no-snoop block may be the only thing that is under qemu's control > because it is transparent to the guest. > > This will probably become clearer as people start to define what the > get_info should return. > Sure. Just to clarify my comment that it is for " Perhaps creating an IOASID should pass in a list of the device labels that the IOASID will be used with". My point is that Qemu doesn't know this fact before the guest completes binding page table to all relevant devices, while IOASID must be created when the table is bound to first device. So Qemu just needs to create IOASID with format that is required for the current device. Incompatibility will be detected when attaching other devices later. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Friday, June 4, 2021 8:09 PM > > On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, June 3, 2021 9:05 PM > > > > > > > > > > > > > 3) Device accepts any PASIDs from the guest. No > > > > >vPASID/pPASID translation is possible. (classic vfio_pci) > > > > > 4) Device accepts any PASID from the guest and has an > > > > >internal vPASID/pPASID translation (enhanced vfio_pci) > > > > > > > > what is enhanced vfio_pci? In my writing this is for mdev > > > > which doesn't support ENQCMD > > > > > > This is a vfio_pci that mediates some element of the device interface > > > to communicate the vPASID/pPASID table to the device, using Max's > > > series for vfio_pci drivers to inject itself into VFIO. > > > > > > For instance a device might send a message through the PF that the VF > > > has a certain vPASID/pPASID translation table. This would be useful > > > for devices that cannot use ENQCMD but still want to support migration > > > and thus need vPASID. > > > > I still don't quite get. If it's a PCI device why is PASID translation > > required? > > Just delegate the per-RID PASID space to user as type-3 then migrating the > > vPASID space is just straightforward. > > This is only possible if we get rid of the global pPASID allocation > (honestly is my preference as it makes the HW a lot simpler) > In this proposal global vs. per-RID allocation is a per-device policy. for vfio-pci it can always use per-RID (regardless of whether the device is partially mediated or not) and no vPASID/pPASID conversion. Even for mdev if no ENQCMD we can still do per-RID conversion. only for mdev which has ENQCMD we need global pPASID allocation. I think this is the motivation you explained earlier that it's not good to have one global PASID allocator in the kernel. per-RID vs. global should be selected per device. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote: > On Fri, 4 Jun 2021 14:22:07 -0300 > Jason Gunthorpe wrote: > > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote: > > > On 04/06/21 18:03, Jason Gunthorpe wrote: > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > > > > > I don't want a security proof myself; I want to trust VFIO to make > > > > > the right > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > > > > > > > > > Given how KVM is just a device driver inside Linux, VMs should be a > > > > > slightly > > > > > more roundabout way to do stuff that is accessible to bare metal; not > > > > > a way > > > > > to gain extra privilege. > > > > > > > > Okay, fine, lets turn the question on its head then. > > > > > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO > > > > application can make use of no-snoop optimizations. The ability of KVM > > > > to execute wbinvd should be tied to the ability of that IOCTL to run > > > > in a normal process context. > > > > > > > > So, under what conditions do we want to allow VFIO to giave a process > > > > elevated access to the CPU: > > > > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. > > > #2+#3 would be worse than what we have today), but IIUC the proposal (was > > > it > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, > > > which then would be on VFIO and not on KVM. > > > > At the end of the day we need an ioctl with two arguments: > > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever) > > - The KVM FD to control wbinvd support on > > > > Philosophically it doesn't matter too much which subsystem that ioctl > > lives, but we have these obnoxious cross module dependencies to > > consider.. > > > > Framing the question, as you have, to be about the process, I think > > explains why KVM doesn't really care what is decided, so long as the > > process and the VM have equivalent rights. > > > > Alex, how about a more fleshed out suggestion: > > > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID > > it communicates its no-snoop configuration: > > Communicates to whom? To the /dev/iommu FD which will have to maintain a list of devices attached to it internally. > > - 0 enable, allow WBINVD > > - 1 automatic disable, block WBINVD if the platform > >IOMMU can police it (what we do today) > > - 2 force disable, do not allow BINVD ever > > The only thing we know about the device is whether or not Enable > No-snoop is hard wired to zero, ie. it either can't generate no-snoop > TLPs ("coherent-only") or it might ("assumed non-coherent"). Here I am outlining the choice an also imagining we might want an admin knob to select the three. > If we're putting the policy decision in the hands of userspace they > should have access to wbinvd if they own a device that is assumed > non-coherent AND it's attached to an IOMMU (page table) that is not > blocking no-snoop (a "non-coherent IOASID"). There are two parts here, like Paolo was leading too. If the process has access to WBINVD and then if such an allowed process tells KVM to turn on WBINVD in the guest. If the process has a device and it has a way to create a non-coherent IOASID, then that process has access to WBINVD. For security it doesn't matter if the process actually creates the non-coherent IOASID or not. An attacker will simply do the steps that give access to WBINVD. The important detail is that access to WBINVD does not compell the process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD can still choose to create a secure guest by always using IOMMU_CACHE in its page tables and not asking KVM to enable WBINVD. This propsal shifts this policy decision from the kernel to userspace. qemu is responsible to determine if KVM should enable wbinvd or not based on if it was able to create IOASID's with IOMMU_CACHE. > Conversely, a user could create a non-coherent IOASID and attach any > device to it, regardless of IOMMU backing capabilities. Only if an > assumed non-coherent device is attached would the wbinvd be allowed. Right, this is exactly the point. Since the user gets to pick if the IOASID is coherent or not then an attacker can always reach WBINVD using only the device FD. Additional checks don't add to the security of the process. The additional checks you are describing add to the security of the guest, however qemu is capable of doing them without more help from the kernel. It is the strenth of Paolo's model that KVM should not be able to do optionally less, not more than the process itself can do. > > It is pretty simple from a /dev/ioasid perpsective, covers todays > > compat requirement, gives some future option to allow the no-snoop > > optimization, and gives a new option for qemu to totally block wbinvd > > no matte
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, 4 Jun 2021 09:13:37 -0300 Jason Gunthorpe wrote: > On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote: > > > Could you clarify "vfio_driver"? > > This is the thing providing the vfio_device_ops function pointers. > > So vfio-pci can't know anything about this (although your no-snoop > control probing idea makes sense to me) > > But vfio_mlx5_pci can know > > So can mdev_idxd > > And kvmgt A capability on VFIO_DEVICE_GET_INFO could provide a hint to userspace. Stock vfio-pci could fill it out to the extent advertising if the device is capable of non-coherent DMA based on the Enable No-snoop probing, the device specific vfio_drivers could set it based on knowledge of the device behavior. Another bit might indicate a preference to not suppress non-coherent DMA at the IOMMU. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, 4 Jun 2021 14:22:07 -0300 Jason Gunthorpe wrote: > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote: > > On 04/06/21 18:03, Jason Gunthorpe wrote: > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > > > > I don't want a security proof myself; I want to trust VFIO to make the > > > > right > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > > > > > > > Given how KVM is just a device driver inside Linux, VMs should be a > > > > slightly > > > > more roundabout way to do stuff that is accessible to bare metal; not a > > > > way > > > > to gain extra privilege. > > > > > > Okay, fine, lets turn the question on its head then. > > > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO > > > application can make use of no-snoop optimizations. The ability of KVM > > > to execute wbinvd should be tied to the ability of that IOCTL to run > > > in a normal process context. > > > > > > So, under what conditions do we want to allow VFIO to giave a process > > > elevated access to the CPU: > > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, > > which then would be on VFIO and not on KVM. > > At the end of the day we need an ioctl with two arguments: > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever) > - The KVM FD to control wbinvd support on > > Philosophically it doesn't matter too much which subsystem that ioctl > lives, but we have these obnoxious cross module dependencies to > consider.. > > Framing the question, as you have, to be about the process, I think > explains why KVM doesn't really care what is decided, so long as the > process and the VM have equivalent rights. > > Alex, how about a more fleshed out suggestion: > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID > it communicates its no-snoop configuration: Communicates to whom? > - 0 enable, allow WBINVD > - 1 automatic disable, block WBINVD if the platform >IOMMU can police it (what we do today) > - 2 force disable, do not allow BINVD ever The only thing we know about the device is whether or not Enable No-snoop is hard wired to zero, ie. it either can't generate no-snoop TLPs ("coherent-only") or it might ("assumed non-coherent"). If we're putting the policy decision in the hands of userspace they should have access to wbinvd if they own a device that is assumed non-coherent AND it's attached to an IOMMU (page table) that is not blocking no-snoop (a "non-coherent IOASID"). I think that means that the IOASID needs to be created (IOASID_ALLOC) with a flag that specifies whether this address space is coherent (IOASID_GET_INFO probably needs a flag/cap to expose if the system supports this). All mappings in this IOASID would use IOMMU_CACHE and and devices attached to it would be required to be backed by an IOMMU capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise). If only these IOASIDs exist, access to wbinvd would not be provided. (How does a user provided page table work? - reserved bit set, user error?) Conversely, a user could create a non-coherent IOASID and attach any device to it, regardless of IOMMU backing capabilities. Only if an assumed non-coherent device is attached would the wbinvd be allowed. I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD and the IOASID world needs to understand the device's ability to generate non-coherent DMA. This wbinvd ioctl would be a no-op (or some known errno) unless a non-coherent IOASID exists with a potentially non-coherent device attached. > vfio_pci may want to take this from an admin configuration knob > someplace. It allows the admin to customize if they want. > > If we can figure out a way to autodetect 2 from vfio_pci, all the > better > > 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace > to access wbinvd so it can make use of the no snoop optimization. > > wbinvd is allowed when: > - A device is joined with mode #0 > - A device is joined with mode #1 and the IOMMU cannot block > no-snoop (today) > > 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD > is blocked and userspace doesn't request to block no-snoop in the > IOASID then it is a userspace error. In my model above, the IOASID is central to this. > 4) The KVM interface is the very simple enable/disable WBINVD. > Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required > to enable WBINVD at KVM. Right, and in the new world order, vfio is only a device driver, the IOASID manages the device's DMA. wbinvd is only necessary relative to non-coherent DMA, which seems like QEMU needs to bump KVM with an ioasidfd. > It is pret
Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes
> On Jun 4, 2021, at 11:53 AM, Robin Murphy wrote: > > On 2021-06-04 18:10, Nadav Amit wrote: >>> On Jun 4, 2021, at 8:38 AM, Joerg Roedel wrote: >>> >>> Hi Nadav, >>> >>> [Adding Robin] >>> >>> On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote: Nadav Amit (4): iommu/amd: Fix wrong parentheses on page-specific invalidations >>> >>> This patch is already upstream in v5.13-rc4. Please rebase to that >>> version. >> I guess it would be rc5 by the time I send it. >>> iommu/amd: Selective flush on unmap iommu/amd: Do not sync on page size changes iommu/amd: Do not use flush-queue when NpCache is on >>> >>> And I think there have been objections from Robin Murphy on Patch 3, >>> have those been worked out? >> I am still waiting for Robin’s feedback on my proposed changes. If he does >> not respond soon, I will drop this patch for now. > > Apologies, it feels like I've spent most of this week fighting fires, > and a great deal of email got skimmed and mentally filed under "nothing > so wrong that I need to respond immediately"... > > FWIW I would have written the simpler patch below, but beyond that I > think it might start descending into bikeshedding - if you still prefer > your more comprehensive refactoring, or something in between, then don't > let my personal preference in style/complexity trade-offs stand in the > way of getting a useful functional change into the AMD driver. Whichever > way, though, I *am* now sold on the idea of having some kerneldoc to > clarify these things. Thanks, I appreciate your feedback. I will add kerneldoc as you indicated. I see you took some parts of the patch I did for MediaTek, but I think this is not good enough for AMD, since AMD behavior should be different than MediaTek - they have different needs: MediaTek wants as few IOTLB flushes as possible, even if it results in flushing of many irrelevant (unmodified) entries between start and end. That’s the reason it can just use iommu_iotlb_gather_update_range(). In contrast, for AMD we do not want to flush too many irrelevant entries, specifically if the the IOMMU is virtualized. When an IOTLB flush is initiated by the VM, the hypervisor needs to scan the IOMMU page-tables for changes and synchronize it with the physical IOMMU. You don’t want this range to be too big, and that is the reason I needed iommu_iotlb_gather_is_disjoint(). I will add documentation, since clearly this information was not conveyed well enough. Thanks again, Nadav signature.asc Description: Message signed with OpenPGP ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices
[+cc John, who tested 6bf6c24720d3] On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote: > From: Xingang Wang > > When booting with devicetree, the pci_request_acs() is called after the > enumeration and initialization of PCI devices, thus the ACS is not > enabled. And ACS should be enabled when IOMMU is detected for the > PCI host bridge, so add check for IOMMU before probe of PCI host and call > pci_request_acs() to make sure ACS will be enabled when enumerating PCI > devices. I'm happy to apply this, but I'm a little puzzled about 6bf6c24720d3 ("iommu/of: Request ACS from the PCI core when configuring IOMMU linkage"). It was tested and fixed a problem, but I don't understand how. 6bf6c24720d3 added the call to pci_request_acs() in of_iommu_configure() so it currently looks like this: of_iommu_configure(dev, ...) { if (dev_is_pci(dev)) pci_request_acs(); pci_request_acs() sets pci_acs_enable, which tells us to enable ACS when enumerating PCI devices in the future. But we only call pci_request_acs() if we already *have* a PCI device. So maybe 6bf6c24720d3 fixed a problem for *some* PCI devices, but not all? E.g., did we call of_iommu_configure() for one PCI device before enumerating the rest? > Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when > configuring IOMMU linkage") > Signed-off-by: Xingang Wang > --- > drivers/iommu/of_iommu.c | 1 - > drivers/pci/of.c | 8 +++- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > index a9d2df001149..54a14da242cc 100644 > --- a/drivers/iommu/of_iommu.c > +++ b/drivers/iommu/of_iommu.c > @@ -205,7 +205,6 @@ const struct iommu_ops *of_iommu_configure(struct device > *dev, > .np = master_np, > }; > > - pci_request_acs(); > err = pci_for_each_dma_alias(to_pci_dev(dev), >of_pci_iommu_init, &info); > } else { > diff --git a/drivers/pci/of.c b/drivers/pci/of.c > index da5b414d585a..2313c3f848b0 100644 > --- a/drivers/pci/of.c > +++ b/drivers/pci/of.c > @@ -581,9 +581,15 @@ static int pci_parse_request_of_pci_ranges(struct device > *dev, > > int devm_of_pci_bridge_init(struct device *dev, struct pci_host_bridge > *bridge) > { > - if (!dev->of_node) > + struct device_node *node = dev->of_node; > + > + if (!node) > return 0; > > + /* Detect IOMMU and make sure ACS will be enabled */ > + if (of_property_read_bool(node, "iommu-map")) > + pci_request_acs(); > + > bridge->swizzle_irq = pci_common_swizzle; > bridge->map_irq = of_irq_parse_and_map_pci; > > -- > 2.19.1 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes
On 2021-06-04 18:10, Nadav Amit wrote: On Jun 4, 2021, at 8:38 AM, Joerg Roedel wrote: Hi Nadav, [Adding Robin] On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote: Nadav Amit (4): iommu/amd: Fix wrong parentheses on page-specific invalidations This patch is already upstream in v5.13-rc4. Please rebase to that version. I guess it would be rc5 by the time I send it. iommu/amd: Selective flush on unmap iommu/amd: Do not sync on page size changes iommu/amd: Do not use flush-queue when NpCache is on And I think there have been objections from Robin Murphy on Patch 3, have those been worked out? I am still waiting for Robin’s feedback on my proposed changes. If he does not respond soon, I will drop this patch for now. Apologies, it feels like I've spent most of this week fighting fires, and a great deal of email got skimmed and mentally filed under "nothing so wrong that I need to respond immediately"... FWIW I would have written the simpler patch below, but beyond that I think it might start descending into bikeshedding - if you still prefer your more comprehensive refactoring, or something in between, then don't let my personal preference in style/complexity trade-offs stand in the way of getting a useful functional change into the AMD driver. Whichever way, though, I *am* now sold on the idea of having some kerneldoc to clarify these things. Thanks, Robin. ->8- From: Robin Murphy Subject: [PATCH] iommu: Improve iommu_iotlb_gather helpers The Mediatek driver is not the only one which might want a basic address-based gathering behaviour, so although it's arguably simple enough to open-code, let's factor it out for the sake of cleanliness. Let's also take this opportunity to document the intent of these helpers for clarity. Signed-off-by: Robin Murphy --- drivers/iommu/mtk_iommu.c | 6 +- include/linux/iommu.h | 32 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index e06b8a0e2b56..cd457487ce81 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -521,12 +521,8 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain, struct iommu_iotlb_gather *gather) { struct mtk_iommu_domain *dom = to_mtk_domain(domain); - unsigned long end = iova + size - 1; - if (gather->start > iova) - gather->start = iova; - if (gather->end < end) - gather->end = end; + iommu_iotlb_gather_add_range(gather, iova, size); return dom->iop->unmap(dom->iop, iova, size, gather); } diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 32d448050bf7..5f036e991937 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain *domain, iommu_iotlb_gather_init(iotlb_gather); } +/** + * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation + * @gather: TLB gather data + * @iova: start of page to invalidate + * @size: size of page to invalidate + * + * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands + * where only the address range matters, and simply minimising intermediate + * syncs is preferred. + */ +static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather *gather, + unsigned long iova, size_t size) +{ + unsigned long end = iova + size - 1; + + if (gather->start > iova) + gather->start = iova; + if (gather->end < end) + gather->end = end; +} + +/** + * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation + * @domain: IOMMU domain to be invalidated + * @gather: TLB gather data + * @iova: start of page to invalidate + * @size: size of page to invalidate + * + * Helper for IOMMU drivers to build invalidation commands based on individual + * pages, or with page size/table level hints which cannot be gathered if they + * differ. + */ static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain, struct iommu_iotlb_gather *gather, unsigned long iova, size_t size) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
Hi Jason, On Fri, 4 Jun 2021 13:22:00 -0300, Jason Gunthorpe wrote: > > > > Yes, in that case we should support both. Give the device driver a > > chance to handle the IOPF if it can. > > Huh? > > The device driver does not "handle the IOPF" the device driver might > inject the IOPF. You are right, I got confused with the native case where device drivers can handle the fault, or do something about it. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/15] Restricted DMA
Hi Claire, On Thu, May 27, 2021 at 08:58:30PM +0800, Claire Chang wrote: > This series implements mitigations for lack of DMA access control on > systems without an IOMMU, which could result in the DMA accessing the > system memory at unexpected times and/or unexpected addresses, possibly > leading to data leakage or corruption. > > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is > not behind an IOMMU. As PCI-e, by design, gives the device full access to > system memory, a vulnerability in the Wi-Fi firmware could easily escalate > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a > full chain of exploits; [2], [3]). > > To mitigate the security concerns, we introduce restricted DMA. Restricted > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a > specially allocated region and does memory allocation from the same region. > The feature on its own provides a basic level of protection against the DMA > overwriting buffer contents at unexpected times. However, to protect > against general data leakage and system memory corruption, the system needs > to provide a way to restrict the DMA to a predefined memory region (this is > usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]). > > [1a] > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html > [1b] > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html > [2] https://blade.tencent.com/en/advisories/qualpwn/ > [3] > https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/ > [4] > https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132 > > v8: > - Fix reserved-memory.txt and add the reg property in example. > - Fix sizeof for of_property_count_elems_of_size in > drivers/of/address.c#of_dma_set_restricted_buffer. > - Apply Will's suggestion to try the OF node having DMA configuration in > drivers/of/address.c#of_dma_set_restricted_buffer. > - Fix typo in the comment of > drivers/of/address.c#of_dma_set_restricted_buffer. > - Add error message for PageHighMem in > kernel/dma/swiotlb.c#rmem_swiotlb_device_init and move it to > rmem_swiotlb_setup. > - Fix the message string in rmem_swiotlb_setup. Thanks for the v8. It works for me out of the box on arm64 under KVM, so: Tested-by: Will Deacon Note that something seems to have gone wrong with the mail threading, so the last 5 patches ended up as a separate thread for me. Probably worth posting again with all the patches in one place, if you can. Cheers, Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 10:27:43AM -0700, Jacob Pan wrote: > Hi Jason, > > On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote: > > > > > I think once it binds a device to an IOASID fd, QEMU will want to probe > > > what hardware features are available before going further with the > > > vIOMMU setup (is there PASID, PRI, which page table formats are > > > supported, > > > > I think David's point was that qemu should be told what vIOMMU it is > > emulating exactly (right down to what features it has) and then > > the goal is simply to match what the vIOMMU needs with direct HW > > support via /dev/ioasid and fall back to SW emulation when not > > possible. > > > > If qemu wants to have some auto-configuration: 'pass host IOMMU > > capabilities' similar to the CPU flags then qemu should probe the > > /dev/ioasid - and maybe we should just return some highly rolled up > > "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in > > doing this. > > > There can be mixed types of physical IOMMUs on the host. So not until a > device is attached, we would not know if the vIOMMU can match the HW > support of the device's IOMMU. Perhaps, vIOMMU should check the > least common denominator features before commit. qemu has to set the vIOMMU at VM startup time, so if it is running in some "copy host" mode the only thing it can do is evaluate the VFIO devices that are present at boot and select a vIOMMU from that list. Probably would pick the most capable physical IOMMU and software emulate the reset. platforms really should avoid creating wildly divergent IOMMUs in the same system if they want to support virtualization effectively. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/amd: Tidy up DMA ops init
On 2021-06-04 16:26, Joerg Roedel wrote: On Thu, Jun 03, 2021 at 02:48:21PM +0100, Robin Murphy wrote: As discussed on the report thread, I think it makes most sense to merge this as a fix for 5.13 and not worry about any backporting. drivers/iommu/amd/amd_iommu.h | 2 -- drivers/iommu/amd/init.c | 5 - drivers/iommu/amd/iommu.c | 31 +-- 3 files changed, 13 insertions(+), 25 deletions(-) Applied for v5.13, thanks Robin et al for the quick work on this regression. Robin, do you by chance have a Fixes tag which I can add? For the sake of justifying this as "fix" rather than "cleanup", you may as well use the flush queue commit cited in the patch log - I maintain there's nothing technically wrong with that commit itself, but it is the point at which the underlying issue becomes worth fixing due to how they interact ;) Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
Hi Jason, On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe wrote: > On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote: > > > I think once it binds a device to an IOASID fd, QEMU will want to probe > > what hardware features are available before going further with the > > vIOMMU setup (is there PASID, PRI, which page table formats are > > supported, > > I think David's point was that qemu should be told what vIOMMU it is > emulating exactly (right down to what features it has) and then > the goal is simply to match what the vIOMMU needs with direct HW > support via /dev/ioasid and fall back to SW emulation when not > possible. > > If qemu wants to have some auto-configuration: 'pass host IOMMU > capabilities' similar to the CPU flags then qemu should probe the > /dev/ioasid - and maybe we should just return some highly rolled up > "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in > doing this. > There can be mixed types of physical IOMMUs on the host. So not until a device is attached, we would not know if the vIOMMU can match the HW support of the device's IOMMU. Perhaps, vIOMMU should check the least common denominator features before commit. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote: > On 04/06/21 18:03, Jason Gunthorpe wrote: > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > > > I don't want a security proof myself; I want to trust VFIO to make the > > > right > > > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > > > > > Given how KVM is just a device driver inside Linux, VMs should be a > > > slightly > > > more roundabout way to do stuff that is accessible to bare metal; not a > > > way > > > to gain extra privilege. > > > > Okay, fine, lets turn the question on its head then. > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO > > application can make use of no-snoop optimizations. The ability of KVM > > to execute wbinvd should be tied to the ability of that IOCTL to run > > in a normal process context. > > > > So, under what conditions do we want to allow VFIO to giave a process > > elevated access to the CPU: > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. > #2+#3 would be worse than what we have today), but IIUC the proposal (was it > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, > which then would be on VFIO and not on KVM. At the end of the day we need an ioctl with two arguments: - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever) - The KVM FD to control wbinvd support on Philosophically it doesn't matter too much which subsystem that ioctl lives, but we have these obnoxious cross module dependencies to consider.. Framing the question, as you have, to be about the process, I think explains why KVM doesn't really care what is decided, so long as the process and the VM have equivalent rights. Alex, how about a more fleshed out suggestion: 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID it communicates its no-snoop configuration: - 0 enable, allow WBINVD - 1 automatic disable, block WBINVD if the platform IOMMU can police it (what we do today) - 2 force disable, do not allow BINVD ever vfio_pci may want to take this from an admin configuration knob someplace. It allows the admin to customize if they want. If we can figure out a way to autodetect 2 from vfio_pci, all the better 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace to access wbinvd so it can make use of the no snoop optimization. wbinvd is allowed when: - A device is joined with mode #0 - A device is joined with mode #1 and the IOMMU cannot block no-snoop (today) 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD is blocked and userspace doesn't request to block no-snoop in the IOASID then it is a userspace error. 4) The KVM interface is the very simple enable/disable WBINVD. Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required to enable WBINVD at KVM. It is pretty simple from a /dev/ioasid perpsective, covers todays compat requirement, gives some future option to allow the no-snoop optimization, and gives a new option for qemu to totally block wbinvd no matter what. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes
> On Jun 4, 2021, at 8:38 AM, Joerg Roedel wrote: > > Hi Nadav, > > [Adding Robin] > > On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote: >> Nadav Amit (4): >> iommu/amd: Fix wrong parentheses on page-specific invalidations > > This patch is already upstream in v5.13-rc4. Please rebase to that > version. I guess it would be rc5 by the time I send it. > >> iommu/amd: Selective flush on unmap >> iommu/amd: Do not sync on page size changes >> iommu/amd: Do not use flush-queue when NpCache is on > > And I think there have been objections from Robin Murphy on Patch 3, > have those been worked out? I am still waiting for Robin’s feedback on my proposed changes. If he does not respond soon, I will drop this patch for now. Thanks, Nadav ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 3/4] iommu: rockchip: Add internal ops to handle variants
Add internal ops to be able to handle incoming variant v2. The goal is to keep the overall structure of the framework but to allow to add the evolution of this hardware block. The ops are global for a SoC because iommu domains are not attached to a specific devices if they are for a virtuel device like drm. Use a global variable shouldn't be since SoC usually doesn't embedded different versions of the iommu hardware block. If that happen one day a WARN_ON will be displayed at probe time. Signed-off-by: Benjamin Gaignard Reviewed-by: Heiko Stuebner --- version 7: - Set DMA mask - Add function to convert dma address to dte version 6: - Remove #include - Remove pt_address_mask field - Only use once of_device_get_match_data - Return an error if ops don't match version 5: - Use of_device_get_match_data() - Add internal ops inside the driver drivers/iommu/rockchip-iommu.c | 86 +- 1 file changed, 64 insertions(+), 22 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 7a2932772fdf..bd2cf7f08c71 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -96,6 +96,15 @@ static const char * const rk_iommu_clocks[] = { "aclk", "iface", }; +struct rk_iommu_ops { + phys_addr_t (*pt_address)(u32 dte); + u32 (*mk_dtentries)(dma_addr_t pt_dma); + u32 (*mk_ptentries)(phys_addr_t page, int prot); + phys_addr_t (*dte_addr_phys)(u32 addr); + u32 (*dma_addr_dte)(dma_addr_t dt_dma); + u64 dma_bit_mask; +}; + struct rk_iommu { struct device *dev; void __iomem **bases; @@ -116,6 +125,7 @@ struct rk_iommudata { }; static struct device *dma_dev; +static const struct rk_iommu_ops *rk_ops; static inline void rk_table_flush(struct rk_iommu_domain *dom, dma_addr_t dma, unsigned int count) @@ -215,11 +225,6 @@ static inline u32 rk_mk_dte(dma_addr_t pt_dma) #define RK_PTE_PAGE_READABLE BIT(1) #define RK_PTE_PAGE_VALID BIT(0) -static inline phys_addr_t rk_pte_page_address(u32 pte) -{ - return (phys_addr_t)pte & RK_PTE_PAGE_ADDRESS_MASK; -} - static inline bool rk_pte_is_page_valid(u32 pte) { return pte & RK_PTE_PAGE_VALID; @@ -448,10 +453,10 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) * and verifying that upper 5 nybbles are read back. */ for (i = 0; i < iommu->num_mmu; i++) { - rk_iommu_write(iommu->bases[i], RK_MMU_DTE_ADDR, DTE_ADDR_DUMMY); + dte_addr = rk_ops->pt_address(DTE_ADDR_DUMMY); + rk_iommu_write(iommu->bases[i], RK_MMU_DTE_ADDR, dte_addr); - dte_addr = rk_iommu_read(iommu->bases[i], RK_MMU_DTE_ADDR); - if (dte_addr != (DTE_ADDR_DUMMY & RK_DTE_PT_ADDRESS_MASK)) { + if (dte_addr != rk_iommu_read(iommu->bases[i], RK_MMU_DTE_ADDR)) { dev_err(iommu->dev, "Error during raw reset. MMU_DTE_ADDR is not functioning\n"); return -EFAULT; } @@ -470,6 +475,16 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) return 0; } +static inline phys_addr_t rk_dte_addr_phys(u32 addr) +{ + return (phys_addr_t)addr; +} + +static inline u32 rk_dma_addr_dte(dma_addr_t dt_dma) +{ + return dt_dma; +} + static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova) { void __iomem *base = iommu->bases[index]; @@ -489,7 +504,7 @@ static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova) page_offset = rk_iova_page_offset(iova); mmu_dte_addr = rk_iommu_read(base, RK_MMU_DTE_ADDR); - mmu_dte_addr_phys = (phys_addr_t)mmu_dte_addr; + mmu_dte_addr_phys = rk_ops->dte_addr_phys(mmu_dte_addr); dte_addr_phys = mmu_dte_addr_phys + (4 * dte_index); dte_addr = phys_to_virt(dte_addr_phys); @@ -498,14 +513,14 @@ static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova) if (!rk_dte_is_pt_valid(dte)) goto print_it; - pte_addr_phys = rk_dte_pt_address(dte) + (pte_index * 4); + pte_addr_phys = rk_ops->pt_address(dte) + (pte_index * 4); pte_addr = phys_to_virt(pte_addr_phys); pte = *pte_addr; if (!rk_pte_is_page_valid(pte)) goto print_it; - page_addr_phys = rk_pte_page_address(pte) + page_offset; + page_addr_phys = rk_ops->pt_address(pte) + page_offset; page_flags = pte & RK_PTE_PAGE_FLAGS_MASK; print_it: @@ -601,13 +616,13 @@ static phys_addr_t rk_iommu_iova_to_phys(struct iommu_domain *domain, if (!rk_dte_is_pt_valid(dte)) goto out; - pt_phys = rk_dte_pt_address(dte); + pt_phys = rk_ops->pt_address(dte); page_table = (u32 *)phys_to_virt(pt_phys); pte = page_table[rk_iova_pte_index(iova)]; if (!rk_pte_is_page_valid(pte)) got
[PATCH v8 1/4] dt-bindings: iommu: rockchip: Convert IOMMU to DT schema
Convert Rockchip IOMMU to DT schema Signed-off-by: Benjamin Gaignard Reviewed-by: Rob Herring Reviewed-by: Heiko Stuebner --- .../bindings/iommu/rockchip,iommu.txt | 38 - .../bindings/iommu/rockchip,iommu.yaml| 80 +++ 2 files changed, 80 insertions(+), 38 deletions(-) delete mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.txt create mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt b/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt deleted file mode 100644 index 6ecefea1c6f9.. --- a/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt +++ /dev/null @@ -1,38 +0,0 @@ -Rockchip IOMMU -== - -A Rockchip DRM iommu translates io virtual addresses to physical addresses for -its master device. Each slave device is bound to a single master device, and -shares its clocks, power domain and irq. - -Required properties: -- compatible : Should be "rockchip,iommu" -- reg : Address space for the configuration registers -- interrupts : Interrupt specifier for the IOMMU instance -- interrupt-names : Interrupt name for the IOMMU instance -- #iommu-cells: Should be <0>. This indicates the iommu is a -"single-master" device, and needs no additional information -to associate with its master device. See: -Documentation/devicetree/bindings/iommu/iommu.txt -- clocks : A list of clocks required for the IOMMU to be accessible by -the host CPU. -- clock-names : Should contain the following: - "iface" - Main peripheral bus clock (PCLK/HCL) (required) - "aclk" - AXI bus clock (required) - -Optional properties: -- rockchip,disable-mmu-reset : Don't use the mmu reset operation. - Some mmu instances may produce unexpected results - when the reset operation is used. - -Example: - - vopl_mmu: iommu@ff940300 { - compatible = "rockchip,iommu"; - reg = <0xff940300 0x100>; - interrupts = ; - interrupt-names = "vopl_mmu"; - clocks = <&cru ACLK_VOP1>, <&cru HCLK_VOP1>; - clock-names = "aclk", "iface"; - #iommu-cells = <0>; - }; diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml new file mode 100644 index ..099fc2578b54 --- /dev/null +++ b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml @@ -0,0 +1,80 @@ +# SPDX-License-Identifier: GPL-2.0-only +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iommu/rockchip,iommu.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Rockchip IOMMU + +maintainers: + - Heiko Stuebner + +description: |+ + A Rockchip DRM iommu translates io virtual addresses to physical addresses for + its master device. Each slave device is bound to a single master device and + shares its clocks, power domain and irq. + + For information on assigning IOMMU controller to its peripheral devices, + see generic IOMMU bindings. + +properties: + compatible: +const: rockchip,iommu + + reg: +items: + - description: configuration registers for MMU instance 0 + - description: configuration registers for MMU instance 1 +minItems: 1 +maxItems: 2 + + interrupts: +items: + - description: interruption for MMU instance 0 + - description: interruption for MMU instance 1 +minItems: 1 +maxItems: 2 + + clocks: +items: + - description: Core clock + - description: Interface clock + + clock-names: +items: + - const: aclk + - const: iface + + "#iommu-cells": +const: 0 + + rockchip,disable-mmu-reset: +$ref: /schemas/types.yaml#/definitions/flag +description: | + Do not use the mmu reset operation. + Some mmu instances may produce unexpected results + when the reset operation is used. + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + - "#iommu-cells" + +additionalProperties: false + +examples: + - | +#include +#include + +vopl_mmu: iommu@ff940300 { + compatible = "rockchip,iommu"; + reg = <0xff940300 0x100>; + interrupts = ; + clocks = <&cru ACLK_VOP1>, <&cru HCLK_VOP1>; + clock-names = "aclk", "iface"; + #iommu-cells = <0>; +}; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 4/4] iommu: rockchip: Add support for iommu v2
This second version of the hardware block has a different bits mapping for page table entries. Add the ops matching to this new mapping. Define a new compatible to distinguish it from the first version. Signed-off-by: Benjamin Gaignard Reviewed-by: Heiko Stuebner --- version 8: - Fix compilation issue version 7: - Set dma_bit_mask field. - Add rk_dma_addr_dte_v2 function version 5: - Use internal ops to support v2 hardware block - Use GENMASK macro. - Keep rk_dte_pt_address() and rk_dte_pt_address_v2() separated because I believe that is more readable like this. - Do not duplicate code. drivers/iommu/rockchip-iommu.c | 85 ++ 1 file changed, 85 insertions(+) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index bd2cf7f08c71..16dd2bf4a859 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -189,6 +189,33 @@ static inline phys_addr_t rk_dte_pt_address(u32 dte) return (phys_addr_t)dte & RK_DTE_PT_ADDRESS_MASK; } +/* + * In v2: + * 31:12 - PT address bit 31:0 + * 11: 8 - PT address bit 35:32 + * 7: 4 - PT address bit 39:36 + * 3: 1 - Reserved + * 0 - 1 if PT @ PT address is valid + */ +#define RK_DTE_PT_ADDRESS_MASK_V2 GENMASK_ULL(31, 4) +#define DTE_HI_MASK1 GENMASK(11, 8) +#define DTE_HI_MASK2 GENMASK(7, 4) +#define DTE_HI_SHIFT1 24 /* shift bit 8 to bit 32 */ +#define DTE_HI_SHIFT2 32 /* shift bit 4 to bit 36 */ +#define PAGE_DESC_HI_MASK1 GENMASK_ULL(39, 36) +#define PAGE_DESC_HI_MASK2 GENMASK_ULL(35, 32) + +static inline phys_addr_t rk_dte_pt_address_v2(u32 dte) +{ + u64 dte_v2 = dte; + + dte_v2 = ((dte_v2 & DTE_HI_MASK2) << DTE_HI_SHIFT2) | +((dte_v2 & DTE_HI_MASK1) << DTE_HI_SHIFT1) | +(dte_v2 & RK_DTE_PT_ADDRESS_MASK); + + return (phys_addr_t)dte_v2; +} + static inline bool rk_dte_is_pt_valid(u32 dte) { return dte & RK_DTE_PT_VALID; @@ -199,6 +226,15 @@ static inline u32 rk_mk_dte(dma_addr_t pt_dma) return (pt_dma & RK_DTE_PT_ADDRESS_MASK) | RK_DTE_PT_VALID; } +static inline u32 rk_mk_dte_v2(dma_addr_t pt_dma) +{ + pt_dma = (pt_dma & RK_DTE_PT_ADDRESS_MASK) | +((pt_dma & PAGE_DESC_HI_MASK1) >> DTE_HI_SHIFT1) | +(pt_dma & PAGE_DESC_HI_MASK2) >> DTE_HI_SHIFT2; + + return (pt_dma & RK_DTE_PT_ADDRESS_MASK_V2) | RK_DTE_PT_VALID; +} + /* * Each PTE has a Page address, some flags and a valid bit: * +-+---+---+-+ @@ -240,6 +276,29 @@ static u32 rk_mk_pte(phys_addr_t page, int prot) return page | flags | RK_PTE_PAGE_VALID; } +/* + * In v2: + * 31:12 - Page address bit 31:0 + * 11:9 - Page address bit 34:32 + * 8:4 - Page address bit 39:35 + * 3 - Security + * 2 - Readable + * 1 - Writable + * 0 - 1 if Page @ Page address is valid + */ +#define RK_PTE_PAGE_READABLE_V2 BIT(2) +#define RK_PTE_PAGE_WRITABLE_V2 BIT(1) + +static u32 rk_mk_pte_v2(phys_addr_t page, int prot) +{ + u32 flags = 0; + + flags |= (prot & IOMMU_READ) ? RK_PTE_PAGE_READABLE_V2 : 0; + flags |= (prot & IOMMU_WRITE) ? RK_PTE_PAGE_WRITABLE_V2 : 0; + + return rk_mk_dte_v2(page) | flags; +} + static u32 rk_mk_pte_invalid(u32 pte) { return pte & ~RK_PTE_PAGE_VALID; @@ -485,6 +544,21 @@ static inline u32 rk_dma_addr_dte(dma_addr_t dt_dma) return dt_dma; } +#define DT_HI_MASK GENMASK_ULL(39, 32) +#define DT_SHIFT 28 + +static inline phys_addr_t rk_dte_addr_phys_v2(u32 addr) +{ + return (phys_addr_t)(addr & RK_DTE_PT_ADDRESS_MASK) | + ((addr & DT_HI_MASK) << DT_SHIFT); +} + +static inline u32 rk_dma_addr_dte_v2(dma_addr_t dt_dma) +{ + return (dt_dma & RK_DTE_PT_ADDRESS_MASK) | + ((dt_dma & DT_HI_MASK) >> DT_SHIFT); +} + static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova) { void __iomem *base = iommu->bases[index]; @@ -1316,11 +1390,22 @@ static struct rk_iommu_ops iommu_data_ops_v1 = { .dma_bit_mask = DMA_BIT_MASK(32), }; +static struct rk_iommu_ops iommu_data_ops_v2 = { + .pt_address = &rk_dte_pt_address_v2, + .mk_dtentries = &rk_mk_dte_v2, + .mk_ptentries = &rk_mk_pte_v2, + .dte_addr_phys = &rk_dte_addr_phys_v2, + .dma_addr_dte = &rk_dma_addr_dte_v2, + .dma_bit_mask = DMA_BIT_MASK(40), +}; static const struct of_device_id rk_iommu_dt_ids[] = { { .compatible = "rockchip,iommu", .data = &iommu_data_ops_v1, }, + { .compatible = "rockchip,rk3568-iommu", + .data = &iommu_data_ops_v2, + }, { /* sentinel */ } }; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 2/4] dt-bindings: iommu: rockchip: Add compatible for v2
Add compatible for the second version of IOMMU hardware block. RK356x IOMMU can also be link to a power domain. Signed-off-by: Benjamin Gaignard Reviewed-by: Rob Herring Reviewed-by: Heiko Stuebner --- .../devicetree/bindings/iommu/rockchip,iommu.yaml | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml index 099fc2578b54..d2e28a9e3545 100644 --- a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml +++ b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml @@ -19,7 +19,9 @@ description: |+ properties: compatible: -const: rockchip,iommu +enum: + - rockchip,iommu + - rockchip,rk3568-iommu reg: items: @@ -48,6 +50,9 @@ properties: "#iommu-cells": const: 0 + power-domains: +maxItems: 1 + rockchip,disable-mmu-reset: $ref: /schemas/types.yaml#/definitions/flag description: | -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 0/4] Add IOMMU driver for rk356x
This series adds the IOMMU driver for rk356x SoC. Since a new compatible is needed to distinguish this second version of IOMMU hardware block from the first one, it is an opportunity to convert the binding to DT schema. version 8: - Fix compilation issue. version 7: - Set DMA mask - Fix rk_iommu_enable() - rebased on v5.13-rc3 tag version 6: - Remove #include - Remove pt_address_mask field - Only use once of_device_get_match_data - Return an error if ops don't match version 5: - Add internal ops inside the driver to be able to add variants. - Add support of v2 variant. - Stop using 'version' field - Use GENMASK macro version 4: - Add description for reg items - Remove useless interrupt-names properties - Add description for interrupts items - Remove interrupt-names properties from DST files version 3: - Rename compatible with soc prefix - Rebase on v5.12 tag version 2: - Fix iommu-cells typo in rk322x.dtsi - Change maintainer - Change reg maxItems - Add power-domains property Benjamin Gaignard (4): dt-bindings: iommu: rockchip: Convert IOMMU to DT schema dt-bindings: iommu: rockchip: Add compatible for v2 iommu: rockchip: Add internal ops to handle variants iommu: rockchip: Add support for iommu v2 .../bindings/iommu/rockchip,iommu.txt | 38 .../bindings/iommu/rockchip,iommu.yaml| 85 + drivers/iommu/rockchip-iommu.c| 171 +++--- 3 files changed, 234 insertions(+), 60 deletions(-) delete mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.txt create mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 09:22:43AM -0700, Jacob Pan wrote: > Hi Jason, > > On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang wrote: > > > 在 2021/6/4 上午2:19, Jacob Pan 写道: > > > Hi Shenming, > > > > > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu > > > wrote: > > > > > >> On 2021/6/2 1:33, Jason Gunthorpe wrote: > > >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: > > >>> > > The drivers register per page table fault handlers to /dev/ioasid > > which will then register itself to iommu core to listen and route > > the per- device I/O page faults. > > >>> I'm still confused why drivers need fault handlers at all? > > >> Essentially it is the userspace that needs the fault handlers, > > >> one case is to deliver the faults to the vIOMMU, and another > > >> case is to enable IOPF on the GPA address space for on-demand > > >> paging, it seems that both could be specified in/through the > > >> IOASID_ALLOC ioctl? > > >> > > > I would think IOASID_BIND_PGTABLE is where fault handler should be > > > registered. There wouldn't be any IO page fault without the binding > > > anyway. > > > > > > I also don't understand why device drivers should register the fault > > > handler, the fault is detected by the pIOMMU and injected to the > > > vIOMMU. So I think it should be the IOASID itself register the handler. > > > > > > > > > As discussed in another thread. > > > > I think the reason is that ATS doesn't forbid the #PF to be reported via > > a device specific way. > > Yes, in that case we should support both. Give the device driver a chance > to handle the IOPF if it can. Huh? The device driver does not "handle the IOPF" the device driver might inject the IOPF. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
Hi Jason, On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang wrote: > 在 2021/6/4 上午2:19, Jacob Pan 写道: > > Hi Shenming, > > > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu > > wrote: > > > >> On 2021/6/2 1:33, Jason Gunthorpe wrote: > >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: > >>> > The drivers register per page table fault handlers to /dev/ioasid > which will then register itself to iommu core to listen and route > the per- device I/O page faults. > >>> I'm still confused why drivers need fault handlers at all? > >> Essentially it is the userspace that needs the fault handlers, > >> one case is to deliver the faults to the vIOMMU, and another > >> case is to enable IOPF on the GPA address space for on-demand > >> paging, it seems that both could be specified in/through the > >> IOASID_ALLOC ioctl? > >> > > I would think IOASID_BIND_PGTABLE is where fault handler should be > > registered. There wouldn't be any IO page fault without the binding > > anyway. > > > > I also don't understand why device drivers should register the fault > > handler, the fault is detected by the pIOMMU and injected to the > > vIOMMU. So I think it should be the IOASID itself register the handler. > > > > > As discussed in another thread. > > I think the reason is that ATS doesn't forbid the #PF to be reported via > a device specific way. > Yes, in that case we should support both. Give the device driver a chance to handle the IOPF if it can. > Thanks > > > > > >> Thanks, > >> Shenming > >> > > > > Thanks, > > > > Jacob > > > Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 04/06/21 18:03, Jason Gunthorpe wrote: On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: I don't want a security proof myself; I want to trust VFIO to make the right judgment and I'm happy to defer to it (via the KVM-VFIO device). Given how KVM is just a device driver inside Linux, VMs should be a slightly more roundabout way to do stuff that is accessible to bare metal; not a way to gain extra privilege. Okay, fine, lets turn the question on its head then. VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO application can make use of no-snoop optimizations. The ability of KVM to execute wbinvd should be tied to the ability of that IOCTL to run in a normal process context. So, under what conditions do we want to allow VFIO to giave a process elevated access to the CPU: Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. #2+#3 would be worse than what we have today), but IIUC the proposal (was it yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, which then would be on VFIO and not on KVM. I assumed Alex was more or less okay with it, given he included me in the discussion. If later y'all switch to "it's always okay to issue the enable/disable ioctl", I guess the rationale would be documented in the commit message. Paolo 1) User has access to a device that can issue no-snoop TLPS 2) User has access to an IOMMU that can not block no-snoop (today) 3) Require CAP_SYS_RAW_IO 4) Anyone Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > On 04/06/21 17:50, Jason Gunthorpe wrote: > > > Extending the scenarios where WBINVD is not a nop is not a problem for me. > > > If possible I wouldn't mind keeping the existing kvm-vfio connection via > > > the > > > device, if only because then the decision remains in the VFIO camp (whose > > > judgment I trust more than mine on this kind of issue). > > Really the question to answer is what "security proof" do you want > > before the wbinvd can be enabled > > I don't want a security proof myself; I want to trust VFIO to make the right > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > Given how KVM is just a device driver inside Linux, VMs should be a slightly > more roundabout way to do stuff that is accessible to bare metal; not a way > to gain extra privilege. Okay, fine, lets turn the question on its head then. VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO application can make use of no-snoop optimizations. The ability of KVM to execute wbinvd should be tied to the ability of that IOCTL to run in a normal process context. So, under what conditions do we want to allow VFIO to giave a process elevated access to the CPU: > > 1) User has access to a device that can issue no-snoop TLPS > > 2) User has access to an IOMMU that can not block no-snoop (today) > > 3) Require CAP_SYS_RAW_IO > > 4) Anyone Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 04/06/21 17:50, Jason Gunthorpe wrote: Extending the scenarios where WBINVD is not a nop is not a problem for me. If possible I wouldn't mind keeping the existing kvm-vfio connection via the device, if only because then the decision remains in the VFIO camp (whose judgment I trust more than mine on this kind of issue). Really the question to answer is what "security proof" do you want before the wbinvd can be enabled I don't want a security proof myself; I want to trust VFIO to make the right judgment and I'm happy to defer to it (via the KVM-VFIO device). Given how KVM is just a device driver inside Linux, VMs should be a slightly more roundabout way to do stuff that is accessible to bare metal; not a way to gain extra privilege. Paolo 1) User has access to a device that can issue no-snoop TLPS 2) User has access to an IOMMU that can not block no-snoop (today) 3) Require CAP_SYS_RAW_IO 4) Anyone #1 is an improvement because it allows userspace to enable wbinvd and no-snoop optimizations based on user choice #2 is where we are today and wbinvd effectively becomes a fixed platform choice. Userspace has no say #3 is "there is a problem, but not so serious, root is powerful enough to override" ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 05:40:34PM +0200, Paolo Bonzini wrote: > On 04/06/21 17:26, Alex Williamson wrote: > > Let's make sure the KVM folks are part of this decision; a re-cap for > > them, KVM currently automatically enables wbinvd emulation when > > potentially non-coherent devices are present which is determined solely > > based on the IOMMU's (or platform's, as exposed via the IOMMU) ability > > to essentially force no-snoop transactions from a device to be cache > > coherent. This synchronization is triggered via the kvm-vfio device, > > where QEMU creates the device and adds/removes vfio group fd > > descriptors as an additionally layer to prevent the user from enabling > > wbinvd emulation on a whim. > > > > IIRC, this latter association was considered a security/DoS issue to > > prevent a malicious guest/userspace from creating a disproportionate > > system load. > > > > Where would KVM stand on allowing more direct userspace control of > > wbinvd behavior? Would arbitrary control be acceptable or should we > > continue to require it only in association to a device requiring it for > > correct operation. > > Extending the scenarios where WBINVD is not a nop is not a problem for me. > If possible I wouldn't mind keeping the existing kvm-vfio connection via the > device, if only because then the decision remains in the VFIO camp (whose > judgment I trust more than mine on this kind of issue). Really the question to answer is what "security proof" do you want before the wbinvd can be enabled 1) User has access to a device that can issue no-snoop TLPS 2) User has access to an IOMMU that can not block no-snoop (today) 3) Require CAP_SYS_RAW_IO 4) Anyone #1 is an improvement because it allows userspace to enable wbinvd and no-snoop optimizations based on user choice #2 is where we are today and wbinvd effectively becomes a fixed platform choice. Userspace has no say #3 is "there is a problem, but not so serious, root is powerful enough to override" #4 is "there is no problem here" Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu: Remove unused of_get_dma_window()
On Thu, May 27, 2021 at 02:37:09PM -0500, Rob Herring wrote: > drivers/iommu/of_iommu.c | 68 > include/linux/of_iommu.h | 17 ++ > 2 files changed, 3 insertions(+), 82 deletions(-) Applied both, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Different type iommus integrated in a SoC
On Thu, Jun 03, 2021 at 01:05:43PM +0100, Robin Murphy wrote: > Hooray! I've been forecasting this for years, but the cases we regularly hit > with internal FPGA prototyping (nor the secret unused MMU-400 I found on > RK3288) have never really been a strong enough argument to stand behind. > > Based on what I remember from looking into this a few years ago, converting > *most* of the API to per-device ops (now via dev->iommu) is trivial; the > main challenge will be getting the per-device data bootstrapped in > iommu_probe_device(), which would probably need to rely on the fwspec and/or > list of registered IOMMU instances. > > The other notable thing which will need to change is the domain allocation > interface, but in practice I think everyone who calls iommu_domain_alloc() > today is in fact doing so for a specific device, so I don't think it's as > big a problem as it might first appear. Yeah, I think for that we have to give up on the promise that a domain can be assigned to _any_ device. But this promise doesn't even hold true now when there are several IOMMU of the same type but with different feature sets in a system. So I happily review patches enabling the Multi-IOMMU SOCs :) Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 04/06/21 17:26, Alex Williamson wrote: Let's make sure the KVM folks are part of this decision; a re-cap for them, KVM currently automatically enables wbinvd emulation when potentially non-coherent devices are present which is determined solely based on the IOMMU's (or platform's, as exposed via the IOMMU) ability to essentially force no-snoop transactions from a device to be cache coherent. This synchronization is triggered via the kvm-vfio device, where QEMU creates the device and adds/removes vfio group fd descriptors as an additionally layer to prevent the user from enabling wbinvd emulation on a whim. IIRC, this latter association was considered a security/DoS issue to prevent a malicious guest/userspace from creating a disproportionate system load. Where would KVM stand on allowing more direct userspace control of wbinvd behavior? Would arbitrary control be acceptable or should we continue to require it only in association to a device requiring it for correct operation. Extending the scenarios where WBINVD is not a nop is not a problem for me. If possible I wouldn't mind keeping the existing kvm-vfio connection via the device, if only because then the decision remains in the VFIO camp (whose judgment I trust more than mine on this kind of issue). For example, would it make sense if *VFIO* (not KVM) gets an API that says "I am going to do incoherent DMA"? Then that API causes WBINVD to become not-a-nop even on otherwise coherent platforms. (Would this make sense at all without a hypervisor that indirectly lets userspace execute WBINVD? Perhaps VFIO would benefit from a WBINVD ioctl too). Paolo ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes
Hi Nadav, [Adding Robin] On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote: > Nadav Amit (4): > iommu/amd: Fix wrong parentheses on page-specific invalidations This patch is already upstream in v5.13-rc4. Please rebase to that version. > iommu/amd: Selective flush on unmap > iommu/amd: Do not sync on page size changes > iommu/amd: Do not use flush-queue when NpCache is on And I think there have been objections from Robin Murphy on Patch 3, have those been worked out? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, 4 Jun 2021 09:19:50 + "Tian, Kevin" wrote: > > From: Alex Williamson > > Sent: Friday, June 4, 2021 4:42 AM > > > > > 'qemu --allow-no-snoop' makes more sense to me > > > > I'd be tempted to attach it to the -device vfio-pci option, it's > > specific drivers for specific devices that are going to want this and > > those devices may not be permanently attached to the VM. But I see in > > the other thread you're trying to optimize IOMMU page table sharing. > > > > There's a usability question in either case though and I'm not sure how > > to get around it other than QEMU or the kernel knowing a list of > > devices (explicit IDs or vendor+class) to select per device defaults. > > > > "-device vfio-pci" is a per-device option, which implies that the > no-snoop choice is given to the admin then no need to maintain > a fixed device list in Qemu? I think we want to look at where we put it to have the best default user experience. For example the QEMU vfio-pci device option could use on/off/auto semantics where auto is the default and QEMU maintains a list of IDs or vendor/class configurations where we've determined the "optimal" auto configuration. Management tools could provide an override, but we're imposing some pretty technical requirements for a management tool to be able to come up with good per device defaults. Seems like we should consolidate that technical decision in one place. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices
On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote: > From: Xingang Wang > > When booting with devicetree, the pci_request_acs() is called after the > enumeration and initialization of PCI devices, thus the ACS is not > enabled. And ACS should be enabled when IOMMU is detected for the > PCI host bridge, so add check for IOMMU before probe of PCI host and call > pci_request_acs() to make sure ACS will be enabled when enumerating PCI > devices. > > Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when > configuring IOMMU linkage") > Signed-off-by: Xingang Wang > --- > drivers/iommu/of_iommu.c | 1 - > drivers/pci/of.c | 8 +++- > 2 files changed, 7 insertions(+), 2 deletions(-) Should probably go through the PCI tree, so Acked-by: Joerg Roedel ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RESEND PATCH v3] iommu/iova: put free_iova_mem() outside of spinlock iova_rbtree_lock
On Mon, May 10, 2021 at 07:53:02PM +0800, chenxiang wrote: > From: Xiang Chen > > It is not necessary to put free_iova_mem() inside of spinlock/unlock > iova_rbtree_lock which only leads to more completion for the spinlock. > It has a small promote on the performance after the change. And also > rename private_free_iova() as remove_iova() because the function will not > free iova after that change. > > Signed-off-by: Xiang Chen > Reviewed-by: John Garry > --- > drivers/iommu/iova.c | 18 +++--- > 1 file changed, 11 insertions(+), 7 deletions(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Fix IOVA reserve dma ranges
On Mon, Sep 14, 2020 at 12:53:19PM +0530, Srinath Mannam wrote: > Fix IOVA reserve failure in the case when address of first memory region > listed in dma-ranges is equal to 0x0. > > Fixes: aadad097cd46f ("iommu/dma: Reserve IOVA for PCIe inaccessible DMA > address") > Signed-off-by: Srinath Mannam > --- > Changes from v2: >Modify error message with useful information based on Bjorn's comments. > > Changes from v1: >Removed unnecessary changes based on Robin's review comments. > > drivers/iommu/dma-iommu.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
[Cc +Paolo] On Fri, 4 Jun 2021 09:28:30 -0300 Jason Gunthorpe wrote: > On Fri, Jun 04, 2021 at 08:38:26AM +, Tian, Kevin wrote: > > > I think more to drive the replacement design; if we can't figure out > > > how to do something other than backwards compatibility trickery in the > > > kernel, it's probably going to bite us. Thanks, > > > > I'm a bit lost on the desired flow in your minds. Here is one flow based > > on my understanding of this discussion. Please comment whether it > > matches your thinking: > > > > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD; > > > > 1) Qemu binds dev1 to ioasid_fd; > > > > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_ > > CACHE info i.e. whether underlying IOMMU can enforce snoop; > > > > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu > > needs to figure out whether dev1 wants to do no-snoop. This might > > be based a fixed vendor/class list or specified by user; > > > > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop' > > flag is specified to decide the page table format, which is supposed > > to match dev1; > > > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this > > point, specify snoop/no-snoop again. If not supported by related > > iommu or different from what gpa_ioasid has, attach fails. > > Why do we need to specify it again? My thought as well. > If the IOASID was created with the "block no-snoop" flag then it is > blocked in that IOASID, and that blocking sets the page table format. > > The only question is if we can successfully attach a device to the > page table, or not. > > The KVM interface is a bit tricky because Alex said this is partially > security, wbinvd is only enabled if someone has a FD to a device that > can support no-snoop. > > Personally I think this got way too complicated, the KVM interface > should simply be > > ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label) > ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label) > > and let qemu sort it out based on command flags, detection, whatever. > > 'ioasidfd, device_label' is the security proof that Alex asked > for. This needs to be some device in the ioasidfd that declares it is > capabale of no-snoop. Eg vfio_pci would always declare it is capable > of no-snoop. > > No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the > IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it > is just a kernel-harmless uerspace bug. > > Then user space can decide which of the various axis's it wants to > optimize for. Let's make sure the KVM folks are part of this decision; a re-cap for them, KVM currently automatically enables wbinvd emulation when potentially non-coherent devices are present which is determined solely based on the IOMMU's (or platform's, as exposed via the IOMMU) ability to essentially force no-snoop transactions from a device to be cache coherent. This synchronization is triggered via the kvm-vfio device, where QEMU creates the device and adds/removes vfio group fd descriptors as an additionally layer to prevent the user from enabling wbinvd emulation on a whim. IIRC, this latter association was considered a security/DoS issue to prevent a malicious guest/userspace from creating a disproportionate system load. Where would KVM stand on allowing more direct userspace control of wbinvd behavior? Would arbitrary control be acceptable or should we continue to require it only in association to a device requiring it for correct operation. A wrinkle in "correct operation" is that while the IOMMU may be able to force no-snoop transactions to be coherent, in the scenario described in the previous reply, the user may intend to use non-coherent DMA regardless of the IOMMU capabilities due to their own optimization policy. There's a whole spectrum here, including aspects we can't determine around the device driver's intentions to use non-coherent transactions, the user's policy in trading hypervisor overhead for cache coherence overhead, etc. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/amd: Tidy up DMA ops init
On Thu, Jun 03, 2021 at 02:48:21PM +0100, Robin Murphy wrote: > As discussed on the report thread, I think it makes most sense to merge > this as a fix for 5.13 and not worry about any backporting. > > drivers/iommu/amd/amd_iommu.h | 2 -- > drivers/iommu/amd/init.c | 5 - > drivers/iommu/amd/iommu.c | 31 +-- > 3 files changed, 13 insertions(+), 25 deletions(-) Applied for v5.13, thanks Robin et al for the quick work on this regression. Robin, do you by chance have a Fixes tag which I can add? Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT
On Thu, Jun 03, 2021 at 09:26:39AM +0200, Jean-Philippe Brucker wrote: > These are only defined when CONFIG_IOMMU_API is set. IORT uses them inside > an #ifdef, I can do the same. Maybe moving these two functions to a new > drivers/acpi/iommu.c would be nicer, though. Not sure what the ACPI maintainers and reviewers prefer, but I would just #ifdef the functions and provide stubs in the #else path if necessary. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/rockchip: Remove redundant DMA syncs
On Fri, May 21, 2021 at 02:49:39PM +0100, Robin Murphy wrote: > When we allocate new pagetable pages, we don't modify them between the > initial dma_map_single() call and the dma_sync_single_for_device() call > via rk_iommu_flush(), so the latter is entirely pointless. > > Signed-off-by: Robin Murphy > --- > drivers/iommu/rockchip-iommu.c | 3 --- > 1 file changed, 3 deletions(-) Applied, thanks Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 0/4] Add IOMMU driver for rk356x
On Fri, Jun 04, 2021 at 04:54:58PM +0200, Joerg Roedel wrote: > On Tue, May 25, 2021 at 02:15:47PM +0200, Benjamin Gaignard wrote: > > Benjamin Gaignard (4): > > dt-bindings: iommu: rockchip: Convert IOMMU to DT schema > > dt-bindings: iommu: rockchip: Add compatible for v2 > > iommu: rockchip: Add internal ops to handle variants > > Applied patches 1-3, thanks. Hmm, no, unapplied. Your separate patch 4 came before the kbuild-bot reports. Can you fix those please and submit a v8? Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 0/4] Add IOMMU driver for rk356x
On Tue, May 25, 2021 at 02:15:47PM +0200, Benjamin Gaignard wrote: > Benjamin Gaignard (4): > dt-bindings: iommu: rockchip: Convert IOMMU to DT schema > dt-bindings: iommu: rockchip: Add compatible for v2 > iommu: rockchip: Add internal ops to handle variants Applied patches 1-3, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/amd: Remove redundant assignment of err
On Wed, May 19, 2021 at 11:37:27AM +0800, Shaokun Zhang wrote: > 'err' will be initialized and cleanup the redundant initialization. > > Cc: Joerg Roedel > Signed-off-by: Shaokun Zhang Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RESEND PATCH v2] iommu/amd: Fix extended features logging
On Tue, May 04, 2021 at 01:22:20PM +0300, Alexander Monakov wrote: > Fixes: 9a295ff0ffc9 ("iommu/amd: Print extended features in one line to fix > divergent log levels") > Link: > https://lore.kernel.org/lkml/alpine.lnx.2.20.13.2104112326460.11...@monopod.intra.ispras.ru > Signed-off-by: Alexander Monakov > Cc: Paul Menzel > Cc: Joerg Roedel > Cc: Suravee Suthikulpanit > Cc: iommu@lists.linux-foundation.org > --- > drivers/iommu/amd/init.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Applied, thanks. And sorry for the delay. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 06:08:28AM +, Tian, Kevin wrote: > In Qemu case the problem is that it doesn't know the list of devices > that will be attached to an IOASID when it's created. This is a guest- > side knowledge which is conveyed one device at a time to Qemu > though vIOMMU. At least for the guest side it is alot simpler because the vIOMMU being emulated will define nearly everything. qemu will just have to ask the kernel for whatever it is the guest is doing. If the kernel can't do it then qemu has to SW emulate. The no-snoop block may be the only thing that is under qemu's control because it is transparent to the guest. This will probably become clearer as people start to define what the get_info should return. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult wrote: > On 02.06.21 19:24, Jason Gunthorpe wrote: > > Hi, > > >> If I understand this correctly, /dev/ioasid is a kind of "common > supplier" > >> to other APIs / devices. Why can't the fd be acquired by the > >> consumer APIs (eg. kvm, vfio, etc) ? > > > > /dev/ioasid would be similar to /dev/vfio, and everything already > > deals with exposing /dev/vfio and /dev/vfio/N together > > > > I don't see it as a problem, just more work. > > One of the problems I'm seeing is in container environments: when > passing in an vfio device, we now also need to pass in /dev/ioasid, > thus increasing the complexity in container setup (or orchestration). Containers already needed to do this today. Container orchestration is hard. > And in such scenarios you usually want to pass in one specific device, > not all of the same class, and usually orchestration shall pick the > next free one. > > Can we make sure that a process having full access to /dev/ioasid > while only supposed to have to specific consumer devices, can't do > any harm (eg. influencing other containers that might use a different > consumer device) ? Yes, /dev/ioasid shouldn't do anything unless you have a device to connect it with. In this way it is probably safe to stuff it into every container. > > Having FDs spawn other FDs is pretty ugly, it defeats the "everything > > is a file" model of UNIX. > > Unfortunately, this is already defeated in many other places :( > (I'd even claim that ioctls already break it :p) I think you are reaching a bit :) > It seems your approach also breaks this, since we now need to open two > files in order to talk to one device. It is two devices, thus two files. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 08:38:26AM +, Tian, Kevin wrote: > > I think more to drive the replacement design; if we can't figure out > > how to do something other than backwards compatibility trickery in the > > kernel, it's probably going to bite us. Thanks, > > I'm a bit lost on the desired flow in your minds. Here is one flow based > on my understanding of this discussion. Please comment whether it > matches your thinking: > > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD; > > 1) Qemu binds dev1 to ioasid_fd; > > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_ > CACHE info i.e. whether underlying IOMMU can enforce snoop; > > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu > needs to figure out whether dev1 wants to do no-snoop. This might > be based a fixed vendor/class list or specified by user; > > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop' > flag is specified to decide the page table format, which is supposed > to match dev1; > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this > point, specify snoop/no-snoop again. If not supported by related > iommu or different from what gpa_ioasid has, attach fails. Why do we need to specify it again? If the IOASID was created with the "block no-snoop" flag then it is blocked in that IOASID, and that blocking sets the page table format. The only question is if we can successfully attach a device to the page table, or not. The KVM interface is a bit tricky because Alex said this is partially security, wbinvd is only enabled if someone has a FD to a device that can support no-snoop. Personally I think this got way too complicated, the KVM interface should simply be ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label) ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label) and let qemu sort it out based on command flags, detection, whatever. 'ioasidfd, device_label' is the security proof that Alex asked for. This needs to be some device in the ioasidfd that declares it is capabale of no-snoop. Eg vfio_pci would always declare it is capable of no-snoop. No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it is just a kernel-harmless uerspace bug. Then user space can decide which of the various axis's it wants to optimize for. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote: > Could you clarify "vfio_driver"? This is the thing providing the vfio_device_ops function pointers. So vfio-pci can't know anything about this (although your no-snoop control probing idea makes sense to me) But vfio_mlx5_pci can know So can mdev_idxd And kvmgt Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, June 3, 2021 9:05 PM > > > > > > > > > > 3) Device accepts any PASIDs from the guest. No > > > >vPASID/pPASID translation is possible. (classic vfio_pci) > > > > 4) Device accepts any PASID from the guest and has an > > > >internal vPASID/pPASID translation (enhanced vfio_pci) > > > > > > what is enhanced vfio_pci? In my writing this is for mdev > > > which doesn't support ENQCMD > > > > This is a vfio_pci that mediates some element of the device interface > > to communicate the vPASID/pPASID table to the device, using Max's > > series for vfio_pci drivers to inject itself into VFIO. > > > > For instance a device might send a message through the PF that the VF > > has a certain vPASID/pPASID translation table. This would be useful > > for devices that cannot use ENQCMD but still want to support migration > > and thus need vPASID. > > I still don't quite get. If it's a PCI device why is PASID translation > required? > Just delegate the per-RID PASID space to user as type-3 then migrating the > vPASID space is just straightforward. This is only possible if we get rid of the global pPASID allocation (honestly is my preference as it makes the HW a lot simpler) Without vPASID the migration would need pPASID's on the RID that are guarenteed free. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote: > I think once it binds a device to an IOASID fd, QEMU will want to probe > what hardware features are available before going further with the vIOMMU > setup (is there PASID, PRI, which page table formats are supported, I think David's point was that qemu should be told what vIOMMU it is emulating exactly (right down to what features it has) and then the goal is simply to match what the vIOMMU needs with direct HW support via /dev/ioasid and fall back to SW emulation when not possible. If qemu wants to have some auto-configuration: 'pass host IOMMU capabilities' similar to the CPU flags then qemu should probe the /dev/ioasid - and maybe we should just return some highly rolled up "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in doing this. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote: > > nor do any virtio drivers implement the required platform specific > > cache flushing to make no-snoop TLPs work. > > I don't get why virtio drivers needs to do that. I think DMA API should hide > those arch/platform specific stuffs from us. It is not arch/platform stuff. If the device uses no-snoop then a very platform specific recovery is required in the device driver. It is not part of the normal DMA API, it is side APIs like flush_agp_cache() or wbinvd() that are used by GPU drivers only. If drivers/virtio doesn't explicitly call these things it doesn't support no-snoop - hence no VDPA device can ever use no-snoop. Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has nothing to do with no-snoop. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 02.06.21 19:24, Jason Gunthorpe wrote: Hi, >> If I understand this correctly, /dev/ioasid is a kind of "common supplier" >> to other APIs / devices. Why can't the fd be acquired by the >> consumer APIs (eg. kvm, vfio, etc) ? > > /dev/ioasid would be similar to /dev/vfio, and everything already > deals with exposing /dev/vfio and /dev/vfio/N together > > I don't see it as a problem, just more work. One of the problems I'm seeing is in container environments: when passing in an vfio device, we now also need to pass in /dev/ioasid, thus increasing the complexity in container setup (or orchestration). And in such scenarios you usually want to pass in one specific device, not all of the same class, and usually orchestration shall pick the next free one. Can we make sure that a process having full access to /dev/ioasid while only supposed to have to specific consumer devices, can't do any harm (eg. influencing other containers that might use a different consumer device) ? Note that we don't have device namespaces yet (device isolation still has to be done w/ complicated bpf magic). I'm already working on that, but even "simple" things like loopdev allocation turns out to be not entirely easy. > Having FDs spawn other FDs is pretty ugly, it defeats the "everything > is a file" model of UNIX. Unfortunately, this is already defeated in many other places :( (I'd even claim that ioctls already break it :p) It seems your approach also breaks this, since we now need to open two files in order to talk to one device. By the way: my idea does keep the "everything's a file" concept - we just have a file that allows opening "sub-files". Well, it would be better if devices could also have directory semantics. --mtx --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering i...@metux.net -- +49-151-27565287 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote: > > > But it would certainly be possible for a system to have two > > > different host bridges with two different IOMMUs with different > > > pagetable formats. Until you know which devices (and therefore > > > which host bridge) you're talking about, you don't know what formats > > > of pagetable to accept. And if you have devices from *both* bridges > > > you can't bind a page table at all - you could theoretically support > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables > > > in both formats, but it would be pretty reasonable not to support > > > that. > > > > The basic process for a user space owned pgtable mode would be: > > > > 1) qemu has to figure out what format of pgtable to use > > > > Presumably it uses query functions using the device label. > > No... in the qemu case it would always select the page table format > that it needs to present to the guest. That's part of the > guest-visible platform that's selected by qemu's configuration. > > There's no negotiation here: either the kernel can supply what qemu > needs to pass to the guest, or it can't. If it can't qemu, will have > to either emulate in SW (if possible, probably using a kernel-managed > IOASID to back it) or fail outright. > > > The > > kernel code should look at the entire device path through all the > > IOMMU HW to determine what is possible. > > > > Or it already knows because the VM's vIOMMU is running in some > > fixed page table format, or the VM's vIOMMU already told it, or > > something. > > Again, I think you have the order a bit backwards. The user selects > the capabilities that the vIOMMU will present to the guest as part of > the qemu configuration. Qemu then requests that of the host kernel, > and either the host kernel supplies it, qemu emulates it in SW, or > qemu fails to start. Hm, how fine a capability are we talking about? If it's just "give me VT-d capabilities" or "give me Arm capabilities" that would work, but probably isn't useful. Anything finer will be awkward because userspace will have to try combinations of capabilities to see what sticks, and supporting new hardware will drop compatibility for older one. For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that completely changes the capabilities offered to the guest (some v2 implementations support nesting page tables, but never PASID nor PRI unlike v3.) The same vIOMMU could support either, presenting different capabilities to the guest, even multiple page table formats if we wanted to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it needs to know early on what the hardware is precisely. Then some new page table format shows up and, although the vIOMMU can support that in addition to older ones, QEMU will have to pick a single one, that it assumes the guest knows how to drive? I think once it binds a device to an IOASID fd, QEMU will want to probe what hardware features are available before going further with the vIOMMU setup (is there PASID, PRI, which page table formats are supported, address size, page granule, etc). Obtaining precise information about the hardware would be less awkward than trying different configurations until one succeeds. Binding an additional device would then fail if its pIOMMU doesn't support exactly the features supported for the first device, because we don't know which ones the guest will choose. QEMU will have to open a new IOASID fd for that device. Thanks, Jean > > Guest visible properties of the platform never (or *should* never) > depend implicitly on host capabilities - it's impossible to sanely > support migration in such an environment. > > > 2) qemu creates an IOASID and based on #1 and says 'I want this format' > > Right. > > > 3) qemu binds the IOASID to the device. > > > > If qmeu gets it wrong then it just fails. > > Right, though it may be fall back to (partial) software emulation. In > practice that would mean using a kernel-managed IOASID and walking the > guest IO pagetables itself to mirror them into the host kernel. > > > 4) For the next device qemu would have to figure out if it can re-use > > an existing IOASID based on the required proeprties. > > Nope. Again, what devices share an IO address space is a guest > visible part of the platform. If the host kernel can't supply that, > then qemu must not start (or fail the hotplug if the new device is > being hotplugged). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Alex Williamson > Sent: Friday, June 4, 2021 4:42 AM > > > 'qemu --allow-no-snoop' makes more sense to me > > I'd be tempted to attach it to the -device vfio-pci option, it's > specific drivers for specific devices that are going to want this and > those devices may not be permanently attached to the VM. But I see in > the other thread you're trying to optimize IOMMU page table sharing. > > There's a usability question in either case though and I'm not sure how > to get around it other than QEMU or the kernel knowing a list of > devices (explicit IDs or vendor+class) to select per device defaults. > "-device vfio-pci" is a per-device option, which implies that the no-snoop choice is given to the admin then no need to maintain a fixed device list in Qemu? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jean-Philippe Brucker > Sent: Friday, June 4, 2021 4:18 PM > > On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote: > > > > This implies that VFIO_BOUND_IOASID will be extended to allow user > > > > specify a device label. This label will be recorded in /dev/iommu to > > > > serve per-device invalidation request from and report per-device > > > > fault data to the user. > > > > > > I wonder which of the user providing a 64 bit cookie or the kernel > > > returning a small IDA is the best choice here? Both have merits > > > depending on what qemu needs.. > > > > Yes, either way can work. I don't have a strong preference. Jean? > > I don't see an issue with either solution, maybe it will show up while > prototyping. First one uses IDs that do mean something for someone, and > userspace may inject faults slightly faster since it doesn't need an > ID->vRID lookup, so that's my preference. ok, will go for the first option in v2. > > > > > In addition, vPASID (if provided by user) will > > > > be also recorded in /dev/iommu so vPASID<->pPASID conversion > > > > is conducted properly. e.g. invalidation request from user carries > > > > a vPASID which must be converted into pPASID before calling iommu > > > > driver. Vice versa for raw fault data which carries pPASID while the > > > > user expects a vPASID. > > > > > > I don't think the PASID should be returned at all. It should return > > > the IOASID number in the FD and/or a u64 cookie associated with that > > > IOASID. Userspace should figure out what the IOASID & device > > > combination means. > > > > This is true for Intel. But what about ARM which has only one IOASID > > (pasid table) per device to represent all guest I/O page tables? > > In that case vPASID = pPASID though. The vPASID allocated by the guest is > the same from the vIOMMU inval to the pIOMMU inval. I don't think host > kernel or userspace need to alter it. > yes. So responding to Jason's earlier comment we do need return PASID (although no conversion is required) to userspace in this case. 😊 Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Alex Williamson > Sent: Friday, June 4, 2021 5:44 AM > > > Based on that observation we can say as soon as the user wants to use > > an IOMMU that does not support DMA_PTE_SNP in the guest we can still > > share the IO page table with IOMMUs that do support DMA_PTE_SNP. page table sharing between incompatible IOMMUs is not a critical thing. I prefer to disallowing sharing in such case as the starting point, i.e. the user needs to create separate IOASIDs for such devices. > > If your goal is to prioritize IO page table sharing, sure. But because > we cannot atomically transition from one to the other, each device is > stuck with the pages tables it has, so the history of the VM becomes a > factor in the performance characteristics. > > For example if device {A} is backed by an IOMMU capable of blocking > no-snoop and device {B} is backed by an IOMMU which cannot block > no-snoop, then booting VM1 with {A,B} and later removing device {B} > would result in ongoing wbinvd emulation versus a VM2 only booted with > {A}. > > Type1 would use separate IO page tables (domains/ioasids) for these such > that VM1 and VM2 have the same characteristics at the end. > > Does this become user defined policy in the IOASID model? There's > quite a mess of exposing sufficient GET_INFO for an IOASID for the user > to know such properties of the IOMMU, plus maybe we need mapping flags > equivalent to IOMMU_CACHE exposed to the user, preventing sharing an > IOASID that could generate IOMMU faults, etc. IOMMU_CACHE is a fixed attribute given an IOMMU. So it's better to convey this info to userspace via GET_INFO for a device_label, before creating any IOASID. But overall I agree that careful thinking is required about how to organize those info reporting (per-fd, per-device, per-ioasid) to userspace. > > > > > It doesn't solve the problem to connect kvm to AP and kvmgt though > > > > > > It does not, we'll probably need a vfio ioctl to gratuitously announce > > > the KVM fd to each device. I think some devices might currently fail > > > their open callback if that linkage isn't already available though, so > > > it's not clear when that should happen, ie. it can't currently be a > > > VFIO_DEVICE ioctl as getting the device fd requires an open, but this > > > proposal requires some availability of the vfio device fd without any > > > setup, so presumably that won't yet call the driver open callback. > > > Maybe that's part of the attach phase now... I'm not sure, it's not > > > clear when the vfio device uAPI starts being available in the process > > > of setting up the ioasid. Thanks, > > > > At a certain point we maybe just have to stick to backward compat, I > > think. Though it is useful to think about green field alternates to > > try to guide the backward compat design.. > > I think more to drive the replacement design; if we can't figure out > how to do something other than backwards compatibility trickery in the > kernel, it's probably going to bite us. Thanks, > I'm a bit lost on the desired flow in your minds. Here is one flow based on my understanding of this discussion. Please comment whether it matches your thinking: 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD; 1) Qemu binds dev1 to ioasid_fd; 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_ CACHE info i.e. whether underlying IOMMU can enforce snoop; 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu needs to figure out whether dev1 wants to do no-snoop. This might be based a fixed vendor/class list or specified by user; 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop' flag is specified to decide the page table format, which is supposed to match dev1; 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this point, specify snoop/no-snoop again. If not supported by related iommu or different from what gpa_ioasid has, attach fails. 6) call KVM to update the snoop requirement via KVM_UPADTE_IOASID_FD. this triggers ioasidfd_for_each_ioasid(); later when dev2 is attached to gpa_ioasid, same flow is followed. This implies that KVM_UPDATE_IOASID_FD is called only when new IOASID is created or existing IOASID is destroyed, because all devices under an IOASID should have the same snoop requirement. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 1/3] iommu: Enhance IOMMU default DMA mode build options
If unsure, say N here. +choice + prompt "IOMMU default DMA mode" + depends on IOMMU_API config INTEL_IOMMU depends on PCI_MSI && ACPI && (X86 || IA64) config AMD_IOMMU depends on X86_64 && PCI && ACPI && HAVE_CMPXCHG_DOUBLE config ARM_SMMU_V3 depends on ARM64 "depends on ARM_SMMU_V3 || INTEL_IOMMU || AMD_IOMMU" may need to be added, because it doesn't work on other platforms. or "depends on X86 || IA64 || X86_64 || ARM64" I suppose so. But I don't think that any other archs use the value from iommu_dma_strict anyway. + + default IOMMU_DEFAULT_STRICT + help + This option allows an IOMMU DMA mode to be chosen at build time, to + override the default DMA mode of each ARCH, removing the need to + pass in kernel parameters through command line. It is still possible + to provide ARCH-specific or common boot options to override this + option. + + If unsure, keep the default. + +config IOMMU_DEFAULT_LAZY + bool "lazy" + help + Support lazy mode, where for every IOMMU DMA unmap operation, the + flush operation of IOTLB and the free operation of IOVA are deferred. + They are only guaranteed to be done before the related IOVA will be + reused. + +config IOMMU_DEFAULT_STRICT + bool "strict" + help + For every IOMMU DMA unmap operation, the flush operation of IOTLB and + the free operation of IOVA are guaranteed to be done in the unmap + function. + + This mode is safer than the two above, but it maybe slower in some + high performace scenarios. + +endchoice + config OF_IOMMU def_bool y depends on OF && IOMMU_API diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 966426a96520..177b0dafc535 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -29,7 +29,8 @@ static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida); static unsigned int iommu_def_domain_type __read_mostly; -static bool iommu_dma_strict __read_mostly = true; +static bool iommu_dma_strict __read_mostly = + IS_ENABLED(CONFIG_IOMMU_DEFAULT_STRICT); Currently, a maximum of 100 columns are allowed in a row. Sure, but some people still prefer limiting to 80, and codingstyle doc mentions a preference for 80. But if you really think I should change, then that's ok. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote: > > > This implies that VFIO_BOUND_IOASID will be extended to allow user > > > specify a device label. This label will be recorded in /dev/iommu to > > > serve per-device invalidation request from and report per-device > > > fault data to the user. > > > > I wonder which of the user providing a 64 bit cookie or the kernel > > returning a small IDA is the best choice here? Both have merits > > depending on what qemu needs.. > > Yes, either way can work. I don't have a strong preference. Jean? I don't see an issue with either solution, maybe it will show up while prototyping. First one uses IDs that do mean something for someone, and userspace may inject faults slightly faster since it doesn't need an ID->vRID lookup, so that's my preference. > > > In addition, vPASID (if provided by user) will > > > be also recorded in /dev/iommu so vPASID<->pPASID conversion > > > is conducted properly. e.g. invalidation request from user carries > > > a vPASID which must be converted into pPASID before calling iommu > > > driver. Vice versa for raw fault data which carries pPASID while the > > > user expects a vPASID. > > > > I don't think the PASID should be returned at all. It should return > > the IOASID number in the FD and/or a u64 cookie associated with that > > IOASID. Userspace should figure out what the IOASID & device > > combination means. > > This is true for Intel. But what about ARM which has only one IOASID > (pasid table) per device to represent all guest I/O page tables? In that case vPASID = pPASID though. The vPASID allocated by the guest is the same from the vIOMMU inval to the pIOMMU inval. I don't think host kernel or userspace need to alter it. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] /dev/ioasid uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, June 3, 2021 8:41 PM > > > When discussing I/O page fault support in another thread, the consensus > > is that an device handle will be registered (by user) or allocated (return > > to user) in /dev/ioasid when binding the device to ioasid fd. From this > > angle we can register {ioasid_fd, device_handle} to KVM and then call > > something like ioasidfd_device_is_coherent() to get the property. > > Anyway the coherency is a per-device property which is not changed > > by how many I/O page tables are attached to it. > > It is not device specific, it is driver specific > > As I said before, the question is if the IOASID itself can enforce > snoop, or not. AND if the device will issue no-snoop or not. Sure. My earlier comment was based on the assumption that all IOASIDs attached to a device should inherit the same snoop/no-snoop fact. But looks it doesn't prevent a device driver from setting PTE_SNP only for selected I/O page tables, according to whether isoch agents are involved. An user space driver could figure out per-IOASID requirements itself. A guest device driver can indirectly convey this information through vIOMMU. Registering {IOASID_FD, IOASID} to KVM has another merit, as we also need it to update CPU PASID mapping for ENQCMD. We can define one interface for both requirements. 😊 Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu