Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > > 
> > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired 
> > > > w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use 
> > > > cases:
> > > >
> > > > ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a 
> > > > non-virtualization
> > > > scenario.
> > > 
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> > 
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design.  Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right.  I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference.  The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID.  Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
> 
> But that is too complicated and far out for me at least to guess on at
> this point..
> 
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> > 
> > Yes, it's completely sane that the guest links a I/O page table to 
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#. 
> 
> Ok..
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > 
> > 5. Use Cases and Flows
> > 
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o 
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> > 
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > 
> > As explained earlier, one IOASID fd is sufficient for all intended use 
> > cases:
> > 
> > ioasid_fd = open("/dev/ioasid", mode);
> > 
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > /* if dev1 is ENQCMD-capable mdev, update CPU PASID 
> >   * translation structure through KVM
> >   */
> > pa_data = {
> > .ioasid_fd  = ioasid_fd;
> > .ioasid = gva_ioasid;
> > .guest_pasid= gpasid1;
> > };
> > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > /* Bind guest I/O page table  */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr   = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually.  But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats.  Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept.  And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++
> > 
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> > 
> > -   Host IOMMU driver receives a page request with raw fault_data {rid, 
> > pasid, addr};
> > 
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> > information registered by IOASID fault handler;
> > 
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr), 
> > which 
> > is saved in ioasid_data->fault_data (used for response);
> > 
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links 
> > it 
> > to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary.  So if you're labelling devices, all devices in a group
would have to have the same label.  Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > -   Upon received event, Qemu needs to find the virtual routing information 
> > (v_rid + v_pasid) of the device attached to the faulting ioasid. If 
> > there are 
> > multiple, pick a random one. This should be fine since the purpose is to
> > fix the I/O page table on the guest;
> 
> The device label should fix this
>  
> > -   Qemu finds the pending fault event, converts virtual completion data 
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to 
> > complete the pending fault;
> > 
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved 
> > in 
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > 
> > 
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is tre

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +
> > 
> > /*
> >   * Check whether an uAPI extension is supported. 
> >   *
> >   * This is for FD-level capabilities, such as locked page 
> > pre-registration. 
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION  _IO(IOASID_TYPE, IOASID_BASE + 0)
> 
>  
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   * - vaddr;
> >   * - size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY  _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY_IO(IOASID_TYPE, IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.

Right, I think we can simplify the interface by modelling the
preregistration as a nesting layer.  Well, mostly.. the wrinkle is
that generally you can't do anything with an ioasid until you've
attached devices to it, but that doesn't really make sense for the
prereg layer.  I expect we can find some way to deal with that,
though.

Actually... to simplify that "weak nesting" concept I wonder if we
want to expand to 3 ways of specifying the pagetables for the ioasid:
  1) kernel managed (MAP/UNMAP)
  2) user managed (BIND/INVALIDATE)
  3) pass-though (IOVA==parent address)

Obviously pass-through wouldn't be allowed in all circumstances.

> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID. 
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address 
> >   * space. Each IOASID is associated with a single I/O page table. User 
> >   * must call this ioctl to get an IOASID for every I/O address space that 
> > is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is 
> >   * attached to a device. Once attached, an empty I/O page table is 
> >   * bound with the IOMMU then the user could use either DMA mapping 
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?

I don't think it reasonably can.  Until attached, you don't actually
know what hardware IOMMU will be backing it, and therefore you don't
know it's capabilities.  You can't really allow mappings if you don't
even know allowed IOVA ranges and page size.

> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
> 
> > /*
> >   * Get information about an I/O address space
> >   *
> >   * Supported capabilities:
> >   * - VFIO type1 map/unmap;
> >   * - pgtable/pasid_table binding
> >   * - hardware nesting vs. software nesting;
> >   * - ...
> >   *
> >   * Related attributes:
> >   * - supported page sizes, reserved IOVA ranges (DMA mapping);
> >   * - vendor pgtable formats (pgtable binding);
> >   * - number of child IOASIDs (nesting);
> >   * - ...
> >   *
> >   * Above information is available only after one or more devices are
> >   * attached to the specified IOASID. Otherwise the IOASID is just a
> >   * number w/o any capability or attribute.
> 
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descrip

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces. 
> > It doesn't include any device routing information, which is only 
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). 
> 
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows.  In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level.  Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly 
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> > one address space any more. Device routing information (indirectly 
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> > page faulting uAPI to help connect vIOMMU with the underlying 
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> I'm confused by this rational.
> 
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
>  - Do we have a hypervisor trap to bind the page table or not? (RID
>and PASID may differ here)
>  - Do we have a hypervisor trap to invaliate the page tables or not?
> 
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
> 
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID. 

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
> 
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
> 
> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> > iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> > find a better name later to differentiate.
> 
> +1 on Jean-Philippe's remarks
> 
> > -   PPC has not be considered yet as we haven't got time to fully understand
> > its semantics. According to previous discussion there is some 
> > generality 
> > between PPC window-based scheme and VFIO type1 semantics. Let's 
> > first make consensus on this proposal and then further discuss how to 
> > extend it to cover PPC's requirement.
> 
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
> 
> > -   There is a protocol between vfio group and kvm. Needs to think about
> > how it will be affected following this proposal.
> 
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
> 
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
> 
> I will read through the detail below in another email
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
>   
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup.  I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

> 
> TOC
> 
> 1. Terminologies and Concepts
> 2. uAPI Proposal
> 2.1. /dev/ioasid uAPI
> 2.2. /dev/vfio uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
> 5.1. A simple example
> 5.2. Multiple IOASIDs (no nesting)
> 5.3. IOASID nesting (software)
> 5.4. IOASID nesting (hardware)
> 5.5. Guest SVA (vSVA)
> 5.6. I/O page fault
> 5.7. BIND_PASID_TABLE
> 
> 
> 1. Terminologies and Concepts
> -
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD?  Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC.  Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction.  I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

  1) On some fancy future MMU allowing free nesting, where the kernel
 would insert an implicit extra layer translating user addresses
 to physical addresses, and the userspace manages a pagetable with
 its own VAs being the target AS
  2) For a purely software virtual device, where its virtual DMA
 engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/i

Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Krzysztof Kozlowski
On 01/06/2021 20:08, Thierry Reding wrote:
> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
>>> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
 From: Thierry Reding 

 Hi,

 this is a set of patches that is the result of earlier discussions
 regarding early identity mappings that are needed to avoid SMMU faults
 during early boot.

 The goal here is to avoid early identity mappings altogether and instead
 postpone the need for the identity mappings to when devices are attached
 to the SMMU. This works by making the SMMU driver coordinate with the
 memory controller driver on when to start enforcing SMMU translations.
 This makes Tegra behave in a more standard way and pushes the code to
 deal with the Tegra-specific programming into the NVIDIA SMMU
 implementation.

 Compared to the original version of these patches, I've split the
 preparatory work into a separate patch series because it became very
 large and will be mostly uninteresting for this audience.

 Patch 1 provides a mechanism to program SID overrides at runtime. Patch
 2 updates the ARM SMMU device tree bindings to include the Tegra186
 compatible string as suggested by Robin during review.

 Patches 3 and 4 create the fundamentals in the SMMU driver to support
 this and also make this functionality available on Tegra186. Patch 5
 hooks the ARM SMMU up to the memory controller so that the memory client
 stream ID overrides can be programmed at the right time.

 Patch 6 extends this mechanism to Tegra186 and patches 7-9 enable all of
 this through device tree updates. Patch 10 is included here to show how
 SMMU will be enabled for display controllers. However, it cannot be
 applied yet because the code to create identity mappings for potentially
 live framebuffers hasn't been merged yet.

 The end result is that various peripherals will have SMMU enabled, while
 the display controllers will keep using passthrough, as initially set up
 by firmware. Once the device tree bindings have been accepted and the
 SMMU driver has been updated to create identity mappings for the display
 controllers, they can be hooked up to the SMMU and the code in this
 series will automatically program the SID overrides to enable SMMU
 translations at the right time.

 Note that the series creates a compile time dependency between the
 memory controller and IOMMU trees. If it helps I can provide a branch
 for each tree, modelling the dependency, once the series has been
 reviewed.

 Changes in v2:
 - split off the preparatory work into a separate series (that needs to
   be applied first)
 - address review comments by Robin

 Thierry

 Thierry Reding (10):
   memory: tegra: Implement SID override programming
   dt-bindings: arm-smmu: Add Tegra186 compatible string
   iommu/arm-smmu: Implement ->probe_finalize()
   iommu/arm-smmu: tegra: Detect number of instances at runtime
   iommu/arm-smmu: tegra: Implement SID override programming
   iommu/arm-smmu: Use Tegra implementation on Tegra186
   arm64: tegra: Use correct compatible string for Tegra186 SMMU
   arm64: tegra: Hook up memory controller to SMMU on Tegra186
   arm64: tegra: Enable SMMU support on Tegra194
   arm64: tegra: Enable SMMU support for display on Tegra194

  .../devicetree/bindings/iommu/arm,smmu.yaml   |  11 +-
  arch/arm64/boot/dts/nvidia/tegra186.dtsi  |   4 +-
  arch/arm64/boot/dts/nvidia/tegra194.dtsi  | 166 ++
  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|   3 +-
  drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c  |  90 --
  drivers/iommu/arm/arm-smmu/arm-smmu.c |  13 ++
  drivers/iommu/arm/arm-smmu/arm-smmu.h |   1 +
  drivers/memory/tegra/mc.c |   9 +
  drivers/memory/tegra/tegra186.c   |  72 
  include/soc/tegra/mc.h|   3 +
  10 files changed, 349 insertions(+), 23 deletions(-)
>>>
>>> Will, Robin,
>>>
>>> do you have any more comments on the ARM SMMU bits of this series? If
>>> not, can you guys provide an Acked-by so that Krzysztof can pick this
>>> (modulo the DT patches) up into the memory-controller tree for v5.14?
>>>
>>> I'll send out a v3 with the bisectibilitiy fix that Krishna pointed
>>> out.
>>
>> Probably best if I queue 3-6 on a separate branch once you send a v3,
>> then Krzysztof can pull that in if he needs it.
> 
> Patch 5 has a build-time dependency on patch 1, so they need to go in
> together. The reason why I suggested Krzysztof pick these up is because
> there is a restructuring series that this depends on, which will go into
> Krzysztof's tree. So in order to 

Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Krzysztof Kozlowski
On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> On 01/06/2021 20:08, Thierry Reding wrote:
>> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
>>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
 On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
>
>>> Probably best if I queue 3-6 on a separate branch once you send a v3,
>>> then Krzysztof can pull that in if he needs it.
>>
>> Patch 5 has a build-time dependency on patch 1, so they need to go in
>> together. The reason why I suggested Krzysztof pick these up is because
>> there is a restructuring series that this depends on, which will go into
>> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
>> and mostly unrelated stuff as well.
> 
> I missed that part... what other series are needed for this one? Except
> Dmitry's power management set I do not have anything in my sight for
> Tegras memory controllers.
> 
> Anyway, I can take the memory bits and provide a stable tag with these.
> Recently there was quite a lot work around Tegra memory controllers, so
> this makes especially sense if new patches appear.

OK, I think I have now the patchset you talked about - "memory: tegra:
Driver unification" v2, right?


Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured

2021-06-02 Thread John Garry

On 02/06/2021 05:37, Lu Baolu wrote:

On 6/1/21 10:29 PM, John Garry wrote:
For streaming DMA mappings involving an IOMMU and whose IOVA len 
regularly

exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing 
issue,

as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.


Could you explain why it requires singleton group and driver unbinding
if the user only wants to increase the upper limit? I haven't dived into
the details yet, sorry if this is a silly question.


Hi Baolu,

I did actually try increasing the range for a 'live' domain in the v1 
series, but it turned out too messy. First problem is reallocating the 
memory to hold the rcaches. Second problem is that we need to deal with 
the issue that all IOVAs in the rcache need to be a pow-of-2, which is 
difficult to enforce for IOVAs which weren't being cached before, but 
now would be.


So now I changed to work similar to how we change the default domain 
type, i.e. don't operate on a 'live' domain.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

2021-06-02 Thread Daniel Borkmann

On 6/1/21 7:42 PM, Jussi Maki wrote:

Hi Robin,

On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy  wrote:

The regression shows as a significant drop in throughput as measured
with "super_netperf" [0],
with measured bandwidth of ~95Gbps before and ~35Gbps after:


I guess that must be the difference between using the flush queue
vs. strict invalidation. On closer inspection, it seems to me that
there's a subtle pre-existing bug in the AMD IOMMU driver, in that
amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api()
has called bus_set_iommu(). Does the patch below work?


Thanks for the quick response & patch. I tried it out and indeed it
does solve the issue:

# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
95341.2

root@zh-lab-node-3:~# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1
17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
33989.5


Robin, probably goes without saying, but please make sure to include ...

Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE")

... to your fix in [0], maybe along with another Fixes tag pointing to the 
original
commit adding this issue. But certainly a250c23f15c2 would be good given the 
regression
was uncovered on that one first, so that Greg et al have a chance to pick this 
fix up
for stable kernels.

Thanks everyone!

  [0] https://lore.kernel.org/bpf/7f048c57-423b-68ba-eede-7e194c1fe...@arm.com/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Thierry Reding
On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> > On 01/06/2021 20:08, Thierry Reding wrote:
> >> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> >>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
>  On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> > From: Thierry Reding 
> >
> >>> Probably best if I queue 3-6 on a separate branch once you send a v3,
> >>> then Krzysztof can pull that in if he needs it.
> >>
> >> Patch 5 has a build-time dependency on patch 1, so they need to go in
> >> together. The reason why I suggested Krzysztof pick these up is because
> >> there is a restructuring series that this depends on, which will go into
> >> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
> >> and mostly unrelated stuff as well.
> > 
> > I missed that part... what other series are needed for this one? Except
> > Dmitry's power management set I do not have anything in my sight for
> > Tegras memory controllers.
> > 
> > Anyway, I can take the memory bits and provide a stable tag with these.
> > Recently there was quite a lot work around Tegra memory controllers, so
> > this makes especially sense if new patches appear.
> 
> OK, I think I have now the patchset you talked about - "memory: tegra:
> Driver unification" v2, right?

Yes, that's the one. That series is fairly self-contained, but Dmitry's
power management set has dependencies that pull in the regulator, clock
and ARM SoC trees.

I did a test merge of the driver unification series with a branch that
has Dmitry's patches and all the dependencies and there are no conflicts
so that, fortunately, doesn't further complicates things.

Do you want me to send you a pull request with Dmitry's memory
controller changes? You could then apply the unification series on top,
which should allow this SMMU series to apply cleanly on top of that.

I can also carry all these changes in the Tegra tree and send a PR in a
few days once this has seen a bit more testing in linux-next, which also
makes sure it's got a bit more testing in our internal test farm.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午4:28, Jason Gunthorpe 写道:

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and
/dev/ioas. I don't have a strong preference and would like to hear
votes from all stakeholders. /dev/iommu is slightly better imho for
two reasons. First, per AMD's presentation in last KVM forum they
implement vIOMMU in hardware thus need to support user-managed
domains. An iommu uAPI notation might make more sense moving
forward. Second, it makes later uAPI naming easier as 'IOASID' can
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
IOASID_ALLOC_IOASID.:)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time.



It looks to me using "/dev/iommu" excludes the possibility of 
implementing IOASID in a device specific way (e.g through the 
co-operation with device MMU + platform IOMMU)?


What's more, ATS spec doesn't forbid the device #PF to be reported via a 
device specific way.


Thanks



We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午1:31, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
  

We can open up to ~0U file descriptors, I don't see why we need to restrict
it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.



I don't see how much difference for IOASID and other type of fds. People 
can choose to use poll or epoll.


And with the current proposal, (assuming there's a N:1 ioasid to 
ioasid). I wonder how select can work for the specific ioasid.


Thanks




A huge number of FDs for typical usages should be avoided.

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Enrico Weigelt, metux IT consult

On 31.05.21 19:37, Parav Pandit wrote:


It appears that this is only to make map ioctl faster apart from accounting.
It doesn't have any ioasid handle input either.

In that case, can it be a new system call? Why does it have to be under 
/dev/ioasid?
For example few years back such system call mpin() thought was proposed in [1].


I'm very reluctant to more syscall inflation. We already have lots of
syscalls that could have been easily done via devices or filesystems
(yes, some of them are just old Unix relics).

Syscalls don't play well w/ modules, containers, distributed systems,
etc, and need extra low-level code for most non-C languages (eg.
scripting languages).


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Enrico Weigelt, metux IT consult

On 27.05.21 09:58, Tian, Kevin wrote:

Hi,


/dev/ioasid provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.


While I'm in favour of having generic APIs for generic tasks, as well as
using FDs, I wonder whether it has to be a new and separate device.

Now applications have to use multiple APIs in lockstep. One consequence
of that is operators, as well as provisioning systems, container
infrastructures, etc, always have to consider multiple devices together.

You can't just say "give workload XY access to device /dev/foo" anymore.
Now you have to take care about scenarios like "if someone wants
/dev/foo, he also needs /dev/bar"). And if that happens multiple times
together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to
scenarios like the dev nodes are bind-mounted somewhere, you need to
take care that additional devices aren't bind-mounted twice, etc ...

If I understand this correctly, /dev/ioasid is a kind of "common 
supplier" to other APIs / devices. Why can't the fd be acquired by the

consumer APIs (eg. kvm, vfio, etc) ?


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午1:29, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:


For the case of 1M, I would like to know what's the use case for a single
process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.



So if we implement per fd model. The global "container" properties could 
be done via the parent fd. E.g attaching the parent to the device at the 
startup.






So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship



This can also be done per fd? A fd parent can have multiple fd childs.

Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()

2021-06-02 Thread Borislav Petkov
On Sat, May 29, 2021 at 11:17:30AM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -56,11 +56,8 @@
>  # define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
>  #endif
>  
> -#ifdef CONFIG_IOMMU_SUPPORT
> -# define DISABLE_ENQCMD  0
> -#else
> -# define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31))
> -#endif
> +/* Force disable because it's broken beyond repair */
> +#define DISABLE_ENQCMD   (1 << (X86_FEATURE_ENQCMD & 31))

Yeah, for that to work we need:

---
From: Borislav Petkov 
Date: Wed, 2 Jun 2021 12:07:52 +0200
Subject: [PATCH] dmaengine: idxd: Use cpu_feature_enabled()

When testing x86 feature bits, use cpu_feature_enabled() so that
build-disabled features can remain off, regardless of what CPUID says.

Fixes: 8e50d392652f ("dmaengine: idxd: Add shared workqueue support")
Signed-off-by: Borislav Petkov 
Cc: 
---
 drivers/dma/idxd/init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 2a926bef87f2..776fd44aff5f 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -745,12 +745,12 @@ static int __init idxd_init_module(void)
 * If the CPU does not support MOVDIR64B or ENQCMDS, there's no point in
 * enumerating the device. We can not utilize it.
 */
-   if (!boot_cpu_has(X86_FEATURE_MOVDIR64B)) {
+   if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
pr_warn("idxd driver failed to load without MOVDIR64B.\n");
return -ENODEV;
}
 
-   if (!boot_cpu_has(X86_FEATURE_ENQCMD))
+   if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
pr_warn("Platform does not have ENQCMD(S) support.\n");
else
support_enqcmd = true;
-- 
2.29.2

For the newly CCed parties, pls check

https://lkml.kernel.org/r/87mtsd6gr9@nanos.tec.linutronix.de

for more info.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()

2021-06-02 Thread Thomas Gleixner
On Wed, Jun 02 2021 at 12:14, Borislav Petkov wrote:

> On Sat, May 29, 2021 at 11:17:30AM +0200, Thomas Gleixner wrote:
>> --- a/arch/x86/include/asm/disabled-features.h
>> +++ b/arch/x86/include/asm/disabled-features.h
>> @@ -56,11 +56,8 @@
>>  # define DISABLE_PTI(1 << (X86_FEATURE_PTI & 31))
>>  #endif
>>  
>> -#ifdef CONFIG_IOMMU_SUPPORT
>> -# define DISABLE_ENQCMD 0
>> -#else
>> -# define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31))
>> -#endif
>> +/* Force disable because it's broken beyond repair */
>> +#define DISABLE_ENQCMD  (1 << (X86_FEATURE_ENQCMD & 31))
>
> Yeah, for that to work we need:
>
> ---
> From: Borislav Petkov 
> Date: Wed, 2 Jun 2021 12:07:52 +0200
> Subject: [PATCH] dmaengine: idxd: Use cpu_feature_enabled()
>
> When testing x86 feature bits, use cpu_feature_enabled() so that
> build-disabled features can remain off, regardless of what CPUID says.
>
> Fixes: 8e50d392652f ("dmaengine: idxd: Add shared workqueue support")
> Signed-off-by: Borislav Petkov 
> Cc: 

Reviewed-by: Thomas Gleixner 

Thanks for spotting this!

   tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Krzysztof Kozlowski
On 02/06/2021 10:52, Thierry Reding wrote:
> On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
>> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
>>> On 01/06/2021 20:08, Thierry Reding wrote:
 On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
>> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
>>> From: Thierry Reding 
>>>
> Probably best if I queue 3-6 on a separate branch once you send a v3,
> then Krzysztof can pull that in if he needs it.

 Patch 5 has a build-time dependency on patch 1, so they need to go in
 together. The reason why I suggested Krzysztof pick these up is because
 there is a restructuring series that this depends on, which will go into
 Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
 and mostly unrelated stuff as well.
>>>
>>> I missed that part... what other series are needed for this one? Except
>>> Dmitry's power management set I do not have anything in my sight for
>>> Tegras memory controllers.
>>>
>>> Anyway, I can take the memory bits and provide a stable tag with these.
>>> Recently there was quite a lot work around Tegra memory controllers, so
>>> this makes especially sense if new patches appear.
>>
>> OK, I think I have now the patchset you talked about - "memory: tegra:
>> Driver unification" v2, right?
> 
> Yes, that's the one. That series is fairly self-contained, but Dmitry's
> power management set has dependencies that pull in the regulator, clock
> and ARM SoC trees.
> 
> I did a test merge of the driver unification series with a branch that
> has Dmitry's patches and all the dependencies and there are no conflicts
> so that, fortunately, doesn't further complicates things.
> 
> Do you want me to send you a pull request with Dmitry's memory
> controller changes? You could then apply the unification series on top,
> which should allow this SMMU series to apply cleanly on top of that.

Makes sense and it looks quite bulletproof for future changes. Let's do
like this. I will apply your patch 1/10 from this v2 on top of it and
driver unification later.

> I can also carry all these changes in the Tegra tree and send a PR in a
> few days once this has seen a bit more testing in linux-next, which also
> makes sure it's got a bit more testing in our internal test farm.
> 


Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Will Deacon
On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote:
> On 02/06/2021 10:52, Thierry Reding wrote:
> > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
> >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> >>> On 01/06/2021 20:08, Thierry Reding wrote:
>  On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> >>> From: Thierry Reding 
> >>>
> > Probably best if I queue 3-6 on a separate branch once you send a v3,
> > then Krzysztof can pull that in if he needs it.
> 
>  Patch 5 has a build-time dependency on patch 1, so they need to go in
>  together. The reason why I suggested Krzysztof pick these up is because
>  there is a restructuring series that this depends on, which will go into
>  Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
>  and mostly unrelated stuff as well.
> >>>
> >>> I missed that part... what other series are needed for this one? Except
> >>> Dmitry's power management set I do not have anything in my sight for
> >>> Tegras memory controllers.
> >>>
> >>> Anyway, I can take the memory bits and provide a stable tag with these.
> >>> Recently there was quite a lot work around Tegra memory controllers, so
> >>> this makes especially sense if new patches appear.
> >>
> >> OK, I think I have now the patchset you talked about - "memory: tegra:
> >> Driver unification" v2, right?
> > 
> > Yes, that's the one. That series is fairly self-contained, but Dmitry's
> > power management set has dependencies that pull in the regulator, clock
> > and ARM SoC trees.
> > 
> > I did a test merge of the driver unification series with a branch that
> > has Dmitry's patches and all the dependencies and there are no conflicts
> > so that, fortunately, doesn't further complicates things.
> > 
> > Do you want me to send you a pull request with Dmitry's memory
> > controller changes? You could then apply the unification series on top,
> > which should allow this SMMU series to apply cleanly on top of that.
> 
> Makes sense and it looks quite bulletproof for future changes. Let's do
> like this. I will apply your patch 1/10 from this v2 on top of it and
> driver unification later.

Okey doke. Thierry -- please let me know which patches I can queue. Having
the SMMU driver changes in the IOMMU tree really helps in case of conflicts
with other SMMU changes. As I say, I can put stuff on a separate branch for
you if it helps.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Parav Pandit


> From: Enrico Weigelt, metux IT consult 
> Sent: Wednesday, June 2, 2021 2:09 PM
> 
> On 31.05.21 19:37, Parav Pandit wrote:
> 
> > It appears that this is only to make map ioctl faster apart from accounting.
> > It doesn't have any ioasid handle input either.
> >
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
> 
> I'm very reluctant to more syscall inflation. We already have lots of syscalls
> that could have been easily done via devices or filesystems (yes, some of
> them are just old Unix relics).
> 
> Syscalls don't play well w/ modules, containers, distributed systems, etc, and
> need extra low-level code for most non-C languages (eg.
> scripting languages).

Likely, but as per my understanding, this ioctl() is a wrapper to device 
agnostic code as,

 {
   atomic_inc(mm->pinned_vm);
   pin_user_pages();
}

And mm must got to hold the reference to it, so that these pages cannot be 
munmap() or freed.

And second reason I think (I could be wrong) is that, second level page table 
for a PASID, should be same as what process CR3 has used.
Essentially iommu page table and mmu page table should be pointing to same page 
table entry.
If they are different, than even if the guest CPU has accessed the pages, 
device access via IOMMU will result in an expensive page faults.

So assuming both cr3 and pasid table entry points to same page table, I fail to 
understand for the need of extra refcount and hence driver specific ioctl().
Though I do not have strong objection to the ioctl(). But want to know what it 
will and will_not do.
Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the 
memory.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

2021-06-02 Thread Robin Murphy

On 2021-06-02 09:09, Daniel Borkmann wrote:

On 6/1/21 7:42 PM, Jussi Maki wrote:

Hi Robin,

On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy  wrote:

The regression shows as a significant drop in throughput as measured
with "super_netperf" [0],
with measured bandwidth of ~95Gbps before and ~35Gbps after:


I guess that must be the difference between using the flush queue
vs. strict invalidation. On closer inspection, it seems to me that
there's a subtle pre-existing bug in the AMD IOMMU driver, in that
amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api()
has called bus_set_iommu(). Does the patch below work?


Thanks for the quick response & patch. I tried it out and indeed it
does solve the issue:


Cool, thanks Jussi. May I infer a Tested-by tag from that?


# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
95341.2

root@zh-lab-node-3:~# uname -a
Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1
17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@zh-lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
33989.5


Robin, probably goes without saying, but please make sure to include ...

Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE")

... to your fix in [0], maybe along with another Fixes tag pointing to 
the original
commit adding this issue. But certainly a250c23f15c2 would be good given 
the regression
was uncovered on that one first, so that Greg et al have a chance to 
pick this fix up

for stable kernels.


Given that the race looks to have been pretty theoretical until now, I'm 
not convinced it's worth the bother of digging through the long history 
of default domain and DMA ops movement to figure where it started, much 
less attempt invasive backports. The flush queue change which made it 
apparent only landed in 5.13-rc1, so as long as we can get this in as a 
fix in the current cycle we should be golden - in the meantime, note 
that booting with "iommu.strict=0" should also restore the expected 
behaviour.


FWIW I do still plan to resend the patch "properly" soon (in all honesty 
it wasn't even compile-tested!)


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Thierry Reding
On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote:
> On 02/06/2021 10:52, Thierry Reding wrote:
> > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
> >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> >>> On 01/06/2021 20:08, Thierry Reding wrote:
>  On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> >>> From: Thierry Reding 
> >>>
> > Probably best if I queue 3-6 on a separate branch once you send a v3,
> > then Krzysztof can pull that in if he needs it.
> 
>  Patch 5 has a build-time dependency on patch 1, so they need to go in
>  together. The reason why I suggested Krzysztof pick these up is because
>  there is a restructuring series that this depends on, which will go into
>  Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
>  and mostly unrelated stuff as well.
> >>>
> >>> I missed that part... what other series are needed for this one? Except
> >>> Dmitry's power management set I do not have anything in my sight for
> >>> Tegras memory controllers.
> >>>
> >>> Anyway, I can take the memory bits and provide a stable tag with these.
> >>> Recently there was quite a lot work around Tegra memory controllers, so
> >>> this makes especially sense if new patches appear.
> >>
> >> OK, I think I have now the patchset you talked about - "memory: tegra:
> >> Driver unification" v2, right?
> > 
> > Yes, that's the one. That series is fairly self-contained, but Dmitry's
> > power management set has dependencies that pull in the regulator, clock
> > and ARM SoC trees.
> > 
> > I did a test merge of the driver unification series with a branch that
> > has Dmitry's patches and all the dependencies and there are no conflicts
> > so that, fortunately, doesn't further complicates things.
> > 
> > Do you want me to send you a pull request with Dmitry's memory
> > controller changes? You could then apply the unification series on top,
> > which should allow this SMMU series to apply cleanly on top of that.
> 
> Makes sense and it looks quite bulletproof for future changes. Let's do
> like this. I will apply your patch 1/10 from this v2 on top of it and
> driver unification later.

The SMMU series here depends on the unification series, so the
unification series needs to go first. It'd be a fair bit of work to
reverse that because the ->probe_device() callback implemented by the
first patch of this SMMU series is part of the tegra_mc_ops structure
that's introduced in the unification series.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Thierry Reding
On Wed, Jun 02, 2021 at 12:40:49PM +0100, Will Deacon wrote:
> On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote:
> > On 02/06/2021 10:52, Thierry Reding wrote:
> > > On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
> > >> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> > >>> On 01/06/2021 20:08, Thierry Reding wrote:
> >  On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
> > > On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> > >> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> > >>> From: Thierry Reding 
> > >>>
> > > Probably best if I queue 3-6 on a separate branch once you send a v3,
> > > then Krzysztof can pull that in if he needs it.
> > 
> >  Patch 5 has a build-time dependency on patch 1, so they need to go in
> >  together. The reason why I suggested Krzysztof pick these up is because
> >  there is a restructuring series that this depends on, which will go 
> >  into
> >  Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of 
> >  other
> >  and mostly unrelated stuff as well.
> > >>>
> > >>> I missed that part... what other series are needed for this one? Except
> > >>> Dmitry's power management set I do not have anything in my sight for
> > >>> Tegras memory controllers.
> > >>>
> > >>> Anyway, I can take the memory bits and provide a stable tag with these.
> > >>> Recently there was quite a lot work around Tegra memory controllers, so
> > >>> this makes especially sense if new patches appear.
> > >>
> > >> OK, I think I have now the patchset you talked about - "memory: tegra:
> > >> Driver unification" v2, right?
> > > 
> > > Yes, that's the one. That series is fairly self-contained, but Dmitry's
> > > power management set has dependencies that pull in the regulator, clock
> > > and ARM SoC trees.
> > > 
> > > I did a test merge of the driver unification series with a branch that
> > > has Dmitry's patches and all the dependencies and there are no conflicts
> > > so that, fortunately, doesn't further complicates things.
> > > 
> > > Do you want me to send you a pull request with Dmitry's memory
> > > controller changes? You could then apply the unification series on top,
> > > which should allow this SMMU series to apply cleanly on top of that.
> > 
> > Makes sense and it looks quite bulletproof for future changes. Let's do
> > like this. I will apply your patch 1/10 from this v2 on top of it and
> > driver unification later.
> 
> Okey doke. Thierry -- please let me know which patches I can queue. Having
> the SMMU driver changes in the IOMMU tree really helps in case of conflicts
> with other SMMU changes. As I say, I can put stuff on a separate branch for
> you if it helps.

Given that the SMMU patches have a build-time dependency on the first
patch in the series, and the series depends on the unification series, I
think Krzysztof would have to pull the memory controller branch that I
have with Dmitry's work, apply the unification series on top and then
take patch 1 of this series on top of that. That's the state that you'd
have to pull into the SMMU tree to get the right dependencies.

The SMMU pieces are the leaf of the dependency tree, so technically no
separate branch is needed, because I don't think anybody would have to
pull from it. However, a separate branch might make it easier to back
any of this work out if we have to.

Krzysztof, I do plan on sending out a v3 of the unification series to
address the two cleanups that Dmitry and you have pointed out. After
that I can send out v3 of this series to fix the ordering issue that
Krishna had mentioned. After all of those are applied, would you be able
to provide a stable branch for Will's SMMU tree?

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Krzysztof Kozlowski
On 02/06/2021 16:53, Thierry Reding wrote:
> On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote:
>> On 02/06/2021 10:52, Thierry Reding wrote:
>>> On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
 On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
> On 01/06/2021 20:08, Thierry Reding wrote:
>> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
>>> On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
 On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
>
>>> Probably best if I queue 3-6 on a separate branch once you send a v3,
>>> then Krzysztof can pull that in if he needs it.
>>
>> Patch 5 has a build-time dependency on patch 1, so they need to go in
>> together. The reason why I suggested Krzysztof pick these up is because
>> there is a restructuring series that this depends on, which will go into
>> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
>> and mostly unrelated stuff as well.
>
> I missed that part... what other series are needed for this one? Except
> Dmitry's power management set I do not have anything in my sight for
> Tegras memory controllers.
>
> Anyway, I can take the memory bits and provide a stable tag with these.
> Recently there was quite a lot work around Tegra memory controllers, so
> this makes especially sense if new patches appear.

 OK, I think I have now the patchset you talked about - "memory: tegra:
 Driver unification" v2, right?
>>>
>>> Yes, that's the one. That series is fairly self-contained, but Dmitry's
>>> power management set has dependencies that pull in the regulator, clock
>>> and ARM SoC trees.
>>>
>>> I did a test merge of the driver unification series with a branch that
>>> has Dmitry's patches and all the dependencies and there are no conflicts
>>> so that, fortunately, doesn't further complicates things.
>>>
>>> Do you want me to send you a pull request with Dmitry's memory
>>> controller changes? You could then apply the unification series on top,
>>> which should allow this SMMU series to apply cleanly on top of that.
>>
>> Makes sense and it looks quite bulletproof for future changes. Let's do
>> like this. I will apply your patch 1/10 from this v2 on top of it and
>> driver unification later.
> 
> The SMMU series here depends on the unification series, so the
> unification series needs to go first. It'd be a fair bit of work to
> reverse that because the ->probe_device() callback implemented by the
> first patch of this SMMU series is part of the tegra_mc_ops structure
> that's introduced in the unification series.

Right, you already wrote it in the first email in this thread, I just
reversed words in my head... Then as you wrote - take Dmitry's changes
and share them via pull to me. I'll put on top the unification series,
then #1 from this SMMU series and finally I'll provide a pull request
for Will so his SMMU can go on.

Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/10] arm64: tegra: Prevent early SMMU faults

2021-06-02 Thread Krzysztof Kozlowski
On 02/06/2021 16:58, Thierry Reding wrote:
> On Wed, Jun 02, 2021 at 12:40:49PM +0100, Will Deacon wrote:
>> On Wed, Jun 02, 2021 at 12:44:58PM +0200, Krzysztof Kozlowski wrote:
>>> On 02/06/2021 10:52, Thierry Reding wrote:
 On Wed, Jun 02, 2021 at 09:35:13AM +0200, Krzysztof Kozlowski wrote:
> On 02/06/2021 09:33, Krzysztof Kozlowski wrote:
>> On 01/06/2021 20:08, Thierry Reding wrote:
>>> On Tue, Jun 01, 2021 at 01:26:46PM +0100, Will Deacon wrote:
 On Fri, May 28, 2021 at 07:05:28PM +0200, Thierry Reding wrote:
> On Tue, Apr 20, 2021 at 07:26:09PM +0200, Thierry Reding wrote:
>> From: Thierry Reding 
>>
 Probably best if I queue 3-6 on a separate branch once you send a v3,
 then Krzysztof can pull that in if he needs it.
>>>
>>> Patch 5 has a build-time dependency on patch 1, so they need to go in
>>> together. The reason why I suggested Krzysztof pick these up is because
>>> there is a restructuring series that this depends on, which will go into
>>> Krzysztof's tree. So in order to pull in 3-6, you'd get a bunch of other
>>> and mostly unrelated stuff as well.
>>
>> I missed that part... what other series are needed for this one? Except
>> Dmitry's power management set I do not have anything in my sight for
>> Tegras memory controllers.
>>
>> Anyway, I can take the memory bits and provide a stable tag with these.
>> Recently there was quite a lot work around Tegra memory controllers, so
>> this makes especially sense if new patches appear.
>
> OK, I think I have now the patchset you talked about - "memory: tegra:
> Driver unification" v2, right?

 Yes, that's the one. That series is fairly self-contained, but Dmitry's
 power management set has dependencies that pull in the regulator, clock
 and ARM SoC trees.

 I did a test merge of the driver unification series with a branch that
 has Dmitry's patches and all the dependencies and there are no conflicts
 so that, fortunately, doesn't further complicates things.

 Do you want me to send you a pull request with Dmitry's memory
 controller changes? You could then apply the unification series on top,
 which should allow this SMMU series to apply cleanly on top of that.
>>>
>>> Makes sense and it looks quite bulletproof for future changes. Let's do
>>> like this. I will apply your patch 1/10 from this v2 on top of it and
>>> driver unification later.
>>
>> Okey doke. Thierry -- please let me know which patches I can queue. Having
>> the SMMU driver changes in the IOMMU tree really helps in case of conflicts
>> with other SMMU changes. As I say, I can put stuff on a separate branch for
>> you if it helps.
> 
> Given that the SMMU patches have a build-time dependency on the first
> patch in the series, and the series depends on the unification series, I
> think Krzysztof would have to pull the memory controller branch that I
> have with Dmitry's work, apply the unification series on top and then
> take patch 1 of this series on top of that. That's the state that you'd
> have to pull into the SMMU tree to get the right dependencies.
> 
> The SMMU pieces are the leaf of the dependency tree, so technically no
> separate branch is needed, because I don't think anybody would have to
> pull from it. However, a separate branch might make it easier to back
> any of this work out if we have to.
> 
> Krzysztof, I do plan on sending out a v3 of the unification series to
> address the two cleanups that Dmitry and you have pointed out. After
> that I can send out v3 of this series to fix the ordering issue that
> Krishna had mentioned. After all of those are applied, would you be able
> to provide a stable branch for Will's SMMU tree?

Yes, I will provide a stable branch/tag.

Best regards,
Krzysztof
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-06-02 Thread Tianyu Lan

Hi Boris:
Thanks for your review.

On 6/2/2021 9:16 AM, Boris Ostrovsky wrote:


On 5/30/21 11:06 AM, Tianyu Lan wrote:

@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
  
  IOMMU_INIT_FINISH(2,

- NULL,
+ hyperv_swiotlb_detect,
  pci_xen_swiotlb_init,
  NULL);



Could you explain this change?


Hyper-V allocates its own swiotlb bounce buffer and the default
swiotlb buffer should not be allocated. swiotlb_init() in 
pci_swiotlb_init() is to allocate default swiotlb buffer.
To achieve this, put hyperv_swiotlb_detect() as the first entry in the 
iommu_table_entry list. The detect loop in the pci_iommu_alloc() will 
exit once hyperv_swiotlb_detect() is called in Hyper-V VM and other 
iommu_table_entry callback will not be called.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/6] Add support for ACPI VIOT

2021-06-02 Thread Jean-Philippe Brucker
Add a driver for the ACPI VIOT table, which provides topology
information for para-virtual IOMMUs. Enable virtio-iommu on
non-devicetree platforms, including x86.

Since v2 [1] I tried to improve commit messages and comments. More
feedback and review are always welcome. Joerg offered to take this
series through the IOMMU tree, which requires acks for patches 1-3.

You can find a QEMU implementation at [2], with extra support for
testing all VIOT nodes including MMIO-based endpoints and IOMMU.
This series is at [3].

[1] 
https://lore.kernel.org/linux-iommu/20210423113836.3974972-1-jean-phili...@linaro.org/
[2] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/acpi
[3] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/acpi

Jean-Philippe Brucker (6):
  ACPI: arm64: Move DMA setup operations out of IORT
  ACPI: Move IOMMU setup code out of IORT
  ACPI: Add driver for the VIOT table
  iommu/dma: Pass address limit rather than size to
iommu_setup_dma_ops()
  iommu/dma: Simplify calls to iommu_setup_dma_ops()
  iommu/virtio: Enable x86 support

 drivers/acpi/Kconfig |   3 +
 drivers/iommu/Kconfig|   4 +-
 drivers/acpi/Makefile|   2 +
 drivers/acpi/arm64/Makefile  |   1 +
 include/acpi/acpi_bus.h  |   3 +
 include/linux/acpi.h |   3 +
 include/linux/acpi_iort.h|  14 +-
 include/linux/acpi_viot.h|  19 ++
 include/linux/dma-iommu.h|   4 +-
 arch/arm64/mm/dma-mapping.c  |   2 +-
 drivers/acpi/arm64/dma.c |  50 +
 drivers/acpi/arm64/iort.c| 129 ++---
 drivers/acpi/bus.c   |   2 +
 drivers/acpi/scan.c  |  60 +-
 drivers/acpi/viot.c  | 364 +++
 drivers/iommu/amd/iommu.c|   9 +-
 drivers/iommu/dma-iommu.c|  17 +-
 drivers/iommu/intel/iommu.c  |  10 +-
 drivers/iommu/virtio-iommu.c |   8 +
 MAINTAINERS  |   8 +
 20 files changed, 562 insertions(+), 150 deletions(-)
 create mode 100644 include/linux/acpi_viot.h
 create mode 100644 drivers/acpi/arm64/dma.c
 create mode 100644 drivers/acpi/viot.c

-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/6] ACPI: arm64: Move DMA setup operations out of IORT

2021-06-02 Thread Jean-Philippe Brucker
Extract generic DMA setup code out of IORT, so it can be reused by VIOT.
Keep it in drivers/acpi/arm64 for now, since it could break x86
platforms that haven't run this code so far, if they have invalid
tables.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/arm64/Makefile |  1 +
 include/linux/acpi.h|  3 +++
 include/linux/acpi_iort.h   |  6 ++---
 drivers/acpi/arm64/dma.c| 50 ++
 drivers/acpi/arm64/iort.c   | 54 ++---
 drivers/acpi/scan.c |  2 +-
 6 files changed, 66 insertions(+), 50 deletions(-)
 create mode 100644 drivers/acpi/arm64/dma.c

diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
index 6ff50f4ed947..66acbe77f46e 100644
--- a/drivers/acpi/arm64/Makefile
+++ b/drivers/acpi/arm64/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_ACPI_IORT)+= iort.o
 obj-$(CONFIG_ACPI_GTDT)+= gtdt.o
+obj-y  += dma.o
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index c60745f657e9..7aaa9559cc19 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -259,9 +259,12 @@ void acpi_numa_x2apic_affinity_init(struct 
acpi_srat_x2apic_cpu_affinity *pa);
 
 #ifdef CONFIG_ARM64
 void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa);
+void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size);
 #else
 static inline void
 acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) { }
+static inline void
+acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { }
 #endif
 
 int acpi_numa_memory_affinity_init (struct acpi_srat_mem_affinity *ma);
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 1a12baa58e40..f7f054833afd 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -34,7 +34,7 @@ struct irq_domain *iort_get_device_domain(struct device *dev, 
u32 id,
 void acpi_configure_pmsi_domain(struct device *dev);
 int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
-void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
+int iort_dma_get_ranges(struct device *dev, u64 *size);
 const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
const u32 *id_in);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
@@ -48,8 +48,8 @@ static inline struct irq_domain *iort_get_device_domain(
 { return NULL; }
 static inline void acpi_configure_pmsi_domain(struct device *dev) { }
 /* IOMMU interface */
-static inline void iort_dma_setup(struct device *dev, u64 *dma_addr,
- u64 *size) { }
+static inline int iort_dma_get_ranges(struct device *dev, u64 *size)
+{ return -ENODEV; }
 static inline const struct iommu_ops *iort_iommu_configure_id(
  struct device *dev, const u32 *id_in)
 { return NULL; }
diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
new file mode 100644
index ..f16739ad3cc0
--- /dev/null
+++ b/drivers/acpi/arm64/dma.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+
+void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size)
+{
+   int ret;
+   u64 end, mask;
+   u64 dmaaddr = 0, size = 0, offset = 0;
+
+   /*
+* If @dev is expected to be DMA-capable then the bus code that created
+* it should have initialised its dma_mask pointer by this point. For
+* now, we'll continue the legacy behaviour of coercing it to the
+* coherent mask if not, but we'll no longer do so quietly.
+*/
+   if (!dev->dma_mask) {
+   dev_warn(dev, "DMA mask not set\n");
+   dev->dma_mask = &dev->coherent_dma_mask;
+   }
+
+   if (dev->coherent_dma_mask)
+   size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
+   else
+   size = 1ULL << 32;
+
+   ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
+   if (ret == -ENODEV)
+   ret = iort_dma_get_ranges(dev, &size);
+   if (!ret) {
+   /*
+* Limit coherent and dma mask based on size retrieved from
+* firmware.
+*/
+   end = dmaaddr + size - 1;
+   mask = DMA_BIT_MASK(ilog2(end) + 1);
+   dev->bus_dma_limit = end;
+   dev->coherent_dma_mask = min(dev->coherent_dma_mask, mask);
+   *dev->dma_mask = min(*dev->dma_mask, mask);
+   }
+
+   *dma_addr = dmaaddr;
+   *dma_size = size;
+
+   ret = dma_direct_set_offset(dev, dmaaddr + offset, dmaaddr, size);
+
+   dev_dbg(dev, "dma_offset(%#08llx)%s\n", offset, ret ? " failed!" : "");
+}
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3912a1f6058e..a940be1cf2af 100644

[PATCH v3 6/6] iommu/virtio: Enable x86 support

2021-06-02 Thread Jean-Philippe Brucker
With the VIOT support in place, x86 platforms can now use the
virtio-iommu.

Because the other x86 IOMMU drivers aren't yet ready to use the
acpi_dma_setup() path, x86 doesn't implement arch_setup_dma_ops() at the
moment. Similarly to Vt-d and AMD IOMMU, call iommu_setup_dma_ops() from
probe_finalize().

Acked-by: Joerg Roedel 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig| 3 ++-
 drivers/iommu/dma-iommu.c| 1 +
 drivers/iommu/virtio-iommu.c | 8 
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index aff8a4830dd1..07b7c25cbed8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -400,8 +400,9 @@ config HYPERV_IOMMU
 config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
-   depends on ARM64
+   depends on (ARM64 || X86)
select IOMMU_API
+   select IOMMU_DMA
select INTERVAL_TREE
select ACPI_VIOT if ACPI
help
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 175f8eaeb5b3..46ed43c400cf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1332,6 +1332,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 dma_limit)
 pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA 
ops\n",
 dev_name(dev));
 }
+EXPORT_SYMBOL_GPL(iommu_setup_dma_ops);
 
 static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
phys_addr_t msi_addr, struct iommu_domain *domain)
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 218fe8560e8d..77aee1207ced 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -1026,6 +1026,13 @@ static struct iommu_device *viommu_probe_device(struct 
device *dev)
return ERR_PTR(ret);
 }
 
+static void viommu_probe_finalize(struct device *dev)
+{
+#ifndef CONFIG_ARCH_HAS_SETUP_DMA_OPS
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
+#endif
+}
+
 static void viommu_release_device(struct device *dev)
 {
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
@@ -1062,6 +1069,7 @@ static struct iommu_ops viommu_ops = {
.iova_to_phys   = viommu_iova_to_phys,
.iotlb_sync = viommu_iotlb_sync,
.probe_device   = viommu_probe_device,
+   .probe_finalize = viommu_probe_finalize,
.release_device = viommu_release_device,
.device_group   = viommu_device_group,
.get_resv_regions   = viommu_get_resv_regions,
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT

2021-06-02 Thread Jean-Philippe Brucker
Extract the code that sets up the IOMMU infrastructure from IORT, since
it can be reused by VIOT. Move it one level up into a new
acpi_iommu_configure_id() function, which calls the IORT parsing
function which in turn calls the acpi_iommu_fwspec_init() helper.

Signed-off-by: Jean-Philippe Brucker 
---
 include/acpi/acpi_bus.h   |  3 ++
 include/linux/acpi_iort.h |  8 ++---
 drivers/acpi/arm64/iort.c | 75 +--
 drivers/acpi/scan.c   | 55 +++-
 4 files changed, 69 insertions(+), 72 deletions(-)

diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 3a82faac5767..41f092a269f6 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -588,6 +588,9 @@ struct acpi_pci_root {
 
 bool acpi_dma_supported(struct acpi_device *adev);
 enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev);
+int acpi_iommu_fwspec_init(struct device *dev, u32 id,
+  struct fwnode_handle *fwnode,
+  const struct iommu_ops *ops);
 int acpi_dma_get_range(struct device *dev, u64 *dma_addr, u64 *offset,
   u64 *size);
 int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr,
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index f7f054833afd..f1f0842a2cb2 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -35,8 +35,7 @@ void acpi_configure_pmsi_domain(struct device *dev);
 int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
 int iort_dma_get_ranges(struct device *dev, u64 *size);
-const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
-   const u32 *id_in);
+int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
 #else
@@ -50,9 +49,8 @@ static inline void acpi_configure_pmsi_domain(struct device 
*dev) { }
 /* IOMMU interface */
 static inline int iort_dma_get_ranges(struct device *dev, u64 *size)
 { return -ENODEV; }
-static inline const struct iommu_ops *iort_iommu_configure_id(
- struct device *dev, const u32 *id_in)
-{ return NULL; }
+static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
+{ return -ENODEV; }
 static inline
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index a940be1cf2af..b5b021e064b6 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -806,23 +806,6 @@ static struct acpi_iort_node 
*iort_get_msi_resv_iommu(struct device *dev)
return NULL;
 }
 
-static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev)
-{
-   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-
-   return (fwspec && fwspec->ops) ? fwspec->ops : NULL;
-}
-
-static inline int iort_add_device_replay(struct device *dev)
-{
-   int err = 0;
-
-   if (dev->bus && !device_iommu_mapped(dev))
-   err = iommu_probe_device(dev);
-
-   return err;
-}
-
 /**
  * iort_iommu_msi_get_resv_regions - Reserved region driver helper
  * @dev: Device from iommu_get_resv_regions()
@@ -900,18 +883,6 @@ static inline bool iort_iommu_driver_enabled(u8 type)
}
 }
 
-static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
-  struct fwnode_handle *fwnode,
-  const struct iommu_ops *ops)
-{
-   int ret = iommu_fwspec_init(dev, fwnode, ops);
-
-   if (!ret)
-   ret = iommu_fwspec_add_ids(dev, &streamid, 1);
-
-   return ret;
-}
-
 static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node)
 {
struct acpi_iort_root_complex *pci_rc;
@@ -946,7 +917,7 @@ static int iort_iommu_xlate(struct device *dev, struct 
acpi_iort_node *node,
return iort_iommu_driver_enabled(node->type) ?
   -EPROBE_DEFER : -ENODEV;
 
-   return arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops);
+   return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops);
 }
 
 struct iort_pci_alias_info {
@@ -1020,24 +991,14 @@ static int iort_nc_iommu_map_id(struct device *dev,
  * @dev: device to configure
  * @id_in: optional input id const value pointer
  *
- * Returns: iommu_ops pointer on configuration success
- *  NULL on configuration failure
+ * Returns: 0 on success, <0 on failure
  */
-const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
-   const u32 *id_in)
+int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
struct acpi_iort_node *node;
-   const struct iommu_ops *ops;
+   const struct iommu_ops *ops = NULL;
int err = -ENODEV;
 
-   /*
-

[PATCH v3 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()

2021-06-02 Thread Jean-Philippe Brucker
dma-iommu uses the address bounds described in domain->geometry during
IOVA allocation. The address size parameters of iommu_setup_dma_ops()
are useful for describing additional limits set by the platform
firmware, but aren't needed for drivers that call this function from
probe_finalize(). The base parameter can be zero because dma-iommu
already removes the first IOVA page, and the limit parameter can be
U64_MAX because it's only checked against the domain geometry. Simplify
calls to iommu_setup_dma_ops().

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/amd/iommu.c   |  9 +
 drivers/iommu/dma-iommu.c   |  4 +++-
 drivers/iommu/intel/iommu.c | 10 +-
 3 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 94b96d81fcfd..d3123bc05c08 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1708,14 +1708,7 @@ static struct iommu_device 
*amd_iommu_probe_device(struct device *dev)
 
 static void amd_iommu_probe_finalize(struct device *dev)
 {
-   struct iommu_domain *domain;
-
-   /* Domains are initialized for this device - have a look what we ended 
up with */
-   domain = iommu_get_domain_for_dev(dev);
-   if (domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX);
-   else
-   set_dma_ops(dev, NULL);
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
 }
 
 static void amd_iommu_release_device(struct device *dev)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c62e19bed302..175f8eaeb5b3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 dma_limit)
if (domain->type == IOMMU_DOMAIN_DMA) {
if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev))
goto out_err;
-   dev->dma_ops = &iommu_dma_ops;
+   set_dma_ops(dev, &iommu_dma_ops);
+   } else {
+   set_dma_ops(dev, NULL);
}
 
return;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 85f18342603c..8d866940692a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5165,15 +5165,7 @@ static void intel_iommu_release_device(struct device 
*dev)
 
 static void intel_iommu_probe_finalize(struct device *dev)
 {
-   dma_addr_t base = IOVA_START_PFN << VTD_PAGE_SHIFT;
-   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-
-   if (domain && domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, base,
-   __DOMAIN_MAX_ADDR(dmar_domain->gaw));
-   else
-   set_dma_ops(dev, NULL);
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
 }
 
 static void intel_iommu_get_resv_regions(struct device *device,
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/6] ACPI: Add driver for the VIOT table

2021-06-02 Thread Jean-Philippe Brucker
The ACPI Virtual I/O Translation Table describes topology of
para-virtual platforms, similarly to vendor tables DMAR, IVRS and IORT.
For now it describes the relation between virtio-iommu and the endpoints
it manages.

Three steps are needed to configure DMA of endpoints:

(1) acpi_viot_init(): parse the VIOT table, find or create the fwnode
associated to each vIOMMU device.

(2) When probing the vIOMMU device, the driver registers its IOMMU ops
within the IOMMU subsystem. This step doesn't require any
intervention from the VIOT driver.

(3) viot_iommu_configure(): before binding the endpoint to a driver,
find the associated IOMMU ops. Register them, along with the
endpoint ID, into the device's iommu_fwspec.

If step (3) happens before step (2), it is deferred until the IOMMU is
initialized, then retried.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/Kconfig  |   3 +
 drivers/iommu/Kconfig |   1 +
 drivers/acpi/Makefile |   2 +
 include/linux/acpi_viot.h |  19 ++
 drivers/acpi/bus.c|   2 +
 drivers/acpi/scan.c   |   3 +
 drivers/acpi/viot.c   | 364 ++
 MAINTAINERS   |   8 +
 8 files changed, 402 insertions(+)
 create mode 100644 include/linux/acpi_viot.h
 create mode 100644 drivers/acpi/viot.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index eedec61e3476..3758c6940ed7 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -526,6 +526,9 @@ endif
 
 source "drivers/acpi/pmic/Kconfig"
 
+config ACPI_VIOT
+   bool
+
 endif  # ACPI
 
 config X86_PM_TIMER
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..aff8a4830dd1 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -403,6 +403,7 @@ config VIRTIO_IOMMU
depends on ARM64
select IOMMU_API
select INTERVAL_TREE
+   select ACPI_VIOT if ACPI
help
  Para-virtualised IOMMU driver with virtio.
 
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 700b41adf2db..a6e644c48987 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -118,3 +118,5 @@ video-objs  += acpi_video.o video_detect.o
 obj-y  += dptf/
 
 obj-$(CONFIG_ARM64)+= arm64/
+
+obj-$(CONFIG_ACPI_VIOT)+= viot.o
diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h
new file mode 100644
index ..1eb8ee5b0e5f
--- /dev/null
+++ b/include/linux/acpi_viot.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef __ACPI_VIOT_H__
+#define __ACPI_VIOT_H__
+
+#include 
+
+#ifdef CONFIG_ACPI_VIOT
+void __init acpi_viot_init(void);
+int viot_iommu_configure(struct device *dev);
+#else
+static inline void acpi_viot_init(void) {}
+static inline int viot_iommu_configure(struct device *dev)
+{
+   return -ENODEV;
+}
+#endif
+
+#endif /* __ACPI_VIOT_H__ */
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index be7da23fad76..b835ca702ff0 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -27,6 +27,7 @@
 #include 
 #endif
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1339,6 +1340,7 @@ static int __init acpi_init(void)
pci_mmcfg_late_init();
acpi_iort_init();
acpi_scan_init();
+   acpi_viot_init();
acpi_ec_init();
acpi_debugfs_init();
acpi_sleep_proc_init();
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 696b81432ae7..574d657cf4bb 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1555,6 +1556,8 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
return ops;
 
err = iort_iommu_configure_id(dev, id_in);
+   if (err && err != -EPROBE_DEFER)
+   err = viot_iommu_configure(dev);
 
/*
 * If we have reason to believe the IOMMU driver missed the initial
diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c
new file mode 100644
index ..892cd9fa7b6d
--- /dev/null
+++ b/drivers/acpi/viot.c
@@ -0,0 +1,364 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual I/O topology
+ *
+ * The Virtual I/O Translation Table (VIOT) describes the topology of
+ * para-virtual IOMMUs and the endpoints they manage. The OS uses it to
+ * initialize devices in the right order, preventing endpoints from issuing DMA
+ * before their IOMMU is ready.
+ *
+ * When binding a driver to a device, before calling the device driver's 
probe()
+ * method, the driver infrastructure calls dma_configure(). At that point the
+ * VIOT driver looks for an IOMMU associated to the device in the VIOT table.
+ * If an IOMMU exists and has been initialized, the VIOT driver initializes the
+ * device's IOMMU fwspec, allowing the DMA infrastructure to invoke the IOMMU
+ * ops when the device driver configures DMA mappings. If an IOMMU exists a

[PATCH v3 4/6] iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()

2021-06-02 Thread Jean-Philippe Brucker
Passing a 64-bit address width to iommu_setup_dma_ops() is valid on
virtual platforms, but isn't currently possible. The overflow check in
iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass
a limit address instead of a size, so callers don't have to fake a size
to work around the check.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/dma-iommu.h   |  4 ++--
 arch/arm64/mm/dma-mapping.c |  2 +-
 drivers/iommu/amd/iommu.c   |  2 +-
 drivers/iommu/dma-iommu.c   | 12 ++--
 drivers/iommu/intel/iommu.c |  2 +-
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 6e75a2d689b4..758ca4694257 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -19,7 +19,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base);
 void iommu_put_dma_cookie(struct iommu_domain *domain);
 
 /* Setup call for arch DMA mapping code */
-void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size);
+void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit);
 
 /* The DMA API isn't _quite_ the whole story, though... */
 /*
@@ -50,7 +50,7 @@ struct msi_msg;
 struct device;
 
 static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base,
-   u64 size)
+  u64 dma_limit)
 {
 }
 
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 4bf1dd3eb041..7bd1d2199141 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -50,7 +50,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 
size,
 
dev->dma_coherent = coherent;
if (iommu)
-   iommu_setup_dma_ops(dev, dma_base, size);
+   iommu_setup_dma_ops(dev, dma_base, size - dma_base - 1);
 
 #ifdef CONFIG_XEN
if (xen_swiotlb_detect())
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3ac42bbdefc6..94b96d81fcfd 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1713,7 +1713,7 @@ static void amd_iommu_probe_finalize(struct device *dev)
/* Domains are initialized for this device - have a look what we ended 
up with */
domain = iommu_get_domain_for_dev(dev);
if (domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, 0);
+   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX);
else
set_dma_ops(dev, NULL);
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..c62e19bed302 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -319,16 +319,16 @@ static bool dev_is_untrusted(struct device *dev)
  * iommu_dma_init_domain - Initialise a DMA mapping domain
  * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
  * @base: IOVA at which the mappable address space starts
- * @size: Size of IOVA space
+ * @limit: Last address of the IOVA space
  * @dev: Device the domain is being initialised for
  *
- * @base and @size should be exact multiples of IOMMU page granularity to
+ * @base and @limit + 1 should be exact multiples of IOMMU page granularity to
  * avoid rounding surprises. If necessary, we reserve the page at address 0
  * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
  * any change which could make prior IOVAs invalid will fail.
  */
 static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
-   u64 size, struct device *dev)
+dma_addr_t limit, struct device *dev)
 {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
@@ -346,7 +346,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
/* Check the domain allows at least some access to the device... */
if (domain->geometry.force_aperture) {
if (base > domain->geometry.aperture_end ||
-   base + size <= domain->geometry.aperture_start) {
+   limit < domain->geometry.aperture_start) {
pr_warn("specified DMA range outside IOMMU 
capability\n");
return -EFAULT;
}
@@ -1308,7 +1308,7 @@ static const struct dma_map_ops iommu_dma_ops = {
  * The IOMMU core code allocates the default DMA domain, which the underlying
  * IOMMU driver needs to support via the dma-iommu layer.
  */
-void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size)
+void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 
@@ -1320,7 +1320,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 size)
 * underlying IOMMU driver needs to support via the dma-iommu layer.
 */
if (domain->type == IOM

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 02:20:15AM +, Tian, Kevin wrote:
> > From: Alex Williamson 
> > Sent: Wednesday, June 2, 2021 6:22 AM
> > 
> > On Tue, 1 Jun 2021 07:01:57 +
> > "Tian, Kevin"  wrote:
> > >
> > > I summarized five opens here, about:
> > >
> > > 1)  Finalizing the name to replace /dev/ioasid;
> > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > 4)  What should/could be specified when allocating an IOASID;
> > > 5)  The protocol between vfio group and kvm;
> > >
> > ...
> > >
> > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > original purpose of this protocol is not about I/O address space. It's
> > > for KVM to know whether any device is assigned to this VM and then
> > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> > 
> > Right, the original use case was for KVM to determine whether it needs
> > to emulate invlpg, so it needs to be aware when an assigned device is
> 
> invlpg -> wbinvd :)
> 
> > present and be able to test if DMA for that device is cache
> > coherent.

Why is this such a strong linkage to VFIO and not just a 'hey kvm
emulate wbinvd' flag from qemu?

I briefly didn't see any obvios linkage in the arch code, just some
dead code:

$ git grep iommu_noncoherent
arch/x86/include/asm/kvm_host.h:bool iommu_noncoherent;
$ git grep iommu_domain arch/x86
arch/x86/include/asm/kvm_host.h:struct iommu_domain *iommu_domain;

Huh?

It kind of looks like the other main point is to generate the
VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
connect back to the kvm data

But that seems like it would have been better handled with some IOCTL
on the vfio_device fd to import the KVM to the driver not this
roundabout way?

> > The user, QEMU, creates a KVM "pseudo" device representing the vfio
> > group, providing the file descriptor of that group to show ownership.
> > The ugly symbol_get code is to avoid hard module dependencies, ie. the
> > kvm module should not pull in or require the vfio module, but vfio will
> > be present if attempting to register this device.
> 
> so the symbol_get thing is not about the protocol itself. Whatever protocol
> is defined, as long as kvm needs to call vfio or ioasid helper function, we 
> need define a proper way to do it. Jason, what's your opinion of alternative 
> option since you dislike symbol_get?

The symbol_get was to avoid module dependencies because bringing in
vfio along with kvm is not nice.

The symbol get is not nice here, but unless things can be truely moved
to lower levels of code where module dependencies are not a problem (eg
kvm to iommu is a non-issue) I don't see much of a solution.

Other cases like kvmgt or AP would be similarly fine to have had a
kvmgt to kvm module dependency.

> > All of these use cases are related to the IOMMU, whether DMA is
> > coherent, translating device IOVA to GPA, and an acceleration path to
> > emulate IOMMU programming in kernel... they seem pretty relevant.
> 
> One open is whether kvm should hold a device reference or IOASID
> reference. For DMA coherence, it only matters whether assigned 
> devices are coherent or not (not for a specific address space). For kvmgt, 
> it is for recoding kvm pointer in mdev driver to do write protection. For 
> ppc, it does relate to a specific I/O page table.
> 
> Then I feel only a part of the protocol will be moved to /dev/ioasid and
> something will still remain between kvm and vfio?

Honestly I would try not to touch this too much :\

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-06-02 Thread Boris Ostrovsky

On 6/2/21 11:01 AM, Tianyu Lan wrote:
> Hi Boris:
> Thanks for your review.
>
> On 6/2/2021 9:16 AM, Boris Ostrovsky wrote:
>>
>> On 5/30/21 11:06 AM, Tianyu Lan wrote:
>>> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>>>   EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
>>>     IOMMU_INIT_FINISH(2,
>>> -  NULL,
>>> +  hyperv_swiotlb_detect,
>>>     pci_xen_swiotlb_init,
>>>     NULL);
>>
>>
>> Could you explain this change?
>
> Hyper-V allocates its own swiotlb bounce buffer and the default
> swiotlb buffer should not be allocated. swiotlb_init() in pci_swiotlb_init() 
> is to allocate default swiotlb buffer.
> To achieve this, put hyperv_swiotlb_detect() as the first entry in the 
> iommu_table_entry list. The detect loop in the pci_iommu_alloc() will exit 
> once hyperv_swiotlb_detect() is called in Hyper-V VM and other 
> iommu_table_entry callback will not be called.



Right. But pci_xen_swiotlb_detect() will only do something for Xen PV guests, 
and those guests don't run on hyperV. It's either xen_pv_domain() (i.e. 
hypervisor_is_type(X86_HYPER_XEN_PV)) or 
hypervisor_is_type(X86_HYPER_MS_HYPERV) but never both. So I don't think there 
needs to be a dependency between the two callbacks.



-boris

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:52:02PM +0800, Jason Wang wrote:
> 
> 在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
> > > I summarized five opens here, about:
> > > 
> > > 1)  Finalizing the name to replace /dev/ioasid;
> > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > 4)  What should/could be specified when allocating an IOASID;
> > > 5)  The protocol between vfio group and kvm;
> > > 
> > > For 1), two alternative names are mentioned: /dev/iommu and
> > > /dev/ioas. I don't have a strong preference and would like to hear
> > > votes from all stakeholders. /dev/iommu is slightly better imho for
> > > two reasons. First, per AMD's presentation in last KVM forum they
> > > implement vIOMMU in hardware thus need to support user-managed
> > > domains. An iommu uAPI notation might make more sense moving
> > > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > > IOASID_ALLOC_IOASID.:)
> > I think two years ago I suggested /dev/iommu and it didn't go very far
> > at the time.
> 
> 
> It looks to me using "/dev/iommu" excludes the possibility of implementing
> IOASID in a device specific way (e.g through the co-operation with device
> MMU + platform IOMMU)?

This is intended to be the 'drivers/iommu' subsystem though. I don't
want to see pluggabilit here beyoned what drivers/iommu is providing.

If someone wants to do a something complicated through this interface
then they need to make a drivers/iommu implementation.

Or they need to use the mdev-esque "SW TABLE" mode when their driver
attaches to the interface.

If this is good enough or not for a specific device is an entirely
other question though

> What's more, ATS spec doesn't forbid the device #PF to be reported via a
> device specific way.

And if this is done then a kernel function indicating page fault
should be triggered on the ioasid handle that the device has. It is
still drivers/iommu functionality

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, June 2, 2021 1:42 AM
> > 
> > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > >
> > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > >
> > > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > > software nesting. With hardware support the child and parent I/O page
> > > > > tables are walked consecutively by the IOMMU to form a nested
> > translation.
> > > > > When it's implemented in software, the ioasid driver is responsible 
> > > > > for
> > > > > merging the two-level mappings into a single-level shadow I/O page
> > table.
> > > > > Software nesting requires both child/parent page tables operated
> > through
> > > > > the dma mapping protocol, so any change in either level can be
> > captured
> > > > > by the kernel to update the corresponding shadow mapping.
> > > >
> > > > Why? A SW emulation could do this synchronization during invalidation
> > > > processing if invalidation contained an IOVA range.
> > >
> > > In this proposal we differentiate between host-managed and user-
> > > managed I/O page tables. If host-managed, the user is expected to use
> > > map/unmap cmd explicitly upon any change required on the page table.
> > > If user-managed, the user first binds its page table to the IOMMU and
> > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > not required when changing a PTE from non-present to present).
> > >
> > > We expect user to use map+unmap and bind+invalidate respectively
> > > instead of mixing them together. Following this policy, map+unmap
> > > must be used in both levels for software nesting, so changes in either
> > > level are captured timely to synchronize the shadow mapping.
> > 
> > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > it is created. If you put two different types in a tree then each IOASID
> > must continue to use its own operation mode.
> > 
> > I don't see a reason to force all IOASIDs in a tree to be consistent??
> 
> only for software nesting. With hardware support the parent uses map
> while the child uses bind.
> 
> Yes, the policy is specified per IOASID. But if the policy violates the
> requirement in a specific nesting mode, then nesting should fail.

I don't get it.

If the IOASID is a page table then it is bind/invalidate. SW or not SW
doesn't matter at all.

> > 
> > A software emulated two level page table where the leaf level is a
> > bound page table in guest memory should continue to use
> > bind/invalidate to maintain the guest page table IOASID even though it
> > is a SW construct.
> 
> with software nesting the leaf should be a host-managed page table
> (or metadata). A bind/invalidate protocol doesn't require the user
> to notify the kernel of every page table change. 

The purpose of invalidate is to inform the implementation that the
page table has changed so it can flush the caches. If the page table
is changed and invalidation is not issued then then the implementation
is free to ignore the changes.

In this way the SW mode is the same as a HW mode with an infinite
cache.

The collaposed shadow page table is really just a cache.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > I agree with Jean-Philippe - at the very least erasing this
> > information needs a major rational - but I don't really see why it
> > must be erased? The HW reports the originating device, is it just a
> > matter of labeling the devices attached to the /dev/ioasid FD so it
> > can be reported to userspace?
> 
> HW reports the originating device as far as it knows.  In many cases
> where you have multiple devices in an IOMMU group, it's because
> although they're treated as separate devices at the kernel level, they
> have the same RID at the HW level.  Which means a RID for something in
> the right group is the closest you can count on supplying.

Granted there may be cases where exact fidelity is not possible, but
that doesn't excuse eliminating fedelity where it does exist..

> > If there are no hypervisor traps (does this exist?) then there is no
> > way to involve the hypervisor here and the child IOASID should simply
> > be a pointer to the guest's data structure that describes binding. In
> > this case that IOASID should claim all PASIDs when bound to a
> > RID. 
> 
> And in that case I think we should call that object something other
> than an IOASID, since it represents multiple address spaces.

Maybe.. It is certainly a special case.

We can still consider it a single "address space" from the IOMMU
perspective. What has happened is that the address table is not just a
64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

If we are already going in the direction of having the IOASID specify
the page table format and other details, specifying that the page
tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
step.

I wouldn't twist things into knots to create a difference, but if it
is easy to do it wouldn't hurt either.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:

> I don't think presence or absence of a group fd makes a lot of
> difference to this design.  Having a group fd just means we attach
> groups to the ioasid instead of individual devices, and we no longer
> need the bookkeeping of "partial" devices.

Oh, I think we really don't want to attach the group to an ioasid, or
at least not as a first-class idea.

The fundamental problem that got us here is we now live in a world
where there are many ways to attach a device to an IOASID:

 - A RID binding
 - A RID,PASID binding
 - A RID,PASID binding for ENQCMD
 - A SW TABLE binding
 - etc

The selection of which mode to use is based on the specific
driver/device operation. Ie the thing that implements the 'struct
vfio_device' is the thing that has to select the binding mode.

group attachment was fine when there was only one mode. As you say it
is fine to just attach every group member with RID binding if RID
binding is the only option.

When SW TABLE binding was added the group code was hacked up - now the
group logic is choosing between RID/SW TABLE in a very hacky and mdev
specific way, and this is just a mess.

The flow must carry the IOASID from the /dev/iommu to the vfio_device
driver and the vfio_device implementation must choose which binding
mode and parameters it wants based on driver and HW configuration.

eg if two PCI devices are in a group then it is perfectly fine that
one device uses RID binding and the other device uses RID,PASID
binding.

The only place I see for a "group bind" in the uAPI is some compat
layer for the vfio container, and the implementation would be quite
different, we'd have to call each vfio_device driver in the group and
execute the IOASID attach IOCTL.

> > I would say no on the container. /dev/ioasid == the container, having
> > two competing objects at once in a single process is just a mess.
> 
> Right.  I'd assume that for compatibility, creating a container would
> create a single IOASID under the hood with a compatiblity layer
> translating the container operations to iosaid operations.

It is a nice dream for sure

/dev/vfio could be a special case of /dev/ioasid just with a different
uapi and ending up with only one IOASID. They could be interchangable
from then on, which would simplify the internals of VFIO if it
consistently delt with these new ioasid objects everywhere. But last I
looked it was complicated enough to best be done later on

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RESEND PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling

2021-06-02 Thread Rob Clark
From: Rob Clark 

(Resend, first attempt seems to not have entirely shown up in patchwork
and had a random already merged patch tagging along because 00*patch
picks up things I forgot to delete)

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
resume translation after it has had a chance to snapshot the GPUs
state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] 
https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcro...@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c   |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c   |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 101 +++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h   |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c   |   2 +-
 drivers/gpu/drm/msm/msm_gem.h   |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c|   1 +
 drivers/gpu/drm/msm/msm_gpu.c   |  55 ++-
 drivers/gpu/drm/msm/msm_gpu.h   |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c|   5 +
 drivers/gpu/drm/msm/msm_iommu.c |  22 -
 drivers/gpu/drm/msm/msm_mmu.h   |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h   |   2 +
 include/linux/adreno-smmu-priv.h|  38 +++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RESEND PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers

2021-06-02 Thread Rob Clark
From: Jordan Crouse 

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse 
Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
int idx = smmu_domain->cfg.cbndx;
+   int ret;
 
fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-   dev_err_ratelimited(smmu->dev,
-   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
+   ret = report_iommu_fault(domain, NULL, iova,
+   fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : 
IOMMU_FAULT_READ);
+
+   if (ret == -ENOSYS)
+   dev_err_ratelimited(smmu->dev,
+   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
fsr, iova, fsynr, cbfrsynra, idx);
 
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RESEND PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info

2021-06-02 Thread Rob Clark
From: Jordan Crouse 

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse 
Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 
 drivers/iommu/arm/arm-smmu/arm-smmu.h  |  2 ++
 include/linux/adreno-smmu-priv.h   | 31 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct 
arm_smmu_device *smmu, int idx,
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+   struct adreno_smmu_fault_info *info)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+   info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+   info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+   info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+   info->cbfrsynra = arm_smmu_gr1_read(smmu, 
ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+   info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+   info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, 
ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
priv->cookie = smmu_domain;
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+   priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h 
b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0 0x68
 #define ARM_SMMU_FSYNR0_WNRBIT(4)
 
+#define ARM_SMMU_CB_FSYNR1 0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA  0x600
 #define ARM_SMMU_CB_S1_TLBIASID0x610
 #define ARM_SMMU_CB_S1_TLBIVAL 0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include 
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+   u64 far;
+   u64 ttbr0;
+   u32 contextidr;
+   u32 fsr;
+   u32 fsynr0;
+   u32 fsynr1;
+   u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  * NULL config disables TTBR0 translation, otherwise
  * TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
 const void *cookie;
 const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
 int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info 
*info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
htt

[RESEND PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support

2021-06-02 Thread Rob Clark
From: Rob Clark 

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++
 include/linux/adreno-smmu-priv.h   |  7 +
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
struct arm_smmu_device smmu;
bool bypass_quirk;
u8 bypass_cbndx;
+   u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct 
arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
u32 reg)
 {
+   struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
/*
 * On the GPU device we want to process subsequent transactions after a
 * fault to keep the GPU from hanging
 */
reg |= ARM_SMMU_SCTLR_HUPCF;
 
+   if (qsmmu->stall_enabled & BIT(idx))
+   reg |= ARM_SMMU_SCTLR_CFCFG;
+
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void 
*cookie,
info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, 
ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+   struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+   if (enabled)
+   qsmmu->stall_enabled |= BIT(cfg->cbndx);
+   else
+   qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool 
terminate)
+{
+   struct arm_smmu_domain *smmu_domain = (void *)cookie;
+   struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   u32 reg = 0;
+
+   if (terminate)
+   reg |= ARM_SMMU_RESUME_TERMINATE;
+
+   arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+   priv->set_stall = qcom_adreno_smmu_set_stall;
+   priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  * TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *  the fault
+ * @set_stall: Configure whether stall on fault (CFCFG) is enabled.  Call
+ * before set_ttbr0_cfg().  If stalling on fault is enabled,
+ * the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
 const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
 int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
 void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info 
*info);
+void (*set_stall)(const void *cookie, bool enabled);
+void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > >   /* Bind guest I/O page table  */
> > >   bind_data = {
> > >   .ioasid = gva_ioasid;
> > >   .addr   = gva_pgtable1;
> > >   // and format information
> > >   };
> > >   ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > 
> > Again I do wonder if this should just be part of alloc_ioasid. Is
> > there any reason to split these things? The only advantage to the
> > split is the device is known, but the device shouldn't impact
> > anything..
> 
> I'm pretty sure the device(s) could matter, although they probably
> won't usually. 

It is a bit subtle, but the /dev/iommu fd itself is connected to the
devices first. This prevents wildly incompatible devices from being
joined together, and allows some "get info" to report the capability
union of all devices if we want to do that.

The original concept was that devices joined would all have to support
the same IOASID format, at least for the kernel owned map/unmap IOASID
type. Supporting different page table formats maybe is reason to
revisit that concept.

There is a small advantage to re-using the IOASID container because of
the get_user_pages caching and pinned accounting management at the FD
level.

I don't know if that small advantage is worth the extra complexity
though.

> But it would certainly be possible for a system to have two
> different host bridges with two different IOMMUs with different
> pagetable formats.  Until you know which devices (and therefore
> which host bridge) you're talking about, you don't know what formats
> of pagetable to accept.  And if you have devices from *both* bridges
> you can't bind a page table at all - you could theoretically support
> a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> in both formats, but it would be pretty reasonable not to support
> that.

The basic process for a user space owned pgtable mode would be:

 1) qemu has to figure out what format of pgtable to use

Presumably it uses query functions using the device label. The
kernel code should look at the entire device path through all the
IOMMU HW to determine what is possible.

Or it already knows because the VM's vIOMMU is running in some
fixed page table format, or the VM's vIOMMU already told it, or
something.

 2) qemu creates an IOASID and based on #1 and says 'I want this format'

 3) qemu binds the IOASID to the device. 

If qmeu gets it wrong then it just fails.

 4) For the next device qemu would have to figure out if it can re-use
an existing IOASID based on the required proeprties.

You pointed to the case of mixing vIOMMU's of different platforms. So
it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
page table mode v2" while running on an x86 because that is what the
vIOMMU is wired to work with.

Presumably qemu will fall back to software emulation if this is not
possible.

One interesting option for software emulation is to just transform the
ARM page table format to a x86 page table format in userspace and use
nested bind/invalidate to synchronize with the kernel. With SW nesting
I suspect this would be much faster

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Wed, 2 Jun 2021 13:01:40 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 02, 2021 at 02:20:15AM +, Tian, Kevin wrote:
> > > From: Alex Williamson 
> > > Sent: Wednesday, June 2, 2021 6:22 AM
> > > 
> > > On Tue, 1 Jun 2021 07:01:57 +
> > > "Tian, Kevin"  wrote:  
> > > >
> > > > I summarized five opens here, about:
> > > >
> > > > 1)  Finalizing the name to replace /dev/ioasid;
> > > > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > > > 3)  Carry device information in invalidation/fault reporting uAPI;
> > > > 4)  What should/could be specified when allocating an IOASID;
> > > > 5)  The protocol between vfio group and kvm;
> > > >  
> > > ...  
> > > >
> > > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > > original purpose of this protocol is not about I/O address space. It's
> > > > for KVM to know whether any device is assigned to this VM and then
> > > > do something special (e.g. posted interrupt, EPT cache attribute, 
> > > > etc.).  
> > > 
> > > Right, the original use case was for KVM to determine whether it needs
> > > to emulate invlpg, so it needs to be aware when an assigned device is  
> > 
> > invlpg -> wbinvd :)

Oops, of course.
   
> > > present and be able to test if DMA for that device is cache
> > > coherent.  
> 
> Why is this such a strong linkage to VFIO and not just a 'hey kvm
> emulate wbinvd' flag from qemu?

IIRC, wbinvd has host implications, a malicious user could tell KVM to
emulate wbinvd then run the op in a loop and induce a disproportionate
load on the system.  We therefore wanted a way that it would only be
enabled when required.

> I briefly didn't see any obvios linkage in the arch code, just some
> dead code:
> 
> $ git grep iommu_noncoherent
> arch/x86/include/asm/kvm_host.h:  bool iommu_noncoherent;
> $ git grep iommu_domain arch/x86
> arch/x86/include/asm/kvm_host.h:struct iommu_domain *iommu_domain;
> 
> Huh?

Cruft from legacy KVM device assignment, I assume.  What you're looking
for is:

kvm_vfio_update_coherency
 kvm_arch_register_noncoherent_dma
  atomic_inc(&kvm->arch.noncoherent_dma_count);

need_emulate_wbinvd
 kvm_arch_has_noncoherent_dma
  atomic_read(&kvm->arch.noncoherent_dma_count);

There are a couple other callers that I'm not as familiar with.

> It kind of looks like the other main point is to generate the
> VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> connect back to the kvm data
> 
> But that seems like it would have been better handled with some IOCTL
> on the vfio_device fd to import the KVM to the driver not this
> roundabout way?

Then QEMU would need to know which drivers require KVM knowledge?  This
allowed transparent backwards compatibility with userspace.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:

> Is there a compelling reason to have all the IOASIDs handled by one
> FD?

There was an answer on this, if every PASID needs an IOASID then there
are too many FDs.

It is difficult to share the get_user_pages cache across FDs.

There are global properties in the /dev/iommu FD, like what devices
are part of it, that are important for group security operations. This
becomes confused if it is split to many FDs.

> > I/O address space can be managed through two protocols, according to 
> > whether the corresponding I/O page table is constructed by the kernel or 
> > the user. When kernel-managed, a dma mapping protocol (similar to 
> > existing VFIO iommu type1) is provided for the user to explicitly specify 
> > how the I/O address space is mapped. Otherwise, a different protocol is 
> > provided for the user to bind an user-managed I/O page table to the 
> > IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> > handling. 
> > 
> > Pgtable binding protocol can be used only on the child IOASID's, implying 
> > IOASID nesting must be enabled. This is because the kernel doesn't trust 
> > userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> > through the parent IOASID.
> 
> To clarify, I'm guessing that's a restriction of likely practice,
> rather than a fundamental API restriction.  I can see a couple of
> theoretical future cases where a user-managed pagetable for a "base"
> IOASID would be feasible:
> 
>   1) On some fancy future MMU allowing free nesting, where the kernel
>  would insert an implicit extra layer translating user addresses
>  to physical addresses, and the userspace manages a pagetable with
>  its own VAs being the target AS

I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
where the IOVA == process VA and the kernel maintains this mapping.

Since the uAPI is so general I do have a general expecation that the
drivers/iommu implementations might need to be a bit more complicated,
like if the HW can optimize certain specific graphs of IOASIDs we
would still model them as graphs and the HW driver would have to
"compile" the graph into the optimal hardware.

This approach has worked reasonable in other kernel areas.

>   2) For a purely software virtual device, where its virtual DMA
>  engine can interpet user addresses fine

This also sounds like an SVA IOASID.

Depending on HW if a device can really only bind to a very narrow kind
of IOASID then it should ask for that (probably platform specific!)
type during its attachment request to drivers/iommu.

eg "I am special hardware and only know how to do PLATFORM_BLAH
transactions, give me an IOASID comatible with that". If the only way
to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
hardwired to the CPU ASID  then that is just how it is.

> I wonder if there's a way to model this using a nested AS rather than
> requiring special operations.  e.g.
> 
>   'prereg' IOAS
>   |
>   \- 'rid' IOAS
>  |
>  \- 'pasid' IOAS (maybe)
> 
> 'prereg' would have a kernel managed pagetable into which (for
> example) qemu platform code would map all guest memory (using
> IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> IO mappings into the 'rid' IOAS in terms of GPA.
> 
> This wouldn't quite work as is, because the 'prereg' IOAS would have
> no devices.  But we could potentially have another call to mark an
> IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> would be an alternative to attaching devices.

It is one option for sure, this is where I was thinking when we were
talking in the other thread. I think the decision is best
implementation driven as the datastructure to store the
preregsitration data should be rather purpose built.

> > /*
> >   * Map/unmap process virtual addresses to I/O virtual addresses.
> >   *
> >   * Provide VFIO type1 equivalent semantics. Start with the same 
> >   * restriction e.g. the unmap size should match those used in the 
> >   * original mapping call. 
> >   *
> >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> >   * must be already in the preregistered list.
> >   *
> >   * Input parameters:
> >   * - u32 ioasid;
> >   * - refer to vfio_iommu_type1_dma_{un}map
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_MAP_DMA  _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA_IO(IOASID_TYPE, IOASID_BASE + 7)
> 
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

Me too, or a SVA page table.

This document would do well to have a list of imagined page table
types and the set of operations that act on them. I think they are all
pretty disjoint..

Your presentation of 'kernel owns the table' vs 'userspace owns the
table' is a useful clarification to call out too

> > 5. Use Cases and Flows
> > 
> > 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:
> 
> 在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> > On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
> > > We can open up to ~0U file descriptors, I don't see why we need to 
> > > restrict
> > > it in uAPI.
> > There are significant problems with such large file descriptor
> > tables. High FD numbers man things like select don't work at all
> > anymore and IIRC there are more complications.
> 
> 
> I don't see how much difference for IOASID and other type of fds. People can
> choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for

> And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
> wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 10:56:48AM +0200, Enrico Weigelt, metux IT consult 
wrote:

> If I understand this correctly, /dev/ioasid is a kind of "common supplier"
> to other APIs / devices. Why can't the fd be acquired by the
> consumer APIs (eg. kvm, vfio, etc) ?

/dev/ioasid would be similar to /dev/vfio, and everything already
deals with exposing /dev/vfio and /dev/vfio/N together

I don't see it as a problem, just more work.

Having FDs spawn other FDs is pretty ugly, it defeats the "everything
is a file" model of UNIX.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:

> > > > present and be able to test if DMA for that device is cache
> > > > coherent.  
> > 
> > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > emulate wbinvd' flag from qemu?
> 
> IIRC, wbinvd has host implications, a malicious user could tell KVM to
> emulate wbinvd then run the op in a loop and induce a disproportionate
> load on the system.  We therefore wanted a way that it would only be
> enabled when required.

I think the non-coherentness is vfio_device specific? eg a specific
device will decide if it is coherent or not?

If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
implementation and not link it through the IOMMU.

If userspace is telling the vfio_device to be non-coherent or not then
it can call kvm_arch_register_noncoherent_dma() or not based on that
signal.

> > It kind of looks like the other main point is to generate the
> > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > connect back to the kvm data
> > 
> > But that seems like it would have been better handled with some IOCTL
> > on the vfio_device fd to import the KVM to the driver not this
> > roundabout way?
> 
> Then QEMU would need to know which drivers require KVM knowledge?  This
> allowed transparent backwards compatibility with userspace.  Thanks,

I'd just blindly fire a generic 'hey here is your KVM FD' into every
VFIO device.

The backwards compat angle is reasonable enough though.

So those two don't sound so bad, don't know about PPC, but David seem
optimistic

A basic idea is to remove the iommu stuff from the kvm connection so
that the scope of the iommu related rework is contained to vfio

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Wed, 2 Jun 2021 14:35:10 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> 
> > > > > present and be able to test if DMA for that device is cache
> > > > > coherent.
> > > 
> > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > emulate wbinvd' flag from qemu?  
> > 
> > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > emulate wbinvd then run the op in a loop and induce a disproportionate
> > load on the system.  We therefore wanted a way that it would only be
> > enabled when required.  
> 
> I think the non-coherentness is vfio_device specific? eg a specific
> device will decide if it is coherent or not?

No, this is specifically whether DMA is cache coherent to the
processor, ie. in the case of wbinvd whether the processor needs to
invalidate its cache in order to see data from DMA.

> If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> implementation and not link it through the IOMMU.

The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

> If userspace is telling the vfio_device to be non-coherent or not then
> it can call kvm_arch_register_noncoherent_dma() or not based on that
> signal.

Not non-coherent device memory, that would be a driver issue, cache
coherence of DMA is what we're after.

> > > It kind of looks like the other main point is to generate the
> > > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > > connect back to the kvm data
> > > 
> > > But that seems like it would have been better handled with some IOCTL
> > > on the vfio_device fd to import the KVM to the driver not this
> > > roundabout way?  
> > 
> > Then QEMU would need to know which drivers require KVM knowledge?  This
> > allowed transparent backwards compatibility with userspace.  Thanks,  
> 
> I'd just blindly fire a generic 'hey here is your KVM FD' into every
> VFIO device.

Yes, QEMU could do this, but the vfio-kvm device was already there with
this association and required no uAPI work.  This one is the least IOMMU
related of the use cases.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 14:35:10 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > 
> > > > > > present and be able to test if DMA for that device is cache
> > > > > > coherent.
> > > > 
> > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > emulate wbinvd' flag from qemu?  
> > > 
> > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > load on the system.  We therefore wanted a way that it would only be
> > > enabled when required.  
> > 
> > I think the non-coherentness is vfio_device specific? eg a specific
> > device will decide if it is coherent or not?
> 
> No, this is specifically whether DMA is cache coherent to the
> processor, ie. in the case of wbinvd whether the processor needs to
> invalidate its cache in order to see data from DMA.

I'm confused. This is x86, all DMA is cache coherent unless the device
is doing something special.

> > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > implementation and not link it through the IOMMU.
> 
> The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

And this special IOMMU mode is basically requested by the device
driver, right? Because if you use this mode you have to also use
special programming techniques.

This smells like all the "snoop bypass" stuff from PCIE (for GPUs
even) in a different guise - it is device triggered, not platform
triggered behavior.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Wed, 2 Jun 2021 15:09:25 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 14:35:10 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > >   
> > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > coherent.  
> > > > > 
> > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > emulate wbinvd' flag from qemu?
> > > > 
> > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > load on the system.  We therefore wanted a way that it would only be
> > > > enabled when required.
> > > 
> > > I think the non-coherentness is vfio_device specific? eg a specific
> > > device will decide if it is coherent or not?  
> > 
> > No, this is specifically whether DMA is cache coherent to the
> > processor, ie. in the case of wbinvd whether the processor needs to
> > invalidate its cache in order to see data from DMA.  
> 
> I'm confused. This is x86, all DMA is cache coherent unless the device
> is doing something special.
> 
> > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > implementation and not link it through the IOMMU.  
> > 
> > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.  
> 
> And this special IOMMU mode is basically requested by the device
> driver, right? Because if you use this mode you have to also use
> special programming techniques.
> 
> This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> even) in a different guise - it is device triggered, not platform
> triggered behavior.

Right, the device can generate the no-snoop transactions, but it's the
IOMMU that essentially determines whether those transactions are
actually still cache coherent, AIUI.

I did experiment with virtually hardwiring the Enable No-Snoop bit in
the Device Control Register to zero, which would be generically allowed
by the PCIe spec, but then we get into subtle dependencies in the device
drivers and clearing the bit again after any sort of reset and the
backdoor accesses to config space which exist mostly in the class of
devices that might use no-snoop transactions (yes, GPUs suck).

It was much easier and more robust to ignore the device setting and rely
on the IOMMU behavior.  Yes, maybe we sometimes emulate wbinvd for VMs
where the device doesn't support no-snoop, but it seemed like platforms
were headed in this direction where no-snoop was ignored anyway.
Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 15:09:25 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > > On Wed, 2 Jun 2021 14:35:10 -0300
> > > Jason Gunthorpe  wrote:
> > >   
> > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > > >   
> > > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > > coherent.  
> > > > > > 
> > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > > emulate wbinvd' flag from qemu?
> > > > > 
> > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > > load on the system.  We therefore wanted a way that it would only be
> > > > > enabled when required.
> > > > 
> > > > I think the non-coherentness is vfio_device specific? eg a specific
> > > > device will decide if it is coherent or not?  
> > > 
> > > No, this is specifically whether DMA is cache coherent to the
> > > processor, ie. in the case of wbinvd whether the processor needs to
> > > invalidate its cache in order to see data from DMA.  
> > 
> > I'm confused. This is x86, all DMA is cache coherent unless the device
> > is doing something special.
> > 
> > > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > > implementation and not link it through the IOMMU.  
> > > 
> > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.  
> > 
> > And this special IOMMU mode is basically requested by the device
> > driver, right? Because if you use this mode you have to also use
> > special programming techniques.
> > 
> > This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> > even) in a different guise - it is device triggered, not platform
> > triggered behavior.
> 
> Right, the device can generate the no-snoop transactions, but it's the
> IOMMU that essentially determines whether those transactions are
> actually still cache coherent, AIUI.

Wow, this is really confusing stuff in the code.

At the PCI level there is a TLP bit called no-snoop that is platform
specific. The general intention is to allow devices to selectively
bypass the CPU caching for DMAs. GPUs like to use this feature for
performance.

I assume there is some exciting security issues here. Looks like
allowing cache bypass does something bad inside VMs? Looks like
allowing the VM to use the cache clear instruction that is mandatory
with cache bypass DMA causes some QOS issues? OK.

So how does it work?

What I see in the intel/iommu.c is that some domains support "snoop
control" or not, based on some HW flag. This indicates if the
DMA_PTE_SNP bit is supported on a page by page basis or not.

Since x86 always leans toward "DMA cache coherent" I'm reading some
tea leaves here:

IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
   transactions */

And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

Further, I guess IOMMUs that don't support PTE_SNP, or have
DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
and all)

So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
incoherent with the CPU caches, it just means that that snoop bit in
the TLP cannot be enforced. ie the device *could* do no-shoop DMA
if it wants. Devices that never do no-snoop remain DMA coherent on
x86, as they always have been.

IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
the PCI device is blocked from using no-snoop in its TLPs.

I wonder if ARM implemented this consistently? I see VDPA is
confused.. I was confused. What a terrible set of names.

In VFIO generic code I see it always sets IOMMU_CACHE:

if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;

And thus also always provides IOMMU_CACHE to iommu_map:

ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
npage << PAGE_SHIFT, prot | d->prot);

So when the IOMMU supports the no-snoop blocking security feature VFIO
turns it on and blocks no-snoop to all pages? Ok..

But I must be missing something big because *something* in the IOVA
map should work with no-snoopable DMA, right? Otherwise what is the
point of exposing the invalidate instruction to the guest?

I would think userspace should be relaying the DMA_PTE_SNP bit from
the guest's page tables up to here??

The KVM hookup is driven by IOMMU_CACHE which is driven by
IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
if the IOMMU can block the SNP bit? And then we map all the

Re: [PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT

2021-06-02 Thread kernel test robot
Hi Jean-Philippe,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on iommu/next arm64/for-next/core linus/master v5.13-rc4 
next-20210602]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Jean-Philippe-Brucker/Add-support-for-ACPI-VIOT/20210602-235849
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: x86_64-randconfig-a012-20210602 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
d41cb6bb2607fa5c7a9df2b3dab361353657d225)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# 
https://github.com/0day-ci/linux/commit/937da71a81108243877fb1f0f568e56a08a62c50
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Jean-Philippe-Brucker/Add-support-for-ACPI-VIOT/20210602-235849
git checkout 937da71a81108243877fb1f0f568e56a08a62c50
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> drivers/acpi/scan.c:1540:26: error: no member named 'ops' in 'struct 
>> iommu_fwspec'
   return fwspec ? fwspec->ops : NULL;
   ~~  ^
>> drivers/acpi/scan.c:1564:9: error: implicit declaration of function 
>> 'iommu_probe_device' [-Werror,-Wimplicit-function-declaration]
   err = iommu_probe_device(dev);
 ^
   2 errors generated.


vim +1540 drivers/acpi/scan.c

  1535  
  1536  static inline const struct iommu_ops *acpi_iommu_fwspec_ops(struct 
device *dev)
  1537  {
  1538  struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
  1539  
> 1540  return fwspec ? fwspec->ops : NULL;
  1541  }
  1542  
  1543  static const struct iommu_ops *acpi_iommu_configure_id(struct device 
*dev,
  1544 const u32 *id_in)
  1545  {
  1546  int err;
  1547  const struct iommu_ops *ops;
  1548  
  1549  /*
  1550   * If we already translated the fwspec there is nothing left to 
do,
  1551   * return the iommu_ops.
  1552   */
  1553  ops = acpi_iommu_fwspec_ops(dev);
  1554  if (ops)
  1555  return ops;
  1556  
  1557  err = iort_iommu_configure_id(dev, id_in);
  1558  
  1559  /*
  1560   * If we have reason to believe the IOMMU driver missed the 
initial
  1561   * add_device callback for dev, replay it to get things in 
order.
  1562   */
  1563  if (!err && dev->bus && !device_iommu_mapped(dev))
> 1564  err = iommu_probe_device(dev);
  1565  
  1566  /* Ignore all other errors apart from EPROBE_DEFER */
  1567  if (err == -EPROBE_DEFER) {
  1568  return ERR_PTR(err);
  1569  } else if (err) {
  1570  dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
  1571  return NULL;
  1572  }
  1573  return acpi_iommu_fwspec_ops(dev);
  1574  }
  1575  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3] iommu/dma: Fix IOVA reserve dma ranges

2021-06-02 Thread Sven Peter via iommu
Hi,

I just ran into the exact same issue while working on the M1 DART IOMMU driver
and it was fixed by this commit. Thanks!

Would be great if this could be picked up.

Tested-by: Sven Peter 


Best,


Sven


On Mon, Sep 14, 2020, at 09:23, Srinath Mannam via iommu wrote:
> Fix IOVA reserve failure in the case when address of first memory region
> listed in dma-ranges is equal to 0x0.
> 
> Fixes: aadad097cd46f ("iommu/dma: Reserve IOVA for PCIe inaccessible 
> DMA address")
> Signed-off-by: Srinath Mannam 
> ---
> Changes from v2:
>Modify error message with useful information based on Bjorn's 
> comments.
> 
> Changes from v1:
>Removed unnecessary changes based on Robin's review comments.
> 
>  drivers/iommu/dma-iommu.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 5141d49a046b..5b9791f35c5e 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -217,9 +217,11 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
>   lo = iova_pfn(iovad, start);
>   hi = iova_pfn(iovad, end);
>   reserve_iova(iovad, lo, hi);
> - } else {
> + } else if (end < start) {
>   /* dma_ranges list should be sorted */
> - dev_err(&dev->dev, "Failed to reserve IOVA\n");
> + dev_err(&dev->dev,
> + "Failed to reserve IOVA [%#010llx-%#010llx]\n",
> + start, end);
>   return -EINVAL;
>   }
>  
> -- 
> 2.17.1
> 
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Wed, 2 Jun 2021 16:54:04 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> > 
> > Right, the device can generate the no-snoop transactions, but it's the
> > IOMMU that essentially determines whether those transactions are
> > actually still cache coherent, AIUI.  
> 
> Wow, this is really confusing stuff in the code.
> 
> At the PCI level there is a TLP bit called no-snoop that is platform
> specific. The general intention is to allow devices to selectively
> bypass the CPU caching for DMAs. GPUs like to use this feature for
> performance.

Yes

> I assume there is some exciting security issues here. Looks like
> allowing cache bypass does something bad inside VMs? Looks like
> allowing the VM to use the cache clear instruction that is mandatory
> with cache bypass DMA causes some QOS issues? OK.

IIRC, largely a DoS issue if userspace gets to choose when to emulate
wbinvd rather than it being demanded for correct operation.

> So how does it work?
> 
> What I see in the intel/iommu.c is that some domains support "snoop
> control" or not, based on some HW flag. This indicates if the
> DMA_PTE_SNP bit is supported on a page by page basis or not.
> 
> Since x86 always leans toward "DMA cache coherent" I'm reading some
> tea leaves here:
> 
>   IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
>  transactions */
> 
> And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
> snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

That's my understanding as well.

> Further, I guess IOMMUs that don't support PTE_SNP, or have
> DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
> and all)

Yes.

> So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
> incoherent with the CPU caches, it just means that that snoop bit in
> the TLP cannot be enforced. ie the device *could* do no-shoop DMA
> if it wants. Devices that never do no-snoop remain DMA coherent on
> x86, as they always have been.

Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
DMA to be coherent via the IOMMU.

> IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
> the PCI device is blocked from using no-snoop in its TLPs.
> 
> I wonder if ARM implemented this consistently? I see VDPA is
> confused.. I was confused. What a terrible set of names.
> 
> In VFIO generic code I see it always sets IOMMU_CACHE:
> 
> if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> domain->prot |= IOMMU_CACHE;
> 
> And thus also always provides IOMMU_CACHE to iommu_map:
> 
> ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << 
> PAGE_SHIFT,
> npage << PAGE_SHIFT, prot | d->prot);
> 
> So when the IOMMU supports the no-snoop blocking security feature VFIO
> turns it on and blocks no-snoop to all pages? Ok..

Yep, I'd forgotten this nuance that we need to enable it via the
mapping flags.

> But I must be missing something big because *something* in the IOVA
> map should work with no-snoopable DMA, right? Otherwise what is the
> point of exposing the invalidate instruction to the guest?
> 
> I would think userspace should be relaying the DMA_PTE_SNP bit from
> the guest's page tables up to here??
> 
> The KVM hookup is driven by IOMMU_CACHE which is driven by
> IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
> if the IOMMU can block the SNP bit? And then we map all the pages to
> block the snoop bit? Huh?

Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
from the guest page table... what page table?  We don't necessarily
have a vIOMMU to expose such things, I don't think it even existed when
this we added.  Essentially if we can ignore no-snoop at the IOMMU,
then KVM doesn't need to worry about emulating wbinvd because of an
assigned device, whether that device uses it or not.  Win-win.

> Your explanation makes perfect sense: Block guests from using the
> dangerous cache invalidate instruction unless a device that uses
> no-snoop is plugged in. Block devices from using no-snoop because
> something about it is insecure. Ok.

No-snoop itself is not insecure, but to support no-snoop in a VM KVM
can't ignore wbinvd, which has overhead and abuse implications.

> But the conditions I'm looking for "device that uses no-snoop" is:
>  - The device will issue no-snoop TLPs at all

We can't really know this generically.  We can try to set the enable
bit to see if the device is capable of no-snoop, but that doesn't mean
it will use no-snoop.

>  - The IOMMU will let no-snoop through
>  - The platform will honor no-snoop
> 
> Only if all three are met we should allow the dangerous instruction in
> KVM, right?

We test at the IOMMU and assume that the IOMMU knowledge encompasses
whether the platform honors no-snoop (note for example how amd and arm
report true for IOMMU_CAP_CACHE_COHER

RE: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()

2021-06-02 Thread Luck, Tony
>> ... so on a PASID system, your trivial reproducer would theoretically
>> fire the same way and corrupt FPU state just as well.
>
> This is worse and you can't selftest it because the IPI can just hit in
> the middle of _any_ FPU state operation and corrupt state.

That sounds like we should abandon the "IPI all the other threads
to force enable the PASID for them" approach. It would just be a
nightmare of papering over cracks when the IPI was delivered at
some inconvenient moment when the recipient was in the middle
of touching xsave state.

I've told Fenghua to dig out the previous iteration of this patch where
the plan was to lazily fix the PASID_MSR in other threads in the #GP
handler.

That algorithm is very simple and easy to check. Pseudo-code:

#GP
if (usermode && current->mm->pasid && rdmsr(PASID_MSR) != valid) {
wrmsr(PASID_MSR, current->mm->pasid | PASID_VALID);
return;
}

Worst case is that some thread of a multi-threaded process that is using PASID
takes some unrelated #GP ... this code will try to fix it by enabling the 
PASID_MSR.
That will just #GP a second time and this test will see the MSR is already set,
so fall into the usual #GP handling code.

Seems like a better direction than trying to fix the IPI method. The 
virtualization
folks will like this way more because IPI in guest causes a couple of VMEXIT
so is somewhat expensive.

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:

> Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> from the guest page table... what page table?  

I see my confusion now, the phrasing in your earlier remark led me
think this was about allowing the no-snoop performance enhancement in
some restricted way.

It is really about blocking no-snoop 100% of the time and then
disabling the dangerous wbinvd when the block is successful.

Didn't closely read the kvm code :\

If it was about allowing the optimization then I'd expect the guest to
enable no-snoopable regions via it's vIOMMU and realize them to the
hypervisor and plumb the whole thing through. Hence my remark about
the guest page tables..

So really the test is just 'were we able to block it' ?

> This support existed before mdev, IIRC we needed it for direct
> assignment of NVIDIA GPUs.

Probably because they ignored the disable no-snoop bits in the control
block, or reset them in some insane way to "fix" broken bioses and
kept using it even though by all rights qemu would have tried hard to
turn it off via the config space. Processing no-snoop without a
working wbinvd would be fatal. Yeesh

But Ok, back the /dev/ioasid. This answers a few lingering questions I
had..

1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
   domains.

   This doesn't actually matter. If you mix them together then kvm
   will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
   anywhere in this VM.

   This if two IOMMU's are joined together into a single /dev/ioasid
   then we can just make them both pretend to be
   !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

2) How to fit this part of kvm in some new /dev/ioasid world

   What we want to do here is iterate over every ioasid associated
   with the group fd that is passed into kvm.

   Today the group fd has a single container which specifies the
   single ioasid so this is being done trivially.

   To reorg we want to get the ioasid from the device not the
   group (see my note to David about the groups vs device rational)

   This is just iterating over each vfio_device in the group and
   querying the ioasid it is using.

   Or perhaps more directly: an op attaching the vfio_device to the
   kvm and having some simple helper 
 '(un)register ioasid with kvm (kvm, ioasid)'
   that the vfio_device driver can call that just sorts this out.

   It is not terrible..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:
> On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> > 
> > > This version only covers 1) and 4). Do you think we need to support 2),
> > > 3) and beyond?
> > 
> > Yes aboslutely. The API should be flexable enough to specify the
> > creation of all future page table formats we'd want to have and all HW
> > specific details on those formats.
> 
> OK, stay in the same line.
> 
> > > If so, it seems that we need some in-kernel helpers and uAPIs to
> > > support pre-installing a page table to IOASID.
> > 
> > Not sure what this means..
> 
> Sorry that I didn't make this clear.
> 
> Let me bring back the page table types in my eyes.
> 
>  1) IOMMU format page table (a.k.a. iommu_domain)
>  2) user application CPU page table (SVA for example)
>  3) KVM EPT (future option)
>  4) VM guest managed page table (nesting mode)
> 
> Each type of page table should be able to be associated with its IOASID.
> We have BIND protocol for 4); We explicitly allocate an iommu_domain for
> 1). But we don't have a clear definition for 2) 3) and others. I think
> it's necessary to clearly define a time point and kAPI name between
> IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
> opportunity to associate their page table with the allocated IOASID
> before attaching the page table to the real IOMMU hardware.

In my mind these are all actions of creation..

#1 is ALLOC_IOASID 'to be compatible with thes devices attached to
   this FD'
#2 is ALLOC_IOASID_SVA
#3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
#4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
  ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address

Each allocation should have a set of operations that are allows
map/unmap is only legal on #1. invalidate is only legal on #4, etc.

How you want to split this up in the ioctl interface is a more
interesting question. I generally like more calls than giant unwieldly
multiplexer structs, but some things are naturally flags and optional
modifications of a single ioctl.

In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
then a single 'DESTROY_IOASID' that works on all of them.

> I/O page fault handling is similar. The provider of the page table
> should take the responsibility to handle the possible page faults.

For the faultable types, yes #3 and #4 should hook in the fault
handler and deal with it.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Gunthorpe
On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote:

> OK, this implies that if one user inadvertently creates intended parent/
> child via different fd's then the operation will simply fail.

Remember the number space to refer to the ioasid's inside the FD is
local to that instance of the FD. Each FD should have its own xarray

You can't actually accidently refer to an IOASID in FD A from FD B
because the xarray lookup in FD B will not return 'IOASID A'.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/7] iommu: Allow IOVA rcache range be configured

2021-06-02 Thread Lu Baolu

On 6/2/21 3:48 PM, John Garry wrote:

On 02/06/2021 05:37, Lu Baolu wrote:

On 6/1/21 10:29 PM, John Garry wrote:
For streaming DMA mappings involving an IOMMU and whose IOVA len 
regularly

exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: 
Retry

from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing 
issue,

as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.


Could you explain why it requires singleton group and driver unbinding
if the user only wants to increase the upper limit? I haven't dived into
the details yet, sorry if this is a silly question.


Hi Baolu,

I did actually try increasing the range for a 'live' domain in the v1 
series, but it turned out too messy. First problem is reallocating the 
memory to hold the rcaches. Second problem is that we need to deal with 
the issue that all IOVAs in the rcache need to be a pow-of-2, which is 
difficult to enforce for IOVAs which weren't being cached before, but 
now would be.


So now I changed to work similar to how we change the default domain 
type, i.e. don't operate on a 'live' domain.


If these hard restrictions on users are just to walk around the messy
code in kernel, I would rather solve those messy code to achieve a
better user experience. :-)

Best regards,
baolu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 12:09 AM
> 
> On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, June 2, 2021 1:42 AM
> > >
> > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > >
> > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > > >
> > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> and
> > > > > > software nesting. With hardware support the child and parent I/O
> page
> > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > translation.
> > > > > > When it's implemented in software, the ioasid driver is responsible
> for
> > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > table.
> > > > > > Software nesting requires both child/parent page tables operated
> > > through
> > > > > > the dma mapping protocol, so any change in either level can be
> > > captured
> > > > > > by the kernel to update the corresponding shadow mapping.
> > > > >
> > > > > Why? A SW emulation could do this synchronization during
> invalidation
> > > > > processing if invalidation contained an IOVA range.
> > > >
> > > > In this proposal we differentiate between host-managed and user-
> > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > > not required when changing a PTE from non-present to present).
> > > >
> > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > instead of mixing them together. Following this policy, map+unmap
> > > > must be used in both levels for software nesting, so changes in either
> > > > level are captured timely to synchronize the shadow mapping.
> > >
> > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > it is created. If you put two different types in a tree then each IOASID
> > > must continue to use its own operation mode.
> > >
> > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> >
> > only for software nesting. With hardware support the parent uses map
> > while the child uses bind.
> >
> > Yes, the policy is specified per IOASID. But if the policy violates the
> > requirement in a specific nesting mode, then nesting should fail.
> 
> I don't get it.
> 
> If the IOASID is a page table then it is bind/invalidate. SW or not SW
> doesn't matter at all.
> 
> > >
> > > A software emulated two level page table where the leaf level is a
> > > bound page table in guest memory should continue to use
> > > bind/invalidate to maintain the guest page table IOASID even though it
> > > is a SW construct.
> >
> > with software nesting the leaf should be a host-managed page table
> > (or metadata). A bind/invalidate protocol doesn't require the user
> > to notify the kernel of every page table change.
> 
> The purpose of invalidate is to inform the implementation that the
> page table has changed so it can flush the caches. If the page table
> is changed and invalidation is not issued then then the implementation
> is free to ignore the changes.
> 
> In this way the SW mode is the same as a HW mode with an infinite
> cache.
> 
> The collaposed shadow page table is really just a cache.
> 

OK. One additional thing is that we may need a 'caching_mode"
thing reported by /dev/ioasid, indicating whether invalidation is
required when changing non-present to present. For hardware 
nesting it's not reported as the hardware IOMMU will walk the
guest page table in cases of iotlb miss. For software nesting 
caching_mode is reported so the user must issue invalidation 
upon any change in guest page table so the kernel can update
the shadow page table timely.

Following this and your other comment with David, we will mark
host-managed vs. guest-managed explicitly for I/O page table
of each IOASID. map+unmap or bind+invalid is decided by
which owner is specified by the user.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Virtio hardening for TDX

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

[v1: Initial post]

With confidential computing like TDX the guest doesn't trust the host
anymore. The host is allowed to DOS of course, but it is not allowed
to read or write any guest memory not explicitely shared with it.

This has implication for virtio. Traditionally virtio didn't assume
the other side of the communication channel is malicious, and therefore
didn't do any boundary checks in virtio ring data structures.

This patchkit does hardening for virtio.  In a TDX like model
the only host memory accesses allowed are in the virtio ring,
as well as the (forced) swiotlb buffer.

This patch kit does various changes to ensure there can be no
access outside these two areas. It is possible for the host
to break the communication, but this should result in a IO
error on the guest, but no memory safety violations.

virtio is quite complicated with many modes. To simplify
the task we enforce that virtio is only in split mode without
indirect descriptors, when running as a TDX guest. We also
enforce use of the DMA API.

Then these code paths are hardened against any corruptions
on the ring.

This patchkit has components in three subsystems:
- Hardening changes to virtio, all in the generic virtio-ring
- Hardening changes to kernel/dma swiotlb to harden swiotlb against
malicious pointers. It requires an API change which needed a tree sweep.
- A single x86 patch to enable the arch_has_restricted_memory_access
for TDX

It depends on Sathya's earlier patchkit that adds the basic infrastructure
for TDX. This is only needed for the "am I running in TDX" part.



Note that it's probably needed by other cases as well:

1) Other encrypted VM technology
2) VDUSE[1]
3) Smart NICs

We have already had discussions and some patches have been posted[2][3][4].

I think the basic idea is similar, basically,  we don't trust any 
metadata provided by the device.


[2] is the series that use the metadata stored in the private memory 
which can't be accessed by swiotlb, this series aims to eliminate all 
the possible attacks via virtqueue metadata

[3] is one example for the the used length validation
[4] is the fix for the malicious config space

Thanks

[1] https://www.spinics.net/lists/netdev/msg743264.html
[2] https://www.spinics.net/lists/kvm/msg241825.html
[3] https://patches.linaro.org/patch/450733/
[4] https://lkml.org/lkml/2021/5/17/376







___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

When running under TDX the virtio host is untrusted. The bulk
of the kernel memory is encrypted and protected, but the virtio
ring is in special shared memory that is shared with the
untrusted host.

This means virtio needs to be hardened against any attacks from
the host through the ring. Of course it's impossible to prevent DOS
(the host can chose at any time to stop doing IO), but there
should be no buffer overruns or similar that might give access to
any private memory in the guest.

virtio has a lot of modes, most are difficult to harden.

The best for hardening seems to be split mode without indirect
descriptors. This also simplifies the hardening job because
it's only a single code path.

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..f35629fa47b1 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifdef DEBUG

@@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev)
unsigned int i;
  
  	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {

+
+   /*
+* In protected guest mode disallow packed or indirect
+* because they ain't hardened.
+*/
+
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;



So we will see huge performance regression without indirect descriptor. 
We need to consider to address this.


Thanks



break;
case VIRTIO_RING_F_EVENT_IDX:
break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev)
case VIRTIO_F_ACCESS_PLATFORM:
break;
case VIRTIO_F_RING_PACKED:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_F_ORDER_PLATFORM:
break;
+   clear:
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks

2021-06-02 Thread Konrad Rzeszutek Wilk
On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote:
> swiotlb currently only uses the start address of a DMA to check if something
> is in the swiotlb or not. But with virtio and untrusted hosts the host
> could give some DMA mapping that crosses the swiotlb boundaries,
> potentially leaking or corrupting data. Add size checks to all the swiotlb
> checks and reject any DMAs that cross the swiotlb buffer boundaries.

I seem to be only CC-ed on this and #7, so please bear with me.

But could you explain to me why please:

commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12)
Author: Martin Radev 
Date:   Tue Jan 12 16:07:29 2021 +0100

swiotlb: Validate bounce size in the sync/unmap path

does not solve the problem as well?

> 
> Signed-off-by: Andi Kleen 
> ---
>  drivers/iommu/dma-iommu.c   | 13 ++---
>  drivers/xen/swiotlb-xen.c   | 11 ++-
>  include/linux/dma-mapping.h |  4 ++--
>  include/linux/swiotlb.h |  8 +---
>  kernel/dma/direct.c |  8 
>  kernel/dma/direct.h |  8 
>  kernel/dma/mapping.c|  4 ++--
>  net/xdp/xsk_buff_pool.c |  2 +-
>  8 files changed, 30 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7bcdd1205535..7ef13198721b 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
> dma_addr_t dma_addr,
>  
>   __iommu_dma_unmap(dev, dma_addr, size);
>  
> - if (unlikely(is_swiotlb_buffer(phys)))
> + if (unlikely(is_swiotlb_buffer(phys, size)))
>   swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
>  }
>  
> @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
> *dev, phys_addr_t phys,
>   }
>  
>   iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
> - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
> + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size))
>   swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
>   return iova;
>  }
> @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
> *dev,
>   if (!dev_is_dma_coherent(dev))
>   arch_sync_dma_for_cpu(phys, size, dir);
>  
> - if (is_swiotlb_buffer(phys))
> + if (is_swiotlb_buffer(phys, size))
>   swiotlb_sync_single_for_cpu(dev, phys, size, dir);
>  }
>  
> @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct 
> device *dev,
>   return;
>  
>   phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
> - if (is_swiotlb_buffer(phys))
> + if (is_swiotlb_buffer(phys, size))
>   swiotlb_sync_single_for_device(dev, phys, size, dir);
>  
>   if (!dev_is_dma_coherent(dev))
> @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
>   if (!dev_is_dma_coherent(dev))
>   arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
>  
> - if (is_swiotlb_buffer(sg_phys(sg)))
> + if (is_swiotlb_buffer(sg_phys(sg), sg->length))
>   swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
>   sg->length, dir);
>   }
> @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device 
> *dev,
>   return;
>  
>   for_each_sg(sgl, sg, nelems, i) {
> - if (is_swiotlb_buffer(sg_phys(sg)))
> + if (is_swiotlb_buffer(sg_phys(sg), sg->length))
>   swiotlb_sync_single_for_device(dev, sg_phys(sg),
>  sg->length, dir);
> -
>   if (!dev_is_dma_coherent(dev))
>   arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
>   }
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 24d11861ac7d..333846af8d35 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t 
> p, size_t size)
>   return 0;
>  }
>  
> -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
> +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr,
> +  size_t size)
>  {
>   unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
>   unsigned long xen_pfn = bfn_to_local_pfn(bfn);
> @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
> dma_addr_t dma_addr)
>* in our domain. Therefore _only_ check address within our domain.
>*/
>   if (pfn_valid(PFN_DOWN(paddr)))
> - return is_swiotlb_buffer(paddr);
> + return is_swiotlb_buffer(paddr, size);
>   return 0;
>  }
>  
> @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device 

Virtio hardening for TDX

2021-06-02 Thread Andi Kleen
[v1: Initial post]

With confidential computing like TDX the guest doesn't trust the host
anymore. The host is allowed to DOS of course, but it is not allowed
to read or write any guest memory not explicitely shared with it.

This has implication for virtio. Traditionally virtio didn't assume
the other side of the communication channel is malicious, and therefore
didn't do any boundary checks in virtio ring data structures.

This patchkit does hardening for virtio.  In a TDX like model
the only host memory accesses allowed are in the virtio ring, 
as well as the (forced) swiotlb buffer.

This patch kit does various changes to ensure there can be no
access outside these two areas. It is possible for the host
to break the communication, but this should result in a IO
error on the guest, but no memory safety violations.

virtio is quite complicated with many modes. To simplify
the task we enforce that virtio is only in split mode without
indirect descriptors, when running as a TDX guest. We also
enforce use of the DMA API.

Then these code paths are hardened against any corruptions
on the ring.

This patchkit has components in three subsystems:
- Hardening changes to virtio, all in the generic virtio-ring
- Hardening changes to kernel/dma swiotlb to harden swiotlb against
malicious pointers. It requires an API change which needed a tree sweep.
- A single x86 patch to enable the arch_has_restricted_memory_access
for TDX

It depends on Sathya's earlier patchkit that adds the basic infrastructure
for TDX. This is only needed for the "am I running in TDX" part.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 6/8] dma: Add return value to dma_unmap_page

2021-06-02 Thread Andi Kleen
In some situations when we know swiotlb is forced and we have
to deal with untrusted hosts, it's useful to know if a mapping
was in the swiotlb or not. This allows us to abort any IO
operation that would access memory outside the swiotlb.

Otherwise it might be possible for a malicious host to inject
any guest page in a read operation. While it couldn't directly
access the results of the read() inside the guest, there
might scenarios where data is echoed back with a write(),
and that would then leak guest memory.

Add a return value to dma_unmap_single/page. Most users
of course will ignore it. The return value is set to EIO
if we're in forced swiotlb mode and the buffer is not inside
the swiotlb buffer. Otherwise it's always 0.

A new callback is used to avoid changing all the IOMMU drivers.

Signed-off-by: Andi Kleen 
---
 drivers/iommu/dma-iommu.c   | 17 +++--
 include/linux/dma-map-ops.h |  3 +++
 include/linux/dma-mapping.h |  7 ---
 kernel/dma/mapping.c|  6 +-
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7ef13198721b..babe46f2ae3a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, 
dma_addr_t dma_addr,
iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
-static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
+static int __iommu_dma_unmap_swiotlb_check(struct device *dev,
+   dma_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
unsigned long attrs)
 {
@@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
phys = iommu_iova_to_phys(domain, dma_addr);
if (WARN_ON(!phys))
-   return;
+   return -EIO;
 
__iommu_dma_unmap(dev, dma_addr, size);
 
if (unlikely(is_swiotlb_buffer(phys, size)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
+   else if (swiotlb_force == SWIOTLB_FORCE)
+   return -EIO;
+   return 0;
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, 
struct page *page,
return dma_handle;
 }
 
-static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
+static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t 
dma_handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
-   __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
+   return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir,
+  attrs);
 }
 
 /*
@@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, 
struct scatterlist *s
int i;
 
for_each_sg(sg, s, nents, i)
-   __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s),
+   __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s),
sg_dma_len(s), dir, attrs);
 }
 
@@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.mmap   = iommu_dma_mmap,
.get_sgtable= iommu_dma_get_sgtable,
.map_page   = iommu_dma_map_page,
-   .unmap_page = iommu_dma_unmap_page,
+   .unmap_page_check   = iommu_dma_unmap_page_check,
.map_sg = iommu_dma_map_sg,
.unmap_sg   = iommu_dma_unmap_sg,
.sync_single_for_cpu= iommu_dma_sync_single_for_cpu,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d53a96a3d64..0ed0190f7949 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,9 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
unsigned long (*get_merge_boundary)(struct device *dev);
+   int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle,
+   size_t size, enum dma_data_direction dir,
+   unsigned long attrs);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 37fbd12bd4ab..25b8382f8601 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, 
dma_addr_t dma_addr)
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs);
-void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t siz

[PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen
When running under TDX the virtio host is untrusted. The bulk
of the kernel memory is encrypted and protected, but the virtio
ring is in special shared memory that is shared with the
untrusted host.

This means virtio needs to be hardened against any attacks from
the host through the ring. Of course it's impossible to prevent DOS
(the host can chose at any time to stop doing IO), but there
should be no buffer overruns or similar that might give access to
any private memory in the guest.

virtio has a lot of modes, most are difficult to harden.

The best for hardening seems to be split mode without indirect
descriptors. This also simplifies the hardening job because
it's only a single code path.

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..f35629fa47b1 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef DEBUG
@@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev)
unsigned int i;
 
for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
+
+   /*
+* In protected guest mode disallow packed or indirect
+* because they ain't hardened.
+*/
+
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev)
case VIRTIO_F_ACCESS_PLATFORM:
break;
case VIRTIO_F_RING_PACKED:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_F_ORDER_PLATFORM:
break;
+   clear:
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.25.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 7/8] virtio: Abort IO when descriptor points outside forced swiotlb

2021-06-02 Thread Andi Kleen
Now that we have a return value for unmapping DMA mappings that
are outside the forced swiotlb, use that to abort the IO operation.

This prevents the host from subverting a read to access some
data in the guest address space, which it might then get access somehow in
another IO operation. It can subvert reads to point to other
reads or other writes, but since it controls IO it can do
that anyways.

This is only done for the split code path, which is the only
one supported with confidential guests.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e9aa1e95e1b..244a5b62d85c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -365,29 +365,31 @@ static int vring_mapping_error(const struct 
vring_virtqueue *vq,
  * Split ring specific functions - *_split().
  */
 
-static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+static int vring_unmap_one_split(const struct vring_virtqueue *vq,
  struct vring_desc *desc)
 {
u16 flags;
+   int ret;
 
if (!vq->use_dma_api)
-   return;
+   return 0;
 
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
if (flags & VRING_DESC_F_INDIRECT) {
-   dma_unmap_single(vring_dma_dev(vq),
+   ret = dma_unmap_single(vring_dma_dev(vq),
 virtio64_to_cpu(vq->vq.vdev, desc->addr),
 virtio32_to_cpu(vq->vq.vdev, desc->len),
 (flags & VRING_DESC_F_WRITE) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
-   dma_unmap_page(vring_dma_dev(vq),
+   ret = dma_unmap_page(vring_dma_dev(vq),
   virtio64_to_cpu(vq->vq.vdev, desc->addr),
   virtio32_to_cpu(vq->vq.vdev, desc->len),
   (flags & VRING_DESC_F_WRITE) ?
   DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
+   return ret;
 }
 
 static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
@@ -609,6 +611,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
break;
if (!inside_split_ring(vq, i))
break;
+   /*
+* Ignore unmapping errors since
+* we're aborting anyways.
+*/
vring_unmap_one_split(vq, &desc[i]);
i = virtio16_to_cpu(_vq->vdev, desc[i].next);
}
@@ -671,7 +677,10 @@ static int detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
i = head;
 
while (vq->split.vring.desc[i].flags & nextflag) {
-   vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
+   int ret;
+   ret = vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
+   if (ret)
+   return ret;
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
if (!inside_split_ring(vq, i))
return -EIO;
@@ -878,6 +887,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
continue;
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
+   /* Ignore unmap errors because there is nothing to abort */
detach_buf_split(vq, i, NULL);
/* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
-- 
2.25.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 8/8] virtio: Error out on endless free lists

2021-06-02 Thread Andi Kleen
Error out with a warning when the free list loops longer
than the maximum size while freeing descriptors. While technically
we don't care about DOS it is still better to abort it early.

We ran into this problem while fuzzing the virtio interactions
where the fuzzed code would get stuck for a long time.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 244a5b62d85c..96adaa4c5404 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,6 +685,11 @@ static int detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
if (!inside_split_ring(vq, i))
return -EIO;
vq->vq.num_free++;
+   if (WARN_ONCE(vq->vq.num_free >
+   vq->split.queue_size_in_bytes /
+   sizeof(struct vring_desc),
+   "Virtio freelist corrupted"))
+   return -EIO;
}
 
vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
-- 
2.25.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 3/8] virtio: Harden split buffer detachment

2021-06-02 Thread Andi Kleen
Harden the split buffer detachment path by adding boundary checking. Note
that when this fails we may fail to unmap some swiotlb mapping, which could
result in a leak and a DOS. But that's acceptable because an malicious host
can DOS us anyways.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index d37ff5a0ff58..1e9aa1e95e1b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue 
*_vq)
return needs_kick;
 }
 
-static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
-void **ctx)
+static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+   void **ctx)
 {
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
 
+   /* We'll leak DMA mappings when this happens, but nothing
+* can be done about that. In the worst case the host
+* could DOS us, but it can of course do that anyways.
+*/
+   if (!inside_split_ring(vq, head))
+   return -EIO;
+
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
 
@@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
while (vq->split.vring.desc[i].flags & nextflag) {
vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
+   if (!inside_split_ring(vq, i))
+   return -EIO;
vq->vq.num_free++;
}
 
@@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
 
/* Free the indirect table, if any, now that it's unmapped. */
if (!indir_desc)
-   return;
+   return 0;
 
len = virtio32_to_cpu(vq->vq.vdev,
vq->split.vring.desc[head].len);
@@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
} else if (ctx) {
*ctx = vq->split.desc_state[head].indir_desc;
}
+   return 0;
 }
 
 static inline bool more_used_split(const struct vring_virtqueue *vq)
@@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
void *ret;
unsigned int i;
u16 last_used;
+   int err;
 
START_USE(vq);
 
@@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
 
/* detach_buf_split clears data, so grab it now. */
ret = vq->split.desc_state[i].data;
-   detach_buf_split(vq, i, ctx);
+   err = detach_buf_split(vq, i, ctx);
+   if (err) {
+   END_USE(vq);
+   return NULL;
+   }
+
vq->last_used_idx++;
/* If we expect an interrupt for the next entry, tell host
 * by writing event index and flush out the write before
@@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
detach_buf_split(vq, i, NULL);
+   /* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
-- 
2.25.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 5/8] dma: Use size for swiotlb boundary checks

2021-06-02 Thread Andi Kleen
swiotlb currently only uses the start address of a DMA to check if something
is in the swiotlb or not. But with virtio and untrusted hosts the host
could give some DMA mapping that crosses the swiotlb boundaries,
potentially leaking or corrupting data. Add size checks to all the swiotlb
checks and reject any DMAs that cross the swiotlb buffer boundaries.

Signed-off-by: Andi Kleen 
---
 drivers/iommu/dma-iommu.c   | 13 ++---
 drivers/xen/swiotlb-xen.c   | 11 ++-
 include/linux/dma-mapping.h |  4 ++--
 include/linux/swiotlb.h |  8 +---
 kernel/dma/direct.c |  8 
 kernel/dma/direct.h |  8 
 kernel/dma/mapping.c|  4 ++--
 net/xdp/xsk_buff_pool.c |  2 +-
 8 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..7ef13198721b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(phys, size)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(phys, size))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(phys, size))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(sg_phys(sg), sg->length))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device 
*dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(sg_phys(sg), sg->length))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
-
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
}
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 24d11861ac7d..333846af8d35 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t 
p, size_t size)
return 0;
 }
 
-static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
+static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr,
+size_t size)
 {
unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
unsigned long xen_pfn = bfn_to_local_pfn(bfn);
@@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(paddr, size);
return 0;
 }
 
@@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
}
 
/* NOTE: We use dev_addr here, not paddr! */
-   if (is_xen_swiotlb_buffer(hwdev, dev_addr))
+   if (is_xen_swiotlb_buffer(hwdev, dev_addr, size))
swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
 }
 
@@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, 
dma_addr_t dma_addr,
xen_dma_sync_for_cpu(dev, dma_addr, size, dir);
}
 
-   if (is_xen_swiotlb_buffer(dev, dma_addr))
+   if (is_xen

[PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Andi Kleen
In protected guest mode we don't trust the host.

This means we need to make sure the host cannot subvert us through
virtio communication. In general it can corrupt our virtio data
and cause a DOS, but it should not be able to access any data
that is not explicitely under IO.

Also boundary checking so that the free list (which is accessible
to the host) cannot point outside the virtio ring. Note it could
still contain loops or similar, but these should only cause an DOS,
not a memory corruption or leak.

When we detect any out of bounds descriptor trigger an IO error.
We also use a WARN() (in case it was a software bug instead of
an attack). This implies that a malicious host can flood
the guest kernel log, but that's only a DOS and acceptable
in the threat model.

This patch only hardens the initial consumption of the free list,
the freeing comes later.

Any of these errors can cause DMA memory leaks, but there is nothing
we can do about that and that would be just a DOS.

Signed-off-by: Andi Kleen 
---
 drivers/virtio/virtio_ring.c | 46 
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f35629fa47b1..d37ff5a0ff58 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct 
virtqueue *_vq,
return desc;
 }
 
+/* assumes no indirect mode */
+static inline bool inside_split_ring(struct vring_virtqueue *vq,
+unsigned index)
+{
+   return !WARN(index >= vq->split.vring.num,
+   "desc index %u out of bounds (%u)\n",
+   index, vq->split.vring.num);
+}
+
 static inline int virtqueue_add_split(struct virtqueue *_vq,
  struct scatterlist *sgs[],
  unsigned int total_sg,
@@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
unsigned int i, n, avail, descs_used, prev, err_idx;
int head;
bool indirect;
+   int io_err;
 
START_USE(vq);
 
@@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
 
for (n = 0; n < out_sgs; n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_TO_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
}
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_FROM_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
dma_addr_t addr = vring_map_single(
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
+   io_err = -ENOMEM;
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
@@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* We're using some buffers from the free list. */
vq->vq.num_free -= descs_used;
 
+   io_err = -EIO;
+   if (!inside_split_ring(vq, head))
+   goto unmap_release;
+
/* Update free pointer */
if (indirect)
vq->free_head = virtio16_to_cpu(_vq->vdev,
@@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
+
+   if (avail >= vq->split.vring.num)
+   goto unmap_release;
+
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
 
/* Descriptors and available array need to be set before we expose the
@@ -576,6 +607,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
for (n = 0; n < total_sg; n++) {
if (i == err_idx)
 

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen


So we will see huge performance regression without indirect 
descriptor. We need to consider to address this.


A regression would be when some existing case would be slower.

That's not the case because the behavior for the existing cases does not 
change.


Anyways when there are performance problems they can be addressed, but 
first is to make it secure.


-Andi




Thanks



  break;
  case VIRTIO_RING_F_EVENT_IDX:
  break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct 
virtio_device *vdev)

  case VIRTIO_F_ACCESS_PLATFORM:
  break;
  case VIRTIO_F_RING_PACKED:
+    if (protected_guest_has(VM_MEM_ENCRYPT))
+    goto clear;
  break;
  case VIRTIO_F_ORDER_PLATFORM:
  break;
+    clear:
  default:
  /* We don't understand this bit. */
  __virtio_clear_bit(vdev, i);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks

2021-06-02 Thread Andi Kleen



On 6/2/2021 6:48 PM, Konrad Rzeszutek Wilk wrote:

On Wed, Jun 02, 2021 at 05:41:30PM -0700, Andi Kleen wrote:

swiotlb currently only uses the start address of a DMA to check if something
is in the swiotlb or not. But with virtio and untrusted hosts the host
could give some DMA mapping that crosses the swiotlb boundaries,
potentially leaking or corrupting data. Add size checks to all the swiotlb
checks and reject any DMAs that cross the swiotlb buffer boundaries.

I seem to be only CC-ed on this and #7, so please bear with me.
You weren't cc'ed originally so if you get partial emails it must be 
through some list.


But could you explain to me why please:

commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 (swiotlb/stable/for-linus-5.12)
Author: Martin Radev 
Date:   Tue Jan 12 16:07:29 2021 +0100

 swiotlb: Validate bounce size in the sync/unmap path

does not solve the problem as well?


Thanks. I missed that patch, race condition.

One major difference of my patch is that it supports an error return, 
which allows virtio to error out. This is important in virtio because 
otherwise you'll end up with uninitialized memory on the target without 
any indication. This uninitialized memory could be an potential attack 
vector on the guest memory, e.g. if the attacker finds some way to echo 
it out again.


But the error return could be added to your infrastructure too and what 
would make this patch much shorter. I'll take a look at that.


-Andi





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Virtio hardening for TDX

2021-06-02 Thread Andi Kleen



Note that it's probably needed by other cases as well:

1) Other encrypted VM technology
2) VDUSE[1]
3) Smart NICs


Right. I don't see any reason why these shouldn't work. You may just 
need to add the enable for the lockdown, but you can reuse the basic 
infrastructure.




We have already had discussions and some patches have been 
posted[2][3][4].


Thanks.

Yes [2] is indeed an alternative. We considered this at some point, but 
since we don't care about DOS in our case it seemed simpler to just 
harden the existing code.  But yes if it's there it's useful for TDX too.


FWIW I would argue that the descriptor boundary checking should be added 
in any case, security case or separated metadata or not, because it can 
catch bugs and is very cheap. Checking boundaries is good practice.


[4] would be an independent issue, that's something we didn't catch.

Also the swiotlb hardening implemented in this patchkit doesn't seem to 
be in any of the other patches.


So I would say my patches are mostly orthogonal to these patches below 
and not conflicting, even though they address a similar problem space.


-Andi


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX

2021-06-02 Thread Andi Kleen
In virtio the host decides whether the guest uses the DMA
API or not using the strangely named VIRTIO_F_ACCESS_PLATFORM
bit (which really indicates whether the DMA API is used or not)

For hardened virtio on TDX we want to enforce that that swiotlb is
always used, which requires using the DMA API.  While IO wouldn't
really work without the swiotlb, it might be possible that an
attacker forces swiotlbless IO to manipulate memory in the guest.

So we want to force the DMA API (which then forces swiotlb),
but without relying on the host.

There is already an arch_has_restricted_memory_acces hook for
this, which is currently used only by s390. Enable
the config option for the hook for x86 and enable it for TDX.

Signed-off-by: Andi Kleen 
---
 arch/x86/Kconfig | 1 +
 arch/x86/mm/mem_encrypt_common.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1531a0f905ed..3d804fce31b9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select X86_MEM_ENCRYPT_COMMON
+   select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
help
  Provide support for running in a trusted domain on Intel processors
  equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 24c9117547b4..2244d1f033ab 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
@@ -37,3 +38,9 @@ void __init mem_encrypt_init(void)
amd_mem_encrypt_init();
 }
 
+int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}
+EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);
+
-- 
2.25.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, June 3, 2021 12:17 AM
>
[...] 
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID.
> >
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
> 
> Maybe.. It is certainly a special case.
> 
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

More accurately 64+20=84 bit IOVA 😊

> 
> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page

I'm leaning toward this direction now, after a discussion with Baolu.
He reminded me that a default domain is already created for each
device when it's probed by the iommu driver. So it looks workable
to expose a per-device capability query uAPI to user once a device
is bound to the ioasid fd. Once it's available, the user should be able
to judge what format/mode should be set when creating an IOASID.

> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

In concept this view is true. But when designing the uAPI possibly
we will not call it a 84bit format as the PASID table itself just
serves 20bit PASID space. 

Will think more how to mark it in the next version.

> 
> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

In protected guest mode we don't trust the host.

This means we need to make sure the host cannot subvert us through
virtio communication. In general it can corrupt our virtio data
and cause a DOS, but it should not be able to access any data
that is not explicitely under IO.

Also boundary checking so that the free list (which is accessible
to the host) cannot point outside the virtio ring. Note it could
still contain loops or similar, but these should only cause an DOS,
not a memory corruption or leak.

When we detect any out of bounds descriptor trigger an IO error.
We also use a WARN() (in case it was a software bug instead of
an attack). This implies that a malicious host can flood
the guest kernel log, but that's only a DOS and acceptable
in the threat model.

This patch only hardens the initial consumption of the free list,
the freeing comes later.

Any of these errors can cause DMA memory leaks, but there is nothing
we can do about that and that would be just a DOS.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 46 
  1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f35629fa47b1..d37ff5a0ff58 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct 
virtqueue *_vq,
return desc;
  }
  
+/* assumes no indirect mode */

+static inline bool inside_split_ring(struct vring_virtqueue *vq,
+unsigned index)
+{
+   return !WARN(index >= vq->split.vring.num,
+   "desc index %u out of bounds (%u)\n",
+   index, vq->split.vring.num);



It's better to use BAD_RING to stop virtqueue in this case.



+}
+
  static inline int virtqueue_add_split(struct virtqueue *_vq,
  struct scatterlist *sgs[],
  unsigned int total_sg,
@@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
unsigned int i, n, avail, descs_used, prev, err_idx;
int head;
bool indirect;
+   int io_err;
  
  	START_USE(vq);
  
@@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
  
  	for (n = 0; n < out_sgs; n++) {

for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_TO_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
  
@@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

}
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_FROM_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;



It looks to me all the evils came from the fact that we depends on the 
descriptor ring.


So the checks in this patch could is unnecessary if we don't even read 
from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html

Thanks



  
@@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

dma_addr_t addr = vring_map_single(
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
+   io_err = -ENOMEM;
if (vring_mapping_error(vq, addr))
goto unmap_release;
  
@@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

/* We're using some buffers from the free list. */
vq->vq.num_free -= descs_used;
  
+	io_err = -EIO;

+   if (!inside_split_ring(vq, head))
+   goto unmap_release;
+
/* Update free pointer */
if (indirect)
vq->free_head = virtio16_to_cpu(_vq->vdev,
@@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_

Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Andi Kleen



It looks to me all the evils came from the fact that we depends on the 
descriptor ring.


So the checks in this patch could is unnecessary if we don't even read 
from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html


I would argue that you should boundary check in any case. It was always 
a bug to not have boundary checks in such a data structure with multiple 
users, trust or not.


But yes your patch series is interesting and definitely makes sense for 
TDX too.


Best would be to have both I guess, and always check the boundaries 
everywhere.


So what's the merge status of your series?

-Andi


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 3/8] virtio: Harden split buffer detachment

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

Harden the split buffer detachment path by adding boundary checking. Note
that when this fails we may fail to unmap some swiotlb mapping, which could
result in a leak and a DOS. But that's acceptable because an malicious host
can DOS us anyways.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 25 +
  1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index d37ff5a0ff58..1e9aa1e95e1b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue 
*_vq)
return needs_kick;
  }
  
-static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,

-void **ctx)
+static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+   void **ctx)
  {
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
  
+	/* We'll leak DMA mappings when this happens, but nothing

+* can be done about that. In the worst case the host
+* could DOS us, but it can of course do that anyways.
+*/
+   if (!inside_split_ring(vq, head))
+   return -EIO;



I think the caller have already did this for us with even more check on 
the token (virtqueue_get_buf_ctx_split()):


    if (unlikely(i >= vq->split.vring.num)) {
    BAD_RING(vq, "id %u out of range\n", i);
    return NULL;
    }
    if (unlikely(!vq->split.desc_state[i].data)) {
    BAD_RING(vq, "id %u is not a head!\n", i);
    return NULL;
    }



+
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
  
@@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,

while (vq->split.vring.desc[i].flags & nextflag) {
vring_unmap_one_split(vq, &vq->split.vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
+   if (!inside_split_ring(vq, i))
+   return -EIO;



Similarly, if we don't depend on the metadata stored in the descriptor, 
we don't need this check.




vq->vq.num_free++;
}
  
@@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
  
  		/* Free the indirect table, if any, now that it's unmapped. */

if (!indir_desc)
-   return;
+   return 0;
  
  		len = virtio32_to_cpu(vq->vq.vdev,

vq->split.vring.desc[head].len);
@@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
} else if (ctx) {
*ctx = vq->split.desc_state[head].indir_desc;
}
+   return 0;
  }
  
  static inline bool more_used_split(const struct vring_virtqueue *vq)

@@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
void *ret;
unsigned int i;
u16 last_used;
+   int err;
  
  	START_USE(vq);
  
@@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
  
  	/* detach_buf_split clears data, so grab it now. */

ret = vq->split.desc_state[i].data;
-   detach_buf_split(vq, i, ctx);
+   err = detach_buf_split(vq, i, ctx);
+   if (err) {
+   END_USE(vq);



This reminds me that we don't use END_USE() after BAD_RING() which 
should be fixed.


Thanks



+   return NULL;
+   }
+
vq->last_used_idx++;
/* If we expect an interrupt for the next entry, tell host
 * by writing event index and flush out the write before
@@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
detach_buf_split(vq, i, NULL);
+   /* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午9:48, Andi Kleen 写道:


So we will see huge performance regression without indirect 
descriptor. We need to consider to address this.


A regression would be when some existing case would be slower.

That's not the case because the behavior for the existing cases does 
not change.


Anyways when there are performance problems they can be addressed, but 
first is to make it secure.



I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not coherent 
swiotlb mapping.


Thanks




-Andi




Thanks



  break;
  case VIRTIO_RING_F_EVENT_IDX:
  break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct 
virtio_device *vdev)

  case VIRTIO_F_ACCESS_PLATFORM:
  break;
  case VIRTIO_F_RING_PACKED:
+    if (protected_guest_has(VM_MEM_ENCRYPT))
+    goto clear;
  break;
  case VIRTIO_F_ORDER_PLATFORM:
  break;
+    clear:
  default:
  /* We don't understand this bit. */
  __virtio_clear_bit(vdev, i);






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Jason Wang


在 2021/6/3 上午10:18, Andi Kleen 写道:


It looks to me all the evils came from the fact that we depends on 
the descriptor ring.


So the checks in this patch could is unnecessary if we don't even 
read from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html


I would argue that you should boundary check in any case. It was 
always a bug to not have boundary checks in such a data structure with 
multiple users, trust or not.


But yes your patch series is interesting and definitely makes sense 
for TDX too.


Best would be to have both I guess, and always check the boundaries 
everywhere.



I agree but some of the checks are unnecessary in we do this series on 
top of my series.





So what's the merge status of your series?



If I understand correctly from Michael, I will send a formal series and 
he will try to merge it for the 5.14.


Thanks




-Andi




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, June 3, 2021 12:59 AM
> 
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > /* Bind guest I/O page table  */
> > > > bind_data = {
> > > > .ioasid = gva_ioasid;
> > > > .addr   = gva_pgtable1;
> > > > // and format information
> > > > };
> > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > >
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> >
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually.
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

I would expect the capability reported per-device via /dev/iommu. 
Incompatible devices can bind to the same fd but cannot attach to
the same IOASID. This allows incompatible devices to share locked
page accounting.

> 
> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.

I hope my memory was not broken, that the original concept was 
the devices attached to the same IOASID must support the same
format. Otherwise they need attach to different IOASIDs (but still
within the same fd).

> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

With above concept we don't need IOASID container then.

> 
> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
> Presumably it uses query functions using the device label. The
> kernel code should look at the entire device path through all the
> IOMMU HW to determine what is possible.
> 
> Or it already knows because the VM's vIOMMU is running in some
> fixed page table format, or the VM's vIOMMU already told it, or
> something.

I'd expect the both. First get the hardware format. Then detect whether
it's compatible to the vIOMMU format.

> 
>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Based on earlier discussion this will possibly be:

struct iommu_ioasid_create_info {

// if set this is a guest-managed page table, use bind+invalidate, with
// info provided in struct pgtable_info;
// if clear it's host-managed and use map+unmap;
#define IOMMU_IOASID_FLAG_USER_PGTABLE  1

// if set it is for pasid table binding. same implication as USER_PGTABLE
// except it's for a different pgtable type
#define IOMMU_IOASID_FLAG_USER_PASID_TABLE  2
int flags;

// Create nesting if not INVALID_IOASID
u32 parent_ioasid;

// additional info about the page table
union {
// for user-managed page table
struct {
u64 user_pgd;
u32 format;
u32 addr_width;
// and other vendor format info
} pgtable_info;

// for kernel-managed page table
struct {
// not required on x86
// for ppc, iirc the user wants to claim a window
// explicitly?
} map_info;
};
};

then there will be no UNBIND_PGTABLE ioctl. The unbind is done 
automatically when the IOASID is freed.

> 
>  3) qemu binds the IOASID to the device.

let's use 'attach' for consistency. 😊 'bind' is for ioasid fd which must
be completed in step 0) so format can be reported in step 1)

> 
> If qmeu gets it wrong then it just fails.
> 
>  4) For the next device qemu would have to figure out if it can re-use
> an existing IOASID based on the required proeprties.
> 
> You pointed to the case of mixing vIOMMU's of different p

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Wed, 2 Jun 2021 19:45:36 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> 
> > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > from the guest page table... what page table?
> 
> I see my confusion now, the phrasing in your earlier remark led me
> think this was about allowing the no-snoop performance enhancement in
> some restricted way.
> 
> It is really about blocking no-snoop 100% of the time and then
> disabling the dangerous wbinvd when the block is successful.
> 
> Didn't closely read the kvm code :\
> 
> If it was about allowing the optimization then I'd expect the guest to
> enable no-snoopable regions via it's vIOMMU and realize them to the
> hypervisor and plumb the whole thing through. Hence my remark about
> the guest page tables..
> 
> So really the test is just 'were we able to block it' ?

Yup.  Do we really still consider that there's some performance benefit
to be had by enabling a device to use no-snoop?  This seems largely a
legacy thing.

> > This support existed before mdev, IIRC we needed it for direct
> > assignment of NVIDIA GPUs.  
> 
> Probably because they ignored the disable no-snoop bits in the control
> block, or reset them in some insane way to "fix" broken bioses and
> kept using it even though by all rights qemu would have tried hard to
> turn it off via the config space. Processing no-snoop without a
> working wbinvd would be fatal. Yeesh
> 
> But Ok, back the /dev/ioasid. This answers a few lingering questions I
> had..
> 
> 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
>domains.
> 
>This doesn't actually matter. If you mix them together then kvm
>will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
>anywhere in this VM.
> 
>This if two IOMMU's are joined together into a single /dev/ioasid
>then we can just make them both pretend to be
>!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
available based on the per domain support available.  That gives us the
most consistent behavior, ie. we don't have VMs emulating wbinvd
because they used to have a device attached where the domain required
it and we can't atomically remap with new flags to perform the same as
a VM that never had that device attached in the first place.

> 2) How to fit this part of kvm in some new /dev/ioasid world
> 
>What we want to do here is iterate over every ioasid associated
>with the group fd that is passed into kvm.

Yeah, we need some better names, binding a device to an ioasid (fd) but
then attaching a device to an allocated ioasid (non-fd)... I assume
you're talking about the latter ioasid.

>Today the group fd has a single container which specifies the
>single ioasid so this is being done trivially.
> 
>To reorg we want to get the ioasid from the device not the
>group (see my note to David about the groups vs device rational)
> 
>This is just iterating over each vfio_device in the group and
>querying the ioasid it is using.

The IOMMU API group interfaces is largely iommu_group_for_each_dev()
anyway, we still need to account for all the RIDs and aliases of a
group.

>Or perhaps more directly: an op attaching the vfio_device to the
>kvm and having some simple helper 
>  '(un)register ioasid with kvm (kvm, ioasid)'
>that the vfio_device driver can call that just sorts this out.

We could almost eliminate the device notion altogether here, use an
ioasidfd_for_each_ioasid() but we really want a way to trigger on each
change to the composition of the device set for the ioasid, which is
why we currently do it on addition or removal of a group, where the
group has a consistent set of IOMMU properties.  Register a notifier
callback via the ioasidfd?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/3 上午4:37, Alex Williamson 写道:

On Wed, 2 Jun 2021 16:54:04 -0300
Jason Gunthorpe  wrote:


On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:

Right, the device can generate the no-snoop transactions, but it's the
IOMMU that essentially determines whether those transactions are
actually still cache coherent, AIUI.

Wow, this is really confusing stuff in the code.

At the PCI level there is a TLP bit called no-snoop that is platform
specific. The general intention is to allow devices to selectively
bypass the CPU caching for DMAs. GPUs like to use this feature for
performance.

Yes


I assume there is some exciting security issues here. Looks like
allowing cache bypass does something bad inside VMs? Looks like
allowing the VM to use the cache clear instruction that is mandatory
with cache bypass DMA causes some QOS issues? OK.

IIRC, largely a DoS issue if userspace gets to choose when to emulate
wbinvd rather than it being demanded for correct operation.


So how does it work?

What I see in the intel/iommu.c is that some domains support "snoop
control" or not, based on some HW flag. This indicates if the
DMA_PTE_SNP bit is supported on a page by page basis or not.

Since x86 always leans toward "DMA cache coherent" I'm reading some
tea leaves here:

IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
   transactions */

And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

That's my understanding as well.


Further, I guess IOMMUs that don't support PTE_SNP, or have
DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
and all)

Yes.


So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
incoherent with the CPU caches, it just means that that snoop bit in
the TLP cannot be enforced. ie the device *could* do no-shoop DMA
if it wants. Devices that never do no-snoop remain DMA coherent on
x86, as they always have been.

Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
DMA to be coherent via the IOMMU.


IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
the PCI device is blocked from using no-snoop in its TLPs.

I wonder if ARM implemented this consistently? I see VDPA is
confused..



Basically, we don't want to bother with pseudo KVM device like what VFIO 
did. So for simplicity, we rules out the IOMMU that can't enforce 
coherency in vhost-vDPA if the parent purely depends on the platform IOMMU:



    if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
    return -ENOTSUPP;

For the parents that use its own translations logic, an implicit 
assumption is that the hardware will always perform cache coherent DMA.


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Andi Kleen





I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not 
coherent swiotlb mapping.


I had all kinds of problems with uninitialized entries in the indirect 
table. So I gave up on it and concluded it would be too difficult to secure.



-Andi


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, June 3, 2021 1:20 AM
> 
[...]
> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> >
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> >|
> >\- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices.  But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> > would be an alternative to attaching devices.
> 
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Yes. For now I prefer to managing prereg through a separate cmd
instead of special-casing it in the IOASID graph. Anyway this is sort
of a per-fd thing.

> 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same
> > >   * restriction e.g. the unmap size should match those used in the
> > >   * original mapping call.
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *   - u32 ioasid;
> > >   *   - refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA  _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
> 
> Me too, or a SVA page table.
> 
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..
> 
> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too

sure, I incorporated this comment in last reply.

> 
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > >   device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/:BB:SS.F for the
> > filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
> > for mdevs.  That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
> 
> There are a bunch of nice options here if we go this path

Yes, this part is only roughly visited to focus on /dev/iommu first. In later
versions it will be considered more seriously.

> 
> > > 5.2. Multiple IOASIDs (no nesting)
> > > 
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.

'initially' here is still user-requested action. For PAPR you should do
attach only when it's necessary.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午10:56, Andi Kleen 写道:




I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not 
coherent swiotlb mapping.


I had all kinds of problems with uninitialized entries in the indirect 
table. So I gave up on it and concluded it would be too difficult to 
secure.



-Andi




Ok, but what I meant is this, if we don't read from the descriptor ring, 
and validate all the other metadata supplied by the device (used id and 
len). Then there should be no way for the device to suppress the dma 
flags to write to the indirect descriptor table.


Or do you have an example how it can do that?

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Thursday, June 3, 2021 10:51 AM
> 
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
> 
> Yup.  Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop?  This seems largely a
> legacy thing.

Yes, there is indeed performance benefit for device to use no-snoop,
e.g. 8K display and some imaging processing path, etc. The problem is
that the IOMMU for such devices is typically a different one from the
default IOMMU for most devices. This special IOMMU may not have
the ability of enforcing snoop on no-snoop PCI traffic then this fact
must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
for such devices to work correctly.

> 
> > > This support existed before mdev, IIRC we needed it for direct
> > > assignment of NVIDIA GPUs.
> >
> > Probably because they ignored the disable no-snoop bits in the control
> > block, or reset them in some insane way to "fix" broken bioses and
> > kept using it even though by all rights qemu would have tried hard to
> > turn it off via the config space. Processing no-snoop without a
> > working wbinvd would be fatal. Yeesh
> >
> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> and !IOMMU_CAP_CACHE_COHERENCY
> >domains.
> >
> >This doesn't actually matter. If you mix them together then kvm
> >will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> >anywhere in this VM.
> >
> >This if two IOMMU's are joined together into a single /dev/ioasid
> >then we can just make them both pretend to be
> >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> 
> Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available.  That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.
> 
> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> >What we want to do here is iterate over every ioasid associated
> >with the group fd that is passed into kvm.
> 
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.
> 
> >Today the group fd has a single container which specifies the
> >single ioasid so this is being done trivially.
> >
> >To reorg we want to get the ioasid from the device not the
> >group (see my note to David about the groups vs device rational)
> >
> >This is just iterating over each vfio_device in the group and
> >querying the ioasid it is using.
> 
> The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> anyway, we still need to account for all the RIDs and aliases of a
> group.
> 
> >Or perhaps more directly: an op attaching the vfio_device to the
> >kvm and having some simple helper
> >  '(un)register ioasid with kvm (kvm, ioasid)'
> >that the vfio_device driver can call that just sorts this out.
> 
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.  Register a notifier
> callback via the ioasidfd?  Thanks,
> 

When discussing I/O page fault support in another thread, the consensus
is that an device handle will be registered (by user) or allocated (return
to user) in /dev/ioasid when binding the device to ioasid fd. From this 
angle we can register {ioasid_fd, device_handle} to KVM and then call 
something like ioasidfd_device_is_coherent()

Re: [PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX

2021-06-02 Thread Kuppuswamy, Sathyanarayanan




On 6/2/21 5:41 PM, Andi Kleen wrote:

+int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}
+EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);
+


This function definition had to be removed from arch/x86/mm/mem_encrypt.c.

Otherwise, if you enable both CONFIG_AMD_MEM_ENCRYPT,
CONFIG_X86_MEM_ENCRYPT_COMMON it will generate multiple definition error.

--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -493,9 +493,3 @@ void __init amd_mem_encrypt_init(void)

print_mem_encrypt_feature_info();
 }
-
-int arch_has_restricted_virtio_memory_access(void)
-{
-   return sev_active();
-}
-EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);

--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -40,7 +40,7 @@ void __init mem_encrypt_init(void)

 int arch_has_restricted_virtio_memory_access(void)
 {
-   return is_tdx_guest();
+   return (is_tdx_guest() || sev_active());
 }
 EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Alex Williamson
On Thu, 3 Jun 2021 03:22:27 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Thursday, June 3, 2021 10:51 AM
> > 
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >  
> > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?  
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?  
> > 
> > Yup.  Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop?  This seems largely a
> > legacy thing.  
> 
> Yes, there is indeed performance benefit for device to use no-snoop,
> e.g. 8K display and some imaging processing path, etc. The problem is
> that the IOMMU for such devices is typically a different one from the
> default IOMMU for most devices. This special IOMMU may not have
> the ability of enforcing snoop on no-snoop PCI traffic then this fact
> must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
> for such devices to work correctly.

The case where the IOMMU does not support snoop-control for such a
device already works fine, we can't prevent no-snoop so KVM will
emulate wbinvd.  The harder one is if we should opt to allow no-snoop
even if the IOMMU does support snoop-control.
 
> > > > This support existed before mdev, IIRC we needed it for direct
> > > > assignment of NVIDIA GPUs.  
> > >
> > > Probably because they ignored the disable no-snoop bits in the control
> > > block, or reset them in some insane way to "fix" broken bioses and
> > > kept using it even though by all rights qemu would have tried hard to
> > > turn it off via the config space. Processing no-snoop without a
> > > working wbinvd would be fatal. Yeesh
> > >
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > >
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY  
> > and !IOMMU_CAP_CACHE_COHERENCY  
> > >domains.
> > >
> > >This doesn't actually matter. If you mix them together then kvm
> > >will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > >anywhere in this VM.
> > >
> > >This if two IOMMU's are joined together into a single /dev/ioasid
> > >then we can just make them both pretend to be
> > >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.  
> > 
> > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> > we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available.  That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.
> >   
> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > >
> > >What we want to do here is iterate over every ioasid associated
> > >with the group fd that is passed into kvm.  
> > 
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.
> >   
> > >Today the group fd has a single container which specifies the
> > >single ioasid so this is being done trivially.
> > >
> > >To reorg we want to get the ioasid from the device not the
> > >group (see my note to David about the groups vs device rational)
> > >
> > >This is just iterating over each vfio_device in the group and
> > >querying the ioasid it is using.  
> > 
> > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > anyway, we still need to account for all the RIDs and aliases of a
> > group.
> >   
> > >Or perhaps more directly: an op attaching the vfio_device to the
> > >kvm and having some simple helper
> > >  '(un)register ioasid with kvm (kvm, ioasid)'
> > >that the vfio_device driver can call that just sorts this out.  
> > 
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Thursday, June 3, 2021 12:15 PM
> 
> On Thu, 3 Jun 2021 03:22:27 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > >
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe  wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >
> > > > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement
> in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?
> > >
> > > Yup.  Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop?  This seems largely a
> > > legacy thing.
> >
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> > for such devices to work correctly.
> 
> The case where the IOMMU does not support snoop-control for such a
> device already works fine, we can't prevent no-snoop so KVM will
> emulate wbinvd.  The harder one is if we should opt to allow no-snoop
> even if the IOMMU does support snoop-control.

In other discussion we are leaning toward a per-device capability
reporting scheme through /dev/ioasid (or /dev/iommu as the new
name). It seems natural to also allow setting a capability e.g. no-
snoop for a device if underlying IOMMU driver allows it.

> 
> > > > > This support existed before mdev, IIRC we needed it for direct
> > > > > assignment of NVIDIA GPUs.
> > > >
> > > > Probably because they ignored the disable no-snoop bits in the control
> > > > block, or reset them in some insane way to "fix" broken bioses and
> > > > kept using it even though by all rights qemu would have tried hard to
> > > > turn it off via the config space. Processing no-snoop without a
> > > > working wbinvd would be fatal. Yeesh
> > > >
> > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > > had..
> > > >
> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> > > and !IOMMU_CAP_CACHE_COHERENCY
> > > >domains.
> > > >
> > > >This doesn't actually matter. If you mix them together then kvm
> > > >will turn on wbinvd anyhow, so we don't need to use the
> DMA_PTE_SNP
> > > >anywhere in this VM.
> > > >
> > > >This if two IOMMU's are joined together into a single /dev/ioasid
> > > >then we can just make them both pretend to be
> > > >!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > >
> > > Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY
> then
> > > we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available.  That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.
> > >
> > > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > > >
> > > >What we want to do here is iterate over every ioasid associated
> > > >with the group fd that is passed into kvm.
> > >
> > > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > > then attaching a device to an allocated ioasid (non-fd)... I assume
> > > you're talking about the latter ioasid.
> > >
> > > >Today the group fd has a single container which specifies the
> > > >single ioasid so this is being done trivially.
> > > >
> > > >To reorg we want to get the ioasid from the device not the
> > > >group (see my note to David about the groups vs device rational)
> > > >
> > > >This is just iterating over each vfio_device in the group and
> > > >querying the ioasid it is using.
> > >
> > > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > > anyway, we still need to account for all the RIDs 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Lu Baolu

On 6/3/21 7:23 AM, Jason Gunthorpe wrote:

On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:


This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond?


Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.


OK, stay in the same line.


If so, it seems that we need some in-kernel helpers and uAPIs to
support pre-installing a page table to IOASID.


Not sure what this means..


Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

  1) IOMMU format page table (a.k.a. iommu_domain)
  2) user application CPU page table (SVA for example)
  3) KVM EPT (future option)
  4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.


In my mind these are all actions of creation..

#1 is ALLOC_IOASID 'to be compatible with thes devices attached to
this FD'
#2 is ALLOC_IOASID_SVA
#3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
#4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
   ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address

Each allocation should have a set of operations that are allows
map/unmap is only legal on #1. invalidate is only legal on #4, etc.


This sounds reasonable. The corresponding page table types and required
callbacks are also part of it.



How you want to split this up in the ioctl interface is a more
interesting question. I generally like more calls than giant unwieldly
multiplexer structs, but some things are naturally flags and optional
modifications of a single ioctl.

In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
then a single 'DESTROY_IOASID' that works on all of them.


I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.


For the faultable types, yes #3 and #4 should hook in the fault
handler and deal with it.


Agreed.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> 
> > I don't think presence or absence of a group fd makes a lot of
> > difference to this design.  Having a group fd just means we attach
> > groups to the ioasid instead of individual devices, and we no longer
> > need the bookkeeping of "partial" devices.
> 
> Oh, I think we really don't want to attach the group to an ioasid, or
> at least not as a first-class idea.
> 
> The fundamental problem that got us here is we now live in a world
> where there are many ways to attach a device to an IOASID:

I'm not seeing that that's necessarily a problem.

>  - A RID binding
>  - A RID,PASID binding
>  - A RID,PASID binding for ENQCMD

I have to admit I haven't fully grasped the differences between these
modes.  I'm hoping we can consolidate at least some of them into the
same sort of binding onto different IOASIDs (which may be linked in
parent/child relationships).

>  - A SW TABLE binding
>  - etc
> 
> The selection of which mode to use is based on the specific
> driver/device operation. Ie the thing that implements the 'struct
> vfio_device' is the thing that has to select the binding mode.

I thought userspace selected the binding mode - although not all modes
will be possible for all devices.

> group attachment was fine when there was only one mode. As you say it
> is fine to just attach every group member with RID binding if RID
> binding is the only option.
> 
> When SW TABLE binding was added the group code was hacked up - now the
> group logic is choosing between RID/SW TABLE in a very hacky and mdev
> specific way, and this is just a mess.

Sounds like it.  What do you propose instead to handle backwards
compatibility for group-based VFIO code?

> The flow must carry the IOASID from the /dev/iommu to the vfio_device
> driver and the vfio_device implementation must choose which binding
> mode and parameters it wants based on driver and HW configuration.
> 
> eg if two PCI devices are in a group then it is perfectly fine that
> one device uses RID binding and the other device uses RID,PASID
> binding.

U... I don't see how that can be.  They could well be in the same
group because their RIDs cannot be distinguished from each other.

> The only place I see for a "group bind" in the uAPI is some compat
> layer for the vfio container, and the implementation would be quite
> different, we'd have to call each vfio_device driver in the group and
> execute the IOASID attach IOCTL.
> 
> > > I would say no on the container. /dev/ioasid == the container, having
> > > two competing objects at once in a single process is just a mess.
> > 
> > Right.  I'd assume that for compatibility, creating a container would
> > create a single IOASID under the hood with a compatiblity layer
> > translating the container operations to iosaid operations.
> 
> It is a nice dream for sure
> 
> /dev/vfio could be a special case of /dev/ioasid just with a different
> uapi and ending up with only one IOASID. They could be interchangable
> from then on, which would simplify the internals of VFIO if it
> consistently delt with these new ioasid objects everywhere. But last I
> looked it was complicated enough to best be done later on
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, Jun 03, 2021 at 01:29:58AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 12:09 AM
> > 
> > On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, June 2, 2021 1:42 AM
> > > >
> > > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > > >
> > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > > > >
> > > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> > and
> > > > > > > software nesting. With hardware support the child and parent I/O
> > page
> > > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > > translation.
> > > > > > > When it's implemented in software, the ioasid driver is 
> > > > > > > responsible
> > for
> > > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > > table.
> > > > > > > Software nesting requires both child/parent page tables operated
> > > > through
> > > > > > > the dma mapping protocol, so any change in either level can be
> > > > captured
> > > > > > > by the kernel to update the corresponding shadow mapping.
> > > > > >
> > > > > > Why? A SW emulation could do this synchronization during
> > invalidation
> > > > > > processing if invalidation contained an IOVA range.
> > > > >
> > > > > In this proposal we differentiate between host-managed and user-
> > > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > > then use invalidation cmd to flush iotlb when necessary (e.g. 
> > > > > typically
> > > > > not required when changing a PTE from non-present to present).
> > > > >
> > > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > > instead of mixing them together. Following this policy, map+unmap
> > > > > must be used in both levels for software nesting, so changes in either
> > > > > level are captured timely to synchronize the shadow mapping.
> > > >
> > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > > it is created. If you put two different types in a tree then each IOASID
> > > > must continue to use its own operation mode.
> > > >
> > > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> > >
> > > only for software nesting. With hardware support the parent uses map
> > > while the child uses bind.
> > >
> > > Yes, the policy is specified per IOASID. But if the policy violates the
> > > requirement in a specific nesting mode, then nesting should fail.
> > 
> > I don't get it.
> > 
> > If the IOASID is a page table then it is bind/invalidate. SW or not SW
> > doesn't matter at all.
> > 
> > > >
> > > > A software emulated two level page table where the leaf level is a
> > > > bound page table in guest memory should continue to use
> > > > bind/invalidate to maintain the guest page table IOASID even though it
> > > > is a SW construct.
> > >
> > > with software nesting the leaf should be a host-managed page table
> > > (or metadata). A bind/invalidate protocol doesn't require the user
> > > to notify the kernel of every page table change.
> > 
> > The purpose of invalidate is to inform the implementation that the
> > page table has changed so it can flush the caches. If the page table
> > is changed and invalidation is not issued then then the implementation
> > is free to ignore the changes.
> > 
> > In this way the SW mode is the same as a HW mode with an infinite
> > cache.
> > 
> > The collaposed shadow page table is really just a cache.
> > 
> 
> OK. One additional thing is that we may need a 'caching_mode"
> thing reported by /dev/ioasid, indicating whether invalidation is
> required when changing non-present to present. For hardware 
> nesting it's not reported as the hardware IOMMU will walk the
> guest page table in cases of iotlb miss. For software nesting 
> caching_mode is reported so the user must issue invalidation 
> upon any change in guest page table so the kernel can update
> the shadow page table timely.

For the fist cut, I'd have the API assume that invalidates are
*always* required.  Some bypass to avoid them in cases where they're
not needed can be an additional extension.

> Following this and your other comment with David, we will mark
> host-managed vs. guest-managed explicitly for I/O page table
> of each IOASID. map+unmap or bind+invalid is decided by
> which owner is specified by the user.
> 
> Thanks
> Kevin
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description:

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, Jun 03, 2021 at 02:49:56AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, June 3, 2021 12:59 AM
> > 
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > >   /* Bind guest I/O page table  */
> > > > >   bind_data = {
> > > > >   .ioasid = gva_ioasid;
> > > > >   .addr   = gva_pgtable1;
> > > > >   // and format information
> > > > >   };
> > > > >   ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > >
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > >
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually.
> > 
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
> 
> I would expect the capability reported per-device via /dev/iommu. 
> Incompatible devices can bind to the same fd but cannot attach to
> the same IOASID. This allows incompatible devices to share locked
> page accounting.

Yeah... I'm not convinced that everything relevant here can be
reported per-device.  I think we may have edge cases where
combinations of devices have restrictions that individual devices in
the set do not.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > /* Bind guest I/O page table  */
> > > > bind_data = {
> > > > .ioasid = gva_ioasid;
> > > > .addr   = gva_pgtable1;
> > > > // and format information
> > > > };
> > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > 
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> > 
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually. 
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

Right.. but I've not been convinced that having a /dev/iommu fd
instance be the boundary for these types of things actually makes
sense.  For example if we were doing the preregistration thing
(whether by child ASes or otherwise) then that still makes sense
across wildly different devices, but we couldn't share that layer if
we have to open different instances for each of them.

It really seems to me that it's at the granularity of the address
space (including extended RID+PASID ASes) that we need to know what
devices we have, and therefore what capbilities we have for that AS.

> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.
> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

Right, but at this stage I'm just not seeing a really clear (across
platforms and device typpes) boundary for what things have to be per
IOASID container and what have to be per IOASID, so I'm just not sure
the /dev/iommu instance grouping makes any sense.

> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
> Presumably it uses query functions using the device label.

No... in the qemu case it would always select the page table format
that it needs to present to the guest.  That's part of the
guest-visible platform that's selected by qemu's configuration.

There's no negotiation here: either the kernel can supply what qemu
needs to pass to the guest, or it can't.  If it can't qemu, will have
to either emulate in SW (if possible, probably using a kernel-managed
IOASID to back it) or fail outright.

> The
> kernel code should look at the entire device path through all the
> IOMMU HW to determine what is possible.
> 
> Or it already knows because the VM's vIOMMU is running in some
> fixed page table format, or the VM's vIOMMU already told it, or
> something.

Again, I think you have the order a bit backwards.  The user selects
the capabilities that the vIOMMU will present to the guest as part of
the qemu configuration.  Qemu then requests that of the host kernel,
and either the host kernel supplies it, qemu emulates it in SW, or
qemu fails to start.

Guest visible properties of the platform never (or *should* never)
depend implicitly on host capabilities - it's impossible to sanely
support migration in such an environment.

>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Right.

>  3) qemu binds the IOASID to the device. 
> 
> If qmeu gets it wrong then it just fails.

Right, though it may be fall back to (partial) software emulation.  In
practice that would mean using a kernel-managed IOASID and walking the
guest IO pagetables itself to mirror them into the host kernel.

>  4) For the next device qemu would have to figure out if it can re-use
> an existing IOASID based on the required proeprties.

Nope.  Again, what devices share an IO address space is a guest
visible part of the platform.  If the host kernel can't supply that,
then qemu must not st

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:16:48PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > > I agree with Jean-Philippe - at the very least erasing this
> > > information needs a major rational - but I don't really see why it
> > > must be erased? The HW reports the originating device, is it just a
> > > matter of labeling the devices attached to the /dev/ioasid FD so it
> > > can be reported to userspace?
> > 
> > HW reports the originating device as far as it knows.  In many cases
> > where you have multiple devices in an IOMMU group, it's because
> > although they're treated as separate devices at the kernel level, they
> > have the same RID at the HW level.  Which means a RID for something in
> > the right group is the closest you can count on supplying.
> 
> Granted there may be cases where exact fidelity is not possible, but
> that doesn't excuse eliminating fedelity where it does exist..
> 
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID. 
> > 
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
> 
> Maybe.. It is certainly a special case.
> 
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

True.  This does complexify how we represent what IOVA ranges are
valid, though.  I'll bet you most implementations don't actually
implement a full 64-bit IOVA, which means we effectively have a large
number of windows from (0..max IOVA) for each valid pasid.  This adds
another reason I don't think my concept of IOVA windows is just a
power specific thing.

> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page
> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

Well, rather I think userspace needs to request what page table format
it wants and the kernel tells it whether it can oblige or not.

> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:
> 
> > Is there a compelling reason to have all the IOASIDs handled by one
> > FD?
> 
> There was an answer on this, if every PASID needs an IOASID then there
> are too many FDs.

Too many in what regard?  fd limits?  Something else?

It seems to be there are two different cases for PASID handling here.
One is where userspace explicitly creates each valid PASID and
attaches a separate pagetable for each (or handles each with
MAP/UNMAP).  In that case I wouldn't have expected there to be too
many fds.

Then there's the case where we register a whole PASID table, in which
case I think you only need the one FD.  We can treat that as creating
an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of
pagetables for each PASID).

> It is difficult to share the get_user_pages cache across FDs.

Ah... hrm, yes I can see that.

> There are global properties in the /dev/iommu FD, like what devices
> are part of it, that are important for group security operations. This
> becomes confused if it is split to many FDs.

I'm still not seeing those.  I'm really not seeing any well-defined
meaning to devices being attached to the fd, but not to a particular
IOAS.

> > > I/O address space can be managed through two protocols, according to 
> > > whether the corresponding I/O page table is constructed by the kernel or 
> > > the user. When kernel-managed, a dma mapping protocol (similar to 
> > > existing VFIO iommu type1) is provided for the user to explicitly specify 
> > > how the I/O address space is mapped. Otherwise, a different protocol is 
> > > provided for the user to bind an user-managed I/O page table to the 
> > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> > > handling. 
> > > 
> > > Pgtable binding protocol can be used only on the child IOASID's, implying 
> > > IOASID nesting must be enabled. This is because the kernel doesn't trust 
> > > userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> > > through the parent IOASID.
> > 
> > To clarify, I'm guessing that's a restriction of likely practice,
> > rather than a fundamental API restriction.  I can see a couple of
> > theoretical future cases where a user-managed pagetable for a "base"
> > IOASID would be feasible:
> > 
> >   1) On some fancy future MMU allowing free nesting, where the kernel
> >  would insert an implicit extra layer translating user addresses
> >  to physical addresses, and the userspace manages a pagetable with
> >  its own VAs being the target AS
> 
> I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
> where the IOVA == process VA and the kernel maintains this mapping.

That makes sense.  Needs a different name to avoid Intel and PCI
specificness, but having a trivial "pagetable format" which just says
IOVA == user address is nice idea.

> Since the uAPI is so general I do have a general expecation that the
> drivers/iommu implementations might need to be a bit more complicated,
> like if the HW can optimize certain specific graphs of IOASIDs we
> would still model them as graphs and the HW driver would have to
> "compile" the graph into the optimal hardware.
> 
> This approach has worked reasonable in other kernel areas.

That seems sensible.

> >   2) For a purely software virtual device, where its virtual DMA
> >  engine can interpet user addresses fine
> 
> This also sounds like an SVA IOASID.

Ok.

> Depending on HW if a device can really only bind to a very narrow kind
> of IOASID then it should ask for that (probably platform specific!)
> type during its attachment request to drivers/iommu.
> 
> eg "I am special hardware and only know how to do PLATFORM_BLAH
> transactions, give me an IOASID comatible with that". If the only way
> to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
> hardwired to the CPU ASID  then that is just how it is.

Fair enough.

> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> > 
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> >|
> >\- 'pasid' IOAS (maybe)
> > 
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> > 
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices.  But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> > would be an alternative to attaching devices.
> 
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitratio

  1   2   >