On Wed, 27 Mar 2019 14:25:00 +0800
Peter Xu <pet...@redhat.com> wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1  
> 
> [1]
> 
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson <alex.william...@redhat.com>  
> 
> The patch looks sane to me even as a bug fix since otherwise the DMA
> address spaces used under misc kinds of PCI bridges can be wrong, so:

I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
I'd be cautious about this if so.  Eric Auger noted that he's seen an
SMMU VM hit a guest kernel bug-on, which needs further
investigation.  It's not clear if it's just an untested or
unimplemented scenario for SMMU to see a conventional PCI bus or if
there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
only VT-d to a very limited degree, thus RFC.
 
> Reviewed-by: Peter Xu <pet...@redhat.com>
> 
> Though I have a question that confused me even before: Alex, do you
> know why all the context entry of the devices in the IOMMU root table
> will be programmed even if the devices are under a pcie-to-pci bridge?
> I'm giving an example with above [1] to be clear: in that case IIUC
> we'll program context entries for all the three devices (00:01.0,
> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> bare metal and then why we bother to program the context entry of
> 01:00.1?  It seems never used.
> 
> (It should be used for current QEMU to work with pcie-to-pci bridges
>  if without this patch, but I feel like I don't know the real answer
>  behind)

We actually have two different scenarios that could be represented by
[1], the group can be formed by lack of isolation or by lack of
visibility.  In the group above, it's the former, isolation.  The PCIe
root port does not support ACS, so while the IOMMU has visibility of
the individual devices, peer-to-peer between devices may also be
possible.  Native, trusted, in-kernel drivers for these devices could
still make use of separate IOMMU domains per device, but in order to
expose the devices to a userspace driver we need to consider them a
non-isolated set to prevent side-channel attacks between devices.  We
therefore consider them as a group within the IOMMU API and it's
required that each context entry maps to the same domain as the IOMMU
will see transactions for each requester ID.

If we had the visibility case, such as if [1] represented a PCIe-to-PCI
bridge subgroup, then the IOMMU really does only see the bridge
requester ID and there may not be a reason to populate the context
entries for the downstream aliased devices.  Perhaps the IOMMU might
still choose to do so, particularly if the bridge is actually a PCI-X
bridge as PCI-X does incorporate a requester ID, but also has strange
rules about the bridge being able to claim ownership of the
transaction.  So it might be paranoia or simplification that causes all
the context entries to be programmed, or for alias quirks, uncertainty
if a device exclusively uses a quirked requester ID or might sometimes
use the proper requester ID.

In the example I present, we're taking [1], which could be either case
above, and converting it into the visibility case in order to force the
IOMMU to handle the devices within a single address space.  Thanks,

Alex

Reply via email to