On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:
> Change since V1:
>       1) Update motivation for Xen vIOMMU - 288 vcpus support part
>       2) Change definition of struct xen_sysctl_viommu_op
>       3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
>       4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the
> emulated I440 chipset.
>       5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> ===============================================================================
> 1. Motivation of vIOMMU
>       1.1 Enable more than 255 vcpus
>       1.2 Support VFIO-based user space driver
>       1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>       2.1 l2 translation overview
>       2.2 Interrupt remapping overview
> 3. Xen hypervisor
>       3.1 New vIOMMU hypercall interface
>       3.2 l2 translation
>       3.3 Interrupt remapping
>       3.4 l1 translation
>       3.5 Implementation consideration
> 4. Qemu
>       4.1 Qemu vIOMMU framework
>       4.2 Dummy xen-vIOMMU driver
>       4.3 Q35 vs. i440x
>       4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===============================================================================
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

What about Windows? Does it care about this?

> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> ================================================================================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>       1) Avoid round trips between Qemu and Xen hypervisor
>       2) Ease of integration with the rest of the hypervisor
>       3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2

destroy
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                                +-------------------+
>                                |   PCI Device      |
>                                +-------------------+
> 
> 
> 
> 
> 3 Xen hypervisor
> ==========================================================================
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> This hypercall should also support pv IOMMU which is still under RFC review.
> Here only covers non-pv part.
> 
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
>       u32 cmd;
>       u32 domid;
>       union {
>               struct {
>                       u32 capabilities;
>               } query_capabilities;
>               struct {
>                       u32 capabilities;
>                       u64 base_address;
>               } create_iommu;
>               struct {
>                       /* IN parameters. */
>                       u16 segment;
>                       u8  bus;
>                       u8  devfn;
>                       u64 iova;
>                       /* Out parameters. */
>                       u64 translated_addr;
>                       u64 addr_mask; /* Translation page size */
>                       IOMMUAccessFlags permisson;
>               } l2_translation;               
> };
> 
> typedef enum {
>       IOMMU_NONE = 0,
>       IOMMU_RO   = 1,
>       IOMMU_WO   = 2,
>       IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability            0
> #define XEN_SYSCTL_viommu_create                      1
> #define XEN_SYSCTL_viommu_destroy                     2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev   3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_l1_translation  (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_l2_translation  (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping     (1 << 2)
> 
> 
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>        Get vIOMMU capabilities(l1/l2 translation and interrupt
> remapping).
> 
> - XEN_SYSCTL_viommu_create
>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
> 
> - XEN_SYSCTL_viommu_destroy
>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> 
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>       Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> 
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> l2 Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> Page-table pointer to context entry of physical IOMMU.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when l2

defaultly?

> translation function is enabled. These change will not affect current
> P2M logic.

What happens if the guests IO page tables have incorrect values?

For example the guest sets up the pagetables to cover some section
of HPA ranges (which are all good and permitted). But then during execution
the guest kernel decides to muck around with the pagetables and adds an HPA
range that is outside what the guest has been allocated.

What then?
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table.
> 
> 
> 3.4 l1 translation
> When nested translation is enabled, any address generated by l1
> translation is used as the input address for nesting with l2
> translation. Physical IOMMU needs to enable both l1 and l2 translation
> in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
> 
> VT-d context entry points to guest l1 translation table which
> will be nest-translated by l2 translation table and so it
> can be directly linked to context entry of physical IOMMU.

I think this means that the shared_ept will be disabled?
>
What about different versions of contexts? Say the V1 is exposed
to guest but the hardware supports V2? Are there any flags that have
swapped positions? Or is it pretty backwards compatible?
 
> To enable l1 translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest l1 translation table to context entry
> of physical IOMMU.
> 
> All handles are in hypervisor and no interaction with Qemu.

All is handled in hypervisor.
> 
> 
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. Linux Intel IOMMU driver thinks l2
> translation is always available when VTD exits and fail to be loaded
> without l2 translation support even if interrupt remapping and l1
> translation are available. So it needs to enable l2 translation first

I am lost on that sentence. Are you saying that it tries to load
the IOVA and if they fail.. then it keeps on going? What is the result
of this? That you can't do IOVA (so can't use vfio ?)

> before other functions.
> 
> 
> 4 Qemu
> ==============================================================================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for l2 translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with l2 translation when
> DMA operations of virtual PCI devices happen.

You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
could expose an vIOMMU to a guest and actually use the AMD IOMMU
in the hypervisor?
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's l2 translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs I440x
> VT-D is introduced with Q35 chipset. Previous concern was that VTD
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first. After experiments, Linux/Windows guest can
> boot up on the emulated I440x chipset with VTD and VTD driver enables
> interrupt remapping function. So we can skip Q35 support to implement
> vIOMMU directly.
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
> 
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
> 
> 2) Pass vIOMMU information to hvmloader via Xenstore
> 
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to