Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B
Hi Malcolm, Thank you very much for accommodate our XenGT requirement in your design. Following are some XenGT related questions. :) On 6/13/2015 12:43 AM, Malcolm Crossley wrote: Hi All, Here is a design for allowing guests to control the IOMMU. This allows for the guest GFN mapping to be programmed into the IOMMU and avoid using the SWIOTLB bounce buffer technique in the Linux kernel (except for legacy 32 bit DMA IO devices). Draft B has been expanded to include Bus Address mapping/lookup for Mediated pass-through emulators. The pandoc markdown format of the document is provided below to allow for easier inline comments: % Xen PV IOMMU interface % Malcolm Crossley malcolm.cross...@citrix.com Paul Durrant paul.durr...@citrix.com % Draft B Introduction Revision History Version Date Changes --- --- -- Draft A 10 Apr 2014 Initial draft. Draft B 12 Jun 2015 Second draft. Background == Linux kernel SWIOTLB Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is decoupled from the host Machine Frame Number(MFN) address space. PV guest hardware drivers are only aware of the PFN address space only and assume that if PFN addresses are contiguous then the hardware addresses would be contiguous as well. The decoupling between PFN and MFN address spaces means PFN and MFN addresses may not be contiguous across page boundaries and thus a buffer allocated in GFN address space which spans a page boundary may not be contiguous in MFN address space. PV hardware drivers cannot tolerate this behaviour and so a special bounce buffer region is used to hide this issue from the drivers. A bounce buffer region is a special part of the PFN address space which has been made to be contiguous in both PFN and MFN address spaces. When a driver requests a buffer which spans a page boundary be made available for hardware to read the core operating system code copies the buffer into a temporarily reserved part of the bounce buffer region and then returns the MFN address of the reserved part of the bounce buffer region back to the driver itself. The driver then instructs the hardware to read the copy of the buffer in the bounce buffer. Similarly if the driver requests a buffer is made available for hardware to write to the first a region of the bounce buffer is reserved and then after the hardware completes writing then the reserved region of bounce buffer is copied to the originally allocated buffer. The overheard of memory copies to/from the bounce buffer region is high and damages performance. Furthermore, there is a risk the fixed size bounce buffer region will become exhausted and it will not be possible to return an hardware address back to the driver. The Linux kernel drivers do not tolerate this failure and so the kernel is forced to crash, as an uncorrectable error has occurred. Input/Output Memory Management Units (IOMMU) allow for an inbound address mapping to be created from the I/O Bus address space (typically PCI) to the machine frame number address space. IOMMU's typically use a page table mechanism to manage the mappings and therefore can create mappings of page size granularity or larger. The I/O Bus address space will be referred to as the Bus Frame Number (BFN) address space for the rest of this document. Mediated Pass-through Emulators --- Mediated Pass-through emulators allow guest domains to interact with hardware devices via emulator mediation. The emulator runs in a domain separate to the guest domain and it is used to enforce security of guest access to the hardware devices and isolation of different guests accessing the same hardware device. The emulator requires a mechanism to map guest address's to a bus address that the hardware devices can access. Clarification of GFN and BFN fields for different guest types - Guest Frame Numbers (GFN) definition varies depending on the guest type. Diagram below details the memory accesses originating from CPU, per guest type: HVM guest PV guest (VA) (VA) | | MMUMMU | | (GFN) | | | (GFN) HAP a.k.a EPT/NPT | | | (MFN) (MFN) | | RAMRAM For PV guests GFN
Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B
On 17.06.15 at 14:48, yu.c.zh...@linux.intel.com wrote: Thank you very much for accommodate our XenGT requirement in your design. Following are some XenGT related questions. :) Please trim your replies. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B
On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote: IOMMUOP_query_caps -- This subop queries the runtime capabilities of the PV-IOMMU interface for the specific called domain. This subop uses `struct pv_iommu_op` directly. calling domain perhaps? -- Field Purpose - --- `flags`[out] This field details the IOMMUOP capabilities. `status` [out] Status of this op, op specific values listed below -- Defined bits for flags field: -- NameBitDefinition -- -- IOMMU_QUERY_map_cap 0IOMMUOP_map_page or IOMMUOP_map_foreign can be used for this domain this (see also above) perhaps being the calling domain? In which case I wonder how the for and IOMMUOP_map_foreign are meant to fit together: I assume the flag to indicate that mapping into the (calling) domain is possible. Which then makes me wonder - what use if the new hypercall when this flag isn't set? IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN not used by Xen gfns or MFN? Defined values for map_page subop status field: Value Reason -- -- 0 subop successfully returned -EIOIOMMU unit returned error when attempting to map BFN to GFN. -EPERM GFN could not be mapped because the GFN belongs to Xen. -EPERM Domain is not a domain and GFN does not belong to domain is not a hardware domain? Also, I think we're pretty determined for there to ever only be one, so perhaps it should be the hardware domain here and elsewhere. IOMMUOP_unmap_page -- This subop uses `struct unmap_page` part of the `struct pv_iommu_op`. The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields are detailed below: Field Purpose - - `bfn` [in] Bus address frame number to be unmapped in DOMID_SELF Has it been determined that unmapping based on GFN is never going to be needed, and that unmapping by BFN is the more practical solution? The map counterpart doesn't seem to exclude establishing multiple mappings for the same BFN, and hence the inverse here would become kind of fuzzy in that case. IOMMUOP_map_foreign_page This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`. It is not valid to use domid representing the calling domain. And what's the point of that? Was it considered to have only one map/unmap pair, capable of mapping both local and foreign pages? If so, what speaks against that? The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple. I didn't see anything explaining the significance of this (namely the ioserver part, I think I can see the need for the domid), can you explain the backgound here please? Every new M2B entry will take a reference to the MFN backing the GFN. What happens when that GFN-MFN mapping changes? All the following conditions are required to be true for PV IOMMU map_foreign subop to succeed: 1. IOMMU detected and supported by Xen 2. The domain has IOMMU controlled hardware allocated to it 3. The domain is a hardware_domain and the following Xen IOMMU options are NOT enabled: dom0-passthrough Is there a way for the hardware domain to know it is running in pass-through mode? Also, the domain is ambiguous here; I'm sure you mean the invoking domain, not the one owning the page. This subop usage of the struct pv_iommu_op and ``struct map_foreign_page` fields are detailed below: Field Purpose - - `domid`[in] The domain ID for which the gfn field applies `ioserver` [in] IOREQ server id associated with mapping `bfn` [in] Bus address frame number for gfn address In the description above you speak of returning data in this field. Is [in] really correct? Defined bits for flags field: Name BitDefinition - -- IOMMUOP_readable 0BFN IOMMU mapping is readable IOMMUOP_writeable 1BFN IOMMU mapping is writeable IOMMUOP_swap_mfn 2BFN IOMMU mapping can be safely
Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B
On 16/06/15 14:19, Jan Beulich wrote: On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote: IOMMUOP_query_caps -- This subop queries the runtime capabilities of the PV-IOMMU interface for the specific called domain. This subop uses `struct pv_iommu_op` directly. calling domain perhaps? -- Field Purpose - --- `flags`[out] This field details the IOMMUOP capabilities. `status` [out] Status of this op, op specific values listed below -- Defined bits for flags field: -- NameBitDefinition -- -- IOMMU_QUERY_map_cap 0IOMMUOP_map_page or IOMMUOP_map_foreign can be used for this domain this (see also above) perhaps being the calling domain? In which case I wonder how the for and IOMMUOP_map_foreign are meant to fit together: I assume the flag to indicate that mapping into the (calling) domain is possible. Which then makes me wonder - what use if the new hypercall when this flag isn't set? This is the calling domain. The IOMMU_lookup_foreign should continue to work if this flag is not set. IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN not used by Xen gfns or MFN? gfns . This is meant to apply to the hardware domain only, it's to allow the same access control as dom0-relaxed mode allows currently. Defined values for map_page subop status field: Value Reason -- -- 0 subop successfully returned -EIOIOMMU unit returned error when attempting to map BFN to GFN. -EPERM GFN could not be mapped because the GFN belongs to Xen. -EPERM Domain is not a domain and GFN does not belong to domain is not a hardware domain? Also, I think we're pretty determined for there to ever only be one, so perhaps it should be the hardware domain here and elsewhere. That is a typo. It should say is not the hardware domain I will correct the other occurrences of a hardware domain to the hardware domain. IOMMUOP_unmap_page -- This subop uses `struct unmap_page` part of the `struct pv_iommu_op`. The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields are detailed below: Field Purpose - - `bfn` [in] Bus address frame number to be unmapped in DOMID_SELF Has it been determined that unmapping based on GFN is never going to be needed, and that unmapping by BFN is the more practical solution? The map counterpart doesn't seem to exclude establishing multiple mappings for the same BFN, and hence the inverse here would become kind of fuzzy in that case. There will be only one BFN to MFN mapping per domain, the map hypercall will fail any attempt to have map a BFN to more than one GFN. This is why the unmap is based on the BFN. It is allowed to have multiple BFN mappings of the same GFN however. IOMMUOP_map_foreign_page This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`. It is not valid to use domid representing the calling domain. And what's the point of that? Was it considered to have only one map/unmap pair, capable of mapping both local and foreign pages? If so, what speaks against that? It was considered to have one map/unmap pair. The foreign map operation is the more complex of the two types of mappings and so I thought it would make for a cleaner API to have a separate subops for each type of mapping. The handling of M2B in particular is what may make the internal implementation complex. The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple. I didn't see anything explaining the significance of this (namely the ioserver part, I think I can see the need for the domid), can you explain the backgound here please? The ioserver part of the tuple is mainly for supporting a notification mechanism when a guest balloons out a GFN. The Every new M2B entry will take a reference to the MFN backing the GFN. What happens when that GFN-MFN mapping changes? The IOREQ server will be notified so that is can ensure any mediated device is not using the now invalid BFN mapping. Once all BFN mappings are removed by affected IOREQ servers (decrementing reference count each time) then the MFN will be released back to Xen. All the following conditions are required to be
Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B
On 16.06.15 at 16:47, malcolm.cross...@citrix.com wrote: On 16/06/15 14:19, Jan Beulich wrote: On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote: IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN not used by Xen gfns or MFN? gfns . This is meant to apply to the hardware domain only, it's to allow the same access control as dom0-relaxed mode allows currently. But why gfns in the name and any MFN in the description? IOMMUOP_unmap_page -- This subop uses `struct unmap_page` part of the `struct pv_iommu_op`. The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields are detailed below: Field Purpose - - `bfn` [in] Bus address frame number to be unmapped in DOMID_SELF Has it been determined that unmapping based on GFN is never going to be needed, and that unmapping by BFN is the more practical solution? The map counterpart doesn't seem to exclude establishing multiple mappings for the same BFN, and hence the inverse here would become kind of fuzzy in that case. There will be only one BFN to MFN mapping per domain, the map hypercall will fail any attempt to have map a BFN to more than one GFN. This is why the unmap is based on the BFN. It is allowed to have multiple BFN mappings of the same GFN however. Okay, I got confused again by the term BFN - I keep mixing up the parts of the bus between device and IOMMU vs between IOMMU and RAM. Alternatives I could think of (DFN for Device Frame Number) wouldn't be any better, so I guess we need to live with the ambiguity. Each successful subop will add to the M2B if there was not an existing identical M2B entry. Every new M2B entry will take a reference to the MFN backing the GFN. This is a lookup - why would it add something somewhere? Perhaps this is just a copy-and-paste mistake? Or is the use of lookup here misleading (as the last section of the document seems to suggest)? I can see how the term lookup is misleading. This subop really does a lookup and take reference. Can you suggest an alternative name? get_foreign_page_map? In any event, in particular with possibly ambiguous name the description should be particularly clear and obvious to help the reader (which also applies to the code to be written). I'm wary of allowing ??? PV IOMMU interactions with self ballooning == The guest should clear any IOMMU mappings it has of it's own pages before releasing a page back to Xen. It will need to add IOMMU mappings after repopulating a page with the populate_physmap hypercall. This requires that IOMMU mappings get a writeable page type reference count and that guests clear any IOMMU mappings before pinning page table pages. I suppose this is only for aware PV guests. If so, perhaps this should be made explicit. This is only for PV guests. I will make the correction. The emphasis was on aware, not on PV (which is already stated). The grant map operation would then behave similarly to the IOMMUOP_map_page subop for the creation of the IOMMU mapping. The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page subop for the removal of the IOMMU mapping. We're talking about mappings of foreign pages here - aren't these the wrong IOMMUOPs then? And if so, where would the ioserver id come from? I don't expected grant mapped pages to be ballooned out or to be directly used by ioservers so I believe the grant mapped pages match more closely to the standard map_page than the foreign_map_page. Right, that became clear with you saying that the ioserver id is meant to be used for balloon out notifications only. But that should be made explicit. Generally I think I need to rework the document to introduce the some concepts before the actual interface itself. Yes, that would be very helpful. The interface spec should probably the (almost) last thing. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [RFC] Xen PV IOMMU interface draft B
Hi All, Here is a design for allowing guests to control the IOMMU. This allows for the guest GFN mapping to be programmed into the IOMMU and avoid using the SWIOTLB bounce buffer technique in the Linux kernel (except for legacy 32 bit DMA IO devices). Draft B has been expanded to include Bus Address mapping/lookup for Mediated pass-through emulators. The pandoc markdown format of the document is provided below to allow for easier inline comments: % Xen PV IOMMU interface % Malcolm Crossley malcolm.cross...@citrix.com Paul Durrant paul.durr...@citrix.com % Draft B Introduction Revision History Version Date Changes --- --- -- Draft A 10 Apr 2014 Initial draft. Draft B 12 Jun 2015 Second draft. Background == Linux kernel SWIOTLB Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is decoupled from the host Machine Frame Number(MFN) address space. PV guest hardware drivers are only aware of the PFN address space only and assume that if PFN addresses are contiguous then the hardware addresses would be contiguous as well. The decoupling between PFN and MFN address spaces means PFN and MFN addresses may not be contiguous across page boundaries and thus a buffer allocated in GFN address space which spans a page boundary may not be contiguous in MFN address space. PV hardware drivers cannot tolerate this behaviour and so a special bounce buffer region is used to hide this issue from the drivers. A bounce buffer region is a special part of the PFN address space which has been made to be contiguous in both PFN and MFN address spaces. When a driver requests a buffer which spans a page boundary be made available for hardware to read the core operating system code copies the buffer into a temporarily reserved part of the bounce buffer region and then returns the MFN address of the reserved part of the bounce buffer region back to the driver itself. The driver then instructs the hardware to read the copy of the buffer in the bounce buffer. Similarly if the driver requests a buffer is made available for hardware to write to the first a region of the bounce buffer is reserved and then after the hardware completes writing then the reserved region of bounce buffer is copied to the originally allocated buffer. The overheard of memory copies to/from the bounce buffer region is high and damages performance. Furthermore, there is a risk the fixed size bounce buffer region will become exhausted and it will not be possible to return an hardware address back to the driver. The Linux kernel drivers do not tolerate this failure and so the kernel is forced to crash, as an uncorrectable error has occurred. Input/Output Memory Management Units (IOMMU) allow for an inbound address mapping to be created from the I/O Bus address space (typically PCI) to the machine frame number address space. IOMMU's typically use a page table mechanism to manage the mappings and therefore can create mappings of page size granularity or larger. The I/O Bus address space will be referred to as the Bus Frame Number (BFN) address space for the rest of this document. Mediated Pass-through Emulators --- Mediated Pass-through emulators allow guest domains to interact with hardware devices via emulator mediation. The emulator runs in a domain separate to the guest domain and it is used to enforce security of guest access to the hardware devices and isolation of different guests accessing the same hardware device. The emulator requires a mechanism to map guest address's to a bus address that the hardware devices can access. Clarification of GFN and BFN fields for different guest types - Guest Frame Numbers (GFN) definition varies depending on the guest type. Diagram below details the memory accesses originating from CPU, per guest type: HVM guest PV guest (VA) (VA) | | MMUMMU | | (GFN) | | | (GFN) HAP a.k.a EPT/NPT | | | (MFN) (MFN) | | RAMRAM For PV guests GFN is equal to MFN for a single page but not for a contiguous range of pages. Bus Frame Numbers (BFN) refer to the address presented on the physical bus before being translated by the IOMMU. Diagram