Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-17 Thread Yu, Zhang

Hi Malcolm,

  Thank you very much for accommodate our XenGT requirement in your
design. Following are some XenGT related questions. :)

On 6/13/2015 12:43 AM, Malcolm Crossley wrote:

Hi All,

Here is a design for allowing guests to control the IOMMU. This
allows for the guest GFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices).

Draft B has been expanded to include Bus Address mapping/lookup for Mediated
pass-through emulators.

The pandoc markdown format of the document is provided below to allow
for easier inline comments:

% Xen PV IOMMU interface
% Malcolm Crossley malcolm.cross...@citrix.com
   Paul Durrant paul.durr...@citrix.com
% Draft B

Introduction


Revision History



Version  Date Changes
---  ---  --
Draft A  10 Apr 2014  Initial draft.

Draft B  12 Jun 2015  Second draft.


Background
==

Linux kernel SWIOTLB


Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.

PV guest hardware drivers are only aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.

PV hardware drivers cannot tolerate this behaviour and so a special
bounce buffer region is used to hide this issue from the drivers.

A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.

The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
uncorrectable error has occurred.

Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.

The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.


Mediated Pass-through Emulators
---

Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.

The emulator requires a mechanism to map guest address's to a bus address that
the hardware devices can access.


Clarification of GFN and BFN fields for different guest types
-
Guest Frame Numbers (GFN) definition varies depending on the guest type.

Diagram below details the memory accesses originating from CPU, per guest type:

   HVM guest  PV guest

  (VA)   (VA)
   |  |
  MMUMMU
   |  |
  (GFN)   |
   |  | (GFN)
  HAP a.k.a EPT/NPT   |
   |  |
  (MFN)  (MFN)
   |  |
  RAMRAM

For PV guests GFN 

Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-17 Thread Jan Beulich
 On 17.06.15 at 14:48, yu.c.zh...@linux.intel.com wrote:
Thank you very much for accommodate our XenGT requirement in your
 design. Following are some XenGT related questions. :)

Please trim your replies.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-16 Thread Jan Beulich
 On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote:
 IOMMUOP_query_caps
 --
 
 This subop queries the runtime capabilities of the PV-IOMMU interface for 
 the
 specific called domain. This subop uses `struct pv_iommu_op` directly.

calling domain perhaps?

 
 --
 Field  Purpose
 -  
 ---
 `flags`[out] This field details the IOMMUOP capabilities.
 
 `status`   [out] Status of this op, op specific values listed below
 
 --
 
 Defined bits for flags field:
 
 
 --
 NameBitDefinition
    -- --
 IOMMU_QUERY_map_cap  0IOMMUOP_map_page or IOMMUOP_map_foreign
   can be used for this domain

this (see also above) perhaps being the calling domain? In which
case I wonder how the for and IOMMUOP_map_foreign are
meant to fit together: I assume the flag to indicate that mapping into
the (calling) domain is possible. Which then makes me wonder - what
use if the new hypercall when this flag isn't set?

 IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN
   not used by Xen

gfns or MFN?

 Defined values for map_page subop status field:
 
 Value   Reason
 --  
 --
 0   subop successfully returned
 -EIOIOMMU unit returned error when attempting to map BFN to GFN.
 -EPERM  GFN could not be mapped because the GFN belongs to Xen.
 -EPERM  Domain is not a  domain and GFN does not belong to domain

is not a hardware domain? Also, I think we're pretty determined
for there to ever only be one, so perhaps it should be the
hardware domain here and elsewhere.

 IOMMUOP_unmap_page
 --
 This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
 
 The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields
 are detailed below:
 
 
 Field  Purpose
 -  -
 `bfn`  [in] Bus address frame number to be unmapped in DOMID_SELF

Has it been determined that unmapping based on GFN is never
going to be needed, and that unmapping by BFN is the more
practical solution? The map counterpart doesn't seem to exclude
establishing multiple mappings for the same BFN, and hence the
inverse here would become kind of fuzzy in that case.

 IOMMUOP_map_foreign_page
 
 This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
 
 It is not valid to use domid representing the calling domain.

And what's the point of that? Was it considered to have only one
map/unmap pair, capable of mapping both local and foreign pages?
If so, what speaks against that?

 The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.

I didn't see anything explaining the significance of this (namely
the ioserver part, I think I can see the need for the domid), can
you explain the backgound here please?

 Every new M2B entry will take a reference to the MFN backing the GFN.

What happens when that GFN-MFN mapping changes?

 All the following conditions are required to be true for PV IOMMU 
 map_foreign
 subop to succeed:
 
 1. IOMMU detected and supported by Xen
 2. The domain has IOMMU controlled hardware allocated to it
 3. The domain is a hardware_domain and the following Xen IOMMU options are
NOT enabled: dom0-passthrough

Is there a way for the hardware domain to know it is running in
pass-through mode? Also, the domain is ambiguous here; I'm
sure you mean the invoking domain, not the one owning the page.

 This subop usage of the struct pv_iommu_op and ``struct map_foreign_page`
 fields are detailed below:
 
 
 Field  Purpose
 -  -
 `domid`[in] The domain ID for which the gfn field applies
 
 `ioserver` [in] IOREQ server id associated with mapping
 
 `bfn`  [in] Bus address frame number for gfn address

In the description above you speak of returning data in this field. Is
[in] really correct?

 Defined bits for flags field:
 
 Name BitDefinition
 -  --
 IOMMUOP_readable  0BFN IOMMU mapping is readable
 IOMMUOP_writeable 1BFN IOMMU mapping is writeable
 IOMMUOP_swap_mfn  2BFN IOMMU mapping can be safely
 

Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-16 Thread Malcolm Crossley
On 16/06/15 14:19, Jan Beulich wrote:
 On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote:
 IOMMUOP_query_caps
 --

 This subop queries the runtime capabilities of the PV-IOMMU interface for 
 the
 specific called domain. This subop uses `struct pv_iommu_op` directly.
 
 calling domain perhaps?
 
 
 --
 Field  Purpose
 -  
 ---
 `flags`[out] This field details the IOMMUOP capabilities.

 `status`   [out] Status of this op, op specific values listed below
 
 --

 Defined bits for flags field:

 
 --
 NameBitDefinition
    -- --
 IOMMU_QUERY_map_cap  0IOMMUOP_map_page or IOMMUOP_map_foreign
   can be used for this domain
 
 this (see also above) perhaps being the calling domain? In which
 case I wonder how the for and IOMMUOP_map_foreign are
 meant to fit together: I assume the flag to indicate that mapping into
 the (calling) domain is possible. Which then makes me wonder - what
 use if the new hypercall when this flag isn't set?

This is the calling domain. The IOMMU_lookup_foreign should continue to work 
if
this flag is not set.

 
 IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN
   not used by Xen
 
 gfns or MFN?

gfns . This is meant to apply to the hardware domain only, it's to allow the
same access control as dom0-relaxed mode allows currently.

 
 Defined values for map_page subop status field:

 Value   Reason
 --  
 --
 0   subop successfully returned
 -EIOIOMMU unit returned error when attempting to map BFN to GFN.
 -EPERM  GFN could not be mapped because the GFN belongs to Xen.
 -EPERM  Domain is not a  domain and GFN does not belong to domain
 
 is not a hardware domain? Also, I think we're pretty determined
 for there to ever only be one, so perhaps it should be the
 hardware domain here and elsewhere.

That is a typo. It should say is not the hardware domain

I will correct the other occurrences of a hardware domain to the
hardware domain.

 
 IOMMUOP_unmap_page
 --
 This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.

 The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields
 are detailed below:

 
 Field  Purpose
 -  -
 `bfn`  [in] Bus address frame number to be unmapped in DOMID_SELF
 
 Has it been determined that unmapping based on GFN is never
 going to be needed, and that unmapping by BFN is the more
 practical solution? The map counterpart doesn't seem to exclude
 establishing multiple mappings for the same BFN, and hence the
 inverse here would become kind of fuzzy in that case.

There will be only one BFN to MFN mapping per domain, the map hypercall will
fail any attempt to have map a BFN to more than one GFN. This is why the unmap
is based on the BFN. It is allowed to have multiple BFN mappings of the same
GFN however.

 
 IOMMUOP_map_foreign_page
 
 This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.

 It is not valid to use domid representing the calling domain.
 
 And what's the point of that? Was it considered to have only one
 map/unmap pair, capable of mapping both local and foreign pages?
 If so, what speaks against that?

It was considered to have one map/unmap pair. The foreign map operation is the
more complex of the two types of mappings and so I thought it would make for a
cleaner API to have a separate subops for each type of mapping. The handling of
M2B in particular is what may make the internal implementation complex.

 
 The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
 
 I didn't see anything explaining the significance of this (namely
 the ioserver part, I think I can see the need for the domid), can
 you explain the backgound here please?

The ioserver part of the tuple is mainly for supporting a notification
mechanism when a guest balloons out a GFN. The

 
 Every new M2B entry will take a reference to the MFN backing the GFN.
 
 What happens when that GFN-MFN mapping changes?

The IOREQ server will be notified so that is can ensure any mediated device
is not using the now invalid BFN mapping. Once all BFN mappings are removed
by affected IOREQ servers (decrementing reference count each time) then the
MFN will be released back to Xen.

 
 All the following conditions are required to be 

Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-16 Thread Jan Beulich
 On 16.06.15 at 16:47, malcolm.cross...@citrix.com wrote:
 On 16/06/15 14:19, Jan Beulich wrote:
 On 12.06.15 at 18:43, malcolm.cross...@citrix.com wrote:
 IOMMU_QUERY_map_all_gfns 1IOMMUOP_map_page subop can map any MFN
   not used by Xen
 
 gfns or MFN?
 
 gfns . This is meant to apply to the hardware domain only, it's to allow the
 same access control as dom0-relaxed mode allows currently.

But why gfns in the name and any MFN in the description?

 IOMMUOP_unmap_page
 --
 This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.

 The subop usage of the struct pv_iommu_op and ``struct unmap_page` fields
 are detailed below:

 
 Field  Purpose
 -  -
 `bfn`  [in] Bus address frame number to be unmapped in DOMID_SELF
 
 Has it been determined that unmapping based on GFN is never
 going to be needed, and that unmapping by BFN is the more
 practical solution? The map counterpart doesn't seem to exclude
 establishing multiple mappings for the same BFN, and hence the
 inverse here would become kind of fuzzy in that case.
 
 There will be only one BFN to MFN mapping per domain, the map hypercall will
 fail any attempt to have map a BFN to more than one GFN. This is why the 
 unmap
 is based on the BFN. It is allowed to have multiple BFN mappings of the same
 GFN however.

Okay, I got confused again by the term BFN - I keep mixing up the
parts of the bus between device and IOMMU vs between IOMMU
and RAM. Alternatives I could think of (DFN for Device Frame Number)
wouldn't be any better, so I guess we need to live with the ambiguity.

 Each successful subop will add to the M2B if there was not an existing 
 identical
 M2B entry.

 Every new M2B entry will take a reference to the MFN backing the GFN.
 
 This is a lookup - why would it add something somewhere? Perhaps
 this is just a copy-and-paste mistake? Or is the use of lookup here
 misleading (as the last section of the document seems to suggest)?
 
 I can see how the term lookup is misleading. This subop really does a
 lookup and take reference. Can you suggest an alternative name?

get_foreign_page_map? In any event, in particular with possibly
ambiguous name the description should be particularly clear and
obvious to help the reader (which also applies to the code to be
written).

 I'm wary of allowing

???

 PV IOMMU interactions with self ballooning
 ==

 The guest should clear any IOMMU mappings it has of it's own pages before
 releasing a page back to Xen. It will need to add IOMMU mappings after
 repopulating a page with the populate_physmap hypercall.

 This requires that IOMMU mappings get a writeable page type reference count 
 and
 that guests clear any IOMMU mappings before pinning page table pages.
 
 I suppose this is only for aware PV guests. If so, perhaps this should
 be made explicit.
 
 This is only for PV guests. I will make the correction.

The emphasis was on aware, not on PV (which is already stated).

 The grant map operation would then behave similarly to the IOMMUOP_map_page
 subop for the creation of the IOMMU mapping.

 The grant unmap operation would then behave similarly to the 
 IOMMUOP_unmap_page
 subop for the removal of the IOMMU mapping.
 
 We're talking about mappings of foreign pages here - aren't these the
 wrong IOMMUOPs then? And if so, where would the ioserver id come
 from?
 
 
 I don't expected grant mapped pages to be ballooned out or to be directly 
 used by
 ioservers so I believe the grant mapped pages match more closely to the 
 standard
 map_page than the foreign_map_page.

Right, that became clear with you saying that the ioserver id is
meant to be used for balloon out notifications only. But that
should be made explicit.

 Generally I think I need to rework the document to introduce the some 
 concepts
 before the actual interface itself.

Yes, that would be very helpful. The interface spec should probably
the (almost) last thing.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-12 Thread Malcolm Crossley
Hi All,

Here is a design for allowing guests to control the IOMMU. This
allows for the guest GFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices).

Draft B has been expanded to include Bus Address mapping/lookup for Mediated
pass-through emulators.

The pandoc markdown format of the document is provided below to allow
for easier inline comments:

% Xen PV IOMMU interface
% Malcolm Crossley malcolm.cross...@citrix.com
  Paul Durrant paul.durr...@citrix.com
% Draft B

Introduction


Revision History



Version  Date Changes
---  ---  --
Draft A  10 Apr 2014  Initial draft.

Draft B  12 Jun 2015  Second draft.


Background
==

Linux kernel SWIOTLB


Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.

PV guest hardware drivers are only aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.

PV hardware drivers cannot tolerate this behaviour and so a special
bounce buffer region is used to hide this issue from the drivers.

A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.

The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
uncorrectable error has occurred.

Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.

The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.


Mediated Pass-through Emulators
---

Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.

The emulator requires a mechanism to map guest address's to a bus address that
the hardware devices can access.


Clarification of GFN and BFN fields for different guest types
-
Guest Frame Numbers (GFN) definition varies depending on the guest type.

Diagram below details the memory accesses originating from CPU, per guest type:

  HVM guest  PV guest

 (VA)   (VA)
  |  |
 MMUMMU
  |  |
 (GFN)   |
  |  | (GFN)
 HAP a.k.a EPT/NPT   |
  |  |
 (MFN)  (MFN)
  |  |
 RAMRAM

For PV guests GFN is equal to MFN for a single page but not for a contiguous
range of pages.

Bus Frame Numbers (BFN) refer to the address presented on the physical bus
before being translated by the IOMMU.

Diagram