Re: RFC: vfio API changes needed for powerpc (v3)

2013-04-11 Thread Joerg Roedel
On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote:
  What happens if a normal unmap call is done on the MSI iova?  Do we
  need a separate unmap?
 
 I was thinking a normal unmap on an MSI windows would be an error...but
 I'm not set on that.   I put the msi unmap there to make things symmetric,
 a normal unmap would work as well...and then we could drop the msi unmap.

Hmm, this API semantic isn't very clean. When you explicitly map the MSI
banks a clean API would also allow to unmap them. But that is not
possible in your design because the kernel is responsible for mapping
MSIs and you can't unmap a MSI bank that is in use by the kernel.

So since the kernel owns the MSI setup anyways it should also take care
of mapping the MSI banks. What is the reason to not let the kernel
allocate the MSI banks top-down from the end of the DMA window space?
Just let userspace know (or even set if needed) in advance how many of
the windows it configures the kernel will take for mapping MSI banks and
you are fine, no?


Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v3)

2013-04-11 Thread Yoder Stuart-B08248

 -Original Message-
 From: Joerg Roedel [mailto:j...@8bytes.org]
 Sent: Thursday, April 11, 2013 7:57 AM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; kvm@vger.kernel.org; qemu-de...@nongnu.org; 
 io...@lists.linux-foundation.org;
 ag...@suse.de; Bhushan Bharat-R65777
 Subject: Re: RFC: vfio API changes needed for powerpc (v3)
 
 On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote:
   What happens if a normal unmap call is done on the MSI iova?  Do we
   need a separate unmap?
 
  I was thinking a normal unmap on an MSI windows would be an error...but
  I'm not set on that.   I put the msi unmap there to make things symmetric,
  a normal unmap would work as well...and then we could drop the msi unmap.
 
 Hmm, this API semantic isn't very clean. When you explicitly map the MSI
 banks a clean API would also allow to unmap them. But that is not
 possible in your design because the kernel is responsible for mapping
 MSIs and you can't unmap a MSI bank that is in use by the kernel.

The mapping that the vfio API creates is specific only to the
assigned device.   So it can be unmapped without affecting
any other device... there is nothing else in the kernel making
the mapping in use.  Another device in use by the kernel using the
same MSI bank would have its own independent mapping.   So, these
mappings are not global but are device specific...just like normal
DMA memory mappings.
  
 So since the kernel owns the MSI setup anyways it should also take care
 of mapping the MSI banks. What is the reason to not let the kernel
 allocate the MSI banks top-down from the end of the DMA window space?
 Just let userspace know (or even set if needed) in advance how many of
 the windows it configures the kernel will take for mapping MSI banks and
 you are fine, no?

As designed the API lets user space determine the number of windows
needed for MSIs, so they can be set.  The only difference between
what we've proposed and what you described, I think, is that the proposal
allows user space to _which_ windows are used for which MSI banks.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v3)

2013-04-11 Thread Scott Wood

On 04/11/2013 07:56:59 AM, Joerg Roedel wrote:

On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote:
  What happens if a normal unmap call is done on the MSI iova?  Do  
we

  need a separate unmap?

 I was thinking a normal unmap on an MSI windows would be an  
error...but
 I'm not set on that.   I put the msi unmap there to make things  
symmetric,
 a normal unmap would work as well...and then we could drop the msi  
unmap.


Hmm, this API semantic isn't very clean. When you explicitly map the  
MSI

banks a clean API would also allow to unmap them. But that is not
possible in your design because the kernel is responsible for mapping
MSIs and you can't unmap a MSI bank that is in use by the kernel.


Why is it not possible to unmap them?  Once they've been mapped,  
they're just like any other IOMMU mapping.  If the user breaks MSI for  
their own devices by unmapping the MSI page, that's their problem.


So since the kernel owns the MSI setup anyways it should also take  
care

of mapping the MSI banks. What is the reason to not let the kernel
allocate the MSI banks top-down from the end of the DMA window space?


It's less flexible, and possibly more complicated.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v3)

2013-04-09 Thread Bhushan Bharat-R65777

So now the sequence would be something like:

1)VFIO_GROUP_SET_CONTAINER // add groups to the container

2)VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model

3)count = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks

4)VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture

5)VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI 
banks

6)   For (int I = 0; I  count; i++)
VFIO_IOMMU_PAMU_MAP_MSI_BANK() // map the MSI banks, do not 
enable aperture here.

7)   Memory Listener will call- VFIO_IOMMU_MAP_DMA// map the guest's 
memory
   --- kernel enables aperture here on first VFIO_IOMMU_MAP_DMA

8)VFIO_DEVICE_SET_IRQS
   --- VFIO in kernel makes 
pci_enable_msix()/pci_enable_msi_block() calls, this sets actual MSI addr/data 
in physical device.
  --- As the address set by above APIs is not what we want so
- is using MSIX, VFIO will update address in the MSI-X 
table
- if using MSI, update MSI address in PCI configuration 
space.



Thanks
-Bharat


 -Original Message-
 From: Yoder Stuart-B08248
 Sent: Friday, April 05, 2013 3:40 AM
 To: Alex Williamson
 Cc: Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; Sethi 
 Varun-B16395;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org
 Subject: RFC: vfio API changes needed for powerpc (v3)
 
 -v3 updates
-made vfio_pamu_attr a union, added flags
-s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more
 clear which fd is being operated on
-added flags to vfio_pamu_msi_bank_map/umap
-VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32
 not a struct
-fixed some typos
 
 
 
 The Freescale PAMU is an aperture-based IOMMU with the following
 characteristics.  Each device has an entry in a table in memory
 describing the iova-phys mapping. The mapping has:
-an overall aperture that is power of 2 sized, and has a start iova that
 is naturally aligned
-has 1 or more windows within the aperture
   -number of windows must be power of 2, max is 256
   -size of each window is determined by aperture size / # of windows
   -iova of each window is determined by aperture start iova / # of windows
   -the mapped region in each window can be different than
the window size...mapping must power of 2
   -physical address of the mapping must be naturally aligned
with the mapping size
 
 These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio).
 
 /*
  * VFIO_IOMMU_PAMU_GET_ATTR
  *
  * Gets the iommu attributes for the current vfio container.  This
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets argsz and attribute.  The ioctl fills in
  * the provided struct vfio_pamu_attr based on the attribute
  * value that was set.
 
  * Return: 0 on success, -errno on failure
  */
 struct vfio_pamu_attr {
 __u32   argsz;
 __u32   flags;/* no flags currently */
 __u32   attribute;
 
 union {
 /* VFIO_ATTR_GEOMETRY */
 struct {
 __u64 aperture_start; /* first addr that can be mapped
 */
 __u64 aperture_end;  /* last addr that can be mapped 
 */
 } attr;
 
 /* VFIO_ATTR_WINDOWS */
 __u32 windows;  /* number of windows in the aperture */
 /* initially this will be the max number
  * of windows that can be set
  */
 
 /* VFIO_ATTR_PAMU_STASH */
 struct {
 __u32 cpu; /* CPU number for stashing */
 __u32 cache;   /* cache ID for stashing */
 } stash;
 }
 };
 #define VFIO_IOMMU_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_IOMMU_PAMU_SET_ATTR
  *
  * Sets the iommu attributes for the current vfio container.  This
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets struct vfio_pamu attr, including argsz and attribute and
  * setting any fields that are valid for the attribute.
  * Return: 0 on success, -errno on failure
  */
 #define VFIO_IOMMU_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT
  *
  * Returns the number of MSI banks for this platform.  This tells user space
  * how many aperture windows should be reserved for MSI banks when setting
  * the PAMU geometry and window count.
  * Return: __u32 bank count on success, -errno on failure
  */
 #define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, 
 __u32)
 
 /*
  * VFIO_IOMMU_PAMU_MAP_MSI_BANK
  *
 

RE: RFC: vfio API changes needed for powerpc (v3)

2013-04-08 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Friday, April 05, 2013 5:17 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
 Sethi Varun-B16395;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org
 Subject: Re: RFC: vfio API changes needed for powerpc (v3)
 
 On 04/04/2013 05:10:27 PM, Yoder Stuart-B08248 wrote:
  /*
   * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
   *
   * Unmaps the MSI bank at the specified iova.
   * Caller provides struct vfio_pamu_msi_bank_unmap with all fields
  set.
   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_unmap {
  __u32   argsz;
  __u32   flags; /* no flags currently */
  __u64   iova;  /* the iova to be unmapped to */
  };
  #define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
  struct vfio_pamu_msi_bank_unmap )
 
 What happens if a normal unmap call is done on the MSI iova?  Do we
 need a separate unmap?

I was thinking a normal unmap on an MSI windows would be an error...but
I'm not set on that.   I put the msi unmap there to make things symmetric,
a normal unmap would work as well...and then we could drop the msi unmap.

Stuart



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v3)

2013-04-05 Thread Scott Wood

On 04/04/2013 05:10:27 PM, Yoder Stuart-B08248 wrote:

/*
 * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields  
set.

 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u32   flags; /* no flags currently */
__u64   iova;  /* the iova to be unmapped to */
};
#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_unmap )


What happens if a normal unmap call is done on the MSI iova?  Do we  
need a separate unmap?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Alex Williamson
On Thu, 2013-04-04 at 17:32 +, Yoder Stuart-B08248 wrote:
 Based on the email thread over the last couple of days, I have
 below an more concrete proposal (v2) for new ioctls supporting vfio-pci
 on SoCs with the Freescale PAMU.
 
 Example usage is as described by Scott:
 
 count = VFIO_IOMMU_GET_MSI_BANK_COUNT
 VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
 VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
 // do other DMA maps now, or later, or not at all, doesn't matter
 for (i = 0; i  count; i++)
 VFIO_IOMMU_MAP_MSI_BANK(iova, i);
 // The kernel now knows where each bank has been mapped, and can
 //   update PCI config space appropriately.
 
 Thanks,
 Stuart
 
 
 
 The Freescale PAMU is an aperture-based IOMMU with the following 
 characteristics.  Each device has an entry in a table in memory
 describing the iova-phys mapping. The mapping has:
-an overall aperture that is power of 2 sized, and has a start iova that
 is naturally aligned
-has 1 or more windows within the aperture
   -number of windows must be power of 2, max is 256
   -size of each window is determined by aperture size / # of windows
   -iova of each window is determined by aperture start iova / # of windows
   -the mapped region in each window can be different than
the window size...mapping must power of 2 
   -physical address of the mapping must be naturally aligned
with the mapping size
 
 /*
  * VFIO_PAMU_GET_ATTR
  *
  * Gets the iommu attributes for the current vfio container.  This 
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets argsz and attribute.  The ioctl fills in
  * the provided struct vfio_pamu_attr based on the attribute
  * value that was set.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 struct vfio_pamu_attr {
 __u32   argsz;
 __u32   attribute;
 #define VFIO_ATTR_GEOMETRY 1
 #define VFIO_ATTR_WINDOWS  2
 #define VFIO_ATTR_PAMU_STASH   3
 
 /* fowlling fields are for VFIO_ATTR_GEOMETRY */
 __u64 aperture_start; /* first address that can be mapped*/
 __u64 aperture_end;   /* last address that can be mapped */
 
 /* follwing fields are for VFIO_ATTR_WINDOWS */
 __u32 windows;/* number of windows in the aperture */
   /* initially this will be the max number
* of windows that can be set
*/
 
 /* following fields are for VFIO_ATTR_PAMU_STASH */
 __u32 cpu;/* CPU number for stashing */
 __u32 cache;  /* cache ID for stashing */

Can multiple attribute bits be set at once?  If not then the above could
be a union and the attribute could be an enum.  A flags field is
probably still a good idea.

 };
 #define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_PAMU_SET_ATTR
  *
  * Sets the iommu attributes for the current vfio container.  This 
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets struct vfio_pamu attr, including argsz and attribute and
  * setting any fields that are valid for the attribute.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 #define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_PAMU_GET_MSI_BANK_COUNT
  *
  * Returns the number of MSI banks for this platform.  This tells user space
  * how many aperture windows should be reserved for MSI banks when setting
  * the PAMU geometry and window count.
  * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 struct vfio_pamu_msi_banks {
 __u32   argsz;
 __u32   bank_count;  /* the number of MSI
 };
 #define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_msi_banks)

argsz w/o flags doesn't really buy us much.

 
 /*
  * VFIO_PAMU_MAP_MSI_BANK
  *
  * Maps the MSI bank at the specified index and iova.  User space must
  * call this ioctl once for each MSI bank (count of banks is returned by
  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
  * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 
 struct vfio_pamu_msi_bank_map {
 __u32   argsz;
 __u32   msi_bank_index;  /* the index of the MSI bank */
 __u64   iova;  /* the iova the bank is to be mapped to */
 };

Again, flags.  If we dynamically add or remove devices from a container
the bank count can change, right?  If bank count goes from 2 to 3, does
userspace know assume the new bank is [2]?  If bank count goes from 3 

RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248

  /*
   * VFIO_PAMU_MAP_MSI_BANK
   *
   * Maps the MSI bank at the specified index and iova.  User space must
   * call this ioctl once for each MSI bank (count of banks is returned by
   * VFIO_IOMMU_GET_MSI_BANK_COUNT).
   * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_map {
  __u32   argsz;
  __u32   msi_bank_index;  /* the index of the MSI bank */
  __u64   iova;  /* the iova the bank is to be mapped to */
  };
 
 Again, flags.  If we dynamically add or remove devices from a container
 the bank count can change, right?  If bank count goes from 2 to 3, does
 userspace know assume the new bank is [2]?  If bank count goes from 3 to
 2, which index was that?  If removing a device removes an MSI bank then
 vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely always be
3 or 4).  So it won't change as devices are added or removed.

If devices are added or removed, the kernel side can enable or disable
the corresponding MSI windows.  But that is hidden from user space.

Stuart




Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Scott Wood

On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:


  /*
   * VFIO_PAMU_MAP_MSI_BANK
   *
   * Maps the MSI bank at the specified index and iova.  User space  
must
   * call this ioctl once for each MSI bank (count of banks is  
returned by

   * VFIO_IOMMU_GET_MSI_BANK_COUNT).
   * Caller provides struct vfio_pamu_msi_bank_map with all fields  
set.

   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_map {
  __u32   argsz;
  __u32   msi_bank_index;  /* the index of the MSI bank */
  __u64   iova;  /* the iova the bank is to be mapped  
to */

  };

 Again, flags.  If we dynamically add or remove devices from a  
container
 the bank count can change, right?  If bank count goes from 2 to 3,  
does
 userspace know assume the new bank is [2]?  If bank count goes from  
3 to
 2, which index was that?  If removing a device removes an MSI bank  
then

 vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by  
VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely  
always be

3 or 4).  So it won't change as devices are added or removed.


It should be the actual number of banks used.  This is required if  
we're going to have userspace do the iteration and specify the exact  
iovas to use -- and even if we aren't going to do that, it would be  
more restrictive on available iova-space than is necessary.  Usually  
there will be only one bank in the group.


Actually mapping all of the MSI banks, all the time, would completely  
negate the point of using the separate alias pages.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, April 04, 2013 5:52 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
 Sethi Varun-B16395;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org
 Subject: Re: RFC: vfio API changes needed for powerpc (v2)
 
 On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:
 
/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space
  must
 * call this ioctl once for each MSI bank (count of banks is
  returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields
  set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
   
struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped
  to */
};
  
   Again, flags.  If we dynamically add or remove devices from a
  container
   the bank count can change, right?  If bank count goes from 2 to 3,
  does
   userspace know assume the new bank is [2]?  If bank count goes from
  3 to
   2, which index was that?  If removing a device removes an MSI bank
  then
   vfio-pamu will automatically do the unmap, right?
 
  My assumption is that the bank count returned by
  VFIO_IOMMU_GET_MSI_BANK_COUNT
  is the max bank count for a platform.  (number will mostly likely
  always be
  3 or 4).  So it won't change as devices are added or removed.
 
 It should be the actual number of banks used.  This is required if
 we're going to have userspace do the iteration and specify the exact
 iovas to use -- and even if we aren't going to do that, it would be
 more restrictive on available iova-space than is necessary.  Usually
 there will be only one bank in the group.
 
 Actually mapping all of the MSI banks, all the time, would completely
 negate the point of using the separate alias pages.

The geometry, windows, DMA mappings, etc is set on a 'container'
which may have multiple groups in it.  So user space needs to
determine the total number of MSI windows needed when setting
the geometry and window count.

In the flow you proposed:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);

...that count has to be the total, not the count for 1 of
N possible groups.   So the get count ioctl is not done on
a group.

However, like you pointed out we don't want to negate isolation 
of the separate alias pages.  All this API is doing is telling
the kernel which windows to use for which MSI banks.   It's up
to the kernel to actually enable them as needed.

Say 3 MSI banks exist.  If there are no groups added to the
vfio container and all 3 MAP_MSI_BANK calls occurred
the picture may look like this (based on my earlier example):

   win gphys/
#   enabled iovaphys  size
--- ---   
5 N 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds 2 groups that use bank 1:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds another group that uses bank 3:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 Y 0x1800  0xf_fe046000  4KB// msi bank 3

User space doesn't need to care what is actually enabled,
it just needs to the the kernel which windows to use
and the kernel can take care of the rest.

Stuart


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Stuart Yoder
   Type1 is arbitrary.  It might as well be named brown and this one
  can be
  blue.

 The difference is that type1 seems to refer to hardware that can do
 arbitrary 4K page mappings, possibly constrained by an aperture but
 nothing else.  More than one IOMMU can reasonably fit that.  The odds
 that another IOMMU would have exactly the same restrictions as PAMU
 seem smaller in comparison.

 In any case, if you had to deal with some Intel-only quirk, would it
 make sense to call it a type1 attribute?  I'm not advocating one way
 or the other on whether an abstraction is viable here (though Stuart
 seems to think it's highly unlikely anything but a PAMU will comply),
 just that if it is to be abstracted rather than a hardware-specific
 interface, we need to document what is and is not part of the
 abstraction.  Otherwise a non-PAMU-specific user won't know what they
 can rely on, and someone adding support for a new windowed IOMMU won't
 know if theirs is close enough, or they need to introduce a type3.

 So Alexey named the SPAPR IOMMU something related to spapr...
 surprisingly enough.  I'm fine with that.  If you think it's unique
 enough, name it something appropriately.  I haven't seen the code and
 don't know the architecture sufficiently to have an opinion.

The only reason I suggested type 2 is that I thought that was the
convention...we would enumerate different iommus.   I think that
calling it pamu is better and more clear.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/02/2013 04:38:45 PM, Alex Williamson wrote:

 On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
  On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com
  wrote:
   C.  Explicit mapping using normal DMA map.  The last idea is
that
   we would introduce a new ioctl to give user-space an fd to
   the MSI bank, which could be mmapped.  The flow would be
   something like this:
  -for each group user space calls new ioctl
VFIO_GROUP_GET_MSI_FD
  -user space mmaps the fd, getting a vaddr
  -user space does a normal DMA map for desired iova
   This approach makes everything explicit, but adds a new
ioctl
   applicable most likely only to the PAMU (type2 iommu).
  
   And the DMA_MAP of that mmap then allows userspace to select the
   window
   used?  This one seems like a lot of overhead, adding a new ioctl, new
   fd, mmap, special mapping path, etc.
  
  
   There's going to be special stuff no matter what.  This would keep it
   separated from the IOMMU map code.
  
   I'm not sure what you mean by overhead here... the runtime overhead
   of
   setting things up is not particularly relevant as long as it's
   reasonable.
   If you mean development and maintenance effort, keeping things well
   separated should help.
 
  We don't need to change DMA_MAP.  If we can simply add a new type 2
  ioctl that allows user space to set which windows are MSIs, it seems
  vastly
  less complex than an ioctl to supply a new fd, mmap of it, etc.
 
  So maybe 2 ioctls:
  VFIO_IOMMU_GET_MSI_COUNT


 Do you mean a count of actual MSIs or a count of MSI banks used by the whole
 VFIO group?

I meant # of MSI banks, so VFIO_IOMMU_GET_MSI_BANK_COUNT would be better.

  VFIO_IOMMU_MAP_MSI(iova, size)


 Not sure how you mean size to be used -- for MPIC it would be 4K per bank,
 and you can only map one bank at a time (which bank you're mapping should be
 a parameter, if only so that the kernel doesn't have to keep iteration state
 for you).

The intent was for user space to tell the kernel which windows to use
for MSI.   So I envisioned a total size of window-size * msi-bank-count.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/03/2013 01:32:26 PM, Stuart Yoder wrote:
On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood scottw...@freescale.com  
wrote:

 On 04/02/2013 04:38:45 PM, Alex Williamson wrote:

 On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
  VFIO_IOMMU_MAP_MSI(iova, size)


 Not sure how you mean size to be used -- for MPIC it would be 4K  
per bank,
 and you can only map one bank at a time (which bank you're mapping  
should be
 a parameter, if only so that the kernel doesn't have to keep  
iteration state

 for you).

The intent was for user space to tell the kernel which windows to use
for MSI.   So I envisioned a total size of window-size *  
msi-bank-count.


Size doesn't tell the kernel *which* banks to use, only how many.  If  
it already knows which banks are used by the group, then it also knows  
how many are used.  And size is misleading because the mapping is not  
generally going to be contiguous.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Stuart Yoder
 Would is be possible for userspace to simply leave room for MSI bank
 mapping (how much room could be determined by something like
 VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
 DMA_MAP starting at the 0x0 address of the aperture, growing up, and
 VFIO will map banks on demand at the top of the aperture, growing down?
 Wouldn't that avoid a lot of issues with userspace needing to know
 anything about MSI banks (other than count) and coordinating irq numbers
 and enabling handlers?

This is basically option #A in the original proposals sent.   I like
this approach, in that it
is simpler and keeps user space mostly out of this...which is
consistent with how things are done
on x86.  User space just needs to know how many windows to leave at
the top of the aperture.
The kernel then has the flexibility to use those windows how it wants.

But one question, is when should the kernel actually map (and unmap)
the MSI banks.   One thing we need to do is enable the aperture...and current
thinking is that is done on the first DMA_MAP.   Similarly when the last mapping
is unmapped we could also umap the MSI banks.

Sequence would be something like:

VFIO_GROUP_SET_CONTAINER // add groups to the container

VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model

cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks

VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture

VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
including MSI banks

VFIO_IOMMU_MAP_DMA// map the guest's memory
   --- kernel enables aperture and maps needed MSI banks here

VFIO_DEVICE_SET_IRQS
   --- kernel sets actual MSI addr/data in physical
device here (I think)


Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:

 Would is be possible for userspace to simply leave room for MSI bank
 mapping (how much room could be determined by something like
 VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
can

 DMA_MAP starting at the 0x0 address of the aperture, growing up, and
 VFIO will map banks on demand at the top of the aperture, growing  
down?

 Wouldn't that avoid a lot of issues with userspace needing to know
 anything about MSI banks (other than count) and coordinating irq  
numbers

 and enabling handlers?

This is basically option #A in the original proposals sent.   I like
this approach, in that it
is simpler and keeps user space mostly out of this...which is
consistent with how things are done
on x86.  User space just needs to know how many windows to leave at
the top of the aperture.
The kernel then has the flexibility to use those windows how it wants.

But one question, is when should the kernel actually map (and unmap)
the MSI banks.


I think userspace should explicitly request it.  Userspace still  
wouldn't need to know anything but the count:


count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can update  
PCI config space appropriately.



One thing we need to do is enable the aperture...and current
thinking is that is done on the first DMA_MAP.


What if there are no other mappings required?

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Alex Williamson
On Wed, 2013-04-03 at 14:09 -0500, Stuart Yoder wrote:
  Would is be possible for userspace to simply leave room for MSI bank
  mapping (how much room could be determined by something like
  VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
  DMA_MAP starting at the 0x0 address of the aperture, growing up, and
  VFIO will map banks on demand at the top of the aperture, growing down?
  Wouldn't that avoid a lot of issues with userspace needing to know
  anything about MSI banks (other than count) and coordinating irq numbers
  and enabling handlers?
 
 This is basically option #A in the original proposals sent.   I like
 this approach, in that it
 is simpler and keeps user space mostly out of this...which is
 consistent with how things are done
 on x86.  User space just needs to know how many windows to leave at
 the top of the aperture.
 The kernel then has the flexibility to use those windows how it wants.
 
 But one question, is when should the kernel actually map (and unmap)
 the MSI banks.   One thing we need to do is enable the aperture...and current
 thinking is that is done on the first DMA_MAP.   Similarly when the last 
 mapping
 is unmapped we could also umap the MSI banks.
 
 Sequence would be something like:
 
 VFIO_GROUP_SET_CONTAINER // add groups to the container
 
 VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model
 
 cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks
 
 VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture
 
 VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
 including MSI banks
 
 VFIO_IOMMU_MAP_DMA// map the guest's memory
--- kernel enables aperture and maps needed MSI banks here
 
 VFIO_DEVICE_SET_IRQS
--- kernel sets actual MSI addr/data in physical
 device here (I think)

You could also make use of the IOMMU_ENABLE/DISABLE entry points that
Alexey plans to use.  Ideally I'd think that you'd want to enable the
required MSI banks for a device on DEVICE_SET_IRQs.  That's effectively
what happens on x86.  Perhaps some information stored in the domain
structure would let architecture hooks in MSI setup enable those
mappings for you?  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:

 Would is be possible for userspace to simply leave room for MSI bank
 mapping (how much room could be determined by something like
 VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
can

 DMA_MAP starting at the 0x0 address of the aperture, growing up, and
 VFIO will map banks on demand at the top of the aperture, growing  
down?

 Wouldn't that avoid a lot of issues with userspace needing to know
 anything about MSI banks (other than count) and coordinating irq  
numbers

 and enabling handlers?

This is basically option #A in the original proposals sent.   I like
this approach, in that it
is simpler and keeps user space mostly out of this...which is
consistent with how things are done
on x86.  User space just needs to know how many windows to leave at
the top of the aperture.
The kernel then has the flexibility to use those windows how it wants.

But one question, is when should the kernel actually map (and unmap)
the MSI banks.   One thing we need to do is enable the aperture...and  
current
thinking is that is done on the first DMA_MAP.   Similarly when the  
last mapping

is unmapped we could also umap the MSI banks.

Sequence would be something like:

VFIO_GROUP_SET_CONTAINER // add groups to the container

VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model

cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of  
MSI banks


VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture

VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
including MSI banks

VFIO_IOMMU_MAP_DMA// map the guest's memory
   --- kernel enables aperture and maps needed MSI banks  
here


Maps them where?

What if there is more than one explicit DMA mapping?  What if DMA  
mappings are changed during operation?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Stuart Yoder
On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:

  Would is be possible for userspace to simply leave room for MSI bank
  mapping (how much room could be determined by something like
  VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
  DMA_MAP starting at the 0x0 address of the aperture, growing up, and
  VFIO will map banks on demand at the top of the aperture, growing down?
  Wouldn't that avoid a lot of issues with userspace needing to know
  anything about MSI banks (other than count) and coordinating irq numbers
  and enabling handlers?

 This is basically option #A in the original proposals sent.   I like
 this approach, in that it
 is simpler and keeps user space mostly out of this...which is
 consistent with how things are done
 on x86.  User space just needs to know how many windows to leave at
 the top of the aperture.
 The kernel then has the flexibility to use those windows how it wants.

 But one question, is when should the kernel actually map (and unmap)
 the MSI banks.


 I think userspace should explicitly request it.  Userspace still wouldn't
 need to know anything but the count:

 count = VFIO_IOMMU_GET_MSI_BANK_COUNT
 VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
 VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
 // do other DMA maps now, or later, or not at all, doesn't matter
 for (i = 0; i  count; i++)
 VFIO_IOMMU_MAP_MSI_BANK(iova, i);
 // The kernel now knows where each bank has been mapped, and can update PCI
 config space appropriately.

And the overall aperture enable/disable would occur on the first
dma/msi map() and last dma/msi unmap()?

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/03/2013 02:43:06 PM, Stuart Yoder wrote:
On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood scottw...@freescale.com  
wrote:

 On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:

  Would is be possible for userspace to simply leave room for MSI  
bank

  mapping (how much room could be determined by something like
  VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that  
userspace can
  DMA_MAP starting at the 0x0 address of the aperture, growing up,  
and
  VFIO will map banks on demand at the top of the aperture,  
growing down?
  Wouldn't that avoid a lot of issues with userspace needing to  
know
  anything about MSI banks (other than count) and coordinating irq  
numbers

  and enabling handlers?

 This is basically option #A in the original proposals sent.   I  
like

 this approach, in that it
 is simpler and keeps user space mostly out of this...which is
 consistent with how things are done
 on x86.  User space just needs to know how many windows to leave at
 the top of the aperture.
 The kernel then has the flexibility to use those windows how it  
wants.


 But one question, is when should the kernel actually map (and  
unmap)

 the MSI banks.


 I think userspace should explicitly request it.  Userspace still  
wouldn't

 need to know anything but the count:

 count = VFIO_IOMMU_GET_MSI_BANK_COUNT
 VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
 VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
 // do other DMA maps now, or later, or not at all, doesn't matter
 for (i = 0; i  count; i++)
 VFIO_IOMMU_MAP_MSI_BANK(iova, i);
 // The kernel now knows where each bank has been mapped, and can  
update PCI

 config space appropriately.

And the overall aperture enable/disable would occur on the first
dma/msi map() and last dma/msi unmap()?


Yes.  We may want the optional ability to do an overall enable/disable  
for reasons we discussed a while ago, but in the absence of an explicit  
disable the domain would be enabled on first map.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/02/2013 10:37:20 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
 On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
   On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood
  scottw...@freescale.com wrote:
C.  Explicit mapping using normal DMA map.  The last  
idea

  is that
we would introduce a new ioctl to give user-space  
an fd

  to
the MSI bank, which could be mmapped.  The flow  
would be

something like this:
   -for each group user space calls new ioctl
 VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired  
iova
This approach makes everything explicit, but adds a  
new

  ioctl
applicable most likely only to the PAMU (type2  
iommu).

   
And the DMA_MAP of that mmap then allows userspace to select  
the

  window
used?  This one seems like a lot of overhead, adding a new
  ioctl, new
fd, mmap, special mapping path, etc.
   
   
There's going to be special stuff no matter what.  This would
  keep it
separated from the IOMMU map code.
   
I'm not sure what you mean by overhead here... the runtime
  overhead of
setting things up is not particularly relevant as long as it's
  reasonable.
If you mean development and maintenance effort, keeping things
  well
separated should help.
  
   We don't need to change DMA_MAP.  If we can simply add a new  
type

  2
   ioctl that allows user space to set which windows are MSIs, it
  seems vastly
   less complex than an ioctl to supply a new fd, mmap of it, etc.
  
   So maybe 2 ioctls:
   VFIO_IOMMU_GET_MSI_COUNT

 Do you mean a count of actual MSIs or a count of MSI banks used by  
the

 whole VFIO group?

I hope the latter, which would clarify how this is distinct from
DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
dynamically adding a device could bring along additional MSI banks?


I'm not sure -- maybe we could say that hotplug can add banks, but not  
remove them or change the order, so userspace would just need to check  
if the number of banks changed, and map the extras.


The current VFIO MSI support has the host handling everything about  
MSI.

The user never programs an MSI vector to the physical device, they set
up everything through ioctl.  On interrupt, we simply trigger an  
eventfd
and leave it to things like KVM irqfd or QEMU to do the right thing  
in a

virtual machine.

Here the MSI vector has to go through a PAMU window to hit the correct
MSI bank.  So that means it has some component of the iova involved,
which we're proposing here is controlled by userspace (whether that
vector uses an offset from 0x1000 or 0x depending on which
window slot is used to make the MSI bank).  I assume we're still  
working

in a model where the physical interrupt fires into the host and a
host-based interrupt handler triggers an eventfd, right?


Yes (subject to possible future optimizations).

So that means the vector also has host components so we trigger the  
correct ISR.  How

is that coordinated?


Everything but the iova component needs to come from the host MSI  
allocator.



Would is be possible for userspace to simply leave room for MSI bank
mapping (how much room could be determined by something like
VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
can

DMA_MAP starting at the 0x0 address of the aperture, growing up, and
VFIO will map banks on demand at the top of the aperture, growing  
down?

Wouldn't that avoid a lot of issues with userspace needing to know
anything about MSI banks (other than count) and coordinating irq  
numbers

and enabling handlers?


This would restrict a (possibly unlikely) use case where the user wants  
to map something near the top of the aperture but has another place  
MSIs can go (or is willing to live without MSIs).  Otherwise it could  
be workable, as long as we can require an explicit MSI enabling on a  
device to happen after the aperture and subwindow count are set up.   
I'm not sure it would really buy anything over having userspace iterate  
over the MSI bank count, though -- it would probably be a bit more  
complicated.



  On x86 MSI count is very
  device specific, which means it wold be a VFIO_DEVICE_* ioctl
  (actually
  VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble  
with

  it
  being a device ioctl is that you need to get the device FD, but  
the
  IOMMU protection needs to be established before you can get  
that... so

  there's an ordering problem if you need it from the device before
  configuring the IOMMU.  Thanks,

 What do you mean by IOMMU protection needs to be established?
 Wouldn't we just start with no mappings in place?

If no mappings blocks all DMA, sure, that's fine.  Once the VFIO  
device

FD is accessible by 

Re: RFC: vfio API changes needed for powerpc

2013-04-03 Thread Scott Wood

On 04/02/2013 10:12:31 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
 On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
   On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
On x86 the interrupt remapper handles this transparently when  
MSI
is enabled and userspace never gets direct access to the  
device

  MSI
address/data registers.
  
   x86 has a totally different mechanism here, as far as I  
understand

  --
   even before you get into restrictions on mappings.
 
  So what control will userspace have over programming the actually  
MSI

  vectors on PAMU?

 Not sure what you mean -- PAMU doesn't get explicitly involved in
 MSIs.  It's just another 4K page mapping (per relevant MSI bank).   
If

 you want isolation, you need to make sure that an MSI group is only
 used by one VFIO group, and that you're on a chip that has alias  
pages

 with just one MSI bank register each (newer chips do, but the first
 chip to have a PAMU didn't).

How does a user figure this out?


The user's involvement could be limited to setting a policy knob of  
whether that degree of isolation is required (if required and  
unavailable, all devices using an MSI bank would be forced into the  
same group).  We'd need to do something with MSI allocation so that we  
avoid using an MSI bank with more than one IOMMU group where possible.   
I'm not sure about the details yet, or how practical this is.  There  
might need to be some MSI bank assignment done as part of the VFIO  
device binding process, if there are going to be more VFIO groups than  
there are MSI banks (reserving one bank for host use).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

Alex,

We are in the process of implementing vfio-pci support for the  
Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
different

than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a  
number
of windows.  All sizes and window counts must be power of 2.  To  
illustrate,
below is a mapping for a 256MB guest, including guest memory (backed  
by

64MB huge pages) and some windows for MSIs:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
5   0x1400  0xf_fe045000  4KB// msi bank 2
6   0x1800  0xf_fe046000  4KB// msi bank 3
7- -  disabled

There are a couple of updates needed to the vfio user-kernel  
interface

that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   attributes.

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
VFIO_IOMMU_SET_ATTR
VFIO_IOMMU_GET_ATTR


Note that this means attributes need to be updated for user-API  
appropriateness, such as using fixed-size types.



2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to  
target
   to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
   the 3 banks any way it wants, and currently user space has no way  
of

   knowing which bank may be used for a given device.

   There are 3 options we have discussed and would like your  
direction:


   A.  Implicit mappings -- with this approach user space would not
   explicitly map MSIs.  User space would be required to set the
   geometry so that there are 3 unused windows (the last 3  
windows)


Where does userspace get the number 3 from?  E.g. on newer chips  
there are 4 MSI banks.  Maybe future chips have even more.



   B.  Explicit mapping using DMA map flags.  The idea is that a new
   flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
   a mapping is to be created for the supplied iova.  No vaddr
   is given though.  So in the above example there would be a
   a dma map at 0x1000 for 24KB (and no vaddr).


A single 24 KiB mapping wouldn't work (and why 24KB?  What if only one  
MSI group is involved in this VFIO group?  What if four MSI groups are  
involved?).  You'd need to either have a naturally aligned,  
power-of-two sized mapping that covers exactly the pages you want to  
map and no more, or you'd need to create a separate mapping for each  
MSI bank, and due to PAMU subwindow alignment restrictions these  
mappings could not be contiguous in iova-space.



   C.  Explicit mapping using normal DMA map.  The last idea is that
   we would introduce a new ioctl to give user-space an fd to
   the MSI bank, which could be mmapped.  The flow would be
   something like this:
  -for each group user space calls new ioctl  
VFIO_GROUP_GET_MSI_FD

  -user space mmaps the fd, getting a vaddr
  -user space does a normal DMA map for desired iova
   This approach makes everything explicit, but adds a new ioctl
   applicable most likely only to the PAMU (type2 iommu).


The new ioctl isn't really specific to PAMU (or whatever type2 is  
supposed to be, which nobody ever explains when I ask), so much as to  
the MSI implementation.  It just exposes the MSI register as another  
device resource (well, technically a groupwide resource, unless we  
expose it on a per-device basis and provide enough information for  
userspace to recognize when it's the same for other devices in the  
group) to be mmapped, which userspace can choose to map in the IOMMU as  
well.


Note that in the explicit case, userspace would have to program the MSI  
iova into the PCI device's config space (or communicate the chosen  
address to the kernel so it can set the config space registers).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
Hi Stuart,

On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
 Alex,
 
 We are in the process of implementing vfio-pci support for the Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
 than x86, and will involve creating a 'type 2' vfio implementation.
 
 For each device's DMA mappings, PAMU has an overall aperture and a number
 of windows.  All sizes and window counts must be power of 2.  To illustrate,
 below is a mapping for a 256MB guest, including guest memory (backed by
 64MB huge pages) and some windows for MSIs:
 
 Total aperture: 512MB
 # of windows: 8
 
 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled
 
 There are a couple of updates needed to the vfio user-kernel interface
 that we would like your feedback on.
 
 1.  IOMMU geometry
 
The kernel IOMMU driver now has an interface (see domain_set_attr,
domain_get_attr) that lets us set the domain geometry using
attributes.
 
We want to expose that to user space, so envision needing a couple
of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR 

Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are
passed to the IOMMU driver.  So you can effectively have your own type2
ioctl extensions.  Alexey has already posted patches to do this for
SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to
VFIO_IOMMU_GET_INFO to examine locked page requirements.  As Scott notes
we need to come up with a clean userspace interface for these though.

 2.   MSI window mappings
 
The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.   
 
There are 3 options we have discussed and would like your direction:
 
A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)
for MSIs, and it would be up to the kernel to create the mappings.
This approach requires some specific semantics (leaving 3 windows)
and it potentially gets a little weird-- when should the kernel
actually create the MSI mappings?  When should they be unmapped?
Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
number exposed to userspace on GET and transparently add MSI entries on
SET.  On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.  What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

This could also be done as another type2 ioctl extension.  What's the
value to userspace in determining which windows are used by which banks?
It sounds like the case that there are X banks and if userspace wants to
use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between MSI
or RAM?

C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to 
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

 Alex,

 We are in the process of implementing vfio-pci support for the Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
 than x86, and will involve creating a 'type 2' vfio implementation.

 For each device's DMA mappings, PAMU has an overall aperture and a number
 of windows.  All sizes and window counts must be power of 2.  To
 illustrate,
 below is a mapping for a 256MB guest, including guest memory (backed by
 64MB huge pages) and some windows for MSIs:

 Total aperture: 512MB
 # of windows: 8

 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled

 There are a couple of updates needed to the vfio user-kernel interface
 that we would like your feedback on.

 1.  IOMMU geometry

The kernel IOMMU driver now has an interface (see domain_set_attr,
domain_get_attr) that lets us set the domain geometry using
attributes.

We want to expose that to user space, so envision needing a couple
of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR


 Note that this means attributes need to be updated for user-API
 appropriateness, such as using fixed-size types.


 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your direction:

A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)


 Where does userspace get the number 3 from?  E.g. on newer chips there are
 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future chips
is nil, and if it ever happened user space could adjust.  Also,
practically speaking since memory is typically allocate in powers of
2 way you need to approximately double the window geometry anyway.

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).


 A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI
 group is involved in this VFIO group?  What if four MSI groups are
 involved?).  You'd need to either have a naturally aligned, power-of-two
 sized mapping that covers exactly the pages you want to map and no more, or
 you'd need to create a separate mapping for each MSI bank, and due to PAMU
 subwindow alignment restrictions these mappings could not be contiguous in
 iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI banks
perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.

If only one MSI bank was involved the kernel could get clever and only enable
the banks actually needed.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com  
wrote:

 On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

 Alex,

 We are in the process of implementing vfio-pci support for the  
Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
different

 than x86, and will involve creating a 'type 2' vfio implementation.

 For each device's DMA mappings, PAMU has an overall aperture and a  
number

 of windows.  All sizes and window counts must be power of 2.  To
 illustrate,
 below is a mapping for a 256MB guest, including guest memory  
(backed by

 64MB huge pages) and some windows for MSIs:

 Total aperture: 512MB
 # of windows: 8

 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled

 There are a couple of updates needed to the vfio user-kernel  
interface

 that we would like your feedback on.

 1.  IOMMU geometry

The kernel IOMMU driver now has an interface (see  
domain_set_attr,

domain_get_attr) that lets us set the domain geometry using
attributes.

We want to expose that to user space, so envision needing a  
couple

of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR


 Note that this means attributes need to be updated for user-API
 appropriateness, such as using fixed-size types.


 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We  
need to
create mappings for up to 3 MSI banks that a device may need to  
target
to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
the 3 banks any way it wants, and currently user space has no  
way of

knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your  
direction:


A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set  
the
geometry so that there are 3 unused windows (the last 3  
windows)



 Where does userspace get the number 3 from?  E.g. on newer chips  
there are

 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future  
chips

is nil,


What makes you so sure?  Especially since you seem to be presenting  
this as not specifically an MPIC API.



and if it ever happened user space could adjust.


What bit of API is going to tell it that it needs to adjust?

Also, practically speaking since memory is typically allocate in  
powers of

2 way you need to approximately double the window geometry anyway.


Only if your existing mapping needs fit exactly in a power of two.

B.  Explicit mapping using DMA map flags.  The idea is that a  
new

flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).


 A single 24 KiB mapping wouldn't work (and why 24KB? What if only  
one MSI

 group is involved in this VFIO group?  What if four MSI groups are
 involved?).  You'd need to either have a naturally aligned,  
power-of-two
 sized mapping that covers exactly the pages you want to map and no  
more, or
 you'd need to create a separate mapping for each MSI bank, and due  
to PAMU
 subwindow alignment restrictions these mappings could not be  
contiguous in

 iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3  
MSI banks

perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.


Where did the assumption of a 64MiB subwindow size come from?

If only one MSI bank was involved the kernel could get clever and  
only enable

the banks actually needed.


I'd rather see cleverness kept in userspace.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
alex.william...@redhat.com wrote:
 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your direction:

A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)
for MSIs, and it would be up to the kernel to create the mappings.
This approach requires some specific semantics (leaving 3 windows)
and it potentially gets a little weird-- when should the kernel
actually create the MSI mappings?  When should they be unmapped?
Some convention would need to be established.

 VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
 number exposed to userspace on GET and transparently add MSI entries on
 SET.

The number of windows is always power of 2 (and max is 256).  And to reduce
PAMU cache pressure you want to use the fewest number of windows
you can.So, I don't see practically how we could transparently
steal entries to
add the MSIs. Either user space knows to leave empty windows for
MSIs and by convention the kernel knows which windows those are (as
in option #A) or explicitly tell the kernel which windows (as in option #B).

 On x86 the interrupt remapper handles this transparently when MSI
 is enabled and userspace never gets direct access to the device MSI
 address/data registers.  What kind of restrictions do you have around
 adding and removing windows while the aperture is enabled?

The windows can be enabled/disabled event while the aperture is
enabled (pretty sure)...

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

 This could also be done as another type2 ioctl extension.  What's the
 value to userspace in determining which windows are used by which banks?
 It sounds like the case that there are X banks and if userspace wants to
 use MSI it needs to leave X windows available for that.  Is this just
 buying userspace a few more windows to allow them the choice between MSI
 or RAM?

Yes, it would potentially give user space the flexibility some more windows.
It also makes more explicit when the MSI mappings are created.  In option
#A the MSI mappings would probably get created at the time of the first
normal DMA map.

So, you're saying with this approach you'd rather see a new type 2
ioctl instead of adding new flags to DMA map, right?

C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

 And the DMA_MAP of that mmap then allows userspace to select the window
 used?  This one seems like a lot of overhead, adding a new ioctl, new
 fd, mmap, special mapping path, etc.  It would be less overhead to just
 add an ioctl to enable MSI, maybe letting userspace pick which windows
 get used, but I'm still not sure what the value is to userspace in
 exposing it.  Thanks,

Thanks,
Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 03:32:17 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need  
to
create mappings for up to 3 MSI banks that a device may need to  
target
to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
the 3 banks any way it wants, and currently user space has no  
way of

knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your  
direction:


A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3  
windows)
for MSIs, and it would be up to the kernel to create the  
mappings.
This approach requires some specific semantics (leaving 3  
windows)
and it potentially gets a little weird-- when should the  
kernel
actually create the MSI mappings?  When should they be  
unmapped?

Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
the
number exposed to userspace on GET and transparently add MSI entries  
on

SET.


What do you mean by reduce the number exposed?  Userspace decides how  
many entries there are, but it must be a power of two beteen 1 and 256.



On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.


x86 has a totally different mechanism here, as far as I understand --  
even before you get into restrictions on mappings.



What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?


Subwindows can be modified while the aperture is enabled, but the  
aperture size and number of subwindows cannot be changed.



B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

This could also be done as another type2 ioctl extension.


Again, what is type2, specifically?  If someone else is adding their  
own IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close enough?  What assumptions can a user make when they see that  
they're dealing with type2?


What's the value to userspace in determining which windows are used  
by which banks?


That depends on who programs the MSI config space address.  What is  
important is userspace controlling which iovas will be dedicated to  
this, in case it wants to put something else there.


It sounds like the case that there are X banks and if userspace wants  
to

use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between  
MSI

or RAM?


Well, there could be that.  But also, userspace will generally have a  
much better idea of the type of mappings it's creating, so it's easier  
to keep everything explicit at the kernel/user interface than require  
more complicated code in the kernel to figure things out automatically  
(not just for MSIs but in general).


If the kernel automatically creates the MSI mappings, when does it  
assume that userspace is done creating its own?  What if userspace  
doesn't need any DMA other than the MSIs?  What if userspace wants to  
continue dynamically modifying its other mappings?



C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl  
VFIO_GROUP_GET_MSI_FD

   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows userspace to select the  
window

used?  This one seems like a lot of overhead, adding a new ioctl, new
fd, mmap, special mapping path, etc.


There's going to be special stuff no matter what.  This would keep it  
separated from the IOMMU map code.


I'm not sure what you mean by overhead here... the runtime overhead  
of setting things up is not particularly relevant as long 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:

 On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com
 wrote:
  On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
 
  Alex,
 
  We are in the process of implementing vfio-pci support for the
  Freescale
  IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite
  different
  than x86, and will involve creating a 'type 2' vfio implementation.
 
  For each device's DMA mappings, PAMU has an overall aperture and a
  number
  of windows.  All sizes and window counts must be power of 2.  To
  illustrate,
  below is a mapping for a 256MB guest, including guest memory (backed by
  64MB huge pages) and some windows for MSIs:
 
  Total aperture: 512MB
  # of windows: 8
 
  win gphys/
  #   iovaphys  size
  ---   
  0   0x  0xX_XX00  64MB
  1   0x0400  0xX_XX00  64MB
  2   0x0800  0xX_XX00  64MB
  3   0x0C00  0xX_XX00  64MB
  4   0x1000  0xf_fe044000  4KB// msi bank 1
  5   0x1400  0xf_fe045000  4KB// msi bank 2
  6   0x1800  0xf_fe046000  4KB// msi bank 3
  7- -  disabled
 
  There are a couple of updates needed to the vfio user-kernel interface
  that we would like your feedback on.
 
  1.  IOMMU geometry
 
 The kernel IOMMU driver now has an interface (see domain_set_attr,
 domain_get_attr) that lets us set the domain geometry using
 attributes.
 
 We want to expose that to user space, so envision needing a couple
 of new ioctls to do this:
  VFIO_IOMMU_SET_ATTR
  VFIO_IOMMU_GET_ATTR
 
 
  Note that this means attributes need to be updated for user-API
  appropriateness, such as using fixed-size types.
 
 
  2.   MSI window mappings
 
 The more problematic question is how to deal with MSIs.  We need to
 create mappings for up to 3 MSI banks that a device may need to
  target
 to generate interrupts.  The Linux MSI driver can allocate MSIs from
 the 3 banks any way it wants, and currently user space has no way of
 knowing which bank may be used for a given device.
 
 There are 3 options we have discussed and would like your direction:
 
 A.  Implicit mappings -- with this approach user space would not
 explicitly map MSIs.  User space would be required to set the
 geometry so that there are 3 unused windows (the last 3 windows)
 
 
  Where does userspace get the number 3 from?  E.g. on newer chips there
  are
  4 MSI banks.  Maybe future chips have even more.

 Ok, then make the number 4.   The chance of more MSI banks in future chips
 is nil,


 What makes you so sure?  Especially since you seem to be presenting this as
 not specifically an MPIC API.


 and if it ever happened user space could adjust.


 What bit of API is going to tell it that it needs to adjust?

Haven't thought through that completely, but I guess we could add an API
to return the number of MSI banks for type 2 iommus.

 Also, practically speaking since memory is typically allocate in powers of
 2 way you need to approximately double the window geometry anyway.


 Only if your existing mapping needs fit exactly in a power of two.


 B.  Explicit mapping using DMA map flags.  The idea is that a new
 flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
 a mapping is to be created for the supplied iova.  No vaddr
 is given though.  So in the above example there would be a
 a dma map at 0x1000 for 24KB (and no vaddr).
 
 
  A single 24 KiB mapping wouldn't work (and why 24KB? What if only one
  MSI
  group is involved in this VFIO group?  What if four MSI groups are
  involved?).  You'd need to either have a naturally aligned, power-of-two
  sized mapping that covers exactly the pages you want to map and no more,
  or
  you'd need to create a separate mapping for each MSI bank, and due to
  PAMU
  subwindow alignment restrictions these mappings could not be contiguous
  in
  iova-space.

 You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI
 banks
 perhaps we could just do one 64MB*3 mapping to identify which windows
 are used for MSIs.


 Where did the assumption of a 64MiB subwindow size come from?

The example I was using.   User space would need to create a
mapping for window_size * msi_bank_count.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote:
 This could also be done as another type2 ioctl extension.


 Again, what is type2, specifically?  If someone else is adding their own
 IOMMU that is kind of, sort of like PAMU, how would they know if it's close
 enough?  What assumptions can a user make when they see that they're dealing
 with type2?

We will define that as part of the type2 implementation.   Highly unlikely
anything but a PAMU will comply.

 What's the value to userspace in determining which windows are used by
 which banks?


 That depends on who programs the MSI config space address.  What is
 important is userspace controlling which iovas will be dedicated to this, in
 case it wants to put something else there.


 It sounds like the case that there are X banks and if userspace wants to
 use MSI it needs to leave X windows available for that.  Is this just
 buying userspace a few more windows to allow them the choice between MSI
 or RAM?


 Well, there could be that.  But also, userspace will generally have a much
 better idea of the type of mappings it's creating, so it's easier to keep
 everything explicit at the kernel/user interface than require more
 complicated code in the kernel to figure things out automatically (not just
 for MSIs but in general).

 If the kernel automatically creates the MSI mappings, when does it assume
 that userspace is done creating its own?  What if userspace doesn't need any
 DMA other than the MSIs?  What if userspace wants to continue dynamically
 modifying its other mappings?


 C.  Explicit mapping using normal DMA map.  The last idea is that
 we would introduce a new ioctl to give user-space an fd to
 the MSI bank, which could be mmapped.  The flow would be
 something like this:
-for each group user space calls new ioctl
  VFIO_GROUP_GET_MSI_FD
-user space mmaps the fd, getting a vaddr
-user space does a normal DMA map for desired iova
 This approach makes everything explicit, but adds a new ioctl
 applicable most likely only to the PAMU (type2 iommu).

 And the DMA_MAP of that mmap then allows userspace to select the window
 used?  This one seems like a lot of overhead, adding a new ioctl, new
 fd, mmap, special mapping path, etc.


 There's going to be special stuff no matter what.  This would keep it
 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime overhead of
 setting things up is not particularly relevant as long as it's reasonable.
 If you mean development and maintenance effort, keeping things well
 separated should help.

We don't need to change DMA_MAP.  If we can simply add a new type 2
ioctl that allows user space to set which windows are MSIs, it seems vastly
less complex than an ioctl to supply a new fd, mmap of it, etc.

So maybe 2 ioctls:
VFIO_IOMMU_GET_MSI_COUNT
VFIO_IOMMU_MAP_MSI(iova, size)

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
 alex.william...@redhat.com wrote:
  2.   MSI window mappings
 
 The more problematic question is how to deal with MSIs.  We need to
 create mappings for up to 3 MSI banks that a device may need to target
 to generate interrupts.  The Linux MSI driver can allocate MSIs from
 the 3 banks any way it wants, and currently user space has no way of
 knowing which bank may be used for a given device.
 
 There are 3 options we have discussed and would like your direction:
 
 A.  Implicit mappings -- with this approach user space would not
 explicitly map MSIs.  User space would be required to set the
 geometry so that there are 3 unused windows (the last 3 windows)
 for MSIs, and it would be up to the kernel to create the mappings.
 This approach requires some specific semantics (leaving 3 windows)
 and it potentially gets a little weird-- when should the kernel
 actually create the MSI mappings?  When should they be unmapped?
 Some convention would need to be established.
 
  VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
  number exposed to userspace on GET and transparently add MSI entries on
  SET.
 
 The number of windows is always power of 2 (and max is 256).  And to reduce
 PAMU cache pressure you want to use the fewest number of windows
 you can.So, I don't see practically how we could transparently
 steal entries to
 add the MSIs. Either user space knows to leave empty windows for
 MSIs and by convention the kernel knows which windows those are (as
 in option #A) or explicitly tell the kernel which windows (as in option #B).

Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has an
MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's aperture,
but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the devices
attached have MSI and add those windows and align to a power of 2.

  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device MSI
  address/data registers.  What kind of restrictions do you have around
  adding and removing windows while the aperture is enabled?
 
 The windows can be enabled/disabled event while the aperture is
 enabled (pretty sure)...
 
 B.  Explicit mapping using DMA map flags.  The idea is that a new
 flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
 a mapping is to be created for the supplied iova.  No vaddr
 is given though.  So in the above example there would be a
 a dma map at 0x1000 for 24KB (and no vaddr).   It's
 up to the kernel to determine which bank gets mapped where.
 So, this option puts user space in control of which windows
 are used for MSIs and when MSIs are mapped/unmapped.   There
 would need to be some semantics as to how this is used-- it
 only makes sense
 
  This could also be done as another type2 ioctl extension.  What's the
  value to userspace in determining which windows are used by which banks?
  It sounds like the case that there are X banks and if userspace wants to
  use MSI it needs to leave X windows available for that.  Is this just
  buying userspace a few more windows to allow them the choice between MSI
  or RAM?
 
 Yes, it would potentially give user space the flexibility some more windows.
 It also makes more explicit when the MSI mappings are created.  In option
 #A the MSI mappings would probably get created at the time of the first
 normal DMA map.
 
 So, you're saying with this approach you'd rather see a new type 2
 ioctl instead of adding new flags to DMA map, right?

I'm not sure I know enough yet to have a suggestion.  What would be the
purpose of userspace specifying the iova and size here?  If userspace
just needs to know that it needs X addition windows for MSI and can tell
the kernel to use banks 0 through (X-1) for MSI, that sounds more like
an ioctl interface than a DMA_MAP flag.  Thanks,

Alex

 C.  Explicit mapping using normal DMA map.  The last idea is that
 we would introduce a new ioctl to give user-space an fd to
 the MSI bank, which could be mmapped.  The flow would be
 something like this:
-for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
-user space mmaps the fd, getting a vaddr
-user space does a normal DMA map for desired iova
 This approach makes everything explicit, but adds a new ioctl
 applicable most likely only to the PAMU (type2 iommu).
 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
 On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
   2.   MSI window mappings
  
  The more problematic question is how to deal with MSIs.  We need  
  to
  create mappings for up to 3 MSI banks that a device may need to  
  target
  to generate interrupts.  The Linux MSI driver can allocate MSIs  
  from
  the 3 banks any way it wants, and currently user space has no  
  way of
  knowing which bank may be used for a given device.
  
  There are 3 options we have discussed and would like your  
  direction:
  
  A.  Implicit mappings -- with this approach user space would not
  explicitly map MSIs.  User space would be required to set the
  geometry so that there are 3 unused windows (the last 3  
  windows)
  for MSIs, and it would be up to the kernel to create the  
  mappings.
  This approach requires some specific semantics (leaving 3  
  windows)
  and it potentially gets a little weird-- when should the  
  kernel
  actually create the MSI mappings?  When should they be  
  unmapped?
  Some convention would need to be established.
  
  VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
  the
  number exposed to userspace on GET and transparently add MSI entries  
  on
  SET.
 
 What do you mean by reduce the number exposed?  Userspace decides how  
 many entries there are, but it must be a power of two beteen 1 and 256.

I didn't understand the API.

  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device MSI
  address/data registers.
 
 x86 has a totally different mechanism here, as far as I understand --  
 even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?

  What kind of restrictions do you have around
  adding and removing windows while the aperture is enabled?
 
 Subwindows can be modified while the aperture is enabled, but the  
 aperture size and number of subwindows cannot be changed.
 
  B.  Explicit mapping using DMA map flags.  The idea is that a new
  flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
  a mapping is to be created for the supplied iova.  No vaddr
  is given though.  So in the above example there would be a
  a dma map at 0x1000 for 24KB (and no vaddr).   It's
  up to the kernel to determine which bank gets mapped where.
  So, this option puts user space in control of which windows
  are used for MSIs and when MSIs are mapped/unmapped.   There
  would need to be some semantics as to how this is used-- it
  only makes sense
  
  This could also be done as another type2 ioctl extension.
 
 Again, what is type2, specifically?  If someone else is adding their  
 own IOMMU that is kind of, sort of like PAMU, how would they know if  
 it's close enough?  What assumptions can a user make when they see that  
 they're dealing with type2?

Naming always has and always will be a problem.  I assume this is named
type2 rather than PAMU because it's trying to expose a generic windowed
IOMMU fitting the IOMMU API.  Like type1, it doesn't really make sense
to name it IOMMU API because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.  Type1 is
arbitrary.  It might as well be named brown and this one can be
blue.

  What's the value to userspace in determining which windows are used  
  by which banks?
 
 That depends on who programs the MSI config space address.  What is  
 important is userspace controlling which iovas will be dedicated to  
 this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user programmed
iova?  But an iova selects a window and I thought there were some number
of MSI banks and we don't really know which ones we'll need...  still
confused.

  It sounds like the case that there are X banks and if userspace wants  
  to
  use MSI it needs to leave X windows available for that.  Is this just
  buying userspace a few more windows to allow them the choice between  
  MSI
  or RAM?
 
 Well, there could be that.  But also, userspace will generally have a  
 much better idea of the type of mappings it's creating, so it's easier  
 to keep everything explicit at the kernel/user interface than require  
 more complicated code in the kernel to figure things out automatically  
 (not just for MSIs but in general).
 
 If the kernel automatically creates the MSI mappings, when does it  
 assume that userspace is done creating its own?  What if userspace  
 doesn't need any DMA other than the MSIs?  What if userspace wants to  
 continue 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote:
  C.  Explicit mapping using normal DMA map.  The last idea is that
  we would introduce a new ioctl to give user-space an fd to
  the MSI bank, which could be mmapped.  The flow would be
  something like this:
 -for each group user space calls new ioctl
   VFIO_GROUP_GET_MSI_FD
 -user space mmaps the fd, getting a vaddr
 -user space does a normal DMA map for desired iova
  This approach makes everything explicit, but adds a new ioctl
  applicable most likely only to the PAMU (type2 iommu).
 
  And the DMA_MAP of that mmap then allows userspace to select the window
  used?  This one seems like a lot of overhead, adding a new ioctl, new
  fd, mmap, special mapping path, etc.
 
 
  There's going to be special stuff no matter what.  This would keep it
  separated from the IOMMU map code.
 
  I'm not sure what you mean by overhead here... the runtime overhead of
  setting things up is not particularly relevant as long as it's reasonable.
  If you mean development and maintenance effort, keeping things well
  separated should help.
 
 We don't need to change DMA_MAP.  If we can simply add a new type 2
 ioctl that allows user space to set which windows are MSIs, it seems vastly
 less complex than an ioctl to supply a new fd, mmap of it, etc.
 
 So maybe 2 ioctls:
 VFIO_IOMMU_GET_MSI_COUNT
 VFIO_IOMMU_MAP_MSI(iova, size)
 

How are MSIs related to devices on PAMU?  On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with it
being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:08:27 PM, Stuart Yoder wrote:
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com  
wrote:

 This could also be done as another type2 ioctl extension.


 Again, what is type2, specifically?  If someone else is adding  
their own
 IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close
 enough?  What assumptions can a user make when they see that  
they're dealing

 with type2?

We will define that as part of the type2 implementation.   Highly  
unlikely

anything but a PAMU will comply.


So then why not just call it pamu instead of being obfuscatory?

 There's going to be special stuff no matter what.  This would keep  
it

 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime  
overhead of
 setting things up is not particularly relevant as long as it's  
reasonable.

 If you mean development and maintenance effort, keeping things well
 separated should help.

We don't need to change DMA_MAP.  If we can simply add a new type 2
ioctl that allows user space to set which windows are MSIs,


And what specifically does that ioctl do?  It causes new mappings to be  
created, right?  So you're changing (or at least adding to) the DMA map  
mechanism.


it seems vastly less complex than an ioctl to supply a new fd, mmap  
of it, etc.


I don't see enough complexity in the mmap approach for anything to be  
vastly less complex in comparison.  I think you're building the mmap  
approach up in your head to be a lot worse that it would actually be.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:16:11 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
 The number of windows is always power of 2 (and max is 256).  And  
to reduce

 PAMU cache pressure you want to use the fewest number of windows
 you can.So, I don't see practically how we could transparently
 steal entries to
 add the MSIs. Either user space knows to leave empty windows for
 MSIs and by convention the kernel knows which windows those are (as
 in option #A) or explicitly tell the kernel which windows (as in  
option #B).


Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has  
an

MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's  
aperture,


What exploitation?

It's not as if there is a pool of 256 global windows that users  
allocate from.  The subwindow count is just how finely divided the  
aperture is.  The only way one user will affect another is through  
cache contention (which is why we want the minimum number of subwindows  
that we can get away with).



but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the  
devices

attached have MSI and add those windows and align to a power of 2.


If you double the subwindow count without userspace knowing, you have  
to double the aperture as well (and you may need to grow up or down  
depending on alignment).  This means you also need to halve the maximum  
aperture that userspace can request.  And you need to expose a  
different number of maximum subwindows in the IOMMU API based on  
whether we might have MSIs of this type.  It's ugly and awkward, and  
removes the possibility for userspace to place the MSIs in some unused  
slot in the middle, or not use MSIs at all.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:32:04 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
 On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device  
MSI

  address/data registers.

 x86 has a totally different mechanism here, as far as I understand  
--

 even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?


Not sure what you mean -- PAMU doesn't get explicitly involved in  
MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
you want isolation, you need to make sure that an MSI group is only  
used by one VFIO group, and that you're on a chip that has alias pages  
with just one MSI bank register each (newer chips do, but the first  
chip to have a PAMU didn't).



  This could also be done as another type2 ioctl extension.

 Again, what is type2, specifically?  If someone else is adding  
their

 own IOMMU that is kind of, sort of like PAMU, how would they know if
 it's close enough?  What assumptions can a user make when they see  
that

 they're dealing with type2?

Naming always has and always will be a problem.  I assume this is  
named
type2 rather than PAMU because it's trying to expose a generic  
windowed

IOMMU fitting the IOMMU API.


But how closely is the MSI situation related to a generic windowed  
IOMMU, then?  We could just as well have a highly flexible IOMMU in  
terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
be mapped rather than a translation table.  Or we could have a windowed  
IOMMU that has an MSI translation table.



Like type1, it doesn't really make sense
to name it IOMMU API because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.


Well, that's my point.  Is it reusable at all, anyway?  If not, then  
giving it a more obscure name won't change that.  If it is reusable,  
then where is the line drawn between things that are PAMU-specific or  
MPIC-specific and things that are part of the generic windowed IOMMU  
abstraction?


 Type1 is arbitrary.  It might as well be named brown and this one  
can be

blue.


The difference is that type1 seems to refer to hardware that can do  
arbitrary 4K page mappings, possibly constrained by an aperture but  
nothing else.  More than one IOMMU can reasonably fit that.  The odds  
that another IOMMU would have exactly the same restrictions as PAMU  
seem smaller in comparison.


In any case, if you had to deal with some Intel-only quirk, would it  
make sense to call it a type1 attribute?  I'm not advocating one way  
or the other on whether an abstraction is viable here (though Stuart  
seems to think it's highly unlikely anything but a PAMU will comply),  
just that if it is to be abstracted rather than a hardware-specific  
interface, we need to document what is and is not part of the  
abstraction.  Otherwise a non-PAMU-specific user won't know what they  
can rely on, and someone adding support for a new windowed IOMMU won't  
know if theirs is close enough, or they need to introduce a type3.


  What's the value to userspace in determining which windows are  
used

  by which banks?

 That depends on who programs the MSI config space address.  What is
 important is userspace controlling which iovas will be dedicated to
 this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user  
programmed
iova?  But an iova selects a window and I thought there were some  
number

of MSI banks and we don't really know which ones we'll need...  still
confused.


Userspace would also need a way to find out the page offset and data  
value.  That may be an argument in favor of having the two ioctls  
Stuart later suggested (get MSI count, and map MSI).  Would there be  
any complication in the VFIO code from tracking a mapping that doesn't  
have a userspace virtual address associated with it?


 There's going to be special stuff no matter what.  This would keep  
it

 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime  
overhead

 of setting things up is not particularly relevant as long as it's
 reasonable.  If you mean development and maintenance effort, keeping
 things well separated should help.

Overhead in terms of code required and complexity.  More things to
reference count and shut down in the proper order on userspace exit.
Thanks,


That didn't stop others from having me convert the KVM device control  
API to use file descriptors instead of something more ad-hoc with a  
better-defined destruction order. :-)


I don't know if it necessarily needs to be a separate fd -- it could be  
just another device resource like BARs, with some way for userspace to  

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:38:45 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
scottw...@freescale.com wrote:
  C.  Explicit mapping using normal DMA map.  The last idea  
is that
  we would introduce a new ioctl to give user-space an fd  
to

  the MSI bank, which could be mmapped.  The flow would be
  something like this:
 -for each group user space calls new ioctl
   VFIO_GROUP_GET_MSI_FD
 -user space mmaps the fd, getting a vaddr
 -user space does a normal DMA map for desired iova
  This approach makes everything explicit, but adds a new  
ioctl

  applicable most likely only to the PAMU (type2 iommu).
 
  And the DMA_MAP of that mmap then allows userspace to select the  
window
  used?  This one seems like a lot of overhead, adding a new  
ioctl, new

  fd, mmap, special mapping path, etc.
 
 
  There's going to be special stuff no matter what.  This would  
keep it

  separated from the IOMMU map code.
 
  I'm not sure what you mean by overhead here... the runtime  
overhead of
  setting things up is not particularly relevant as long as it's  
reasonable.
  If you mean development and maintenance effort, keeping things  
well

  separated should help.

 We don't need to change DMA_MAP.  If we can simply add a new type  
2
 ioctl that allows user space to set which windows are MSIs, it  
seems vastly

 less complex than an ioctl to supply a new fd, mmap of it, etc.

 So maybe 2 ioctls:
 VFIO_IOMMU_GET_MSI_COUNT


Do you mean a count of actual MSIs or a count of MSI banks used by the  
whole VFIO group?



 VFIO_IOMMU_MAP_MSI(iova, size)


Not sure how you mean size to be used -- for MPIC it would be 4K per  
bank, and you can only map one bank at a time (which bank you're  
mapping should be a parameter, if only so that the kernel doesn't have  
to keep iteration state for you).



How are MSIs related to devices on PAMU?


PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
device is standard PCI stuff.  Each MSI bank (which is part of the  
MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
map all MSI banks that are in use by any of the devices in the group.   
Ideally we'd let the VFIO grouping influence the allocation of MSIs.



On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl  
(actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
it

being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,


What do you mean by IOMMU protection needs to be established?   
Wouldn't we just start with no mappings in place?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote:
 On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
   The number of windows is always power of 2 (and max is 256).  And  
  to reduce
   PAMU cache pressure you want to use the fewest number of windows
   you can.So, I don't see practically how we could transparently
   steal entries to
   add the MSIs. Either user space knows to leave empty windows for
   MSIs and by convention the kernel knows which windows those are (as
   in option #A) or explicitly tell the kernel which windows (as in  
  option #B).
  
  Ok, apparently I don't understand the API.  Is it something like
  userspace calls GET_ATTR and finds out that there are 256 available
  windows, userspace determines that it needs 8 for RAM and then it has  
  an
  MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
  prone to exploitation by the first userspace to allocate it's  
  aperture,
 
 What exploitation?
 
 It's not as if there is a pool of 256 global windows that users  
 allocate from.  The subwindow count is just how finely divided the  
 aperture is.  The only way one user will affect another is through  
 cache contention (which is why we want the minimum number of subwindows  
 that we can get away with).
 
  but I'm also not sure why userspace could specify the (non-power of 2)
  number of windows it needs for RAM, then VFIO would see that the  
  devices
  attached have MSI and add those windows and align to a power of 2.
 
 If you double the subwindow count without userspace knowing, you have  
 to double the aperture as well (and you may need to grow up or down  
 depending on alignment).  This means you also need to halve the maximum  
 aperture that userspace can request.  And you need to expose a  
 different number of maximum subwindows in the IOMMU API based on  
 whether we might have MSIs of this type.  It's ugly and awkward, and  
 removes the possibility for userspace to place the MSIs in some unused  
 slot in the middle, or not use MSIs at all.

Ok, I missed this in Stuart's example:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
  ^^
5   0x1400  0xf_fe045000  4KB// msi bank 2
  ^^
6   0x1800  0xf_fe046000  4KB// msi bank 3
  ^^
7- -  disabled

So even though the MSI banks are 4k in this example, they're still on
64MB boundaries.  If userspace were to leave this as 256 windows, each
would be 2MB and we'd use 128 of them to map the same memory as these
4x64MB windows and thrash the iotlb harder.  The picture is becoming
clearer.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
 On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
   On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device  
  MSI
address/data registers.
  
   x86 has a totally different mechanism here, as far as I understand  
  --
   even before you get into restrictions on mappings.
  
  So what control will userspace have over programming the actually MSI
  vectors on PAMU?
 
 Not sure what you mean -- PAMU doesn't get explicitly involved in  
 MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
 you want isolation, you need to make sure that an MSI group is only  
 used by one VFIO group, and that you're on a chip that has alias pages  
 with just one MSI bank register each (newer chips do, but the first  
 chip to have a PAMU didn't).

How does a user figure this out?

This could also be done as another type2 ioctl extension.
  
   Again, what is type2, specifically?  If someone else is adding  
  their
   own IOMMU that is kind of, sort of like PAMU, how would they know if
   it's close enough?  What assumptions can a user make when they see  
  that
   they're dealing with type2?
  
  Naming always has and always will be a problem.  I assume this is  
  named
  type2 rather than PAMU because it's trying to expose a generic  
  windowed
  IOMMU fitting the IOMMU API.
 
 But how closely is the MSI situation related to a generic windowed  
 IOMMU, then?  We could just as well have a highly flexible IOMMU in  
 terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
 be mapped rather than a translation table.  Or we could have a windowed  
 IOMMU that has an MSI translation table.
 
  Like type1, it doesn't really make sense
  to name it IOMMU API because that's a kernel internal interface and
  we're designing a userspace interface that just happens to use that.
  Tagging it to a piece of hardware makes it less reusable.
 
 Well, that's my point.  Is it reusable at all, anyway?  If not, then  
 giving it a more obscure name won't change that.  If it is reusable,  
 then where is the line drawn between things that are PAMU-specific or  
 MPIC-specific and things that are part of the generic windowed IOMMU  
 abstraction?
 
   Type1 is arbitrary.  It might as well be named brown and this one  
  can be
  blue.
 
 The difference is that type1 seems to refer to hardware that can do  
 arbitrary 4K page mappings, possibly constrained by an aperture but  
 nothing else.  More than one IOMMU can reasonably fit that.  The odds  
 that another IOMMU would have exactly the same restrictions as PAMU  
 seem smaller in comparison.
 
 In any case, if you had to deal with some Intel-only quirk, would it  
 make sense to call it a type1 attribute?  I'm not advocating one way  
 or the other on whether an abstraction is viable here (though Stuart  
 seems to think it's highly unlikely anything but a PAMU will comply),  
 just that if it is to be abstracted rather than a hardware-specific  
 interface, we need to document what is and is not part of the  
 abstraction.  Otherwise a non-PAMU-specific user won't know what they  
 can rely on, and someone adding support for a new windowed IOMMU won't  
 know if theirs is close enough, or they need to introduce a type3.

So Alexey named the SPAPR IOMMU something related to spapr...
surprisingly enough.  I'm fine with that.  If you think it's unique
enough, name it something appropriately.  I haven't seen the code and
don't know the architecture sufficiently to have an opinion.

What's the value to userspace in determining which windows are  
  used
by which banks?
  
   That depends on who programs the MSI config space address.  What is
   important is userspace controlling which iovas will be dedicated to
   this, in case it wants to put something else there.
  
  So userspace is programming the MSI vectors, targeting a user  
  programmed
  iova?  But an iova selects a window and I thought there were some  
  number
  of MSI banks and we don't really know which ones we'll need...  still
  confused.
 
 Userspace would also need a way to find out the page offset and data  
 value.  That may be an argument in favor of having the two ioctls  
 Stuart later suggested (get MSI count, and map MSI).

Connecting the user set iova and host kernel assigned irq number is
where I'm still lost, but I'll follow-up with that question in the other
thread.

 Would there be  
 any complication in the VFIO code from tracking a mapping that doesn't  
 have a userspace virtual address associated with it?

Only the VFIO iommu driver tracks mappings, the QEMU userspace component
doesn't (replies on the memory API for type1), nor does any of the
kernel framework code.

   There's going to be special stuff no matter 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
 On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
   On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
  scottw...@freescale.com wrote:
C.  Explicit mapping using normal DMA map.  The last idea  
  is that
we would introduce a new ioctl to give user-space an fd  
  to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl
 VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new  
  ioctl
applicable most likely only to the PAMU (type2 iommu).
   
And the DMA_MAP of that mmap then allows userspace to select the  
  window
used?  This one seems like a lot of overhead, adding a new  
  ioctl, new
fd, mmap, special mapping path, etc.
   
   
There's going to be special stuff no matter what.  This would  
  keep it
separated from the IOMMU map code.
   
I'm not sure what you mean by overhead here... the runtime  
  overhead of
setting things up is not particularly relevant as long as it's  
  reasonable.
If you mean development and maintenance effort, keeping things  
  well
separated should help.
  
   We don't need to change DMA_MAP.  If we can simply add a new type  
  2
   ioctl that allows user space to set which windows are MSIs, it  
  seems vastly
   less complex than an ioctl to supply a new fd, mmap of it, etc.
  
   So maybe 2 ioctls:
   VFIO_IOMMU_GET_MSI_COUNT
 
 Do you mean a count of actual MSIs or a count of MSI banks used by the  
 whole VFIO group?

I hope the latter, which would clarify how this is distinct from
DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
dynamically adding a device could bring along additional MSI banks?

   VFIO_IOMMU_MAP_MSI(iova, size)
 
 Not sure how you mean size to be used -- for MPIC it would be 4K per  
 bank, and you can only map one bank at a time (which bank you're  
 mapping should be a parameter, if only so that the kernel doesn't have  
 to keep iteration state for you).
 
  How are MSIs related to devices on PAMU?
 
 PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
 device is standard PCI stuff.  Each MSI bank (which is part of the  
 MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
 map all MSI banks that are in use by any of the devices in the group.   
 Ideally we'd let the VFIO grouping influence the allocation of MSIs.

The current VFIO MSI support has the host handling everything about MSI.
The user never programs an MSI vector to the physical device, they set
up everything through ioctl.  On interrupt, we simply trigger an eventfd
and leave it to things like KVM irqfd or QEMU to do the right thing in a
virtual machine.

Here the MSI vector has to go through a PAMU window to hit the correct
MSI bank.  So that means it has some component of the iova involved,
which we're proposing here is controlled by userspace (whether that
vector uses an offset from 0x1000 or 0x depending on which
window slot is used to make the MSI bank).  I assume we're still working
in a model where the physical interrupt fires into the host and a
host-based interrupt handler triggers an eventfd, right?  So that means
the vector also has host components so we trigger the correct ISR.  How
is that coordinated?

Would is be possible for userspace to simply leave room for MSI bank
mapping (how much room could be determined by something like
VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
DMA_MAP starting at the 0x0 address of the aperture, growing up, and
VFIO will map banks on demand at the top of the aperture, growing down?
Wouldn't that avoid a lot of issues with userspace needing to know
anything about MSI banks (other than count) and coordinating irq numbers
and enabling handlers?

  On x86 MSI count is very
  device specific, which means it wold be a VFIO_DEVICE_* ioctl  
  (actually
  VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
  it
  being a device ioctl is that you need to get the device FD, but the
  IOMMU protection needs to be established before you can get that... so
  there's an ordering problem if you need it from the device before
  configuring the IOMMU.  Thanks,
 
 What do you mean by IOMMU protection needs to be established?   
 Wouldn't we just start with no mappings in place?

If no mappings blocks all DMA, sure, that's fine.  Once the VFIO device
FD is accessible by userspace we have to protect the host against DMA.
If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
could be exploitable.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a