Re: [Qemu-devel] vfio for platform devices - 9/5/2012 - minutes

2013-09-11 Thread Yoder Stuart-B08248


 -Original Message-
 From: Yoder Stuart-B08248
 Sent: Thursday, September 05, 2013 12:51 PM
 To: Wood Scott-B07421; Sethi Varun-B16395; Bhushan Bharat-R65777; 'Peter
 Maydell'; 'Santosh Shukla'; 'Alex Williamson'; 'Alexander Graf';
 'Antonios Motakis'; 'Christoffer Dall'; 'kim.phill...@linaro.org'
 Cc: kvm...@lists.cs.columbia.edu; 'kvm-...@vger.kernel.org'; 'qemu-
 de...@nongnu.org'
 Subject: vfio for platform devices - 9/5/2012 - minutes
 
 We had a call with those interested and/or working on vfio
 for platform devices.
 
 Participants: Scott Wood, Varun Sethi, Bharat Bhushan, Peter Maydell,
   Santosh Shukla, Alex Williamson, Alexander Graf,
   Antonios Motakis, Christoffer Dall, Kim Phillips,
   Stuart Yoder
 
 Several aspects to vfio for platform devices:
 
 1. IOMMU groups
 
  -iommu driver needs to register a bus notifier for the platform bus
   and create groups for relevant platform devices
  -Antonios is looking at this for several ARM IOMMUs
  -PAMU (Freescale) driver already does this
 
 2. unbinding device from host
 
  PCI:
echo :06:0d.0  /sys/bus/pci/devices/:06:0d.0/driver/unbind
  Platform:
echo ffe101300.dma 
 /sys/bus/platform/devices/ffe101300.dma/driver/unbind
 
  -don't believe there are issues or work to do here
 
 3. binding device to vfio-platform driver
 
  PCI:
echo 1102 0002  /sys/bus/pci/drivers/vfio-pci/new_id
 
  -this is probably the least understood issue-- platform drivers
   register themselves with the bus for a specific name
   string.  That is matched with device tree compatible strings
   later to bind a device to a driver
  -we want is to have the vfio-platform driver to dynamically bind
   to a variety of platform devices previously unknown to
   vfio-platform
  -ideally unbinding and binding could be an atomic operation
  -Alex W pointed out that x86 could leverage this work so
   keep that in mind in what we design
  -Kim Phillips (Linaro) will start working on this

One thing we didn't discuss needs to be considered (probably by
Kim who is looking at the 'binding device' issue) is around
returning a passthru device back to the host.

After a platform device has been bound to vfio and is in use by
user space or a virtual machine, we also need to be able
to unwind all that and return the device back to the host
in a sane state.

What happens when user space exits and the vfio file
descriptors are closed?

What if the device is still active and doing bus 
mastering?   (e.g. a VM crashed causing a QEMU
exit)

How can the vfio-platform layer in the host kernel
get a specific device in a sane state?

When a plaform device is 'unbound' from vfio, what
specifically happens to the device?

Platform devices don't have generic mechanisms like on PCI
to disable bus mastering or even disable or reset a
device.

Haven't thought through all this yet, but just raising
some issues I see.

Regards,
Stuart







Re: [Qemu-devel] vfio for platform devices - 9/5/2012 - minutes

2013-09-06 Thread Yoder Stuart-B08248
Adding Will...

 -Original Message-
 From: Sethi Varun-B16395
 Sent: Friday, September 06, 2013 11:56 AM
 To: Yoder Stuart-B08248; Wood Scott-B07421; Bhushan Bharat-R65777; 'Peter
 Maydell'; 'Santosh Shukla'; 'Alex Williamson'; 'Alexander Graf';
 'Antonios Motakis'; 'Christoffer Dall'; 'kim.phill...@linaro.org'
 Cc: kvm...@lists.cs.columbia.edu; kvm-...@vger.kernel.org; qemu-
 de...@nongnu.org
 Subject: RE: vfio for platform devices - 9/5/2012 - minutes
 
 I have a query about the ARM SMMU driver. In the ARM smmu driver I see,
 that bus notifiers are registered for both amba and platform bus. Amba is
 the I/O interconnect, right? Why is bus notifier required for the amba
 bus?
 
 -Varun
 
  -Original Message-
  From: Yoder Stuart-B08248
  Sent: Thursday, September 05, 2013 11:21 PM
  To: Wood Scott-B07421; Sethi Varun-B16395; Bhushan Bharat-R65777;
 'Peter
  Maydell'; 'Santosh Shukla'; 'Alex Williamson'; 'Alexander Graf';
  'Antonios Motakis'; 'Christoffer Dall'; 'kim.phill...@linaro.org'
  Cc: kvm...@lists.cs.columbia.edu; kvm-...@vger.kernel.org; qemu-
  de...@nongnu.org
  Subject: vfio for platform devices - 9/5/2012 - minutes
 
  We had a call with those interested and/or working on vfio for platform
  devices.
 
  Participants: Scott Wood, Varun Sethi, Bharat Bhushan, Peter Maydell,
Santosh Shukla, Alex Williamson, Alexander Graf,
Antonios Motakis, Christoffer Dall, Kim Phillips,
Stuart Yoder
 
  Several aspects to vfio for platform devices:
 
  1. IOMMU groups
 
   -iommu driver needs to register a bus notifier for the platform bus
and create groups for relevant platform devices  -Antonios is looking
  at this for several ARM IOMMUs  -PAMU (Freescale) driver already does
  this
 
  2. unbinding device from host
 
   PCI:
 echo :06:0d.0  /sys/bus/pci/devices/:06:0d.0/driver/unbind
   Platform:
 echo ffe101300.dma 
  /sys/bus/platform/devices/ffe101300.dma/driver/unbind
 
   -don't believe there are issues or work to do here
 
  3. binding device to vfio-platform driver
 
   PCI:
 echo 1102 0002  /sys/bus/pci/drivers/vfio-pci/new_id
 
   -this is probably the least understood issue-- platform drivers
register themselves with the bus for a specific name
string.  That is matched with device tree compatible strings
later to bind a device to a driver
   -we want is to have the vfio-platform driver to dynamically bind
to a variety of platform devices previously unknown to
vfio-platform
   -ideally unbinding and binding could be an atomic operation  -Alex W
  pointed out that x86 could leverage this work so
keep that in mind in what we design
   -Kim Phillips (Linaro) will start working on this
 
  4. vfio kernel interface
 
   -exposes regions and interrupts to user space via FDs  -there are
 'info'
  ioctls that allow getting info about
regions/interrupts such as size and type of interrupt  -there is a
  proposed extension to the 'info' ioctls that
provides device tree paths, allowing user space to coorelate
resources with the device tree
(https://lists.cs.columbia.edu/pipermail/kvmarm/2013-
 July/006237.html)
 
  5. QEMU
   -some key tasks
 -interacts with vfio kernel interface
 -registers memslots
 -needs to dynamically get device hooked up to VM's
  platform bus, including IRQs
 -needs to generate device tree node
   -a key point: we don't believe that platform device passthru
in QEMU can be solved in a completely generic way.  There will
need to be device specific code for each device type being passed
through...to do things like generate device tree nodes  -in general
 we
  expect a relatively small number of device types
to be passed through to VMs
   -Alex Graf is working on dynamic creation of platform devices
for the PPC e500 paravirt machine
   -see: http://lists.nongnu.org/archive/html/qemu-devel/2013-
  07/msg03614.html
   -first step is dynamically generating a virtual UART  -that sets the
  stage to create and create vfio devices backed
by real hardware
 
  There is a session at Linux Plumbers in a couple of weeks and further
  discussions will happen there.
 
  Regards,
  Stuart




[Qemu-devel] vfio for platform devices - 9/5/2012 - minutes

2013-09-05 Thread Yoder Stuart-B08248
We had a call with those interested and/or working on vfio
for platform devices.

Participants: Scott Wood, Varun Sethi, Bharat Bhushan, Peter Maydell,
  Santosh Shukla, Alex Williamson, Alexander Graf,
  Antonios Motakis, Christoffer Dall, Kim Phillips,
  Stuart Yoder

Several aspects to vfio for platform devices:

1. IOMMU groups

 -iommu driver needs to register a bus notifier for the platform bus
  and create groups for relevant platform devices
 -Antonios is looking at this for several ARM IOMMUs
 -PAMU (Freescale) driver already does this

2. unbinding device from host

 PCI:
   echo :06:0d.0  /sys/bus/pci/devices/:06:0d.0/driver/unbind
 Platform:
   echo ffe101300.dma  /sys/bus/platform/devices/ffe101300.dma/driver/unbind

 -don't believe there are issues or work to do here

3. binding device to vfio-platform driver

 PCI:
   echo 1102 0002  /sys/bus/pci/drivers/vfio-pci/new_id

 -this is probably the least understood issue-- platform drivers
  register themselves with the bus for a specific name
  string.  That is matched with device tree compatible strings
  later to bind a device to a driver
 -we want is to have the vfio-platform driver to dynamically bind
  to a variety of platform devices previously unknown to
  vfio-platform
 -ideally unbinding and binding could be an atomic operation
 -Alex W pointed out that x86 could leverage this work so
  keep that in mind in what we design
 -Kim Phillips (Linaro) will start working on this

4. vfio kernel interface

 -exposes regions and interrupts to user space via FDs
 -there are 'info' ioctls that allow getting info about
  regions/interrupts such as size and type of interrupt
 -there is a proposed extension to the 'info' ioctls that
  provides device tree paths, allowing user space to coorelate
  resources with the device tree
  (https://lists.cs.columbia.edu/pipermail/kvmarm/2013-July/006237.html)

5. QEMU
 -some key tasks
   -interacts with vfio kernel interface
   -registers memslots
   -needs to dynamically get device hooked up to VM's
platform bus, including IRQs
   -needs to generate device tree node
 -a key point: we don't believe that platform device passthru
  in QEMU can be solved in a completely generic way.  There will
  need to be device specific code for each device type being passed
  through...to do things like generate device tree nodes
 -in general we expect a relatively small number of device types
  to be passed through to VMs
 -Alex Graf is working on dynamic creation of platform devices
  for the PPC e500 paravirt machine
 -see: http://lists.nongnu.org/archive/html/qemu-devel/2013-07/msg03614.html
 -first step is dynamically generating a virtual UART
 -that sets the stage to create and create vfio devices backed
  by real hardware

There is a session at Linux Plumbers in a couple of weeks
and further discussions will happen there.

Regards,
Stuart




Re: [Qemu-devel] RFC: vfio API changes needed for powerpc (v3)

2013-04-11 Thread Yoder Stuart-B08248

 -Original Message-
 From: Joerg Roedel [mailto:j...@8bytes.org]
 Sent: Thursday, April 11, 2013 7:57 AM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; k...@vger.kernel.org; qemu-devel@nongnu.org; 
 io...@lists.linux-foundation.org;
 ag...@suse.de; Bhushan Bharat-R65777
 Subject: Re: RFC: vfio API changes needed for powerpc (v3)
 
 On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote:
   What happens if a normal unmap call is done on the MSI iova?  Do we
   need a separate unmap?
 
  I was thinking a normal unmap on an MSI windows would be an error...but
  I'm not set on that.   I put the msi unmap there to make things symmetric,
  a normal unmap would work as well...and then we could drop the msi unmap.
 
 Hmm, this API semantic isn't very clean. When you explicitly map the MSI
 banks a clean API would also allow to unmap them. But that is not
 possible in your design because the kernel is responsible for mapping
 MSIs and you can't unmap a MSI bank that is in use by the kernel.

The mapping that the vfio API creates is specific only to the
assigned device.   So it can be unmapped without affecting
any other device... there is nothing else in the kernel making
the mapping in use.  Another device in use by the kernel using the
same MSI bank would have its own independent mapping.   So, these
mappings are not global but are device specific...just like normal
DMA memory mappings.
  
 So since the kernel owns the MSI setup anyways it should also take care
 of mapping the MSI banks. What is the reason to not let the kernel
 allocate the MSI banks top-down from the end of the DMA window space?
 Just let userspace know (or even set if needed) in advance how many of
 the windows it configures the kernel will take for mapping MSI banks and
 you are fine, no?

As designed the API lets user space determine the number of windows
needed for MSIs, so they can be set.  The only difference between
what we've proposed and what you described, I think, is that the proposal
allows user space to _which_ windows are used for which MSI banks.

Stuart




Re: [Qemu-devel] RFC: vfio API changes needed for powerpc (v3)

2013-04-08 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Friday, April 05, 2013 5:17 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
 Sethi Varun-B16395;
 k...@vger.kernel.org; qemu-devel@nongnu.org; io...@lists.linux-foundation.org
 Subject: Re: RFC: vfio API changes needed for powerpc (v3)
 
 On 04/04/2013 05:10:27 PM, Yoder Stuart-B08248 wrote:
  /*
   * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
   *
   * Unmaps the MSI bank at the specified iova.
   * Caller provides struct vfio_pamu_msi_bank_unmap with all fields
  set.
   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_unmap {
  __u32   argsz;
  __u32   flags; /* no flags currently */
  __u64   iova;  /* the iova to be unmapped to */
  };
  #define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
  struct vfio_pamu_msi_bank_unmap )
 
 What happens if a normal unmap call is done on the MSI iova?  Do we
 need a separate unmap?

I was thinking a normal unmap on an MSI windows would be an error...but
I'm not set on that.   I put the msi unmap there to make things symmetric,
a normal unmap would work as well...and then we could drop the msi unmap.

Stuart






[Qemu-devel] RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248
Based on the email thread over the last couple of days, I have
below an more concrete proposal (v2) for new ioctls supporting vfio-pci
on SoCs with the Freescale PAMU.

Example usage is as described by Scott:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can
//   update PCI config space appropriately.

Thanks,
Stuart



The Freescale PAMU is an aperture-based IOMMU with the following 
characteristics.  Each device has an entry in a table in memory
describing the iova-phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2 
  -physical address of the mapping must be naturally aligned
   with the mapping size

/*
 * VFIO_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   attribute;
#define VFIO_ATTR_GEOMETRY 1
#define VFIO_ATTR_WINDOWS  2
#define VFIO_ATTR_PAMU_STASH   3

/* fowlling fields are for VFIO_ATTR_GEOMETRY */
__u64 aperture_start; /* first address that can be mapped*/
__u64 aperture_end;   /* last address that can be mapped */

/* follwing fields are for VFIO_ATTR_WINDOWS */
__u32 windows;/* number of windows in the aperture */
  /* initially this will be the max number
   * of windows that can be set
   */

/* following fields are for VFIO_ATTR_PAMU_STASH */
__u32 cpu;/* CPU number for stashing */
__u32 cache;  /* cache ID for stashing */
};
#define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
#define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_msi_banks {
__u32   argsz;
__u32   bank_count;  /* the number of MSI
};
#define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_banks)

/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};

/*
 * VFIO_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u64   iova;  /* the iova to be unmapped to */
};




Re: [Qemu-devel] RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248

  /*
   * VFIO_PAMU_MAP_MSI_BANK
   *
   * Maps the MSI bank at the specified index and iova.  User space must
   * call this ioctl once for each MSI bank (count of banks is returned by
   * VFIO_IOMMU_GET_MSI_BANK_COUNT).
   * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_map {
  __u32   argsz;
  __u32   msi_bank_index;  /* the index of the MSI bank */
  __u64   iova;  /* the iova the bank is to be mapped to */
  };
 
 Again, flags.  If we dynamically add or remove devices from a container
 the bank count can change, right?  If bank count goes from 2 to 3, does
 userspace know assume the new bank is [2]?  If bank count goes from 3 to
 2, which index was that?  If removing a device removes an MSI bank then
 vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely always be
3 or 4).  So it won't change as devices are added or removed.

If devices are added or removed, the kernel side can enable or disable
the corresponding MSI windows.  But that is hidden from user space.

Stuart




[Qemu-devel] RFC: vfio API changes needed for powerpc (v3)

2013-04-04 Thread Yoder Stuart-B08248
-v3 updates
   -made vfio_pamu_attr a union, added flags
   -s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more
clear which fd is being operated on
   -added flags to vfio_pamu_msi_bank_map/umap
   -VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32
not a struct
   -fixed some typos



The Freescale PAMU is an aperture-based IOMMU with the following
characteristics.  Each device has an entry in a table in memory
describing the iova-phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2
  -physical address of the mapping must be naturally aligned
   with the mapping size

These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio).

/*
 * VFIO_IOMMU_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.

 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   flags;/* no flags currently */
__u32   attribute;

union {
/* VFIO_ATTR_GEOMETRY */
struct { 
__u64 aperture_start; /* first addr that can be mapped 
*/
__u64 aperture_end;  /* last addr that can be mapped */
} attr;

/* VFIO_ATTR_WINDOWS */
__u32 windows;  /* number of windows in the aperture */
/* initially this will be the max number
 * of windows that can be set
 */

/* VFIO_ATTR_PAMU_STASH */
struct { 
__u32 cpu; /* CPU number for stashing */
__u32 cache;   /* cache ID for stashing */
} stash;
}
};
#define VFIO_IOMMU_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Return: 0 on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Return: __u32 bank count on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, __u32)

/*
 * VFIO_IOMMU_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   flags; /* no flags currently */
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};
#define VFIO_IOMMU_PAMU_MAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_map )

/*
 * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u32   flags; /* no flags currently */
__u64   iova;  /* the iova to be unmapped to */
};
#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_unmap )




Re: [Qemu-devel] RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, April 04, 2013 5:52 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
 Sethi Varun-B16395;
 k...@vger.kernel.org; qemu-devel@nongnu.org; io...@lists.linux-foundation.org
 Subject: Re: RFC: vfio API changes needed for powerpc (v2)
 
 On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:
 
/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space
  must
 * call this ioctl once for each MSI bank (count of banks is
  returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields
  set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
   
struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped
  to */
};
  
   Again, flags.  If we dynamically add or remove devices from a
  container
   the bank count can change, right?  If bank count goes from 2 to 3,
  does
   userspace know assume the new bank is [2]?  If bank count goes from
  3 to
   2, which index was that?  If removing a device removes an MSI bank
  then
   vfio-pamu will automatically do the unmap, right?
 
  My assumption is that the bank count returned by
  VFIO_IOMMU_GET_MSI_BANK_COUNT
  is the max bank count for a platform.  (number will mostly likely
  always be
  3 or 4).  So it won't change as devices are added or removed.
 
 It should be the actual number of banks used.  This is required if
 we're going to have userspace do the iteration and specify the exact
 iovas to use -- and even if we aren't going to do that, it would be
 more restrictive on available iova-space than is necessary.  Usually
 there will be only one bank in the group.
 
 Actually mapping all of the MSI banks, all the time, would completely
 negate the point of using the separate alias pages.

The geometry, windows, DMA mappings, etc is set on a 'container'
which may have multiple groups in it.  So user space needs to
determine the total number of MSI windows needed when setting
the geometry and window count.

In the flow you proposed:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);

...that count has to be the total, not the count for 1 of
N possible groups.   So the get count ioctl is not done on
a group.

However, like you pointed out we don't want to negate isolation 
of the separate alias pages.  All this API is doing is telling
the kernel which windows to use for which MSI banks.   It's up
to the kernel to actually enable them as needed.

Say 3 MSI banks exist.  If there are no groups added to the
vfio container and all 3 MAP_MSI_BANK calls occurred
the picture may look like this (based on my earlier example):

   win gphys/
#   enabled iovaphys  size
--- ---   
5 N 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds 2 groups that use bank 1:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds another group that uses bank 3:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 Y 0x1800  0xf_fe046000  4KB// msi bank 3

User space doesn't need to care what is actually enabled,
it just needs to the the kernel which windows to use
and the kernel can take care of the rest.

Stuart





[Qemu-devel] RFC: vfio API changes needed for powerpc

2013-04-02 Thread Yoder Stuart-B08248
Alex,

We are in the process of implementing vfio-pci support for the Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a number
of windows.  All sizes and window counts must be power of 2.  To illustrate,
below is a mapping for a 256MB guest, including guest memory (backed by
64MB huge pages) and some windows for MSIs:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
5   0x1400  0xf_fe045000  4KB// msi bank 2
6   0x1800  0xf_fe046000  4KB// msi bank 3
7- -  disabled

There are a couple of updates needed to the vfio user-kernel interface
that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   attributes.

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
VFIO_IOMMU_SET_ATTR
VFIO_IOMMU_GET_ATTR 

2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to target
   to generate interrupts.  The Linux MSI driver can allocate MSIs from
   the 3 banks any way it wants, and currently user space has no way of
   knowing which bank may be used for a given device.   

   There are 3 options we have discussed and would like your direction:

   A.  Implicit mappings -- with this approach user space would not
   explicitly map MSIs.  User space would be required to set the
   geometry so that there are 3 unused windows (the last 3 windows)
   for MSIs, and it would be up to the kernel to create the mappings.
   This approach requires some specific semantics (leaving 3 windows)
   and it potentially gets a little weird-- when should the kernel
   actually create the MSI mappings?  When should they be unmapped?
   Some convention would need to be established.

   B.  Explicit mapping using DMA map flags.  The idea is that a new
   flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
   a mapping is to be created for the supplied iova.  No vaddr
   is given though.  So in the above example there would be a
   a dma map at 0x1000 for 24KB (and no vaddr).   It's
   up to the kernel to determine which bank gets mapped where.
   So, this option puts user space in control of which windows
   are used for MSIs and when MSIs are mapped/unmapped.   There
   would need to be some semantics as to how this is used-- it
   only makes sense

   C.  Explicit mapping using normal DMA map.  The last idea is that
   we would introduce a new ioctl to give user-space an fd to 
   the MSI bank, which could be mmapped.  The flow would be
   something like this:
  -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
  -user space mmaps the fd, getting a vaddr
  -user space does a normal DMA map for desired iova
   This approach makes everything explicit, but adds a new ioctl
   applicable most likely only to the PAMU (type2 iommu).

Any feedback or direction?

Thanks,
Stuart 






[Qemu-devel] RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Yoder Stuart-B08248
Alex Graf, Scott Wood, and I met last week to try to flesh out
some details as to how vfio could work for non-PCI devices,
like we have in embedded systems.   This most likely will
require a different kernel driver than vfio-- for now we are
calling it dtio (for device tree I/O) as there is no way
to discover these devices except from the device tree.   But
the dtio driver would use the same architecture and interfaces
as vfio.

For devices on a system bus and represented in a device
tree we have some different requirements than PCI for what
is exposed in the device fd file.  A device may have multiple
address regions, multiple interrupts, a variable length device
tree path, whether a region is mmapable, etc.

With existing vfio, the device fd file layout is something
like:
  0xF Config space offset
  ...
  0x6 ROM offset
  0x5 BAR 5 offset
  0x4 BAR 4 offset
  0x3 BAR 3 offset
  0x2 BAR 2 offset
  0x1 BAR 1 offset
  0x0 BAR 0 offset

We have an alternate proposal that we think is more flexible,
extensible, and will accommodate both PCI and system bus
type devices (and others).

Instead of config space fixed at 0xf, we would propose
a header and multiple 'device info' records at offset 0x0 that would
encode everything that user space needs to know about
the device:

  0x0  +-+-+
   | magic   |   version   | u64   // magic u64 identifies the type of
   |   vfio| |   // passthru I/O, plus version #
   |   dtio| |   //   vfio - PCI devices
   +-+-+   //   dtio - device tree devices
   | flags | u32   // encodes any flags (TBD)
   +---+
   |  dev info record N|
   |type   | u32   // type of record
   |rec_len| u32   // length in bytes of record
   |   |  (including record header)
   |flags  | u32   // type specific flags
   |...content...  |   // record content, which could
   +---+   // include sub-records
   |  dev info record N+1  |
   +---+
   |  dev info record N+2  |
   +---+
   ...

The device info records following the file header have the following
record types each with content encoded in a record specific way:

 REGION  - describes an addressable address range for the device
 DTPATH - describes the device tree path for the device
 DTINDEX - describes the index into the related device tree
   property (reg,ranges,interrupts,interrupt-map)
 INTERRUPT - describes an interrupt for the device
 PCI_CONFIG_SPACE - describes config space for the device
 PCI_INFO - domain:bus:device:func
 PCI_BAR_INFO - information about the BAR for a device

For a device tree type device the file may look like:

 0x0+---+
|header |  
+---+
|   type = REGION   |  
|   rec_len |  
|   flags = | u32 // region specific flags
|   is_mmapable | 
|   offset  | u64 // seek offset to region from
|   |from beginning
|   len | u64 // length of region
|   addr| u64 // phys addr of region
|   |  
+---+
 \   type = DTPATH  \  // a sub-region
  |   rec_len|  
  |   flags  |  
  |   dev tree path  | char[] // device tree path
+---+
 \   type = DTINDEX \  // a sub-region
  |   rec_len|  
  |   flags  |  
  |   prop_type  | u32  // REG, RANGES
  |   prop_index | u32  // index  into resource list
+---+
|  type = INTERRUPT |  
|  rec_len  |  
|  flags| u32 
|  ioctl_handle | u32 // argument to ioctl to get interrupts
|   |  
+---+
 \   type = DTPATH \  // a sub-region
  |   rec_len   |  
  |   flags |  
  |   dev tree path |  char[] // device tree path
+---+
  \   type = DTINDEX   \  // a sub-region 
  |   rec_len   |  
  |   flags |  
  |   prop_type | u32 // INTERRUPT,INTERRUPT_MAP
  |   prop_index| u32 // index


PCI devices would have a PCI specific encoding.  Instead of
config space and the mappable BAR regions being at specific
predetermined offsets, the device info records 

Re: [Qemu-devel] [PATCH] remove cross prefix from pkg-config command

2011-08-03 Thread Yoder Stuart-B08248


 -Original Message-
 From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo 
 Bonzini
 Sent: Wednesday, August 03, 2011 1:41 AM
 To: Stefan Weil
 Cc: Yoder Stuart-B08248; qemu-devel@nongnu.org
 Subject: Re: [PATCH] remove cross prefix from pkg-config command
 
 On 08/02/2011 11:01 PM, Stefan Weil wrote:
 
  I run cross builds for arm, mips, powerpc and mingw.
  All of them use the cross prefix. When running make, I neither want to
  specify a special PATH nor a PKG_CONFIG_PATH. All I need is something
  like make -C bin/arm (each cross target has its own directory with
  the binaries).
 
  The general idea of your patch is ok, but maybe you can modify it so
  the cross prefix is used if there is no PKG_CONFIG_PATH set?
 
 No, the PKG_CONFIG_PATH can be used by the user to add paths in his home 
 directory.
 
 The right fix is to do something like
 
 -pkg_config=${cross_prefix}${PKG_CONFIG-pkg-config}
 +pkg_config=${PKG_CONFIG-${cross_prefix}pkg-config}
 
 and likewise for all other tools.  Then Stuart can use pkg_config=pkg-config. 
  Stuart, can you
 do that?

That works...will re-submit.

Stuart




[Qemu-devel] glib dependency

2011-08-01 Thread Yoder Stuart-B08248
Anthony,

So in QEMU 0.15 rc2 it looks like the dependency on gio and
gthread has been removed, but glib is still required
correct?

 ##
 # glib support probe
-if $pkg_config --modversion gthread-2.0 gio-2.0  /dev/null 21 ; then
-glib_cflags=`$pkg_config --cflags gthread-2.0 gio-2.0 2/dev/null`
-glib_libs=`$pkg_config --libs gthread-2.0 gio-2.0 2/dev/null`
+if $pkg_config --modversion glib-2.0  /dev/null 21 ; then
+glib_cflags=`$pkg_config --cflags glib-2.0 2/dev/null`
+glib_libs=`$pkg_config --libs glib-2.0 2/dev/null`

...I had the impression that the glib dependency
itself was going away.

I'm trying to get qemu integrated into our internal SDK
system for building user space and am trying to figure
out what is required.


Stuart




Re: [Qemu-devel] [PATCH 04/25] Add hard build dependency on glib

2011-07-27 Thread Yoder Stuart-B08248


 -Original Message-
 From: Anthony Liguori [mailto:aligu...@us.ibm.com]
 Sent: Tuesday, July 26, 2011 5:10 PM
 To: Yoder Stuart-B08248
 Cc: qemu-devel@nongnu.org
 Subject: Re: [PATCH 04/25] Add hard build dependency on glib
 
 On 07/26/2011 04:51 PM, Yoder Stuart-B08248 wrote:
 
  I am having issues with this in a cross compilation
  environment.   In Power embedded, almost all our
  development is using cross toolchains.
 
  Neither glib or pkg-config are in our cross build environment and I'm
  having issues getting them built and installed.
  Not even sure if pkg-config is even supposed to work in a cross
  development environment...I'm new to that tool and poking around a bit
  with google raises some questions.
 
 You're probably setting up your cross environment incorrectly which, 
 unfortunately, is very
 common.

I got glib to build without too much trouble, however, 'make install' tries to
re-link some stuff and at that point there seems to be a bug somewhere where 
libtool
fails to use the correct CFLAGS and PATH, and thus the make install partially
installs glib before erroring out.

 The proper thing to do is to have GCC use a different system include 
 directory and a different
 prefix.  That will result in a directory where there are gcc binaries with 
 normal names
 installed in ${cross_prefix}/bin
 
 You need to build and install pkg-config to this prefix too, and then when it 
 comes time to
 actually doing the QEMU configure, you should do something like:
 
 export PATH=${cross_prefix}/bin:$PATH
 export PKG_CONFIG_PATH=${cross_prefix}/lib/pkg-config:$PKG_CONFIG_PATH
 
 Many automated cross compiler environment scripts will install specially 
 named versions of gcc
 and binutils in your normal $PATH.  The trouble is, this is a bit of a hack 
 and unless you
 know to make this hack work with other build tools, it all comes tumbling 
 down.

Note-- this is not just a matter of me getting this to work in my
own private build environment, I'm working with a cross toolchain
that gets delivered to our customers that I have little control
over, and I need to get it working in that environment.

Looks like our cross tools have both a specially named version of the tool
(e.g. powerpc-linux-gnu-gcc) and plain (e.g. gcc).  Unfortunately the plain
version of the tools don't seem to be functional (and have never been used as
far as I know).

Will keep fiddling with this...

Stuart




Re: [Qemu-devel] [PATCH 04/25] Add hard build dependency on glib

2011-07-26 Thread Yoder Stuart-B08248
 From: Anthony Liguori address@hidden
 
 GLib is an extremely common library that has a portable thread implementation
 along with tons of other goodies.
 
 GLib and GObject have a fantastic amount of infrastructure we can leverage in
 QEMU including an object oriented programming infrastructure.
 
 Short term, it has a very nice thread pool implementation that we could 
 leverage
 in something like virtio-9p.  It also has a test harness implementation that
 this series will use.
 
 Signed-off-by: Anthony Liguori address@hidden
 Signed-off-by: Michael Roth address@hidden
 Signed-off-by: Luiz Capitulino address@hidden
 ---
  Makefile|2 ++
  Makefile.objs   |1 +
  Makefile.target |1 +
  configure   |   13 +
  4 files changed, 17 insertions(+), 0 deletions(-)
 
 diff --git a/Makefile b/Makefile
 index b3ffbe2..42ae4e5 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -106,6 +106,8 @@ audio/audio.o audio/fmodaudio.o: QEMU_CFLAGS += 
 $(FMOD_CFLAGS)
  
  QEMU_CFLAGS+=$(CURL_CFLAGS)
  
 +QEMU_CFLAGS+=$(GLIB_CFLAGS)
 +
  ui/cocoa.o: ui/cocoa.m
  
  ui/sdl.o audio/sdlaudio.o ui/sdl_zoom.o baum.o: QEMU_CFLAGS += $(SDL_CFLAGS)
 diff --git a/Makefile.objs b/Makefile.objs
 index c43ed05..55d18bb 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -376,3 +376,4 @@ vl.o: QEMU_CFLAGS+=$(GPROF_CFLAGS)
  
  vl.o: QEMU_CFLAGS+=$(SDL_CFLAGS)
  
 +vl.o: QEMU_CFLAGS+=$(GLIB_CFLAGS)
 diff --git a/Makefile.target b/Makefile.target
 index e20a313..cde509b 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -204,6 +204,7 @@ QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
  QEMU_CFLAGS += $(VNC_PNG_CFLAGS)
 +QEMU_CFLAGS += $(GLIB_CFLAGS)
  
  # xen support
  obj-$(CONFIG_XEN) += xen-all.o xen_machine_pv.o xen_domainbuild.o 
 xen-mapcache.o
 diff --git a/configure b/configure
 index e57efb1..c0c8fdf 100755
 --- a/configure
 +++ b/configure
 @@ -1803,6 +1803,18 @@ EOF
  fi
  
  ##
 +# glib support probe
 +if $pkg_config --modversion gthread-2.0 gio-2.0  /dev/null 21 ; then
 +glib_cflags=`$pkg_config --cflags gthread-2.0 gio-2.0 2/dev/null`
 +glib_libs=`$pkg_config --libs gthread-2.0 gio-2.0 2/dev/null`
 +libs_softmmu=$glib_libs $libs_softmmu
 +libs_tools=$glib_libs $libs_tools
 +else
 +echo glib-2.0 required to compile QEMU
 +exit 1
 +fi

I am having issues with this in a cross compilation 
environment.   In Power embedded, almost all our
development is using cross toolchains.

Neither glib or pkg-config are in our cross build environment
and I'm having issues getting them built and installed.
Not even sure if pkg-config is even supposed to work
in a cross development environment...I'm new to that
tool and poking around a bit with google raises
some questions.

Wanted to make you aware of the issue...

Stuart




Re: [Qemu-devel] device assignment for embedded Power

2011-07-05 Thread Yoder Stuart-B08248


 -Original Message-
 From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
 Sent: Thursday, June 30, 2011 7:58 PM
 To: Yoder Stuart-B08248
 Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; 
 alex.william...@redhat.com;
 anth...@codemonkey.ws; d...@au1.ibm.com; joerg.roe...@amd.com; 
 p...@codesourcery.com;
 blauwir...@gmail.com; arm...@redhat.com
 Subject: Re: device assignment for embedded Power
 
 On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:
  One feature we need for QEMU/KVM on embedded Power Architecture is the
  ability to do passthru assignment of SoC I/O devices and memory.  An
  important use case in embedded is creating static partitions-- taking
  physical memory and I/O devices (non-PCI) and partitioning
  them between the host Linux and several virtual machines.   Things like
  live migration would not be needed or supported in these types of scenarios.
 
  SoC devices do not sit on a probeable bus and there are no identifiers
  like 01:00.0 with PCI that we can use to identify devices--  the host
  Linux kernel is made aware of SoC I/O devices from nodes/properties in a
  device tree structure passed at boot.   QEMU needs to generate a
  device tree to pass to the guest as well with all the guest's virtual
  and physical resources.  Today a number of mostly complete guest
  device trees are kept under ./pc-bios in QEMU, but this too static and
  inflexible.
 
  Some new mechanism is needed to assign SoC devices to guests, and we
  (FSL + Alex Graf) have been discussing a few possible approaches for
  doing this from QEMU and would like some feedback.
 
  Some possibilities:
 
  1. Option 1.  Pass the host dev tree to QEMU and assign devices
 by device tree path
 
   -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
 
 /soc/i2c@3000 is the device tree path to the assigned device.
 The device node 'i2c@3000' has some number of properties (e.g.
 address, interrupt info) and possibly subnodes under
 it.   QEMU copies that node when generating the guest dev tree.
 See snippet of entire node:  http://paste2.org/p/1496460
 
 Yuck (see below)
 
  2. Option 2.  Pass the entire assigned device node as a string to
 QEMU
 
   -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = 1;
#size-cells = 0; cell-index = 0; compatible = fsl-i2c;
reg = 0xffe03000 0x100; interrupts = 43 2;
interrupt-parent = mpic; dfsrr;'
 
 Beuark ! (see below)
 
 This avoids needing to pass the host device tree, but could
 get awkward-- the i2c example above is very simple, some device
 nodes are very large with a complex hierarchy of subnodes and
 could be hundreds of lines of text to represent a single
 node.
 
  It gets more complicated...
 
 
 So, from a qemu command line perspective, all you should have to do is pass 
 qemu the device-
 tree -path- to the device you want to pass-trough (you may support passing a 
 full hierarchy
 here).
 
 That is for normal MMIO mapped SoC devices. Something else (individual i2c, 
 usb, ...) will use
 specific virtualization of the corresponding busses.

Then why 'yuck' to option 1 :)?   That is basically what was being proposed.

 Anything else sucks too much really.
 
 From there, well, there's several approach inside qemu/kvm to handle that 
 path. If you want to
 do things at the qemu level you can probably parse /proc/device-tree. But I'd 
 personally just
 make it a kernel thing.

 IE. I would have an ioctl to instanciate a pass-through device, that takes 
 that path as an
 argument. I would make it return an anonymous fd which you can then use to 
 mmap the resources,
 etc...

Regarding implementation I think there are 3 things that need
to be set up--  1) mmapping the device's registers, 2) getting the iommu
set up (if there is one), 3) getting the interrupt(s) handled.

  In some cases, modifications to device tree nodes may be needed.
  An example-- sometimes a device tree property references another node
  and that relationship may not exist when assigned to a guest.
  A phy-handle property may need to be deleted and a fixed-link
  property added to a node representing a network device.
 
 That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? 
 Such things are so
 device-specific that they would have to be handled by device-specific quirks, 
 which can live
 either in qemu or in the kernel.

It is shared and in this case didn't want the phy shared.   That was a super
simple example to illustrate the idea.  With our experience with the Freescale
Embedded Hypervisor we see this as a definite requirement-- nodes in the
hardware device may need modifications.  In the P4080 device tree there
are some complex relationships expressed between nodes of our 'data
path'.   In some cases the hardware device tree expresses configuration
information, and while it could be argued that config info does not belong

[Qemu-devel] device assignment for embedded Power

2011-06-30 Thread Yoder Stuart-B08248
One feature we need for QEMU/KVM on embedded Power Architecture is the 
ability to do passthru assignment of SoC I/O devices and memory.  An 
important use case in embedded is creating static partitions-- 
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines.   Things like
live migration would not be needed or supported in these types of scenarios.

SoC devices do not sit on a probeable bus and there are no identifiers 
like 01:00.0 with PCI that we can use to identify devices--  the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
device tree structure passed at boot.   QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources.  Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.

Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.

Some possibilities:

1. Option 1.  Pass the host dev tree to QEMU and assign devices
   by device tree path

 -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000

   /soc/i2c@3000 is the device tree path to the assigned device.
   The device node 'i2c@3000' has some number of properties (e.g. 
   address, interrupt info) and possibly subnodes under
   it.   QEMU copies that node when generating the guest dev tree.
   See snippet of entire node:  http://paste2.org/p/1496460

2. Option 2.  Pass the entire assigned device node as a string to
   QEMU

 -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = 1;
  #size-cells = 0; cell-index = 0; compatible = fsl-i2c;
  reg = 0xffe03000 0x100; interrupts = 43 2;
  interrupt-parent = mpic; dfsrr;'

   This avoids needing to pass the host device tree, but could 
   get awkward-- the i2c example above is very simple, some device
   nodes are very large with a complex hierarchy of subnodes and 
   could be hundreds of lines of text to represent a single
   node.

It gets more complicated...

In some cases, modifications to device tree nodes may be needed.
An example-- sometimes a device tree property references another node 
and that relationship may not exist when assigned to a guest.
A phy-handle property may need to be deleted and a fixed-link
property added to a node representing a network device.

So in addition to assigning a device, a mechanism is needed to update 
device tree nodes.  So for the above example, maybe--

 -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
  node-update=fixed-link = 2 1 1000 0 0

The types of modifications needed--  deleting nodes, deleting properties, 
adding nodes, adding properties, adding properties that reference other
nodes, changing properties. This device tree transformation mechanism
needed is general enough that it could apply to any device tree based
embedded platform (e.g. ARM, MIPS).

Another complexity relates to the IOMMU.  Here things get very company 
and IOMMU specific. Freescale has a proprietary IOMMU.
Devices have 1 or more logical I/O device numbers used to index into 
the IOMMU table. The IOMMU is limited in that it is designed to only 
support large, physically contiguous mappings per device.  It does not 
support any kind of page table.  The IOMMU hardware architecture 
assumes DMAs are typically targeted to just a few address regions.  
So, a common IOMMU setup for a device would be a device with a single 
IOMMU mapping covering the guest's main memory segment.  However, 
there are many much more complicated IOMMU setups that are common as 
well, such as doing operation translations where a device's write 
transaction is translated to stash directly into CPU caches.  We 
can't assume that all memory slots belonging to the guest are targets 
of DMA.

So for Freescale we would need some very Freescale-specific 
configuration mechanism to set up the IOMMU.  Here I think we would 
need the new qcfg approach to expressing nested
structures (http://wiki.qemu.org/Features/QCFG).   Device
assignment with IOMMU set up might look like the examples
below:

# device with multiple logical i/o device numbers

-device assigned-soc-dev,dev=/qman-portals/qman-portal@4000,
vcpu=1,fsl,iommu.stash-mem={
dma-window.guest-addr=0x0,
dma-window.size=0x1,
liodn-index=1,
operation-mapping=0
stash-dest=1},
fsl,iommu.stash-dqrr={
dma-window.guest-addr=0xff420,
dma-window.size=0x4000,
liodn-index=0,
operation-mapping=0
stash-dest=1}

# assign pci-bus to a guest with multiple memory # regions
#addr   size
#0x0 512MB
#0x2000  4KB  (for MSIs)
#0x4000  16MB (shared memory)
#0xc000  64MB (shared memory)

-device assigned-soc-dev,dev=/pcie@ffe09000,
fsl,iommu={dma-window.guest-addr=0x0,
dma-window.size=0x1,

[Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Monday, February 07, 2011 12:52 PM
 To: Alexander Graf
 Cc: Yoder Stuart-B08248; Wood Scott-B07421; kvm-...@vger.kernel.org;
 k...@vger.kernel.org; qemu-devel@nongnu.org
 Subject: Re: RFC: New API for PPC for vcpu mmu access
 
 On Mon, 7 Feb 2011 17:49:51 +0100
 Alexander Graf ag...@suse.de wrote:
 
 
  On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
 
   Suggested change to this would be to have Qemu set tlb_type as
   an _input_ argument.   If KVM supports it, that type gets used,
   else an error is returned.This would allow Qemu to tell
   the kernel what type of MMU it is prepared to support.   Without
   this Qemu would just have to error out if the type returned is
   unknown.
 
  Yes, we could use the same struct for get and set. On set, it could
 transfer the mmu type, on get it could tell userspace the mmu type.
 
 What happens if a get is done before the first set, and there are multiple
 MMU type options for this hardware, with differing entry sizes?
 
 Qemu would have to know beforehand how large to make the buffer.
 
 We could say that going forward, it's expected that qemu will do a TLB set
 (either a full one, or a lightweight alternative) after creating a vcpu.
 For compatibility, if this doesn't happen before the vcpu is run, the TLB
 is created and initialized as it is today, but no new Qemu-visible features
 will be enabled that way.

Since I think the normal thing Qemu would want to do is determine
the type/size before allocating space for the TLB, we could just
pass in NULL for tlb_data on the first set.   If tlb_data is
NULL we just set the MMU type and return the size (and type).

 If Qemu does a get without ever doing some set operation, it should get an
 error, since the requirement to do a set is added at the same time as the
 get API.

Right.

Stuart




[Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Yoder Stuart-B08248


 -Original Message-
 From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org]
 On Behalf Of Avi Kivity
 Sent: Monday, February 07, 2011 11:14 AM
 To: Alexander Graf
 Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-...@vger.kernel.org;
 k...@vger.kernel.org; qemu-devel@nongnu.org
 Subject: Re: RFC: New API for PPC for vcpu mmu access
 
 On 02/03/2011 11:19 AM, Alexander Graf wrote:
  
I have no idea what things will look like 10 years down the road,
   but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
 
  That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
 want to transfer every time qemu feels like resolving an EA.
 
 You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
 similar), or have the TLB stored in user memory, so there is no need to
 transfer it (on the other hand, you have to re-validate it every time you
 peek at it).

The most convenient and flexible thing for  Power Book III-E I think
will be something that operates like a TLB search instruction.  Inputs
are 'address space' and 'process id' and outputs are in which TLB the
entry was found and all the components of a TLB entry:
   address space
   pid
   entry number
   ea
   rpn
   guest state
   permissions flags
   attributes (WIMGE)

Since all of those fields are architected in MAS registers, in the previous
proposal we just proposed to return several 32-bit fields (one per MAS)
that use the architected layout instead of inventing a brand new
structure defining these fields.

Stuart




[Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Yoder Stuart-B08248

  A fixed array does mean you wouldn't have to worry about whether qemu
  supports the more advanced struct format if fields are added -- you
  can just unconditionally write it, as long as it's backwards
  compatible.  Unless you hit the limit of the pre-determined array
  size, that is.  And if that gets made higher for future expansion,
  that's even more data that has to get transferred, before it's really
 needed.
 
 Yes, it is. And I don't see how we could easily avoid it. Maybe just pass
 in a random __user pointer that we directly write to from kernel space and
 tell qemu how big and what type a tlb entry is?
 
 struct request_ppc_tlb {
 int tlb_type;
 int tlb_entries;
 uint64_t __user *tlb_data
 };
 
 in qemu:
 
 struct request_ppc_tlb req;
 reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX); r = do_ioctl(REQUEST_PPC_TLB,
 req); if (r == -ENOMEM) {
 cpu_abort(env, TLB too big);
 }
 
 switch (reg.tlb_type) {
 case PPC_TLB_xxx:
 copy_reg_to_tlb_for_xxx(env, reg.tlb_data); }
 
 something like this. Then we should be flexible enough for the foreseeable
 future and make it possible for kernel space to switch MMU modes in case we
 need that.

Suggested change to this would be to have Qemu set tlb_type as 
an _input_ argument.   If KVM supports it, that type gets used,
else an error is returned.This would allow Qemu to tell
the kernel what type of MMU it is prepared to support.   Without
this Qemu would just have to error out if the type returned is
unknown.

Stuart




[Qemu-devel] RFC: New API for PPC for vcpu mmu access

2011-02-02 Thread Yoder Stuart-B08248
Below is a proposal for a new API for PPC to allow KVM clients

to set MMU state in a vcpu.



BookE processors have one or more software managed TLBs and

currently there is no mechanism for Qemu to initialize

or access them.  This is needed for normal initialization

as well as debug.



There are 4 APIs:



-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type

of MMU with KVM-- the type determines the size and format

of the data in the other APIs



-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all

TLBs in the vcpu



-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture

specifies the format of the MMU data passed in



-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,

or iterating over an entire TLB.  Some TLBs may have an unspecified

geometry and thus the need to be able to iterate in order

to dump the TLB.  The Power architecture specifies the format

of the MMU data



Feedback welcome.



Thanks,

Stuart Yoder



--



KVM PPC MMU API

---



User space can query whether the APIs to access the vcpu mmu

is available with the KVM_CHECK_EXTENSION API using

the KVM_CAP_PPC_MMU argument.



If the KVM_CAP_PPC_MMU return value is non-zero it specifies that

the following APIs are available:



   KVM_PPC_SET_MMU_TYPE

   KVM_PPC_INVALIDATE_TLB

   KVM_PPC_SET_TLBE

   KVM_PPC_GET_MMU





KVM_PPC_SET_MMU_TYPE





Capability: KVM_CAP_PPC_SET_MMU_TYPE

Architectures: powerpc

Type: vcpu ioctl

Parameters: __u32 mmu_type (in)

Returns: 0 if specified MMU type is supported, else -1



Sets the MMU type.  Valid input values are:

   BOOKE_NOHV   0x1

   BOOKE_HV 0x2



A return value of 0x0 indicates that KVM supports

the specified MMU type.



KVM_PPC_INVALIDATE_TLB

--



Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters: none

Returns: 0 on success, -1 on error



Invalidates all TLB entries in all TLBs of the vcpu.



KVM_PPC_SET_TLBE





Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters:

For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)

Returns: 0 on success, -1 on error



Sets an MMU entry in a virtual CPU.



For mmu types BOOKE_NOHV and BOOKE_HV:



  To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu

  as per the Power architecture.



  struct kvm_ppc_booke_mmu {

union {

  __u64 mas0_1;

  struct {

__u32 mas0;

__u32 mas1;

  };

};

__u64 mas2;

union {

  __u64 mas7_3

  struct {

__u32 mas7;

__u32 mas3;

  };

};

union {

  __u64 mas5_6

  struct {

__u64 mas5;

__u64 mas6;

  };

}

__u32 mas8;

  };



  For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields

  in kvm_ppc_booke_mmu are present but not supported.





KVM_PPC_GET_TLB

---



Capability: KVM_CAP_PPC_MMU

Architectures: powerpc

Type: vcpu ioctl

Parameters: struct kvm_ppc_get_mmu (in/out)

Returns: 0 on success

 -1 on error

 errno = ENOENT when iterating and there are no more entries to read



Reads an MMU entry from a virtual CPU.



  struct kvm_ppc_get_mmu {

/* in */

void *mmu;

__u32 flags;

  /* a bitmask of flags to the API */

/* TLB_READ_FIRST   0x1  */

/* TLB_SEARCH   0x2  */

/* out */

__u32 max_entries;

  };



For mmu types BOOKE_NOHV and BOOKE_HV :



  The void *mmu field of kvm_ppc_get_mmu points to

a struct of type struct kvm_ppc_booke_mmu.



  If TLBnCFG[NENTRY]  0 and TLBnCFG[ASSOC]  0, the TLB has

  of known number of entries and associativity.  The mas0[ESEL]

  and mas2[EPN] fields specify which entry to read.



  If TLBnCFG[NENTRY] == 0 the number of TLB entries is

  undefined and this API can be used to iterate over

  the entire TLB selected with TLBSEL in mas0.



  -To read a TLB entry:



 set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):

flags=0

mas0[TLBSEL] // select which TLB is being read

mas0[ESEL]   // select which entry is being read

mas2[EPN]// effective address



 On return the following fields are updated as per the Power 
architecture:

mas0

mas1

mas2

mas3

mas7



  -To iterate over a TLB (read all entries):




[Qemu-devel] RFC: New API for PPC for vcpu mmu access

2011-02-02 Thread Yoder Stuart-B08248
Below is a proposal for a new API for PPC to allow KVM clients
to set MMU state in a vcpu.

BookE processors have one or more software managed TLBs and
currently there is no mechanism for Qemu to initialize
or access them.  This is needed for normal initialization
as well as debug.

There are 4 APIs:
   
-KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
 of MMU with KVM-- the type determines the size and format
 of the data in the other APIs

-KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
 TLBs in the vcpu

-KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
 specifies the format of the MMU data passed in

-KVM_PPC_GET_TLB allows searching, reading a specific TLB entry,
 or iterating over an entire TLB.  Some TLBs may have an unspecified
 geometry and thus the need to be able to iterate in order
 to dump the TLB.  The Power architecture specifies the format
 of the MMU data

Feedback welcome.

Thanks,
Stuart Yoder

--

KVM PPC MMU API
---

User space can query whether the APIs to access the vcpu mmu
is available with the KVM_CHECK_EXTENSION API using
the KVM_CAP_PPC_MMU argument.

If the KVM_CAP_PPC_MMU return value is non-zero it specifies that
the following APIs are available:

   KVM_PPC_SET_MMU_TYPE
   KVM_PPC_INVALIDATE_TLB
   KVM_PPC_SET_TLBE
   KVM_PPC_GET_MMU


KVM_PPC_SET_MMU_TYPE


Capability: KVM_CAP_PPC_SET_MMU_TYPE
Architectures: powerpc
Type: vcpu ioctl
Parameters: __u32 mmu_type (in)
Returns: 0 if specified MMU type is supported, else -1

Sets the MMU type.  Valid input values are:
   BOOKE_NOHV   0x1
   BOOKE_HV 0x2

A return value of 0x0 indicates that KVM supports
the specified MMU type.

KVM_PPC_INVALIDATE_TLB
--

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error

Invalidates all TLB entries in all TLBs of the vcpu.

KVM_PPC_SET_TLBE


Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters:
For mmu types BOOKE_NOHV and BOOKE_HV : struct kvm_ppc_booke_mmu (in)
Returns: 0 on success, -1 on error

Sets an MMU entry in a virtual CPU.

For mmu types BOOKE_NOHV and BOOKE_HV:

  To write a TLB entry, set the mas fields of kvm_ppc_booke_mmu 
  as per the Power architecture.

  struct kvm_ppc_booke_mmu {
union {
  __u64 mas0_1;
  struct {
__u32 mas0;
__u32 mas1;
  };
};
__u64 mas2;
union {
  __u64 mas7_3  
  struct {
__u32 mas7;
__u32 mas3;
  };
};
union {
  __u64 mas5_6  
  struct {
__u64 mas5;
__u64 mas6;
  };
}
__u32 mas8;
  };

  For a mmu type of BOOKE_NOHV, the mas5 and mas8 fields
  in kvm_ppc_booke_mmu are present but not supported.


KVM_PPC_GET_TLB
---

Capability: KVM_CAP_PPC_MMU
Architectures: powerpc
Type: vcpu ioctl
Parameters: struct kvm_ppc_get_mmu (in/out)
Returns: 0 on success
 -1 on error
 errno = ENOENT when iterating and there are no more entries to read

Reads an MMU entry from a virtual CPU.

  struct kvm_ppc_get_mmu {
/* in */
void *mmu;
__u32 flags;
  /* a bitmask of flags to the API */
/* TLB_READ_FIRST   0x1  */
/* TLB_SEARCH   0x2  */
/* out */
__u32 max_entries;
  };

For mmu types BOOKE_NOHV and BOOKE_HV :

  The void *mmu field of kvm_ppc_get_mmu points to 
a struct of type struct kvm_ppc_booke_mmu.

  If TLBnCFG[NENTRY]  0 and TLBnCFG[ASSOC]  0, the TLB has
  of known number of entries and associativity.  The mas0[ESEL]
  and mas2[EPN] fields specify which entry to read.
  
  If TLBnCFG[NENTRY] == 0 the number of TLB entries is 
  undefined and this API can be used to iterate over
  the entire TLB selected with TLBSEL in mas0.
  
  -To read a TLB entry:
  
 set the following fields in the mmu struct (struct kvm_ppc_booke_mmu):
flags=0
mas0[TLBSEL] // select which TLB is being read
mas0[ESEL]   // select which entry is being read
mas2[EPN]// effective address 
  
 On return the following fields are updated as per the Power 
architecture:
mas0
mas1 
mas2 
mas3 
mas7 
  
  -To iterate over a TLB (read all entries):
  
To start an interation sequence, set the following fields in
the mmu struct (struct 

[Qemu-devel] RE: RFC: New API for PPC for vcpu mmu access

2011-02-02 Thread Yoder Stuart-B08248


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Wednesday, February 02, 2011 3:34 PM
 To: Yoder Stuart-B08248
 Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; qemu-devel@nongnu.org
 Subject: Re: RFC: New API for PPC for vcpu mmu access
 
 
 On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
 
  Below is a proposal for a new API for PPC to allow KVM clients to set
  MMU state in a vcpu.
 
  BookE processors have one or more software managed TLBs and currently
  there is no mechanism for Qemu to initialize or access them.  This is
  needed for normal initialization as well as debug.
 
  There are 4 APIs:
 
  -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
  with KVM-- the type determines the size and format of the data in the
  other APIs
 
 This should be done through the PVR hint in sregs, no? Usually a single CPU
 type only has a single MMU type.
 
  -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
  vcpu
 
  -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
  the format of the MMU data passed in
 
 This seems to fine-grained. I'd prefer a list of all TLB entries to be
 pushed in either direction. What's the foreseeable number of TLB entries
 within the next 10 years?
 
 Having the whole stack available would make the sync with qemu easier and
 also allows us to only do a single ioctl for all the TLB management. Thanks
 to the PVR we know the size of the TLB, so we don't have to shove that
 around.

Yes, we thought about that approach but the idea here, as Scott 
described, was to provide an API that could work if user space
is unaware of the geometry of the TLB.

Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
allow TLB geometries that cannot be described in the TLBnCFG
registers.

I think the use case where this API would be used the most
would be from a gdb stub that needed to look up an effective
address.

Stuart