spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

Alexey Kardashevskiy Thu, 24 Mar 2016 02:11:56 -0700

On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote:

On 03/23/2016 05:03 PM, David Gibson wrote:

On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:

On 03/23/2016 01:53 PM, David Gibson wrote:

On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:

On 03/23/2016 12:08 PM, David Gibson wrote:

On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:

On 03/22/2016 04:14 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window
management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is
added/removed.


This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to
vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the
TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
  hw/vfio/common.c | 108
++++++++++++++++++++++++++++++++++++++++++++++++++++---
  trace-events     |   2 ++
  2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer
*container,
      return 0;
  }

+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr
min_iova)
+{
+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container,
min_iova, 0x1000);


The hard-coded 0x1000 looks dubious..


Well, that's the minimal page size...


Really?  Some BookE CPUs support 1KiB page size..


Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)


Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
it's been done for CPU MMU I wouldn't count on it not being done for
IOMMU.

1 is a safer choice.

+    g_assert(hiommu);
+    QLIST_REMOVE(hiommu, hiommu_next);
+}
+
  static bool vfio_listener_skipped_section(MemoryRegionSection
*section)
  {
      return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void
vfio_listener_region_add(MemoryListener *listener,
      }
      end = int128_get64(llend);

+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {


I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.



It is rather vfio_spapr_create_host_window() and we were avoiding
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
separate file but this usually triggers more discussion and never
ends well.

+        unsigned entries, pages;
+        struct vfio_iommu_spapr_tce_create create = { .argsz =
sizeof(create) };
+
+        g_assert(section->mr->iommu_ops);
+        g_assert(memory_region_is_iommu(section->mr));


I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.


In what configuration/machine can we do that on SPAPR?


spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.



I am pretty sure VFIO won't work in this case anyway.


I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.


This is not about TCG (pseries TCG guest works with VFIO on powernv host),
this is about things like VFIO_IOMMU_GET_INFO vs.
VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.

Should I add such support in this patchset?


Unless adding the generality is really complex, and so far I haven't
seen a reason for it to be.


Seriously? :(



So, I tried.

With q35 machine (pc-i440fx-2.6 have even worse memory tree), there areseveral RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, thena gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after4GB the rest of the RAM:


memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, RW): system

0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g@pc.ram 0000000000000000-000000007fffffff

    0000000000000000-ffffffffffffffff (prio -1, RW): pci
      00000000000c0000-00000000000dffff (prio 1, RW): pc.rom

00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios@pc.bios 0000000000020000-000000000003ffff

      00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0

00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0mmaps[0]

      00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2
        00000000febf0000-00000000febf007f (prio 0, RW): msix-table
        00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled]
      00000000febf2000-00000000febf2fff (prio 1, RW): ahci
      00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
[...]

0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g@pc.ram 0000000080000000-00000001ffffffff

Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump"regions) and we end up having RAM below 2GB and above 4GB, 2 windows.Problem is a window needs to be aligned to power of two.

Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and8GB), next problem - the second window needs to start from 1<<<59, not 4Gor any other random offset.


Ok, I change my test to start with -m1G to have a single window.
Now it fails because here is what happens:
region_add: pc.ram: 0 40000000
region_del: pc.ram: 0 40000000
region_add: pc.ram: 0 c0000

The second "add" is not power of two -> fail. And I cannot avoid this -"pc.rom" is still there, it is a separate region so RAM gets split intosmaller chunks. I do not know to to fix this properly.

So, in order to make it (x86/tcg on powernv host) work anyhow, I had tocreate a single huge window (as it is a single window - it starts from @0)unconditionally in vfio_connect_container(), not invfio_listener_region_add(); and I added filtering (skip "pc.bios","pc.rom", BARs - these things start above 1GB) and then I could boot ax86_64 guest and even pass Mellanox ConnextX3, it would bring 2 interfacesup and dhclient assigned IPs to them (which is quite amusing that it evenworks).



So, I think I will replace assert() with:

unsigned pagesize = qemu_real_host_page_size;
if (section->mr->iommu_ops) {
    pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
}

but there is no practical use for this anyway.

What do I miss and what do I need to try more to proceed with this patch?Thanks.



--
Alexey

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

Reply via email to