On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote: > On 03/20/2014 06:57 AM, Alex Williamson wrote: > > On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote: > >> From: David Gibson <da...@gibson.dropbear.id.au> > >> > >> This patch uses the new IOMMU notifiers to allow VFIO pass through devices > >> to work with guest side IOMMUs, as long as the host-side VFIO iommu has > >> sufficient capability and granularity to match the guest side. This works > >> by tracking all map and unmap operations on the guest IOMMU using the > >> notifiers, and mirroring them into VFIO. > >> > >> There are a number of FIXMEs, and the scheme involves rather more notifier > >> structures than I'd like, but it should make for a reasonable proof of > >> concept. > >> > >> Signed-off-by: David Gibson <da...@gibson.dropbear.id.au> > >> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > >> > >> --- > >> Changes: > >> v4: > >> * fixed list objects naming > >> * vfio_listener_region_add() reworked to call memory_region_ref() from one > >> place only, it is also easier to review the changes > >> * fixes boundary check not to fail on sections == 2^64 bytes, > >> the "vfio: Fix debug output for int128 values" patch is required; > >> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support > >> for 128 bit memory section sizes" patch proposal > >> --- > >> hw/misc/vfio.c | 126 > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > >> 1 file changed, 120 insertions(+), 6 deletions(-) > >> > >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c > >> index 038010b..4f6f5da 100644 > >> --- a/hw/misc/vfio.c > >> +++ b/hw/misc/vfio.c > >> @@ -159,10 +159,18 @@ typedef struct VFIOContainer { > >> }; > >> void (*release)(struct VFIOContainer *); > >> } iommu_data; > >> + QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; > >> QLIST_HEAD(, VFIOGroup) group_list; > >> QLIST_ENTRY(VFIOContainer) next; > >> } VFIOContainer; > >> > >> +typedef struct VFIOGuestIOMMU { > >> + VFIOContainer *container; > >> + MemoryRegion *iommu; > >> + Notifier n; > >> + QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; > >> +} VFIOGuestIOMMU; > >> + > >> /* Cache of MSI-X setup plus extra mmap and memory region for split BAR > >> map */ > >> typedef struct VFIOMSIXInfo { > >> uint8_t table_bar; > >> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, > >> hwaddr iova, > >> > >> static bool vfio_listener_skipped_section(MemoryRegionSection *section) > >> { > >> - return !memory_region_is_ram(section->mr) || > >> - /* > >> + return (!memory_region_is_ram(section->mr) && > >> + !memory_region_is_iommu(section->mr)) || > >> + /* > > > > White space damage > > > >> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >> * addresses in the upper part of the 64-bit address space. > >> These > >> * are never accessed by the CPU and beyond the address width > >> of > >> @@ -2251,6 +2260,61 @@ static bool > >> vfio_listener_skipped_section(MemoryRegionSection *section) > >> section->offset_within_address_space & (1ULL << 63); > >> } > >> > >> +static void vfio_iommu_map_notify(Notifier *n, void *data) > >> +{ > >> + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); > >> + VFIOContainer *container = giommu->container; > >> + IOMMUTLBEntry *iotlb = data; > >> + MemoryRegion *mr; > >> + hwaddr xlat; > >> + hwaddr len = iotlb->addr_mask + 1; > >> + void *vaddr; > >> + int ret; > >> + > >> + DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", > >> + iotlb->iova, iotlb->iova + iotlb->addr_mask); > >> + > >> + /* > >> + * The IOMMU TLB entry we have just covers translation through > >> + * this IOMMU to its immediate target. We need to translate > >> + * it the rest of the way through to memory. > >> + */ > >> + mr = address_space_translate(&address_space_memory, > >> + iotlb->translated_addr, > >> + &xlat, &len, iotlb->perm & IOMMU_WO); > > > > Write-only? Is this supposed to be read-write to mask just 2 bits? > > > The last parameter of address_space_translate() bool is_write. So I do not > really understand the problem here.
Oops, my bad, I didn't look at what address_space_translate() used that for. Ok. > >> + if (!memory_region_is_ram(mr)) { > >> + DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n", > >> + xlat); > >> + return; > >> + } > >> + if (len & iotlb->addr_mask) { > >> + DPRINTF("iommu has granularity incompatible with target AS\n"); > > > > Is this possible? Assuming len is initially a power-of-2, would the > > translate function change it? Maybe worth a comment to explain. > > > Oh. address_space_translate() actually changes @len to min(len, > TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok > but lately I have been implementing a huge DMA window (plus one > sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with > 16MB pages (can do 64K pages too) and now this "granularity incompatible" > is happening. > > I disabled that check but I need to think of better fix... > > Adding Paolo to cc, may be he picks the context and gives good piece of > advise :) > > > > > > >> + return; > >> + } > >> + > >> + vaddr = memory_region_get_ram_ptr(mr) + xlat; > > > > This lookup isn't free and the unmap path doesn't need it, maybe move > > the variable and lookup into the first branch below? > > > >> + > >> + if (iotlb->perm != IOMMU_NONE) { > >> + ret = vfio_dma_map(container, iotlb->iova, > >> + iotlb->addr_mask + 1, vaddr, > >> + !(iotlb->perm & IOMMU_WO) || mr->readonly); > >> + if (ret) { > >> + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > >> + "0x%"HWADDR_PRIx", %p) = %d (%m)", > >> + container, iotlb->iova, > >> + iotlb->addr_mask + 1, vaddr, ret); > >> + } > >> + } else { > >> + ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + > >> 1); > >> + if (ret) { > >> + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " > >> + "0x%"HWADDR_PRIx") = %d (%m)", > >> + container, iotlb->iova, > >> + iotlb->addr_mask + 1, ret); > >> + } > >> + } > >> +} > >> + > >> static void vfio_listener_region_add(MemoryListener *listener, > >> MemoryRegionSection *section) > >> { > >> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener > >> *listener, > >> void *vaddr; > >> int ret; > >> > >> - assert(!memory_region_is_iommu(section->mr)); > >> - > >> if (vfio_listener_skipped_section(section)) { > >> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", > >> section->offset_within_address_space, > >> @@ -2286,15 +2348,47 @@ static void > >> vfio_listener_region_add(MemoryListener *listener, > >> return; > >> } > >> > >> + memory_region_ref(section->mr); > >> + > >> + if (memory_region_is_iommu(section->mr)) { > >> + VFIOGuestIOMMU *giommu; > >> + > >> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", > >> + iova, int128_get64(int128_sub(llend, int128_one()))); > >> + /* > >> + * FIXME: We should do some checking to see if the > >> + * capabilities of the host VFIO IOMMU are adequate to model > >> + * the guest IOMMU > >> + * > >> + * FIXME: This assumes that the guest IOMMU is empty of > >> + * mappings at this point - we should either enforce this, or > >> + * loop through existing mappings to map them into VFIO. > >> + * > >> + * FIXME: For VFIO iommu types which have KVM acceleration to > >> + * avoid bouncing all map/unmaps through qemu this way, this > >> + * would be the right place to wire that up (tell the KVM > >> + * device emulation the VFIO iommu handles to use). > >> + */ > > > > That's a lot of FIXMEs... The second one in particular looks like it > > needs to expand a bit on why this is likely a valid assumption. The > > last one is more of a TODO than a FIXME. > > > >> + giommu = g_malloc0(sizeof(*giommu)); > >> + giommu->iommu = section->mr; > >> + giommu->container = container; > >> + giommu->n.notify = vfio_iommu_map_notify; > >> + QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); > >> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); > >> + > >> + return; > >> + } > >> + > >> + /* Here we assume that memory_region_is_ram(section->mr)==true */ > >> + > >> end = int128_get64(llend); > >> vaddr = memory_region_get_ram_ptr(section->mr) + > >> section->offset_within_region + > >> (iova - section->offset_within_address_space); > >> > >> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", > >> + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", > >> iova, end - 1, vaddr); > >> > >> - memory_region_ref(section->mr); > >> ret = vfio_dma_map(container, iova, end - iova, vaddr, > >> section->readonly); > >> if (ret) { > >> error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > >> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener > >> *listener, > >> return; > >> } > >> > >> + if (memory_region_is_iommu(section->mr)) { > >> + VFIOGuestIOMMU *giommu; > >> + > >> + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { > >> + if (giommu->iommu == section->mr) { > >> + memory_region_unregister_iommu_notifier(&giommu->n); > >> + QLIST_REMOVE(giommu, giommu_next); > >> + g_free(giommu); > >> + break; > >> + } > >> + } > >> + > >> + /* > >> + * FIXME: We assume the one big unmap below is adequate to > >> + * remove any individual page mappings in the IOMMU which > >> + * might have been copied into VFIO. That may not be true for > >> + * all IOMMU types > >> + */ > > > > We assume this because the IOVA that gets unmapped is the same > > regardless of whether a guest IOMMU is present? > > > What exactly is meant by "guest IOMMU is present"? Doing the second DMA > window, now I am really confused about terminology :( The confusion for me is that add_region initializes the giommu and all the DMA mapping through VFIO is done in the notifier for the giommu. It's therefore asymmetric that add_region doesn't vfio_dma_map anything, but region_del does vfio_dma_unmap, which is the basis of my question. I thought this comment was trying to address why that is, but apparently it's something else entirely, so it would be nice to understand why this doesn't return() and decode a bit more clearly what the FIXME is trying to say. Thanks, Alex > >> + } > >> + > >> iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); > >> end = (section->offset_within_address_space + > >> int128_get64(section->size)) & > >> TARGET_PAGE_MASK; > > > > > > > >