On 20/01/18 02:57, Alex Williamson wrote: > On Fri, 19 Jan 2018 19:55:49 +1100 > Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > >> On 19/01/18 08:59, Alex Williamson wrote: >>> On Tue, 16 Jan 2018 16:17:58 +1100 >>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>> >>>> On 06/01/18 02:29, Alex Williamson wrote: >>>>> On Fri, 5 Jan 2018 10:48:07 +0100 >>>>> Auger Eric <eric.au...@redhat.com> wrote: >>>>> >>>>>> Hi Alexey, >>>>>> >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>>>> >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>>>> are emulated on top unless the machine requests not to by defining and >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>>>> enabling it as it does not define the "set" callback for the new >>>>>>> property; >>>>>>> the new property also does not appear in "-machine pseries,help". >>>>>>> >>>>>>> If the new capability is present, this puts MSIX IO memory region under >>>>>>> mapped memory region. If the capability is not there, it falls back to >>>>>>> the old behaviour with the sparse capability. >>>>>>> >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>>>> emulated by QEMU as before. >>>>>>> >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>> --- >>>>>>> >>>>>>> This is mtree and flatview BEFORE this patch: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: p...@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): >>>>>>> p...@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba >>>>>>> [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 >>>>>>> BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #0 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 >>>>>>> mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: p...@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): >>>>>>> p...@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 >>>>>>> BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> [disabled] >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba >>>>>>> [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 >>>>>>> BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #2 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 >>>>>>> mmaps[0] >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 >>>>>>> mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>>>> to enable emulation: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: p...@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): >>>>>>> p...@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 >>>>>>> BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba >>>>>>> [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 >>>>>>> BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #1 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 >>>>>>> mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 >>>>>>> mmaps[0] @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 >>>>>>> mmaps[0] >>>>>>> --- >>>>>>> include/hw/vfio/vfio-common.h | 1 + >>>>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h >>>>>>> b/include/hw/vfio/vfio-common.h >>>>>>> index f3a2ac9..927d600 100644 >>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int >>>>>>> index, >>>>>>> struct vfio_region_info **info); >>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> uint32_t subtype, struct vfio_region_info >>>>>>> **info); >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int >>>>>>> region); >>>>>>> #endif >>>>>>> extern const MemoryListener vfio_prereg_listener; >>>>>>> >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>>>> index 4312e96..b45182e 100644 >>>>>>> --- a/linux-headers/linux/vfio.h >>>>>>> +++ b/linux-headers/linux/vfio.h >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>>>> >>>>>>> +/* >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be >>>>>>> mmapped. >>>>>>> + */ >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>>>> + >>>>>>> /** >>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>>>> * struct vfio_irq_info) >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>>>> index 9de63f0..693394a 100644 >>>>>>> --- a/hw/ppc/spapr.c >>>>>>> +++ b/hw/ppc/spapr.c >>>>>>> @@ -2772,6 +2772,11 @@ static void >>>>>>> spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>>>> spapr->use_hotplug_event_source = value; >>>>>>> } >>>>>>> >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>>>> +{ >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>>>> { >>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>>>> object_property_set_description(obj, "vsmt", >>>>>>> "Virtual SMT: KVM behaves as if >>>>>>> this were" >>>>>>> " the host's SMT mode", >>>>>>> &error_abort); >>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>>>> } >>>>>>> >>>>>>> static void spapr_machine_finalizefn(Object *obj) >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>>>> /* >>>>>>> * pseries-2.11 >>>>>>> */ >>>>>>> + >>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>>>> { >>>>>>> } >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>> index 1fb8a8e..da5f182 100644 >>>>>>> --- a/hw/vfio/common.c >>>>>>> +++ b/hw/vfio/common.c >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice >>>>>>> *vbasedev, uint32_t type, >>>>>>> return -ENODEV; >>>>>>> } >>>>>>> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int >>>>>>> region) >>>>>>> +{ >>>>>>> + struct vfio_region_info *info = NULL; >>>>>>> + bool ret = false; >>>>>>> + >>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>>>> + ret = true; >>>>>>> + } >>>>>>> + g_free(info); >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> /* >>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>>>> */ >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>> index c977ee3..27a3706 100644 >>>>>>> --- a/hw/vfio/pci.c >>>>>>> +++ b/hw/vfio/pci.c >>>>>>> @@ -1289,6 +1289,11 @@ static void >>>>>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>>>> off_t start, end; >>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>>>> >>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, >>>>>>> VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>>>> + vdev->msix->table_bar)) { >>>>>>> + return; >>>>>>> + } >>>>>> >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>>>> msix table at 0x2000, using 64kB pages. >>>>>> >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>>>> mmaps[0] >>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>>>> >>>>>> >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>>>> be mapped with 64kB page, leading to a qemu abort. >>>>>> >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>>>> 0xffffa8820000) = -22 (Invalid argument) >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>>>> >>>>>> In my case don't we still need the fixup logic? >>>>> >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>>>> can't remove it until we figure out this bug and who should be filtering >>>>> out sub-page size listener events. Thanks, >>>> >>>> >>>> This should do it: >>>> >>>> @@ -303,11 +303,19 @@ static bool >>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>> * addresses in the upper part of the 64-bit address space. >>>> These >>>> * are never accessed by the CPU and beyond the address width >>>> of >>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU >>>> width. >>>> */ >>>> - section->offset_within_address_space & (1ULL << 63); >>>> + section->offset_within_address_space & (1ULL << 63) || >>>> + /* >>>> + * Non-system-page aligned RAM regions have no chance to be >>>> + * DMA mapped so skip them too. >>>> + */ >>>> + (memory_region_is_ram(section->mr) && >>>> + ((section->offset_within_region & qemu_real_host_page_mask) || >>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >>>> + ); >>>> } >>>> >>>> >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >>>> more complicated logic to detect whether it is MMIO? Or there is something >>>> bigger? >>> >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the >>> silly address hack above to try to exclude PCI BARs as they're being >>> resized, and b) there is some value to enabling peer-to-peer mappings >>> between devices. Is there a better, more generic solution to a) now? >>> If we know a MR is MMIO, then I think we can handle any associated >>> mapping failure as non-fatal. It's nice to enable p2p, but it's barely >>> used and not clear that it works reliably (sometimes even on bare >>> metal). >> >> When P2P works in a guest, what does it look like exactly? Do the devices >> talk via the IOMMU or directly? I'd assume the latter but then the guest >> device driver needs to know physical MMIO addresses and it does not or >> guest must program BARs exactly as the host did... Has anyone actually >> reported this working? > > I've seen it work, NVIDIA has a peer to peer test in their cuda tests, > see previous patches enabling GPUDirect. The route is generally > through the IOMMU, guests don't know host physical addresses. I say > generally though because ACS does have a Directed Translated P2P > feature that would allow a more efficient path, but I've never seen > anything that can take advantage of this.
Ah, I see. I guess it still makes sense with IOMMU as it eliminates an extra memory copy. > >> I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO >> windows to PHB address space") and ended up having >> vfio_listener_region_add() called with RAM regions (which are mapped MMIO) >> and sPAPR cannot handle it otherwise than creating a window per a memory >> section and since we can have just two, this fails. Adding >> memory_region_is_ram_device() to vfio_listener_skipped_section() solves my >> particular problem but I wonder if I should rather enable this P2P on sPAPR? > > Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO) > mappings non-fatal. I think there's probably enough skepticism in > drivers as to whether p2p works that this isn't as critical. Of course > making it work is even better. Thanks, Well, to make P2P-via-IOMMU work on sPAPR (assuming that our IOMMU can actually redirect back to the bus which I am not sure about), these MMIOs need to me mapped via the huge DMA window by the guest and not by QEMU-VFIO so sPAPR will need another machine option ("no-mmio-dma-mmap"? implemented like "vfio-no-msix-emulation" proposed by this patch) to tell vfio_listener_region_add() not to vfio_dma_map on MMIO MRs, vfio_listener_region_del() will need adjustment too. Would that be considered as an option or there is a better interface? -- Alexey