On Fri, 19 Jan 2018 19:55:49 +1100 Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> On 19/01/18 08:59, Alex Williamson wrote: > > On Tue, 16 Jan 2018 16:17:58 +1100 > > Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > > > >> On 06/01/18 02:29, Alex Williamson wrote: > >>> On Fri, 5 Jan 2018 10:48:07 +0100 > >>> Auger Eric <eric.au...@redhat.com> wrote: > >>> > >>>> Hi Alexey, > >>>> > >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>>>> > >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>>>> are emulated on top unless the machine requests not to by defining and > >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>>>> enabling it as it does not define the "set" callback for the new > >>>>> property; > >>>>> the new property also does not appear in "-machine pseries,help". > >>>>> > >>>>> If the new capability is present, this puts MSIX IO memory region under > >>>>> mapped memory region. If the capability is not there, it falls back to > >>>>> the old behaviour with the sparse capability. > >>>>> > >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>>>> listener does not register it with the KVM as a memory slot and MSIX is > >>>>> emulated by QEMU as before. > >>>>> > >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>>>> > >>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > >>>>> --- > >>>>> > >>>>> This is mtree and flatview BEFORE this patch: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #0 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> [disabled] > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #2 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>>>> to enable emulation: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #1 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> --- > >>>>> include/hw/vfio/vfio-common.h | 1 + > >>>>> linux-headers/linux/vfio.h | 5 +++++ > >>>>> hw/ppc/spapr.c | 8 ++++++++ > >>>>> hw/vfio/common.c | 15 +++++++++++++++ > >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/include/hw/vfio/vfio-common.h > >>>>> b/include/hw/vfio/vfio-common.h > >>>>> index f3a2ac9..927d600 100644 > >>>>> --- a/include/hw/vfio/vfio-common.h > >>>>> +++ b/include/hw/vfio/vfio-common.h > >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int > >>>>> index, > >>>>> struct vfio_region_info **info); > >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> uint32_t subtype, struct vfio_region_info > >>>>> **info); > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int > >>>>> region); > >>>>> #endif > >>>>> extern const MemoryListener vfio_prereg_listener; > >>>>> > >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>> index 4312e96..b45182e 100644 > >>>>> --- a/linux-headers/linux/vfio.h > >>>>> +++ b/linux-headers/linux/vfio.h > >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>>>> > >>>>> +/* > >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be > >>>>> mmapped. > >>>>> + */ > >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>>>> + > >>>>> /** > >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>>>> * struct vfio_irq_info) > >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>> index 9de63f0..693394a 100644 > >>>>> --- a/hw/ppc/spapr.c > >>>>> +++ b/hw/ppc/spapr.c > >>>>> @@ -2772,6 +2772,11 @@ static void > >>>>> spapr_set_modern_hotplug_events(Object *obj, bool value, > >>>>> spapr->use_hotplug_event_source = value; > >>>>> } > >>>>> > >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>>> + > >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>>>> { > >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>>>> object_property_set_description(obj, "vsmt", > >>>>> "Virtual SMT: KVM behaves as if > >>>>> this were" > >>>>> " the host's SMT mode", > >>>>> &error_abort); > >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>>>> + spapr_get_msix_emulation, NULL, NULL); > >>>>> } > >>>>> > >>>>> static void spapr_machine_finalizefn(Object *obj) > >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>>>> /* > >>>>> * pseries-2.11 > >>>>> */ > >>>>> + > >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>>>> { > >>>>> } > >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>>>> index 1fb8a8e..da5f182 100644 > >>>>> --- a/hw/vfio/common.c > >>>>> +++ b/hw/vfio/common.c > >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice > >>>>> *vbasedev, uint32_t type, > >>>>> return -ENODEV; > >>>>> } > >>>>> > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int > >>>>> region) > >>>>> +{ > >>>>> + struct vfio_region_info *info = NULL; > >>>>> + bool ret = false; > >>>>> + > >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>>>> + ret = true; > >>>>> + } > >>>>> + g_free(info); > >>>>> + } > >>>>> + > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>>>> */ > >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>>>> index c977ee3..27a3706 100644 > >>>>> --- a/hw/vfio/pci.c > >>>>> +++ b/hw/vfio/pci.c > >>>>> @@ -1289,6 +1289,11 @@ static void > >>>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>>>> off_t start, end; > >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>>>> > >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, > >>>>> VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>>>> + vdev->msix->table_bar)) { > >>>>> + return; > >>>>> + } > >>>> > >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >>>> msix table at 0x2000, using 64kB pages. > >>>> > >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >>>> mmaps[0] > >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >>>> > >>>> > >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >>>> be mapped with 64kB page, leading to a qemu abort. > >>>> > >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >>>> 0xffffa8820000) = -22 (Invalid argument) > >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >>>> > >>>> In my case don't we still need the fixup logic? > >>> > >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about > >>> this. I'd say no, the fixup shouldn't be needed, but apparently we > >>> can't remove it until we figure out this bug and who should be filtering > >>> out sub-page size listener events. Thanks, > >> > >> > >> This should do it: > >> > >> @@ -303,11 +303,19 @@ static bool > >> vfio_listener_skipped_section(MemoryRegionSection *section) > >> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >> * addresses in the upper part of the 64-bit address space. > >> These > >> * are never accessed by the CPU and beyond the address width > >> of > >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU > >> width. > >> */ > >> - section->offset_within_address_space & (1ULL << 63); > >> + section->offset_within_address_space & (1ULL << 63) || > >> + /* > >> + * Non-system-page aligned RAM regions have no chance to be > >> + * DMA mapped so skip them too. > >> + */ > >> + (memory_region_is_ram(section->mr) && > >> + ((section->offset_within_region & qemu_real_host_page_mask) || > >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) > >> + ); > >> } > >> > >> > >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have > >> more complicated logic to detect whether it is MMIO? Or there is something > >> bigger? > > > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > > silly address hack above to try to exclude PCI BARs as they're being > > resized, and b) there is some value to enabling peer-to-peer mappings > > between devices. Is there a better, more generic solution to a) now? > > If we know a MR is MMIO, then I think we can handle any associated > > mapping failure as non-fatal. It's nice to enable p2p, but it's barely > > used and not clear that it works reliably (sometimes even on bare > > metal). > > When P2P works in a guest, what does it look like exactly? Do the devices > talk via the IOMMU or directly? I'd assume the latter but then the guest > device driver needs to know physical MMIO addresses and it does not or > guest must program BARs exactly as the host did... Has anyone actually > reported this working? I've seen it work, NVIDIA has a peer to peer test in their cuda tests, see previous patches enabling GPUDirect. The route is generally through the IOMMU, guests don't know host physical addresses. I say generally though because ACS does have a Directed Translated P2P feature that would allow a more efficient path, but I've never seen anything that can take advantage of this. > I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO > windows to PHB address space") and ended up having > vfio_listener_region_add() called with RAM regions (which are mapped MMIO) > and sPAPR cannot handle it otherwise than creating a window per a memory > section and since we can have just two, this fails. Adding > memory_region_is_ram_device() to vfio_listener_skipped_section() solves my > particular problem but I wonder if I should rather enable this P2P on sPAPR? Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO) mappings non-fatal. I think there's probably enough skepticism in drivers as to whether p2p works that this isn't as critical. Of course making it work is even better. Thanks, Alex