On Fri, 19 Jan 2018 13:41:41 +1100 Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> On 19/01/18 08:59, Alex Williamson wrote: > > On Tue, 16 Jan 2018 16:17:58 +1100 > > Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > > > >> On 06/01/18 02:29, Alex Williamson wrote: > >>> On Fri, 5 Jan 2018 10:48:07 +0100 > >>> Auger Eric <eric.au...@redhat.com> wrote: > >>> > >>>> Hi Alexey, > >>>> > >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>>>> > >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>>>> are emulated on top unless the machine requests not to by defining and > >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>>>> enabling it as it does not define the "set" callback for the new > >>>>> property; > >>>>> the new property also does not appear in "-machine pseries,help". > >>>>> > >>>>> If the new capability is present, this puts MSIX IO memory region under > >>>>> mapped memory region. If the capability is not there, it falls back to > >>>>> the old behaviour with the sparse capability. > >>>>> > >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>>>> listener does not register it with the KVM as a memory slot and MSIX is > >>>>> emulated by QEMU as before. > >>>>> > >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>>>> > >>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > >>>>> --- > >>>>> > >>>>> This is mtree and flatview BEFORE this patch: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #0 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> [disabled] > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #2 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>>>> to enable emulation: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: p...@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): > >>>>> p...@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba > >>>>> [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 > >>>>> BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #1 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 > >>>>> mmaps[0] @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 > >>>>> mmaps[0] > >>>>> --- > >>>>> include/hw/vfio/vfio-common.h | 1 + > >>>>> linux-headers/linux/vfio.h | 5 +++++ > >>>>> hw/ppc/spapr.c | 8 ++++++++ > >>>>> hw/vfio/common.c | 15 +++++++++++++++ > >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/include/hw/vfio/vfio-common.h > >>>>> b/include/hw/vfio/vfio-common.h > >>>>> index f3a2ac9..927d600 100644 > >>>>> --- a/include/hw/vfio/vfio-common.h > >>>>> +++ b/include/hw/vfio/vfio-common.h > >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int > >>>>> index, > >>>>> struct vfio_region_info **info); > >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> uint32_t subtype, struct vfio_region_info > >>>>> **info); > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int > >>>>> region); > >>>>> #endif > >>>>> extern const MemoryListener vfio_prereg_listener; > >>>>> > >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>> index 4312e96..b45182e 100644 > >>>>> --- a/linux-headers/linux/vfio.h > >>>>> +++ b/linux-headers/linux/vfio.h > >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>>>> > >>>>> +/* > >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be > >>>>> mmapped. > >>>>> + */ > >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>>>> + > >>>>> /** > >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>>>> * struct vfio_irq_info) > >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>> index 9de63f0..693394a 100644 > >>>>> --- a/hw/ppc/spapr.c > >>>>> +++ b/hw/ppc/spapr.c > >>>>> @@ -2772,6 +2772,11 @@ static void > >>>>> spapr_set_modern_hotplug_events(Object *obj, bool value, > >>>>> spapr->use_hotplug_event_source = value; > >>>>> } > >>>>> > >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>>> + > >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>>>> { > >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>>>> object_property_set_description(obj, "vsmt", > >>>>> "Virtual SMT: KVM behaves as if > >>>>> this were" > >>>>> " the host's SMT mode", > >>>>> &error_abort); > >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>>>> + spapr_get_msix_emulation, NULL, NULL); > >>>>> } > >>>>> > >>>>> static void spapr_machine_finalizefn(Object *obj) > >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>>>> /* > >>>>> * pseries-2.11 > >>>>> */ > >>>>> + > >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>>>> { > >>>>> } > >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>>>> index 1fb8a8e..da5f182 100644 > >>>>> --- a/hw/vfio/common.c > >>>>> +++ b/hw/vfio/common.c > >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice > >>>>> *vbasedev, uint32_t type, > >>>>> return -ENODEV; > >>>>> } > >>>>> > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int > >>>>> region) > >>>>> +{ > >>>>> + struct vfio_region_info *info = NULL; > >>>>> + bool ret = false; > >>>>> + > >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>>>> + ret = true; > >>>>> + } > >>>>> + g_free(info); > >>>>> + } > >>>>> + > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>>>> */ > >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>>>> index c977ee3..27a3706 100644 > >>>>> --- a/hw/vfio/pci.c > >>>>> +++ b/hw/vfio/pci.c > >>>>> @@ -1289,6 +1289,11 @@ static void > >>>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>>>> off_t start, end; > >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>>>> > >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, > >>>>> VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>>>> + vdev->msix->table_bar)) { > >>>>> + return; > >>>>> + } > >>>> > >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >>>> msix table at 0x2000, using 64kB pages. > >>>> > >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >>>> mmaps[0] > >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >>>> > >>>> > >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >>>> be mapped with 64kB page, leading to a qemu abort. > >>>> > >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >>>> 0xffffa8820000) = -22 (Invalid argument) > >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >>>> > >>>> In my case don't we still need the fixup logic? > >>> > >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about > >>> this. I'd say no, the fixup shouldn't be needed, but apparently we > >>> can't remove it until we figure out this bug and who should be filtering > >>> out sub-page size listener events. Thanks, > >> > >> > >> This should do it: > >> > >> @@ -303,11 +303,19 @@ static bool > >> vfio_listener_skipped_section(MemoryRegionSection *section) > >> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >> * addresses in the upper part of the 64-bit address space. > >> These > >> * are never accessed by the CPU and beyond the address width > >> of > >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU > >> width. > >> */ > >> - section->offset_within_address_space & (1ULL << 63); > >> + section->offset_within_address_space & (1ULL << 63) || > >> + /* > >> + * Non-system-page aligned RAM regions have no chance to be > >> + * DMA mapped so skip them too. > >> + */ > >> + (memory_region_is_ram(section->mr) && > >> + ((section->offset_within_region & qemu_real_host_page_mask) || > >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) > >> + ); > >> } > >> > >> > >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have > >> more complicated logic to detect whether it is MMIO? Or there is something > >> bigger? > > > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > > silly address hack above to try to exclude PCI BARs as they're being > > resized, and b) there is some value to enabling peer-to-peer mappings > > between devices. Is there a better, more generic solution to a) now? > > "now?" what? :) Now - adverb, at the present time or moment. For example, vs in the past. > > If we know a MR is MMIO, then I think we can handle any associated > > mapping failure as non-fatal. > > > mr->ram_device sound like an appropriate flag to check. Yeah, I suppose we could, and that didn't exist when we added the bit 63 hack. > > It's nice to enable p2p, but it's barely > > used and not clear that it works reliably (sometimes even on bare > > metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, > > thus the silly address hack, but we do have supported page sizes, so we > > should probably be referencing that rather than host page size. > > Ah, correct, > s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ No, that would be the emulated iommu page size, which may or may not (hopefully does) match the physical iommu page size. The vfio iommu page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. > >> As for now, it is possible to have non-aligned RAM MR, the KVM's listener > >> just does not register it with the KVM if unaligned. > > > > For a sanity check, we don't support running a guest with a smaller > > page size than the host, right? > > We do, at least on POWER. > > > I assume that would be the only case > > where we would have a reasonable chance of the guest using a host > > sub-page as a DMA target, but that seems like just one of an infinite > > number of problems with such a setup. Thanks, > > Like what? I am missing the point here. The most popular config on POWER > today is 64k system page size in both host and guest and 4K IOMMU page size > for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. The IOMMU supporting a smaller page size than the processor is the thing I think I was missing here, but it makes sense that they could, they're not necessarily linked. Eric, will SMMU support 4K even while the CPU runs at 64K? The type1 backend has a number of PAGE_SIZE assumptions in it and will not expose IOMMU page sizes less than PAGE_SIZE. Perhaps this is really the root of the problem, but it is advertising vfio_iommu_type1_info.iova_pgsizes, which userspace should honor. So I think running (target page size) < (host page size) is probably broken for type1, for instance the guest could have a 4K RAM page as a DMA target that the 64K host couldn't map due to alignment, therefore DMA with that page would fault. Thanks, Alex