On 20/01/18 02:57, Alex Williamson wrote:
> On Fri, 19 Jan 2018 19:55:49 +1100
> Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> 
>> On 19/01/18 08:59, Alex Williamson wrote:
>>> On Tue, 16 Jan 2018 16:17:58 +1100
>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
>>>   
>>>> On 06/01/18 02:29, Alex Williamson wrote:  
>>>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>>>> Auger Eric <eric.au...@redhat.com> wrote:
>>>>>     
>>>>>> Hi Alexey,
>>>>>>
>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>>>
>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>>>> are emulated on top unless the machine requests not to by defining and
>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>>>> enabling it as it does not define the "set" callback for the new 
>>>>>>> property;
>>>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>>>
>>>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>>>> the old behaviour with the sparse capability.
>>>>>>>
>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>>>> emulated by QEMU as before.
>>>>>>>
>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> This is mtree and flatview BEFORE this patch:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: p...@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): 
>>>>>>> p...@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba 
>>>>>>> [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 
>>>>>>> BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #0
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 
>>>>>>> @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 
>>>>>>> mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: p...@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): 
>>>>>>> p...@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 
>>>>>>> BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 
>>>>>>> [disabled]
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba 
>>>>>>> [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 
>>>>>>> BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 
>>>>>>> mmaps[0]
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 
>>>>>>> mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>>>> to enable emulation:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: p...@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): 
>>>>>>> p...@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 
>>>>>>> BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba 
>>>>>>> [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 
>>>>>>> BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #1
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 
>>>>>>> mmaps[0]
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 
>>>>>>> mmaps[0] @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 
>>>>>>> mmaps[0]
>>>>>>> ---
>>>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/hw/vfio/vfio-common.h 
>>>>>>> b/include/hw/vfio/vfio-common.h
>>>>>>> index f3a2ac9..927d600 100644
>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int 
>>>>>>> index,
>>>>>>>                           struct vfio_region_info **info);
>>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>                               uint32_t subtype, struct vfio_region_info 
>>>>>>> **info);
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int 
>>>>>>> region);
>>>>>>>  #endif
>>>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>>>  
>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>> index 4312e96..b45182e 100644
>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
>>>>>>>  
>>>>>>> +/*
>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be 
>>>>>>> mmapped.
>>>>>>> + */
>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE     3
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>>>   *                                 struct vfio_irq_info)
>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>>> index 9de63f0..693394a 100644
>>>>>>> --- a/hw/ppc/spapr.c
>>>>>>> +++ b/hw/ppc/spapr.c
>>>>>>> @@ -2772,6 +2772,11 @@ static void 
>>>>>>> spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>>>      spapr->use_hotplug_event_source = value;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>>>> +{
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>>>  {
>>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>>>      object_property_set_description(obj, "vsmt",
>>>>>>>                                      "Virtual SMT: KVM behaves as if 
>>>>>>> this were"
>>>>>>>                                      " the host's SMT mode", 
>>>>>>> &error_abort);
>>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>>>  }
>>>>>>>  
>>>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>>>  /*
>>>>>>>   * pseries-2.11
>>>>>>>   */
>>>>>>> +
>>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>>>  {
>>>>>>>  }
>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>> index 1fb8a8e..da5f182 100644
>>>>>>> --- a/hw/vfio/common.c
>>>>>>> +++ b/hw/vfio/common.c
>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice 
>>>>>>> *vbasedev, uint32_t type,
>>>>>>>      return -ENODEV;
>>>>>>>  }
>>>>>>>  
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int 
>>>>>>> region)
>>>>>>> +{
>>>>>>> +    struct vfio_region_info *info = NULL;
>>>>>>> +    bool ret = false;
>>>>>>> +
>>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>>>> +            ret = true;
>>>>>>> +        }
>>>>>>> +        g_free(info);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>>>   */
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index c977ee3..27a3706 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -1289,6 +1289,11 @@ static void 
>>>>>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>>>      off_t start, end;
>>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>>>  
>>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, 
>>>>>>> VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>>>> +                            vdev->msix->table_bar)) {
>>>>>>> +        return;
>>>>>>> +    }      
>>>>>>
>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>>>> msix table at 0x2000, using 64kB pages.
>>>>>>
>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>>>> mmaps[0]
>>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>>>
>>>>>>
>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>>>
>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>>>
>>>>>> In my case don't we still need the fixup logic?    
>>>>>
>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>>>> can't remove it until we figure out this bug and who should be filtering
>>>>> out sub-page size listener events.  Thanks,    
>>>>
>>>>
>>>> This should do it:
>>>>
>>>> @@ -303,11 +303,19 @@ static bool
>>>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  
>>>> These
>>>>              * are never accessed by the CPU and beyond the address width 
>>>> of
>>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU 
>>>> width.
>>>>              */
>>>> -           section->offset_within_address_space & (1ULL << 63);
>>>> +           section->offset_within_address_space & (1ULL << 63) ||
>>>> +           /*
>>>> +            * Non-system-page aligned RAM regions have no chance to be
>>>> +            * DMA mapped so skip them too.
>>>> +            */
>>>> +           (memory_region_is_ram(section->mr) &&
>>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>>>> +           );
>>>>  }
>>>>
>>>>
>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>>>> more complicated logic to detect whether it is MMIO? Or there is something
>>>> bigger?  
>>>
>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
>>> silly address hack above to try to exclude PCI BARs as they're being
>>> resized, and b) there is some value to enabling peer-to-peer mappings
>>> between devices.  Is there a better, more generic solution to a) now?
>>> If we know a MR is MMIO, then I think we can handle any associated
>>> mapping failure as non-fatal.  It's nice to enable p2p, but it's barely
>>> used and not clear that it works reliably (sometimes even on bare
>>> metal).   
>>
>> When P2P works in a guest, what does it look like exactly? Do the devices
>> talk via the IOMMU or directly? I'd assume the latter but then the guest
>> device driver needs to know physical MMIO addresses and it does not or
>> guest must program BARs exactly as the host did... Has anyone actually
>> reported this working?
> 
> I've seen it work, NVIDIA has a peer to peer test in their cuda tests,
> see previous patches enabling GPUDirect.  The route is generally
> through the IOMMU, guests don't know host physical addresses.  I say
> generally though because ACS does have a Directed Translated P2P
> feature that would allow a more efficient path, but I've never seen
> anything that can take advantage of this.

Ah, I see. I guess it still makes sense with IOMMU as it eliminates an
extra memory copy.

>  
>> I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO
>> windows to PHB address space") and ended up having
>> vfio_listener_region_add() called with RAM regions (which are mapped MMIO)
>> and sPAPR cannot handle it otherwise than creating a window per a memory
>> section and since we can have just two, this fails. Adding
>> memory_region_is_ram_device() to vfio_listener_skipped_section() solves my
>> particular problem but I wonder if I should rather enable this P2P on sPAPR?
> 
> Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO)
> mappings non-fatal.  I think there's probably enough skepticism in
> drivers as to whether p2p works that this isn't as critical.  Of course
> making it work is even better.  Thanks,

Well, to make P2P-via-IOMMU work on sPAPR (assuming that our IOMMU can
actually redirect back to the bus which I am not sure about), these MMIOs
need to me mapped via the huge DMA window by the guest and not by QEMU-VFIO
so sPAPR will need another machine option ("no-mmio-dma-mmap"? implemented
like "vfio-no-msix-emulation" proposed by this patch) to tell
vfio_listener_region_add() not to vfio_dma_map on MMIO MRs,
vfio_listener_region_del() will need adjustment too. Would that be
considered as an option or there is a better interface?



-- 
Alexey

Reply via email to