On Wed, Apr 3, 2024 at 8:53 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
>
>
>
> On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote:
> > On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>
> >>
> >> On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>>> Hi Michael,
> >>>>
> >>>> On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:
> >>>>> On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:
> >>>>>> Hi Eugenio,
> >>>>>>
> >>>>>> I thought this new code looks good to me and the original issue I saw 
> >>>>>> with
> >>>>>> x-svq=on should be gone. However, after rebase my tree on top of this,
> >>>>>> there's a new failure I found around setting up guest mappings at early
> >>>>>> boot, please see attached the specific QEMU config and corresponding 
> >>>>>> event
> >>>>>> traces. Haven't checked into the detail yet, thinking you would need 
> >>>>>> to be
> >>>>>> aware of ahead.
> >>>>>>
> >>>>>> Regards,
> >>>>>> -Siwei
> >>>>> Eugenio were you able to reproduce? Siwei did you have time to
> >>>>> look into this?
> >>>> Didn't get a chance to look into the detail yet in the past week, but
> >>>> thought it may have something to do with the (internals of) iova tree
> >>>> range allocation and the lookup routine. It started to fall apart at the
> >>>> first vhost_vdpa_dma_unmap call showing up in the trace events, where it
> >>>> should've gotten IOVA=0x2000001000,  but an incorrect IOVA address
> >>>> 0x1000 was ended up returning from the iova tree lookup routine.
> >>>>
> >>>> HVA                    GPA                IOVA
> >>>> -------------------------------------------------------------------------------------------------------------------------
> >>>> Map
> >>>> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 
> >>>> 0x80000000)
> >>>> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> >>>> [0x80001000, 0x2000001000)
> >>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> >>>> [0x2000001000, 0x2000021000)
> >>>>
> >>>> Unmap
> >>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> >>>> 0x20000) ???
> >>>>                                        shouldn't it be [0x2000001000,
> >>>> 0x2000021000) ???
> >>>>
> >> It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(),
> >> which is called from vhost_vdpa_listener_region_del(), can't properly
> >> deal with overlapped region. Specifically, q35's mch_realize() has the
> >> following:
> >>
> >> 579     memory_region_init_alias(&mch->open_high_smram, OBJECT(mch),
> >> "smram-open-high",
> >> 580                              mch->ram_memory,
> >> MCH_HOST_BRIDGE_SMRAM_C_BASE,
> >> 581                              MCH_HOST_BRIDGE_SMRAM_C_SIZE);
> >> 582     memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
> >> 583 &mch->open_high_smram, 1);
> >> 584     memory_region_set_enabled(&mch->open_high_smram, false);
> >>
> >> #0  0x0000564c30bf6980 in iova_tree_find_address_iterator
> >> (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at
> >> ../util/iova-tree.c:96
> >> #1  0x00007f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
> >> #2  0x0000564c30bf6b53 in iova_tree_find_iova (tree=<optimized out>,
> >> map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
> >> #3  0x0000564c309da0a9 in vhost_iova_tree_find_iova (tree=<optimized
> >> out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
> >> #4  0x0000564c3085e49d in vhost_vdpa_listener_region_del
> >> (listener=0x564c331024c8, section=0x7fffb6d74aa0) at
> >> ../hw/virtio/vhost-vdpa.c:444
> >> #5  0x0000564c309f4931 in address_space_update_topology_pass
> >> (as=as@entry=0x564c31ab1840 <address_space_memory>,
> >> old_view=old_view@entry=0x564c33364cc0,
> >> new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at
> >> ../system/memory.c:977
> >> #6  0x0000564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840
> >> <address_space_memory>) at ../system/memory.c:1079
> >> #7  0x0000564c309f86d0 in memory_region_transaction_commit () at
> >> ../system/memory.c:1132
> >> #8  0x0000564c309f86d0 in memory_region_transaction_commit () at
> >> ../system/memory.c:1117
> >> #9  0x0000564c307cce64 in mch_realize (d=<optimized out>,
> >> errp=<optimized out>) at ../hw/pci-host/q35.c:584
> >>
> >> However, it looks like iova_tree_find_address_iterator() only check if
> >> the translated address (HVA) falls in to the range when trying to locate
> >> the desired IOVA, causing the first DMAMap that happens to overlap in
> >> the translated address (HVA) space to be returned prematurely:
> >>
> >>    89 static gboolean iova_tree_find_address_iterator(gpointer key,
> >> gpointer value,
> >>    90                                                 gpointer data)
> >>    91 {
> >>    :
> >>    :
> >>    99     if (map->translated_addr + map->size < needle->translated_addr ||
> >> 100         needle->translated_addr + needle->size < map->translated_addr) 
> >> {
> >> 101         return false;
> >> 102     }
> >> 103
> >> 104     args->result = map;
> >> 105     return true;
> >> 106 }
> >>
> >> In the QEMU trace file, it reveals that the first DMAMap as below gets
> >> returned incorrectly instead the second, the latter of which is what the
> >> actual IOVA corresponds to:
> >>
> >> HVA                                                             GPA        
> >>                                      IOVA
> >> [0x7f7903e00000, 0x7f7983e00000)        [0x0, 0x80000000)                  
> >>      [0x1000, 0x80001000)
> >> [0x7f7903ea0000, 0x7f7903ec0000)        [0xfeda0000, 0xfedc0000)        
> >> [0x2000001000, 0x2000021000)
> >>
> > I think the analysis is totally accurate as no code expects to unmap /
> > map overlapping regions. In particular, vdpa kernel does not expect it
> > either.
>
> Maybe I missed something, but I don't see how vdpa kernel prohibits this
> overlapping case? The same HVA gets remapped under a different GPA or
> IOVA range should be a valid use case it seems? For e.g. I don't see
> mapping failure when x-svq=on was removed from QEMU's vhost-vdpa
> arguments (where you can see in the attached trace, mch_realize with the
> exactly same overlapping region got remapped successfully, and then got
> unmapped right after).
>

I meant to remove part of a map. You can see that vhost_vdpa_pa_unmap
/ vhost_vdpa_va_unmap removes all the map, not only the portion of it,
much like iova_tree does.

> > Since it is issued at _realize, it should be ok to unmap all the
> > region range and then map the right range again, even if that implies
> > a lot of unpin / pin.
> You'll find out mch_realize() already did that -  where line 582 is to
> create a mapping overlapped with entire range, while line 583 is to
> disable (unmap) this same mapping shortly after.
>
> 582     memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
> 583 &mch->open_high_smram, 1);
> 584     memory_region_set_enabled(&mch->open_high_smram, false);
>
> So we don't have to add any special case to deal with these overlapping
> case (just use the GPA as the hint to help identify which iova range it
> tries to unmap), as the vdpa's memory listener now moves upfront to the
> earliest point of time at machine boot, the relevant SVQ code needs to
> be more flexible and tolerate to the extra churns for the rest of QEMU's
> subsystems to stabilize the guest memory layout.

Yes I think the same.

> >> Maybe other than check the HVA range, we should also match GPA, or at
> >> least the size should exactly match?
> >>
> > The safe actions here would be to unmap all the memory chunk and then
> > map the overlap memory? Or am I missing something?
> >
> > Another thing I don't get, is this reproducible in previous versions?
> It is not reproducible without this series. I thought that may be
> because previously the vhost-vdpa's memory listener was registered at a
> later point of time, after all the overlapped regions are gone and the
> memory layout is completely stabilized. Now we move registration to the
> init time, then all the devils come out biting us.
>
> Previous versions of this patch series we didn't get this far yet, when
> x-svq=on is specified.
>

Right, good point.

Thanks!

> https://lore.kernel.org/qemu-devel/92ecfd90-8d06-4669-b260-a7a3b1062...@oracle.com/
> > As far as I understand, this bug was never found before. I guess this
> > path of q35's mch_realize is recent?
> No, it's been there for quite a while.
>
> Cheers,
> -Siwei
> >
> > Thanks!
> >
> >>> Yes, I'm still not able to reproduce. In particular, I don't know how
> >>> how the memory listener adds a region and then release a region with a
> >>> different size. I'm talking about these log entries:
> >>>
> >>> 1706854838.154394:vhost_vdpa_listener_region_add vdpa: 0x556d45c75140
> >>> iova 0x0 llend 0x80000000 vaddr: 0x7f7903e00000 read-only: 0
> >>> 452:vhost_vdpa_listener_region_del vdpa: 0x556d45c75140 iova 0x0 llend
> >>> 0x7fffffff
> >> Didn't see a different size here, though if you referred to the
> >> discrepancy in the traces around llend, I thought the two between _add()
> >> and _del() would have to be interpreted differently due to:
> >>
> >> 3d1e4d34 "vhost_vdpa: fix the input in
> >> trace_vhost_vdpa_listener_region_del()"
> >>
> >> Regards,
> >> -Siwei
> >>> Is it possible for you to also trace the skipped regions? We should
> >>> add a debug trace there too...
> >>>
> >>> Thanks!
> >>>
> >>>> PS, I will be taking off from today and for the next two weeks. Will try
> >>>> to help out looking more closely after I get back.
> >>>>
> >>>> -Siwei
> >>>>>     Can't merge patches which are known to break things ...


Reply via email to