On Wed, Apr 3, 2024 at 8:53 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > > On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote: > > On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> > >> > >> On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote: > >>> On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu <si-wei....@oracle.com> wrote: > >>>> Hi Michael, > >>>> > >>>> On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote: > >>>>> On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote: > >>>>>> Hi Eugenio, > >>>>>> > >>>>>> I thought this new code looks good to me and the original issue I saw > >>>>>> with > >>>>>> x-svq=on should be gone. However, after rebase my tree on top of this, > >>>>>> there's a new failure I found around setting up guest mappings at early > >>>>>> boot, please see attached the specific QEMU config and corresponding > >>>>>> event > >>>>>> traces. Haven't checked into the detail yet, thinking you would need > >>>>>> to be > >>>>>> aware of ahead. > >>>>>> > >>>>>> Regards, > >>>>>> -Siwei > >>>>> Eugenio were you able to reproduce? Siwei did you have time to > >>>>> look into this? > >>>> Didn't get a chance to look into the detail yet in the past week, but > >>>> thought it may have something to do with the (internals of) iova tree > >>>> range allocation and the lookup routine. It started to fall apart at the > >>>> first vhost_vdpa_dma_unmap call showing up in the trace events, where it > >>>> should've gotten IOVA=0x2000001000, but an incorrect IOVA address > >>>> 0x1000 was ended up returning from the iova tree lookup routine. > >>>> > >>>> HVA GPA IOVA > >>>> ------------------------------------------------------------------------------------------------------------------------- > >>>> Map > >>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, > >>>> 0x80000000) > >>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > >>>> [0x80001000, 0x2000001000) > >>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>> [0x2000001000, 0x2000021000) > >>>> > >>>> Unmap > >>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > >>>> 0x20000) ??? > >>>> shouldn't it be [0x2000001000, > >>>> 0x2000021000) ??? > >>>> > >> It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), > >> which is called from vhost_vdpa_listener_region_del(), can't properly > >> deal with overlapped region. Specifically, q35's mch_realize() has the > >> following: > >> > >> 579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), > >> "smram-open-high", > >> 580 mch->ram_memory, > >> MCH_HOST_BRIDGE_SMRAM_C_BASE, > >> 581 MCH_HOST_BRIDGE_SMRAM_C_SIZE); > >> 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, > >> 583 &mch->open_high_smram, 1); > >> 584 memory_region_set_enabled(&mch->open_high_smram, false); > >> > >> #0 0x0000564c30bf6980 in iova_tree_find_address_iterator > >> (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at > >> ../util/iova-tree.c:96 > >> #1 0x00007f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0 > >> #2 0x0000564c30bf6b53 in iova_tree_find_iova (tree=<optimized out>, > >> map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114 > >> #3 0x0000564c309da0a9 in vhost_iova_tree_find_iova (tree=<optimized > >> out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70 > >> #4 0x0000564c3085e49d in vhost_vdpa_listener_region_del > >> (listener=0x564c331024c8, section=0x7fffb6d74aa0) at > >> ../hw/virtio/vhost-vdpa.c:444 > >> #5 0x0000564c309f4931 in address_space_update_topology_pass > >> (as=as@entry=0x564c31ab1840 <address_space_memory>, > >> old_view=old_view@entry=0x564c33364cc0, > >> new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at > >> ../system/memory.c:977 > >> #6 0x0000564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 > >> <address_space_memory>) at ../system/memory.c:1079 > >> #7 0x0000564c309f86d0 in memory_region_transaction_commit () at > >> ../system/memory.c:1132 > >> #8 0x0000564c309f86d0 in memory_region_transaction_commit () at > >> ../system/memory.c:1117 > >> #9 0x0000564c307cce64 in mch_realize (d=<optimized out>, > >> errp=<optimized out>) at ../hw/pci-host/q35.c:584 > >> > >> However, it looks like iova_tree_find_address_iterator() only check if > >> the translated address (HVA) falls in to the range when trying to locate > >> the desired IOVA, causing the first DMAMap that happens to overlap in > >> the translated address (HVA) space to be returned prematurely: > >> > >> 89 static gboolean iova_tree_find_address_iterator(gpointer key, > >> gpointer value, > >> 90 gpointer data) > >> 91 { > >> : > >> : > >> 99 if (map->translated_addr + map->size < needle->translated_addr || > >> 100 needle->translated_addr + needle->size < map->translated_addr) > >> { > >> 101 return false; > >> 102 } > >> 103 > >> 104 args->result = map; > >> 105 return true; > >> 106 } > >> > >> In the QEMU trace file, it reveals that the first DMAMap as below gets > >> returned incorrectly instead the second, the latter of which is what the > >> actual IOVA corresponds to: > >> > >> HVA GPA > >> IOVA > >> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) > >> [0x1000, 0x80001000) > >> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >> [0x2000001000, 0x2000021000) > >> > > I think the analysis is totally accurate as no code expects to unmap / > > map overlapping regions. In particular, vdpa kernel does not expect it > > either. > > Maybe I missed something, but I don't see how vdpa kernel prohibits this > overlapping case? The same HVA gets remapped under a different GPA or > IOVA range should be a valid use case it seems? For e.g. I don't see > mapping failure when x-svq=on was removed from QEMU's vhost-vdpa > arguments (where you can see in the attached trace, mch_realize with the > exactly same overlapping region got remapped successfully, and then got > unmapped right after). >
I meant to remove part of a map. You can see that vhost_vdpa_pa_unmap / vhost_vdpa_va_unmap removes all the map, not only the portion of it, much like iova_tree does. > > Since it is issued at _realize, it should be ok to unmap all the > > region range and then map the right range again, even if that implies > > a lot of unpin / pin. > You'll find out mch_realize() already did that - where line 582 is to > create a mapping overlapped with entire range, while line 583 is to > disable (unmap) this same mapping shortly after. > > 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, > 583 &mch->open_high_smram, 1); > 584 memory_region_set_enabled(&mch->open_high_smram, false); > > So we don't have to add any special case to deal with these overlapping > case (just use the GPA as the hint to help identify which iova range it > tries to unmap), as the vdpa's memory listener now moves upfront to the > earliest point of time at machine boot, the relevant SVQ code needs to > be more flexible and tolerate to the extra churns for the rest of QEMU's > subsystems to stabilize the guest memory layout. Yes I think the same. > >> Maybe other than check the HVA range, we should also match GPA, or at > >> least the size should exactly match? > >> > > The safe actions here would be to unmap all the memory chunk and then > > map the overlap memory? Or am I missing something? > > > > Another thing I don't get, is this reproducible in previous versions? > It is not reproducible without this series. I thought that may be > because previously the vhost-vdpa's memory listener was registered at a > later point of time, after all the overlapped regions are gone and the > memory layout is completely stabilized. Now we move registration to the > init time, then all the devils come out biting us. > > Previous versions of this patch series we didn't get this far yet, when > x-svq=on is specified. > Right, good point. Thanks! > https://lore.kernel.org/qemu-devel/92ecfd90-8d06-4669-b260-a7a3b1062...@oracle.com/ > > As far as I understand, this bug was never found before. I guess this > > path of q35's mch_realize is recent? > No, it's been there for quite a while. > > Cheers, > -Siwei > > > > Thanks! > > > >>> Yes, I'm still not able to reproduce. In particular, I don't know how > >>> how the memory listener adds a region and then release a region with a > >>> different size. I'm talking about these log entries: > >>> > >>> 1706854838.154394:vhost_vdpa_listener_region_add vdpa: 0x556d45c75140 > >>> iova 0x0 llend 0x80000000 vaddr: 0x7f7903e00000 read-only: 0 > >>> 452:vhost_vdpa_listener_region_del vdpa: 0x556d45c75140 iova 0x0 llend > >>> 0x7fffffff > >> Didn't see a different size here, though if you referred to the > >> discrepancy in the traces around llend, I thought the two between _add() > >> and _del() would have to be interpreted differently due to: > >> > >> 3d1e4d34 "vhost_vdpa: fix the input in > >> trace_vhost_vdpa_listener_region_del()" > >> > >> Regards, > >> -Siwei > >>> Is it possible for you to also trace the skipped regions? We should > >>> add a debug trace there too... > >>> > >>> Thanks! > >>> > >>>> PS, I will be taking off from today and for the next two weeks. Will try > >>>> to help out looking more closely after I get back. > >>>> > >>>> -Siwei > >>>>> Can't merge patches which are known to break things ...