On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > > On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote: > > On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.pal...@oracle.com> > > wrote: > >> > >> > >> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote: > >>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei....@oracle.com> wrote: > >>>> > >>>> > >>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: > >>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>> wrote: > >>>>>> > >>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: > >>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>> wrote: > >>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: > >>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>>>> wrote: > >>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote: > >>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow > >>>>>>>>>>> virtqueue. This mappings may not match with the GPA->HVA ones. > >>>>>>>>>>> > >>>>>>>>>>> This causes a problem when overlapped regions (different GPA but > >>>>>>>>>>> same > >>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will > >>>>>>>>>>> return > >>>>>>>>>>> them twice. To solve this, create an id member so we can assign > >>>>>>>>>>> unique > >>>>>>>>>>> identifiers (GPA) to the maps. > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Eugenio Pérez <epere...@redhat.com> > >>>>>>>>>>> --- > >>>>>>>>>>> include/qemu/iova-tree.h | 5 +++-- > >>>>>>>>>>> util/iova-tree.c | 3 ++- > >>>>>>>>>>> 2 files changed, 5 insertions(+), 3 deletions(-) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h > >>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644 > >>>>>>>>>>> --- a/include/qemu/iova-tree.h > >>>>>>>>>>> +++ b/include/qemu/iova-tree.h > >>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap { > >>>>>>>>>>> hwaddr iova; > >>>>>>>>>>> hwaddr translated_addr; > >>>>>>>>>>> hwaddr size; /* Inclusive */ > >>>>>>>>>>> + uint64_t id; > >>>>>>>>>>> IOMMUAccessFlags perm; > >>>>>>>>>>> } QEMU_PACKED DMAMap; > >>>>>>>>>>> typedef gboolean (*iova_tree_iterator)(DMAMap *map); > >>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree > >>>>>>>>>>> *tree, const DMAMap *map); > >>>>>>>>>>> * @map: the mapping to search > >>>>>>>>>>> * > >>>>>>>>>>> * Search for a mapping in the iova tree that > >>>>>>>>>>> translated_addr overlaps with the > >>>>>>>>>>> - * mapping range specified. Only the first found mapping will be > >>>>>>>>>>> - * returned. > >>>>>>>>>>> + * mapping range specified and map->id is equal. Only the first > >>>>>>>>>>> found > >>>>>>>>>>> + * mapping will be returned. > >>>>>>>>>>> * > >>>>>>>>>>> * Return: DMAMap pointer if found, or NULL if not found. > >>>>>>>>>>> Note that > >>>>>>>>>>> * the returned DMAMap pointer is maintained internally. > >>>>>>>>>>> User should > >>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c > >>>>>>>>>>> index 536789797e..0863e0a3b8 100644 > >>>>>>>>>>> --- a/util/iova-tree.c > >>>>>>>>>>> +++ b/util/iova-tree.c > >>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean > >>>>>>>>>>> iova_tree_find_address_iterator(gpointer key, gpointer value, > >>>>>>>>>>> > >>>>>>>>>>> needle = args->needle; > >>>>>>>>>>> if (map->translated_addr + map->size < > >>>>>>>>>>> needle->translated_addr || > >>>>>>>>>>> - needle->translated_addr + needle->size < > >>>>>>>>>>> map->translated_addr) { > >>>>>>>>>>> + needle->translated_addr + needle->size < > >>>>>>>>>>> map->translated_addr || > >>>>>>>>>>> + needle->id != map->id) { > >>>>>>>>>> It looks this iterator can also be invoked by SVQ from > >>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest > >>>>>>>>>> GPA > >>>>>>>>>> space will be searched on without passing in the ID (GPA), and > >>>>>>>>>> exact > >>>>>>>>>> match for the same GPA range is not actually needed unlike the > >>>>>>>>>> mapping > >>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup > >>>>>>>>>> case > >>>>>>>>>> specifically? Or alternatively, add a special flag, say > >>>>>>>>>> skip_id_match to > >>>>>>>>>> DMAMap, and the id match check may look like below: > >>>>>>>>>> > >>>>>>>>>> (!needle->skip_id_match && needle->id != map->id) > >>>>>>>>>> > >>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant > >>>>>>>>>> or > >>>>>>>>>> pass DMAmap with skip_id_match set to true to > >>>>>>>>>> svq_iova_tree_find_iova(). > >>>>>>>>>> > >>>>>>>>> I think you're totally right. But I'd really like to not complicate > >>>>>>>>> the API of the iova_tree more. > >>>>>>>>> > >>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and > >>>>>>>>> then get the hwaddr. It is another lookup though... > >>>>>>>> Yeah, that will be another means of doing translation without having > >>>>>>>> to > >>>>>>>> complicate the API around iova_tree. I wonder how the lookup through > >>>>>>>> memory_region_from_host() may perform compared to the iova tree one, > >>>>>>>> the > >>>>>>>> former looks to be an O(N) linear search on a linked list while the > >>>>>>>> latter would be roughly O(log N) on an AVL tree? > >>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is > >>>>>>> linear too. It is not even ordered. > >>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach () > >>>>>> instead of g_tree_search_node(). So the former is indeed linear > >>>>>> iteration, but it looks to be ordered? > >>>>>> > >>>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$ > >>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr. > >>>>> > >>>>> If we have these translations: > >>>>> [0x1000, 0x2000] -> [0x10000, 0x11000] > >>>>> [0x2000, 0x3000] -> [0x6000, 0x7000] > >>>>> > >>>>> We will see them in this order, so we cannot stop the search at the > >>>>> first node. > >>>> Yeah, reverse lookup is unordered indeed, anyway. > >>>> > >>>>>>> But apart from this detail you're right, I have the same concerns with > >>>>>>> this solution too. If we see a hard performance regression we could go > >>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in > >>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually. > >>>>>> Agreed, yeap we can use memory_region_from_host for now. Any reason > >>>>>> why > >>>>>> reverse IOVATree was dropped, lack of users? But now we have one! > >>>>>> > >>>>> No, it is just simplicity. We already have an user in the hot patch in > >>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled > >>>>> enough to find if it is a bottleneck or not to be honest. > >>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to > >>>> profile and see the difference. > >>>>> I'll send the new series by today, thank you for finding these issues! > >>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree, > >>>> Jonah (cc'ed) may have interest in looking into it. > >>>> > >>> Actually, yes. I've tried to solve it using: > >>> memory_region_get_ram_ptr -> It's hard to get this pointer to work > >>> without messing a lot with IOVATree. > >>> memory_region_find -> I'm totally unable to make it return sections > >>> that make sense > >>> flatview_for_each_range -> It does not return the same > >>> MemoryRegionsection as the listener, not sure why. > >>> > >>> The only advance I have is that memory_region_from_host is able to > >>> tell if the vaddr is from the guest or not. > >>> > >>> So I'm convinced there must be a way to do it with the memory > >>> subsystem, but I think the best way to do it ATM is to store a > >>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we > >>> find the entry in this new tree, we can directly remove it by GPA. If > >>> not, assume it is a host-only address like SVQ vrings, and remove by > >>> iterating on vaddr as we do now. It is guaranteed the guest does not > >>> translate to that vaddr and that that vaddr is unique in the tree > >>> anyway. > >>> > >>> Does it sound reasonable? Jonah, would you be interested in moving this > >>> forward? > >>> > >>> Thanks! > >>> > >> Sure, I'd be more than happy to work on this stuff! I can probably get > >> started on this either today or tomorrow. > >> > >> Si-Wei mentioned something about these "reverse IOVATree" patches that > >> were dropped; > > The patches implementing the reverse IOVA tree were never created / > > posted, just in case you try to look for them. > > > > > >> is this relevant to what you're asking here? Is it > >> something I should base my work off of? > >> > > So these patches work ok for adding and removing maps. We assign ids, > > which is the gpa of the memory region that the listener receives. The > > bad news is that SVQ also needs this id to look for the right > > translation at vhost_svq_translate_addr, so this series is not > > complete. > I have a fundamental question to ask here. Are we sure SVQ really needs > this id (GPA) to identify the right translation? Or we're just > concerning much about the aliased map where there could be one single > HVA mapped to multiple IOVAs / GPAs, which (the overlapped) is almost > transient mapping that usually goes away very soon after guest memory > layout is stabilized?
Are we sure all of the overlaps go away after the memory layout is stabilized in all conditions? I think it is worth not making two different ways to ask the tree depending on what part of QEMU we are. > For what I can tell, the caller in SVQ datapath > code (vhost_svq_vring_write_descs) just calls into > vhost_iova_tree_find_iova to look for IOVA translation rather than > identify a specific section on the memory region, the latter of which > would need the id (GPA) to perform an exact match. The removal case > would definitely need perfect match on GPA with the additional id, but I > don't find it necessary for the vhost_svq_vring_write_descs code to pass > in the id / GPA? Do I miss something? > Expanding on the other thread, as there are more concrete points there. Please let me know if I missed something. > Thanks, > -Siwei > > > You can find the > > vhost_iova_tree_find_iova()->iova_tree_find_iova() call there. > > > > The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It > > is also the less elegant and (potentially) the less performant, as it > > includes duplicating information that QEMU already has, and a > > potentially linear search. > > > > David Hildenbrand (CCed) proposed to try iterating through RAMBlocks. > > I guess qemu_ram_block_from_host() could return a block where > > block->offset is the id of the map? > > > > It would be great to try this approach. If you don't have the bandwith > > for this, going directly for the reverse iova tree is also a valid > > solution. > > > > Thanks! > > > >> If there's any other relevant information about this issue that you > >> think I should know, let me know. I'll start digging into this ASAP and > >> will reach out if I need any guidance. :) > >> > >> Jonah > >> > >>>> -Siwei > >>>> > >>>> > >>>>>> Thanks, > >>>>>> -Siwei > >>>>>>> Thanks! > >>>>>>> > >>>>>>>> Of course, > >>>>>>>> memory_region_from_host() won't search out of the guest memory space > >>>>>>>> for > >>>>>>>> sure. As this could be on the hot data path I have a little bit > >>>>>>>> hesitance over the potential cost or performance regression this > >>>>>>>> change > >>>>>>>> could bring in, but maybe I'm overthinking it too much... > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> -Siwei > >>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> -Siwei > >>>>>>>>>>> return false; > >>>>>>>>>>> } > >>>>>>>>>>> >