On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin <epere...@redhat.com> wrote: > > On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasow...@redhat.com> wrote: > > > > On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > > <epere...@redhat.com> wrote: > > > > > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasow...@redhat.com> wrote: > > > > > > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > > > <epere...@redhat.com> wrote: > > > > > > > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasow...@redhat.com> > > > > > wrote: > > > > > > > > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasow...@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin > > > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang > > > > > > > > > <jasow...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin > > > > > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang > > > > > > > > > > > <jasow...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang > > > > > > > > > > > > > <jasow...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez > > > > > > > > > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, > > > > > > > > > > > > > > > where different GPA leads > > > > > > > > > > > > > > > to the same HVA. This causes a problem when > > > > > > > > > > > > > > > overlapped regions > > > > > > > > > > > > > > > (different GPA but same translated HVA) exists in > > > > > > > > > > > > > > > the tree, as looking > > > > > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side > > > > > > > > > > > > > > effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where > > > > > > > > > > > > > this comes from. > > > > > > > > > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this > > > > > > > > > > > > > sequences of maps / > > > > > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > > > > Map > > > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) > > > > > > > > > > > > > [0x1000, 0x80000000) > > > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, > > > > > > > > > > > > > 0x2080000000) > > > > > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, > > > > > > > > > > > > > 0xfedc0000) > > > > > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, > > > > > > > > > > > > > 0xfedc0000) [0x1000, > > > > > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, > > > > > > > > > > > > > but exposed under a > > > > > > > > > > > > > different GVA (aliased). This is not "flattened" by > > > > > > > > > > > > > QEMU, as GPA does > > > > > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds > > > > > > > > > > > > > the first chunk, > > > > > > > > > > > > > not the second one. This series is the way to tell > > > > > > > > > > > > > the difference at > > > > > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to > > > > > > > > > > > > HVA mappings in > > > > > > > > > > > > the iova tree to solve this issue completely. Then > > > > > > > > > > > > there won't be > > > > > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another > > > > > > > > > > > problem. Both SVQ > > > > > > > > > > > vrings and CVQ buffers also need to be addressable by > > > > > > > > > > > VhostIOVATree, > > > > > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle > > > > > > > > > > > this > > > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can > > > > > > > > > > > we store > > > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it > > > > > > > > > > > has no GPA and > > > > > > > > > > > then a list to go through other entries (SVQ vaddr and > > > > > > > > > > > CVQ buffers). > > > > > > > > > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me > > > > > > > > > expand & recap. > > > > > > > > > > > > > > > > > > Option 1 is to simply add another iova tree to > > > > > > > > > VhostShadowVirtqueue. > > > > > > > > > Let's call it gpa_iova_tree, as opposed to the current > > > > > > > > > iova_tree that > > > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use > > > > > > > > > is easy at > > > > > > > > > adding or removing, like in the memory listener, but how to > > > > > > > > > know at > > > > > > > > > vhost_svq_translate_addr? > > > > > > > > > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version > > > > > > > > of > > > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > > > > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > > > > > vhost_svq_translate_addr. The problem is the need of adding > > > > > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > > > > > > > > > > > > > The easiest way for me is to rely on > > > > > > > > > memory_region_from_host(). When > > > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. > > > > > > > > > When it is > > > > > > > > > not, it returns NULL. I'm not sure if this is a valid use > > > > > > > > > case, it > > > > > > > > > just worked in my tests so far. > > > > > > > > > > > > > > > > > > Now we have the second problem: The GPA values of the regions > > > > > > > > > of the > > > > > > > > > two IOVA tree must be unique. We need to be able to find > > > > > > > > > unallocated > > > > > > > > > regions in SVQ IOVA. At this moment there is only one > > > > > > > > > IOVATree, so > > > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is > > > > > > > > > very > > > > > > > > > complicated with two trees. > > > > > > > > > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For > > > > > > > > example, we > > > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. > > > > > > > > It is > > > > > > > > shared by both > > > > > > > > > > > > > > > > 1) Guest memory (GPA) > > > > > > > > 2) SVQ virtqueue and buffers > > > > > > > > > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > > > > > > > > > The SVQ code could use either > > > > > > > > > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and > > > > > > > > buffers > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ > > > > > > > > IOVA? > > > > > > > > > > > > > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead > > > > > > > of > > > > > > > keeping it self-contained in VhostIOVATree. > > > > > > > > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > > > > > > > > > An example is the iova allocator in the kernel. > > > > > > > > > > > > Note that there's an even simpler IOVA "allocator" in NVME > > > > > > passthrough > > > > > > code, not sure it is useful here though (haven't had a deep look at > > > > > > that). > > > > > > > > > > > > > > > > I don't know enough about them to have an opinion. I keep seeing the > > > > > drawback of needing to synchronize both allocation & adding in all the > > > > > places we want to modify the IOVATree. At this moment, these are the > > > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and > > > > > net CVQ buffers. But it may be more in the future. > > > > > > > > > > What are the advantages of keeping these separated that justifies > > > > > needing to synchronize in all these places, compared with keeping them > > > > > synchronized in VhostIOVATree? > > > > > > > > It doesn't need to be synchronized. > > > > > > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > > > which part of the range has been used. > > > > > > > > > > Not sure if I follow, that is what I mean with "synchronized". > > > > Oh right. > > > > > > > > > This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > > > IOVA mappings separately. > > > > > > > > > > Sorry, I still cannot see the whole picture :). > > > > > > Let's assume we have all the GPA mapped to specific IOVA regions, so > > > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > > because of the migration. How can we know where we can place SVQ > > > vrings without having them synchronized? > > > > Just allocating a new IOVA range for SVQ? > > > > > > > > At this moment we're using a tree. The tree nature of the current SVQ > > > IOVA -> VA makes all nodes ordered so it is more or less easy to look > > > for holes. > > > > Yes, iova allocate could still be implemented via a tree. > > > > > > > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > > over all of them, order them, of the two trees, and then look for > > > holes there? > > > > Let me clarify, correct me if I was wrong: > > > > 1) IOVA allocator is still implemented via a tree, we just don't need > > to store how the IOVA is used > > 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > > the datapath SVQ translation > > 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > > > > Ok, so the part I was missing is that now we have 3 whole trees, with > somehow redundant information :).
Somehow, it decouples the IOVA usage out of the IOVA allocator. This might be simple as guests and SVQ may try to share a single IOVA address space. > > In some sense this is simpler than trying to get all the information > from only two trees. On the bad side, all SVQ calls that allocate some > region need to remember to add to one of the two other threes. That is > what I mean by synchronized. But sure, we can go that way. Just a suggestion, if it turns out to complicate the issue, I'm fine to go the other way. Thanks > > > Thanks > > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > >