> -----Original Message----- > From: Akihiko Odaki <[email protected]> > Sent: 23 June 2026 15:04 > To: Thanos Makatos <[email protected]>; Paolo Bonzini > <[email protected]>; BALATON Zoltan <[email protected]> > Cc: Daniel P. Berrangé <[email protected]>; [email protected]; > Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier > <[email protected]>; Peter Xu <[email protected]>; > Hervé Poussineau <[email protected]>; Alex Bennée > <[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien > Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark Cave- > Ayland <[email protected]>; Marc-André Lureau > <[email protected]>; John Levon <[email protected]>; > Cédric Le Goater <[email protected]> > Subject: Re: [RFC 4/7] system: add memory_region_new / > memory_region_new_io > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On 2026/06/23 21:58, Thanos Makatos wrote: > >> -----Original Message----- > >> From: Akihiko Odaki <[email protected]> > >> Sent: 23 June 2026 06:42 > >> To: Thanos Makatos <[email protected]>; Paolo Bonzini > >> <[email protected]>; BALATON Zoltan <[email protected]> > >> Cc: Daniel P. Berrangé <[email protected]>; [email protected]; > >> Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier > >> <[email protected]>; Peter Xu <[email protected]>; > >> Hervé Poussineau <[email protected]>; Alex Bennée > >> <[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien > >> Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark > Cave- > >> Ayland <[email protected]>; Marc-André Lureau > >> <[email protected]>; John Levon <[email protected]>; > >> Cédric Le Goater <[email protected]> > >> Subject: Re: [RFC 4/7] system: add memory_region_new / > >> memory_region_new_io > >> > >> !-------------------------------------------------------------------| > >> CAUTION: External Email > >> > >> |-------------------------------------------------------------------! > >> > >> On 2026/06/23 6:24, Thanos Makatos wrote: > >>>> -----Original Message----- > >>>> From: Akihiko Odaki <[email protected]> > >>>> Sent: 22 June 2026 15:37 > >>>> To: Paolo Bonzini <[email protected]>; BALATON Zoltan > >>>> <[email protected]> > >>>> Cc: Daniel P. Berrangé <[email protected]>; qemu- > [email protected]; > >>>> Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier > >>>> <[email protected]>; Peter Xu <[email protected]>; > >>>> Hervé Poussineau <[email protected]>; Alex Bennée > >>>> <[email protected]>; Michael S. Tsirkin <[email protected]>; > Aurelien > >>>> Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark > >> Cave- > >>>> Ayland <[email protected]>; Marc-André Lureau > >>>> <[email protected]>; John Levon > <[email protected]>; > >>>> Thanos Makatos <[email protected]>; Cédric Le Goater > >>>> <[email protected]> > >>>> Subject: Re: [RFC 4/7] system: add memory_region_new / > >>>> memory_region_new_io > >>>> > >>>> !-------------------------------------------------------------------| > >>>> CAUTION: External Email > >>>> > >>>> |-------------------------------------------------------------------! > >>>> > >>>> On 2026/06/20 1:32, Paolo Bonzini wrote: > >>>>> On 6/18/26 00:44, BALATON Zoltan wrote: > >>>>>> On Wed, 17 Jun 2026, Paolo Bonzini wrote: > >>>>>>> On 6/16/26 17:55, Daniel P. Berrangé wrote: > >>>>>>>> Prepare for the move to dynamically allocated memory regions by > >>>>>>>> introducing memory_region_new and memory_region_new_io > >> functions > >>>>>>>> which call through to object_new instead of object_initialize. > >>>>>>> > >>>>>>> MemoryRegionOps callback will certainly access fields in the parent. > >>>>>>> Having a separate object gives you no benefit because you still have > >>>>>>> the same use-after-free if the MemoryRegion outlives the device. > >>>>>>> Likewise for buses. > >>>>>>> > >>>>>>> If the device itself is embedded, then all the invariants must hold > >>>>>>> recursively: > >>>>>>> > >>>>>>> - the device must be kept alive by mutual references between the > >>>>>>> device itself and its parent; > >>>>>>> > >>>>>>> - unparent for the device should wait for all guest memory accesses > >>>>>>> to be either completed or cancelled. > >>>>>>> > >>>>>>> So, the problem is that "QEMU automatically references the owning > >>>>>>> Device to keep the MemoryRegion alive" is a hack that should die. > >>>>>>> [...] I disagree that removing embedding will fix any bugs > >>>>>> > >>>>>> I'm trying to understand this better. The way it originally meant to > >>>>>> work according to the memory region docs and implementation is that > >>>>>> memory region is added to an owner as a child which takes a reference > >>>>>> and when the owner is freed it will unparent its children which will > >>>>>> then get unrefed and freed as well if nothing else increased their ref > >>>>>> count. > >>>>> > >>>>> Yes, and that's broken. > >>>>> > >>>>>> The problem is that when something accesses the memory region it > has > >>>>>> to keep the owner alive as well so instead of referencing the memory > >>>>>> region it references the owner (not directly but using > >>>>>> memory_region_ref which does this). > >>>>> > >>>>> A mechanism to keep the owner alive exists, and it is mutual references > >>>>> (so that neither owner nor MR die) + unparent of owner (which should > >>>>> wait for pending users to complete before returning). > >>>>> > >>>>> The fact that this mechanism is not used, i.e. unparent of a PCIDevice > >>>>> does not wait for pending BAR accesses to complete, is a bug. > >>>>> > >>>>>> So embedding seems unrelated to this passing refs on > >>>>>> memory regions to owner and could be fixed separately that my > patches > >>>>>> attempted. > >>>>> > >>>>> Your patches add the possibility of not embedding; I don't like having > >>>>> two separate API, but I think we only have a disagreement on > aesthetics. > >>>>> > >>>>> This is different for Daniel's patches, which try to fix something by > >>>>> not embedding, and I don't think they do. Currently, if the owner dies, > >>>>> the MemoryRegion dies with it while it shouldn't (which is a problem, > >>>>> absolutely). But these patches, while preventing the MR from dying, > >>>>> still leave around a dangling pointer to the owner from the MR, so > >>>>> there's not much gained if anything. > >>>> > >>>> I think we can stack follow-up patches that reject accesses to > >>>> MemoryRegions owned by unrealized devices. That would make this a > >> more > >>>> complete solution. Accessing an unrealized device is an unsafe corner > >>>> case anyway, so it is better to close it explicitly. > >>>> > >>>> memory_region_init_ram_ptr() and > memory_region_init_ram_device_ptr() > >>>> are > >>>> a little more complicated, because direct accesses to those regions can > >>>> happen outside the BQL. Still, we can unmap the memory when the > >>>> MemoryRegion is finalized, as memory_region_init_ram() already does. > >>>> > >>>> However, after looking into this further, I found one complication: > >>>> x-vfio-user-server. In that case, the unmap timing is controlled by > >>>> libvfio-user, so QEMU cannot keep the memory mapped until the > >>>> MemoryRegion reference is dropped. This is not a new issue; it is a > >>>> pre-existing use-after-free hazard. I think a practical fix would be to > >>>> extend the libvfio-user API to support asynchronous unmapping. > >>>> > >>>> Cc'ing the vfio-user maintainers. > >>> > >>> VFIO_USER_DMA_UNMAP is already asynchronous, it sends the response > >> only after it's done unmapping. What would that asynchronous unmapping > >> API look like? VFIO_USER_DMA_UNMAP would send a respond that it > >> received the operation and started unmapping, and then another one to > say > >> it's done unmapping? > >> > >> The problem is of libvfio-user's API, not the vfio-user protocol. > >> vfu_setup_device_dma() takes a DMA region unregistration callback, which > >> is expected to synchronously unmap the region from the guest, and > >> x-vfio-user-server attempts to do so by calling object_unparent(). > > > > I misunderstood. > > > > libvfio-user provides an optional mechanism where the device is asked to > quiesce for certain operations, including VFIO_USER_DMA_UNMAP. Once > quiesced, the device must not modify state. If the device cannot quiesce > immediately it can continue operating and once ready to quiesce it can call a > libvfio-user function to tell the library it's quiesced. Would that solve the > problem? > > I don't think that solves this specific problem. > > The quiesce callback would let QEMU explicitly tell libvfio-user when it > is safe to proceed with the unmap. For example, QEMU could register a > quiesce callback with: > > vfu_setup_device_quiesce_cb(o->vfu_ctx, quiesce_cb); > > Then quiesce_cb() could return -EBUSY while QEMU prepares for the unmap, > and later call: > > vfu_device_quiesced(vfu_ctx, 0); > > At that point, libvfio-user would call dma_unregister(), and > dma_unregister() would be expected to synchronously unmap the memory. > > The difficulty is that this assumes QEMU has a primitive that can > prepare the guest memory mapping so that `dma_unregister()` can later > complete the unmap synchronously. I don't think we have such a > primitive, and I don't think it can be implemented cleanly with the > current memory API. > > If this were protected by a conventional lock, QEMU could acquire the > lock in quiesce_cb() and use the resulting critical section in > dma_unregister() to update the guest memory mapping synchronously. But > the memory API uses RCU instead. RCU does not give the updater an > exclusive critical section where old readers are blocked and the old > mapping is guaranteed to be gone before the function returns. Instead, > the update completes asynchronously after a grace period. That is the > point of RCU: readers avoid blocking, while updates complete later. > > So I think the quiesce mechanism is not enough here, but its API shape > is still useful. Concretely, libvfio-user could provide a variant of the > DMA unregister callback that may complete asynchronously. QEMU would > start the guest memory mapping update, return -1 with errno set to > EBUSY, and then call something like vfu_device_unmapped() once the RCU > grace period has elapsed.
To me it seems that the vfu_device_unmapped() callback you propose is functionally equivalent to the existing vfu_device_quiesced(). Am I missing something? > libvfio-user would free or reuse the memory > only after that completion notification. > > Regards, > Akihiko Odaki > > > >> > >> However, object_unparent() does not actually synchronously unmap the > >> region from the guest so it results in a use-after-free hazard. A > >> possible solution is to have the following three changes in combination: > >> - This series, ensuring the MemoryRegion stays alive > >> - A change of memory_region_init_ram_ptr() to add a callback for > >> asynchronous unmapping > >> - A libvfio-user API change to asynchronously complete unmapping > >> > >>> Also, since (AFAIK) x-vfio-user-server is orphaned is it worth putting the > >> effort in this? > >> > >> I rechecked MAINTAINERS and found its status is Odd Fixes, not Orphan. > >> That said, it is possible to leave a FIXME comment instead of fixing it > >> right now, as it's already done by virtio-gpu-rutabaga: > >> > >> > There is small risk of the MemoryRegion dereferencing the pointer > >> > after rutabaga unmaps it. Please see discussion here: > >> > > >> > https://urldefense.proofpoint.com/v2/url?u=https- > >> 3A__lists.gnu.org_archive_html_qemu-2Ddevel_2023- > >> > 2D09_msg05141.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh > >> > 5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=t9M_TxgBz_6K8y6rUD0ypaL_El > >> PVMTV5GPtiCVHrHb9BMKzTkgWiOKmcvA812- > >> 5H&s=4N2t0WV5D358WZZsjHWVvc6CwQrYydmofzhTAlZ4eVM&e= > >> > > >> > It is highly unlikely to happen in practice and doesn't affect known > >> > use cases. However, it should be fixed and is noted here for > >> > posterity. > >> > >> A difference from virtio-gpu-rutabaga is that the Rutabaga library API > >> does not interfere with asynchronous unmapping so it can be fixed > >> entirely on the QEMU side. On contrary, anyone who would want to fix > >> x-vfio-user-server would need to change the libvfio-user API because its > >> current form is too restrictive. > >> > >> Regards, > >> Akihiko Odaki
