> -----Original Message-----
> From: Akihiko Odaki <[email protected]>
> Sent: 23 June 2026 15:04
> To: Thanos Makatos <[email protected]>; Paolo Bonzini
> <[email protected]>; BALATON Zoltan <[email protected]>
> Cc: Daniel P. Berrangé <[email protected]>; [email protected];
> Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier
> <[email protected]>; Peter Xu <[email protected]>;
> Hervé Poussineau <[email protected]>; Alex Bennée
> <[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien
> Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark Cave-
> Ayland <[email protected]>; Marc-André Lureau
> <[email protected]>; John Levon <[email protected]>;
> Cédric Le Goater <[email protected]>
> Subject: Re: [RFC 4/7] system: add memory_region_new /
> memory_region_new_io
> 
> !-------------------------------------------------------------------|
>   CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On 2026/06/23 21:58, Thanos Makatos wrote:
> >> -----Original Message-----
> >> From: Akihiko Odaki <[email protected]>
> >> Sent: 23 June 2026 06:42
> >> To: Thanos Makatos <[email protected]>; Paolo Bonzini
> >> <[email protected]>; BALATON Zoltan <[email protected]>
> >> Cc: Daniel P. Berrangé <[email protected]>; [email protected];
> >> Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier
> >> <[email protected]>; Peter Xu <[email protected]>;
> >> Hervé Poussineau <[email protected]>; Alex Bennée
> >> <[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien
> >> Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark
> Cave-
> >> Ayland <[email protected]>; Marc-André Lureau
> >> <[email protected]>; John Levon <[email protected]>;
> >> Cédric Le Goater <[email protected]>
> >> Subject: Re: [RFC 4/7] system: add memory_region_new /
> >> memory_region_new_io
> >>
> >> !-------------------------------------------------------------------|
> >>    CAUTION: External Email
> >>
> >> |-------------------------------------------------------------------!
> >>
> >> On 2026/06/23 6:24, Thanos Makatos wrote:
> >>>> -----Original Message-----
> >>>> From: Akihiko Odaki <[email protected]>
> >>>> Sent: 22 June 2026 15:37
> >>>> To: Paolo Bonzini <[email protected]>; BALATON Zoltan
> >>>> <[email protected]>
> >>>> Cc: Daniel P. Berrangé <[email protected]>; qemu-
> [email protected];
> >>>> Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier
> >>>> <[email protected]>; Peter Xu <[email protected]>;
> >>>> Hervé Poussineau <[email protected]>; Alex Bennée
> >>>> <[email protected]>; Michael S. Tsirkin <[email protected]>;
> Aurelien
> >>>> Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark
> >> Cave-
> >>>> Ayland <[email protected]>; Marc-André Lureau
> >>>> <[email protected]>; John Levon
> <[email protected]>;
> >>>> Thanos Makatos <[email protected]>; Cédric Le Goater
> >>>> <[email protected]>
> >>>> Subject: Re: [RFC 4/7] system: add memory_region_new /
> >>>> memory_region_new_io
> >>>>
> >>>> !-------------------------------------------------------------------|
> >>>>     CAUTION: External Email
> >>>>
> >>>> |-------------------------------------------------------------------!
> >>>>
> >>>> On 2026/06/20 1:32, Paolo Bonzini wrote:
> >>>>> On 6/18/26 00:44, BALATON Zoltan wrote:
> >>>>>> On Wed, 17 Jun 2026, Paolo Bonzini wrote:
> >>>>>>> On 6/16/26 17:55, Daniel P. Berrangé wrote:
> >>>>>>>> Prepare for the move to dynamically allocated memory regions by
> >>>>>>>> introducing memory_region_new and memory_region_new_io
> >> functions
> >>>>>>>> which call through to object_new instead of object_initialize.
> >>>>>>>
> >>>>>>> MemoryRegionOps callback will certainly access fields in the parent.
> >>>>>>> Having a separate object gives you no benefit because you still have
> >>>>>>> the same use-after-free if the MemoryRegion outlives the device.
> >>>>>>> Likewise for buses.
> >>>>>>>
> >>>>>>> If the device itself is embedded, then all the invariants must hold
> >>>>>>> recursively:
> >>>>>>>
> >>>>>>> - the device must be kept alive by mutual references between the
> >>>>>>> device itself and its parent;
> >>>>>>>
> >>>>>>> - unparent for the device should wait for all guest memory accesses
> >>>>>>> to be either completed or cancelled.
> >>>>>>>
> >>>>>>> So, the problem is that "QEMU automatically references the owning
> >>>>>>> Device to keep the MemoryRegion alive" is a hack that should die.
> >>>>>>> [...] I disagree that removing embedding will fix any bugs
> >>>>>>
> >>>>>> I'm trying to understand this better. The way it originally meant to
> >>>>>> work according to the memory region docs and implementation is that
> >>>>>> memory region is added to an owner as a child which takes a reference
> >>>>>> and when the owner is freed it will unparent its children which will
> >>>>>> then get unrefed and freed as well if nothing else increased their ref
> >>>>>> count.
> >>>>>
> >>>>> Yes, and that's broken.
> >>>>>
> >>>>>> The problem is that when something accesses the memory region it
> has
> >>>>>> to keep the owner alive as well so instead of referencing the memory
> >>>>>> region it references the owner (not directly but using
> >>>>>> memory_region_ref which does this).
> >>>>>
> >>>>> A mechanism to keep the owner alive exists, and it is mutual references
> >>>>> (so that neither owner nor MR die) + unparent of owner (which should
> >>>>> wait for pending users to complete before returning).
> >>>>>
> >>>>> The fact that this mechanism is not used, i.e. unparent of a PCIDevice
> >>>>> does not wait for pending BAR accesses to complete, is a bug.
> >>>>>
> >>>>>> So embedding seems unrelated to this passing refs on
> >>>>>> memory regions to owner and could be fixed separately that my
> patches
> >>>>>> attempted.
> >>>>>
> >>>>> Your patches add the possibility of not embedding; I don't like having
> >>>>> two separate API, but I think we only have a disagreement on
> aesthetics.
> >>>>>
> >>>>> This is different for Daniel's patches, which try to fix something by
> >>>>> not embedding, and I don't think they do.  Currently, if the owner dies,
> >>>>> the MemoryRegion dies with it while it shouldn't (which is a problem,
> >>>>> absolutely).  But these patches, while preventing the MR from dying,
> >>>>> still leave around a dangling pointer to the owner from the MR, so
> >>>>> there's not much gained if anything.
> >>>>
> >>>> I think we can stack follow-up patches that reject accesses to
> >>>> MemoryRegions owned by unrealized devices. That would make this a
> >> more
> >>>> complete solution. Accessing an unrealized device is an unsafe corner
> >>>> case anyway, so it is better to close it explicitly.
> >>>>
> >>>> memory_region_init_ram_ptr() and
> memory_region_init_ram_device_ptr()
> >>>> are
> >>>> a little more complicated, because direct accesses to those regions can
> >>>> happen outside the BQL. Still, we can unmap the memory when the
> >>>> MemoryRegion is finalized, as memory_region_init_ram() already does.
> >>>>
> >>>> However, after looking into this further, I found one complication:
> >>>> x-vfio-user-server. In that case, the unmap timing is controlled by
> >>>> libvfio-user, so QEMU cannot keep the memory mapped until the
> >>>> MemoryRegion reference is dropped. This is not a new issue; it is a
> >>>> pre-existing use-after-free hazard. I think a practical fix would be to
> >>>> extend the libvfio-user API to support asynchronous unmapping.
> >>>>
> >>>> Cc'ing the vfio-user maintainers.
> >>>
> >>> VFIO_USER_DMA_UNMAP is already asynchronous, it sends the response
> >> only after it's done unmapping. What would that asynchronous unmapping
> >> API look like? VFIO_USER_DMA_UNMAP would send a respond that it
> >> received the operation and started unmapping, and then another one to
> say
> >> it's done unmapping?
> >>
> >> The problem is of libvfio-user's API, not the vfio-user protocol.
> >> vfu_setup_device_dma() takes a DMA region unregistration callback, which
> >> is expected to synchronously unmap the region from the guest, and
> >> x-vfio-user-server attempts to do so by calling object_unparent().
> >
> > I misunderstood.
> >
> > libvfio-user provides an optional mechanism where the device is asked to
> quiesce for certain operations, including VFIO_USER_DMA_UNMAP. Once
> quiesced, the device must not modify state. If the device cannot quiesce
> immediately it can continue operating and once ready to quiesce it can call a
> libvfio-user function to tell the library it's quiesced. Would that solve the
> problem?
> 
> I don't think that solves this specific problem.
> 
> The quiesce callback would let QEMU explicitly tell libvfio-user when it
> is safe to proceed with the unmap. For example, QEMU could register a
> quiesce callback with:
> 
>      vfu_setup_device_quiesce_cb(o->vfu_ctx, quiesce_cb);
> 
> Then quiesce_cb() could return -EBUSY while QEMU prepares for the unmap,
> and later call:
> 
>      vfu_device_quiesced(vfu_ctx, 0);
> 
> At that point, libvfio-user would call dma_unregister(), and
> dma_unregister() would be expected to synchronously unmap the memory.
> 
> The difficulty is that this assumes QEMU has a primitive that can
> prepare the guest memory mapping so that `dma_unregister()` can later
> complete the unmap synchronously. I don't think we have such a
> primitive, and I don't think it can be implemented cleanly with the
> current memory API.
> 
> If this were protected by a conventional lock, QEMU could acquire the
> lock in quiesce_cb() and use the resulting critical section in
> dma_unregister() to update the guest memory mapping synchronously. But
> the memory API uses RCU instead. RCU does not give the updater an
> exclusive critical section where old readers are blocked and the old
> mapping is guaranteed to be gone before the function returns. Instead,
> the update completes asynchronously after a grace period. That is the
> point of RCU: readers avoid blocking, while updates complete later.
> 
> So I think the quiesce mechanism is not enough here, but its API shape
> is still useful. Concretely, libvfio-user could provide a variant of the
> DMA unregister callback that may complete asynchronously. QEMU would
> start the guest memory mapping update, return -1 with errno set to
> EBUSY, and then call something like vfu_device_unmapped() once the RCU
> grace period has elapsed.

To me it seems that the vfu_device_unmapped() callback you propose is 
functionally equivalent to the existing vfu_device_quiesced().
Am I missing something?

> libvfio-user would free or reuse the memory
> only after that completion notification.
> 
> Regards,
> Akihiko Odaki
> >
> >>
> >> However, object_unparent() does not actually synchronously unmap the
> >> region from the guest so it results in a use-after-free hazard. A
> >> possible solution is to have the following three changes in combination:
> >> - This series, ensuring the MemoryRegion stays alive
> >> - A change of memory_region_init_ram_ptr() to add a callback for
> >>     asynchronous unmapping
> >> - A libvfio-user API change to asynchronously complete unmapping
> >>
> >>> Also, since (AFAIK) x-vfio-user-server is orphaned is it worth putting the
> >> effort in this?
> >>
> >> I rechecked MAINTAINERS and found its status is Odd Fixes, not Orphan.
> >> That said, it is possible to leave a FIXME comment instead of fixing it
> >> right now, as it's already done by virtio-gpu-rutabaga:
> >>
> >>   > There is small risk of the MemoryRegion dereferencing the pointer
> >>   > after rutabaga unmaps it. Please see discussion here:
> >>   >
> >>   > https://urldefense.proofpoint.com/v2/url?u=https-
> >> 3A__lists.gnu.org_archive_html_qemu-2Ddevel_2023-
> >>
> 2D09_msg05141.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh
> >>
> 5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=t9M_TxgBz_6K8y6rUD0ypaL_El
> >> PVMTV5GPtiCVHrHb9BMKzTkgWiOKmcvA812-
> >> 5H&s=4N2t0WV5D358WZZsjHWVvc6CwQrYydmofzhTAlZ4eVM&e=
> >>   >
> >>   > It is highly unlikely to happen in practice and doesn't affect known
> >>   > use cases. However, it should be fixed and is noted here for
> >>   > posterity.
> >>
> >> A difference from virtio-gpu-rutabaga is that the Rutabaga library API
> >> does not interfere with asynchronous unmapping so it can be fixed
> >> entirely on the QEMU side. On contrary, anyone who would want to fix
> >> x-vfio-user-server would need to change the libvfio-user API because its
> >> current form is too restrictive.
> >>
> >> Regards,
> >> Akihiko Odaki

Reply via email to