On 2026/06/23 21:58, Thanos Makatos wrote:
-----Original Message-----
From: Akihiko Odaki <[email protected]>
Sent: 23 June 2026 06:42
To: Thanos Makatos <[email protected]>; Paolo Bonzini
<[email protected]>; BALATON Zoltan <[email protected]>
Cc: Daniel P. Berrangé <[email protected]>; [email protected];
Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier
<[email protected]>; Peter Xu <[email protected]>;
Hervé Poussineau <[email protected]>; Alex Bennée
<[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien
Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark Cave-
Ayland <[email protected]>; Marc-André Lureau
<[email protected]>; John Levon <[email protected]>;
Cédric Le Goater <[email protected]>
Subject: Re: [RFC 4/7] system: add memory_region_new /
memory_region_new_io
!-------------------------------------------------------------------|
CAUTION: External Email
|-------------------------------------------------------------------!
On 2026/06/23 6:24, Thanos Makatos wrote:
-----Original Message-----
From: Akihiko Odaki <[email protected]>
Sent: 22 June 2026 15:37
To: Paolo Bonzini <[email protected]>; BALATON Zoltan
<[email protected]>
Cc: Daniel P. Berrangé <[email protected]>; [email protected];
Philippe Mathieu-Daudé <[email protected]>; Pierrick Bouvier
<[email protected]>; Peter Xu <[email protected]>;
Hervé Poussineau <[email protected]>; Alex Bennée
<[email protected]>; Michael S. Tsirkin <[email protected]>; Aurelien
Jarno <[email protected]>; Fabiano Rosas <[email protected]>; Mark
Cave-
Ayland <[email protected]>; Marc-André Lureau
<[email protected]>; John Levon <[email protected]>;
Thanos Makatos <[email protected]>; Cédric Le Goater
<[email protected]>
Subject: Re: [RFC 4/7] system: add memory_region_new /
memory_region_new_io
!-------------------------------------------------------------------|
CAUTION: External Email
|-------------------------------------------------------------------!
On 2026/06/20 1:32, Paolo Bonzini wrote:
On 6/18/26 00:44, BALATON Zoltan wrote:
On Wed, 17 Jun 2026, Paolo Bonzini wrote:
On 6/16/26 17:55, Daniel P. Berrangé wrote:
Prepare for the move to dynamically allocated memory regions by
introducing memory_region_new and memory_region_new_io
functions
which call through to object_new instead of object_initialize.
MemoryRegionOps callback will certainly access fields in the parent.
Having a separate object gives you no benefit because you still have
the same use-after-free if the MemoryRegion outlives the device.
Likewise for buses.
If the device itself is embedded, then all the invariants must hold
recursively:
- the device must be kept alive by mutual references between the
device itself and its parent;
- unparent for the device should wait for all guest memory accesses
to be either completed or cancelled.
So, the problem is that "QEMU automatically references the owning
Device to keep the MemoryRegion alive" is a hack that should die.
[...] I disagree that removing embedding will fix any bugs
I'm trying to understand this better. The way it originally meant to
work according to the memory region docs and implementation is that
memory region is added to an owner as a child which takes a reference
and when the owner is freed it will unparent its children which will
then get unrefed and freed as well if nothing else increased their ref
count.
Yes, and that's broken.
The problem is that when something accesses the memory region it has
to keep the owner alive as well so instead of referencing the memory
region it references the owner (not directly but using
memory_region_ref which does this).
A mechanism to keep the owner alive exists, and it is mutual references
(so that neither owner nor MR die) + unparent of owner (which should
wait for pending users to complete before returning).
The fact that this mechanism is not used, i.e. unparent of a PCIDevice
does not wait for pending BAR accesses to complete, is a bug.
So embedding seems unrelated to this passing refs on
memory regions to owner and could be fixed separately that my patches
attempted.
Your patches add the possibility of not embedding; I don't like having
two separate API, but I think we only have a disagreement on aesthetics.
This is different for Daniel's patches, which try to fix something by
not embedding, and I don't think they do. Currently, if the owner dies,
the MemoryRegion dies with it while it shouldn't (which is a problem,
absolutely). But these patches, while preventing the MR from dying,
still leave around a dangling pointer to the owner from the MR, so
there's not much gained if anything.
I think we can stack follow-up patches that reject accesses to
MemoryRegions owned by unrealized devices. That would make this a
more
complete solution. Accessing an unrealized device is an unsafe corner
case anyway, so it is better to close it explicitly.
memory_region_init_ram_ptr() and memory_region_init_ram_device_ptr()
are
a little more complicated, because direct accesses to those regions can
happen outside the BQL. Still, we can unmap the memory when the
MemoryRegion is finalized, as memory_region_init_ram() already does.
However, after looking into this further, I found one complication:
x-vfio-user-server. In that case, the unmap timing is controlled by
libvfio-user, so QEMU cannot keep the memory mapped until the
MemoryRegion reference is dropped. This is not a new issue; it is a
pre-existing use-after-free hazard. I think a practical fix would be to
extend the libvfio-user API to support asynchronous unmapping.
Cc'ing the vfio-user maintainers.
VFIO_USER_DMA_UNMAP is already asynchronous, it sends the response
only after it's done unmapping. What would that asynchronous unmapping
API look like? VFIO_USER_DMA_UNMAP would send a respond that it
received the operation and started unmapping, and then another one to say
it's done unmapping?
The problem is of libvfio-user's API, not the vfio-user protocol.
vfu_setup_device_dma() takes a DMA region unregistration callback, which
is expected to synchronously unmap the region from the guest, and
x-vfio-user-server attempts to do so by calling object_unparent().
I misunderstood.
libvfio-user provides an optional mechanism where the device is asked to
quiesce for certain operations, including VFIO_USER_DMA_UNMAP. Once quiesced,
the device must not modify state. If the device cannot quiesce immediately it
can continue operating and once ready to quiesce it can call a libvfio-user
function to tell the library it's quiesced. Would that solve the problem?
I don't think that solves this specific problem.
The quiesce callback would let QEMU explicitly tell libvfio-user when it
is safe to proceed with the unmap. For example, QEMU could register a
quiesce callback with:
vfu_setup_device_quiesce_cb(o->vfu_ctx, quiesce_cb);
Then quiesce_cb() could return -EBUSY while QEMU prepares for the unmap,
and later call:
vfu_device_quiesced(vfu_ctx, 0);
At that point, libvfio-user would call dma_unregister(), and
dma_unregister() would be expected to synchronously unmap the memory.
The difficulty is that this assumes QEMU has a primitive that can
prepare the guest memory mapping so that `dma_unregister()` can later
complete the unmap synchronously. I don't think we have such a
primitive, and I don't think it can be implemented cleanly with the
current memory API.
If this were protected by a conventional lock, QEMU could acquire the
lock in quiesce_cb() and use the resulting critical section in
dma_unregister() to update the guest memory mapping synchronously. But
the memory API uses RCU instead. RCU does not give the updater an
exclusive critical section where old readers are blocked and the old
mapping is guaranteed to be gone before the function returns. Instead,
the update completes asynchronously after a grace period. That is the
point of RCU: readers avoid blocking, while updates complete later.
So I think the quiesce mechanism is not enough here, but its API shape
is still useful. Concretely, libvfio-user could provide a variant of the
DMA unregister callback that may complete asynchronously. QEMU would
start the guest memory mapping update, return -1 with errno set to
EBUSY, and then call something like vfu_device_unmapped() once the RCU
grace period has elapsed. libvfio-user would free or reuse the memory
only after that completion notification.
Regards,
Akihiko Odaki
However, object_unparent() does not actually synchronously unmap the
region from the guest so it results in a use-after-free hazard. A
possible solution is to have the following three changes in combination:
- This series, ensuring the MemoryRegion stays alive
- A change of memory_region_init_ram_ptr() to add a callback for
asynchronous unmapping
- A libvfio-user API change to asynchronously complete unmapping
Also, since (AFAIK) x-vfio-user-server is orphaned is it worth putting the
effort in this?
I rechecked MAINTAINERS and found its status is Odd Fixes, not Orphan.
That said, it is possible to leave a FIXME comment instead of fixing it
right now, as it's already done by virtio-gpu-rutabaga:
> There is small risk of the MemoryRegion dereferencing the pointer
> after rutabaga unmaps it. Please see discussion here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-
3A__lists.gnu.org_archive_html_qemu-2Ddevel_2023-
2D09_msg05141.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh
5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=t9M_TxgBz_6K8y6rUD0ypaL_El
PVMTV5GPtiCVHrHb9BMKzTkgWiOKmcvA812-
5H&s=4N2t0WV5D358WZZsjHWVvc6CwQrYydmofzhTAlZ4eVM&e=
>
> It is highly unlikely to happen in practice and doesn't affect known
> use cases. However, it should be fixed and is noted here for
> posterity.
A difference from virtio-gpu-rutabaga is that the Rutabaga library API
does not interfere with asynchronous unmapping so it can be fixed
entirely on the QEMU side. On contrary, anyone who would want to fix
x-vfio-user-server would need to change the libvfio-user API because its
current form is too restrictive.
Regards,
Akihiko Odaki